Machine Learning Terms & Definitions
This glossary explains common machine learning terms you'll encounter when using Simply ML. Understanding these concepts will help you make better modeling decisions and interpret results correctly.
Performance Metrics
R² Score (R-squared / Coefficient of Determination)
Measures the proportion of variance in the target variable that's predictable from the features. Range: 0 to 1 (can be negative for very poor models). 1.0 = perfect predictions, 0.0 = model predicts as well as simply using the mean. Higher is better.
RMSE (Root Mean Squared Error)
Average prediction error in the same units as the target variable. Penalizes large errors more heavily than small ones. Lower is better. More sensitive to outliers than MAE.
MAE (Mean Absolute Error)
Average absolute difference between predictions and actual values. Same units as target variable. Lower is better. Less sensitive to outliers than RMSE, easier to interpret.
Accuracy
Percentage of correct predictions in classification. Range: 0-100%. Can be misleading with imbalanced classes. Formula: (Correct Predictions) / (Total Predictions)
Precision
Of all positive predictions, what proportion were actually positive? Important when false positives are costly. Formula: True Positives / (True Positives + False Positives)
Recall (Sensitivity / True Positive Rate)
Of all actual positives, what proportion were correctly identified? Important when false negatives are costly. Formula: True Positives / (True Positives + False Negatives)
F1-Score
Harmonic mean of precision and recall. Balances both metrics. Range: 0-1, higher is better. Useful when you need balance between precision and recall.
ROC-AUC (Area Under ROC Curve)
Measures model's ability to distinguish between classes across all thresholds. Range: 0.5-1.0. 0.5 = random guessing, 1.0 = perfect classification. Threshold-independent metric.
Model Concepts
Overfitting
Model learns training data too well, including noise and outliers, resulting in poor performance on new data. Signs: Very high training accuracy but low test accuracy. Solutions: Regularization, more data, simpler model, cross-validation.
Underfitting
Model is too simple to capture the underlying patterns in the data. Signs: Low accuracy on both training and test data. Solutions: More complex model, more features, less regularization.
Bias-Variance Tradeoff
Bias: Error from oversimplifying (underfitting). Variance: Error from sensitivity to training data fluctuations (overfitting). Goal: Find balance that minimizes total error.
Regularization
Technique to prevent overfitting by adding penalties for model complexity. L1 (Lasso): Can eliminate features. L2 (Ridge): Shrinks coefficients. Elastic Net: Combines both.
Cross-Validation
Method to assess model performance by splitting data into multiple folds, training on some and testing on others, then averaging results. Provides more reliable performance estimate than single train-test split.
Hyperparameters
Parameters set before training that control the learning process (e.g., K in KNN, C in SVM, alpha in Lasso). Unlike model parameters (learned from data), these must be chosen through tuning/validation.
Data Preprocessing
Standardization (Z-score Normalization)
Transforms features to have mean=0 and standard deviation=1. Formula: (x - mean) / std_dev. Essential for distance-based algorithms (KNN, SVM) and regularization. Preserves outliers.
Normalization (Min-Max Scaling)
Scales features to a fixed range, typically [0, 1]. Formula: (x - min) / (max - min). Useful when you need bounded values. More affected by outliers than standardization.
Feature Selection
Process of selecting relevant features and removing irrelevant ones. Benefits: Reduces overfitting, improves model performance, decreases training time. Methods: Lasso, correlation analysis, domain knowledge.
Feature Engineering
Creating new features from existing ones to improve model performance. Examples: Polynomial features, interaction terms, binning continuous variables, extracting date components.
Missing Data
Data points where values are absent. Handling methods: Remove rows/columns, impute with mean/median/mode, use algorithms that handle missing data, predict missing values.
Outliers
Data points significantly different from other observations. Can indicate errors or rare events. Handling: Remove if errors, keep if legitimate, use robust methods (MAE, Ridge), transform data.
Algorithm-Specific Terms
Multicollinearity
When predictor variables are highly correlated with each other. Problems: Unstable coefficient estimates, difficulty interpreting feature importance. Solutions: Ridge regression, remove correlated features, PCA.
Support Vectors
In SVM, the training points closest to the decision boundary (or outside epsilon-tube in SVR). These are the only points that define the model. Typical: 30-70% of training data.
Kernel Trick
Mathematical technique allowing algorithms to operate in high-dimensional spaces without explicitly computing transformations. Used in SVM to handle non-linear relationships efficiently.
Decision Boundary
The surface that separates different classes in a classification problem. Can be linear or non-linear depending on the algorithm and data. Visualized in 2D as lines/curves.
Distance Metrics
Methods to measure similarity between data points. Euclidean: Straight-line distance. Manhattan: Sum of absolute differences. Minkowski: Generalization of both. Used in KNN and clustering.
Curse of Dimensionality
Phenomena where algorithms become less effective as the number of features increases. In high dimensions, all points become roughly equidistant. Particularly affects KNN and distance-based methods.
Training & Validation
Training Set
Portion of data used to train (fit) the model. Typically 70-80% of total data. Model learns patterns from this data.
Test Set
Portion of data held out to evaluate final model performance. Typically 20-30% of total data. Never used during training or hyperparameter tuning. Provides unbiased performance estimate.
Validation Set
Separate portion used for hyperparameter tuning and model selection during cross-validation. Helps prevent overfitting to test set through repeated evaluation.
K-Fold Cross-Validation
Data split into K equal parts (folds). Model trained K times, each time using K-1 folds for training and 1 for validation. Performance averaged across all K runs. Common K values: 5, 10.
Grid Search
Systematic method to find optimal hyperparameters by trying all combinations in a specified grid. Example: Try C=[0.1, 1, 10] and gamma=[0.01, 0.1, 1] for SVM (9 combinations). Use with cross-validation.
Stratified Sampling
Splitting data while preserving class proportions in train/test sets. Essential for imbalanced classification to ensure both sets represent all classes proportionally.
Model Types
Supervised Learning
Learning from labeled data where the target variable is known. Includes regression (predicting continuous values) and classification (predicting categories). All models in Simply ML are supervised.
Regression
Predicting continuous numerical values. Examples: House prices, temperature, sales revenue. Metrics: R², RMSE, MAE. Models: Linear, Polynomial, Ridge, Lasso, KNN, SVR.
Classification
Predicting categorical labels or classes. Examples: Spam/not spam, disease diagnosis, customer churn. Metrics: Accuracy, Precision, Recall, F1-Score. Models: Logistic, Ridge Logistic, KNN, SVM.
Parametric Models
Models that assume a specific functional form and learn parameters from data. Examples: Linear regression, Logistic regression. Pros: Fast, interpretable. Cons: Limited flexibility.
Non-Parametric Models
Models that don't assume a specific functional form. Flexibility grows with data size. Examples: KNN, SVM with RBF kernel. Pros: Flexible, captures complex patterns. Cons: Slower, require more data.
Common Issues
Imbalanced Classes
Classification problem where one class significantly outnumbers others. Example: 95% class A, 5% class B. Solutions: Class weights, resampling (SMOTE), use F1/AUC instead of accuracy, adjust prediction threshold.
Extrapolation
Making predictions outside the range of training data. Most models perform poorly at extrapolation. Particularly bad for polynomial and KNN. Use caution when predicting beyond training range.
Data Leakage
When information from test set influences training, leading to overly optimistic performance estimates. Common causes: Preprocessing before splitting, using future information, duplicate data across sets.
Convergence
When iterative optimization algorithms reach a stable solution. Non-convergence warnings indicate model didn't find optimal parameters. Solutions: More iterations, better initialization, scale data, adjust learning rate.
Quick Reference
When to Use Which Model?
- Linear relationship: Simple/Multiple Linear Regression
- Curved relationship: Polynomial Regression
- Many features, need selection: Lasso
- Many features, correlated: Ridge or Elastic Net
- Binary classification: Logistic or Ridge Logistic
- Complex non-linear patterns: KNN or SVM
- High-dimensional data: SVM or regularized models
- Small dataset, simple patterns: Linear/Logistic Regression
- Large dataset: Avoid KNN, consider Linear SVM
Always Remember:
- ✓ Standardize features for KNN, SVM, and regularized models
- ✓ Use cross-validation for hyperparameter tuning
- ✓ Check for overfitting (compare train vs test performance)
- ✓ Start simple, then increase complexity if needed
- ✓ Visualize your data before modeling
- ✓ Don't rely on a single metric
- ✓ Understand your data domain and context