Simply ML

Machine Learning Terms & Definitions

This glossary explains common machine learning terms you'll encounter when using Simply ML. Understanding these concepts will help you make better modeling decisions and interpret results correctly.

Performance Metrics

R² Score (R-squared / Coefficient of Determination)

Measures the proportion of variance in the target variable that's predictable from the features. Range: 0 to 1 (can be negative for very poor models). 1.0 = perfect predictions, 0.0 = model predicts as well as simply using the mean. Higher is better.

RMSE (Root Mean Squared Error)

Average prediction error in the same units as the target variable. Penalizes large errors more heavily than small ones. Lower is better. More sensitive to outliers than MAE.

MAE (Mean Absolute Error)

Average absolute difference between predictions and actual values. Same units as target variable. Lower is better. Less sensitive to outliers than RMSE, easier to interpret.

Accuracy

Percentage of correct predictions in classification. Range: 0-100%. Can be misleading with imbalanced classes. Formula: (Correct Predictions) / (Total Predictions)

Precision

Of all positive predictions, what proportion were actually positive? Important when false positives are costly. Formula: True Positives / (True Positives + False Positives)

Recall (Sensitivity / True Positive Rate)

Of all actual positives, what proportion were correctly identified? Important when false negatives are costly. Formula: True Positives / (True Positives + False Negatives)

F1-Score

Harmonic mean of precision and recall. Balances both metrics. Range: 0-1, higher is better. Useful when you need balance between precision and recall.

ROC-AUC (Area Under ROC Curve)

Measures model's ability to distinguish between classes across all thresholds. Range: 0.5-1.0. 0.5 = random guessing, 1.0 = perfect classification. Threshold-independent metric.

Model Concepts

Overfitting

Model learns training data too well, including noise and outliers, resulting in poor performance on new data. Signs: Very high training accuracy but low test accuracy. Solutions: Regularization, more data, simpler model, cross-validation.

Underfitting

Model is too simple to capture the underlying patterns in the data. Signs: Low accuracy on both training and test data. Solutions: More complex model, more features, less regularization.

Bias-Variance Tradeoff

Bias: Error from oversimplifying (underfitting). Variance: Error from sensitivity to training data fluctuations (overfitting). Goal: Find balance that minimizes total error.

Regularization

Technique to prevent overfitting by adding penalties for model complexity. L1 (Lasso): Can eliminate features. L2 (Ridge): Shrinks coefficients. Elastic Net: Combines both.

Cross-Validation

Method to assess model performance by splitting data into multiple folds, training on some and testing on others, then averaging results. Provides more reliable performance estimate than single train-test split.

Hyperparameters

Parameters set before training that control the learning process (e.g., K in KNN, C in SVM, alpha in Lasso). Unlike model parameters (learned from data), these must be chosen through tuning/validation.

Data Preprocessing

Standardization (Z-score Normalization)

Transforms features to have mean=0 and standard deviation=1. Formula: (x - mean) / std_dev. Essential for distance-based algorithms (KNN, SVM) and regularization. Preserves outliers.

Normalization (Min-Max Scaling)

Scales features to a fixed range, typically [0, 1]. Formula: (x - min) / (max - min). Useful when you need bounded values. More affected by outliers than standardization.

Feature Selection

Process of selecting relevant features and removing irrelevant ones. Benefits: Reduces overfitting, improves model performance, decreases training time. Methods: Lasso, correlation analysis, domain knowledge.

Feature Engineering

Creating new features from existing ones to improve model performance. Examples: Polynomial features, interaction terms, binning continuous variables, extracting date components.

Missing Data

Data points where values are absent. Handling methods: Remove rows/columns, impute with mean/median/mode, use algorithms that handle missing data, predict missing values.

Outliers

Data points significantly different from other observations. Can indicate errors or rare events. Handling: Remove if errors, keep if legitimate, use robust methods (MAE, Ridge), transform data.

Algorithm-Specific Terms

Multicollinearity

When predictor variables are highly correlated with each other. Problems: Unstable coefficient estimates, difficulty interpreting feature importance. Solutions: Ridge regression, remove correlated features, PCA.

Support Vectors

In SVM, the training points closest to the decision boundary (or outside epsilon-tube in SVR). These are the only points that define the model. Typical: 30-70% of training data.

Kernel Trick

Mathematical technique allowing algorithms to operate in high-dimensional spaces without explicitly computing transformations. Used in SVM to handle non-linear relationships efficiently.

Decision Boundary

The surface that separates different classes in a classification problem. Can be linear or non-linear depending on the algorithm and data. Visualized in 2D as lines/curves.

Distance Metrics

Methods to measure similarity between data points. Euclidean: Straight-line distance. Manhattan: Sum of absolute differences. Minkowski: Generalization of both. Used in KNN and clustering.

Curse of Dimensionality

Phenomena where algorithms become less effective as the number of features increases. In high dimensions, all points become roughly equidistant. Particularly affects KNN and distance-based methods.

Training & Validation

Training Set

Portion of data used to train (fit) the model. Typically 70-80% of total data. Model learns patterns from this data.

Test Set

Portion of data held out to evaluate final model performance. Typically 20-30% of total data. Never used during training or hyperparameter tuning. Provides unbiased performance estimate.

Validation Set

Separate portion used for hyperparameter tuning and model selection during cross-validation. Helps prevent overfitting to test set through repeated evaluation.

K-Fold Cross-Validation

Data split into K equal parts (folds). Model trained K times, each time using K-1 folds for training and 1 for validation. Performance averaged across all K runs. Common K values: 5, 10.

Grid Search

Systematic method to find optimal hyperparameters by trying all combinations in a specified grid. Example: Try C=[0.1, 1, 10] and gamma=[0.01, 0.1, 1] for SVM (9 combinations). Use with cross-validation.

Stratified Sampling

Splitting data while preserving class proportions in train/test sets. Essential for imbalanced classification to ensure both sets represent all classes proportionally.

Model Types

Supervised Learning

Learning from labeled data where the target variable is known. Includes regression (predicting continuous values) and classification (predicting categories). All models in Simply ML are supervised.

Regression

Predicting continuous numerical values. Examples: House prices, temperature, sales revenue. Metrics: R², RMSE, MAE. Models: Linear, Polynomial, Ridge, Lasso, KNN, SVR.

Classification

Predicting categorical labels or classes. Examples: Spam/not spam, disease diagnosis, customer churn. Metrics: Accuracy, Precision, Recall, F1-Score. Models: Logistic, Ridge Logistic, KNN, SVM.

Parametric Models

Models that assume a specific functional form and learn parameters from data. Examples: Linear regression, Logistic regression. Pros: Fast, interpretable. Cons: Limited flexibility.

Non-Parametric Models

Models that don't assume a specific functional form. Flexibility grows with data size. Examples: KNN, SVM with RBF kernel. Pros: Flexible, captures complex patterns. Cons: Slower, require more data.

Common Issues

Imbalanced Classes

Classification problem where one class significantly outnumbers others. Example: 95% class A, 5% class B. Solutions: Class weights, resampling (SMOTE), use F1/AUC instead of accuracy, adjust prediction threshold.

Extrapolation

Making predictions outside the range of training data. Most models perform poorly at extrapolation. Particularly bad for polynomial and KNN. Use caution when predicting beyond training range.

Data Leakage

When information from test set influences training, leading to overly optimistic performance estimates. Common causes: Preprocessing before splitting, using future information, duplicate data across sets.

Convergence

When iterative optimization algorithms reach a stable solution. Non-convergence warnings indicate model didn't find optimal parameters. Solutions: More iterations, better initialization, scale data, adjust learning rate.

Quick Reference

When to Use Which Model?

  • Linear relationship: Simple/Multiple Linear Regression
  • Curved relationship: Polynomial Regression
  • Many features, need selection: Lasso
  • Many features, correlated: Ridge or Elastic Net
  • Binary classification: Logistic or Ridge Logistic
  • Complex non-linear patterns: KNN or SVM
  • High-dimensional data: SVM or regularized models
  • Small dataset, simple patterns: Linear/Logistic Regression
  • Large dataset: Avoid KNN, consider Linear SVM

Always Remember:

  • ✓ Standardize features for KNN, SVM, and regularized models
  • ✓ Use cross-validation for hyperparameter tuning
  • ✓ Check for overfitting (compare train vs test performance)
  • ✓ Start simple, then increase complexity if needed
  • ✓ Visualize your data before modeling
  • ✓ Don't rely on a single metric
  • ✓ Understand your data domain and context