Machine learning and ML models
What is machine learning?

Machine learning (ML) is a way to build systems that improve from experience (data) instead of only hand-written rules. You choose a model family, define an objective (what “better” means), fit parameters on historical examples, then run inference on new inputs. The goal is generalization: performing well on data you did not train on, within the limits of your signal, features, and safeguards.
ML sits between classical statistics and modern deep learning. Many production systems still use classical or gradient-boosted models for tabular data; vision, speech, and large-scale language tasks often use neural networks with representation learning. The unifying idea is the same: learn a mapping from inputs to outputs (or to useful internal representations) from data.
What is an ML model?
An ML model is the trained artifact: architecture plus learned parameters (weights) that implement a decision rule or generator. Training adjusts those parameters using a dataset and a loss or reward signal; inference applies the fixed model to new cases. Models differ by input type (tables, images, text, audio, graphs), by objective (classification, regression, ranking, generation), and by whether they are discriminative (predict labels or values) or generative (produce new samples or tokens).
Large language models are ML models over text (and often multimodal extensions). They are trained at scale on next-token or related objectives, then adapted for chat, tools, retrieval, and agents. This site’s LLM Engines page lists concrete model lines by provider; RAG and agents show how those models sit inside products.
Linear regression

Linear Regression
LinearRegression model is a simple model that predicts the target by finding the best straight-line (or flat hyperplane) through your data—weights and intercept chosen to minimize the sum of squared prediction errors on the training set.
When to use it: anytime you want a fast, interpretable baseline for predicting a number from tabular inputs—sales/demand forecasting (simple drivers), pricing or risk scoring with clear coefficients, A/B or experiment analysis (effect sizes), feature importance sanity checks, and as a first model before trees/boosting. It works best when relationships are roughly additive and you’ve handled scaling/outliers; if the signal is deeply nonlinear, you’ll usually beat it with richer models.
Flow: tabular feature matrix X and target y → Ordinary Least Squares (minimize sum of squared residuals) → learned weights and intercept → predict ŷ for new rows.
Example task: Predict a house’s sale price from a few numeric inputs you already have—living area (sq ft), number of bedrooms, and age of the building—using past sales as training data.
Ridge regression

Ridge
Ridge is linear regression that adds an L2 penalty on the weights so coefficients stay smaller and the model is less sensitive to correlated/noisy features, trading a bit of bias for more stable predictions.
When to use it: many features, collinear inputs (e.g. many similar sensors or one-hot columns), or more features than you trust for plain OLS—common in tabular ML baselines, genomics/text bag-of-words (many weak predictors), risk/scoring models where you want stable coefficients, and whenever LinearRegression overfits or explodes on correlated columns.
Flow: tabular X and target y → minimize squared errors plus λ times the sum of squared weights (L2; intercept usually not penalized) → shrunk, stable coefficients → predict ŷ on new rows.
Example task: Predict a continuous outcome from several numeric inputs where two or more features measure almost the same thing (high correlation). Compare ordinary least squares coefficients (often large and opposite-signed) with Ridge to see weights stabilize.
Lasso regression

Lasso
Lasso is linear regression with an L1 penalty that can shrink some feature weights exactly to zero, giving you a sparse model that behaves like automatic feature selection.
When to use it: lots of features and you want a simpler, cheaper model—high-dimensional tabular data, many one-hot or engineered features, text/count vectors, genomics, or baselines where you need interpretability (many coefficients forced to zero). It’s weaker than Ridge when features are highly correlated (it tends to pick one and zero the rest).
Flow: tabular X and target y → minimize squared errors plus λ times the sum of absolute weights (L1) → many weights hit zero, rest shrunk → sparse β → predict ŷ on new rows.
Example task: Predict a numeric target from dozens of numeric features where only a few truly matter; check how many coefficients Lasso zeroes out versus dense OLS on the same split.
ElasticNet regression

ElasticNet
ElasticNet is linear regression with both L1 (Lasso) and L2 (Ridge) penalties, so it can shrink coefficients like Ridge while still allowing some sparsity like Lasso, which helps when many features are correlated.
When to use it: tabular regression when you have many features and groups of correlated predictors—common in marketing mix models, credit/risk features, sensor bundles, genomics, and text/count features—where Lasso alone is unstable (randomly picks one of a correlated set) but you still want some shrinkage and simpler models than plain Ridge.
Flow: tabular X and target y → minimize squared errors plus α-weighted mix of L1 and L2 on weights (l1_ratio ρ in sklearn) → blended sparse/shrunk β → predict ŷ on new rows.
Example task: Predict a numeric target from tabular features with a correlated pair plus several noise columns; compare coefficient stability and sparsity from ElasticNet (tune l1_ratio) against Lasso-only on the same split.
Decision tree regression

DecisionTreeRegressor
DecisionTreeRegressor is a tree-based model: it repeatedly splits the data using simple if/else rules on individual features, groups the training points into leaf buckets, and predicts the target in each bucket as the mean of the training targets in that bucket.
When to use it: nonlinear predictions with an easy-to-read rule list—pricing tiers, simple risk segments, ops thresholds, EDA and feature checks, or a teaching baseline; in production it’s often too unstable or overfit alone, so people use RandomForest or boosting instead while keeping the same splitting idea.
Flow: tabular X and target y → choose splits that reduce prediction error (e.g. MSE) on training rows → recurse until leaves → each leaf predicts the mean y of rows that fell there → for a new row, walk the rules to a leaf and output that mean.
Example task: Fit a shallow tree on a numeric target with a clear kink or tier in one feature (e.g. discount thresholds); print or plot the tree to read the if/else rules, then compare test error to a linear model on the same features.
Random forest regression

RandomForestRegressor
RandomForestRegressor trains many deep decision trees on random row samples and random feature subsets, then predicts by averaging their outputs to get stronger, less overfit nonlinear regression.
When to use it: default tabular regression when you want accuracy without much tuning—demand forecasting, pricing, LTV/churn dollar targets, real estate, manufacturing yield, fraud/risk amounts—and when you want robustness to messy nonlinear interactions plus feature importance for debugging.
Flow: tabular X and target y → bootstrap sample rows and subsample features for each tree → fit many full-depth (or limited) trees independently → at inference, run the row through every tree and average the leaf means → ŷ.
Example task: On the same train/test split, compare a single deep DecisionTreeRegressor (high train score, shaky test) with RandomForestRegressor (n_estimators 100–300); inspect feature_importances_ to see which columns the ensemble leans on.
Support vector regression (SVR)

SVR
Support Vector Machines are a family of models that prefer wide empty margin around a decision boundary—most influence comes from points near that zone (support vectors), not far-away easy examples, which tends to stabilize generalization. Classification uses SVC (widest “street” between classes; kernels such as RBF yield nonlinear boundaries). SVR applies the same toolkit to numbers: it fits a regression function that stays as flat as possible while allowing most training points inside an ε-wide tube around the prediction, and kernels (e.g. RBF, polynomial) capture nonlinear shape using only a subset of support-vector points.
When to use it: small/medium tabular problems where you want smooth predictions and nonlinear structure without trees—QSAR/chemistry, some financial or engineering fits, text or structured features with RBF/poly kernels. Training cost scales poorly; on large messy tables, gradient boosting often wins.
Flow: scale features (critical for SVR) → choose kernel, C, and ε → solve for weights/support vectors so errors outside the tube are penalized while the function stays smooth → predict ŷ with the kernel expansion on support vectors only.
Example task: On mildly nonlinear tabular data, compare Pipeline(StandardScaler, SVR(kernel="linear")) with the same pipeline using kernel="rbf"; grid-search C and gamma on a validation fold and count support vectors (n_support_).
MLPRegressor

MLPRegressor
MLPRegressor is a small feedforward neural network (stack of layers with nonlinear activations) that learns to map features to a numeric target by minimizing squared error on the training set.
When to use it: nonlinear tabular regression when you want a neural baseline without PyTorch or TensorFlow—engineering or sensor mappings, simple forecasting feature stacks, calibration or surrogate models. Boosting often wins on raw messy tables; scale features and guard against overfitting when data are limited.
Flow: tabular X and target y (scale inputs) → forward pass through hidden layers with activations → backprop adjusts weights to reduce MSE → repeat until convergence limits → predict ŷ for new rows.
Example task: Fit MLPRegressor with one or two hidden layers on scaled features; compare validation curves as you change hidden_layer_sizes and alpha (L2 on weights), then benchmark against RandomForestRegressor on the same split.
What ML is used for
Typical uses include:
- Prediction and scoring — churn, risk, demand, and other supervised learning tasks on structured data.
- Ranking and recommendations — search quality, feeds, and personalization.
- Computer vision — detection, segmentation, quality control, medical imaging assistance.
- Speech and audio — transcription, synthesis, diarization, keyword spotting.
- Natural language processing — classification, extraction, translation, summarization, conversational systems.
- Time series and forecasting — operations, finance, IoT.
- Anomaly and fraud detection — rare-event patterns in transactions or logs.
- Control and robotics — policies learned from simulation or demonstrations (often with heavy safety and validation).
In all of these, success depends on problem framing, data quality, evaluation, monitoring, and governance — not only on the choice of model class.
Conclusion
The models above share one pattern: define a hypothesis family, fit it to data with a clear objective, then judge it on held-out or live performance. Linear and regularized models give fast, interpretable baselines; trees and forests handle interactions and nonlinear shape with little feature engineering; SVR and small MLPs add other nonlinear options when scale and tuning constraints fit. None of that replaces careful features, validation, and monitoring—but it is the same “learn from examples” story that underpins large language models, retrieval, and agents on the rest of this section.