One Word, Three Jobs: Untangling the Regressions
One Word, Three Jobs: Untangling the Regressions
Part of the series: Understanding and Harnessing LLMs
You have just spent an article inside a deep network — layers, neurons, activations, the whole machine. This one steps back to the doorway you walked through to get there, because three of the most ordinary terms in machine learning are standing right on it, and almost everyone mixes them up.
The terms are linear regression, logistic regression, and autoregression. They all carry the word “regression,” they all turn up constantly, and they are routinely half-understood — partly because the shared word is misleading, and partly because two of them live a double life, built one way by a classical-ML person and a completely different-looking way by a deep-learning person, while being the same model underneath.
That double life is exactly why these three words are the right lens for the seam between machine learning and deep learning. Two of them don’t sit on one side of that seam — they straddle it. Learn to see how, and the boundary stops looking like a wall and starts looking like what it actually is: a place where the same few ideas get built twice, in two vocabularies.
Here is the most useful compression — keep it in your head for the whole article:
Three terms say “regression,” but they aren’t three of a kind. Two of them — linear and logistic — are models that fit a function to data, and each can be built either as classical machine learning or as a neural network. The third — autoregression — isn’t a model at all; it’s a usage pattern you wrap around a model so it feeds on its own past. So it’s 2 + 1, not 3.
That “2 + 1” is the whole untangling. The first two are siblings — same job (fit a function), each buildable in two toolboxes. The third only sounds related; it answers a different question entirely, and it became the engine of every text-generating LLM.
The Knowledge Map
"REGRESSION" — one word doing three different jobs │ ┌──────────────────────────┴───────────────────────────┐ │ TWO that FIT A FUNCTION to data │ ONE that │ (and each one builds TWO ways) │ LOOPS on ▼ ▼ its own past ┌─────────────────────┐ ┌─────────────────────┐ ┌──────────────────┐ │ LINEAR REGRESSION │ │ LOGISTIC REGRESSION │ │ AUTOREGRESSION │ │ continuous number │ │ probability → class│ │ next value from │ │ out (rent, price) │ │ (spam / not spam) │ │ its own past │ └──────────┬──────────┘ └──────────┬──────────┘ └────────┬─────────┘ │ │ │ sklearn ⟷ nn.Linear sklearn ⟷ one neuron stats AR ⟷ RNN / GPT ─────────────────────────────────────────────────── ─────────────────── SAME MODEL, TWO TOOLBOXES — these straddle the a USAGE PATTERN, not machine-learning / deep-learning line a model; became how LLMs generate text
The top split is the point: 2 + 1. The left two share a job and each appears in two toolboxes (the ⟷). The right one is a different category — a loop, not a model. We’ll walk the map left to right: linear regression, then logistic (with a recall of what a neuron is), then autoregression as the odd one out, then the cross-cutting confusion (task vs. algorithm), the side-by-side table, and finally why the word itself misleads.
Linear Regression — A Line, Built Two Ways
[Map: TWO THAT FIT → Linear Regression]
Start with the most concrete version of the problem. You want to predict an apartment’s monthly rent from its size. You have some data:
size 50 m² → rent 7,000size 70 m² → rent 9,400size 100 m² → rent 13,000
Plot those points and they fall close to a straight line. So you describe rent as a straight-line function of size:
rent = w · size + b
Find the w (how much each extra square meter adds) and the b (the base) that put the line as close as possible to your points, and you can predict the rent for a size you’ve never seen. That is linear regression: fit a straight line — in higher dimensions, a flat hyperplane — to predict a continuous number that can run anywhere from minus infinity to plus infinity.
Two things define it, and both will matter when we compare:
-
The output is the raw linear combination. There is no squashing, no nonlinearity — w · size + bis the answer, full stop. -
The loss is MSE (mean squared error): training nudges wandbto shrink the squared distance between the line and the data points.
Now the part that makes this article’s whole point. There are two completely different-looking ways to build this exact model — and a developer crossing from classical ML into deep learning needs to recognize them as the same thing.
Toolbox 1 — classical machine learning (sklearn):
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X_train, y_train)predictions = model.predict(X_test)
Toolbox 2 — a neural network (PyTorch): one linear layer, no activation:
import torch.nn as nnmodel = nn.Linear(input_dim, 1) # one continuous number out# trained against nn.MSELoss()
These are not analogous — they are the same model. A nn.Linear layer computes exactly w · x + b; that is the literal definition of linear regression. The neural-network version is just linear regression wearing deep-learning clothes.
Here is the framing to carry forward: in the language of neural networks, linear regression is a single layer with its activation removed. Hold that “activation removed” — because the very next term puts the activation back, and that one addition is the first real step into deep learning.
Logistic Regression — The Single Neuron, Built Two Ways
[Map: TWO THAT FIT → Logistic Regression]
Change the question from “how much?” to “which kind?” You no longer want a continuous number; you want to know whether an email is spam. The honest answer isn’t a hard yes/no — it’s a probability: “87% likely spam.” A probability has to live between 0 and 1, and a raw line (w · x + b) doesn’t — it happily returns 5,000 or −3.
So you take the linear score and squash it into the (0, 1) range with the sigmoid function:
z = w · x + b ← the exact same linear step as linear regressionoutput = sigmoid(z) ← then squash into a probability between 0 and 1sigmoid(z) = 1 / (1 + e^(−z))z = −5 → output ≈ 0.0067 (very likely NOT spam)z = 0 → output = 0.5 (a coin flip)z = +5 → output ≈ 0.9933 (very likely spam)
Then a threshold (usually 0.5) turns that probability into a decision: above 0.5, call it spam. That is logistic regression: a linear step, then a sigmoid squash, producing a probability you can threshold into a class. The loss is cross-entropy, not MSE — squared error behaves badly on probabilities.
Notice what just happened relative to linear regression: same linear core, plus one nonlinear squash on top. That structure — linear combination, then a nonlinear activation — has a name you met in the deep-learning article. It is the structure of a single neuron.
▸ Recall: what “a neuron” actually is. A neuron is one weight vector dotted with the input, plus a bias, then passed through an activation function:
out = φ( w · x + b ) └────┬────┘ └┬┘ one linear step one activation (the squash)
Inside a layer’s weight matrix, that weight vector is one row — so a layer of three neurons is three rows, each producing one output. You may have seen a neuron described as “a row of the weight matrix” and also as “a linear step followed by an activation.” These are not two definitions; they are the same neuron seen from two sides — the row is what it’s made of (its weights), the linear-step-plus-squash is what it does (its computation). The activation is applied to that row’s result.
Logistic regression is the case where the layer has exactly one neuron — one row, one output. With only a single neuron, “the row,” “the layer,” and “the neuron” are all the same object; there’s nothing to pick apart. That is why it earns the title the simplest possible neural network: it’s a layer you cannot make any smaller. The activation just happens to be the sigmoid.
With that recall in hand, the two toolboxes look almost too similar to be different traditions:
Toolbox 1 — classical machine learning (sklearn):
from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()model.fit(X_train, y_train)
Toolbox 2 — a neural network (PyTorch): one linear layer + a sigmoid:
import torch.nn as nnmodel = nn.Sequential( nn.Linear(input_dim, 1), nn.Sigmoid())# trained against cross-entropy loss
Same model, two vocabularies — exactly as with linear regression. But logistic regression carries one extra gift, and it’s the bridge this whole article is built to show: it is a deep network with the layers taken out. Put hidden layers between the input and that final sigmoid, and you have a full deep-learning classifier. Nothing about the destination changes — only the depth. That is the seam between machine learning and deep learning, and as you can see, it’s a ramp, not a wall: logistic regression is the bottom of it.
Autoregression — A Loop, Not a Model
[Map: ONE THAT LOOPS → Autoregression]
The first two terms were siblings. This one only sounds like family — and seeing why is the heart of the untangling.
Start, again, with the concrete problem. You’re predicting tomorrow’s value of something that changes over time — a stock price, a temperature, the next word in a sentence — and the most useful clues are its own recent past. Yesterday’s price tells you a lot about today’s. That’s the whole idea:
x(t) = f( x(t−1), x(t−2), ..., x(t−n) ) + noise
The model regresses on itself — its own earlier outputs become its inputs. Hence “auto” (self) + “regression.” It was born in 1920s–30s statistics, long before neural networks, and the classical version is a simple weighted sum of past values:
x(t) = 0.7 · x(t−1) + 0.2 · x(t−2) + noise
This is the AR in ARIMA, the workhorse of economic forecasting, stock and weather prediction — any series where the past predicts the next step.
Now the crucial distinction. Look at what autoregression actually specifies: not how to fit a function (linear? sigmoid? a 96-layer Transformer?), but what you feed the function — its own previous outputs, in a loop. Autoregression is not a model. It is a usage pattern you wrap around a model. The model inside the loop can be anything: a classical AR equation, an RNN, an LSTM, or a giant Transformer.
That last one is why this term matters today. GPT is an autoregressive language model. It generates one token, appends it to the input, and runs again to get the next — looping on its own output exactly like the 1920s AR equation, just with a Transformer as the function inside the loop:
Input: "The cat sat on the" → generates: "mat"Input: "The cat sat on the mat" → generates: "."Input: "The cat sat on the mat ." → generates: [END]
The model never sees the future — only what it has already produced. This is precisely why text generation from an LLM is sequential: each token must exist before the next can be computed, which is why generation can’t be trivially parallelized the way training can.
And here is the payoff of putting this term beside the other two. Linear and logistic regression answered “what shape of function fits my data?” — and each had two ways to build it. Autoregression answers a different question entirely — “what do I feed the function?” — and it is not “built two ways” like the others; it’s orthogonal to all of them. You can run any of those models autoregressively. It shares a word with linear and logistic regression and almost nothing else. That’s the 2 + 1.
A Different Mix-Up: The Task vs. The Algorithm
[Map: a cross-cutting caution]
There’s a second confusion that trips people, and it sits at a right angle to the first. It’s the habit of treating binary classification and logistic regression as the same thing. They aren’t even the same kind of thing.
-
Binary classification is a task — a description of the problem: the output has exactly two possible classes. Spam or not spam. Cat or dog. Fraud or legitimate. -
Logistic regression is an algorithm — one specific way to solve that task: output a probability, then threshold it.
The task says what you’re trying to do; the algorithm says how you do it. And one task can be solved by many algorithms:
Binary Classification (the TASK — two possible outputs)├── Logistic Regression ← one algorithm├── Support Vector Machine├── Decision Tree├── Random Forest└── Neural Network with a sigmoid output
Logistic regression is purpose-built for binary classification — but binary classification does not require logistic regression. You could solve the same spam problem with a decision tree and never compute a single probability. Keeping the two layers straight — the problem you’re solving versus the tool you’re solving it with — is what lets you read “we used logistic regression for this binary classification task” and hear two distinct facts instead of one redundant phrase.
The Three, Side by Side
[Map: synthesis]
Everything above, in one view. The last column is the one this article exists to make visible: which of these you can build in both toolboxes, and which you can’t.
|
|
|
|
|
|---|---|---|---|
| Origin |
|
|
|
| Output |
|
|
|
| Job |
|
|
|
| Activation |
|
|
|
| Loss |
|
|
|
| Classical-ML build | sklearn.LinearRegression |
sklearn.LogisticRegression |
statsmodels
|
| Deep-learning build | nn.Linear
|
nn.Linear
nn.Sigmoid (one neuron) |
|
| Built two ways? | Yes
|
Yes
|
No
|
| What “regression” means here |
|
|
|
Read the bottom three rows top to bottom and the 2 + 1 is undeniable. Linear and logistic share a row of toolbox-duality; autoregression breaks the pattern — it has no “one model, two builds” because it isn’t a single model at all.
Why the Word “Regression” Misleads Everyone
[Map: the naming]
The mess is, in the end, a naming accident. “Regression” is doing two unrelated jobs across these three terms:
-
In linear and logistic regression, “regression” means fit a function to data. The word comes from Francis Galton’s “regression to the mean” in the 1880s — a statistical observation about data pulling toward its average. This is the curve-fitting sense. -
In autoregression, the root is reused for something else entirely: the model takes its own past outputs as input. The “regression” here points inward, at self-reference, not at fitting a curve.
So the shared word is historical coincidence, not conceptual kinship. Autoregression is no more closely related to linear regression than it is to any other sequence model — they just happened to inherit the same Victorian-era root for different reasons. Once you see that, the three stop blurring together: two of them are curve-fitters that straddle the ML/DL line, and one is a self-referential loop that rides on top of any model you like.
Close: Three Words, One Border
You came into this article fresh from the inside of a deep network, and we used that vantage to sort out three terms that sound alike and aren’t.
Two of them — linear and logistic regression — turned out to be the same models in two costumes. Linear regression is a neural layer with the activation stripped off; logistic regression is a single neuron, the smallest neural network there is, and the bottom rung of the ramp that climbs into deep learning. The reason they’re built “two ways” is the reason they’re the perfect lens on the boundary: they don’t belong to machine learning or deep learning. They belong to both, and the boundary is just the place where the same idea gets a second vocabulary.
The third — autoregression — never belonged to that pair at all. It’s a loop, not a model; a way of using a predictor by feeding it its own past, and the engine behind the way today’s LLMs write — one token at a time.
So the next time you meet these words — in a paper, a library, a colleague’s offhand sentence — you’ll see past the shared syllables to what each one actually is: a line, a neuron, and a loop. One word, three jobs. And the border between machine learning and deep learning, seen through them, is no wall — it’s a doorway the same ideas walk through twice.
References
-
scikit-learn —
LinearRegressionandLogisticRegressionuser guides. The classical-ML implementations of the first two terms, including the threshold-into-a-class behavior of logistic regression. https://scikit-learn.org/stable/modules/linear_model.html -
PyTorch —
torch.nndocumentation (Linear,Sigmoid,Sequential). The neural-network builds of the same two models — one linear layer with and without an activation. https://pytorch.org/docs/stable/nn.html -
Hyndman & Athanasopoulos — Forecasting: Principles and Practice (3rd ed.), chapters on AR and ARIMA models. The classical statistical home of autoregression, predating neural networks. https://otexts.com/fpp3/
-
Radford et al. — Language Models are Unsupervised Multitask Learners (GPT-2, 2019). The autoregressive, one-token-at-a-time generation scheme underlying decoder-only LLMs. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
夜雨聆风