💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

4.3 Machine learning

We reference the machine-learning and deep-learning refresher (Refreshers#Machine learning and deep learning) up front, because this chapter leans on it and does not repeat it: supervised learning, what a network is, and how training works by stochastic gradient descent (SGD) and backpropagation all live there. What this chapter is about is one idea that sits above any particular network — a move you could make with a linear model, a decision tree, or a hundred-layer transformer, and that the imaging world mostly makes with the last of these. The idea is to stop designing the operator that solves an imaging task and instead learn it from examples. The concrete deep-network operators that carry the idea out — the image-to-image nets, the semantic predictors, the generative translators — are the next chapter, Deep learning. Here we set up two things: what it means to learn an operator from data, and why the data — how you generate it, how faithfully you model the camera's noise, which dataset you train on — is as much the method as the model.

4.3.1 The framing: learned operators replace hand-designed ones

Every recovery task in this part so far arrived with a hand-designed operator. Deblurring used an inverse filter; denoising leaned on a smoothness prior; dehazing reached for a dark-channel heuristic; demosaicking interpolated the missing colors with a carefully tuned rule. In each case a person looked at the problem, decided what made a good answer, and wrote that decision down as a formula. This chapter is about a single move that quietly replaced all of it: instead of writing the operator, fit it to data. Collect a pile of example pairs — a degraded input next to the answer we wish we had gotten — and let a function with millions of free parameters adjust itself until it reproduces those answers (Figure 4.3.1). The function is typically a neural network; we write it $f_\theta$, where $\theta$ is the bag of parameters that training sets. The recovered image is then just $\hat I = f_\theta(\text{measurement})$, and training is the optimization

$$\min_\theta \; \sum_i \ell\!\big(f_\theta(x_i),\, y_i\big)$$

over a dataset of input–target pairs $(x_i, y_i)$: search over parameters $\theta$ for the ones that make the operator's output $f_\theta(x_i)$ land closest, under a loss $\ell$, to the desired target $y_i$ on every example. That is the whole framing. Everything difficult about it is hiding in two places — what network to use (deferred to Deep learning and the refresher) and where the pairs $(x_i, y_i)$ come from (the second half of this chapter).

fig-learned-vs-handdesigned
Figure 4.3.1. The same inverse-problem skeleton, with the prior swapped. Left, the classical pipeline: minimize data-fit plus a hand-designed prior $\Phi$ — smoothness, sparsity, a dark-channel heuristic — every term written down by a person. Right, the learned pipeline: the prior, or the whole operator, becomes a function $f_\theta$ fit to a dataset of example pairs. The shape of the problem is unchanged; only the source of the prior moves, from a human to the data.

The reason this is not a wholesale abandonment of everything earlier in the part is that the skeleton survives intact. Recall the inverse-problem template from the previous chapter, $\hat I = \arg\min_I \lVert AI - b\rVert^2 + \lambda\,R(I)$ (Linear Inverse Problems and Regression): a data-fit term that keeps the answer consistent with the measurement, plus a prior $R$ that says what a plausible image looks like. A learned method keeps both halves. What it changes is where $R$ comes from. The hand-tuned regularizer — total variation, a sparse-gradient penalty, whatever a researcher guessed — is replaced by something fit from examples. In the strongest version the network is the prior, and the solver too: $f_\theta$ swallows the whole map from measurement to answer. But the bargain is unchanged — data-fit plus prior — and only the prior learned to read.

💡 Big lesson L8 — a learned operator swaps a hand-designed prior for one learned from data

A classical recovery method minimizes data-fit + a hand-tuned prior — smoothness, sparsity, a dark-channel heuristic, whatever a person decided made a good image. A learned method keeps the same skeleton but replaces that prior, or the whole operator, with a function $f_\theta$ fit to a dataset. The inverse-problem template does not change; the prior simply becomes data-driven. This is a throughline for the rest of the book: the deep-network realizations follow in Deep learning, a learned denoiser turns out to be a reusable prior you can plug into any solver (Denoising as a universal prior), and a diffusion model is the same idea taken to its generative limit (Generative AI and diffusion). The cost is real and worth naming up front — you now need data and compute, and the prior can hallucinate, inventing plausible detail that was never measured. (First appearance; the refresher carries a one-line callback. Big Lessons#L8)

Why did this take over, and why now? Three things arrived together: large datasets, graphics processing units (GPUs) fast enough to train on them, and a kit of reusable building blocks — convolutional networks, U-Nets, transformers — that turn out to be good at images. With those in hand, fitting a map that used to take a decade of hand-tuning became, roughly, a weekend of training. There is a one-line argument behind the trend, Sutton's Bitter Lesson: general methods that scale with data and compute tend, eventually, to beat methods that bake in human cleverness. We will see that play out task by task in the next chapter. But the honest reading of the lesson is not a victory lap; it is a warning to flag the costs — data hunger, compute, hallucinated detail — rather than to cheerlead, and we will keep flagging them.

A note on scope before we go on. This chapter is deliberately about the principle and the data, not the machinery. We will not build a network, choose between a U-Net and a transformer, or survey which architecture wins which task — that operator zoo is the next chapter, Deep learning, and the architectures and training procedures live in the refresher. What stays here is the one move common to all of them — learn the operator from examples — and the half of the problem that the move makes load-bearing: the data. As we are about to see, a learned operator is only as good as the pairs it trained on, and getting good pairs is most of the work.

4.3.2 The data story: synthetic data, noise models, datasets

If there is one lesson this chapter wants to leave you with beyond L8, it is this: the data is the method. A learned operator is exactly as good as the pairs it trained on, and good pairs are usually not lying around — they have to be manufactured. The architecture gets the headlines, but two networks of the same shape, one trained on careless data and one on faithful data, are not the same tool at all. So before we ever pick a model, we have to ask where the training pairs come from.

Generating synthetic data. The trick is to start from clean targets and simulate the degradation to produce (input, target) pairs at scale (Figure 4.3.2). This is often the only way to get supervised data at all: you cannot photograph the same scene perfectly "clean" and "noisy" in perfect registration, so you take a clean image and add the corruption yourself — you know the answer because you started from it. The same move mints super-resolution pairs (downsample a sharp image), deblurring pairs (convolve with a kernel), and dehazing pairs (composite in synthetic haze). The ground truth is free, because we built the input from the target; what costs us is realism.

fig-synthetic-data-pipeline
Figure 4.3.2. Manufacturing training pairs by simulating degradation. A clean image (the target) is pushed through a model of the camera and its corruptions — a realistic noise model, the image signal processor (ISP), a blur or a downsampling — to produce a matched degraded input. The pair (degraded, clean) becomes one training example. Because we start from the clean target, the ground truth is free; the realism of the simulated degradation is what decides whether the trained network transfers to real photos.

Realistic noise models — the crux. Here is where naive synthetic data goes wrong. Train a denoiser on additive Gaussian noise and it will fail on real photographs, because real sensor noise is nothing like that: it is signal-dependent (Poisson shot noise plus read noise — bright pixels are noisier in absolute terms), and by the time it reaches a finished image it has already been mangled by the demosaicker and the image signal processor (ISP). Match the real noise model — cross-referenced to the shot-noise discussion in the Refreshers and Book 2 — or the learned denoiser simply will not transfer. This is precisely why Gharbi mines hard cases for joint demosaicking-and-denoising, and why Real-ESRGAN (enhanced super-resolution generative adversarial network) models an elaborate chain of degradations rather than a single clean blur: the realism of the simulated corruption, not the cleverness of the net, is what decides whether the result survives outside the lab.

Reverse-engineering the camera pipeline. To synthesize a realistic raw-or-finished pair you have to replay the ISP — undo and redo black-level subtraction, white balance, demosaicking, the tone curve, and JPEG compression — so that the synthetic input carries the same fingerprints as the camera's real output. This is the same camera-pipeline machinery FlexISP describes, now in service of making fake data look real. A differentiable model of the pipeline is doubly useful: it both generates faithful training inputs and lets the degradation itself be tuned end-to-end against the data.

Standard datasets and benchmarks. It is worth knowing the workhorses and what each measures, because the benchmark quietly defines the task: ImageNet (pretraining and features), DIV2K (super-resolution), SIDD and DND (real-world denoising), MIT-Adobe FiveK (retouching), KITTI and NYU-Depth (depth), COCO (detection and segmentation). The MIT-Adobe FiveK set is a clean illustration of the point — the same scenes retouched by five expert editors — which is exactly what made a photographer's taste learnable in the first place. And a benchmark's biases become the model's biases: a depth dataset that is all indoor rooms will leave a model confused outdoors, so a dataset is never neutral. The choice of benchmark is a design decision as consequential as the choice of architecture.

A note on self-supervision. Two places in this story sidestep the data problem rather than solving it by brute force. Self-supervised denoisingNoise2Noise, and its single-image cousin Noise2Void — trains without clean targets at all, learning to denoise from noisy images alone, a direct answer to "we cannot photograph the clean version." And many of the pretrained backbones behind depth, detection, and learned perceptual metrics come from self- or weakly-supervised pretraining on unlabeled images. Both matter; neither needs its own section here, and the deeper treatment belongs to the machine-learning appendix and the advanced books.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson L8 — a learned operator swaps a hand-designed prior for one learned from data

A classical recovery method minimizes data-fit + a hand-tuned prior — smoothness, sparsity, a dark-channel heuristic, whatever a person decided made a good image. A learned method keeps the same skeleton but replaces that prior, or the whole operator, with a function $f_\theta$ fit to a dataset. The inverse-problem template does not change; the prior simply becomes data-driven. This is a throughline for the rest of the book: the deep-network realizations follow in Deep learning, a learned denoiser turns out to be a reusable prior you can plug into any solver (Denoising as a universal prior), and a diffusion model is the same idea taken to its generative limit (Generative AI and diffusion). The cost is real and worth naming up front — you now need data and compute, and the prior can hallucinate, inventing plausible detail that was never measured. (First appearance; the refresher carries a one-line callback. Big Lessons#L8)