symbol

name / meaning

$A$

a matrix; a linear map / forward operator (as in $y = Ax$)

$x$

a vector — an image, or the unknown we solve for

$y$

a vector — the measurement / observation

$\langle u,v\rangle$

inner product, $u^\top v$ (alignment of two vectors)

\cdot\

norm (vector length); $\

=\sqrt{\langle v,v\rangle}$

$\Sigma$ / $U$ / $V$

singular value decomposition $A = U\Sigma V^\top$ ($\sigma_i$ = singular values)

$\nabla f$

gradient of $f$ (vector of partial derivatives; points uphill)

$J$

Jacobian $J_{ij}=\partial f_i/\partial x_j$ ($m\times n$; the derivative of a vector→vector map)

$H$

Hessian $H_{ij}=\partial^2 f/\partial x_i\partial x_j$ ($n\times n$; curvature) — clashes with entropy / image height, see note below

$\eta$

learning rate / gradient-descent step size

$\lambda$

regularization weight (data-fit vs prior trade-off)

$\mu$

mean / expected value, $E[X]$

$\sigma^2$

variance ($\sigma$ = standard deviation = noise level)

$\mathcal N$

Gaussian (normal) distribution, $\mathcal N(\mu,\sigma^2)$

$H$

entropy, $-\sum_i p_i\log p_i$ (bits to encode a source)

$p(x\mid y)$

conditional probability / Bayesian posterior

$\theta$

model parameters (the weights a network learns)

$\sigma(\cdot)$

a neuron's nonlinearity (activation function; e.g. ReLU, sigmoid)

No big lessons in this chapter — skip ahead to Problem Set 0 — Environment and C++ basics →

22.1 Refreshers⧉

The whole book runs on a handful of mathematical ideas, reused relentlessly. Images are vectors; the operations we do to them are linear maps, which lets us borrow all of linear algebra and the Fourier transform; recovering an image from imperfect measurements is an optimization — fit the data, respect a prior; the randomness we fight is probability; and the modern workhorses are learned. This appendix is that kit. Each section sketches one tool just far enough to read the chapters that use it, always with an eye on the imaging payoff, and points you to where to go deeper. It is deliberately tight: lean on the figures, and defer the proofs to the texts named in each section.

22.1.1 Linear algebra⧉

Vectors and matrices are the native language of imaging. A pixel grid is a vector; a blur, a color transform, a resampling is a matrix; and "solve the imaging problem" almost always means "solve a linear system." The recurring superpower, the one this section is really about, is the orthogonal basis: a coordinate system in which a hard operator becomes easy.

An image is a vector. A vector is just a length-$n$ list of numbers, which you can also picture as a point — or an arrow from the origin — in $n$-dimensional space. The key move of the whole book is to treat an image as one of these: stack its pixels into a single tall column (Figure 22.1.1). A one-megapixel grayscale photo is then a single point in a million-dimensional space, and everything we do to it — brighten, blur, sharpen, denoise — is geometry in that space. (Stacking is sometimes literal, in the code, and sometimes just a way to think; either way the algebra is the same.)

fig-image-as-vector — **Figure 22.1.1.** An image is a vector. A small pixel grid is "flattened" — its rows stacked end to end — into one tall column of numbers, a single point in a high-dimensional space (a megapixel image lives in a million dimensions). Brightening, blurring, and the rest are then operations on that one vector.

Inner product, norm, angle. The inner product $\langle u,v\rangle = u^\top v$ measures how much two vectors align; the norm $\|v\| = \sqrt{\langle v,v\rangle}$ is length; and $\langle u,v\rangle = 0$ means orthogonal (perpendicular). This one operation does an enormous amount of imaging work: it is how a cone "reads" a spectrum (project the spectrum onto the cone's sensitivity), how a filter responds to a patch, and how we measure the similarity between two images. Keep "inner product = how much of this is in that" as the intuition.

Matrices are linear maps. Write $y = Ax$: a matrix sends a vector to a vector, linearly — it respects sums and scaling, $A(\alpha x_1 + \beta x_2) = \alpha A x_1 + \beta A x_2$. For a $2\times 2$ $A$ you can see exactly what it does by watching it deform the unit square (and the grid around it) into a parallelogram (Figure 22.1.2): rotation, scaling, shear. A blur, an RGB-to-YCbCr color conversion, a downsampling, a perspective warp — every one is, or linearizes to, a single $y = Ax$. (One caution worth importing from the linear-image-processing chapter: matrix multiplication does not commute, $AB \ne BA$ in general, so the order of operations matters.)

fig-matrix-as-transform — **Figure 22.1.2.** A matrix is a linear map. A $2\times2$ matrix $A$ acting on the plane sends the unit square (and the grid) to a parallelogram — here a rotation, a scaling, and a shear shown side by side. Straight lines stay straight and the origin stays put; that is all "linear" means, and it is enough to describe blur, color conversion, and resampling.

Linear systems and inverse problems. Imaging forward models are read in the $y = Ax$ direction: a clean scene $x$ goes through some physical process $A$ (optical blur, mosaic sampling, projection) to produce a measurement $y$. Recovering $x$ from $y$ is then solving the system — inverting $A$. The catch, which sets up the rest of this appendix, is that $A$ is usually enormous, often rank-deficient (it throws information away — its null space is exactly what the measurement cannot see), and the data is noisy. So we almost never literally invert; we optimize (the next two sections). The vocabulary to keep: square vs over-/under-determined systems, rank, and null space.

Bases, projection, orthogonality. A basis is a choice of coordinate system; changing basis re-expresses the same vector in new coordinates. Projecting a vector $v$ onto a direction $u$ — its shadow on $u$ — is $\hat v = \frac{\langle u,v\rangle}{\langle u,u\rangle}\,u$ (Figure 22.1.3). Bases get dramatically nicer when they are orthonormal (mutually perpendicular unit vectors $\{e_i\}$): then the coordinates are just inner products, $x = \sum_i \langle e_i, x\rangle\, e_i$ — no system to solve, and energy is preserved (Parseval: the norm is the same in either basis). This is exactly why the book loves orthogonal bases (Fourier, orthonormal wavelets): they re-coordinatize an image for free. The flip side is a recurring source of pain in Color technology: the cones and display primaries form a non-orthogonal, non-negative basis, so reading off coordinates needs a dual basis and the clean symmetry breaks — much of color's trickiness is exactly this.

fig-vector-projection — **Figure 22.1.3.** Projection onto a direction. The vector $v$ casts a shadow onto the line spanned by $u$; the foot of the perpendicular is $\hat v = \frac{\langle u,v\rangle}{\langle u,u\rangle}u$, and the leftover $v-\hat v$ is orthogonal to $u$. In an orthonormal basis every coordinate is just such a projection — an inner product — which is why orthogonal bases are so convenient.

Eigenvectors and the SVD. Some directions a matrix only scales, never rotates — its eigenvectors, with eigenvalues the scale factors. For a symmetric matrix these directions are orthogonal (the spectral theorem), so diagonalizing it means finding the basis where the operator is nothing but per-axis scaling. The singular value decomposition $A = U\Sigma V^\top$ generalizes this to any matrix, even non-square: read it right to left as rotate ($V^\top$), scale by the singular values on the diagonal of $\Sigma$, rotate ($U$). Geometrically, $A$ takes the unit circle to an ellipse whose axes are the singular values (Figure 22.1.4). The SVD is the swiss-army knife behind low-rank approximation, PCA, the pseudo-inverse (how you "solve" a system that has no exact solution), and — crucially for us — the conditioning of deblurring: tiny singular values mean that inverting $A$ divides by something near zero, which is why deconvolution blows up noise (Linearity, Fourier, Aliasing and deblurring).

fig-svd-geometry — **Figure 22.1.4.** The SVD as rotate–scale–rotate. Any matrix $A = U\Sigma V^\top$ maps the unit circle to an ellipse: $V^\top$ rotates, $\Sigma$ stretches each axis by a singular value $\sigma_i$, and $U$ rotates again. The singular values are the ellipse's semi-axes; a small $\sigma_i$ is a direction $A$ nearly collapses — exactly where inverting $A$ amplifies noise.

The one move to remember

Almost every appearance of linear algebra in this book is the same trick: find the basis where a complicated operator becomes diagonal (just scaling), do the easy thing there, and come back. Convolution is diagonalized by the Fourier basis (Linearity, Fourier, Aliasing and deblurring); a covariance is diagonalized by its eigenvectors (PCA), which for natural images come out looking like Fourier modes anyway; pyramids and wavelets are change-of-basis to make scale explicit (Linear pyramids and wavelets). Same idea each time.

💡 Big lesson. Whenever possible, express a linear operation in a basis where it is diagonal. The "diagonalize / go to Fourier" maxim that runs through the book (Big Lessons) is, at bottom, this linear-algebra fact — and this section is where it is planted. For the deeper treatment, Strang's Linear Algebra and Learning from Data and chapters 2–4 of Deisenroth, Faisal & Ong are the companions.

22.1.2 Calculus: derivatives, gradients, integrals⧉

Three calculus ideas earn their keep in imaging. Derivatives measure change — and a change in image intensity is an edge. Gradients point uphill — so the negative gradient is the direction to walk to minimize a loss, which is how everything gets optimized. And integrals accumulate — a pixel value is, physically, an integral of light. That is most of what we need; the rest is the chain rule, stated once so the machine-learning section can lean on it.

Derivative = slope = rate of change. $f'(x) = \lim_{h\to0}\frac{f(x+h)-f(x)}{h}$ is the slope of the tangent (Figure 22.1.5, left). In a discrete image we approximate it with a finite difference — one pixel minus its neighbor — and that difference is large exactly where intensity changes fast, i.e. at edges. This is the whole basis of Sobel gradients, edge detection, and unsharp sharpening (Neighborhood operations and convolution).

Partial derivatives and the gradient. For a function of several variables $f(x_1,\dots,x_n)$, the gradient $\nabla f = (\partial f/\partial x_1, \dots, \partial f/\partial x_n)$ stacks the partials into a vector. It points in the direction of steepest ascent, and its length tells you how steep (Figure 22.1.5, right: arrows climbing a contour map). For an image, the gradient at a pixel points across the edge — its direction is the edge orientation and its magnitude is the edge strength.

fig-derivative-gradient — **Figure 22.1.5.** Derivative and gradient. Left: in 1-D, $f'(x)$ is the slope of the tangent line at $x$. Right: in 2-D, the gradient $\nabla f$ is a vector at each point pointing uphill (steepest ascent), drawn as arrows on a contour map — longer where the surface is steeper. For an image, that arrow points across the nearest edge.

Jacobian and Hessian — the multivariable second order. Two more objects appear the moment a function takes or returns vectors. A map that sends a vector to a vector, $\mathbf f:\mathbb R^n\to\mathbb R^m$, has a Jacobian $J$ — the $m\times n$ matrix of all first partial derivatives, $J_{ij}=\partial f_i/\partial x_j$. It is the best linear approximation of $\mathbf f$ near a point (the gradient is just the Jacobian of a scalar function, the $m=1$ case), and composing two maps multiplies their Jacobians: that is the vector form of the chain rule that backpropagation runs on, and the way a geometric warp is linearized locally. When the function is scalar but we care about its curvature, the Hessian $H$ is the $n\times n$ matrix of second partial derivatives, $H_{ij}=\partial^2 f/\partial x_i\partial x_j$. At a point where $\nabla f=\mathbf 0$ the Hessian says which kind of stationary point it is: positive-definite (curving up in every direction) means a genuine minimum, indefinite means a saddle. It is also what second-order optimizers exploit — Newton's method steps by $H^{-1}\nabla f$ instead of a fixed $-\eta\,\nabla f$, folding the local curvature into the step. In practice you meet the Jacobian constantly (anywhere a warp or a network is linearized) and the Hessian more rarely, mostly to reason about minima.

Chain rule. The derivative of a composition: $\frac{d}{dx}f(g(x)) = f'(g(x))\,g'(x)$. This unassuming rule is the engine of backpropagation. A deep network is a long composition of functions, and training it means differentiating a loss back through that whole stack — repeated chain rule, applied automatically. We state it here so the ML section can simply invoke it.

Integrals = accumulation. An integral is area under a curve — a continuous sum. In imaging this is not abstract: a pixel value is an integral, the scene's radiance accumulated over the sensor element's area, over the cone of incoming directions, over the exposure time, and over the spectral response of the color filter (Image formation and linear perspective). We constantly slide between the continuous picture (the integral) and the discrete one (a sum over samples) — that slide is exactly what sampling and Resampling are about.

Convexity, minima, and why gradients matter. A smooth function is at a minimum (or maximum, or saddle) where $\nabla f = 0$. If the function is convex — bowl-shaped, no false bottoms — then $\nabla f = 0$ pins down the global minimum, and you can reach it by simply walking downhill along $-\nabla f$. That is the bridge to the next section: optimization is "follow the negative gradient until the gradient is zero."

💡 Big lesson. Many imaging quantities are multiplicative — illumination $\times$ reflectance, or contrast as a ratio — and taking a $\log$ turns those products into sums and turns derivatives into relative (percentage) change. That is one reason log space keeps reappearing, from exposure stops to tone mapping (Big Lessons). For the calculus itself, chapter 5 of Deisenroth, Faisal & Ong (vector calculus, aimed at exactly this use) is the companion.

22.1.3 Optimization and regression⧉

Here is the secret that organizes most of computational photography: nearly every task is "find the image, or the parameters, that best explain the measurements." Concretely, minimize a loss that is a data-fit term plus a prior. Two tools cover the ground — least squares (when there's a closed form) and gradient descent (when there isn't) — and regularization is how we keep ill-posed inverse problems from exploding.

The pattern: loss = data term + prior. Almost every recovery problem has the shape

$$ \min_x \ \underbrace{\|Ax - y\|^2}_{\text{fit the measurement}} \ +\ \lambda\, \underbrace{R(x)}_{\text{prefer a plausible } x}. $$

Denoising, deblurring, demosaicking, HDR merging, white balance — same skeleton every time, only the forward operator $A$ and the prior $R$ change. Recognize this and half the book's algorithms become variations on one theme.

Least squares and the normal equations. Fitting a model — a line, a plane, a response curve — by minimizing the sum of squared residuals (Figure 22.1.6) has a clean closed-form solution, the normal equations $A^\top A\,\hat x = A^\top y$ (and when $A^\top A$ is singular, the SVD-based pseudo-inverse from the last section steps in). Why squared error specifically? Two reasons that recur: it is the maximum-likelihood estimate under Gaussian noise (a tie to the probability section), and squaring keeps the derivative linear, so the math stays tractable. Its weakness — squared error is dominated by a few large residuals, so it is fragile to outliers — is exactly what motivates the robust losses that appear later.

fig-least-squares — **Figure 22.1.6.** *Least squares. A cloud of points, the fitted line, and the vertical **residuals** — the gaps between data and fit. Least squares chooses the line that minimizes the sum of the* squared residuals; the closed-form answer is the normal equations $A^\top A\hat x = A^\top y$. Squaring makes the math linear but also makes the fit sensitive to outliers.

Gradient descent. When the problem is too big for a closed form, or nonlinear, we walk downhill: $x_{t+1} = x_t - \eta\,\nabla f(x_t)$ (Figure 22.1.7). The learning rate $\eta$ is the step size — too large and you overshoot or diverge, too small and you crawl. Three variants in one breath: stochastic gradient descent uses a random minibatch of data per step (how neural nets are actually trained, and why each step is cheap); momentum and Adam smooth and adapt the steps to go faster. If the loss is convex there is one basin and you find it; deep networks are wildly non-convex, full of local minima — and yet gradient descent works astonishingly well anyway, one of the field's happy mysteries.

fig-gradient-descent — **Figure 22.1.7.** Gradient descent. On a bowl-shaped loss surface (contour map), each step moves along $-\nabla f$, the steepest-downhill direction, by a distance set by the learning rate $\eta$. The path zig-zags down toward the minimum. Too large an $\eta$ overshoots the valley; too small and the path barely moves.

Ill-posedness and regularization. Inverse problems are usually ill-posed: many different $x$ explain the same $y$ (the null space again), or small measurement noise gets amplified into huge errors. The cure is the prior. A regularizer $R(x)$ encodes what a plausible image looks like — smoothness (Tikhonov), sparsity or piecewise-flatness ($\ell_1$, total variation), or, in the modern world, a learned prior. The weight $\lambda$ dials between trusting the data and trusting the prior. This is the principled fix for the noise-amplification we saw in deblurring (Linearity, Fourier, Aliasing and deblurring): regularization is what tames the division by small singular values.

Overfitting and the bias–variance trade-off. Fit your model too closely to the measurements and you fit the noise too, so it generalizes badly (Figure 22.1.8: a low-order fit underfits, a sky-high-order polynomial overfits, something in between is right). The dials are model capacity, the regularization strength, and an honest held-out validation set to tell which side you're on. Note that $\lambda$ here is the same dial as the regularization $\lambda$ above — and this is precisely the bridge to machine learning, where overfitting is the central worry.

fig-overfitting — **Figure 22.1.8.** Under-, just-right, and over-fitting. The same noisy points fit by polynomials of increasing degree: too low (underfit — misses the trend), about right (captures the trend), and too high (overfit — wiggles through every point, including the noise). The overfit curve has low error on these points but predicts badly on new ones.

💡 Big lesson. Because the measurement genuinely loses information, the prior is not optional — it is what makes recovery possible at all. This is the sense in which, across the book, "quantization is rarely the issue; the prior is" (Big Lessons): what limits a reconstruction is usually not the bits you kept but the assumptions you bring. The standing references here are Boyd & Vandenberghe's Convex Optimization, chapter 7 of Deisenroth, Faisal & Ong, and Strang on least squares and gradient descent. This whole machinery gets a chapter of its own in Single-image computational photography.

22.1.4 Probability and information theory⧉

Noise is random, so measuring it and removing it is probability. The best estimate given a noisy measurement and a prior is Bayes — and, satisfyingly, it turns out to be the same optimization we just did. And how many bits an image fundamentally needs is information theory, which is the theory behind compression. Yes, probability matters: it underlies noise, denoising, and every learned method. Keep it light here; the goal is fluency, not rigor.

Random variables, distributions, expectation, variance. A pixel reading is a random variable scattered around the true value. Its mean $\mu = E[X]$ is the value we actually want; its variance $\sigma^2 = E[(X-\mu)^2]$ (and its square root, the standard deviation $\sigma$) is the noise. Two facts do most of the work downstream: expectation is linear ($E[aX+bY] = aE[X]+bE[Y]$ always), and the variances of independent quantities add.

The two distributions we actually use. The Gaussian $\mathcal N(\mu,\sigma^2)$ — the bell curve (Figure 22.1.9), with its $\pm\sigma$ width — is everywhere: it models read noise, it is what the central limit theorem pushes sums toward, and it is the reason squared error is the "natural" loss (Gaussian noise $\Leftrightarrow$ least squares). The Poisson distribution governs photon counting: light arrives as discrete photons, and counting $N$ of them has $\mathrm{variance} = \mathrm{mean}$, so the noise std is $\sqrt N$. The consequence is the single most important fact about sensor noise: signal-to-noise ratio is worse in the shadows, where few photons land (Image formation and linear perspective).

fig-gaussian — **Figure 22.1.9.** The Gaussian (normal) distribution. The bell curve centered at the mean $\mu$, with width set by the standard deviation $\sigma$; about 68% of the probability lies within $\pm\sigma$. It models read noise, it is the limit the central limit theorem drives sums toward, and choosing it as the noise model is exactly what makes squared-error the natural loss.

Independence and averaging. Average $N$ independent noisy measurements of the same thing and the variance drops by a factor of $N$ — so the std drops by $\sqrt N$. This one fact is the engine of an enormous amount of computational photography: it is why we stack frames, why a longer exposure or a bigger pixel is cleaner, and the seed of the whole "quality through quantity" thread.

Conditional probability and Bayes. Bayes' rule, $p(x\mid y) \propto p(y\mid x)\,p(x)$, says the posterior (what we believe about the scene $x$ after seeing measurement $y$) is the likelihood (how the measurement is produced, the noise model) times the prior (what scenes are plausible), suitably normalized (Figure 22.1.10). Taking the most probable $x$ — MAP estimation — and writing it in $\log$ form turns it into $\min_x [-\log p(y\mid x) - \log p(x)]$, which is exactly the "data term + prior" loss of the optimization section. The two viewpoints are one thing seen twice. Discounting the illuminant (white balance), denoising, and demosaicking are all, at heart, Bayesian estimates.

fig-bayes-posterior — **Figure 22.1.10.** Bayes' rule as a product of curves. A broad **prior** $p(x)$ (what we expected) times a **likelihood** $p(y\mid x)$ (what this noisy measurement says) gives the **posterior** $p(x\mid y)$ — narrower than either, and pulled toward wherever they agree. MAP estimation picks the posterior's peak; in $\log$ form that is the "data-fit + prior" loss of the optimization section.

Information and entropy. The entropy $H = -\sum_i p_i \log p_i$ is the average "surprise" of a source — equivalently, the minimum number of bits per symbol needed to encode it. It is largest when outcomes are most uncertain: for a single coin flip, $H(p)$ peaks at $p = 0.5$ and vanishes at $p=0$ or $1$ (Figure 22.1.11). Low entropy means predictable means compressible — that is the theory beneath the entropy coding in File formats and compression. (Mutual information, "bits shared between two variables," is the same idea applied to dependence.) Kept light, this is enough to see why JPEG can shrink a photo of the sky and why it cannot shrink random noise: structure is low-entropy, noise is high-entropy.

fig-entropy-coin — **Figure 22.1.11.** Entropy of a coin. The entropy $H(p) = -p\log p - (1-p)\log(1-p)$ of a single biased coin, plotted against the probability $p$ of heads. It is zero at $p=0$ and $p=1$ (a certain outcome carries no information) and peaks at one bit when $p=0.5$ (maximum uncertainty). Predictable sources have low entropy and compress well.

💡 Big lesson. Averaging $N$ independent samples cuts the noise std by $\sqrt N$: quality through quantity, the seed of all multi-shot imaging (Denoising, Big Lessons). For the deeper story, MacKay's Information Theory, Inference, and Learning Algorithms (free, and superb on exactly the Bayes-meets-coding view above) and chapter 6 of Deisenroth, Faisal & Ong are the companions.

22.1.5 Machine learning and deep learning⧉

Learned methods now dominate denoising, demosaicking, super-resolution, tone mapping, and much else, so a reader needs the vocabulary. The good news: it is the same vocabulary as the optimization section, scaled up. Supervised learning is fitting a function by minimizing a loss; neural networks are composable differentiable building blocks; and training is gradient descent driven by backprop. The deep shift is conceptual, not mathematical — the hand-designed prior becomes a learned one.

Supervised learning = curve fitting, scaled up. Given example pairs $(x_i, y_i)$, find a model $\hat y = f_\theta(x)$ whose parameters $\theta$ minimize the average loss

$$ \mathcal L(\theta) = \tfrac1N \sum_i \ell\big(f_\theta(x_i),\, y_i\big). $$

That is precisely the regression of two sections ago, now with millions of parameters instead of a handful. Split the data into train / validation / test; the goal is generalization to new inputs, not memorization of the training set — the overfitting worry of Figure 22.1.8, made central.

The artificial neuron and layers. A single neuron computes $\sigma(w^\top x + b)$: a weighted sum of its inputs (an inner product! — the linear-algebra section again), shifted by a bias $b$, then squashed through a nonlinearity $\sigma$ such as a ReLU or sigmoid (Figure 22.1.12). Stack many neurons into a layer, stack layers into a multilayer perceptron, and a deep network is a long composition $f_\theta = f_L \circ \cdots \circ f_1$ of such blocks. The nonlinearity between layers is what makes the stack more expressive than one big matrix — without it the whole thing would collapse to a single linear map.

fig-neuron — **Figure 22.1.12.** An artificial neuron. Inputs $x_1,\dots,x_n$ are combined by a weighted sum $w^\top x + b$ — an inner product plus a bias — and passed through a nonlinearity $\sigma$ (e.g. ReLU) to give the output $\sigma(w^\top x + b)$. Stacking many of these into layers, and layers into a deep network, yields a flexible composition of simple differentiable parts.

Training = gradient descent + backprop. We minimize $\mathcal L(\theta)$ by stochastic gradient descent, $\theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal L$. The gradient $\nabla_\theta\mathcal L$ is computed by backpropagation, which is just the chain rule (the calculus section) applied efficiently through the layer-by-layer composition. In practice an autodiff framework like PyTorch does this for you — you write the forward computation, and the gradients come back for free (the programming section, next). The whole training loop is "predict, measure the loss, backprop, step downhill," repeated over minibatches.

fig-train-val-loss — **Figure 22.1.13.** Training versus validation loss over epochs. The training loss falls steadily, but the validation loss falls and then turns back up: past that minimum the network is fitting the training set's noise (overfitting), and the gap between the two curves grows. The dashed line marks the best model — where you would **early-stop**. This is Figure 22.1.8's overfitting, seen through the lens of training dynamics.

Convolutional networks — the imaging-native architecture. The workhorse for images is the convolutional neural network: filters that are local (each output looks at a small neighborhood) and weight-shared (the same filter slides across the image), making them shift-equivariant. In other words, a CNN learns convolutions (Neighborhood operations and convolution) instead of having them hand-designed — a natural fit for images. For image-to-image tasks (denoise, demosaic, super-resolve) the dominant shape is an encoder–decoder / U-Net, which processes the image across scales, echoing the pyramids of Linear pyramids and wavelets. We keep this brief and forward-point to the advanced books for the real depth.

What changes versus classical methods. The skeleton is unchanged — still "loss = data-fit + prior." What changes is that the prior is now learned from a dataset rather than written down by hand. The upside is state-of-the-art quality; the costs are real and worth naming: appetite for data and compute, brittle failure modes, and — the one to flag for imaging — hallucinated detail, where the model invents plausible texture that was never measured. That last is not just a quality issue but an ethical and evidentiary one, taken up in Human factors and the art of photography.

💡 Big lesson. A learned method swaps a hand-tuned regularizer for a prior learned from data — the very same "data-fit + prior" loss of the optimization section, now data-driven (Optimization and regression). The companions are Prince's Understanding Deep Learning (free, modern, beautifully illustrated) and Goodfellow, Bengio & Courville's Deep Learning for the deeper standard. This section is the hinge between the book's "hand-designed operator" chapters and its "learned operator" ones across Books 2–3.

22.1.6 Programming: Python, C++, and PyTorch⧉

This book ships in two languages from one source — it renders as a Python book or a C++ book (Book Generation Process) — so this section sets expectations rather than teaching a language. The mental model is short: Python/NumPy for clarity and prototyping, C++ for performance, PyTorch for tensors plus autograd plus GPU. The one idea that unifies all three is the one from the linear-algebra section: an image is an array, and operations are whole-array, not pixel-by-pixel.

Why two languages. Python with NumPy reads almost exactly like the math — array operations, vectorization, broadcasting — which is why it's the language for explaining and prototyping. C++ is the same operations when speed and memory matter: the camera ISP, real-time pipelines, Halide-generated kernels. The dual-language build lets a listing exist once and render in whichever language the reader chose; this section is the shared mental model under both.

Images as arrays / tensors. An image is an $H \times W \times C$ array — a NumPy ndarray, a C++ buffer, or a torch.Tensor. The cardinal rule is to operate on the whole array at once (vectorized) rather than looping over pixels: img * 0.5 to darken, a + b to blend, a slice to crop. That is the linear-algebra "image as a vector" view made executable, and it is both clearer and far faster than a Python for-loop. Know your indexing, slicing, and broadcasting rules; and know that arrays are stored row-major in memory, so loop order and access pattern decide cache behavior — a thread we pick up in Performance engineering and Halide.

PyTorch in one paragraph. A PyTorch tensor is an array that can live on CPU or GPU, plus one extra power: autograd. Every operation you do is recorded in a graph, so a single call to loss.backward() hands you the gradient of the loss with respect to every input — the backprop of the ML section, automated. This is why modern imaging code is written in PyTorch even when nothing is being learned: you get free derivatives (for any optimization, not just training) and free GPU acceleration.

Numerical hygiene — the bugs that actually bite

A short checklist that will save you hours: watch floats vs ints (integer division and overflow silently corrupt images); mind value ranges and clip deliberately ($[0,1]$ vs $[0,255]$); above all, do arithmetic in linear space, not gamma space — blurring or averaging gamma-encoded values is physically wrong (the encoding Big Lesson; cross-ref Big Lessons). Beware in-place edits vs copies, and the NaNs / infs that creep in from divide-by-zero or $\log 0$. See Developing, Testing and Debugging.

Pointers to practice. The dual-language listings, the book's imageops / figstyle conventions (Conventions, Notation & Style), and the warmups in Exercises & Experiments make all of this concrete — start with the simplest possible loop: load an image, confirm it's just a NumPy array of numbers, do some arithmetic on it, and save it back out. Everything else in the book is that, scaled up.

22.1 Refreshers🔗⧉

22.1.1 Linear algebra🔗⧉

22.1.2 Calculus: derivatives, gradients, integrals🔗⧉

22.1.3 Optimization and regression🔗⧉

22.1.4 Probability and information theory🔗⧉

22.1.5 Machine learning and deep learning🔗⧉

22.1.6 Programming: Python, C++, and PyTorch🔗⧉