Computational Photography, an AI-powered Slopendium

▸Part 4 COMPUTATIONAL TOOLS⧉

The chapters before this one built images up from physics, perception, and the basic processing pipeline. This part assembles the **general-purpose computational tools** the rest of the book reaches for again and again: casting a recovery task as a **linear inverse problem** and solving it by regression; replacing a hand-designed operator with one **learned from data**; and **generating** plausible images with modern generative models. They are gathered here, before the single-image applications, because nearly every later part — deblurring, super-resolution, compositing, HDR, video — is at bottom an application of one of these three.

▸4.1 Linear Inverse Problems and Regression⧉

`fig-correspondence-then-transport` · the L17 spine — one scene displaced (two views / two faces / two frames / one long frame), each resolved by estimating a coordinate map (homography, morph field, flow, track, motion vector, camera path) then transporting pixels by one shared inverse-warp engine; the finding is hard, the moving is plumbing 🟨

`fig-image-as-vector` · an image *is* a vector: a 5×5 pixel grid unrolled into a tall column vector; n = H·W (Linear algebra) 🟩

`fig-least-squares` · line fit minimizing squared vertical residuals (Optimization & regression) 🟩

`fig-deblur-preview` · the deblur preview: sharp → blurred + noise → naive inverse amplifies noise; MTF 🟨

`fig-gradient-descent` · descent path on a convex-bowl contour (−∇f steps); too-large step overshoots (Optimization) 🟩

`fig-pyramid-reconstruction` · RECONSTRUCTION on a real photo: collapse the Laplacian pyramid coarse→fine (residual → +L_k each octave → exact image), plus an all-black per-pixel error panel (max ~1e-16 = lossless) (Reconstruction, BASIC) ✅

`fig-focus-stacking` · optics-chapter illustrative figure (07-07 Figure 4): a stepped-focus stack → sharpness selection → all-in-focus composite. Synthetic per-slice defocus on one photo (`sourced/corn-cobs.jpg`, © Frédo Durand) — license-safe. The full real-data treatment lives in part-08 `fig-focalstack-*` ✅

equations

forward model $y = Ax$

least squares $\hat x = \arg\min_x \tfrac12\|Ax - y\|^2$

normal equations $A^\top A\,x = A^\top y$

gradient of the data term $\nabla f(x) = A^\top(Ax - y)$

gradient step $x_{t+1} = x_t - \eta\,A^\top(Ax_t - y)$

convolution form $A x = k * x$ and $A^\top y = \tilde k * y$ (flipped kernel $\tilde k$)

per-frequency inverse $\hat x(\omega) = \hat y(\omega)/\hat k(\omega)$ (when it diagonalizes)

▸4.1.1 Blur is linear, so deblurring is inversion⧉

• **the setup**: a sharp scene $x$, a **known** blur kernel $k$ (camera-shake path, defocus disk, lens PSF, motion streak), and the photo we actually got, $y = k * x$. We are **non-blind** here — the kernel is *given* (measured, or recovered earlier); recovering the kernel *too* is **blind** deblurring, deferred to [[Advanced inverse problems]].

• **the one idea**: blur is a **linear, shift-invariant** operation, so it is a **matrix** $A$ acting on the image: $y = Ax$. Deblurring is therefore *inversion* — undo a matrix multiply. Everything below is what "undo" actually means when $A$ is a million-by-million matrix you must never write down. [deblur-linear-inversion figure]

• **three views of the same operator** (the figure ties them together): (i) a **convolution** $y = k * x$ — the intuitive, local view; (ii) a **giant matrix** $A$ — huge ($N\times N$ for $N = W\cdot H$ pixels), but **sparse** (each output touches only the kernel's support) and **structured** (every row is the same kernel, shifted — a *Toeplitz / circulant* matrix); (iii) a **map of vectors** $x \mapsto Ax$ in the image vector space. Same object; we pick whichever view makes the next step easy.

• **why not literal inversion?** Two reasons, both fatal: (1) $A$ is astronomically large — for a 12-megapixel image $A$ is $12\text{M}\times 12\text{M}$; you can't store it, let alone invert it. (2) Even conceptually, $A^{-1}$ is **ill-conditioned** — blur kills high frequencies, so the inverse must *amplify* them, and with them any noise (the deblur preview from the last chapter: naive inverse → exploding noise). We sidestep both: never form $A$ (this chapter), and tame the conditioning with a **prior** ([[Advanced inverse problems]]). [fig-deblur-preview — reuse]

• 💡 **Big lesson:** **diagonalize when you can.** A general blur matrix is a tangled mess to invert, but **Fourier diagonalizes convolution** ([[Linearity, Fourier, Aliasing and deblurring]]): in the frequency basis $A$ becomes a *diagonal* operator — multiply each frequency by $\hat k(\omega)$ — so its inverse is just **per-frequency division** $\hat x(\omega) = \hat y(\omega)/\hat k(\omega)$. That one fact is both the cleanest deblur (the FFT solver, below) and the cleanest explanation of *why* deblurring is hard: where $\hat k(\omega) \approx 0$ (a frequency the blur erased), dividing blows up. *(First appearance: [[Linearity, Fourier, Aliasing and deblurring]]; recurs here — full callback box.)*

▸4.1.2 Images as vectors, and the notation overload⧉

• **refresher (images as vectors)**: stack every pixel (and channel) of an image into one long **column vector** $x \in \mathbb{R}^N$, $N = W\cdot H\cdot C$ — exactly how the image already lives in memory (C++ planar $idx = c\,WH + y\,W + x$; NumPy `(H, W, C)` flattened). This **serialization** is mostly *conceptual*: it lets us speak the language of linear algebra (and call its libraries / solvers) without ever materializing the vector or the matrix. [image-as-vector figure]

• **what the vector view buys us**: addition, scaling, the **dot product** $\langle a, b\rangle = \sum_i a_i b_i$ (= correlation / projection), norms $\|r\|^2 = \langle r, r\rangle$ — and the fact that **every linear image operation is now a matrix**: exposure (a diagonal $A$), convolution (a Toeplitz/circulant $A$), downsampling, gradient. One frame, many operations.

• ⚠️ **notation overload (call it out explicitly)**: in linear algebra the unknown image is universally written $x$ — clashing with $x$ = the horizontal pixel coordinate. We keep both (they're standard) and **disambiguate by context**: $x$-the-vector is the whole serialized image; $x$-the-coordinate indexes a pixel. The measurement is $y$ (also overloaded with the row coordinate). *Queue for [[Notations]]:* register $x$ = image-as-vector, $y$ = measurement vector, $A$ = forward (system) operator/matrix, $k$ / $\tilde k$ = blur kernel / flipped kernel, $r = Ax - y$ = residual, $L$ = regularizer (Laplacian) operator.

▸4.1.3 Regression: deblurring as least squares⧉

• **the reframe**: don't ask "solve $Ax = y$ exactly" (with noise there may be **no** exact solution, and the inverse is unstable). Ask instead: **which sharp image $x$, once blurred, best reproduces the photo?** That is **regression** — fit $x$ to the data by minimizing squared error: $\hat x = \arg\min_x \tfrac12\|Ax - y\|^2$. Deblur, denoise, demosaic, HDR-merge, white-balance all share this skeleton; only $A$ changes. [fig-least-squares — reuse: points / residuals / fit, here with $A$ = blur]

• **why squared error**: it makes the problem **linear** (the gradient is linear in $x$), it is the maximum-likelihood fit under Gaussian noise (ties to Refreshers → Probability), and it has a clean closed form. Its blindness to outliers and its need for a prior come later ([[Advanced inverse problems]]).

• **the normal equations**: setting the gradient to zero, $\nabla f(x) = A^\top(Ax - y) = 0$, gives $\boxed{A^\top A\,x = A^\top y}$ — the **normal equations**, the least-squares analogue of "$Ax = y$." $A^\top A$ is **symmetric positive (semi-)definite**, which is exactly the structure the iterative solvers below exploit.

• **what $A^\top$ *is*, concretely**: the **transpose of a convolution by $k$ is a convolution by the flipped kernel** $\tilde k(\mathbf{u}) = k(-\mathbf{u})$ — i.e. correlation. So $A x = k * x$ (blur) and $A^\top y = \tilde k * y$ (the "back-projected" blur). This is the crucial hook: **both $A$ and $A^\top$ are just convolutions**, so we can apply them without ever building a matrix.

• **deferred (scope boundary)**: noiseless least squares over-fits noise — the cure is a **regularizer / prior**, $\min_x \tfrac12\|Ax - y\|^2 + \lambda R(x)$, turning the normal equations into $(A^\top A + \lambda L)\,x = A^\top y$. We name it here and **defer the why/how to [[Advanced inverse problems]]** (Tikhonov, total-variation, Wiener) and the *learned* prior to [[Machine learning]]. The solver machinery below is unchanged — only the operator picks up a $+\lambda L$.

▸4.1.4 Matrices without forming matrices: gradient descent and conjugate gradient⧉

• **the central trick of the chapter**: the normal-equation matrix $A^\top A$ is $N\times N$ and we will **never build it**. Every iterative solver below needs **only one primitive**: given a vector $v$, return $A^\top A\,v$ — computed as **two convolutions**, $A^\top(A v) = \tilde k * (k * v)$. *Matrix-free.* `apply_blur(v) = conv(v, k)`, `apply_blur_T(v) = conv(v, flip(k))`; the solver calls these and nothing else.

• **gradient descent**: walk downhill on $f(x)=\tfrac12\|Ax-y\|^2$. The gradient is $\nabla f(x) = A^\top(Ax - y)$ — **one forward blur, one residual, one transposed blur** per step — then $x_{t+1} = x_t - \eta\,\nabla f(x_t)$. Correct and trivial to code, but **slow**: the least-squares bowl is **elongated** (badly conditioned, because blur squashes high frequencies), so plain gradient descent **zig-zags** down the valley. [fig-gradient-descent — reuse: steps down a quadratic bowl]

• *pseudocode:* a matrix-free gradient-descent loop (init $x{=}0$; iterate: residual $A x{-}y$, gradient $A^\top$ of it, step by $-\eta\nabla$) using only the two convolutions.

• **conjugate gradient (CG)** — the workhorse: because $A^\top A$ is **symmetric positive-definite**, CG solves the normal equations in far fewer iterations by choosing **conjugate** (non-interfering) search directions instead of greedily steepest ones — it never undoes earlier progress, so it cuts straight down the elongated valley. Like gradient descent it touches $A$ only through **apply-operator** calls (one $A$ + one $A^\top$ per iteration); unlike it, it converges in tens, not thousands, of steps. Cite Shewchuk's *"…Without the Agonizing Pain"* for the intuition. [deblur-cg-vs-gd figure — CG's conjugate path vs GD's zig-zag on the same bowl]

• **preconditioning**: CG is fast when the bowl is **round** (well-conditioned). A **preconditioner** $M \approx (A^\top A)^{-1}$ that is cheap to apply (e.g. a diagonal, or an FFT-based approximate inverse) **re-rounds** the bowl and slashes the iteration count — preconditioned CG is the practical default.

▸4.1.5 How we actually solve it — forward-ref to the solver toolkit⧉

• **the situation**: $A^\top A$ is **huge but sparse and structured**, so in practice we use **iterative / multiscale** solvers — gradient descent and conjugate gradient (above), plus **multigrid** (coarse-to-fine on a pyramid) and the **FFT solver** (when the operator diagonalizes in Fourier) — never a direct inverse. These are the shared workhorses of the whole part, so they get their **own chapter next** ([[#Efficient solvers]]): which solver wins when, and why each is matrix-free.

• **the payoff (why these two chapters are the foundation of the part)**: this **same solver machinery** — set up $y = Ax$, minimize $\|Ax - y\|^2$ (+ a prior later), solve matrix-free with CG / multigrid / FFT — recurs across **gradient-domain HDR, [[Poisson image editing|Poisson cloning]], colorization, matting**, and most optimization-based edits in this part. Learn it once, here and in [[#Efficient solvers]]; every later chapter is a new $A$ and a new prior on the same spine.

▸4.2 Efficient solvers⧉

The previous chapter cast recovery as the normal equations $A^\top A\,x = A^\top y$ and made the key point that we **never build $A$** — every solver needs only to *apply* the blur and its transpose. This chapter is the toolkit that actually does the solving: a handful of matrix-free, iterative / multiscale methods, and the rule for which one to reach for. The same machinery returns in [[Poisson image editing]], colorization, matting, and gradient-domain HDR — learn it once here.

equations

matrix-free operator pair $A x = k * x$, $A^\top y = \tilde k * y$

gradient step $x_{t+1}=x_t-\eta\,A^\top(Ax_t-y)$

preconditioned CG ($M\approx(A^\top A)^{-1}$)

FFT one-shot $\hat x(\omega)=\hat y(\omega)\,\overline{\hat k(\omega)}/\big(|\hat k(\omega)|^2+\lambda|\hat L(\omega)|^2\big)$

▸4.2.1 Matrix-free, iterative, multiscale — the situation⧉

• **the situation**: $A^\top A$ is **huge but sparse and structured**, so we use **iterative / multiscale** solvers, never a direct inverse. Everything below touches $A$ only through the matrix-free pair `apply_blur(v) = conv(v, k)` and `apply_blur_T(v) = conv(v, flip(k))`. Three (plus one) tools, ordered by when each wins.

▸4.2.2 Gradient descent and conjugate gradient (recap + preconditioning)⧉

• **gradient descent and CG** (developed in the previous chapter) are the starting point: GD walks downhill on $\tfrac12\|Ax-y\|^2$ but **zig-zags** on the elongated, ill-conditioned least-squares bowl; **conjugate gradient** chooses non-interfering directions and converges in tens, not thousands, of steps. Both use only apply-operator calls. [deblur-cg-vs-gd figure — CG's conjugate path vs GD's zig-zag]

• *pseudocode:* the operator-only CG loop (residual, conjugate search direction, exact step $\alpha$, conjugacy update $\beta$) — one $A^\top A$ matvec = two convolutions per iteration, no matrix formed.

• **conjugate gradient + preconditioner** — the general-purpose default; works for *any* $A$ you can apply (and apply-transpose), even when the boundaries or the operator aren't shift-invariant. A **preconditioner** $M \approx (A^\top A)^{-1}$ that is cheap to apply re-rounds the bowl and slashes the iteration count — preconditioned CG is the practical default.

▸4.2.3 Multigrid: coarse-to-fine on a pyramid⧉

• **multigrid (coarse-to-fine)**: low-frequency error is what iterative solvers kill *slowest*, and it is exactly what a **coarse grid** represents cheaply. Solve on a coarse image, **prolong** (upsample) the correction, refine on the fine image — a **V-cycle** down and up an image **pyramid** — and convergence becomes near-linear in pixel count. Directly reuses the Gaussian/Laplacian-pyramid machinery of [[Linear pyramids and wavelets]] (reduce / expand). [deblur-multigrid-vcycle figure — V-cycle down and up the pyramid]

▸4.2.4 The FFT solver: diagonalize when you can⧉

• **FFT solver**: when the domain is **rectangular with periodic (or Neumann) boundaries**, the blur (and a Laplacian regularizer) is **circulant → diagonal in Fourier** (L5). Then there is **no iteration at all**: transform, **divide per frequency** $\hat x(\omega) = \hat y(\omega)\,\overline{\hat k(\omega)} \,/\, \big(|\hat k(\omega)|^2 + \lambda\,|\hat L(\omega)|^2\big)$, inverse-transform. Two FFTs and a pointwise divide — the fastest solve when its boundary assumption holds; ties to [[Linearity, Fourier, Aliasing and deblurring]]. [deblur-fft-solver figure — deblur as per-frequency division]

▸4.2.5 Which solver wins — and why this is the foundation of the part⧉

• **picking one**: **CG + preconditioner** for any operator you can apply (the safe default); **multigrid** when low-frequency error dominates and you have a pyramid; **FFT** when the operator is shift-invariant with nice boundaries (the fastest, but most restrictive).

• **the payoff (why this chapter is the foundation of the part)**: this **same solver machinery** — set up $y = Ax$, minimize $\|Ax - y\|^2$ (+ a prior later), solve matrix-free with CG / multigrid / FFT — recurs across **gradient-domain HDR, [[Poisson image editing|Poisson cloning]], colorization, matting**, and most optimization-based edits in this part. Learn it once here; every later chapter is a new $A$ and a new prior on the same spine.

▸4.3 Machine learning⧉

`fig-learned-vs-handdesigned` · same inverse-problem skeleton, prior swapped: classical (data-fit + hand prior $\Phi$) vs learned ($f_\theta$ fit to data) (ML) ✅

`fig-synthetic-data-pipeline` · manufacture (degraded, clean) training pairs by simulating the camera/degradation (ML) ✅

reference the ML & deep-learning refresher ([[Refreshers#Machine learning and deep learning]]) up front; this chapter does **not** re-teach networks — it sets up *learning an operator from data* and the **data** that powers it. The concrete deep-network operators are the next chapter, [[Deep learning]].

equations

learned operator $\hat I = f_\theta(\text{measurement})$ trained by $\min_\theta \sum_i \ell\!\big(f_\theta(x_i),\,y_i\big)$

reuse the inverse-problem $\hat I=\arg\min_I \lVert AI-b\rVert^2+\lambda\,R(I)$ from above to contrast a hand-tuned $R$ vs a learned one

▸4.3.1 The framing: learned operators replace hand-designed ones⧉

• **the move.** Every task in this part so far had a **hand-designed operator** — an inverse filter, a smoothness prior, a dark-channel heuristic. Replace it with a function $f_\theta$ (a net) **fit to example data**: collect pairs (degraded input, desired output), minimize a loss, deploy the learned map. [fig-learned-vs-handdesigned]

• **same skeleton, data-driven prior.** Recall the inverse-problem template $\hat I=\arg\min_I \lVert AI-b\rVert^2+\lambda R(I)$ ([[Optimization and regression]]). A learned method keeps **data-fit + prior**; it just replaces the hand-tuned $R$ (or the whole solver) with something **learned from a dataset** — so the network *is* the prior.

• 💡 **Big lesson:** L8 — *a learned operator swaps a hand-designed prior for one learned from data.* Same data-fit + prior skeleton as the inverse-problem chapters; the prior is now fit to examples, not hand-tuned. Cost: data, compute, and **hallucinated** detail. (first appearance — full box here; the ML refresher carries the one-line callback) [[Big Lessons#L8]]

• **why now (and why it works).** Big datasets + GPUs + CNN/U-Net/transformer building blocks make fitting these maps practical; intuition over derivation (nets are the Refreshers' job). One-line nod to **the Bitter Lesson** (Sutton): general methods that scale with data/compute tend to win over hand-engineered ones — but flag the costs and failure modes rather than cheerlead.

• **scope.** This chapter sets the **framing** and the **data** that powers learned operators; the **operator zoo** itself — low-level pixel-to-pixel, mid/high-level predictors, generative translation, and learned metrics — is the next chapter, [[Deep learning]]. Architectures and training: the Refreshers throughout.

• refs: Sutton 2019 (The Bitter Lesson); Torralba/Isola/Freeman Part III; Szeliski ch. 5.

▸4.3.2 The data story: synthetic data, noise models, datasets⧉

The recurring lesson of this chapter: **the data is the method.** A learned operator is only as good as the pairs it trains on, and good pairs are often *manufactured*.

• **generating synthetic data.** Start from clean targets and **simulate the degradation** to make (input, target) pairs at scale — the cheap way to get supervised data when ground truth is otherwise unobtainable (you can't photograph the same scene "clean" and "noisy" perfectly registered). [fig-synthetic-data-pipeline]

• **realistic noise models (the crux).** A net trained on *additive Gaussian* noise fails on real photos — real sensor noise is **signal-dependent (Poisson shot + read)**, post-ISP, and demosaicked. Match the **real noise model** (cross-ref the shot-noise / Poisson discussion, Refreshers & Book 2) or the learned denoiser won't transfer. This is why Gharbi mines hard cases and Real-ESRGAN models a complex degradation chain.

• **reverse-engineering the camera pipeline (ISP).** To synthesize realistic raw/finished pairs you must **invert/replay the ISP** — black level, white balance, demosaic, tone curve, compression — so synthetic data looks like the camera's actual output. Ties to FlexISP / the camera-pipeline material.

• **standard datasets & benchmarks.** Name the workhorses and *what each measures*: ImageNet (pretraining/features), DIV2K (super-res), SIDD/DND (real-world denoising), MIT-Adobe FiveK (retouching), KITTI/NYU (depth), COCO (detection/segmentation). The benchmark *defines the task*; its biases become the model's biases. (Full index + links: [[Appendices#Datasets]].)

• **a note on self-supervision (mention, don't expand).** Two spots here sidestep the data problem rather than brute-forcing it: (1) **self-supervised denoising** (Noise2Noise / Noise2Void) trains *without clean targets at all* — a direct answer to "we can't photograph the clean version"; (2) **pretrained backbones** behind depth/detection/perceptual metrics often come from self-/weakly-supervised pretraining on unlabeled images. Both matter; neither needs its own section — deep treatment belongs to the ML appendix / advanced books.

• refs: Gharbi et al. 2016; Wang et al. 2021 (Real-ESRGAN degradation model); Heide et al. 2014 (FlexISP); Bychkovsky et al. 2011 (FiveK); Lehtinen et al. 2018 (Noise2Noise) — **to download**; queue dataset citations (DIV2K, SIDD, NYU-Depth) — **to download.**

▸4.4 Deep learning⧉

`fig-downsample-aliasing` · 3 panels: the full **high-res input** zone plate (cos r², rings finer toward the edge — what's being shrunk) → ÷4 naive decimation (drop samples) → moiré aliasing → ÷4 prefilter-then-decimate (Gaussian low-pass first) → clean ✅

`fig-colorization-classical-vs-learned` · Levin 2004 scribble-propagation vs Zhang 2016 fully-automatic colorization (ML) ✅

⬜ figure not yet created

fig-depth-anything (one photo → its monocular depth map) fig-depth-anything

`fig-gan-pix2pix` · paired image-to-image translation: edge map → photo (conditional GAN) (ML) ✅

`fig-metric-degradation-gallery` · one clean photo vs four degradations (1-px shift, low-q JPEG, noise, blur), each labelled with PSNR + SSIM — PSNR and SSIM mostly agree but disagree where L2 is blind (shift scores worst PSNR yet looks identical; blur keeps high PSNR yet softens texture) (Image metrics, BASIC) ✅

these are the **deep-network** realizations of the learned-operator framing ([[Machine learning]]); architectures and training are the Refreshers' job — here we survey *what gets learned*.

equations

minimal (survey chapter) — perceptual / feature loss $\ell_{\text{feat}}=\sum_l \lVert \phi_l(\hat I)-\phi_l(I)\rVert^2$ over deep features $\phi_l$ (LPIPS)

reuse the learned operator $\hat I = f_\theta(\text{measurement})$ from [[Machine learning]]

▸4.4.1 Low-level learned operators (pixel-to-pixel)⧉

The bread-and-butter: an image-to-image net (typically a **U-Net** / encoder–decoder) maps a degraded or partial image to a restored one. Each task below is the *classical* chapter's task, now learned. [fig-learned-task-zoo]

• **denoising.** Train on (noisy, clean) pairs → a learned denoiser that beats hand-tuned NLM/bilateral, *if* the training noise matches the real sensor noise (→ the data story, [[Machine learning]]). Forward-ref: the denoiser as a reusable prior ([[Super-resolution and image priors#Denoising as a universal prior â Plug-and-Play and RED|Denoising as a universal prior]] — PnP/RED) and diffusion as iterated denoising ([[Generative AI and diffusion]]).

• **demosaicking — and joint demosaick+denoise.** Gharbi et al.: a single net learns to go **mosaic → full RGB**, jointly with denoising, trained on hard cases mined from data — outperforming the hand-built interpolators of Book 2. The headline case that "learn the operator end-to-end" beats a decades-tuned pipeline stage. [fig-learned-task-zoo: mosaic→RGB cell]

• refs: Gharbi et al. 2016 (deep joint demosaicking & denoising).

• **super-resolution.** Pose low-res → high-res as a learned map. Lineage from **Freeman et al.** (example-based / patch-dictionary super-res — the pre-deep ancestor) to modern nets: **SwinIR** (transformer restoration backbone) and **Real-ESRGAN** (GAN-trained on a *realistic* synthetic degradation model for in-the-wild photos). Note: the deep treatment of SR + priors may live in [[Super-resolution and image priors]]; here it's one entry in the zoo with a forward-ref.

• refs: Freeman et al. 2002 (example-based SR); Liang et al. 2021 (SwinIR); Wang et al. 2021 (Real-ESRGAN).

• **colorization — classical vs learned, side by side.** Two paradigms: **Levin et al. 2004** propagates a few user **scribbles** by optimization (edge-preserving — "same color where intensity is similar", an *affinity* prior, cross-ref L4); **Zhang, Isola & Efros 2016** learns **fully automatic** colorization from a huge corpus, posed as classification over quantized colors to fight desaturation. Contrast: hand-designed propagation vs learned-from-data plausibility. [fig-colorization-classical-vs-learned]

• refs: Levin, Lischinski & Weiss 2004 (colorization by optimization); Zhang, Isola & Efros 2016 (colorful image colorization).

• **monocular depth.** Estimate depth from a *single* RGB image — ill-posed, so the net learns a strong **scene prior**. **MiDaS**: trained across many datasets via a **scale/shift-invariant** loss to mix incompatible depth sources; **Depth Anything**: scale further with massive (pseudo-labelled) data → robust zero-shot depth. Note it's *relative* depth, and forward-ref geometric/dual-pixel depth (Books on geometry / multi-image). [fig-depth-anything]

• refs: Ranftl et al. 2019 (MiDaS); Yang et al. 2024 (Depth Anything).

• **learned exposure / retouch (tone & color adjustment).** Learn a photographer's **global/local adjustment** from before/after pairs. **Bychkovsky et al. (MIT-Adobe FiveK)**: the dataset of expert retouches that made this learnable. **HDRnet** (Gharbi 2017): predict the *coefficients of a bilateral-grid affine transform* — fast, edge-aware, runs on a phone (cross-ref the learned bilateral grid in edge-preserving). **Exposure** (Yuanming Hu 2018): learn a sequence of *interpretable* retouching operations (curves, white balance) rather than raw pixels — keep as the "learn the *operations*, not the pixels" variant.

• refs: Bychkovsky et al. 2011 (FiveK); Gharbi et al. 2017 (HDRnet); Hu et al. 2018 (Exposure).

▸4.4.2 Mid- and high-level learned predictors⧉

Not pixel-to-pixel restoration but **prediction of semantic structure** from a single image — the inputs to retouching, autofocus, framing, and selection.

• **sky / face detection.** The classic "where is the sky / where are the faces" detectors that drive auto-enhance and metering; the pre-deep (Viola–Jones-era) lineage → modern detectors. Keep brief; they're enabling, not the focus.

• **face landmarks & pose.** Predict facial keypoints + head pose — feeds portrait retouching, relighting, AR, and the *forensics* angle (consistency checks against deepfakes — cross-ref ethics). (Resolves the "photo bio here?" stub: **cut** — photo-bio/EXIF belongs in file formats, not here.)

• **saliency / gaze prediction (what draws the eye).** ML supplies the *estimator* (pre-deep Itti–Koch → learned saliency trained on eye-tracking data), but the model and its uses (auto-crop/framing, retargeting, attention-aware compression, quality/aesthetics) now live with the other perceptual models in **[[Human factors#Computational models of perception]]** — moved there so the *human-factors* framing owns it. Mentioned here only as one more learned predictor.

• refs: queue — Viola & Jones 2004 (face detection); learned-saliency ref (SALICON / DeepGaze) — **missing from `refs/`, to download.** (Saliency refs proper: see [[Human factors#Computational models of perception]].)

▸4.4.3 Generative models: image-to-image translation⧉

When the target isn't a unique "correct" image but a **plausible** one, shift from regression to **generative** modelling.

• **why generative.** Many single-image tasks are **one-to-many** (colorization, super-res, inpainting) — averaging plausible outputs (L2) gives blur; a generative model samples a *sharp, plausible* answer instead. Motivates the next two.

• **GANs (adversarial training).** A generator vs a discriminator that learns the "real-image" prior → sharp, plausible detail (the engine behind Real-ESRGAN's realism). One-paragraph intuition; depth deferred to [[Generative AI and diffusion]].

• **conditional / paired & unpaired translation.** **Pix2Pix** (Isola et al.): paired image-to-image translation — edges→photo, label-map→scene, day→night — a generic "learn the map between two image domains" tool. **CycleGAN**: the **unpaired** version (cycle-consistency) when you lack aligned pairs (horses↔zebras, summer↔winter). [fig-gan-pix2pix]

• **the pre-deep ancestor.** Mention **Image Analogies** (Hertzmann et al. 2001): "A:A′ :: B:B′" learned a filter/style from a *single* example pair by patch matching — the same translation idea before deep nets (ties to texture synthesis / patch-match).

• **forward-ref.** Diffusion models now dominate generative image-to-image (text-to-image, editing) — full treatment in [[Generative AI and diffusion]]; here just the placement (GAN → diffusion lineage).

• refs: queue — Isola et al. 2017 (Pix2Pix), Zhu et al. 2017 (CycleGAN), Goodfellow et al. 2014 (GANs), Hertzmann et al. 2001 (Image Analogies) — **missing from `refs/`, to download.** Forward-ref: Rombach et al. 2022, Brooks et al. 2023 (in the diffusion chapter).

▸4.4.4 Learned perceptual metrics and losses⧉

How do we *measure* whether a restored/generated image is good — and how do we *train* for it? The classical metrics break exactly where learned methods shine, and the metric you optimize decides the sharpness.

• **back-ref: why L2 / PSNR / SSIM / VDP fall short.** Introduced in Book 2: per-pixel L2 / **PSNR**, structural **SSIM**, and perceptual **VDP** all compare *low-level* statistics. They are **blind to texture and high-level similarity** — a slightly shifted but perceptually-identical texture scores terribly; a blurry mean scores *well*. So they reward exactly the wrong thing for generative restoration. [fig-perceptual-metric]

• **LPIPS — deep feature distance.** Compare two images by the distance between their **deep network features** $\ell_{\text{feat}}=\sum_l\lVert\phi_l(\hat I)-\phi_l(I)\rVert^2$, calibrated to human similarity judgments. Tracks perceived similarity far better than L2; it's also usable as a **training loss** (cf. Gatys' feature/perceptual loss, neural style).

• **DreamSim — learned perceptual embedding.** A more recent metric tuned to *mid-level* / **holistic** similarity (layout, pose, semantic content), capturing judgments LPIPS misses. Keep as "the next step beyond LPIPS."

• **from metric to loss — perceptual losses buy sharpness.** Training on per-pixel **MSE/L1** regresses to the **blurry mean**; adding a **perceptual (feature/LPIPS) loss** restores texture and an **adversarial (GAN)** term sharpens onto the real-image manifold. Caveat: perception–distortion trade-off (**L10**) — extra sharpness is partly hallucinated, so keep a fidelity anchor. [fig-perceptual-loss-sharpness — same input restored under L2 vs L1+perceptual vs +adversarial; blurry → crisp]

• **a go-to loss mix** for restoration/super-res/deblur: weighted sum $\mathcal{L}=\lambda_1\lVert\hat I-I\rVert_1 + \lambda_p\,\ell_{\text{LPIPS}} + \lambda_a\,\mathcal{L}_{\text{adv}}$ — **L1** fidelity/colour anchor ($\lambda_1{=}1$), **LPIPS/VGG** perceptual for texture ($\lambda_p{\approx}0.1$–$1$), a **small adversarial** term for crisp highs ($\lambda_a{\approx}0.01$–$0.1$, added last). Track LPIPS not PSNR.

• one-line takeaway: **metrics are themselves learned now** — and which metric you optimize quietly defines what "good" means (and what the model hallucinates toward).

• refs: back-ref Book 2 (PSNR/SSIM/VDP); Gatys, Ecker & Bethge 2016 (feature/perceptual loss); Zhang et al. 2018 (LPIPS) — **to download**; Fu et al. 2023 (DreamSim) — **to download.**

▸4.5 Generative AI and diffusion⧉

`fig-genai-evaluate-vs-sample` · the generative leap (L11): a prior you can only *evaluate* ($\Phi$/denoiser) vs one you can *sample* ($x\sim p(x)$) (GenAI) ✅

`fig-diffusion-forward-reverse` · the two chains: forward $q$ adds Gaussian noise to a real photo; reverse $p_\theta$ denoises back (GenAI) ✅

`fig-diffusion-demo` · live interactive demo (web edition): type a prompt and watch the reverse process denoise from noise, step by step; static fallback is a noise→photo filmstrip (GenAI) ✅

`fig-denoiser-as-prior-spectrum` · one PnP/RED prior slot, ever-stronger denoisers: hand-built → classical → learned → diffusion score (GenAI / Super-res) ✅

`fig-latent-diffusion` · encode → diffuse in a compressed latent → decode; prompt conditions by cross-attention (Stable Diffusion) (GenAI) ✅

`fig-posterior-sampling` · generative priors for inverse problems: one degraded $y$ → many plausible samples $x\sim p(x\mid y)$ (GenAI) ✅

`fig-conditioning-controlnet` · one prompt + a spatial hint (edges / pose / depth) → a generated image obeying both (GenAI) ✅

This chapter is the **generative limit** of the two before it. *Machine learning* replaced a hand-designed prior with a **learned** one (L8); *Super-resolution* showed the prior is **not optional** and that a **denoiser is a universal prior** you can plug into any solver (L10, PnP/RED, and the punchline that the **score is a denoiser**). Here that same prior becomes something you can **sample from** — a generative model — and the canonical one, **diffusion**, is literally that denoiser run in a loop. We reference the ML / deep-learning refresher ([[Refreshers#Machine learning and deep learning]]) for U-Nets, transformers, and attention; this chapter does **not** re-teach networks — it surveys the *generative* idea and where it plugs back into the book's inverse-problem spine.

equations

forward (noising) process $x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon$, $\epsilon\sim\mathcal N(0,I)$

training loss (predict the noise) $\mathbb{E}_{x_0,\epsilon,t}\big\lVert \epsilon - \epsilon_\theta(x_t, t)\big\rVert^2$

Tweedie / score = denoiser $\hat x_0 = x_t + \sigma_t^2\,\nabla_{x_t}\log p_{\sigma_t}(x_t)$, i.e. the score and a Gaussian denoiser are the same object

reverse (sampling) step — denoise a little, add a little noise, repeat from $x_T\sim\mathcal N(0,I)$ down to $x_0$

classifier-free guidance $\tilde\epsilon_\theta = \epsilon_\theta(x_t,\varnothing) + w\,\big(\epsilon_\theta(x_t,c) - \epsilon_\theta(x_t,\varnothing)\big)$

posterior sampling for an inverse problem — sample $x\sim p(x\mid y)\propto p(y\mid x)\,p(x)$ by alternating the diffusion prior step with a data-fit step against the forward model $A$ (a PnP/RED solver with a generative prior)

▸4.5.1 The framing: generation is learning and sampling a prior $p(x)$⧉

• **the move.** Every prior so far was something you **evaluate**: a penalty $\Phi(x)$ you *add* to a data term ($\min_x \tfrac12\|Ax-y\|^2 + \lambda\,\Phi(x)$), or a denoiser $\mathcal D$ you *plug* into a solver (PnP/RED). A **generative model** learns the distribution of natural images $p(x)$ well enough that you can **draw fresh samples** $x\sim p(x)$ — not score an image, but *produce* one. [fig-genai-evaluate-vs-sample]

• **why sampling changes everything.** With a sampleable prior you get three capabilities the evaluable prior couldn't give: (1) **unconditional generation** — invent a plausible image from nothing; (2) **conditioning** — sample from $p(x\mid c)$ for a text prompt, a sketch, a class, another image (text→image, image→image); (3) **posterior sampling** — condition the prior on a *measurement* $y$ and sample $p(x\mid y)\propto p(y\mid x)\,p(x)$, which is exactly **super-resolution / deblur / inpaint** with the strongest prior we have. The inverse-problem template is unchanged; the prior just learned to *generate*.

• **one-to-many is the point.** Restoration tasks like colorization, super-res, and inpainting are **one-to-many** — many sharp images explain one degraded input. Regression (L2) averages them into a blurry mean (the very failure motivated in [[Deep learning]]); a generative prior **samples one sharp, plausible answer** instead. [fig-posterior-sampling]

• 💡 **Big lesson:** L8 — *a learned operator swaps a hand-designed prior for one learned from data.* The generative prior here is the most extreme case of that swap: the entire distribution of natural images, learned. (Recurrence; first appears in [[Machine learning]].) [[Big Lessons#L8]]

• 💡 **Big lesson:** L10 — *the prior is not optional.* When the measurement destroys information (SR, inpaint, deblur), only a prior selects an answer. A generative model is that prior at full strength — and conditioning it on the measurement is posterior sampling. (Recurrence; first appears in [[Super-resolution and image priors]].) [[Big Lessons#L10]]

• 💡 **Big lesson:** L11 — *a generative model is a sampleable prior over images.* The earlier chapters' prior was something you *evaluate* (a penalty $\Phi$) or *denoise with*; a generative model is a prior you can **draw images from**, $x\sim p(x)$. That leap — from scoring to sampling — is what makes generation, text conditioning, and posterior sampling for inverse problems possible. (First appearance; registered as [[Big Lessons#L11]].)

• **the pre-deep ancestor.** "Sample plausible pixels" predates deep nets: **Efros–Leung texture synthesis** (1999) grew an image pixel-by-pixel by copying from matching neighbourhoods — a non-parametric sample from the patch distribution. Same instinct (draw from $p(x)$), no learned model. Ties to the patch/texture material and to Image Analogies ([[Deep learning]]).

• refs: Efros & Leung 1999 (texture synthesis); Goodfellow et al. 2014; Ho et al. 2020; Torralba/Isola/Freeman Part IX.

▸4.5.2 Diffusion: generation as iterated denoising⧉

• **the two chains.** A **forward** process gradually corrupts an image into pure Gaussian noise: $x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon$ — at $t=0$ the clean image, at $t=T$ static. A **reverse** process learns to undo one step of that corruption at every noise level. **Sampling** runs the reverse chain from $x_T\sim\mathcal N(0,I)$ down to a clean $x_0$ — an image *emerging from noise*. [fig-diffusion-forward-reverse]

• *pseudocode:* the reverse sampling loop (start from $x_T\sim\mathcal N(0,I)$; for $t$ from $T$ down to $1$: predict the noise $\epsilon_\theta$, denoise a step, reinject a little fresh noise) — the "sample, don't average" walk.

• **what the network learns (denoise everywhere).** Train a single net $\epsilon_\theta(x_t,t)$ to **predict the noise** added at level $t$, by minimizing $\mathbb{E}_{x_0,\epsilon,t}\lVert \epsilon - \epsilon_\theta(x_t,t)\rVert^2$ over clean images, random noise, and random levels. Predicting the noise *is* denoising — the model is a **denoiser at every noise level**.

• **the hinge — the score is a denoiser (Tweedie).** For a Gaussian-noised $x_t$, the MMSE denoiser (posterior mean) is $\hat x_0 = x_t + \sigma_t^2\,\nabla_{x_t}\log p_{\sigma_t}(x_t)$: the **score** $\nabla\log p$ and a **Gaussian denoiser** are the *same object*, one rearranged into the other. So a diffusion model *is* a learned score field / family of denoisers. (First appearance of this identity: *Denoising as a universal prior*, [[Super-resolution and image priors]]; recurs here as the chapter's spine.) [fig-score-is-denoiser]

• **therefore diffusion = PnP/RED at the continuous limit.** The sampling loop — *denoise a little, add a little noise, step down the schedule, repeat* — is exactly the **Plug-and-Play / RED** outer loop ([[Super-resolution and image priors]]), with the prior-step denoiser being the learned score run over a noise-level schedule. Generation is iterated denoising; the book's denoiser-as-prior abstraction and its hottest generative model are one mechanism.

• 💡 **callback to L10/L11:** the not-optional prior (L10) is now a *sampleable* one (L11) — and the machinery is the universal denoiser of the previous chapter, looped.

• **latent diffusion (why it scaled).** Diffusing in **pixels** at high resolution is expensive. **Latent diffusion** (Rombach 2022, Stable Diffusion) first **encodes** the image into a compact latent with an autoencoder, runs the whole diffusion process **in that latent**, then **decodes** — far cheaper, and the step that made open high-res text→image practical. [fig-latent-diffusion]

• refs: Ho et al. 2020 (DDPM); Song et al. 2021 (score SDE / samplers); Rombach et al. 2022 (latent diffusion); back-ref Venkatakrishnan 2013 / Romano 2017 (PnP/RED — the loop diffusion generalizes).

▸4.5.3 Conditioning: text, images, and control⧉

• **text→image.** A frozen **text encoder** (e.g. CLIP / T5) turns the prompt into embeddings that condition the denoiser at every step via **cross-attention** — the model samples from $p(x\mid \text{text})$ instead of $p(x)$. This is the headline capability of Stable Diffusion / DALL·E / Imagen-style systems.

• **text vs images — why two different generative models** (resolving the outline's old *Text vs images* stub): text is **discrete tokens** with a natural left-to-right order and a finite vocabulary, a perfect fit for **autoregressive** language models ($p(x)=\prod_i p(x_i\mid x_{<i})$, exact next-token likelihood); images are **continuous**, million-dimensional, and have **no canonical ordering**, a perfect fit for **diffusion** (add/remove continuous Gaussian noise over *all* pixels in parallel, no ordering). It is not fashion but **discrete-sequential vs continuous-parallel** — and the two modalities meet through a shared **CLIP** embedding, where an LLM-encoded prompt steers the image diffusion model.

• **classifier-free guidance (the knob).** Train the model **with and without** the condition (dropping $c$ sometimes), then at sampling time **extrapolate** along the conditioned-vs-unconditioned difference: $\tilde\epsilon_\theta = \epsilon_\theta(x_t,\varnothing) + w\big(\epsilon_\theta(x_t,c)-\epsilon_\theta(x_t,\varnothing)\big)$. The guidance weight $w$ trades **prompt fidelity vs diversity** — the single most-used quality dial.

• **structural control (ControlNet).** Beyond text, condition on a **spatial hint** — edges, pose, depth, segmentation — so the output obeys a given structure *and* the prompt. **ControlNet** (Zhang 2023) adds a trainable branch that injects the hint without retraining the base model. [fig-conditioning-controlnet]

• **instruction-based editing.** **InstructPix2Pix** (Brooks 2023): condition on *(an input image, a text instruction)* — "make it winter," "turn the car red" — and the model edits in place. The image→image lineage runs from Pix2Pix/CycleGAN ([[Deep learning]]) to diffusion editing here.

• refs: Ho & Salimans 2022 (CFG); Zhang et al. 2023 (ControlNet); Brooks et al. 2023 (InstructPix2Pix); Rombach et al. 2022.

▸4.5.4 Posterior sampling: generative priors for inverse problems⧉

• **the payoff for the rest of the book.** Plug the generative prior back into the inverse-problem template. To solve $y=Ax+n$, **sample the posterior** $p(x\mid y)\propto p(y\mid x)\,p(x)$ by **alternating** a diffusion prior step (denoise toward the manifold of natural images) with a **data-fit** step against the forward model $A$ — a PnP/RED solver whose prior is the strongest learned denoiser we have. Result: state-of-the-art **super-resolution, deblurring, inpainting, colorization**. [fig-posterior-sampling]

• **reconstruction vs hallucination (the honest caveat).** Because the prior now *invents* plausible detail, posterior sampling sits squarely on the **hallucination** side of L10's split: the detail is plausible, not measured. Different samples give different valid answers — useful (uncertainty) and dangerous (invented "evidence"). Always label which is which.

• **where it gets applied.** This abstraction is what the later application parts reach for: **compositing / inpainting** (fill a removed object), **relighting**, **video** (temporally-conditioned generation), and generative **super-resolution**. Here we establish the mechanism; each part applies it.

• refs: back-ref Venkatakrishnan 2013 / Romano 2017 (PnP/RED); Song et al. 2021 (conditional sampling); [[Super-resolution and image priors]] (the prior throughline this closes).

▸4.5.5 Other generative families, in brief⧉

• **GANs.** A generator vs a discriminator that learns the "real-image" boundary → sharp samples in one forward pass (the engine behind Real-ESRGAN's realism, [[Deep learning]]). Fast to sample, historically hard to train (mode collapse); largely overtaken by diffusion for general text→image, still used where one-shot speed matters.

• **VAEs.** Encode→sample-latent→decode; a principled probabilistic model, but samples are typically **blurrier**. Live on inside **latent diffusion** as the autoencoder that defines the latent space.

• **autoregressive / transformer image models.** Generate tokens (pixels or codebook entries) one at a time (PixelRNN/CNN, image transformers, VQ-based models); the same next-token idea as language models, now reborn in multimodal systems.

• one-line placement: **diffusion currently dominates** open image generation; the others matter as alternatives, components, or speed/latency trade-offs.

• refs: Goodfellow et al. 2014 (GANs); Kingma & Welling 2014 (VAE); van den Oord et al. 2016 (PixelCNN — queue ref).

▸4.5.6 Caveats and ethics⧉

• **hallucination is structural, not a bug.** A sampleable prior *must* invent detail — that is what makes it work and what makes it untrustworthy as evidence. Generated/super-resolved detail is **plausible, not real**; never treat it as measurement (forensics, medical, legal). Ties to L10's reconstruction-vs-hallucination split and to [[Human factors and the art of photography]].

• **bias and representation.** The model can only sample what its **training distribution** contained; skews in the data become skews in the outputs (demographics, styles, defaults). The benchmark/dataset defines the behaviour, as in [[Machine learning]].

• **deepfakes, provenance, consent.** Realistic synthesis enables **deepfakes** and misinformation; mitigations — watermarking, provenance/C2PA, detection (the forensics angle, cross-ref face-landmark consistency checks) — are partial. Training on scraped images raises **consent, copyright, and attribution** questions still unsettled.

• **environmental and access costs.** Training and serving large diffusion models is compute- and energy-intensive; capability concentrates with whoever can pay — flag, don't cheerlead (echoing the Bitter-Lesson caveat in [[Machine learning]]).

• refs: queue — provenance/C2PA, a deepfake-detection survey, a dataset-consent reference — **missing from `refs/`, to download**; cross-ref [[Human factors and the art of photography]] (ethics, forensics, the art-vs-evidence line).