Computational Photography, an AI-powered Slopendium

▸Part 22 APPENDICES⧉

▸22.1 Refreshers⧉

The book is built on a small kit of mathematical tools, used over and over: images are **vectors**, operations on them are **linear maps** (so we get to use all of linear algebra and the Fourier transform), recovering an image from measurements is an **optimization**, noise is **probability**, and the modern workhorses are **learned**. This appendix is the kit — each tool sketched just far enough to read the chapters that use it, with a pointer to where to go deeper.

▸22.1.1 Linear algebra⧉

`fig-pinhole-imaging` · imaging-scenario series (2/3): add a pinhole to the bare sensor — one ray per scene point → a dim **inverted** image (same tree+sensor+colours as fig-bare-sensor-averaging) ✅

`fig-matrix-as-transform` · a 2×2 matrix as a linear map: unit grid + basis vectors → deformed parallelogram under A (Linear algebra) 🟩

`fig-vector-projection` · projection & orthogonality: v̂ = ⟨u,v⟩/⟨u,u⟩·u and the perpendicular residual v−v̂ (Linear algebra) 🟩

`fig-svd-geometry` · SVD = rotate·scale·rotate: unit circle → ellipse with semi-axes σ₁u₁, σ₂u₂ (Linear algebra) 🟩

• **Vectors and what they mean here.** A length-$n$ list of numbers; geometrically a point/arrow in $n$-D space. **An image *is* a vector** — stack the pixels into one long column (fig-image-as-vector): a megapixel image is a point in a million-dimensional space. Everything we do to images is geometry in that space.

• **Inner product, norm, angle.** $\langle u,v\rangle=u^\top v$ measures alignment; $\|v\|=\sqrt{\langle v,v\rangle}$ is length; orthogonal means $\langle u,v\rangle=0$. The inner product is how a cone "reads" a spectrum, how a filter responds to a patch, how we measure image similarity.

• **Matrices as linear maps.** $y=Ax$ — a matrix sends vectors to vectors, linearly (preserves sums and scaling). Visualize a 2×2 $A$ as deforming the unit square / grid into a parallelogram (fig-matrix-as-transform). A blur, a color-space conversion, a resampling, a perspective warp — all are (or linearize to) $y=Ax$.

• **Linear systems and inverse problems.** Imaging forward models read $y=Ax$ (scene $x$ → measurement $y$); recovering $x$ is solving the system / inverting $A$. Usually $A$ is huge, rank-deficient or noisy, so we don't literally invert — we optimize (next sections). Square vs over/under-determined; rank; null space (what the measurement *cannot* see).

• **Bases, projection, orthogonality.** A basis is a coordinate system; changing basis re-expresses the same vector. Projecting $v$ onto $u$: $\hat v=\frac{\langle u,v\rangle}{\langle u,u\rangle}u$ (fig-vector-projection). In an **orthonormal** basis $\{e_i\}$ the coordinates are just inner products, $x=\sum_i\langle e_i,x\rangle e_i$ — clean, and energy-preserving (Parseval). This is why we love orthogonal bases (Fourier, orthonormal wavelets); **non-orthogonal, non-negative** bases (the cones/primaries of [[Color technology]]) need a *dual* basis and are the source of color's trickiness.

• **Eigenvectors and the SVD.** Eigenvectors are directions a matrix only scales; for symmetric matrices they're orthogonal (the spectral theorem) — diagonalizing = finding the basis where the operator is just per-axis scaling. The **SVD** $A=U\Sigma V^\top$ generalizes this to any matrix: rotate ($V^\top$), scale ($\Sigma$), rotate ($U$) — the unit circle becomes an ellipse whose axes are the singular values (fig-svd-geometry). It underlies low-rank approximation, PCA, the pseudo-inverse, and the conditioning of deblurring (small singular values → noise blow-up).

• **Tie to the book's recurring move:** convolution is a linear operator that is *diagonalized by the Fourier basis* ([[Linearity, Fourier, Aliasing and deblurring]]); PCA diagonalizes a covariance; pyramids/wavelets are change-of-basis. Same idea each time — pick the basis where the operator is simple.

equations

$y = Ax$

inner product $\langle u,v\rangle = u^\top v$

projection $\hat v = \frac{\langle u,v\rangle}{\langle u,u\rangle}u$

orthonormal expansion $x = \sum_i \langle e_i,x\rangle e_i$

SVD $A = U\Sigma V^\top$

▸22.1.2 Calculus: derivatives, gradients, integrals⧉

• **Derivative = slope = rate of change.** $f'(x)$ is the tangent slope (fig-derivative-gradient, left). In images, the **discrete derivative** (finite difference, neighbor minus neighbor) is large at **edges** — the basis of Sobel gradients and sharpening ([[Neighborhood operations and convolution]]).

• **Partial derivatives and the gradient.** For $f(x_1,\dots,x_n)$, the gradient $\nabla f$ stacks the partials; it points in the direction of **steepest ascent**, and its magnitude is the steepness (fig-derivative-gradient, right: arrows on a contour map). Edge *orientation* is the direction of the image gradient.

• **Jacobian and Hessian — the multivariable kit.** A map that takes a vector and returns a vector, $\mathbf f:\mathbb R^n\to\mathbb R^m$, has a **Jacobian** $J$: the $m\times n$ matrix of all first partials $J_{ij}=\partial f_i/\partial x_j$ — the best **linear approximation** of $\mathbf f$ near a point (the gradient is just the scalar, $m{=}1$, case), and composing maps **multiplies** their Jacobians (the vector chain rule that backprop runs on; also how a warp is linearized locally). A **scalar** $f$ has a **Hessian** $H$: the $n\times n$ matrix of *second* partials $H_{ij}=\partial^2 f/\partial x_i\partial x_j$, i.e. **curvature**. At a stationary point ($\nabla f=0$) a **positive-definite** Hessian certifies a genuine **minimum** (vs a saddle), and **Newton's method** steps by $H^{-1}\nabla f$ — second-order optimization (Gauss–Newton for least squares). Keep it conceptual; the Jacobian is the one you meet constantly.

• **Chain rule.** Derivative of a composition — the engine of **backpropagation**: a deep network is a long composition, and training differentiates the loss through it by repeated chain rule. State it once here so the ML section can lean on it.

• **Integrals = accumulation.** Area under a curve; in imaging a **pixel value is an integral** — radiance integrated over the sensor's area, solid angle, exposure time, and spectral response ([[Image formation and linear perspective]]). Continuous vs discrete (sums) — we constantly move between them (sampling, [[Resampling]]).

• **Convexity, minima, and why gradients matter.** A function is minimized where $\nabla f=0$; if it's **convex** (bowl-shaped) that's the global minimum. This is the bridge to optimization: follow $-\nabla f$ downhill.

equations

derivative $f'(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}$

gradient $\nabla f=(\partial f/\partial x_1,\dots)$

chain rule $\frac{d}{dx}f(g(x))=f'(g)g'(x)$

**Jacobian** $J_{ij}=\partial f_i/\partial x_j$ (the $m\times n$ derivative of a vector→vector map

$=\nabla f$ when scalar)

**Hessian** $H_{ij}=\partial^2 f/\partial x_i\partial x_j$ (the $n\times n$ matrix of second partials — curvature)

integral $\int f$

▸22.1.3 Optimization and regression⧉

`fig-merit-function-loop` · lens design as optimization, a closed loop: parameters → ray-trace (fields × wavelengths × pupil zones) → merit function Φ (weighted spot/wavefront + constraints) → damped-least-squares step → repeat; the book's forward-model+loss+optimizer pattern (Lens optimization) ✅

`fig-gradient-descent` · descent path on a convex-bowl contour (−∇f steps); too-large step overshoots (Optimization) 🟩

⬜ figure not yet created

fig-overfitting

• **The pattern: loss = data term + prior.** Almost every recovery problem is $\min_x \underbrace{\|Ax-y\|^2}_{\text{fit the data}}+\lambda\, \underbrace{R(x)}_{\text{prefer plausible }x}$. Denoise, deblur, demosaic, HDR-merge, white-balance — same skeleton, different $A$ and $R$.

• **Least squares & the normal equations.** Fitting a line/plane/model by minimizing squared residuals (fig-least-squares: points, fit, vertical residuals); closed form $A^\top A\hat x=A^\top y$ (and the SVD/pseudo-inverse when $A^\top A$ is singular). Why *squared* error — Gaussian-noise MLE (ties to the probability section), and it's the one that makes the math linear; its blindness to outliers motivates robust losses.

• **Gradient descent.** When there's no closed form (huge or nonlinear problems), step downhill: $x_{t+1}=x_t-\eta\nabla f$ (fig-gradient-descent: descent path on a contour bowl). Learning rate $\eta$ — too big overshoots/diverges, too small crawls. Variants in one breath: stochastic GD (a minibatch per step — how nets train), momentum/Adam. Convex → one basin; non-convex (deep nets) → local minima, and yet it works.

• **Ill-posedness & regularization.** Inverse problems often have many $x$ explaining $y$ (or noise blows up the inverse). A **regularizer** $R(x)$ encodes a prior — smoothness (Tikhonov), sparsity ($\ell_1$, total variation), or a learned prior. $\lambda$ trades data-fit against prior. This is the principled cure for the deblurring noise-amplification of [[Linearity, Fourier, Aliasing and deblurring]].

• **Overfitting and the bias–variance trade-off.** Fit the noise and you generalize badly (fig-overfitting: under-/just-right/over-fit polynomials). Capacity, regularization, and held-out validation — the same dial as $\lambda$, and the bridge to the ML section.

equations

least squares $\hat x=\arg\min_x\|Ax-y\|^2$, normal equations $A^\top A\,\hat x=A^\top y$

gradient step $x_{t+1}=x_t-\eta\nabla f(x_t)$

regularized $\min_x\|Ax-y\|^2+\lambda R(x)$

▸22.1.4 Probability and information theory⧉

⬜ figure not yet created

fig-gaussian

`fig-bayes-posterior` · Bayes: posterior ∝ likelihood × prior (three curves) (Probability) 🟩

`fig-entropy-coin` · binary entropy H(p): 1 bit at p=0.5, 0 at the ends (Information theory) 🟩

• **the basics, in one breath.** A **probability** is a number in $[0,1]$; **independent** events **multiply** (AND), **mutually-exclusive** ones **add** (OR). A **distribution** says how probability is spread over the possible outcomes; its **expectation** $E[X]$ is the long-run average and its **variance** the spread. That is the whole vocabulary the rest of this section leans on — kept deliberately simple.

• **Random variables, distributions, expectation, variance.** A pixel reading is a random variable around the true value; mean = the value we want, **variance/std = the noise**. Expectation is linear; variances of independent things add.

• **The two distributions we actually use.** **Gaussian** $\mathcal N(\mu,\sigma^2)$ — the bell curve (fig-gaussian: μ, ±σ), read noise, the central limit theorem, and the reason squared error is "natural." **Poisson** — photon counting: shot noise with $\mathrm{variance}=\mathrm{mean}$, so std $=\sqrt N$ → SNR is **worse in the shadows** ([[Image formation and linear perspective]]).

• **Independence and averaging.** Average $N$ independent noisy measurements → variance drops by $N$, std by $\sqrt N$. This single fact is why we stack frames, why bigger pixels are cleaner, and the whole "quality through quantity" thread.

• **Conditional probability and Bayes.** $p(x\mid y)\propto p(y\mid x)\,p(x)$ (fig-bayes-posterior: prior × likelihood → posterior). **MAP estimation** = maximize the posterior = the optimization "data term ($\log$ likelihood) + prior ($\log$ prior)" of the last section — the two views are the same. Discounting the illuminant (white balance), denoising, and demosaicking are all Bayesian estimates.

• **Information & entropy.** $H=-\sum p\log p$ = average surprise = the bits needed to encode a source (fig-entropy-coin: $H(p)$ peaks at $p=0.5$). Low entropy = predictable = compressible — the theory behind entropy coding in [[File formats and compression]]. Mutual information as "shared bits." Keep it light: enough to know *why* JPEG can shrink an image and *why* it can't shrink random noise.

equations

mean $\mu=E[X]$, variance $\sigma^2=E[(X-\mu)^2]$

Gaussian $\mathcal N(\mu,\sigma^2)$

Poisson (shot noise) $\mathrm{var}=\mathrm{mean}$

Bayes $p(x\mid y)\propto p(y\mid x)p(x)$

entropy $H=-\sum p\log p$

▸22.1.5 Machine learning and deep learning⧉

⬜ figure not yet created

fig-neuron

`fig-train-val-loss` · training vs validation loss; validation U-turn = overfitting / early stop (ML / deep learning) 🟩

• **Supervised learning = curve fitting, scaled up.** Given pairs $(x_i,y_i)$, find $f_\theta$ minimizing average loss $\mathcal L(\theta)$ — exactly the regression of two sections ago, with millions of parameters. Train/validation/test split; generalization is the goal, not memorization (callback to fig-overfitting).

• **The artificial neuron and layers.** A neuron computes $\sigma(w^\top x+b)$ — a weighted sum (an inner product!) through a nonlinearity (fig-neuron). Stack them into layers → a **multilayer perceptron**; a deep net is a long composition $f_\theta=f_L\circ\dots\circ f_1$ of such building blocks, hence very expressive.

• **Training = gradient descent + backprop.** Minimize $\mathcal L(\theta)$ by SGD; **backpropagation** is the chain rule (calculus section) computing $\nabla_\theta\mathcal L$ efficiently through the composition. Autodiff frameworks (PyTorch) do this for you (programming section).

• **Convolutional networks — the imaging-native architecture.** Weight-shared, local, shift-equivariant filters — literally learned convolutions ([[Neighborhood operations and convolution]]), so they fit images naturally; encoder–decoder / U-Net shapes for image-to-image tasks (denoise, demosaic, super-res). Brief, with forward-ref to the advanced books for depth.

• **What changes vs classical methods.** Same "loss = data-fit + prior" skeleton; the **prior is now learned** from a dataset rather than hand-designed. Strengths (state-of-the-art quality), costs (data, compute, failure modes, hallucinated detail — tie to the ethics discussion in [[Human factors and the art of photography]]).

equations

model $\hat y=f_\theta(x)$

loss $\mathcal L(\theta)=\frac1N\sum_i \ell(f_\theta(x_i),y_i)$

SGD $\theta\leftarrow\theta-\eta\nabla_\theta\mathcal L$

a neuron $\sigma(w^\top x+b)$

▸22.1.6 Programming: Python, C++, and PyTorch⧉

⬜ figure not yet created

none — a prose + small-code-snippet section

• **Why two languages.** The book renders as a **Python book** *or* a **C++ book** from one source ([[Book Generation Process]]); this section is the shared mental model. Python/NumPy: arrays, vectorization, broadcasting — read like the math. C++: the same operations when speed/memory matters (ISP, Halide).

• **Images as arrays/tensors.** An image is an $H\times W\times C$ array (NumPy ndarray / `std::vector` / `torch.Tensor`); operations are whole-array (vectorized) not per-pixel loops — the linear-algebra view made executable. Indexing, slicing, broadcasting; row-major memory and why loop order matters (forward-ref to [[Performance engineering and Halide]]).

• **The speed differential.** For a low-level inner loop you write yourself (e.g. your own convolution), expect roughly **~1000× Python→C++** and **~10× more again for an optimized [[Performance engineering and Halide|Halide]] schedule** — why hot loops leave Python; see [[Performance engineering and Halide]].

• **PyTorch in one paragraph.** Tensors that live on CPU or GPU, plus **autograd** — every operation records a graph so `loss.backward()` gives gradients for free (the backprop of the ML section). This is why modern imaging code is written in it even when no "learning" is involved: free derivatives + GPU.

• **Numerical hygiene.** Floats vs ints, value ranges and clipping, **linear vs gamma space** before arithmetic (the encoding Big Lesson), in-place vs copy, NaNs/inf from divide-by-zero — the bugs that actually bite, cross-ref [[Developing, Testing and Debugging]].

• **Pointers to practice.** The dual-language listings, the `imageops`/`figstyle` style, and [[Exercises & Experiments]] warmups (load an image → it's a NumPy array → do arithmetic → save) that make all of the above concrete.

equations

none

22.2 Problem Set 0 — Environment and C++ basics⧉

22.3 Problem Set 1 — Image class, point operations, and color⧉

22.4 Problem Set 2 — Convolution and the bilateral filter⧉

22.5 Problem Set 3 — Denoising and demosaicking⧉

22.6 Problem Set 4 — High dynamic range⧉

22.7 Problem Set 5 — Resampling, warping, and morphing⧉

22.8 Problem Set 6 — Homographies and manual panoramas⧉

22.9 Problem Set 7 — Automatic panoramas⧉

22.10 Problem Set 8 — Non-photorealistic rendering⧉

22.11 Problem Set 9 — Make-your-own, video, and ethics⧉

▸22.12 EXIF and image metadata⧉

• **What EXIF is.** Structured **metadata embedded in the image file itself** (JPEG, TIFF, and most RAW/DNG), not a sidecar — the camera's record of *how* the shot was taken. In JPEG it rides in the **APP1 marker segment**; the payload is a little **TIFF/IFD** structure (tags → values). Standardized as **Exif** (JEITA/CIPA), with **DCF** governing on-card file/folder naming.

• **EXIF is not the only block.** A photo commonly carries two more metadata standards alongside EXIF: **XMP** (Adobe's XML-based metadata — where Lightroom/ACR edit settings, keywords, and ratings live, often in a **sidecar `.xmp`** for raw files) and **IPTC** (the press/captioning standard — caption, byline, credit, keywords, location). The three **co-occur** and can **disagree** (e.g. a date or copyright string present in more than one and edited in only one); a tool like `exiftool` reads all three at once.

• **Field groups** — the metadata sorted by what it describes:

• **capture settings**: exposure time / **shutter speed**, **f-number** (aperture), **ISO**, **exposure program / mode**, metering mode, **exposure bias** (compensation), flash fired/mode

• **optics**: **focal length** (and **35 mm-equivalent**), lens make / model, subject / focus distance

• **image**: pixel width & height, **orientation** flag (the in-camera rotation the viewer must apply), **color space** / embedded ICC profile, white-balance tag

• **time**: **DateTimeOriginal** (when the photo was taken) vs **DateTimeDigitized** (when it was written/scanned) — they differ for film scans and edits; plus sub-second and time-zone tags

• **place**: **GPS** latitude / longitude / altitude (and heading, timestamp)

• **MakerNotes**: proprietary, undocumented per-vendor blobs (often where the interesting bits — true ISO, lens corrections, shot count — hide)

• **embedded thumbnail / preview**: a small JPEG (and, in RAW, a full-res preview) carried alongside the full image

• **identity / file**: software, artist / copyright, camera & lens serial numbers, owner name

• ⚠️ **caveats** (why EXIF frustrates): coverage is **incomplete** (not every camera writes every field) and **inconsistent** across makers; some values — notably **ISO** and **shutter speed** — are **approximate / rounded** and **not radiometrically correct**, so never read them as calibrated radiometry.

• 🔒 **privacy.** EXIF leaks more than people expect: **GPS** coordinates (home/where a photo was taken), camera & lens **serial numbers** (links photos to one device), and **owner name** / copyright. Many platforms strip EXIF on upload; if sharing matters, strip it yourself (`exiftool -all=`).

• **Tools.** `exiftool` (the reference, widest tag coverage), `exiv2`; in Python `piexif` and `Pillow` (`Image.getexif()`), the C `libexif`.

• **How the book's pipeline reads EXIF.** The capture chapter pulls orientation, exposure triangle, and lens info from EXIF to annotate and correctly display the book's own photos — see [[Basic image processing and ISP]] → "Beyond the pixels: basic metadata and EXIF" and "Data versus metadata: EXIF".

▸22.13 DNG: the Digital Negative⧉

• **What DNG is.** An **open, documented raw format** introduced by **Adobe in 2004**, built on **TIFF/EP**: a single, vendor-neutral, **archival** container for camera raw data — *one* published format instead of a proprietary raw per camera model (Canon CR2/CR3, Nikon NEF, Sony ARW, Fuji RAF, …). The spec is public, so anyone can read and write it.

• **The problem it solves.** Every camera ships its **own undocumented raw**; software must reverse-engineer each one, and old proprietary raws risk becoming **unreadable** as cameras and decoders age. DNG is a **stable, documented target** for long-term archiving and interchange — a "digital negative" you can still open decades later.

• **What's inside (the container).** A **TIFF/IFD** structure (same family as EXIF): the **raw image data**, full **EXIF + MakerNote** metadata, **XMP** (edit settings, ratings), one or more embedded **JPEG previews** + a **thumbnail**, and the **color/calibration** data a converter needs to render the raw.

• **Two flavors of raw payload.** **Mosaic ("raw") DNG** keeps the original **Bayer CFA** samples — demosaick later, the usual case (→ [[Demosaicking]]); **Linear DNG** stores already-**demosaicked** data (one RGB triple per pixel) — used after processing/remosaic or for non-Bayer sensors.

• **Color-rendering data (how raw → color).** DNG carries the math to turn sensor counts into colorimetric values: **ColorMatrix / ForwardMatrix / CameraCalibration** tags for (typically) **two reference illuminants** (e.g. Standard-A and D65) interpolated by correlated color temperature, plus **AsShotNeutral / AsShotWhiteXY** (the camera's white-balance estimate). **DNG camera profiles (DCP)** add hue/saturation/tone maps. (→ [[Color technology]] — camera color matrices & white balance.)

• **DNG opcodes.** A small list of **opcodes** the raw processor applies at decode: **WarpRectilinear / WarpFisheye** (geometric distortion correction), **FixVignetteRadial** (vignetting), **GainMap** (per-channel lens-shading / color cast), **TrimBounds**, defective-pixel maps. This is how DNG bakes in **lens corrections declaratively** (→ Optics, lens-profile correction).

• **Compression.** **Lossless JPEG** (default) or **uncompressed**; an optional **lossy DNG** trades some fidelity for size. (→ [[File formats and compression]].)

• **Embed-the-original.** A DNG can **wrap the original proprietary raw** inside itself, so conversion is **reversible** (extract the untouched original later).

• **Adoption & relatives.** Native raw on some cameras (**Leica, Pentax/Ricoh**), most Android phones via the **Camera2** API, and Apple **ProRAW** (which *is* DNG); the **Adobe DNG Converter** transcodes proprietary raws; **CinemaDNG** is the motion/video raw variant. Not universal — **Canon, Nikon, Sony** keep their own raw formats.

• **Trade-offs.** *Pros*: open, documented, archival; one pipeline; embedded corrections + profiles; lossless and often smaller than the native raw. *Cons*: a transcoding step; some maker-note / "secret-sauce" raw features may not map; not every tool round-trips a DNG perfectly.

• **Relation to EXIF.** DNG **contains** EXIF and XMP — it is a **superset container**, so the trust/coverage caveats of [[Appendices#EXIF and image metadata]] apply to the metadata it carries.

▸22.14 Datasets⧉

A pointer index, not a survey: the workhorse public datasets behind the methods in this book, grouped by task. Each entry is a name, a one-line description, and a link.

**Classification / features**

• **ImageNet** — million-image labelled classification set; the pretraining backbone behind most vision features. https://image-net.org

• **COCO** — common objects in context: detection, segmentation, captioning. https://cocodataset.org

• **Places** — scene-recognition dataset (millions of images, hundreds of scene categories). http://places2.csail.mit.edu

**Super-resolution**

• **DIV2K** — 2K-resolution high-quality images, the standard SR training set. https://data.vision.ee.ethz.ch/cvl/DIV2K

• **Flickr2K** — 2K Flickr images, often paired with DIV2K for training. *(verify URL)* https://github.com/limbee/NTIRE2017

• **Set5 / Set14 / BSD100 / Urban100** — small standard SR **test** sets (classic images, natural scenes, urban self-similar structure). *(verify URL)* https://github.com/jbhuang0604/SelfExSR

**Deblurring & restoration**

• **GoPro (Nah et al. 2017)** — sharp/blurry video-frame pairs synthesized from high-fps GoPro footage; the standard dynamic-scene motion-deblurring benchmark. *(verify URL)* https://seungjunnah.github.io/Datasets/gopro

• **REDS** — REalistic and Dynamic Scenes (NTIRE challenge set): high-quality video for deblurring, super-resolution, and denoising. *(verify URL)* https://seungjunnah.github.io/Datasets/reds

**Denoising**

• **SIDD** — Smartphone Image Denoising Dataset: real noisy/clean smartphone pairs. https://www.eecs.yorku.ca/~kamel/sidd/

• **DND** — Darmstadt Noise Dataset: real-photograph denoising benchmark (held-out ground truth). https://noise.visinf.tu-darmstadt.de

• **Kodak** — the 24 "kodim" lossless test images, a long-standing denoising/compression benchmark. *(verify URL)* https://r0k.us/graphics/kodak/

**HDR / tone mapping**

• **HDR+ burst dataset** — Google's raw burst dataset behind HDR+ computational photography. https://hdrplusdata.org

• **Kalantari HDR (dynamic scenes)** — multi-exposure bursts with motion, for HDR merging with moving content. *(verify URL)* https://www.robots.ox.ac.uk/~szwu/storage/hdr/kalantari17.html

• **Laval HDR (sky / indoor)** — high-dynamic-range outdoor-sky and indoor panoramas (lighting estimation). *(verify URL)* http://hdrdb.com

• **Fairchild HDR Photographic Survey** — calibrated HDR scenes for tone-mapping evaluation. *(verify URL)* http://markfairchild.org/HDR.html

**Retouching / enhancement**

• **MIT-Adobe FiveK** — 5,000 raw photos each retouched by five expert artists; the standard learned-enhancement set. https://data.csail.mit.edu/graphics/fivek

**Depth / flow**

• **NYU Depth V2** — indoor RGB-D (Kinect) for monocular depth. https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

• **KITTI** — driving stereo / depth / flow benchmark. https://www.cvlibs.net/datasets/kitti/

• **Middlebury stereo** — classic high-accuracy stereo benchmark. https://vision.middlebury.edu/stereo/

• **MPI Sintel** — synthetic optical-flow benchmark with long-range, large-motion sequences. http://sintel.is.tue.mpg.de

**Color / white balance**

• **NUS / Gehler-Shi** — color-constancy sets with measured ground-truth illuminant (color-checker in scene). *(verify URL)* https://www2.cs.sfu.ca/~color/data/shi_gehler/

• **Cube+** — single-illuminant color-constancy images with a calibration cube. *(verify URL)* https://ipg.fer.hr/ipg/resources/color_constancy

**Faces**

• **CelebA / CelebA-HQ** — celebrity faces with attributes; HQ is the high-res variant for generative work. https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

• **FFHQ** — Flickr-Faces-HQ: 70k high-quality aligned faces (the StyleGAN set). https://github.com/NVlabs/ffhq-dataset

• **LFW** — Labeled Faces in the Wild: the classic face-verification benchmark. http://vis-www.cs.umass.edu/lfw/

**Inpainting / segmentation / matting**

• **Places2** — same scene corpus as *Places / Places2* (Classification, above); also the standard set for **inpainting** and scene parsing.

• **ADE20K** — densely annotated scene parsing / semantic segmentation. https://groups.csail.mit.edu/vision/datasets/ADE20K/

• **Composition-1k** — the standard alpha-matting benchmark (foregrounds composited over many backgrounds). *(verify URL)* https://sites.google.com/view/deepimagematting

▸22.15 A camera-feature wish list⧉

• **framing**: most asks are not hard *algorithms* — the book covers them — they are missing because of conservative firmware, cramped UI, and closed software. The list is the book held up as a mirror to real products.

• **exposure, ISO, HDR**: tiered **auto-ISO** (minimum shutter speed as a first-class rule); **ambient/flash balance** control; **in-camera HDR bracket → 16-bit raw merge** (→ [[HDR merging]]); **auto-ETTR** with a pull flag recorded in metadata; **frame averaging** for a low-ISO, low-noise long exposure (→ [[Denoising]], the $\sqrt{N}$ argument).

• **bracketing the controls**: alternate settings *within a burst* — flash/no-flash pairs, slow+fast shutter pairs, and **aperture bracketing** at constant exposure for post-hoc depth-of-field choice (→ [[Depth of field]]; light field).

• **focus & DoF**: **focus stacking** with step-size control and blur-boundary detection (→ [[Depth of field]]); focus-shift compensation; cycle through detected **faces/AF subjects** when it locks on the wrong one (→ [[Focus, autofocus]]).

• **computational raw / sensor**: a **true per-channel raw histogram** (not the JPEG proxy); **lossless-compressed raw**; **pixel-shift multi-shot high-res** (→ [[Super-resolution and image priors]]).

• **motion data & metadata**: record the **gyro/accelerometer** streams for shake removal, blur assessment, stabilization (→ [[Video stabilization and rolling-shutter correction]]) and automatic **perspective correction** (→ [[Perspective distortion and its correction]]); in-camera **rating/sorting** that round-trips to the editor (cross-ref [[Appendices#EXIF and image metadata]] — XMP/IPTC); save/restore settings as a text file; separate stills/video settings.

• **panorama**: an electronic **sweep-panorama** mode like a phone's (→ [[Automatic panorama stitching from multiple views and feature matching]]).

• **UI & ecosystem**: better self-timer / long-exposure / **AF-ON** ergonomics; **automatic Wi-Fi offload** at home; and the wish that subsumes the rest — **open the software ecosystem** so third parties can add exactly these features. The recurring lesson: the bottleneck is openness, not optics.

▸22.16 How this book was created⧉

This book was written **with an AI collaborator** (Anthropic's Claude), as a deliberate experiment in AI-assisted authoring — a fitting, slightly meta exercise for a book about computational imaging. The short version: an AI did much of the drafting, building, and checking; a human did all of the deciding. The longer version (below) matters because the failure modes everyone worries about — hallucinated facts, inconsistent notation, prose that drifts over hundreds of pages — are exactly what the method is built to prevent.

▸22.16.1 Two documents, not one⧉

• The central design decision: keep two living documents apart.

• **Outline** = an annotated spec of *what* the book says — structure, order of ideas, and per-section figures, equations/symbols, prerequisites, and the papers/slides it draws on.

• **Prose** = the actual text a reader sees — about *how* it reads. A **compilation** step turns outline (plus supporting files) into draft prose, the way a compiler turns source into a program.

• Point of the separation: leverage — structure/content reasoned about and rearranged cheaply in one place, voice/phrasing in another.

• The arrow runs both ways: when the author edits prose, the change is **triaged** home — *content* (new point, reorder, corrected claim) → outline; *voice* (word choice, rhythm) → style guide; purely local polish → locked in place so it is never silently overwritten.

• Central promise: *re-compiling reproduces essentially the author's edits* rather than regenerating something new. The outline is kept a strict **superset** of the prose's content, so nothing emerged-in-writing is lost on the next pass.

▸22.16.2 Compiling a section⧉

• LLMs are at their worst in long, drifting conversations where early mistakes calcify and context "rots." Antidote: **files are the durable memory, the chat is disposable.**

• Each section is compiled in a **fresh, clean context** from an explicit **capsule** of exactly what it needs — never from accumulated conversation.

• The capsule holds: the section's outline bullets; a one-line chapter thesis + neighbouring sections for continuity; in-scope notation and relevant style rules; and — the key part — the **source material**.

• Two kinds of source, different roles: (1) the teaching the book grew from — lecture **slides** and cleaned **transcripts**, as machine-readable digests — supply *pedagogy and voice* (framing, worked examples, analogies, order of reveal); (2) the technical literature — cited **papers and textbooks** on disk — supply *precision* (definitions, results, attributions).

• Division of labour: slides/transcripts decide *how* to explain, papers decide *what is true*. Grounding every claim in a named source is the first defence against fabrication. The rule: write from the sources, not the model's memory.

▸22.16.3 Figures as code⧉

• No diagram or example image is drawn by hand. Each figure is a small **program** that builds it — regenerates from source, stays consistent with the others, updates when data/style changes (the same reproducibility ethic the book preaches for image-processing code).

• Two libraries: `imageops` holds the actual image **operations** a figure demonstrates (exposure, contrast, tone mapping, convolution, pyramids, demosaicking…), written once so figure and prose can never disagree; `figstyle` holds one shared **visual style** — one palette, one font set, consistent labels — applied everywhere.

• A figure is specified in the outline alongside the text it supports, built to a **legibility standard** (no text smaller than body type, nothing clipped/overlapping), and checked so the prose actually *leverages* it.

• Photographs used as examples are the author's own, or carefully credited Creative-Commons images.

▸22.16.4 Text-to-image: cover art and stylized figures⧉

• A pleasing bit of recursion: the book's **decorative** imagery was made with a **generative text-to-image model** — the same diffusion/transformer family the book explains in [[Generative AI and diffusion]]. A book about computational photography, illustrated by computational photography's newest tool.

• The line to keep straight: **technical figures are code** (previous section) — generative models are *not* trusted to draw a ray diagram or a Bayer grid, where a hallucinated arrow or a mangled equation would be a lie. The model is used only where invention is the point: **ornament**, not evidence.

• **The model.** We used **Reve**, a text-to-image API, driven by two small scripts that read a key, post a prompt (or an image-edit instruction), and save the returned PNG — each call metered in credits, so generation is always a deliberate, bounded selection, never an open faucet.

• **Cover and title art** (`reve_gen.py` ← `cali.txt`). A bank of prompts, each pairing the book's title with one *calligraphic hand* (copperplate, blackletter, Spencerian, Carolingian, Insular/Celtic, Art Nouveau, Roman capitals, brush) and one *gloriously over-engineered futuristic camera* bristling with mismatched lenses — rendered as paper-quilling, gold leaf, illuminated manuscript, and so on. Each prompt is generated in **several variants**; the author picks the keepers. The whole cover suite is one prompt file plus one command.

• **Stylized figures** (`reve_stylize.py` ← `figstyles.txt`). The *opposite* direction — **image-to-image**: take an already-built, typeset figure and ask Reve to **re-render it in another artistic medium** (chalk on a blackboard, blueprint, da-Vinci ink, Diderot engraving, watercolour…) while *preserving every label and equation*. Used for chapter-opener flavour, never to compute or alter content.

• **The fidelity guard.** Generative edits love to redraw a typeset "$r_d$" as literal LaTeX or drop a subscript. Two defences: every edit instruction carries an explicit *keep the math as real typeset glyphs, no backslashes/dollar-signs* clause, and each result is optionally checked by a **vision model** (`reve_verify.py`) that compares it against the original and flags any **major** content drift (a missing panel, a changed sign, an illegible label) — cosmetic medium differences pass, meaning-changing ones are rejected.

• **Honesty.** Every generated image is decorative or illustrative and labelled as such; none is presented as a photograph, a dataset sample, or the output of an algorithm. Where a figure *mimics* a result without being one, it carries a visible **SIMULATED** mark (see the figure conventions). Provenance — model, prompt, and that the image is synthetic — is recorded with the asset.

▸22.16.5 Text-to-3D: models for figures and interactive demos⧉

• A second generative tool, in a different medium: where a figure or an **interactive demo** needs a **3D model** — an insect, a camera, an eyeball, a studio umbrella, a human figure — we generate it from a text prompt with **Meshy**, a **text-to-3D** API, rather than hunting for a correctly-licensed mesh. The same recursion as the cover art: a book about imaging, populated by models a generative system imagined.

• **Where it is used.** The book's **interactive WebGL demos** for Photography 101 (the exposure triangle, portrait lighting, dolly zoom, depth of field, and face-distortion simulators) need real 3D geometry to light, focus, and reproject: a beetle / bee / ladybug for the macro depth-of-field bokeh, flowers for the blurred background, studio umbrellas for the lighting rig, diverse human figures for the portrait and perspective demos, and props (cameras, an eye) for general illustration.

• **The pipeline** (`generate_mesh.py` ← a prompt). One small script reads a key, posts a prompt, polls until the model is ready, downloads a **GLB**, and embeds it as base64 so it loads offline — no server, even from a local `file://` page. Two modes mirror a credit/quality trade-off: **preview** returns *untextured* geometry (cheaper) for demos that apply their **own** material — a metallic shell, a grid — and **refine** adds **PBR textures** when colour carries the meaning (a bee's stripes, a ladybug's spots). Untextured meshes are decimated to a light polygon budget; textured ones are kept whole (decimation would drop the texture coordinates). Each call is metered in credits, so generation stays a deliberate, bounded selection, never an open faucet.

• **Diversity, on purpose.** The human figures are generated as a deliberately **diverse set** — varied gender, race, age, and dress — so the people who populate the demos and figures are not a monoculture.

• **Honesty and credit.** As with the text-to-image art, these models are **illustrative stand-ins**, never presented as photographs of real objects, specimens, or people, and their synthetic provenance is recorded with the asset. **Every caption that uses a Meshy-generated model acknowledges it** — e.g. *"3D model generated with Meshy AI"* — the convention is firm: AI-made assets are labelled where they appear.

▸22.16.6 Text-to-video: animated covers⧉

• A third generative medium: the cover art is also brought to life as short, silent, **seamlessly-looping videos** for the web edition — the index page offers an **animated / static** toggle, and animated mode plays one of the loops at random (`cover_animate.py` ← a cover image, via **Kling** image-to-video on **fal.ai**).

• **The hard part is the loop.** The obvious way to force a true loop — pin the model's *end frame* to the *start frame* — backfires: the laziest path to landing on the identical final frame is to barely move, so the clip comes out nearly frozen. The loops are made the other way round: generate slow, calm, **cyclic** motion — prompted so every element *returns to its starting state* (a full 360° rotation, figures walking back to their pose) — then close the seam with a short **forward crossfade-wrap** (a brief dissolve, never a reversing "ping-pong"). A flawless seam with lively motion is genuinely hard; truly cyclic motion, or a purpose-built looping-context video model, is the clean route.

• **Honesty.** Like the still art, the animations are generative and decorative — never presented as captured footage; their synthetic provenance is recorded with the asset.

▸22.16.7 Verification and review⧉

• A chapter is not allowed to start drafting until it passes a **readiness gate**: outline fleshed out; prerequisites declared and satisfied upstream; figures built (or stub-placeheld); equations/notation locked; references resolving to real documents; source coverage checked.

• After a draft exists, it is run past a panel of automated **reviewers**, each a different lens: technical (inaccuracies, missing assumptions); pedagogical (where a student gets lost); prerequisite (assumed-but-never-introduced background); style (single-author voice); plus figure–text fit, citation correctness, and faithful coverage of slides/papers.

• Findings collect into a running issues list and steer the next revision. Mechanical checks — consistent notation, language-isolated code, sane value-range/encoding assumptions — run separately and automatically.

• **A second opinion from a rival model.** Because the drafting AI shares blind spots with its own prose, an *independent* model from a different vendor (here, OpenAI's GPT) is run as an outside reviewer over the outline or the drafts — at chapter, part, or whole-book scope — tasked only with disagreeing. Its notes are triaged: small, unambiguous fixes the editor applies directly; judgment calls escalated to the author. Crossing model families catches errors a single model would wave through (it flagged, for instance, a real optics slip in an early draft).

• The verifiers don't replace human review; they make it land on what matters.

▸22.16.8 Keeping a long book coherent⧉

• Hardest problem in piecewise generation: the pieces must add up to one object. Three measures.

• **Single-source registries**: one master **notation** table owns every symbol; one **glossary** owns every key term; a registry of recurring **"big lessons"** (imaging is multiplicative; color is non-orthogonal and non-negative; the right basis diagonalizes; edge-preserving filtering is about affinity; quantization is rarely the culprit) records where each first appears and recurs. Defined in one place, referenced everywhere → cannot drift.

• **Dependency graph**: each chapter declares what it *provides* and *requires*; the build checks nothing is used before introduced, and propagates a changed shared definition to every dependent chapter.

• **Version control**: every change to prose and tooling is committed, so generated and edited versions sit side by side, edits extract as diffs, and history is recoverable. A verbatim **log of every instruction** and a dated **process changelog** accompany the repo, so the (continuously evolving) method can be reconstructed — much of what this appendix is built from.

▸22.16.9 The toolchain⧉

• Prose written once in **neutral markup**, rendered to several outputs — print/PDF and a web edition — from a single source, so it is proofread once, not per-format.

• The same single-source trick produces two flavours, one with **Python** code and one with **C++**: shared explanation written once, language-specific snippets swapped in per build, so a reader sees only one language, never both.

• Derived artifacts — slide digests, figures, and eventually compiled chapters, index, glossary — follow an **incremental, build-style** discipline: each records its inputs and rebuilds only when those change, so a small edit doesn't trigger wholesale regeneration.

• The book carries a **version number** that ticks up on each recompile.

▸22.16.10 What the machine did, and what it did not⧉

• "Written by AI" is both an overstatement and an undersell — worth being precise about the division of labour.

• The AI was excellent at the laborious-but-well-specified: drafting fluent prose from a tight, source-backed outline; writing and debugging figure-generation code; mechanical bookkeeping (notation tables, cross-references, consistency sweeps); and building the tooling that ran the process. It gave one author the throughput of a small team.

• It was *not* trusted with judgment. A human chose structure and argument, decided what mattered and what to cut, supplied the taste, caught subtle errors that survive plausible prose, and owns every remaining claim.

• The safeguards above — source-grounding vs fabrication, fresh per-section context vs drift, single-source registries vs incoherence, a reviewer panel feeding a human editor — exist precisely because an unsupervised model would fail in each way.

• Honest summary: the AI is a powerful collaborator and tireless build engine; the responsibility, like the byline, stays human.

▸22.16.11 Who wrote what: a per-part estimate⧉

• an honest, prompt-log-reconstructed table of the **human vs AI** split, **per part**, for the **outline** (content/structure) and the **figures** — rounded to 5%, proportions not precision.

• caveat up front: "authorship" of the outline is fuzzy — the human set structure, chose every topic, and supplied the course material, while the AI expanded those seeds into connective text; a low human % means *directed-not-absent*. A figure counted "AI-originated" was still requested in general terms and accepted/killed by the human.

• two metrics: **outline human %** (originated as explicit human direction / course material vs AI expansion) and **figures human-specified %** (named/described by the human vs AI-proposed from a "suggest the figures" prompt).

• patterns: human outline-share peaks where the book leans on the author's course/research (Fundamentals, Basic, Optics, Multiple-exposure, Performance/Halide, pset Appendices) and dips in survey-style later parts; human figure-share concentrates in the drafted-and-iterated parts (Fundamentals, Basic) and the Appendices (author's own handout figures). Whole-book ≈ **30–35%** human on the outline, ≈ **20%** human-specified on the figures — everything reviewed and owned by the human regardless of who first drafted it.

▸22.17 The course tutor: a local, book-grounded AI teaching assistant⧉

• **What it is.** A terminal **and** browser chatbot tutor for the course, grounded in *this book* plus the lecture slides and pset handouts. It answers conceptual questions fully and, for problem-set work, **guides** — concept, the section to read, the next step — without handing over answers or solution code.

• **Local-first.** The default runs an open-weights model (**Gemma**, via Ollama) **on the student's own machine** — no API key, no per-token cost, works offline. An install-time probe picks a model tier that fits the hardware (a small model on any laptop; a larger, multimodal one on a GPU / Apple Silicon).

• **Retrieval-augmented (RAG).** The book drafts, outline, companion files, slide digests, and pset *handouts* are chunked, embedded, and stored in a small local vector index; each question retrieves the most relevant passages, which are fed to the model as grounded context — the same "write from the sources, not memory" rule the authoring pipeline uses. **Solutions and answer keys are deliberately never indexed.**

• **Guide, don't solve.** Enforced by a Socratic system prompt + worked exemplars, *not* by a brittle filter (a student can get answers from any frontier model anyway — the point is a genuinely useful *guided* tool, the CS50 stance). A **grounding gate** makes the tutor say "the material doesn't cover this" rather than confabulate when retrieval is weak (the Jill-Watson "decline when uncovered" behaviour).

• **It links and shows.** Answers cite the book by name and **link** to the published section; in the browser it renders **equations** (LaTeX/KaTeX) and shows the relevant **figures** inline — the reasons a browser UI beats a terminal for a visual, mathematical subject. A **vision-capable** model lets a student upload an image (their result, an artifact) and ask about it.

• **Two front-ends, one core.** A terminal client (lives in the dev shell, can see code, works over SSH) and a local browser app (equations, figures, image upload, a model dropdown) share the same retrieval + inference + logging core. An optional **online backup** (instructor-hosted) serves students whose machines can't run a local model.

• **Logging and analytics for the instructor.** Every interaction (question, retrieved sources, response) is logged and delivered to the instructor. A batch pipeline produces a **dashboard**: what help students seek, usage over time, per-student activity, retrieval-coverage gaps, and a **pedagogy audit** — an independent model reviews each turn for quality (correct / confusing / harmful), grounding, and answer-leaks, surfacing problem turns. Student identities are **pseudonymized and PII-scrubbed** before any third-party model sees a trace.

• **Privacy & honesty.** Interaction logs are student records (FERPA): disclosed in the syllabus, access-restricted, anonymized before external analysis. The tutor is labelled as an AI that can be wrong; it points students *back into the book* rather than substituting for it.

▸22.18 The semi-automatic grading system⧉

• **The shape (to confirm).** Each pset ships a **test harness** + reference data; submissions are built and run against it, image/numeric outputs compared to references **within a tolerance**, and a provisional score produced for a human to review and adjust.

• **What is automatic vs human (to confirm).** *Automatic:* compiles, runs without crashing, output matches reference within tolerance, performance within budget where relevant. *Human:* partial credit for near-misses, code style/clarity, the open-ended make-your-own / video / ethics components, and any appeal.

• **To fill in (from Frédo):** the exact harness and languages (Python/C++), per-pset reference outputs and tolerances, the rubric and point allocation, how grades + feedback are returned to students, the academic-integrity / permitted-AI-use policy (and how it interacts with [[Appendices#The course tutor: a local, book-grounded AI teaching assistant|the tutor]]), and any plagiarism/similarity checks.