Computational Photography, an AI-powered Slopendium

▸Part 1 INTRO⧉

▸1.1 From Digital to Computational Photography⧉

⬜ figure not yet created

a montage of computational-photography results — HDR tone mapping (Durand & Dorsey 2002), refocus-after-capture (Ng et al. 2005), motion deblur (Levin et al. 2007), a stitched panorama (problem set) [results-montage figure] fig-results-montage

`fig-demosaick-snapshot` · even a plain snapshot needs computation: one colour per pixel (Bayer mosaic) → demosaicked RGB ✅

`fig-portrait-lighting-sim` · live portrait-lighting simulator (web edition): three soft area lights (key/fill/kicker — az/el/extent/intensity/colour, area-light-supersampled soft shadows) on a 3D face with a physiological melanin/hemoglobin skin model; a from-behind setup view with Meshy studio umbrellas + a camera rig; lighting presets; static fallback is a screenshot. *3D umbrella generated with Meshy AI.* ✅

At first, the digital revolution merely replaced film by digital sensors more amenable to quick snapshots, instant gratification and sharing. But it did more than that : it created the opportunity to add arbitrary computation between the photons in a scene and the final image.

This allows us to do really cool stuff : manage scenes where contrast is excessive, computationally remove blur, combine images into virtual wide angle panoramas, and even capture the full array of light to do stuff like refocusing after the fact and extract 3D.

[results-montage figure — HDR tone mapping (Durand et al.), refocus-after-capture (Ng et al.), motion deblur (Levin et al.), panorama (problem set)]

In fact, even a simple snapshot requires serious computation because each pixel usually captures only one of the 3 channels and the other two need to be reconstructed [demosaick-snapshot figure]. Cell phones these days take much better images than they have a right to because of computational photography.

Plus generative AI is poised to further revolutionize photography and image making with its ability to learn the space of all possible images in a manner reminiscent of a Borges novella (cite *The Library of Babel* and https://arxiv.org/abs/2310.01425)

▸1.1.1 Computation is the new optics⧉

• The principle of image formation stayed largely unchanged since photography's invention: a lens focuses scene rays onto a 2D sensor that records them directly; the final image is essentially a *copy* of the optical image, and quality came from better **optics**.

• Computational photography breaks this: insert computation **between the optical image and the final picture** → alleviate physical limits, enable flexible post-capture editing, record new information (depth, structure), and create novel visual experiences.

• The key shift: the optical image **need not resemble** the final image, which vastly expands imaging strategies. Computation is not just post-processing of a normally-formed image — it changes the optics too: **new optics are co-designed with the computation** to optimize the whole pipeline.

• **coded optics, concretely**: in **coded-aperture** imaging (Levin et al. 2007) a patterned mask replaces the round aperture so the out-of-focus blur becomes *invertible* and its shape encodes depth — the raw photo looks oddly blurred, but mask and deblurring algorithm are **designed together** so that what you recover *after* computation is sharp and depth-aware (forward ref: coded aperture / computational imaging in [[Advanced computational photography]]).

• **"Computation is the new optics."** Optics long extended our ability to image the world; digital processing now vastly expands how we *form* and enhance images.

• Analogy (Ansel Adams, also a trained pianist): the negative is a *score*, the print a *performance*; a "straight print" reproduces the negative mechanically — computation lets us further **interpret the score**.

▸1.1.2 The goals of computational photography⧉

• Prettier pictures

• Alleviate physical limitations

• Record more info (depth, structure, layering).

• Facilitate human-driven post-processing

• New visual experience

• Reveal the invisible

▸1.1.3 Many flavors: from traditional to generative⧉

• traditional, traditional with multiple exposures, new optics/change imaging

• (and now more and more **generative**).

• From generation (painting) to measurement/snapshot to generation again.

• **Sidebar: History of Photography**

See e.g. https://photographpreservationprogram.hsites.harvard.edu/harvards-history-photography-timeline-text-only

https://en.wikipedia.org/wiki/Timeline_of_photography_technology

• **timeline** (rough — to lay out as a clean chronology when writing):

• **optics first**: the **camera obscura** (known since antiquity; described by Alhazen ~1020; a drawing aid for Renaissance painters).

• **1826** Niépce — first permanent photograph (heliography); **1839** Daguerre (daguerreotype, *doesn't fade*) and Talbot (calotype — the negative→positive idea) — photography is announced.

• **1851** Archer — wet **collodion**, cutting exposure to a few seconds; **1861** **Maxwell** — first color photograph (three filters).

• **1888** Kodak roll film ("you press the button…") → **1900** Brownie — photography for everyone; **1895** the **Lumière** brothers (cinema; Autochrome color 1907).

• **1925** Leica — 35 mm becomes the standard small format; **1935** anti-reflection **lens coating** (Smakula, Zeiss; multi-coating / T* in the 1970s — cuts flare, lifts transmission & contrast); **1935** Kodachrome (Kodacolor color-negative film follows in **1942**) — practical color film; **1948** Land — **Polaroid** instant photography (print develops in the camera; folding **SX-70** SLR in **1972**); **1963** Instamatic.

• **1969–75** the **CCD** (Boyle & Smith) and the first digital-camera prototype (Sasson, Kodak, 1975); **1977** Konica C35 AF — first autofocus camera; **1985** Minolta Maxxum 7000 — **phase-detection autofocus** in the SLR; **1995** Canon — first optically **image-stabilized** lens (in-body sensor-shift IS later).

• **the image industry went digital before the camera did (software & scanning)**: even while film still ruled *capture*, the photo *industry* moved to digital **images** — **desktop publishing / digital prepress** (late 1980s) scanned film for page layout; **drum and film scanners**, plus Kodak **Photo CD** (1992), digitised negatives and slides; **Adobe Photoshop 1.0** (**1990**, Thomas & John Knoll) put the **digital darkroom** on a desktop; and **photojournalism** digitised early (Kodak DCS bodies + wire transmission) because filing a picture *fast* beat film quality. Digital images were edited, scanned, and transmitted years before most photographs were *captured* digitally.

• **1990s** first digital cameras (Kodak **DCS** 1991, Apple QuickTake 1994, Nikon **D1** 1999); **2000** first **camera phone**.

• **2012** ImageNet / AlexNet — the deep-learning revolution; **~2014** GANs, then **~2022** **diffusion** models / DALL·E — generative image AI.

• **branches and oddities** to weave in: the **flash**; **holography**; **Lippmann** integral imaging and Lippmann interference color photography; light-field cameras (**Lytro**); and the coining of the term **"computational photography"** (Steve Mann, mid-1990s; broadened by Marc Levoy in the mid-2000s).

• **Sidebar: Notable cameras**

• **Polaroid Land Camera (1948) / SX-70 (1972)** — instant photography: the print develops in your hand, no lab. Edwin Land made the photograph immediate decades before digital; the folding SX-70 SLR (self-developing print) is a landmark of optics and industrial design.

• **Kodak DCS (1991)** — the first commercial digital SLR (a Nikon body + a 1.3 MP Kodak digital back); digital photography begins as a professional tool.

• **Apple QuickTake 100 (1994)** — one of the first consumer digital cameras (0.3 MP).

• **Nikon D1 (1999)** — the first DSLR built as an integrated digital camera (not a film body retrofitted), priced to start displacing film in photojournalism.

• **first camera phone — Sharp J-SH04 (2000, Japan)** — the camera becomes something you always carry; the road to phones as the dominant cameras.

• **Casio EX-F1 (2008)** — an early consumer **high-speed / burst** camera (60 fps stills, up to 1200 fps video) — a hint of the "shoot many frames" computational era.

• **Lytro (2012)** — the first consumer **light-field** camera (refocus after capture) — computational photography as a product category.

▸1.1.4 Objective, perceptual, and subjective⧉

• computational photography spans a spectrum from **objective** to **subjective**:

• **objective** — physics and measurement: radiometry, optics, sensor noise, geometry, reconstruction error. There *is* a right answer — a deblurred image is closer to or further from the ground truth; SNR is a number; a calibration is right or wrong.

• **subjective** — taste and intent: what makes a "good" photograph, composition, a pleasing color grade, an artistic look. No single correct answer.

• **in between: perceptual** — how an image actually *looks* to a human. The surprise is that this middle ground is **fairly objective and quite well modelled mathematically**, along several dimensions: the **contrast sensitivity function**, color appearance, lightness/brightness, visual **masking**, perceived sharpness, and image-quality metrics (SSIM, VDP). "How it looks" can be measured and *predicted*, even though it is not pure physics — which is exactly what lets us **engineer for the eye** (tone mapping, chroma subsampling, perceptual metrics, where to spend bits or smoothing).

• this **objective → perceptual → subjective** spread is a recurring lens for the whole book: it helps to know *which kind of problem* you are solving before reaching for a tool.

▸1.1.5 A limited medium, or a window to the world?⧉

• **two ideals of the photograph.** One tradition treats the photograph as a **window onto reality**: the ideal image is a *perfect reproduction* of the scene, so that viewing the picture is, perceptually, the same as having been there. **Gabriel Lippmann** pursued exactly this — his **interference colour photography** (Nobel Prize, 1908) recorded the full spectrum at each point, and his **integral photography** (1908) recorded directional rays (the ancestor of the light field) — both attempts to capture *everything* the eye could have received. **Helmholtz**, by contrast, spent his *Handbook of Physiological Optics* cataloguing how far the eye is from a clean instrument, which already hints that "reproduce the light exactly" and "reproduce the *experience*" are different goals.

• **but the medium is not transparent — it has hard limits.** A photograph is, physically, a **bounded, flat, low-contrast token** of an unbounded 3-D world:

• **dynamic range** — the scene spans far more contrast (sun to shadow, ~$10^4$–$10^8$:1) than paper or a screen can show (~$10^2$:1); something must give (→ [[Tone mapping]], [[HDR merging]]).

• **flatness** — a 2-D projection discards depth, parallax, and the directional structure of light (→ [[Light fields]], [[3D and depth]]).

• **limited field of view** — a frame is a narrow window cut from a $360°$ surround (→ panoramas, [[Many images and photo collections]]).

• (and more: fixed focus/aperture, a single instant, three colour channels for an infinite spectrum.)

• **the limits are also opportunities — they are the medium's *language*.** This is the key reframing (and the heart of [[#Objective, perceptual, and subjective|the perceptual/subjective view]], developed in **Durand 2002**): a constraint is not only a problem to overcome but an **artistic affordance**.

• **flatness as composition.** By collapsing 3-D onto a plane, the photograph lets the artist **put into relation objects that are far apart in the scene** — align a foreground branch with a distant peak, frame a doorway around a subject, build graphic patterns out of unrelated depths. Flattening is *why* composition (the 2-D arrangement of the picture) is a craft at all.

• **limited contrast as relation.** Because the medium can't preserve the scene's enormous luminance ratios, it **brings parts of the scene into a shared tonal world** — a face in shadow and a bright sky can be made to coexist on the page, relating luminances that the eye would never have reconciled at once. Tone mapping is, in this light, not just damage control but a *compositional* act.

• same for the frame (exclusion as emphasis), the single instant (the "decisive moment"), and shallow depth of field (isolation by blur).

• **takeaway / why it matters for this book.** Computational photography is exactly the toolkit for *negotiating* this tension: it pushes toward the **window** ideal when we want fidelity (HDR, wide gamut, stitched panoramas, light fields — chasing Lippmann's dream with code), and it leans into the **medium's** affordances when we want expression (tone curves, framing, synthetic depth of field). Knowing which ideal you are serving — faithful reproduction or expressive depiction — tells you which tools to reach for.

▸1.1.6 A field at a crossroads⧉

• the field is thrilling precisely because it sits at the **crossroads of an unusual number of disciplines** — you rarely get to touch this many at once:

• **algorithms** · **AI / machine learning** · **physics** (optics, light transport, radiometry) · **chemistry** (film, sensors, dyes) · **electrical engineering** (sensors, circuits, the ISP) · **mechanical engineering** (lenses, shutters, stabilization, camera bodies) · **computer graphics** · **computer vision** · **human perception** · **art** (the craft of photography) · **human–computer interaction (HCI)** · **performance engineering** (real-time, on-device, Halide) · **ethics**

• and they show up *together*: a single feature — a phone's **night mode**, say — leans on physics (photons), EE (the sensor), algorithms (align + merge frames), AI (learned denoising), perception (tone mapping), performance engineering (finish in a second on a phone), and ethics (how much to "beautify") **all at once**. That breadth is the fun of it.

▸1.1.7 It's not just about photography⧉

• the cameras-and-pictures framing is the way *in*, but the machinery — form an image from measurements, model light/optics/sensors, invert with computation — is the same wherever images (or signals) are made.

• **the ordinary camera at extraordinary jobs** (the picture becomes an *input to a decision*, not the product): **robotics & self-driving** (perceive, localize, navigate — same alignment/depth/motion estimation), **surveillance/security**, **industrial inspection / machine vision** (defects on a production line), agriculture, drones, document scanning, biometrics.

• **more advanced optics**: **astronomy** (frame stacking = night mode at scale; adaptive optics; synthetic aperture — Event Horizon Telescope's Earth-sized virtual lens), **microscopy** (light-sheet / confocal / computational; super-resolution past the diffraction limit). Same game: co-design measurement + reconstruction.

• **other EM wavelengths**: infrared (thermal, night vision), ultraviolet, **X-ray → CT** (3D from many 1D projections — pure inverse problem), radio (radio astronomy, **SAR**).

• **medical imaging** is this discipline under another name: CT, **MRI**, **PET**, ultrasound — all regularized inverse problems recovering an image from indirect measurements.

• **non-imaging signals**: the toolkit (sampling, filtering, Fourier, denoising, priors, inverse problems) is really a **signal** toolkit. **Audio** is the obvious neighbor; example — Paul Lamere's **Infinite Jukebox** (find near-identical beats, jump between them → a song plays forever) is **Video Textures** (Schödl et al. 2000) transplanted from frames to sound.

• **computational imaging hiding in plain sight — the optical mouse**: an **optical mouse is a tiny visual sensor** (as the name says) — a **low-resolution camera** (a few tens of pixels square) plus an **image-processing chip** that tracks motion by **correlating successive frames** of the surface under it (the same cross-correlation / sub-pixel-shift estimation that powers alignment and optical flow), thousands of times a second. A computer-vision system shrunk into a $5 part you already own — computational imaging hiding in plain sight.

• **why imaging keeps reappearing — we really only measure electrons and photons**: a surprising amount of "non-visual" sensing is **visual under the hood**, because the two physical quantities we transduce well are **electrons** (charge, current, voltage) and **photons** (light). Electrons are the workhorse — almost every sensor ultimately produces a voltage — but **photons convert into electrons almost for free** via the **photoelectric effect**: a photon frees (or ejects) an electron, the insight that won **Einstein** his Nobel and exactly how a photosite works. So a huge range of measurements is staged as *make the quantity emit or modulate light, then count the photons as electrons*: fiber-optic and spectroscopic sensors, fluorescence assays, **scintillators** that turn X-rays / gamma into visible flashes for a camera, optical readout of pressure / strain / temperature, even reading bits off a disc. Photonic measurement recurs because the **photon ⇄ electron** conversion is cheap, fast, and linear — which is why this book's sensor, a photon counter, is a template far beyond photography.

• takeaway: you're not learning photography tricks but a **way of thinking about measured signals**, illustrated by photographs.

▸1.2 How to read this book⧉

`fig-chapter-dependency-graph` · graph of chapter dependencies (provides/requires) so a reader can chart a path 🟨

• the book is organized **cross-cuttingly**: a chapter is sometimes a **tool** (convolution, the bilateral filter), sometimes a **big idea** (priors, the light field), and the major themes deliberately **recur** across chapters — the same lesson seen from several angles. The redundancy is intentional; cross-references thread the pieces together.

• **hammers and nails**: we deliberately sought a **balance between "hammers and nails"** — between **tools/techniques** (the hammers: convolution, the bilateral filter, optimization, priors) and **applications/challenges** (the nails: HDR, panoramas, deblurring, refocusing). Rather than a purely tool-first or purely application-first table of contents, the book's structure is **organized by a combination of tools and applications/challenges**, alternating between the two so each motivates the other.

• **intuition first**: lead with the idea and a picture, then the math. The audience is a **CS undergrad**, new to imaging (what to know coming in is spelled out in *[[#Prerequisites]]* below).

• **how to read**: FUNDAMENTALS grounds the physics and perception; BASIC IMAGE PROCESSING gives the toolkit; later parts build single- and multi-image computational photography. Read mostly front-to-back, or follow the dependency graph below to chart your own path.

• **how this book came to be**:

• grew out of Frédo's **computational photography course** at MIT — its lectures, slides, and problem sets are this book's backbone.

• written **with an AI collaborator** (Claude) as a deliberate experiment in **AI-assisted authoring** — the outline, figures, drafts, and the book's own tooling were co-developed; a fitting, slightly meta nod to the generative-AI theme the book itself discusses.

• the **process** (how outline, figures, and prose are generated, verified, and kept consistent) is documented in an appendix — itself a small piece of AI methodology [forward ref: "how this book was created"].

• **chapter dependency graph**:

• *(placeholder — to create later)* a **graph of chapter dependencies** belongs here, at the end of the intro: a diagram showing which chapters rely on which, so a reader can chart their own path through the book. Generate it from the **provides / requires** graph (see [[Book Generation Process#Cross-chapter dependencies]] and each section's `- prereqs:`), once enough of the book is written for the dependencies to be stable.

▸1.2.1 Prerequisites⧉

• **the audience**: a **CS undergrad** (or equivalent) — new to imaging, but comfortable thinking computationally. We lead **intuition-first** — the idea and a picture, then the math.

• **math — the small kit** (refreshed in [[Appendices#Refreshers|the Refreshers appendix]], so a rusty reader can catch up): **multivariable calculus** — derivatives and integrals, the **gradient** (and the chain rule; a light touch of **Jacobian / Hessian**), because a pixel is an integral of light and almost every recovery is a **gradient-driven optimization**; and **basic linear algebra** — vectors, matrices, dot products, and a little **eigen / SVD**, because *images are vectors and our operations are linear maps*. A little **probability** (noise) and a bit of **machine learning** round it out. Anything heavier lives in the Refreshers, not the main text.

• **programming — basic is enough, and the rest is learned as you need it**: you should be comfortable with **basic programming** — variables, loops, arrays and indexing, functions — ideally in **Python** (the book also ships **C++**; see *[[#What language for computational photography?]]*). **No advanced programming is required**: you do **not** need to arrive knowing NumPy, PyTorch, OpenCV, GPU/parallel programming, or performance engineering. Those are introduced **where they're needed** and can be **picked up as you go** — the worth-knowing libraries and the language trade-offs are surveyed in the next section, and performance optimization is its own (optional) part. The book deliberately has you implement core algorithms **from scratch**, which keeps the programming elementary and lets the *ideas* carry the weight.

▸1.2.2 What language for computational photography?⧉

• the book is **language-agnostic in its ideas** but ships in two editions — a **Python** one and a **C++** one (never one book showing both) — because the two languages serve different moments: Python to *think and prototype*, C++ to *ship and run fast*. The concepts are identical; pick the edition that matches your goal.

• **why choose which**: **Python** for **deep learning, prototyping, and convenience**; **C++** for **performance**. The gap is not small. When you write the **low-level code yourself** (e.g. your own convolution rather than a library call), the **speed differential** is dramatic: in Frédo's experience, roughly **1000× from Python to optimized C++**, and **at least another ~10× from C++ to Halide**. So the language choice is also a performance choice — see [[Performance engineering and Halide]] and [[Appendices#Programming: Python, C++, and PyTorch]].

• **Python** — the right default for *learning, prototyping, and research*:

• **pros**: **readable** — NumPy array code reads almost like the math (`out = a + 0.5*b`); **fast to write and explore**, with interactive notebooks and instant feedback; a vast **ecosystem** (NumPy, SciPy, Pillow / OpenCV, scikit-image, Matplotlib, and the whole deep-learning stack — PyTorch, JAX); automatic memory; the lingua franca of research and ML.

• **cons**: **slow at the pixel level** — a naive per-pixel Python loop is orders of magnitude slower than C, so you must **vectorize** (push the loop down into NumPy's C core) or it crawls; the **GIL** limits real CPU threading; packaging / deployment is awkward; dynamic typing lets type bugs hide until run time; little control over memory layout or exact performance.

• bottom line: the right default for **learning, prototyping, and research** — and, through NumPy / PyTorch, fast *enough* for most of what we do here.

• **C++** — the right default for *performance and production*:

• **pros**: **fast and predictable** — close to the metal, with control over memory layout, cache behaviour, and SIMD; what real **camera ISPs, real-time pipelines, and shipped products** are written in; static types catch errors at compile time; deploys to phones and embedded hardware.

• **cons**: **verbose and slower to write**; **manual memory management**, a rich source of bugs (out-of-bounds, leaks, use-after-free); a heavier **build / toolchain** story; far less interactive (edit–compile–run, not edit–run). The productivity tax is real.

• bottom line: the right default for **performance and production** — once the algorithm is settled and it must run fast, on device, at scale.

• **Halide** — a **domain-specific language** (embedded in C++ / Python) for high-performance image processing: you write the **algorithm** (what each pixel computes) **separately** from the **schedule** (how to tile, vectorize, parallelize, and fuse it across the memory hierarchy).

• **pros**: near-hand-tuned speed **without** hand-writing the tangled, hardware-specific loops; the *same* algorithm retargets to CPU / GPU / DSP by swapping the schedule; the basis of real camera stacks.

• **cons**: another language and mental model to learn, best reserved for the **hot paths** you have already identified — premature scheduling is its own trap. Covered properly in [[Performance engineering and Halide]].

• **MATLAB (historical note)**: for years *the* platform for this work — a strong numerical/linear-algebra library and an IDE that handled **images natively** (an image is a matrix; `imread`/`imshow`/slicing; the Image Processing Toolbox); much of the literature was prototyped in it. **Largely displaced by Python** (NumPy fills the same array niche, free) and especially **PyTorch** (autodiff + GPU + deep learning, which MATLAB never matched). Worth knowing because older papers/code are in it, and its array-thinking carries over to NumPy.

• **Libraries (implement-from-scratch, then build on)**: the book has you **implement core algorithms from scratch** to understand them (the `imageops` reference), but to *build* / dive into advanced projects, stand on mature libraries — **Python**: Pillow (PIL), NumPy/SciPy, OpenCV, scikit-image, PyTorch (also a GPU-array + autodiff engine); **C++**: OpenCV (de-facto standard), Eigen (linear algebra), LibRaw (raw decode), Halide (hot paths).

• **side bar — Photoshop 1.0, in 128,000 lines (a language lesson from 1990)**: Adobe (via the **Computer History Museum**, 2013) released the source of **Photoshop 1.0.1** — **≈128,000 lines across 179 files**, written for the Macintosh by **Thomas Knoll** (its sole engineer for v1 — a computer-vision PhD student whose 1987 image-display hack became the app). The language split *is* this section's lesson, 35 years early: **≈75 % Pascal** (the productive high-level language) and **≈15 % hand-written 68000 assembly** for the speed-critical inner loops — *most of it readable, the hot pixels hand-tuned* — exactly the Python-vs-C++/Halide trade-off (a high-level language for the bulk, machine code where it counts; what **Halide** now automates). The code is famously *mostly uncommented but well-structured*. It's now studied as a cultural artifact: the **"Das digitale Bild"** project (DFG / LMU Munich) runs an ongoing **Critical-Code-Studies** *"reading the source code of Photoshop"* — code as a text to be close-read. **For scale:** today's Photoshop is reportedly **millions** of lines of C++ (Frédo's estimate ~**10 M**); **Adobe Camera Raw (ACR)** and **Lightroom** share the same Camera Raw engine, and a typical camera **ISP** is a large proprietary C/C++ codebase — exact modern sizes mostly **not public** [⚠️ confirm the ~10 M / current ACR/Lightroom/ISP figures]. (→ the hand-tuned-inner-loop angle returns in [[Performance engineering and Halide]].)

• **Vibe coding (LLM-assisted)**:

• writing image code with an **LLM** ("vibe coding") is now part of the toolkit: great for **boilerplate, I/O, visualization, and explaining errors**, and a quick way to draft a function or a test.

• **but** image code is **numerically and perceptually subtle** — "looks plausible" is not "correct" — so keep the model on a **short leash** for filtering, resampling, color, and antialiasing, and always **verify against a hand-checkable input** (a constant, an impulse, a half-plane) before trusting it. Treat generated image code as *guilty until tested* (see [[Developing, Testing and Debugging]]). AI coding sometimes **fumbles the trivial** even when it nails the hard parts — see the cross-fade anecdote in [[Basic image processing and ISP#Vibe coding: writing image code with an LLM]].

▸1.3 Basic intro to digital images⧉

Just so that we can be concrete and perform simple experiments for the FUNDAMENTALS part.

▸1.3.1 The data structure is gloriously simple⧉

• Although details can matter and can be subtle.

• In this chapter we give the basics and will worry about details later.

1.3.2 An image as a function over the plane⧉

▸1.3.3 An image as a discrete array⧉

fig-pixel-grid · a 32×32 image with each pixel a square + a zoom to the array of numbers (Image representation, INTRO) 🟨

• a tiny image is literally a grid of pixels, each a square holding a few numbers (R, G, B) — show a 32×32 image with each pixel rendered as a square + a zoom to the array of numbers [pixel-grid figure]

▸1.3.4 How the array sits in memory, briefly⧉

• an image is stored as a **1-D block** of numbers; the *packing* (which index runs fastest) is a choice — interleaved **HWC** (NumPy default) vs planar **CHW** (ML), and the C++ vs Python conventions differ. Here we keep it light (just enough to address a pixel — our C++ uses `c·W·H + y·W + x`) and treat memory layout properly in [[Image representation]].

▸1.3.5 Floats in [0, 1], for now⧉

• for now just think low value dark, high value bright,

• and one channel for each primary color (R, G, B)

• (we defer **gamma / encoding** to [[Light and physics]] and [[Image representation]]; here, just "bigger number = brighter")

▸1.3.6 Generating images from scratch⧉

• generate images: constant color, vertical gradient from black to white, Dirac, box, sine wave.

• Finger exercise: checkerboard. Line

• more advanced: mandelbrot? Sierpinsky?

▸1.3.7 Basic point operations⧉

• add or subtract constant, multiply by scalar (~brightening),

• non-linear gamma-ish curve for contrast (what’s a good curve ,

• finger exercise: convert to BW. Negative. Fake color scheme. Sepia (more red and green)

▸1.3.8 Domain operations⧉

• translation, mirror, rotate by 180 degrees. Rotate by 90 degrees.

• Edge effects for rotate 90 degrees (return black for now?).

• Implement with loop over output, not input. Doesn’t matter yet, will matter later.

▸1.3.9 Neighborhood operations⧉

• blur

• exercise: center-surround, finite difference gradient

▸1.3.10 Two-image operations⧉

• add

• subtract

• multiply? Graduated neutral density? Compositing/masking?

▸1.3.11 A little fun to finish⧉

• look apple Photo Booth

• half mirror

• averages of humans? Provide pre-aligned images or use one of the celeb datasets

▸1.4 Problem sets⧉

• **PS0 — Environment & C++ basics**: toolchain setup, a minimal `Image` class, image I/O. (→ [[Developing, Testing and Debugging]])

• **PS1 — Image class, point operations, color**: pixel access, brightness/contrast, luminance–chrominance (YUV), gamma, white balance, the Spanish-Castle illusion. (→ [[Image representation]], [[Point operations]], [[Human (and animal) vision and color]])

• **PS2 — Convolution & the bilateral filter**: box/Gaussian blur, gradients, sharpening, and bilateral denoising (with a YUV variant). (→ [[Neighborhood operations and convolution]], [[Bilateral filtering]])

• **PS3 — Denoising & demosaicking**: align-and-average a burst, ISO/variance; Bayer demosaicking (green-first, edge-based, color-difference) and Prokudin-Gorsky channel alignment. (→ [[Denoising]], [[Demosaicking]])

• **PS4 — HDR**: merge an exposure bracket into a radiance map and tone-map it. (→ [[HDR merging]], [[Tone mapping]])

• **PS5 — Resampling, warping & morphing**: nearest/bilinear/bicubic/Lanczos resampling; Beier–Neely segment warps; a full Beier–Neely morph. (→ [[Warping and resampling]], [[Morphing]])

• **PS6 — Homographies & manual panoramas**: homogeneous coordinates, warp by a homography, solve $H$ from four correspondences, stitch a planar mosaic. (→ [[Manual panorama stitching from multiple views]])

• **PS7 — Automatic panoramas**: Harris corners (structure tensor), patch descriptors, the second-nearest-neighbour ratio test, RANSAC, and feathered / two-scale blending. (→ [[Automatic panorama stitching from multiple views and feature matching]])

• **PS8 — Non-photorealistic rendering**: paintbrush splatting; single-scale, two-scale, and gradient-oriented painterly rendering. (→ [[Non-photorealistic rendering]])

• **PS9 — Make-your-own / video & ethics**: a self-proposed project (optical-flow retiming, phase-based video magnification, …) plus the ethics deliverable; its open menu of advanced topics seeds many [[Advanced computational photography]] sections. (→ [[Optical flow]], [[Video magnification]])

▸Part 2 FUNDAMENTALS OF PHOTOGRAPHY⧉

`fig-light-journey` · FUNDAMENTALS part opener — the journey of light: source → scene (reflection) → lens → sensor / retina → visual system, each step labelled with its chapter ✅

**Photography means writing with light** — from the Greek *phōs* ("light") and *graphē* ("drawing"). This first part follows light on its entire journey and studies each step in turn.

• **the path of light** (and the chapters that follow it): a **source** emits light → it strikes **objects** and is reflected, absorbed, and scattered (the scene) → it passes through a **lens** → it lands on a camera **sensor** or the **retina**'s cones and rods → and the **visual system** interprets it.

• **Light and physics** — what light *is* (wave / ray / photon), how it interacts with matter (reflection, refraction, scattering, the BRDF), and how we *measure* it (radiometry: radiance, irradiance, the inverse-square and cosine laws).

• **Image formation and linear perspective** — how a lens turns 3-D rays into a 2-D image (pinhole, perspective projection), depth of field, the **sensor** and the exposure triangle, and **noise**.

• **Human perception and color** — the eye and visual system; why three **cones** make us trichromatic; color as the projection of a spectrum onto three numbers; opponent processing, constancy, and the contrast sensitivity function.

• **Color technology** — the engineering payoff: **measuring** color (CIE), **encoding** it (linear / gamma / log, color spaces, CIELAB), and **reproducing** it (additive vs subtractive synthesis, gamuts, white balance, color management).

• **Human factors and the art of photography** — making *good* pictures (composition, light, portraiture), the perception of images, and the **ethics** of photography.

• **skippable, but foundational**: a reader who only wants to code can jump straight to BASIC IMAGE PROCESSING — but the **big lessons** of this part recur everywhere later. The ones to carry forward:

• **multiplicative vs additive** — much of imaging is *multiplicative* (illumination × albedo, contrast), some *additive* (incoherent light from different sources, blur); the right **encoding** (linear / log / gamma) follows from which one you are doing (→ [[Big Lessons]]).

• **gamma / ratios matter** — perception is roughly logarithmic, so we store images **gamma-encoded** to spend code levels where the eye looks; what matters is *ratios*, not absolute values.

• **noise is affine** — sensor noise is **read + shot** (a constant plus a term that grows with the signal); because shot noise is **Poisson**, SNR is worst in the **shadows**, which is exactly where noise *looks* worst.

• **projective geometry is matrices + a division** — perspective projection is a **linear map in homogeneous coordinates** followed by a divide-by-depth; that single trick handles cameras, warps, and stereo.

• **color is non-orthogonal and non-negative** — the cone "axes" overlap and light cannot be negative, which is what makes color sensing and reproduction genuinely hard (no perfect set of primaries; white balance is never exact).

• **by the end of this part** you can trace a photon from a lamp to a perceived color, reason quantitatively about exposure and noise, and you will carry the small set of principles the rest of the book keeps reusing.

▸2.1 Light and physics⧉

• chapter intro: a working model of light — its **color** (spectrum), its **interaction with surfaces**, and its **energy** (radiometry); closes with the **plenoptic function** that ties the pipeline together

▸2.1.1 Light: rays, waves, and the spectrum⧉

`fig-light-models` · three models of light: wave (λ) vs ray vs particle/photon propagation, and what each is good for 🟨

`fig-newton-prism` · Newton's prism experiment: white light dispersed into a spectrum (classic diagram) 🟨

`fig-em-spectrum` · EM spectrum / wavelength map 🟨

• three models: **ray** (geometry), **wave** (color, diffraction), **particle/photon** (counting → noise) — use whichever fits the question [wave/ray/particle figure]

• side bar — **wave–particle duality, made visible with a liquid analog**: light is *both* wave and particle, which is famously hard to picture. A startling **hydrodynamic analog** makes the marriage tangible: a tiny oil droplet **bouncing** on a vibrating fluid bath becomes a **"walker,"** carried across the surface by its own **pilot wave**, and these walkers reproduce quantum-like behaviour (single- and double-slit statistics, quantized orbits, tunnelling). They are *not* a theory of quantum mechanics, but they are a gorgeous classical picture of a particle that is inseparable from its wave. [embed video — Couder & Fort and **John Bush's** (MIT) walking-droplet experiments → [[Video Resources]]] (revisited at diffraction, below)

• wavelength, EM map, spectra

• **Newton's prism experiment** (the historical hook): a prism splits white light into the **spectrum** — white light is a *mixture* of wavelengths, and the *experimentum crucis* (a second prism) shows those pure colors don't split further. [classic diagram]

• a spectrum = power vs wavelength; like an **infinity of channels**, of which a camera (and the eye) keep only three → **metamerism** (Color technology)

equations

c = λν (wavelength × frequency)

▸2.1.2 Reflection, refraction, and what happens at a surface⧉

`fig-reflection-refraction-diffraction` · reflection (equal angles) · refraction (Snell, two indices) · diffraction 🟨

`fig-polarization` · unpolarized (random orientations) vs polarized light 🟨

`fig-atmospheric-scattering` · Rayleigh scattering: blue sky (short path) vs red sunset (long path) 🟨

• a camera sees nothing that hasn't bounced off something

• **reflection** (angle out = angle in); **refraction** (light slows in glass/water → *refractive index*; **Snell's law** n₁ sinθ₁ = n₂ sinθ₂); **diffraction** (→ Wave effects section below)

• marching-band (wavefront) intuition; fuller lens geometry deferred to Single lenses

• **interactive — a wave simulation of refraction** (`fig-refraction-wave-sim`): the same 2-D FDTD wave solver as the diffraction figure, but with a slower medium (higher index) filling the lower half of the box. An angled plane wave crosses the interface and its wavefronts **bend toward the normal** and **crowd together** (λ→λ/n) — the wave speed drops, so Snell's law becomes a *consequence* of the wave picture (the wave-level version of the marching-band intuition and Fermat's equal-path view in [[#Single lenses: refraction and Snell's law]]). Sliders for the index n₂ and the incidence angle, with an on-figure Snell check. Companion to `fig-diffraction-wave-sim`.

• **light–matter interaction is largely *resonance***: much of what light does to matter — the **index of refraction** itself, and **absorption** (hence material color) — comes from the **resonances** of atoms, molecules, and bound electrons. Light's oscillating field drives those bound charges, which resonate at characteristic frequencies (the **Lorentz-oscillator** picture); **near a resonance** the material **absorbs** strongly, and the **index of refraction varies with wavelength** (**dispersion**) according to how far each wavelength sits from the nearest resonance. This is why different materials transmit/absorb different colors, and why $n(\lambda)$ gives **prisms, rainbows, and chromatic aberration**. Flag forward: dispersion **shapes much of lens design** — achromatic/apochromatic doublets (crown+flint), the fight against chromatic aberration in fast lenses → [[Optics, lenses, and aberration correction]]. **Interactive** `fig-lorentz-resonance`: a driven bound electron (mass on a spring) with the absorption peak at ω₀ and the n(ω) normal→anomalous dispersion curve; sliders for driving frequency and damping.

• side bar: **polarization** — light is a *transverse* wave (field ⊥ propagation); **polarization** is the **orientation of that oscillation**. Natural light is **unpolarized** (orientation a rapidly varying mix, *no single direction*); reflection partly polarizes it → a polarizing filter cuts reflections / deepens the sky. Two everyday encounters: (a) **good sunglasses are polarizers** that selectively kill glare — *why* lives in the reflection/BRDF material (specular reflection off dielectrics is partially polarized, maximally at **Brewster's angle**) → [[#The BRDF and the look of materials]], the polarizer filter in [[#Lens filters: polarizers, ND, and graduated ND]], and reflection / specular removal in [[Single-image computational photography]]; (b) **LCD displays emit polarized light** — the liquid-crystal shutter forms an image by **rotating polarization between two crossed polarizers**, which is why a polarizer (or polarized sunglasses) makes some screens go dark at certain angles.

• **why some screens go dark through sunglasses and some don't.** Polarized sunglasses pass only **vertically** polarized light (so they can block the horizontally-polarized glare off roads and water). An LCD emits **linearly polarized** light fixed by its **exit polarizer**; whether the screen survives your sunglasses depends on *that polarizer's orientation*: if it emits **vertical** light it passes straight through, if **horizontal** the glasses **black it out**, and at **45°** it dims. Manufacturers choose the axis — laptops and monitors are usually safe in landscape, but many phones (and especially **car dashboards / GPS**, ATMs, and aircraft seat-back screens) are built with a diagonal or horizontal axis, so they wash out or go black when you tilt your head or rotate the phone 90°. The honest fix is to put a **quarter-wave retarder** *outside* the exit polarizer so the emitted light is **circularly** (not linearly) polarized — then it looks the same through any linear polarizer at any rotation; phones that stay bright through sunglasses generally do this.

• side bar: **why the sky is blue and the sunset orange** — **Rayleigh scattering**: air molecules scatter short (blue) wavelengths far more than long (red) ones (∝ 1/λ⁴). At midday the sunlight takes a *short* path through the atmosphere and the blue it sheds scatters into the sky in every direction → the sky reads **blue**; at sunset the light skims a *long* path, its blue is scattered away before it reaches us, and mostly **orange–red** survives → the sun reddens. (Larger droplets scatter all wavelengths about equally — **Mie** scattering → white haze and clouds.) [atmospheric scattering figure] [Minnaert]

• side bar: **rainbows** (spoiler: it's a caustic) — and a desktop demo. A **single spherical raindrop** bends sunlight three times: **refract** in at the front, **reflect** off the back, **refract** out again; because $n$ depends on wavelength (**dispersion**, above) each color exits at a slightly different angle. Across a whole sky of drops those exit rays **pile up** into a bright **caustic** at the **rainbow angle** — about **42° (red)** down to **~40° (violet)** measured from the **antisolar point** (the shadow of your head) — and *that* accumulation is the rainbow. A **glass sphere** reproduces it on a tabletop: it **stands in for a giant water droplet**, and a bright lamp throws its dispersed caustic onto a card (`fig-rainbow-glass-sphere`).

• **the color order, and the second bow**: the **primary** bow has **one** internal reflection — **red on the outside, violet inside**. A fainter **secondary** bow (~51°) comes from **two** internal reflections, with the colors **reversed**; between them sits **Alexander's dark band** (named for Alexander of Aphrodisias, ~200 AD), darker because one-reflection light emerges only at **≤42°** and two-reflection light only at **≥51°**, so the 42–51° gap gets almost no rainbow light (and the sky *inside* the primary is correspondingly brighter). Same geometry, one extra bounce. Real example: a **double bow over Loch Ness** (`fig-rainbow-loch-ness`, photo) showing both bows and the dark band.

• **dispersion alone is NOT a rainbow**: a prism only *fans* colours; the rainbow's brightness and fixed angle are a **caustic**. Vary the impact parameter and the deviation $D(b)$ of the **one-internal-reflection** ray bottoms out at a **minimum** (~138° → bow at 42°) where rays pile up. Crucially, a ray with **only two refractions and no internal reflection makes no bow** — its $D(b)$ is **monotonic**, no minimum, no pile-up. Dispersion only splits the already-bunched light into the band. Interactive Descartes diagram `fig-rainbow-droplet` (water drop; drag the impact parameter, choose the number of internal reflections, and tick **"zoom near the caustic"** to re-range both the slider and the plot tightly around the bow; the deviation-vs-$b$ plot marks the minimum, and a side panel plots the **ray density** — an integral over impact parameter — whose **spike** at the stationary deviation *is* the bow. Dispersion is always on, so three colour bands separate near the caustic).

• **it's an angle, not an object**: the bow is centered on the antisolar point, so it can't be reached and *everyone sees their own* — it is a **caustic of accumulated rays at a particular angle**, not a thing sitting in space (the same "it's a direction" lesson as vanishing points). [Minnaert] [`fig-rainbow-glass-sphere`, real demo photo; `fig-rainbow-droplet`, interactive]

• **scattering & absorption** in media (fog, haze, underwater): single vs. multiple scattering → de-weathering / dehazing; the same physics as the blue sky, at close range

▸2.1.3 The color of objects: illumination times reflectance⧉

`fig-reflectance-illumination-spectra` · example reflectance & illumination spectra 🟨

`fig-illum-times-reflectance` · light = illumination × reflectance (per-wavelength) ✅

• a **red apple is not red in the dark**: color = **illumination × reflectance**

• reflectance ρ(λ) is the surface's intrinsic, fixed signature; illumination S(λ) is whatever light is present; their per-wavelength product C(λ) reaches the eye

• this multiplicative entanglement is exactly what **white balance** must later undo

• scope caveat: the clean product assumes direction-independent reflectance (matte case); the general directional case is the **BRDF**, next

• side bar: **albedo can appear > 100% (fluorescence cheats)** — real (non-fluorescent) reflectance is **≤ 1 per wavelength** (a surface can't return more light than it receives at that wavelength). But a **fluorescent** material **absorbs** short-wavelength (often **ultraviolet**) light and **re-emits** it at **longer** wavelengths, so in the re-emitted band the measured "reflectance" **exceeds 1** — apparent albedo tops 100%. Examples: **optical brighteners** in laundry detergent and paper ("whiter than white"), **highlighter** ink, **fluorescent safety vests**. The trick is **moving energy between wavelengths**, not creating it.

equations

reflected color C(λ) = S(λ)·ρ(λ) (illuminant spectrum S × spectral reflectance ρ, per wavelength)

▸2.1.4 The BRDF and the look of materials⧉

`fig-brdf` · light bouncing off a surface — the BRDF (incoming/outgoing dirs, normal, reflection lobe, θ angles) 🟨

`fig-diffuse-specular-spheres` · Lambertian (matte) sphere vs shiny (specular) sphere — same shape/light, different BRDF 🟨

`fig-global-illumination` · color bleeding: a red wall casts red onto a white floor (multiple bounces) 🟨

• the **BRDF** (bidirectional reflectance distribution function): a surface re-emits incoming light into a *distribution* of outgoing directions — radiance toward q = f_r(p→q; n, λ) · incoming; *bidirectional* = depends on both the incoming and outgoing directions. [BRDF diagram]

• **diffuse (Lambertian) vs specular (shiny)**: a matte surface scatters into a broad lobe and looks equally bright from any view (the light it *receives* falls off as cos θ); a shiny one concentrates near the mirror direction, giving a view-dependent highlight; most materials are a blend. [Lambertian sphere vs shiny sphere]

• **side bar — where a material's colour actually comes from.** "Colour" is not one mechanism; the same red can be made several physically different ways, and they behave differently (some shift with viewing angle, some don't), which is why this belongs with the BRDF:

• **absorption (pigments & dyes)** — the everyday case: molecules/electrons absorb certain wavelengths (the **resonance** picture from *Light and physics*) and what's left is the colour. Paint, ink, skin, leaves (chlorophyll), most natural surfaces. **Angle-independent** and the basis of the *illumination × reflectance* model above.

• **structural colour (interference & diffraction)** — colour from **micro-geometry**, not pigment: **thin-film interference** (soap bubbles, oil slicks, anodized metal, the blue of a Morpho butterfly, peacock feathers), **diffraction gratings** (CD/DVD, holographic foil, some beetles), and **photonic crystals** (opal). The hallmark is **iridescence** — the hue **shifts with viewing/lighting angle** — because it comes from path-length differences, so it is strongly **directional** and a genuinely *spectral* BRDF effect, not a single albedo.

• **scattering** — wavelength-dependent **Rayleigh/Mie/Tyndall** scattering makes colour without any pigment: the **blue sky** and blue eyes, the **Tyndall blue** of some bird feathers and milk, the red of a sunset (cross-ref the Rayleigh sidebar in *Reflection, refraction*).

• **metallic reflection** — in metals the **free electrons** set the reflectance; most metals are grey/white (aluminium, silver), but **gold** and **copper** absorb in the blue and so reflect warm — a coloured *specular* with a tinted highlight (unlike dielectrics, whose highlight stays the light's colour). Matters for shading models (metal vs dielectric is the key axis in PBR).

• **emission & fluorescence** — the surface **makes** light: incandescence/LEDs (emit), and **fluorescence/phosphorescence** which absorb short λ and re-emit longer (the albedo->100% and "fluorescence breaks the BRDF" sidebars).

• takeaway: pigment colour is a per-wavelength **albedo** (the simple model), but interference, scattering, and metallic colour are **angle- and spectrum-dependent**, which is precisely why a faithful material model is a *spectral, directional* BRDF rather than a single RGB reflectance. [PBR: Pharr, Jakob & Humphreys]

• **light–matter interaction also polarizes the light**: reflection (especially the **specular** component) partially **polarizes** what bounces off a surface — maximally near **Brewster's angle** — so a full account of the BRDF is *polarization-dependent* (a **polarization BRDF / pBRDF**). This is exactly what a **polarizing filter** exploits to cut glare and reflections (→ Photography 101 — Lens filters) and what **shape-from-polarization** and reflection-removal methods use (→ Advanced). [cross-ref the polarization sidebar in *Reflection*]

• **side bar — why good sunglasses are polarizers.** Glare off roads, water, snow, and car hoods is light that reflected off a roughly **horizontal** surface, so (near Brewster's angle) it is strongly **horizontally polarized**. Quality sunglasses are **linear polarizers oriented to pass vertical and block horizontal** light, so they kill that glare specifically — not just dim everything like a plain neutral-density tint. That's the difference between "good" (polarized) and cheap (just dark) sunglasses: you see *into* the water and the wet road instead of at the sheen. The same physics is the camera's circular/linear polarizer, and the same selective extinction is shape-from-polarization's signal. (One catch: polarized glasses can make some **LCD screens / phone displays** go dark at certain angles, because those screens emit polarized light.)

• **try it yourself (how to test your sunglasses).** Hold the glasses up to a glary reflection — a car hood, a puddle, a window, a lake — and **rotate them ~90°**. If the glare **darkens in one orientation and comes back in the other**, they are genuine **polarizers**; if nothing changes, they are just tinted. (That is exactly what `fig-polarizer-sunglasses` shows. A second quick test: look at an LCD screen and tilt your head — polarized lenses go dark at some angles.)

• hint of **global illumination** (light bounces many times between surfaces) — and it is not just a rendering nicety, it is something the photographer actively **exploits** and occasionally **fights**:

• **handy — reflectors and bounce flash**: because light keeps bouncing, you can shape it by adding bounce surfaces. A **reflector** (a white/silver/gold panel) catches the key light and **bounces fill** into the shadows; a **bounce flash** fires at a ceiling or wall so the *whole room* becomes a big, soft secondary source instead of a hard on-axis point. Both are global illumination put to work — softening contrast and wrapping light around the subject (→ Photography 101 — fill & bounce flash; [[#Human factors and the art of photography]]; computational bounce flash in Computational illumination). [photo: someone holding a reflector to bounce fill onto a subject — `fig-reflector-fill`, **shoot our own** unless a CC image turns up]

• **a challenge — unwanted bounced color**: the same bounced light can **hurt** you, because it carries the *color* of whatever it bounced off (color bleeding — the very thing the `fig-global-illumination` figure shows). The classic case is shooting at a **pool**: the strong **blue** light bouncing up off the water casts a cold tint on faces that makes decent **skin tones** genuinely hard to hold (a green lawn or a red wall does the same in its own hue). Knowing the bounce is colored is half the fix — block it, add neutral fill, or correct it later.

• **side bar — fluorescence breaks the BRDF.** A BRDF is **per-wavelength**: same wavelength in, same wavelength out, just redistributed in direction. **Fluorescence** violates this — it **absorbs short-λ (often UV) light and re-emits at *longer* wavelengths** (the **Stokes shift**), moving energy *between* wavelengths. So a fluorescent material is not described by a scalar BRDF per λ but by a **reradiation (bispectral) matrix** — input wavelength → output wavelength. (This is also the cheat behind apparent albedo > 100% — cross-ref, don't re-derive, the albedo sidebar in [[#The color of objects: illumination times reflectance]].) Everywhere once you look: **optical brighteners**, **highlighter** ink, **minerals under UV**, **GFP** in bio-imaging, **banknote security inks** — and it is the working signal in fluorescence **microscopy** and **forensics**.

• **side bar — subsurface scattering (why skin isn't plastic).** A BRDF assumes light leaves at the *same point* it entered. In **translucent** materials light instead **enters, scatters around beneath the surface, and exits somewhere else**, so a point-wise BRDF is **insufficient** — you need the **BSSRDF** (bidirectional *surface-scattering* RDF), which couples an entry point to a separate exit point. Examples: **skin**, **marble**, **milk**, **wax**, **leaves**. This is what gives skin its soft, "lit-from-within" glow and rounded shadow terminator; render skin with a pure BRDF and it looks **plastic**. (Cross-ref translucency / global illumination in rendering.)

equations

BRDF — outgoing radiance toward q = f_r(p→q

n, λ) · incoming

▸2.1.5 Illuminants⧉

`fig-illuminant-spectra` · blackbody / illuminant spectra (D65, tungsten, fluorescent) 🟨

`fig-color-temperature` · the (inverted) color-temperature scale: warm (low K) → cool (high K) 🟨

• 'white' light is not one thing — the same scene at noon vs sunset shifts color

• **blackbody** & **color temperature** (low K warm/orange, high K cool/blue); **D65**, **tungsten** (warm), **fluorescent** (spiky → poor color rendition)

• insist again: the stimulus is the **spectral product** (illuminant × reflectance) → motivates white balance

equations

blackbody spectrum set by temperature → color temperature (K)

▸2.1.6 Radiometry: radiance, irradiance, exposure, and falloff⧉

`fig-radiance-irradiance` · radiance (per direction) vs irradiance (per area) 🟨

`fig-cosine-irradiance` · why irradiance carries a cosine: a beam of width w vs its surface footprint L = w/cosθ 🟨

`fig-seasons` · sidebar — seasons: N. America in summer vs winter (sun angle → energy per area, cosine law) 🟨

`fig-inverse-square-vs-radiance` · point source 1/r² vs distance-invariant surface radiance 🟨

• **radiance** L: power per unit area per unit solid angle carried *along a ray* — ultimately what a pixel measures; **conserved along a ray** in vacuum (a wall doesn't get dimmer as you step back, only smaller)

• **irradiance** E: power per unit area *arriving on a surface* (the sensor) — the integral of incoming radiance over the collecting solid angle, which grows with **aperture area**

• the **cos θ** in E = ∫ L cos θ dω is **projected area**: a beam (cylinder of light) of width w striking a surface at angle θ spreads over a larger footprint w/cos θ, so the same power is thinned → E ∝ cos θ (Lambert's cosine law). [cosine / projected-area figure]

• **sidebar — seasons**: the same cosine explains seasons — the Earth's 23.5° tilt makes summer sun strike North America near-vertically (concentrated) while winter sun arrives at a grazing angle (spread out, ~half the energy per area). [seasons figure]

• **units — radiometric vs photometric** (worth pinning down): the quantities above are **radiometric** (raw energy, in **watts**). Weight them by the eye's spectral sensitivity — the **luminous-efficiency curve $V(\lambda)$** — and you get the **photometric** quantities the photographic world actually quotes:

• radiant flux **W** ↔ luminous flux **lumen (lm)**

• irradiance **W/m²** ↔ **illuminance lux** (lm/m²) — light *arriving on* a surface (an incident-meter reading)

• radiance **W/m²/sr** ↔ **luminance cd/m²** (the "**nit**") — light *along a ray* / leaving a surface; this is the perceptual "brightness" of a patch, and what displays, HDR, and the photopic/scotopic scale (Perception) are all quoted in

• intensity **W/sr** ↔ luminous intensity **candela (cd)** — the SI **base** unit (≈ one candle)

• so **cd/m² = lm/(m²·sr)**; for scale, a monitor is ~100–300 cd/m², an HDR display ~1000+, paper in sunlight ~10⁴, and the photopic/scotopic ranges (Perception) live on this same axis

• **exposure** H = irradiance × time → ties straight back to aperture (D/f)² and shutter t (Exposure settings)

• **falloff**: a *point* source dims as **1/r²** (inverse-square), but scene **radiance is distance-invariant** — contrast the two carefully, it confuses people

• this radiometry underlies **HDR** (recovering scene radiance) and **computational illumination**

• (the off-axis **cos⁴** edge falloff / natural vignetting is deferred to **Sensors**, where the pixel irradiance integral and a finite aperture make it meaningful)

equations

radiance L (power per area per solid angle, **conserved along a ray** in vacuum)

irradiance E = ∫ L cos θ dω (power per area on the sensor)

exposure H = E·t

inverse-square E ∝ 1/r² (point source)

**photometric** counterparts via the luminous-efficiency curve V(λ): lumen, lux, candela, **cd/m² (nit)**

▸2.1.7 Wave effects, diffraction, and the diffraction limit⧉

`fig-diffraction-aperture-size` · two apertures: a smaller one diffracts more (spread θ ≈ λ/D) 🟨

`fig-airy-disk` · Airy disk = diffraction-limited PSF 🟨

`fig-aperture-sharpness-sweet-spot` · aberration vs diffraction → sharpness sweet spot 🟨

• light is a **wave**: at edges and small apertures it **diffracts** — bends and spreads instead of travelling in straight rays

• consequence for imaging: a **finite aperture cannot focus to a perfect point**; even an ideal lens images a point as a small blur, the **Airy disk**, whose size grows with wavelength × f-number

• the **diffraction limit**: the finest resolvable detail is set by wavelength and aperture (Rayleigh ≈ 1.22 λ/D); a **larger** aperture diffracts **less**

• so **stopping down too far softens** the image — aberrations dominate wide open, diffraction dominates stopped down → a sharpness **sweet spot** (developed with MTF in the Optics part)

• **coherent vs incoherent**: we mostly assume **incoherent** imaging (intensities add, stay positive, linear); coherent / wave optics returns for advanced optics (metalenses, lensless, holography)

• **a live demo — the wave equation, solved**: diffraction through a slit is shown as an actual 2-D wave simulation (FDTD), wavefronts bending and spreading past the opening, with the time-averaged intensity pattern alongside — and a companion widget where a *smaller aperture visibly spreads more* (θ ≈ λ/D). [`fig-diffraction-wave-sim` (animated), `fig-diffraction-aperture-size`]

• **the audio parallel** (use it — sound is the everyday 1-D wave): every wave effect here has a **sound** counterpart, and most people have *heard* them. Diffraction — **bass bends around a doorway** (long λ) while treble stays directional, exactly as a small aperture spreads light; plus **interference** and **beats**, **standing waves** in a pipe or room, and the **Doppler** shift. The governing math is shared (a wave equation, Fourier content, λ/D spreading), so acoustics is a reliable intuition pump for the optics — and the same FDTD solver in `fig-diffraction-wave-sim` *is* an acoustic simulation with light's numbers.

• side bar — **diffraction and duality, revisited with walkers**: diffraction is the *wave* story (a slit spreads light by θ ≈ λ/D); the *photon* story is that **individual** photons still arrive one at a time, yet pile up into the same fringe pattern. The **walking-droplet** analog (see [[#Light: rays, waves, and the spectrum]]) dramatizes the marriage: a single bouncing drop passing a slit is steered by its own **pilot wave** into a **diffraction-like** distribution — so you can literally *watch* a "particle" guided by a wave. [reuse the Couder/Bush walker video → [[Video Resources]]]

equations

Airy-disk angular radius θ ≈ 1.22 λ/D

blur-spot diameter ≈ 2.44 λ·N (N = f-number)

Rayleigh resolution criterion

▸2.2 Pinhole image formation and linear perspective⧉

`fig-camera-obscura` · camera obscura / pinhole (artist's) 🟨

`fig-camera-obscura-apparatus` · camera obscura apparatus (Diderot engraving) 🟨

`fig-bare-sensor-averaging` · a bare sensor averages all rays 🟨

`fig-pinhole-fov` · pinhole geometry: real (inverted) vs virtual (upright) image plane, and focal length → field of view 🟨

`fig-perspective-projection` · perspective projection equation: x'=f·X/Z, y'=f·Y/Z (divide by depth, scale by f) 🟨

`fig-perspective-projection-3d` · perspective projection in 3D: P=(X,Y,Z) → p=(x,y) on the image plane through the pinhole 🟨

`fig-perspective-vanishing-points` · perspective projection + vanishing points 🟨

⬜ figure not yet created

focal length vs subject distance (background compression) fig-focal-length-compression

`fig-face-distortion-sim` · live face-distortion simulator (web edition): focal length + magnification + eccentricity controls with a full-frame view and a face-cropped view, showing the close-wide "big nose" perspective and the wide-angle edge stretch at constant face size; subject = a face, a grid sphere, or one of six diverse Meshy humans framed on the face; static fallback is a screenshot. *Human 3D models generated with Meshy AI.* ✅

`fig-dolly-zoom-sim` · live dolly-zoom simulator (web edition): hold the subject size while focal length and camera distance trade off so only perspective / background scale changes; log focal (12–200 mm) + distance sliders, realistic Poly Haven environments, subject = head bust or one of six diverse Meshy humans. *Human 3D models generated with Meshy AI.* ✅

`fig-keystoning-cause` · keystoning is the tilt, not the lens — a camera tilted up makes world-parallel verticals converge (façade → trapezoid), while the fronto-parallel shot keeps them parallel; projection preserves lines, not parallelism 🟨

`fig-point-line-duality` · point↔line duality in homogeneous 2D: two points → a line, two lines → a point (same cross product) 🟨

`fig-projection-decomposition` · projection-matrix decomposition K·[R matplotlib, 3-stage pipeline

`fig-crop-focal-length` · cropping = changing focal length: full frame vs a crop box ≡ a longer-focal-length capture, upsampled 🟨

`fig-depth-vs-ray-length` · depth (z coordinate) vs ray length ‖P‖ from camera to a 3D point; unproject (pixel + depth → 3D) 🟨

• sensor alone would capture a diffuse light. We need to select which rays from which part of the scene contribute to which pixel.

equations

pinhole projection x = f·X/Z, y = f·Y/Z

homogeneous p ≈ K·[R|t]·P

field of view = 2·arctan(sensor / 2f)

▸2.2.1 Pinhole imaging and the perspective projection⧉

• pinhole,

• show camera obscure example (some artist)

• **geometry of pinhole imaging**: each scene point's chief ray passes through the pinhole onto the sensor. The physical sensor sits *behind* the pinhole, so the image is **inverted** (upside-down and left-right flipped).

• we instead use a **virtual image plane the same distance in front** of the pinhole — *not physical*, but it keeps the image **right-side up**, so we adopt it throughout (consistent with the camera looking down **+z** → [[Conventions, Notation & Style#Geometry & coordinate systems]]).

• the pinhole→sensor distance is the **focal length f**: a short f gives a **wide** field of view, a long f a **narrow** (telephoto) one — FOV = 2·arctan(sensor / 2f).

• **side bar — DIY pinhole.** A pinhole needs no lens, so it is the easiest real camera to build: punch a clean tiny hole in foil and tape it over the lens mount of a camera body, or over one end of a lightproof box (a shoebox / oatmeal tin) with film or paper at the other — a working **camera obscura**. Smaller holes are sharper but dimmer (diffraction sets the floor), so exposures run long.

• **side bar — the pinhole bag** (credit the idea to **Freeman, Torralba & Isola**'s online vision book, their Figure 5.5, at <https://visionbook.mit.edu/imaging.html>): pull a bag with a single small pinhole over your head, let your eyes dark-adapt, and you see the **upside-down image of the bright world outside projected on the inside of the bag** — pinhole projection made personal, and an unforgettable demo that the inverted image is real.

• exercise: make your own, do in a room

• mention animal eyes

• linear perspective when camera is at origin and along z axis

• **perspective projection equation**: a scene point (X, Y, Z) projects to screen coordinates **x' = f·X/Z** and **y' = f·Y/Z** — each screen coordinate is the world coordinate **divided by the depth Z and multiplied by the focal length f** (the "divide by z" that makes distant things small). [shown both as 2D similar-triangles and in a 3D view]

• 💡 **Big lesson — with the camera at the origin looking down +z, perspective is *mostly a division by depth*.** In that canonical frame the whole of linear (pinhole) perspective is just $x' = f\,X/Z,\ y' = f\,Y/Z$: divide each world coordinate by the depth $Z$, then scale by $f$. Everything else a camera does — sitting somewhere else, pointing some other way, landing on pixels — is bolted on *around* this one step. The **divide-by-depth is the whole of perspective**; the rest is bookkeeping. → register [[Big Lessons]]

▸2.2.2 Homogeneous coordinates⧉

• **homogeneous coordinates**: write a 2D point as (x, y, 1) up to scale, so projection, **translation**, and camera motion all become **matrix multiplies**

• **translation becomes a matrix, too.** A shift by $(t_x, t_y)$ is *not* linear on $(x, y)$, but on the homogeneous triple it is an ordinary matrix multiply: $\begin{bmatrix}1&0&t_x\\0&1&t_y\\0&0&1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix} = \begin{bmatrix}x+t_x\\y+t_y\\1\end{bmatrix}$, and in 3D the $4\times4$ analogue $\begin{bmatrix}I_3&\mathbf{t}\\\mathbf{0}&1\end{bmatrix}$ translates $(X, Y, Z, 1)$. This is *the* reason graphics and vision live in homogeneous coordinates — translation, rotation, projection, and the perspective divide all compose into one chain of matrix multiplies. [translation-matrix in homogeneous coordinates]

▸2.2.3 Translating and rotating the camera⧉

• so far the camera has sat at the **origin looking down +z**, where perspective is the clean divide-by-depth above. But a real camera is **somewhere else, pointing some other way** — so before we can use that canonical projection we must re-express the scene **in the camera's own frame**. This is the step that turns "an arbitrary camera" into "the camera at the origin."

• 💡 **moving the camera = applying the *inverse* transform to the world.** Sliding the camera right by $d$ is indistinguishable from sliding the whole **world left** by $d$; rotating the camera one way is rotating the world the other. So we never move the camera *inside* the projection — we move the **world into the camera's frame** by the inverse rigid transform, then project with the fixed origin/+z pinhole. (This is also why "camera-to-world" and "world-to-camera" matrices are inverses — the convention trap below.)

• that world→camera step is a **rotation R then a translation t**, written as one $4\times4$ matrix in homogeneous coordinates, $\begin{bmatrix}R&\mathbf{t}\\\mathbf{0}&1\end{bmatrix}$, carrying a world point $(X, Y, Z, 1)$ into the camera frame — where the canonical divide-by-depth then applies. Translating the camera is changing $\mathbf{t}$; aiming it is changing $R$.

• 💡 **Big lesson — perspective for an *arbitrary* camera is one $4\times4$ matrix followed by a divide by the last coordinate.** Pose the camera (world→camera $4\times4$), project, and map to pixels — every stage is a matrix on the homogeneous point, so the *whole* geometric pipeline composes into **one matrix multiply**, and the only nonlinearity — the perspective divide — *is* the final "divide through by the last coordinate." The arbitrary-camera case is the origin case (divide by depth) wrapped in two more matrices; homogeneous coordinates are exactly what make that wrapping elegant. → register [[Big Lessons]]

▸2.2.4 Intrinsics, extrinsics, and what cropping really does⧉

• moving camera, matrices, all that good stuff

• **camera intrinsics K** — give them a real treatment, not a bullet: reason first in **normalized image coordinates** (project onto the z=1 plane, independent of sensor size), then map to **pixels** with a second, **affine** matrix carrying the focal lengths fx, fy and the **principal point** (the arbitrary image centre). So the full projection **decomposes** as *(pixel affine) · (project at z=1) · [R|t]* = K·[R|t]. This split is also why **cropping an image = using a longer focal length** — you change only the pixel mapping, not the projection. [projection-matrix decomposition figure; crop = focal-length figure]

• 💡 **Big lesson — cropping a photo *is* changing the focal length.** From the projection geometry, cropping touches only the last stage, the pixel mapping $K$: keep a smaller patch of the normalized image and rescale it to fill the frame — exactly what enlarging $f_x, f_y$ does. A 2× crop and a 2× zoom produce the *same* picture (the crop just has fewer pixels). Scaled up to whole sensors, this same fact is the **crop factor** that relates sensor sizes. → register [[Big Lessons]]

• **camera extrinsics [R|t]** — where intrinsics are *inside* the camera, extrinsics are the camera's **pose in the world**: a **rigid 6-DOF** world→camera transform, a rotation **R** (3×3, 3 DOF — orientation) plus a translation **t** (3 DOF — position). Together they place the world into the camera's frame *before* projection, so the **full projection is intrinsics · extrinsics = K·[R|t]** (the [R|t] moves the world into camera coordinates; K projects and maps to pixels). "Moving the camera" is just changing R and t.

• **the convention trap — cam2world vs. world2cam** (the one place to slow down): the matrix [R|t] above maps **world → camera**; its inverse maps **camera → world**. Worse, packages disagree on whether **t is the camera's position** (its center C in world coordinates) or its **negation** in the projection (t = −R·C). Mixing the two silently flips signs / mirrors a reconstruction — always pin down *which direction* a stored pose maps and *what t means* before trusting it.

• this is exactly the machinery that **multiple-view geometry** and **panorama homographies** rely on — relating two cameras is composing their extrinsics; a pure rotation between views collapses [R|t] into a **homography** (→ [[Multiple view geometry]]; Pano & homographies).

▸2.2.5 What perspective preserves, and what it destroys⧉

• vanishing points

• **what projection preserves vs destroys**: straight **lines → lines** and incidence survive; **angles and lengths do not** (parallels converge to vanishing points, a circle can become an ellipse) — the core fact behind perspective

▸2.2.6 Photography with focal length: framing, magnification, and compression⧉

• Photography, focal length, field of view, various trigomometric ideas, magnification. Difference focal length vs distance and how that changes the background

• the same focal-length-vs-distance distinction governs **portraiture**: facial proportions (and even perceived personality) are set by **camera-to-subject distance**, which is *why* portrait lenses (~85–135 mm) flatter — they let you back up (Perona 2007; full treatment in [[#Lenses and focal length]]).

• sensor size and impact on perspective. Maybe too late?

• sidebar: the (weird) meaning of **"35 mm"** = a **24×36 mm** image frame. The "35 mm" is the *width of the film stock* (including the sprocket-hole margins), not the image; Barnack ran 35 mm cine film sideways, giving a 36 mm × 24 mm frame. Today this size is called **"full frame"**, and smaller sensors are quoted by their **crop factor** relative to it (APS-C ≈ 1.5×, Micro 4/3 ≈ 2×, phone sensors much smaller).

▸2.2.7 Depth, ray length, and unprojection⧉

• **depth vs. ray length**: a depth sensor's "depth" is the **z coordinate**, *not* the ray length ‖·‖ to the point — keep them distinct; **unproject** (pixel + depth → 3D point) inverts the projection (RealSense & friends do this live). Depth sensors exist (→ ToF / structured light / stereo). [depth-vs-ray-length / unproject figure]

▸2.3 Lens image formation⧉

• chapter intro: from the **pinhole** (sharp but dim) to a **lens** (bright but with one focus plane) — refraction at curved surfaces, the thin-lens linearization, and the depth-of-field that the finite aperture forces on us

▸2.3.1 Single lenses: refraction and Snell's law⧉

`fig-single-lens-refraction` · single lens, two interfaces, Snell at each → focus 🟨

`fig-thick-lens-snell` · thick lens: Snell's law at both interfaces, with the incidence/refraction angles and indices of refraction shown 🟨

`fig-fermat-equal-path` · the Fermat / equal-optical-path view of a lens: several rays object→focus, all the same optical path (glass detour offsets the slower speed) 🟨

`fig-thick-lens-wave-sim` · live 2-D FDTD wave sim of a **thick biconvex lens** focusing: a point source's diverging wave is reshaped by the slower glass into a converging one that meets at an image point; focus tracks 1/f=(n−1)2/R and 1/v=1/f−1/u (Fundamentals → Lens image formation, Fermat/equal-path) ✅

• a real lens is a piece of glass bounded by **two curved surfaces** (two interfaces)

• light bends at each interface by **refraction** — Snell's law: n₁ sin θ₁ = n₂ sin θ₂

• shown explicitly on a **thick lens** (the two interfaces drawn far apart): the angle of incidence and the angle of refraction at *each* surface, with the indices n₁ (air) and n₂ (glass) marked. [thick-lens Snell figure]

• a curved interface bends a bundle of rays; the two surfaces of a single lens together steer parallel light toward a **focal point**

• **a second view — Fermat / equal optical path length**: a lens focuses because **every** ray from the object to the focus has the *same optical path length* — the longer geometric detour through the thick centre is exactly compensated by the slower speed (higher index) of glass there, so all paths arrive **in phase** and interfere constructively. The ray (Snell) picture and this wave (Fermat) picture are two takes on the one fact. [equal-path-length figure]

• 🖱️ **Interactive (web edition):** a live **2-D wave simulation of a thick lens** — a point source on the left radiates circular waves that cross a biconvex glass slab (slower inside, higher index n); because the glass is thickest on the axis it retards the centre of each wavefront most, turning the diverging wave into a **converging** one that collapses to an **image point**. This is the Fermat/wave picture made literal: focusing emerges from the wave equation, not from a ray rule. Sliders for the surface curvature R, the index n, and the source distance u, with the focus tracking the **lensmaker's equation** $1/f=(n-1)\,2/R$ and the **imaging equation** $1/v=1/f-1/u$ live on the figure. The lens sits in an opaque mount (aperture stop) so all light is forced through the glass; absorbing (sponge) borders, so nothing reflects off the window edges. Companion to `fig-refraction-wave-sim` (single interface) and `fig-diffraction-wave-sim`. [`fig-thick-lens-wave-sim`; interactive, FDTD]

• **lensmaker's equation** ties focal length to the two radii of curvature and the glass index: 1/f = (n−1)(1/R₁ − 1/R₂)

• this already assumes small angles (paraxial); at large angles the rays do **not** meet at one point → **aberrations** (forward ref: Optics part, where compound lenses correct them)

• eyes have optics too (cornea + lens are the two main refracting elements)

equations

Snell's law n₁ sin θ₁ = n₂ sin θ₂

lensmaker 1/f = (n−1)(1/R₁ − 1/R₂)

▸2.3.2 Thin lens optics⧉

`fig-thin-lens-rays` · thin-lens ray diagram (focus, focal point) 🟨

`fig-thin-lens-conjugates` · conjugates: object at ∞ / far / close / at focal length → where the image forms 🟨

• the **thin lens** as a *simplification / linearization* of the single lens: collapse the two interfaces into one plane, assume small angles and negligible thickness → all bending happens once, at the lens plane (forward ref: Optics part for the accurate thick / compound model)

• precisely, **paraxial = first-order optics**: replace **sin θ ≈ θ** (keep only the linear term). Keeping the *next* term — sin θ ≈ θ − θ³/6 (**third-order optics**) — is exactly what produces the **Seidel aberrations**, so "linearization" and "no aberrations" are the *same* approximation (forward ref: Optics part)

• three-ray construction (parallel→focus, through-centre, focus→parallel) locates the image

• **focus equation** 1/f = 1/u + 1/v (object distance u, image distance v); an object at infinity focuses at f; **focusing** = changing v (move the lens or the sensor) to image nearer/farther objects

• **conjugates**: each object distance pairs with one image distance — object at ∞ → image at f; far → just beyond f; close → far behind; at the focal point F → rays exit parallel (image at ∞). [conjugates figure]

• **symmetry & reversibility**: the focus equation $1/f = 1/s_o + 1/s_i$ is **symmetric** in object and image distance, so object and image are **conjugate** — **swap them and the rays retrace the same path** (the reversibility of light paths). Object and image planes are **conjugate planes**, and the magnification follows directly ($m = -s_i/s_o$, so swapping inverts it). This object↔image **reciprocity** recurs throughout optics (e.g. why a lens images either way, and why entrance/exit pupils are conjugates).

• **magnification** m = −v/u (the image is inverted)

• **lensmaker's equation** (thin-lens form) 1/f = (n−1)(1/R₁ − 1/R₂) — connects f to the glass shape (from the single-lens section)

• **f-number** N = f/D sets the aperture (→ exposure, → blur)

• (depth of field gets its **own section**, next)

equations

thin lens / focus 1/f = 1/u + 1/v

magnification m = −v/u

lensmaker 1/f = (n−1)(1/R₁ − 1/R₂)

f-number N = f/D

▸2.3.3 Depth of field⧉

`fig-circle-of-confusion` · finite aperture & circle of confusion ✅

`fig-defocus-vs-parameters` · defocus blur (CoC) vs aperture and object distance 🟨

`fig-depth-of-field` · near/far limits where blur reaches c (linearized DoF) 🟨

`fig-dof-derivation-near` · near (front) limit by similar triangles — all quantities 🟨

`fig-dof-derivation-far` · far (back) limit by similar triangles — all quantities 🟨

`fig-dof-vs-parameters` · DoF vs aperture, focal length, focus distance 🟨

`fig-hyperfocal` · hyperfocal: focus at H → sharp from H/2 to ∞ 🟨

`fig-dof-same-framing` · same framing + same f-number → same object-space DoF (short vs long focal length); background blur still looks larger with the long lens 🟨

`fig-dof-double-cone` · object-space DoF as a double cone in front of the lens, waist on the focus plane (the blur for one scene point) 🟨

`fig-dof-depth-dependent` · defocus is NOT a convolution: blur radius depends on scene depth (not pixel location), and occlusion at depth edges 🟨

`fig-dof-background-blur` · DoF focal-length invariance is only first-order: background blur c vs background distance for 35/85/200 mm (matched framing & N) coincide near the subject and fan apart far away; + c vs focal length showing growth/saturation ✅

• **defocus / circle of confusion**: a point off the focus plane images to a point that is not on the sensor, so the aperture cone meets the sensor as a blur disk — the **circle of confusion**. A point still counts as sharp while its disk stays within an acceptable c. [CoC figure]

• the blur disk grows with the **aperture** (wider → bigger disk) and with the **object's distance from the focus plane** (zero exactly at focus); focal length and focus distance enter too. [defocus-vs-parameters figure]

• **depth of field** = the *range* of object distances whose blur disk stays within c; the **near** and **far** limits are where the blur just reaches c. [DoF near/far figure — each cone carried out to the CoC at the sensor]

• **deriving the limits**: on the image side a near-limit object images *behind* the sensor and a far-limit object *in front* of it; **similar triangles** relate the aperture D, that convergence distance, and the circle of confusion c — and together with the conjugate relations (1/f = 1/s + 1/v) they fix the near (front) and far (back) object-distance limits. [near-limit and far-limit similar-triangle figures]

• **hyperfocal distance** H = f²/(N·c): focus there and everything from **H/2 to infinity** is acceptably sharp — the focus setting that maximizes DoF (focusing at ∞ instead wastes the H/2…H zone). [hyperfocal figure]

• DoF **grows** with a smaller aperture (larger N), a shorter focal length, and a greater focus distance — these are the paraxial (linearized) formulas; the accurate model is in the Optics part. [DoF-vs-parameters figure]

• **the surprising invariance** (the lecture's flagged "important conclusion"): for a **fixed framing** (same subject size) and a **fixed f-number**, the depth of field in **object space** is essentially **independent of focal length** — switching 28 mm → 100 mm forces you to step back, and the larger physical aperture cancels the longer focal length. So "long lenses have shallow DoF" is really about **magnification**, not focal length per se. (The *background* still looks more blurred with the long lens — that's magnification of the out-of-focus disk, not a DoF change.) [same-framing/same-N → same DoF figure]

• ⚠️ **but the invariance is only *first-order* — the background gives the focal length away.** "Same framing + same f-number → same DoF" is the **small-defocus (paraxial)** result: it holds for the **near/far acceptable-sharpness limits straddling the subject**, but **not** for how hard a *distant* background is thrown out of focus. **Derivation** (thin lens; blur-disk diameter on the sensor): a point at distance $s_b$, with the lens focused at $s_f$, images to a circle of confusion $c \approx \dfrac{f^2}{N}\,\dfrac{|s_b-s_f|}{s_b\,s_f}$. **Fix the framing** — subject magnification $m=f/s_f$ held constant, so $s_f=f/m$ — and **fix $N$**, with the background a gap $\Delta=s_b-s_f$ behind the subject:

$$c(f)=\frac{m^2\Delta/N}{\,1+ m\Delta/f\,}\ \xrightarrow[f\to\infty]{}\ \frac{m^2\Delta}{N}.$$

The subject-zone DoF is ~independent of $f$ (the leading $f$ cancels — the **first-order invariance**), but the **background blur grows monotonically with focal length**, saturating at $m^2\Delta/N$. Two shots matched in framing *and* aperture render the **subject** identically, yet a **200 mm melts a far background far more** than a 35 mm: the correction term is order $m\Delta/f$ — negligible for a background near the focus plane (invariance looks exact), but dominant once the background is many focus-distances away. This is exactly why portrait shooters reach for long lenses for "background separation" even though the DoF on the *face* is unchanged. **Full derivation + plots** of $c$ vs. background distance for 35 / 85 / 200 mm (fixed framing & $N$) — the curves coincide near the focus plane and fan apart far from it. [`fig-dof-background-blur`]

• **object-space picture & a caveat**: the simplest view is a **double cone** in front of the lens whose waist sits on the focus plane — it gives the blur for **one** scene point. But **defocus is *not* a spatially-invariant convolution**: the circle of confusion depends on each point's **scene depth** (not its pixel location), and at depth discontinuities **occlusion** matters — which is why naive "just blur the image" fakes look wrong, and why light-field / depth-aware methods (Advanced) do better. [object-space double cone; depth-dependent-blur + occlusion figure]

• **sensor size scales DoF**: shrink the sensor (keeping FOV) and both the circle of confusion and the aperture shrink, so DoF grows **linearly** with sensor size — why phone cameras have huge DoF (and resort to *computational* background blur), and why **macro is easier on small sensors** (a scaled-down camera photographs a scaled-down world — modulo diffraction)

• 🖱️ **Interactive (web edition):** a live **macro depth-of-field simulator** — a to-scale thin-lens diagram, a lens-supersampled 3D photo (real bokeh), and a circle-of-confusion-vs-distance plot all driven by one shared optics model (focal length, f-number, sensor, focus distance, subject–background gap). A 3D bug (beetle / bee / ladybug) over a flower makes the macro lesson vivid: focus lands on the front of the face and the rest of the body, antennae, and background all fall out of focus. [fig-depth-of-field-sim; interactive] *(bug & flower 3D models generated with Meshy AI — acknowledge in the caption)*

equations

acceptable blur = circle of confusion c

defocus blur diameter b = D·|v − v′|/v′

hyperfocal H = f²/(N·c)

near limit D_n = s(H−f)/(H+s−2f)

far limit D_f = s(H−f)/(H−s)

total DoF = D_f − D_n

▸2.4 Image measurements as integrals⧉

• chapter intro: the unifying idea behind sensing — **measurement = integrating a slice of the plenoptic function** — then the physical sensor that does it (photosites, CCD/CMOS) and the **noise** that limits it

• sidebar — **photography as objective measurement**: the medium's long arc bends toward objective measurement (film/print = a *performance*; → measurement). The **digital sensor** is the far end: because each photosite *counts photons*, a **raw** image is ≈ a **calibrated radiometric measurement** — a physical map of scene radiance, in real units, of one snapshot of time — not just a rendering. This is what makes HDR / depth / deblur possible: the pixel numbers must *mean* something physical. Ties to the intro's **generation → measurement → generation** arc (→ [[01 Intro]]).

▸2.4.1 The plenoptic function⧉

`fig-contrast-af-curve` · contrast-detect AF: high-frequency contrast (sharpness) vs focus position, peaked at best focus; hill-climb arrows that must overshoot past the peak then step back (no direction cue) ✅

fig-pixel-measurement-integral · a pixel value = integral over aperture (thin lens ellipse) + finite pixel area (+ time, wavelength); the 4 domains → DoF, antialiasing, motion blur, color; linked to the plenoptic function 🟨

• (placed here, just before **sensors**, because it pays off as the framing for what a pixel actually measures)

• the **plenoptic function** describes *all the light filling space*: the radiance of every **ray**, parameterized by its 3D position, 2D direction, wavelength, and time — a 7-dimensional function [Adelson & Bergen]

• **any** imaging operation is the **measurement of some portion** of this function; the eye and the camera each sample a slice of it

• a conventional camera records only a **2D projection**: each pixel integrates all the rays reaching it, **discarding their direction** — information that richer ray-based / multi-aperture devices (Advanced part) can recover

• a recurring unifying view: imaging = sampling the plenoptic function

equations

plenoptic function P(x, y, z, θ, φ, λ, t) — a ray's 3D position, 2D direction, wavelength, and time (7D)

**add polarization** as an 8th axis (the field's oscillation orientation) — omitted classically, but physically present and informative (sky polarization, glare/reflections, shape-from-polarization

animals & polarization cameras sample it)

▸2.4.2 The pixel integral⧉

• **the chapter's thesis in one equation — a pixel value is an integral of the [[#The plenoptic function|plenoptic function]]** against a sampling kernel: each photosite *counts photons*, so it reports the **incoming radiance integrated over four physical dimensions**, weighted by the pixel's response. Schematically,

• pixel value ≈ ∫_time ∫_λ ∫_aperture ∫_(pixel area) L(x, y, θ, φ, λ, t) · r(λ) · (geometry) dA dω dλ dt

• i.e. **measurement = integrate the plenoptic function against the pixel's sampling kernel** — and the kernel's *support* in each dimension is exactly a photographic knob. [fig-pixel-integral: thin lens as an ellipse (aperture solid angle), the sensor, one finite pixel; the integral over aperture + pixel area]

• **each integration dimension ↔ a photographic parameter ↔ a downstream effect** (every later sensing topic is "what happens when you change the support of one axis"):

• **sensor footprint (x, y area)** — the pixel integrates radiance over its **finite area** on the sensor, which sets **resolution** and makes a sample a *box-filtered* average rather than a point sample → **sampling & anti-aliasing** (a bigger footprint blurs but suppresses aliasing; pre-filter vs. point-sample lives here).

• **aperture (solid angle of admitted rays)** — the pixel integrates over the **cone of directions** the lens admits, set by the **f-number** N = f/D. Wider aperture = more light *and* a wider integration cone = a shallower **depth of field** (out-of-focus points spread their energy across many pixels → the circle of confusion). The light-vs-blur trade is this single axis. (→ [[#Depth of field]])

• **exposure time** — the pixel integrates over the **shutter interval** t, which sets exposure (H = E·t) and, when the scene or camera moves during t, produces **motion blur** (the width of the temporal kernel). Light-vs-sharpness, the temporal sibling of aperture. (→ Motion / video)

• **wavelength × spectral sensitivity** — the pixel integrates radiance against its **spectral response** r(λ), collapsing a whole spectrum to one number (exactly like a cone, → [[#Color]]). A **color-filter array** gives each photosite a different r(λ) (R/G/B), so color is *spatially multiplexed* and must be reconstructed → **demosaicking** (Bayer).

• **why this framing pays off**: the rest of the sensing story is just *which kernel, how wide, and what it loses* — antialiasing widens the spatial kernel, depth of field is the aperture kernel, motion blur is the time kernel, demosaicking inverts the wavelength multiplexing. Each is also a **source of noise**: integrating fewer photons (small footprint, short time, narrow aperture) means a noisier estimate (→ Noise, next). Capture is **linear** in radiance, which is why this integral is well-defined and why **raw** values *mean* something physical.

▸2.5 Sensors: photosites, CCD vs CMOS⧉

`fig-sensor-microlens` · sensor cross-section: microlens → filter → photosite 🟨

`fig-rolling-shutter-pan` · a slanted, sheared vertical pole from an uncorrected fast pan vs a straightened, rectified version, illustrating line-by-line CMOS readout during camera motion 🟨

`fig-slowmo-axis` · four ways to treat the time axis on a shared timeline — normal capture (sparse), high-speed (dense true samples), interpolation (sparse reals with synthesized in-betweens), and long-exposure blur (the integral); only interpolation adds resolution after capture and can be wrong 🟨

`fig-mirrorless-anatomy` · labeled cutaway of a mirrorless full-frame camera (lens+mount, sensor/IBIS, mech+electronic shutter, EVF, processor/on-sensor PDAF, card+battery); SLR mirror-box inset for contrast ✅

`fig-rolling-shutter-skew` · rolling-shutter distortion — a global shutter keeping a vertical pole upright under a fast pan vs a rolling shutter where per-row readout $t(r)=t_\text{frame}+r\,t_\text{row}$ shears the pole and smears fan blades (the jello effect) 🟨

`fig-flash-sync` · flash & focal-plane-shutter sync: whole frame ≤ X-sync · slit/partial frame above it · high-speed-sync pulse train; fill-flash noted ✅

`fig-darkroom-dodge-burn` · the darkroom ancestor: dodging (hold back light) and burning (add light) under the enlarger as hand-painted spatially-varying exposure 🟨

▸2.5.1 Photon to number: the photosite⧉

• the **digital-sensor revolution came in two steps**: first **electronic** — turning light into an *analog electronic signal* (the photosite + readout; the era of the CCD and analog electronic / video imaging) — and only **then digital** — putting an **analog-to-digital converter (ADC)** on the path so that signal becomes **numbers** we can compute on. Computational photography lives on that second step; flag the distinction up front.

• a **sensor** is a 2-D array of **photosites** (pixels), each a potential well that **counts photons** → charge → voltage → a digital number; counting photons is what makes capture ~**linear** in radiance (keep this short — the deep sensor zoo is Advanced)

• a **microlens** over each photosite gathers light onto its active area (**fill factor** < 100% because of wiring), and a **color filter** sits under it → the **Bayer** mosaic (forward-ref: Demosaicking). [sensor cross-section figure]

• the scene-radiance → pixel-value mapping is the **camera response curve**: **raw** is ~linear in radiance, while the rendered (JPEG) image applies a non-linear response (≈ gamma + a tone curve — see Linear-vs-gamma in Color technology); **recovering** that response is exactly what HDR merging needs

▸2.5.2 CCD vs CMOS⧉

• **CCD vs CMOS**: a **CCD** shuttles charge across the chip in a "bucket brigade" to **one** amplifier at the edge (clean, uniform, but slow and power-hungry, and it reads the whole frame at once → naturally **global** shutter); **CMOS** puts an amplifier (and now the ADC + logic) **at every pixel** and reads in place, **row by row** (fast, cheap, low-power — now utterly dominant). [CCD bucket-brigade vs CMOS read-in-place schematic]

• what CMOS's per-pixel, row-by-row readout **unlocks** (and costs): on-chip ADC, region-of-interest and high-frame-rate readout, on-sensor HDR and PDAF and binning — but, because rows are read at different times, an **electronic rolling shutter** with its motion artifacts (below).

▸2.5.3 Time integration: the exposure (the time axis of the pixel integral)⧉

• the pixel value is an **integral over four dimensions** (aperture × pixel area × wavelength × **time**) — the [[#Image measurements as integrals]] chapter owns the first three; **this is the time one**. Over the **exposure time** Δt the well **accumulates** photon arrivals: value ∝ ∫ irradiance dt. [photons-accumulating-in-a-well figure]

• **exposure = irradiance × time** (the other half of the exposure triangle with aperture; ISO is gain *after*). **Reciprocity**: ½ the light for 2× the time gives the same exposure — until it doesn't (sensor reciprocity holds far better than film's reciprocity *failure*).

• the time integral is exactly **motion blur**: anything that moves during Δt is averaged along its path → a long exposure smears motion (light trails, silky water), a short one freezes it. The deliberate use and the *removal* of this integral are [[Motion blur and temporal sampling]] and deblurring.

• **what ends the integral is the shutter** — which is the rest of this chapter.

▸2.5.4 Mechanical vs electronic shutter⧉

• **mechanical shutter** — a physical curtain (or leaf) opens then closes to time the exposure. A focal-plane shutter is **two curtains**: at short times the second starts closing before the first fully opens, so a **slit** sweeps the frame (already a kind of rolling exposure, in hardware). A **leaf shutter** (in the lens) opens from the centre → exposes the whole frame at once (global) and **syncs flash at any speed**.

• **electronic shutter** — no moving parts: the sensor **resets** each photosite to start its integration and **reads it out** to end it. Silent, vibration-free, and capable of very short or very high-frame-rate exposures — but on CMOS it is usually **rolling** (below). Many cameras use an **electronic first curtain** (electronic start, mechanical end) to cut shutter shock while keeping a clean finish.

▸2.5.5 Global, rolling, and readout shutter⧉

• **global shutter**: every pixel integrates over the **same** time window (all start and stop together), then the frame is read out. No motion distortion. Native to CCD and to special global-shutter CMOS; costs transistors/area per pixel.

• **rolling shutter**: rows are exposed and **read out one after another**, so each row's time window is **shifted** slightly later than the one above. Cheap and standard on CMOS — but rows see the scene at different instants. [row-vs-time diagram: global = a rectangle in (row, time); rolling = a slanted parallelogram]

• the **artifacts** (all from "different rows, different times"): **skew** (a fast pan or a moving car leans over), **wobble / "jello"** (handheld video shimmer), the **spinning-propeller / guitar-string** surreal shapes, and **partial exposure with flash** (a brief flash lights only the rows open at that instant → a **bright band**, which is why flash sync speed exists). [📺 [rolling shutter explained](https://www.youtube.com/watch?v=dNVtMmLlnoE) — Smarter Every Day; `fig-rolling-shutter` interactive: a vertical rod / spinning blade under global vs rolling readout]

• **lighting & flicker banding**: mains-powered and many **LED/PWM** lights **flicker** (100/120 Hz, or fast PWM dimming); under a rolling shutter, rows captured at different phases of that flicker get different brightness → horizontal **banding** (and colour banding if RGB phosphors lag). Fixes: match the exposure to the flicker period, **anti-flicker** modes that time the readout, or a global shutter. Ties to video [[Video stabilization and rolling-shutter correction]].

• **correcting rolling shutter** computationally — estimate the per-row time offset and the camera/scene motion, then warp each row back to a common instant → [[Video stabilization and rolling-shutter correction]].

▸2.5.6 Modern sensor tricks (a taste; the zoo is Advanced)⧉

• **off-axis (cos⁴) falloff** still applies once there's a sensor + finite aperture: rays reaching the frame edge arrive obliquely → **cos⁴θ** dimming toward the corners (**natural vignetting**), distinct from optical vignetting (Optics).

• **dual conversion gain (DCG)** — a CMOS pixel can switch the **sense-node capacitance**, changing the conversion gain (µV per electron): **high-conversion-gain** mode (small capacitance → lower read noise, better low light / high ISO) vs **low-conversion-gain** mode (large capacitance → bigger full well, more highlight headroom). Two settings → two **native ISOs** (**dual native ISO**); reading/combining both extends dynamic range → [[Noise, signal-to-noise ratio and dynamic range]].

• **quad-Bayer (Tetracell / quad CFA)** — the CFA groups **2×2 same-color photosites** under one filter tile. Good light: the sensor **remosaics** back to full-resolution Bayer; low light: it **bins** each group into one larger effective pixel (less noise, ¼ resolution). How 48/50/108 MP phone sensors get clean low-light output and in-pixel staggered-exposure HDR. Variant: **3×3 Nonacell**. → [[Demosaicking]], [[HDR merging]].

• **dual-pixel and quad-pixel autofocus (on-chip lens, OCL)** — split a photosite into **2 (dual-pixel)** or **4 (quad-pixel, 2×2 OCL)** sub-photodiodes under **one microlens**; each sub-diode sees a different half/quadrant of the lens pupil → **phase-detection autofocus (PDAF) at every pixel** (a stereo baseline inside the pixel); sub-diodes are **summed** for the final image (no light loss). Sensor structure only — the optics/AF algorithm are in [[Focus, autofocus]].

▸2.6 Noise, signal-to-noise ratio and dynamic range⧉

⬜ figure not yet created

noisy image + flat-patch histogram (ISO 3200) fig-noise-histogram

⬜ figure not yet created

noise vs ISO fig-noise-vs-iso

`fig-noise-affine` · measured noise variance is affine in brightness (σ²≈gain·I+read²) — real per-pixel variance vs mean from a 50-frame aligned ISO-3200 burst, with the affine fit, read-noise floor, and highlight roll-off (Noise, SNR, dynamic range) ✅

`fig-noise-gaussian-pixels` · a single pixel's value across many frames is ≈ Gaussian — per-pixel histograms from the ISO-3200 burst with Gaussian overlays (Noise) ✅

`fig-noise-truncation` · noise clips at black/white so it is not zero-mean at the extremes — a near-black pixel pinned at 0 ~44% of frames, its average biased bright, vs an unbiased midtone (Noise; denoising trap) ✅

`fig-snr-vs-stddev` · noise std-dev map vs SNR map of a brightness ramp: std rises with brightness (∝√N), but SNR=√N is worst in the shadows 🟨

⬜ figure not yet created

**underexposure recovery — full-frame vs phone**: the same underexposed scene pushed up several stops in software, the full-frame frame cleaning up while the phone reveals amplified shadow noise / banding [fig-underexposure-recovery-ff-vs-phone fig-underexposure-recovery-ff-vs-phone

⬜ figure not yet created

photo] fig-results-montage

`fig-dynamic-range-comparison` · dynamic-range ladder in **stops** (horizontal bars): colour slide (~5–6) · reflective print (~6–7) · film negative (~12–13) · phone sensor (~10–12) · full-frame sensor (~14) · human eye instantaneous (~10–14) vs adapted (~20+) · an animal example · a typical sun-and-shadow scene (>20) — shows why no single capture holds a high-contrast scene (→ HDR) ✅

⬜ figure not yet created

an animal or two

• the noise **sources** (their variances add):

• **photon / shot noise** — the Poisson statistics of *counting photons*: variance = mean, so σ = √N. Fundamental (it's the light itself), dominates the midtones/highlights; averaging N independent frames cuts it by √N.

• **read noise** — added by the amplifier / ADC on readout; signal-independent, dominates the **shadows** and sets the noise floor (hence dynamic range).

• **thermal / dark current** — electrons freed by heat, independent of light; grows with exposure time and temperature (long exposures & astrophotography → sensor cooling, dark-frame subtraction).

• **fixed-pattern / pattern noise** — per-pixel gain/offset non-uniformity (PRNU/DSNU), hot/stuck pixels, banding; *structured*, so it reads as worse than random noise of equal magnitude, and is removed by calibration (flat / dark fields).

• 💡 **Big lesson — in a *linear* image, noise variance is an *affine* function of brightness.** The variances add: shot noise gives a term **proportional to the signal** (Poisson, variance = mean, scaled by the gain) and read noise a **constant** floor, so the total is **variance ≈ gain·signal + read²** — a straight line in brightness. You can *measure* it: take an aligned **burst** of a static scene and, per pixel, plot the variance across frames against the mean — the points trace that line, slope = photon gain, intercept = the read-noise floor. And a single pixel's value over many frames is **≈ Gaussian**, which is what licenses additive-Gaussian noise models. [`fig-noise-affine` — per-pixel variance vs brightness, measured from a 50-frame ISO-3200 burst; `fig-noise-gaussian-pixels` — per-pixel histograms] → register [[Big Lessons]]

• 💡 **Big lesson — noise is *clipped* at black and white, so near the extremes it is *not* zero-mean.** A recorded value is clamped to [0, max]: near black the negative half of the noise is cut off (frames pile up at 0), near white the positive half is. So at the extremes the noise distribution is **asymmetric** and its mean is pushed **inward**. The consequence for **denoising**: any naive average/smoothing returns that biased mean, so it makes **shadows come out too bright and highlights too dark** — a bias every denoiser has to correct for (carried forward to [[#Denoising]] in BASIC). [`fig-noise-truncation` — a near-black pixel pinned at 0 a third of the time, its average biased bright; from the ISO-3200 burst] → register [[Big Lessons]]

• **the key intuition — it's about ratios (SNR)**: shot noise is **Poisson**, so σ = √N grows with brightness — measured noise is actually **higher in the highlights**. Yet noise *looks* worst in the **shadows**, because perception cares about the **signal-to-noise ratio** SNR = N/√N = √N, which is worst where N is small. "Ratios are all that matters" — the same reason we encode in gamma/log, and the motivation for ETTR (below â Photography 101). [std-dev map vs SNR map figure]

• **dynamic range** = the brightest recordable signal (sensor **saturation / full-well**) ÷ the **noise floor** (read noise in the shadows); it sets how much of a high-contrast scene fits in a *single* exposure → the motivation for **HDR** (Multiple exposure)

• **HDR (high dynamic range)**, sometimes called **wide dynamic range (WDR)**, is a deliberately **fuzzy** term — flag this. It gets used for at least four different things: the *scene* (a high-contrast world), the *capture* (multi-exposure / a high-DR sensor), the *file/encoding* (float or ≥10-bit radiance, e.g. OpenEXR, HDR10), and the *display* (an HDR monitor) — and, loosely, for the **tone-mapped "HDR look."** There is no single sharp definition; say which sense you mean. (Developed in [[Multiple exposure imaging]].)

• 💡 **Big lesson (L15) — dynamic range is set by full-well capacity over the noise floor**: a single exposure records from a **top** (the photosite's **full-well capacity** — where it clips to white) down to a **bottom** (the **noise floor** — read noise in the shadows); their **ratio is the dynamic range** (quoted in stops). You **widen** it with a **bigger well** (larger photosites — why a full-frame sensor out-ranges a phone) or a **lower floor** (cooling, lower read noise), and you **beat** it altogether by merging **multiple exposures** (HDR). The capture-side companion to **L6**: what bounds a shot is this *range*, not the bit depth. → register [[Big Lessons]]

• **examples, in stops** (the figure): a color **slide** is narrow (~5–6 stops), a reflective **print** ~6–7, **film negative** wide (~12–13 latitude), a **phone** sensor ~10–12, a **full-frame** sensor ~14; the **human eye** is ~10–14 stops *instantaneously* but ~**20+** once **adaptation** is allowed — and some **animals** do better in their niche; an outdoor sun-and-shadow scene can exceed **20 stops**, which is why no single capture holds it [fig-dynamic-range-comparison]

• **see it — recover an underexposed shot, full-frame vs phone**: shoot the *same* underexposed scene as **raw** on a **full-frame** camera and on a **phone**, then **push the exposure up** in software (e.g. +3–4 stops). The full-frame frame cleans up (deep full-well + low read-noise floor → wide dynamic range), while the phone frame breaks down into **shadow noise and banding** as the lift amplifies its read-noise floor — a direct, visible demonstration that **dynamic range scales with photosite / sensor size**, and exactly why phones lean on **burst / HDR+** (Multiple exposure) rather than a single exposure. [fig-underexposure-recovery-ff-vs-phone; photo — paired Fredo full-frame DNG + phone raw] *(also a coding exercise — see [[Exercises & Experiments]])*

• 💡 **Big lesson (L6) — quantization is rarely the real problem**: with enough bits and a sane encoding, it's **noise and dynamic range** (this section) that bite — not the number of levels. The one exception is **gamma / banding in shadows**, which is exactly why gamma encoding exists (allocate codes perceptually); so "rarely" ≠ "never". Reminded again in BASIC → Image representation → *Float vs 8-bit*.

equations

shot noise Poisson (variance = mean), σ ∝ √N, SNR = √N

variances add (shot + read + thermal)

averaging N frames → noise /√N

PSNR = 10·log₁₀(MAX²/MSE)

dynamic range = full-well ÷ read-noise floor

▸2.7 Multiple view geometry⧉

• chapter intro: the leap from one image to **two** — what a second viewpoint buys you (depth), built on the stereo / disparity intuition, then the epipolar-geometry derivation of E/F, with only their estimation and the multi-view scaling-up forwarded

▸2.7.1 Two views: stereo and disparity⧉

⬜ figure not yet created

a stereo pair (anaglyph) fig-stereo-pair

`fig-disparity-geometry` · parallel-camera disparity geometry 🟨

⬜ figure not yet created

a disparity map fig-disparity-map

`fig-random-dot-stereogram` · a random-dot stereogram pair (Julesz): a shifted central patch → depth from disparity alone, no monocular cues 🟨

• two eyes / two cameras → **disparity** (nearer points shift more)

• **random-dot stereograms** (Julesz): two random-dot images identical except for a shifted patch fuse into a floating surface — proof that **disparity alone yields depth**, with no recognizable objects or monocular cues at all ("depth without objects") [random-dot stereogram figure]

• simple parallel-camera geometry: depth ∝ baseline · focal length / disparity

• sidebar — **widening your baseline**: since disparity ∝ baseline, gadgets that optically push your eyes farther apart exaggerate depth (**hyperstereo**) — Helmholtz's **telestereoscope** (mirrors, 1857), military **optical rangefinders** & the **scissors periscope** (*Scherenfernrohr*), vs **hypostereo** (shrunk baseline) for macro 3D

• disparity map; rectification (a homography) when the cameras aren't parallel

• ties to panorama stitching; forward ref: multi-view stereo & depth (Advanced)

equations

depth Z = f·B / d (baseline B, disparity d)

▸2.7.2 Two-view geometry: epipolar lines, and the essential and fundamental matrices⧉

`fig-epipolar-line` · the epipolar constraint: two views of a 3-D point, epipolar plane & line, the match constrained to a 1-D search ✅

• the key fact, in one picture: a point in one view does **not** pin its match to a point in the other — it constrains it to a **line**. Given a pixel in the left image, its 3-D point lies somewhere along a single ray; that ray projects, in the right image, to a line — the **epipolar line**. So matching is a **1-D search along a line**, not a 2-D search (this is exactly why we *rectify* stereo pairs — to make those lines horizontal). [fig-epipolar-line]

• **derive the matrix straight from that picture** — ray → plane → line → algebra:

• **(1) the ray (depth ambiguity)**: a pixel back-projects to a ray; in **calibrated** coords ($\hat{x}=K^{-1}x$) the 3-D point is $X=\lambda\hat{x}$, depth $\lambda$ unknown.

• **(2) the epipolar plane**: that ray together with the **second camera centre** spans a plane (equivalently: the plane through both camera centres and $X$ — independent of $\lambda$).

• **(3) the epipolar line**: the plane meets the second image in a line, so the match $\hat{x}'$ must lie on it.

• **(4) write the plane in 3-D**: with relative pose $(R,t)$ ($X\mapsto RX+t$; the first centre maps to $t$ = the **baseline** vector, the ray direction maps to $R\hat{x}$), the three vectors $\hat{x}'$, $t$, $R\hat{x}$ are **coplanar** → vanishing triple product $\hat{x}'\cdot(t\times R\hat{x})=0$ → $\hat{x}'^{\mathsf T}[t]_\times R\,\hat{x}=0$ → the **essential matrix** $E=[t]_\times R$.

• **(5) → the fundamental matrix**: in raw pixel coords, $x'^{\mathsf T}Fx=0$ with $F=K'^{-\mathsf T}[t]_\times R\,K^{-1}=K'^{-\mathsf T}EK^{-1}$; $Fx$ is the epipolar line. The relative pose $(R,t)$ (translation only up to scale) is recoverable from $E$ — the seed of multi-view reconstruction.

• **forward-ref** (deferred to Advanced/3-D part): *estimating* F/E from noisy matches (the **8-point algorithm**, RANSAC), rectification, triangulation, and **structure-from-motion / multi-view stereo**.

▸2.8 Human (and animal) vision and color⧉

▸2.8.1 Visual system anatomy:⧉

`fig-eye-cross-section` · eye cross-section 🟨

`fig-retina-layers` · retina layers & photoreceptor synapses 🟨

`fig-cone-rod-distribution` · cone/rod distribution & fovea 🟨

`fig-visual-pathway` · pathway retina→LGN→V1 🟨

`fig-photoreceptor-cell` · anatomy of a rod & cone cell 🟨

`fig-rods-cones-micrograph` · micrograph of rods & cones (primate retina) 🟨

• light, lens, pupil, retina,

• cones/rods, distribution, fovea — note there are **no S (blue) cones in the very centre of the fovea**, and far more L+M than S overall, so **blue is low-resolution and a bit out of focus** (chromatic aberration) — yet we perceive blue everywhere (the brain fills it in). Luminance ≈ the **L+M cone sum**, which is why **green carries most of the acuity** (and why Bayer uses 2× green).

• cells, ganglions, on-retina spatial processing, visual cortex, V1, higher.

• **the eye as a camera — and its refractive errors.** Like any lens system the eye can mis-focus: **myopia** (eyeball too long / cornea too curved → image forms *in front* of the retina), **hyperopia** (too short → *behind*), **astigmatism** (a non-spherical cornea → two focal lines, not one point), and **presbyopia** (the aging lens stiffens and loses **accommodation**). Each is corrected by a lens quoted in **diopters** (D = 1/focal-length in metres): a negative/diverging lens for myopia, positive/converging for hyperopia, a **cylindrical** lens (with an axis) for astigmatism. [`fig-refractive-errors`]

• **how the prescription is *measured* (the machines), and why it's just inverse imaging.** **(1) Retinoscopy** — shine a moving streak of light into the eye and watch the **retinal reflex** sweep across the pupil; its direction and speed reveal myopia vs hyperopia, and trial lenses are stacked until the reflex stops moving ("neutralization"). **(2) Autorefractor** — automates this objectively: it projects an **infrared** target onto the retina and analyses the **returning image** (its focus/size, or a Hartmann–Shack / Scheimpflug pattern) to solve for the corrective lens that makes the retinal image sharp — essentially a little camera running autofocus *on your eye*, in seconds. **(3) Phoropter** — the familiar "better one… or two?" wheel of trial lenses over an eye chart: a **subjective** refinement of the machine's objective estimate. **(4) Wavefront aberrometer (Shack–Hartmann)** — for the full optical picture: a **lenslet array** samples the wavefront leaving the eye and measures **higher-order aberrations** (coma, spherical, trefoil), not just sphere/cylinder — the same wavefront sensing used in **adaptive optics** and in planning **LASIK**. The throughline: measuring an eye is **inverse imaging** — find the optics that would make the retinal image (or the returning wavefront) ideal. (Cross-ref wavefront / Shack–Hartmann sensing and adaptive optics in [[Optics, lenses, and aberration correction]].)

▸2.8.2 Color⧉

`fig-cone-sensitivities` · cone spectral sensitivities (L/M/S) 🟨

`fig-spectrum-to-three-responses` · spectrum → 3 responses (projection) 🟨

`fig-metamers` · metamers (two spectra, one color) 🟨

`fig-ishihara` · Ishihara plate (“74”) 🟨

`fig-cone-mosaic` · trichromatic cone mosaic (L/M/S, fovea) 🟨

`fig-cone-response-matrix` · cone response as a 3×N matrix–vector product r = C·E (discretized λ); note the continuous / infinite-D limit r_k = ∫ c_k(λ) E(λ) dλ 🟨

`fig-nonorthogonal-dual-basis` · non-orthogonal cone basis (two vectors in the positive quadrant) and its dual: synthesis vs analysis; the dual (analysis) basis has negative coordinates 🟨

`fig-opsin-tree` · the opsin gene tree (schematic): ancestral opsin → RH1 rod + cone classes (SWS1/SWS2/RH2/LWS); the human S/M/L branch highlighted, with the recent primate L/M gene duplication marked ✅

• cones, projection from spectrum to 3 responses

• relate to measurement equation in image formation

• **the response is a matrix–vector product**: discretize the spectrum into N wavelength bins → a vector E ∈ ℝᴺ; stack the three cone sensitivities as the **rows of a 3×N matrix C** (one row per L/M/S); the cone responses are then **r = C·E**, i.e. each r_k = Σ_i c_k(λ_i)·E(λ_i). In the real world wavelength is **continuous**, so it's really **infinite-dimensional** — the sum becomes the integral r_k = ∫ c_k(λ)·E(λ) dλ. [cone-response matrix figure]

• insist a given cone does not make the difference between wavelength. Example of monochromatic stimuli

• point out they are non orthogonal and lots of stuff can’t be negative.

• This make linear algebra more messy

• 💡 **Big lesson:** **color is non-orthogonal *and* non-negative** — overlapping cone axes plus the no-negative-light constraint are what make color algebra tricky (analysis needs a dual basis with negative coordinates).

• side bar: **non-orthogonal bases & the dual basis** — the cone "axes" are **not orthogonal** (L and M overlap heavily), so the comfortable orthonormal intuition fails. Picture two basis vectors **c₁, c₂ both in the positive quadrant** of the (canonical) spectrum space, each standing for one cone. **Synthesis** is easy — any point is a combination a₁c₁ + a₂c₂ (a parallelogram). But **analysis** — recovering the coordinates a₁, a₂ — is **not** a plain projection onto c₁, c₂; you must project onto the **dual (reciprocal) basis** c₁\*, c₂\* (defined by cᵢ\*·cⱼ = δᵢⱼ). For a non-orthogonal positive basis the **dual vectors point partly into the negative quadrants** — they have **negative coordinates** in the original space. That is exactly why "stuff can't be negative" makes color algebra messy: the natural *analysis* vectors are negative. (Only for an **orthonormal** basis do a basis and its dual coincide — then analysis = synthesis = projection.) [non-orthogonal dual-basis figure]

• color blindness. Linear algebra perspective (projection loses info)

• we are all color blind

• side bar — **animal vision and the divergence of opsins**: "color" is just whatever set of **opsins** (photopigments) an animal carries, and these have **diverged enormously** across major animal groups — many insects are UV–blue–green trichromats, birds and reptiles are often **tetrachromats** (a fourth, UV cone), most mammals reverted to **dichromacy** (primate trichromacy is *re-evolved*, via a gene duplication), and the **mantis shrimp** carries ~12–16 photoreceptor classes. So human red–green color blindness is one point in a vast space — every species is "color blind" relative to some other. The opsin family tree and its divergence are traced in the "Evolution of Eyes" chapter of *Vision* (Cambridge); the schematic opsin gene tree (ancestral opsin → rod RH1 + cone classes, human S/M/L with the recent L/M duplication) is [fig-opsin-tree]. [→ Animal eyes]

• the **multistage color model** — the brain re-codes color in stages (simplified; see Reinhard et al., *Color Imaging* for the full version):

• **stage 1 — cones**: three heavily-overlapping LMS responses (above)

• **stage 2 — opponent recoding** in the retina / LGN: LMS → **light–dark, red–green, blue–yellow** (decorrelates the channels, matches the four unique hues; → next section)

• **stage 3 — appearance level**: a higher organization closer to **hue / saturation / brightness** (an HSV-like arrangement) where context, adaptation and memory act (→ color appearance models, Color technology)

• **focal colors & naming**: color *names* cluster around shared **focal colors** across languages (Berlin & Kay; the World Color Survey) — evidence the recoding is perceptual, not arbitrary

• experimental support: **psychophysics** (hue-cancellation — Hurvich & Jameson) and **physiology** (opponent cells — De Valois)

equations

cone response r_k = ∫ E(λ)·c_k(λ) dλ, k ∈ {L,M,S} (a projection)

metamerism = same r_k for different E(λ)

▸2.8.3 So what are the primary colors?⧉

• the perennial muddle (RGB? RYB? CMY?) has *two separate* causes, neither of which is a fact about light:

• **dual-process / opponent coding** (next section): the visual system re-wires the three cone signals into **red–green, blue–yellow, light–dark**, so *red, green, blue, and yellow* all feel "primary" perceptually — there are **four** psychological unique hues, not three. This is why RYB feels natural to painters and why the cones' actual L/M/S don't match anyone's intuition of "primaries."

• the rest is **additive vs subtractive synthesis** (Color technology): **RGB** are the *additive* (light) primaries, **CMY** the *subtractive* (ink/filter) primaries; grade-school **RYB** is a folk version of subtractive mixing. So "primary" conflates a **perception** fact with a **technology** choice — different goals, different "primaries."

• source: lecture — the opponent rewiring "is one of the many causes of all the confusion about what's a primary color" (transcript 10/2); pair with [[Glossary]] entry for *primary*.

▸2.8.4 Opponent process and the multistage model⧉

`fig-opponent-channels` · opponent channels (R–G, B–Y, light–dark) 🟨

• two stages: trichromatic at the cones, then opponent in the retina / LGN — red–green, blue–yellow, light–dark

• explains afterimages and color naming; and why **luma/chroma** (YCbCr — luma Y′ on gamma-encoded RGB, see *Linear vs Gamma*) compresses well (see JPEG)

equations

opponent = linear map LMS → (L−M, S−(L+M), L+M) (YCbCr-like)

▸2.8.5 Light adaptation⧉

`fig-spanish-castle` · afterimage / Spanish-Castle illusion ✅

`fig-photopic-scotopic-range` · the photopic / mesopic / scotopic regimes on a horizontal log-luminance axis (cd/m², ~10⁻⁶→10⁸): rod vs cone activity bands, the absolute threshold, the Purkinje shift, and landmarks (starlight, moonlight, indoor, overcast, sunlight, the sun); the eye's full range vs the few-log instantaneous range (→ adaptation, HDR) ✅

• we recenter the neural response around the current average brightness — spanning a huge range over *time*, not all at once

• mechanisms: neural, chemical (photopigment bleaching), and the pupil

• afterimages; the Spanish-Castle illusion

• **the operating regimes — photopic, mesopic, scotopic** (which receptors are working, and *how bright* that is, in **luminance** cd/m²):

• **photopic** (daylight, **cones**, full color, sharp foveal vision): roughly **> 3 cd/m²** — office light ~10²; an overcast sky ~10³–10⁴; a sunlit scene ~10⁴–10⁵

• **mesopic** (dusk / dim interior, **rods *and* cones together**): roughly **10⁻³ – 3 cd/m²** — colors desaturate and the **Purkinje shift** sets in (as rods take over, blues look relatively brighter and reds darker)

• **scotopic** (night, **rods only** — no color, poor acuity, and *off-fovea* since the fovea is all cones): below ~**10⁻³ cd/m²** — moonlit landscape ~10⁻²; starlight ~10⁻³–10⁻⁴; the **absolute threshold** of vision ~**10⁻⁶ cd/m²**

• big picture: the eye works across a staggering **~10⁻⁶ to ~10⁸ cd/m²** (≈ 14 log units), but only over a few log units **at once** — which is exactly why **adaptation** (above) and, in cameras, **HDR / tone mapping** exist. [fig-photopic-scotopic-range]

equations

Naka–Rushton response R = Iⁿ / (Iⁿ + I₀ⁿ) (saturating)

▸2.8.6 Lightness constancy⧉

`fig-checker-shadow` · Adelson's checker-shadow illusion — the original artwork: the illusion (squares A/B identical luminance) beside the equal-grey-bars proof ✅

• the visual system reports a surface's intrinsic **lightness** (its reflectance / albedo), **not** the raw luminance reaching the eye — a white page in shadow still looks white, a black one in sunlight still looks black, even when the shadowed white sends *less* light to the eye than the sunlit black

• it **discounts the illuminant**: the brain factors the retinal image into **illumination × reflectance** (the same entanglement from Light and physics) and reports the *reflectance* — the property we actually care about

• this is an **ill-posed inversion** (one image, two unknowns), so the brain leans on cues and priors: **edges and context** tell it where the *illumination* changes versus where the *surface* changes

• the canonical demo: **Adelson's checker-shadow illusion** — two squares of identical luminance look very different because the brain infers the cast shadow and discounts it

• **Retinex** (Land & McCann) models the same idea computationally: treat slow spatial gradients as illumination and sharp edges as reflectance changes, then integrate the edges back to lightness

• ties to **intrinsic images** (split a photo into reflectance + shading) and to the darkroom craft of **dodging & burning** / tone mapping

▸2.8.7 Color constancy⧉

`fig-white-balance-real` · white balance computed from scratch (von Kries per-channel gain in linear light): a warm-casted photo corrected by gray-world vs white-patch/max-RGB, with the recovered channel gains (PS1) ✅

⬜ figure not yet created

the same surface under two illuminants reading the same color fig-wb-before-after

⬜ figure not yet created

"the dress" — identical pixels read blue-black or white-gold depending on the illuminant the viewer assumes fig-the-dress

• the **color analog** of lightness constancy: we perceive an object's intrinsic **surface color** (its reflectance) despite large changes in the **illuminant** — a white shirt reads white under noon daylight, a warm lamp, or blue shade, even though the spectra reaching the eye differ a lot

• again the brain **discounts the illuminant**: it estimates the light's color and divides it out so the surface reads consistently — the same ill-posed (**illuminant × reflectance**) inversion, solved with assumptions (e.g. the average scene is roughly gray, or the brightest patch is white)

• a key mechanism is **von Kries adaptation**: an *independent gain on each cone type* (L, M, S) rescales the response to the prevailing light — plus spatial / contextual cues; it is **not perfect** (some cast remains, and memory colors bias us)

• this is the **biological counterpart of white balance** (Color technology) — a camera's automatic white balance must do, by algorithm, what our visual system does for free

• classics: **Retinex** (Land & McCann); **Bayesian** color constancy (Freeman & Brainard); gray-world / bright-pixel heuristics

• the **"the dress"** demo (2015): the *same* image pixels look **blue-black** to some viewers and **white-gold** to others — because the brain must *guess the illuminant* (cool daylight vs warm indoor light) and divides it out differently. A perfect, viral illustration that color constancy is an **ambiguous inference**, not a measurement — the visual system commits to a prior and different people commit differently. Ties to white balance (a camera faces the exact same ill-posed guess).

▸2.8.8 Contrast⧉

`fig-simultaneous-contrast` · simultaneous-contrast patch 🟨

`fig-contrast-ratios` · contrast-as-ratios strip 🟨

• contrast is about *ratios*: 1→2 looks like 100→200 (discounts the multiplicative effect of light) — foreshadows gamma / log encoding

• Weber's law / just-noticeable difference

• simultaneous contrast; the Adelson checker-shadow illusion

equations

Weber's law ΔI/I ≈ const

Michelson contrast = (I_max − I_min)/(I_max + I_min)

▸2.8.9 Spatial vision⧉

`fig-campbell-robson` · Campbell–Robson CSF chart 🟨

`fig-csf-chromatic` · three contrast-sensitivity functions: luminance (band-pass) vs red–green and blue–yellow (low-pass, lower cutoff) — we resolve colour more coarsely than brightness 🟨

`fig-blur-chroma-luma` · blur only the chroma (Lab) and the image still looks sharp; blur the luminance and it falls apart — the basis for chroma subsampling ✅

• Visual Acuity

• Simultaneous contrast

• Mach bands; lateral inhibition (center-surround receptive fields)

• **CSF** — measured by **psychophysics** (show **sine-wave gratings**, ask "do you see a grating?"); the **contrast sensitivity function** is band-pass in spatial frequency. Crucially the **chromatic** CSFs (red–green, blue–yellow) cut off at **far lower** spatial frequency than the **luminance** CSF — we resolve **color** much more coarsely than **brightness**. [three-CSF figure]

• the dramatic demo: in Lab, **blur the chroma** channels and the image still looks sharp; blur the **luminance** and it falls apart — the empirical basis for **chroma subsampling** (JPEG / YCbCr) and a confirmation of the opponent model. [blur-chroma vs blur-luminance figure]

▸2.8.10 Temporal vision⧉

• the visual system trades some spatial resolution for **temporal** resolution: a **temporal CSF** band-passes flicker, with a **critical flicker-fusion** frequency (~50–60 Hz in bright light, lower when dim) above which a flickering source looks steady — why displays/film refresh fast enough to fuse, and why old CRTs / fluorescents could flicker in peripheral vision

• **apparent motion** (the phi phenomenon) + persistence of vision make a sequence of frames read as continuous movement (→ Motion/Video)

• **temporal adaptation** (light/dark adaptation over seconds–minutes; → Light adaptation) and the eye's own motion (saccades, microsaccades) shape what we see over time

• exploited by computation: **frame-rate** choices, flicker-free capture, and **video magnification** (amplifying tiny temporal changes — Motion/Video)

▸2.8.11 Attention and eye movements⧉

⬜ figure not yet created

a fixation scanpath over an image fig-scanpath

• attention; fixations; saccades; smooth pursuit

• the eye samples a scene through rapid saccades + a high-res fovea

• saliency (what draws the eye) is modeled computationally → see [[Human factors#Computational models of perception]] (the model and its uses) — the ML estimator is in [[Computational tools#Machine learning]]

▸2.8.12 Perception of color and images⧉

`fig-hunt-effect` · colorfulness vs luminance (Hunt) 🟨

`fig-bezold-brucke-abney` · Bezold–Brücke / Abney demos ✅

• color reproduction kind of stuff

• blue shift

• colorfulness as a function of luminance (Hunt effect)

• hue shifts: Bezold–Brücke (with intensity), Abney effect (with added white)

▸2.8.13 Animal eyes⧉

`fig-eye-evolution` · the evolution of the eye, in cross-sections: flat photoreceptor patch (no image) → cup (directional) → pinhole (an image, no lens) → lensed eye — recapitulating sensor → camera-obscura → lens (Bonus: animal eyes) 🟨

• **the eye evolved as a sequence of small improvements**, and it recapitulates the optics of this book — from "no image" to a focused one. Following Nilsson's stages (Land & Nilsson, *Animal Eyes*):

• **a flat patch of photoreceptors** (an *eyespot*): just **photosites**, **no image formation** — it tells light from dark and, crudely, which side has *more* light (phototaxis), but cannot form a picture. This is the bare **sensor** with no optics.

• **a concavity / cup**: fold the patch into a **pit** and the rim's shading gives **directional** sensitivity — the deeper the cup, the better you can tell *where* light comes from.

• **a pinhole**: close the cup almost shut → a **pinhole eye**, a real (if dim) **image** with no lens at all (the **nautilus** still uses one) — exactly the camera obscura of [[#Pinhole image formation and linear perspective]].

• **a lens**: fill the aperture with a **refractive lens** (often a graded-index sphere) → a **bright, focused** image, independently evolved in fish, cephalopods, and vertebrates. Add a **cornea, iris, and accommodation** and you have the human eye.

• so the human eye is *one* solution; evolution reached the same goal by very different routes (below).

• **a tour of those other solutions** — a useful contrast to the human eye and to camera design:

• **compound eyes** (insects): many **ommatidia** → wide field of view and fast temporal response, but low spatial resolution

• **more cone types**: birds / reptiles are **tetrachromatic** (a 4th cone, into the UV); the **mantis shrimp** has ~12–16 photoreceptor classes (yet poor color *discrimination* — it seems to recognize rather than compare) — the **divergence of opsins** across animal groups is the color sidebar in [[#Color]]

• **tapetum lucidum** (cats, deer): a reflective layer behind the retina that boosts night vision (and causes eye-shine)

• **polarization vision** (cephalopods, many insects): they see light's **polarization** — a channel we're blind to

• **other eye designs**: **pinhole** (nautilus), **mirror eyes** (scallops — image formed by a concave mirror, not a lens), and the front-facing (**stereo**) vs side-facing (**field-of-view**) placement trade-off

• further reading: Land & Nilsson, *Animal Eyes*; the "Evolution of Eyes" chapter in *Vision* (Cambridge); a friendly overview at phos.co.uk, "The Evolution of Sight".

▸2.9 Color technology⧉

▸2.9.1 Analysis vs synthesis and non-orthogonality⧉

`fig-analysis-vs-synthesis` · analysis vs synthesis (sense→3 / reproduce←few) 🟨

• analysis = *sensing* color (project a spectrum onto 3 cone / sensor responses)

• synthesis = *reproducing* color (add a few primaries)

• both lose information: the projection is non-orthogonal and light can't be negative — a tension that recurs (sensing, displays, white balance)

▸2.9.2 Measuring color⧉

`fig-color-matching-setup` · color-matching experiment setup ✅

`fig-cie-cmfs` · CIE color-matching functions 🟨

`fig-chromaticity-diagram` · xy chromaticity + spectral locus + white point 🟨

• the color-matching experiment: match a test light by mixing 3 primaries

• trichromatic theory (von Helmholtz 1859); tristimulus values

• metamers: different spectra that match — *why* color reproduction is possible at all

• CIE 1931: color-matching functions, the xy chromaticity diagram, the spectral locus, the white point

• negative matches → why some colors need "impossible" primaries (see Reproducing color)

equations

tristimulus X = ∫ E(λ)·x̄(λ) dλ (and Y, Z)

chromaticity x = X/(X+Y+Z), y = Y/(X+Y+Z)

▸2.9.3 Linear vs gamma vs. log encoding⧉

`fig-gamma-curves` · gamma encode/decode curves 🟨

`fig-quantization-banding` · quantization banding (low-bit demo), worst in shadows ✅

⬜ figure not yet created

CRT power-law response (screen luminance vs grid voltage ≈ V^2.2)

`fig-slr-cross-section` · labelled cross-section of an SLR: 45° main reflex mirror sends light UP to the focusing screen + pentaprism → optical viewfinder; the semi-transparent main mirror + secondary (sub) mirror fold light DOWN to the phase-detect AF module in the mirror-box floor; focal-plane shutter + sensor/film behind; inset shows the mirror flipping up for exposure (companion to `fig-mirrorless-anatomy`) ✅

• different challenges: quantization, noise

• same observation: ratios matter

• **side bar — why gamma ≈ 2.2? the CRT accident**: old **CRT** displays had a *natural* power-law response — screen luminance ∝ (grid voltage)^γ, γ ≈ 2.2–2.5 (electron-gun physics: beam current vs control-grid voltage is a power law). So a CRT **automatically decodes** a gamma-encoded signal: send V = L^(1/γ) and the tube returns L = V^γ (linear light), no decode hardware. **sRGB's ≈2.2 codifies it.** Bonus: that curve ≈ the *inverse of human lightness perception* (Weber–Fechner), so it also puts code levels where the eye is sensitive — **why we kept gamma after CRTs vanished** (LCD/OLED emulate the curve for compatibility). [CRT-response figure]

• **side bar — even sRGB isn't quite continuous**: the ubiquitous sRGB transfer curve is **piecewise** — a short **linear** toe near black, $V = 12.92\,L$ for $L \le 0.0031308$, then a **power** segment $V = 1.055\,L^{1/2.4} - 0.055$ above it — meant to join smoothly at the knee. But the standardized constants (the $12.92$ slope, the $0.0031308$ breakpoint, the $0.055$ offset, the $1/2.4$ exponent) are **rounded and not mutually consistent**, so the two pieces **don't exactly meet**: at the breakpoint the linear part gives $\approx 0.04045$ while the power part gives $\approx 0.04059$ — a tiny **discontinuity** ($\sim 10^{-4}$), on top of the slope kink that's there by design. So even the encoding *every* image uses is, strictly, **not continuous** at the join. The lesson: real transfer functions are messier than the clean "$\gamma \approx 2.2$" story — **check the actual curve** (and its inverse) before doing arithmetic or quantizing in that space. *(A "corrected" sRGB nudges the breakpoint/scale so the pieces meet exactly.)*

• **side bar — gamma in film**: the word **gamma** comes from photography. Film's response is the **characteristic (Hurter–Driffield) curve** — **density** vs **log exposure** — with a toe (shadows), straight middle, and shoulder (highlights). Film **gamma = the slope of the straight portion** (γ = ΔD/Δlog H) = the medium's **contrast**, set by development (push-processing raises γ). So *film* gamma is a contrast/rendering parameter while *display* gamma is an encoding power law — both log–log slopes, different jobs. End-to-end **system gamma** (capture × display) is tuned slightly >1 (~1.1–1.5) so images look right in a **dim viewing surround**. [film characteristic-curve figure]

• 💡 **Big lesson:** **additive vs multiplicative → choice of encoding.** Light *adds* (incoherent sources, blur) but surfaces and contrast *multiply*; **log** makes the multiplicative native, **linear** suits the additive (deconvolution), **gamma** is the compromise with least crazy behaviour near zero.

• **log is becoming the standard for *video***: high-end and now prosumer cameras capture in **log** profiles — **S-Log / S-Log3** (Sony), **Log-C** (ARRI), **V-Log** (Panasonic), **C-Log** (Canon), **RED Log3G10** — and the **ACES** pipeline standardises a log working space (ACEScc / ACEScct). Log packs a wide scene dynamic range into limited bits while keeping the grade malleable: a **stop** becomes a roughly constant code increment, so exposure and contrast moves are uniform. The footage looks flat and desaturated straight out of camera and is **color-graded** to a display encoding; the broadcast-HDR transfer functions **HLG** and **PQ** (SMPTE ST 2084) are the display-side cousins. So in video the trio increasingly resolves to **log for capture, gamma / PQ / HLG for display**. (→ HDR, tone mapping.)

• **luma vs luminance** (a recurring confusion): **luminance** Y is the *linear*, perceptual brightness — a weighted sum of **linear** RGB (CIE Y); **luma** Y′ is the *same-shaped* weighted sum computed on the **gamma-encoded** RGB (the coding convenience behind YUV / YCbCr / JPEG). They differ because the weighting and the gamma don't commute.

• **our convention**: use **Rec. 601 weights — Y = 0.299 R + 0.587 G + 0.114 B** — matching video / JPEG / YCbCr and our programming exercises; mention **Rec. 709** (0.2126 / 0.7152 / 0.0722) as the HDTV / linear-light alternative.

equations

gamma encode V = L^(1/γ), decode L = V^γ (γ ≈ 2.2)

sRGB piecewise curve

**luma** Y′ = 0.299 R′ + 0.587 G′ + 0.114 B′ (Rec. 601, on *encoded* RGB) ≠ **luminance** Y (same-style weights on *linear* RGB

Rec. 709 alt 0.2126/0.7152/0.0722)

▸2.9.4 Linear (kind of) color spaces⧉

`fig-gamut-primaries` · sRGB vs wide-gamut primaries on xy 🟨

• **RGB and its flavours**: many RGB spaces differ only by their **primaries** (gamut) and **white point**, related by a single **3×3 matrix M** — sRGB / Rec. 709 (web, HDTV), **Adobe RGB** and **Display P3** (wider), **Rec. 2020** (UHD), **ProPhoto** (very wide, for editing). Converting between them (or to XYZ) is one matrix multiply *in linear light*.

• **opponent / luma-chroma spaces** (YCbCr, YUV, YCoCg): a fixed 3×3 map separating **luma** from two **chroma** axes — used for compression (chroma subsampling) and some editing.

• caveat: "linear" here means **a linear map between spaces**, not necessarily linear-*light* values — sRGB carries a gamma; XYZ is linear-light.

equations

RGB ↔ RGB and RGB → XYZ via a 3×3 matrix M

▸2.9.5 Non-linear space: CIELAB⧉

`fig-perceptual-uniformity` · a rainbow at N colours equally spaced in sRGB vs CIELAB; ΔE bars under each strip show sRGB steps are perceptually ragged, CIELAB steps even (Color technology — CIELAB) 🟩

`fig-cielab-ab-plane` · the a*b* plane 🟨

• **CIELAB** (and CIELUV): a deliberately **non-linear, perceptually-uniform** space — a cube-root compression of XYZ about a white point yields **L\*** (lightness) and **a\*, b\*** (opponent chroma), so **equal distances ≈ equal perceived differences**; color difference **ΔE = ‖ΔLab‖** (improved by ΔE2000)

• the motivating demo: sample a rainbow at **N colors equally spaced in sRGB vs in CIELAB** — the sRGB steps have wildly uneven perceived gaps (ΔE), the Lab steps are even [perceptual-uniformity figure]

• **why non-linear**: perception is roughly power-law (Weber–Fechner / Stevens), so a uniform space must warp XYZ — the same reason gamma encoding exists

• **friends**: HSV / HSL (handy for UI, *not* perceptually uniform), and modern uniform spaces — CAM02-UCS, **Oklab** — that fix CIELAB's known non-uniformities (the *Chong* perceptual color metric belongs here)

equations

CIELAB L* = 116·f(Y/Yₙ) − 16 (cube-root nonlinearity)

ΔE = ‖ΔLab‖

▸2.9.6 Reproducing color⧉

`fig-additive-subtractive` · additive vs subtractive synthesis 🟨

`fig-color-synthesis-bands` · additive vs subtractive synthesis, 3-band cartoon; subtractive is multiplicative (Color technology) 🟨

`fig-subtractive-spectra` · subtractive mixing as a per-wavelength multiplication: yellow & cyan filter transmittances and their product T_Y·T_C (= green) 🟨

`fig-gamut-mapping` · gamut + gamut mapping 🟨

`fig-repro-monochromatic` · reproducing a monochromatic signal (2D LA analogy) 🟨

• intro: fundamental challenge of spectrum non-orthogonality.

• Give example of reproducing a monochromatic signal.

• Show 2D linear algebra equivalent problem.

• Say that the big extra challenge is that we can only have positive light.

• Non-orthogonality + restriction to positive -> impossible problems.

• **additive vs subtractive** color synthesis — cartoon: split the spectrum into 3 equal bands and pretend they are pure **B, G, R** [3-band synthesis figure]

• **additive** (lights): start from black and *add* bands → R+G+B = white; pairs give yellow / cyan / magenta

• **subtractive** (filters/inks): start from white and each filter *removes* a band — but this is really a wavelength-by-wavelength **multiplication** of the spectrum, so "subtractive" is **misleading**; C·M·Y → black

• concrete proof it's a multiplication: a **yellow** filter (blocks blue) stacked with a **cyan** filter (blocks red) passes only the overlapping green band → **yellow × cyan = green**, not a "sum" of the two [subtractive spectra-multiplication figure: T_Y(λ)·T_C(λ)]

• the **overlapping-disks** picture makes the same point at a glance: RGB lights overlapping toward white (additive) vs CMY inks overlapping toward black (subtractive) [overlapping RGB/CMY disks figure]

• real systems are often **hybrid** — some light *added* and some *removed* — e.g. an **LCD projector**: a white lamp, LCD panels that *subtract* (modulate) each channel, then the three channels *added* on the screen

• various technology: CRT, LCD, projectors (DMD, LCD, laser)

• film, printer, halftoning

• gamut, gamut mapping

equations

additive mix = Σ wᵢ·Pᵢ (linear)

solve primary weights via 3×3 inverse

gamut = convex hull of primaries

▸2.9.7 Color management, ICC, and industry standards⧉

`fig-icc-workflow` · ICC workflow diagram 🟨

`fig-device-gamut-mismatch` · gamut mismatch across devices ✅

• the problem: **the same numbers look different on different devices** — each camera / monitor / printer has its own primaries, white point, gamut and transfer curve

• the solution — **ICC color management**: every device carries a **profile** mapping its values to/from a device-independent **profile-connection space** (XYZ or Lab); image → PCS → target device makes color *portable*

• **rendering intents** for out-of-gamut color: **perceptual** (compress the whole gamut, keep relations), **relative colorimetric** (clip + white-point match), **saturation**, **absolute colorimetric**

• **chromatic adaptation** between white points via a **Bradford 3×3** transform (the engineering cousin of von Kries adaptation, Perception)

• in practice: profiles embedded in files (→ File formats), monitor calibration, soft-proofing; **sRGB** as the safe default when no profile is present

equations

device ↔ profile-connection space (XYZ/Lab)

chromatic-adaptation transform (Bradford 3×3)

▸2.9.8 Sensing color: multiplexing strategies⧉

`fig-color-multiplexing` · four colour-sensing multiplexing strategies: temporal, spatial (Bayer), beam-split prism, depth (Foveon) 🟨

⬜ figure not yet created

temporal multiplexing artifact — Prokudin-Gorskii color fringes on moving water fig-prokudin-gorskii

• four ways to turn a monochrome sensor into a color one — **multiplex** the three (or more) measurements across some axis:

• **in time**: shoot R, G, B frames sequentially through filters — Maxwell's first color photograph (1861), **Prokudin-Gorskii**, flatbed scanners, astronomy filter wheels. Cheap and full-resolution, but **fails on motion** (the classic color fringes on moving water).

• **in space (the dominant choice)**: a **color filter array (CFA)** — the **Bayer mosaic** (2 green : 1 red : 1 blue) — one color per photosite, then **demosaick** to fill the rest; the eye does the same with its interleaved cone mosaic. Variants: Fuji's **X-Trans** (a larger, less-periodic tile to fight moiré) and **complementary CMY/CYGM** CFAs (more light through, messier color). Needs an **optical anti-aliasing (low-pass) filter** to tame high-frequency color artifacts, and suffers some channel **crosstalk**.

• **beam-splitter (prism)**: a **dichroic prism** splits the rays onto **three sensors** — **3-CCD / 3-chip**, the classic **broadcast video camera** — wasting no photons at full per-channel resolution, but bulky, costly, and hard to align.

• **in depth (stacked)**: stack wavelength-selective layers so one location senses all three — **Foveon** (silicon absorbs longer wavelengths deeper; used in **Sigma** cameras), echoing **Kodachrome**'s stacked dye layers. Full color at every pixel (**no demosaicking**), but trickier color separation and more chroma noise.

• **spectral / Lippmann**: record the standing-wave **interference** of the full spectrum (Lippmann, 1908 Nobel; a precursor to holography) — true spectral capture, not just 3 numbers.

• **hybrid**: combine these (e.g. spatial + temporal in video).

• this is *analysis* (sensing color) — the mirror of color *synthesis* (reproduction, above); the **demosaicking** algorithm itself is deferred to Basic Image Processing (which is why this is just the *sensing* half)

• **trichromatic (perceptual) vs spectral (physical) capture**: almost all imaging systems aim to capture **perceptual, trichromatic color** — three numbers that reproduce what a *human* would see (RGB ≈ the cones), so metamers are, by design, indistinguishable. **Multispectral / hyperspectral** imaging instead samples the **physical spectrum** in many narrow bands (tens to hundreds), keeping distinctions the eye throws away — for remote sensing, agriculture, art conservation, and machine vision, where the *material*, not the *appearance*, is what matters (and metamers must be told apart). Lippmann (above) is the limiting case: the full continuous spectrum.

▸2.9.9 Processing color: saturation and beyond⧉

`fig-point-op-saturation-vibrance` · uniform saturation (constant `s`, over-saturates already-vivid colours → clip) vs vibrance (per-pixel gain `g = 1+v(1−sat)·w_skin(hue)`, protects vivid colours & skin): input, saturation, vibrance + a gain-vs-saturation panel (Point operations → Basic colour enhancement, BASIC) 🟩

• **saturation**: scale the chroma — push colors away from the grey (luma) axis in an opponent / HSV space; a uniform scale over-saturates already-vivid colors and skin

• **vibrance**: a *smarter* saturation — boost **less-saturated** colors more and **protect** already-saturated ones and **skin tones** (a non-linear, hue-aware gain)

• **selective / HSL color** (Lightroom's mixer): per-hue-band hue/saturation/luminance, and **color grading** (shadows / mids / highlights tinting) from film & video

• the *mechanics* (curves, LUTs, pipeline placement) live in **Basic Image Processing** (point operations); here we own the **color-space** reasoning — which axis, which space, why protect skin

▸2.9.10 Converting to black and white⧉

`fig-bw-conversions` · one colour photo → black & white several ways: single channel (R/G/B) · average · weighted luminance · red-filter channel mix (dark sky) · an isoluminant pair collapsing to one grey — the choices differ (Converting to B&W, Color technology) ✅

• the key framing: "convert to black & white" is **under-specified** — you are projecting **three numbers to one**, an inherently **lossy, many-to-one** map, so there are *many* valid answers and the right one depends on **intent**, not correctness

• **pick a single channel** (R, G, or B): quick; occasionally useful (the **green** channel is close to luminance and usually least noisy; the **red** channel smooths skin and darkens skies) — but it throws away two-thirds of the data

• **average** $(R+G+B)/3$: simplest, but **perceptually wrong** — it makes blue as bright as green; skies and foliage come out muddy

• **weighted luminance** (the perceptual default): a weighted sum that matches the eye's sensitivity — **green dominates**, blue counts little — **Rec. 601** $0.299/0.587/0.114$ or **Rec. 709** $0.2126/0.7152/0.0722$; this is exactly the cone-weighting / luminance idea from [[Human (and animal) vision and color]]

• **but in which space?** the recurring **[[#Linear vs gamma vs. log encoding|luma vs luminance]]** distinction — the cheap version weights the **gamma-encoded** values (**luma** Y′, what most software does), the radiometrically correct version weights **linear** RGB (**luminance** Y) then re-encodes; they give visibly different greys (linear-light is darker in saturated reds/blues)

• **lightness, not luminance**: take **L\*** from **CIELAB** (perceptually uniform lightness) — or the UI-space shortcuts **HSL "lightness"** $(\max+\min)/2$ and **HSV "value"** $\max(R,G,B)$ — each a different grey again

• **custom channel mixer** (the artistic control): choose the weights yourself (keep $\sum w_i\approx 1$ to hold exposure) — this is Lightroom/Photoshop's **B&W mixer**, and the digital descendant of the darkroom **color filter on B&W film**: a **red** filter darkens a blue sky for dramatic clouds, a **yellow/green** filter lightens foliage, an **orange** filter smooths skin (Ansel Adams's black skies are a red filter on panchromatic film). So a B&W conversion is really a decision about *how each color should map to grey*

• **the hard case — isoluminant colors collapse**: two distinct hues with the *same* luminance (a red and a green of equal brightness) map to the **same grey**, erasing the edge between them; **contrast-preserving decolorization** methods (Color2Gray, Gooch et al. 2005; Grundland & Dodgson 2007) optimise the mapping to keep **color contrast** rather than only luminance — the principled answer to "the green apple and red apple shouldn't vanish into one grey"

equations

grey $= w_R R + w_G G + w_B B$ with $\sum w_i = 1$ (and *in which space* — luma on gamma RGB vs luminance on linear RGB)

▸2.9.11 Color appearance models⧉

`fig-color-dimensions` · the dimensions of color (hue/chroma/lightness) 🟨

`fig-color-solid` · a perceptual color solid ✅

• dimensions of color perception:hue, chroma, etc.

• perceptual color space

• HSV and other UI-spaces

• color appearance models

▸2.9.12 Skin tones⧉

`fig-skin-tone-vectorscope` · skin-tone locus on a vectorscope ✅

⬜ figure not yet created

diverse skin tones fig-diverse-skin-tones

⬜ figure not yet created

a Shirley card fig-shirley-card

• a *memory color*: viewers are unusually sensitive to skin looking right

• the skin-tone locus; cameras and film are tuned to render skin pleasingly

• diversity of skin tones and historical bias (e.g. Shirley cards) — a fairness tie-in (see ethics)

▸2.10 Photography and camera 101⧉

• chapter intro: having built the physics, we now sit behind a real camera — its **controls, modes, and guts** — and see how phones reach the same goal by a wholly different (computational) route

▸2.10.1 Basic photography: exposure settings — shutter, aperture, and ISO⧉

`fig-exposure-triangle` · exposure triangle 🟨

`fig-fstop-progression` · f-stop power-of-two progression 🟨 pilot rendered

⬜ figure not yet created

aperture vs depth of field fig-aperture-dof

⬜ figure not yet created

shutter vs motion blur fig-shutter-motion-blur

⬜ figure not yet created

metering & the 18% gray assumption (snow scene → underexposed without +EV) fig-metering-18-gray

`fig-ettr-histogram` · expose-to-the-right: an image histogram pushed toward the highlights without clipping (vs an under-exposed one bunched at the noise floor) 🟨

• effect on light (linear, quadratic). Why f/# are a ratio: same amount of light (divide fov, compensate with aperture)

• side effects (motion blur, depth of field). Will derive later

• **f-stops are a log₂ (doubling) scale of light**: the *f-number* itself is a **ratio** N = f/D, but what a photographer counts is **stops**, and **one stop = a factor of 2 in light**. Because light ∝ aperture **area** ∝ D² ∝ 1/N², the standard series **1.4, 2, 2.8, 4, 5.6, 8, 11, 16** steps by **√2 in N** so the area (and the light) **halves** each step. A "stop" is therefore a **log₂ unit**: $\text{stops} = \log_2(\text{light ratio})$ — and aperture, shutter time, and ISO are all measured in this same stop currency, which is exactly why you can trade one against another (the exposure triangle). Cameras add **⅓-stop** clicks for fine control. [f-stop power-of-two progression figure]

• 💡 **Big lesson:** **photographers measure light in *stops* — a log₂ (doubling) scale.** Aperture, shutter, ISO, and dynamic range are all counted in stops (factors of 2), so **exposure is additive in log space** — the same "ratios matter, work in log" principle as gamma and tone mapping.

• 🖱️ **Interactive (web edition):** a live **exposure-triangle simulator** — two 3D subjects at different depths walk in opposite directions while you set shutter, aperture, ISO, scene brightness, and sensor size; it renders the scene *physically* (time supersampling → real motion blur, aperture supersampling → real depth of field, a shot+read-noise model → real ISO grain), and an "auto-set triangle" holds the exposure constant so you feel each control's trade-off, not just its brightness. [fig-exposure-triangle-sim; interactive]

• **autoexposure & metering**: the meter integrates scene light (TTL — center-weighted / **spot** / **evaluative-matrix**) and renders it to a **mid-gray (~18%) assumption**, which **fails on high-/low-key scenes** — a snowfield meters to gray and comes out *underexposed* (you add +EV); a black cat meters to gray and comes out *overexposed*. Matrix metering (e.g. Nikon's 3D Color Matrix, trained on tens of thousands of photos) adds brightness/color/contrast/distance cues to guess intent. [18%-gray metering figure]

• **where does 18% come from?** it's the reflectance of the standard photographic **grey card** and the meter's stand-in for the *average reflectance of a "normal" scene*. The key idea: **18% is middle grey *perceptually*, not numerically** — because lightness is roughly **logarithmic** (Weber–Fechner; cf. CIE $L^*$), the perceptual midpoint between a black (~2–3% reflectance) and a white (~90%) sits near their **geometric mean** (√(0.03·0.9) ≈ 0.16), i.e. ~**18%**, *not* 50%. It is **Zone V** of Ansel Adams's Zone System (the middle of eleven zones). A meter's job is then "assume the scene averages to middle grey, and expose so that average lands at middle grey."

• 💡 **the radiometric meaning of (shutter, aperture, ISO) — a worked example (and why not to trust it).** The three controls aren't arbitrary dials: they tie the photo to **absolute scene radiometry** through two calibrated relations.

• **ISO is calibrated to the sensor** (ISO 12232). The **saturation-based speed** is $S_{\text{sat}} = 78 / H_{\text{sat}}$, where $H_{\text{sat}}$ (in **lux·s**) is the focal-plane *luminous exposure* that just **saturates** the sensor; so a higher ISO means less light is needed to fill the well. (The standard also defines SOS / REI variants tied to a mid-grey output level — which is partly why bodies disagree.)

• **the exposure (metering) equation** ties the controls to scene **luminance** $L$ (cd/m²): a "correct" (middle-grey) exposure satisfies $\boxed{\,N^2/t = L\,S/K\,}$, with $N$ the f-number, $t$ the shutter (s), $S$ the ISO, and $K \approx 12.5$ the meter calibration constant (ISO 2720).

• **worked example** — **f/2.8, 1/100 s, ISO 400.** The scene luminance this renders to middle grey: $L = K\,N^2/(t\,S) = 12.5 \times 2.8^2 / (0.01 \times 400) = 12.5\times 7.84/4 \approx \mathbf{24.5\ \text{cd/m}^2}$. The **focal-plane exposure** actually delivered to the sensor follows the *camera equation* $H = q\,L\,t/N^2$ with the lens factor $q=\tfrac{\pi}{4}T\cos^4\theta \approx 0.65$ on-axis (DxO uses $q=0.71$): $H \approx 0.65\times 24.5\times 0.01/7.84 \approx \mathbf{0.020\ \text{lux·s}}$. Sanity check against saturation: at ISO 400, $H_{\text{sat}}=78/400 = 0.195$ lux·s, so middle grey sits $\log_2(0.195/0.020)\approx\mathbf{3.3\ \text{stops}}$ below clipping — that's your highlight headroom. So a bare $(t,N,S)$ triple **does** pin down scene radiance in real units.

• ⚠️ **but two big caveats** before you believe the number. **(1) the nominal values lie** — the marked/EXIF $N$, $t$, $S$ are not the true ones (rounded shutter, manufacturer-tuned ISO that can be **⅓–1 stop** off the measured value; → [[Basic image processing and ISP#Beyond the pixels: basic metadata and EXIF]], [@burggraaff-etal-2019]). **(2) not all pixels see the same light** — lens **vignetting** and the **cos⁴θ** natural falloff (→ [[#Sensors: photosites, CCD vs CMOS]]) mean **edge pixels get less light than the centre** for the same scene radiance, so "absolute radiance per pixel" needs a per-lens **flat-field** correction, not one global scalar. Honest radiometry therefore **calibrates the actual body + lens + aperture** rather than trusting the triplet. (DxOMark's "measured ISO" is exactly this saturation-based calibration; their old *ISO-sensitivity* measurements page is now folded into the [DxOMark glossary → ISO speed](https://www.dxomark.com/glossary/iso-speed/) and the [sensor-testing protocol](https://www.dxomark.com/dxomark-camera-sensor-testing-protocol-and-scores/).) [refs: ISO 12232; ISO 2720; [@burggraaff-etal-2019]]

• caveat (the **18 vs 12.5%** confusion): light meters are actually calibrated to a fixed **luminance constant** ($K$, ISO 2720) that, with typical lens flare, corresponds to **~12–13%** reflectance — so the famous "18% card" and the meter's true midpoint differ by **~⅓–½ stop**; metering off an 18% card gives a slightly different exposure than the meter's internal target, a perennial source of forum arguments.

• **priority modes & their failure cases**: aperture-priority (set N, camera picks t) and shutter-priority (set t, camera picks N) — each fails when no valid partner exists (f/1.4 in bright sun needs an impossible shutter; 1/1000 in dim light needs an impossible aperture)

• **"expose to the right" (ETTR)**: push the exposure as bright as possible **without clipping** highlights, since the **shadows** are where noise lives; **raising ISO** beats brightening in software because ISO gain amplifies the signal *and* the pre-amplifier noise but **not** the downstream **read noise** (→ Noise) [ETTR histogram]

equations

exposure ∝ t·(D/f)²

one stop = ×2 light = 1 EV

EV = log₂(N²/t)

▸2.10.2 Exposure modes, UI, and auto-ISO⧉

`fig-mode-dial` · exposure mode dial (P/A/S/M) → who sets aperture vs shutter in the exposure triangle ✅

`fig-telephoto-vs-retrofocus` · two two-group schematics — telephoto (+ then −, principal plane H′ pushed in front → physical length < f) vs retrofocus/inverted-telephoto (− then +, long back-focal distance to clear the SLR mirror); marks f vs physical length, H′, F′ ✅

`fig-exposure-triangle-sim` · live exposure-triangle simulator (web edition): two 3D subjects at different depths walking opposite ways; shutter/aperture/ISO/scene-light/sensor sliders with brute-force time + aperture supersampling (real motion blur, depth of field, shot+read noise) and an auto-set triangle; static fallback is a screenshot ✅

• the **PASM** modes are just *who picks which leg of the exposure triangle*: **P** (program — camera picks aperture+shutter, you can shift the pair), **A/Av** (you set aperture → DoF, camera picks shutter), **S/Tv** (you set shutter → motion, camera picks aperture), **M** (you set both); plus full **Auto** / scene modes

• **auto-ISO** is the modern fourth knob: in A or S the camera still has only one free leg — auto-ISO frees a *second* one by letting ISO float, so you can hold *both* a chosen aperture **and** a minimum shutter speed; you cap the **max ISO** (noise budget) and a **min shutter** (often a "1/focal-length" rule, or a factor of it)

• **exposure compensation** (±EV) biases the meter's mid-grey target — the manual override for the 18%-grey failure (snow → +EV, dark scene → −EV) without leaving an auto mode

• **M + auto-ISO** is a popular combo: you nail aperture and shutter for look, and let ISO be the automatic variable, with ±EV still steering it

▸2.10.3 Focus, autofocus, and depth of field⧉

`fig-af-families` · autofocus families: contrast-detect hill-climb · phase-detect sub-aperture offset · on-sensor/dual-pixel PDAF ✅

`fig-focus-conjugate-move` · focusing geometry: as subject distance u changes the image distance v must change (1/f=1/u+1/v); the lens slides to keep the image plane on the fixed sensor — near vs far focus (lens extends vs retracts) ✅

`fig-depth-of-field-sim` · live macro depth-of-field simulator (web edition): a to-scale thin-lens diagram + a 3D photo (aperture-supersampled bokeh) + a circle-of-confusion plot all sharing one optics model; subject dropdown (Meshy beetle / bee / ladybug) over a flower background, focus landing on the front of the face so the body and antennae blur; static fallback is a screenshot. *Bug & flower 3D models generated with Meshy AI.* ✅

• **the two classic passive AF families**:

• **contrast-detection (CDAF)**: drive the lens until image **contrast/sharpness is maximal** — accurate but has to **hunt** (it doesn't know which way is "more in focus", only that it overshot); historically slow, common on early mirrorless / compacts

• **phase-detection (PDAF)**: look at the scene through **two opposite edges of the lens** (sub-apertures) → two slightly shifted images; the **phase difference gives the direction *and* amount** of defocus in **one shot**, so the lens moves straight to focus (open-loop, no hunt). Classic SLR mechanism (a dedicated AF sensor under the mirror); it is literally a **tiny stereo measurement**

• **on-sensor PDAF / dual-pixel**: mirrorless bodies put phase detection *on the imaging sensor* — masked AF pixels, or **dual-pixel** designs that split each photosite into two halves that see opposite sub-apertures → phase detection at (nearly) every pixel, the best of both worlds (used for Pixel's portrait-mode depth too — a literal micro-stereo pair)

• **depth-from-defocus AF (Panasonic DFD)**: a third route — model how the lens's blur (PSF) changes with focus and read the **defocus direction and amount** from **two frames at slightly different focus**, giving a PDAF-like one-step cue from a plain **contrast** sensor (no phase pixels); rides on CDAF, needs a **per-lens blur profile** (full treatment → Optics, *Focus*)

• **active AF** (for completeness / history): time-of-flight ultrasonic (old Polaroid sonar) and IR rangefinding — work in the dark but limited; distinct from an **AF-assist lamp** that merely adds contrast for passive AF

• **focus modes**: **AF-S / one-shot** (lock once, for still subjects) vs **AF-C / AI-Servo** (track continuously, predict motion, for action); plus **MF** with focus aids (peaking, magnify — see UI)

• **focus selection & automation** — where computation now lives: single-point / zone / wide-area selection, then **subject-detection AF** — **face → eye → animal/bird/vehicle** detect that picks and **tracks** the subject (deep-learning driven; →ML). **Eye-AF** is the headline modern feature for portraits.

• **handling tricks**: **back-button focus** (decouple AF from the shutter button — focus with a thumb button, shutter only releases) and **focus-and-recompose** (lock focus on the subject, then reframe — beware the small focus-plane error at wide apertures / close range)

▸2.10.4 Lenses and focal length⧉

`fig-focal-length-families` · focal-length families as nested FOV cones (ultrawide/wide/normal/short-tele/tele) with angle of view + one-word use each ✅

• **primes vs zooms**: a **prime** is one focal length — typically **brighter (faster), sharper, smaller, cheaper** for its quality, and "makes you think / move your feet"; a **zoom** trades some of that for **convenience** (one lens, many framings). Fredo's standing advice: a cheap **35–85 mm f/1.8 prime** is razor-sharp value; modern zooms are good but the basic kit zoom is the weak point.

• **the focal-length families** (full-frame reference; *what each is for*):

• **ultrawide** (≤24 mm): landscapes, interiors, exaggerated near/far — watch the perspective stretch

• **wide** (~24–35 mm): environmental / reportage, "the scene + context"

• **normal** (~50 mm): close to the eye's perspective, unobtrusive, fast & cheap

• **short telephoto** (~85–135 mm): **portraits** — flattering compression, easy subject separation (see the portrait-distance bullet below)

• **telephoto / super-tele** (~200 mm+): sports, wildlife, the moon — strong compression, brings far subjects close

• 🖱️ **Interactive (web edition):** a live **dolly-zoom simulator** — the camera dollies in/out while the focal length zooms to hold the subject the same size, so *only* perspective changes and the background rushes in or falls away (the "Vertigo" effect); log focal-length (12–200 mm) and distance sliders with a magnification-preserving link, realistic environment maps, and a subject that is a bust or one of several diverse 3D humans. It makes "focal length vs distance / background compression" tangible (the Pinhole chapter's lesson). [fig-dolly-zoom-sim; interactive] *(human 3D models generated with Meshy AI — acknowledge in the caption)*

• **why portrait lenses flatter — it's the *distance*, not the lens**: a "portrait" focal length (~**85–135 mm**) flatters because it lets you **back away** from the subject, and what actually shapes a face is the **camera-to-subject distance**. Up close (a wide lens), the nose is much nearer the lens than the ears, so it looms — bulbous nose, big forehead, receding ears; step back and those distances equalize, giving natural proportions. Striking finding: the chosen **distance changes not just proportions but perceived *personality*** — closer faces read as less trustworthy / less attractive in rating studies. Cite **Perona 2007** ("A new perspective on portraiture," *Journal of Vision* 7(9):992) and the perceived-traits-vs-distance study (Bryan et al., PMC4114730: <https://pmc.ncbi.nlm.nih.gov/articles/PMC4114730/>). The flip side — the **wide-angle face stretch** up close — and its correction are in [[Perspective distortion and its correction]].

• 🖱️ **Interactive (web edition):** a live **face-distortion simulator** — set the focal length, the magnification (face size), and the eccentricity (how far off-centre the face sits), and watch the face's *shape* change: the close-wide "big-nose" perspective when a short lens forces you in, and the wide-angle stretch as the face nears the frame edge. A full-frame view and a face-cropped view (constant size) let you compare shape, not position; the subject can be a face, a grid sphere, or one of several diverse 3D human faces. [fig-face-distortion-sim; interactive] *(human 3D models generated with Meshy AI — acknowledge in the caption)*

• **fast vs slow**: a **fast** lens has a **large maximum aperture** (small f-number, e.g. f/1.4) → more light + shallower DoF; a **slow** lens (f/5.6) is smaller/cheaper but light-starved. Zooms are often slow and/or have a **variable** max aperture across the zoom range.

• **the "35 mm-equiv" reminder** (cross-ref, don't re-derive): a focal length only means a field of view *relative to a sensor size*; smaller sensors **crop** → multiply by the **crop factor** to get the full-frame-equivalent FOV (APS-C ≈ 1.5×, M4/3 ≈ 2×, phones much more). See the Pinhole chapter's "35 mm" / crop-factor sidebar.

▸2.10.5 Lens filters: polarizers, ND, and graduated ND⧉

`fig-reflection-removal-cues` · cues that separate a window reflection from the transmitted scene — ghosting/double-image, polarization, focus difference 🟨

⬜ figure not yet created

**ND** enabling a daylight long exposure (silky water) [photo before/after] fig-gain-vignette-solve

`fig-jpeg-artifacts-photo` · JPEG blocking & ringing on a REAL photographic edge (white egret vs dark foliage) at q=10: grid-aligned zoom with 8×8 overlay — blocking in the smooth area, ringing along the edge (File formats, BASIC) ✅

• **polarizer (circular polarizer, CPL)**: a rotatable filter passing one polarization. Because **reflection partially polarizes light** (Brewster's angle; Light & physics), rotating it **cuts glare and reflections** off non-metal surfaces (water, glass, foliage), **deepens blue sky** (skylight ~90° from the sun is strongly polarized) and **boosts saturation**. Costs ~1–2 stops; "**circular**" (a linear polarizer + quarter-wave plate) so the camera's beam-splitter AF/metering still work. **Not reproducible in post** — it changes *which* light is captured.

• **neutral-density (ND) filter**: a **uniform grey** that cuts light by a chosen number of **stops** with (ideally) no color shift → lets you use a **slow shutter** (silky water, motion blur) or a **wide aperture** in bright light. Strong NDs (6–10 stops) enable **daytime long exposures**; a **variable ND** is two stacked polarizers rotated to dial the strength.

• **graduated ND (GND)**: dark at the top, clear at the bottom, with a **hard / soft / reverse** transition. Aligned to the **horizon**, it holds back a bright sky so a high-contrast scene fits in **one exposure** — an *optical* alternative to **HDR bracketing** (Image measurements; Multiple exposure).

• **UV / clear "protective" filter**: today mostly lens protection (perennial debate: protection vs an extra surface that can add flare); historically cut UV haze on film.

• **throughline**: polarizer and GND effects are **partly or wholly un-doable in software** (the polarizer changes the captured light; the GND is a manual exposure blend) — which is why physical filters survive the computational era, in contrast to a plain ND (≈ an exposure change software *can* fake).

▸2.10.6 Keeping the lens clean⧉

⬜ figure not yet created

the cleaning kit + order of operations (blower → brush → microfibre / lens-pen → fluid)

`fig-lens-flare-ghosting` · stray light: a bright source, a ray reflecting off two internal lens surfaces → a "ghost" landing displaced on the sensor, plus a veiling-glare wash; ghost vs flare labelled ✅

• a smudged or dusty front element **scatters light** → **veiling flare**, lower contrast, and soft dust spots (worst stopped down) — clean glass is real image quality, not fussiness

• **order of operations**: **blower first** (lift grit without dragging it), then a soft **brush**, then a **microfibre cloth / lens pen**, and only then **lens fluid** for stubborn smears — wipe **gently, centre-outward**; never a dry rough wipe (it grinds grit into the **coating**)

• **don't overdo it**: every wipe risks the anti-reflection coating, and a little dust is optically harmless; a **protective/UV filter** or the **lens hood** prevents most of the problem

• **sensor dust** is the cousin problem — dust on the **sensor (or its cover glass)** shows as **soft dark spots fixed in the frame** (vs moving with the lens), most visible at **small apertures** on plain areas (a clear **sky**); the cure is cleaning (rocket **blower**, then a **wet swab**) plus in-camera **dust-removal / ultrasonic shake** and **dust-mapping** for software subtraction — a maintenance reality of **interchangeable-lens** cameras (changing lenses lets dust in) (forward-ref dust-spot removal, BASIC)

▸2.10.7 Image stabilization⧉

`fig-ois-vs-ibis` · two stabilization strategies side-by-side: lens-shift OIS (floating lens group moves) vs sensor-shift IBIS (sensor moves on a magnetic stage), moving element highlighted, same gyro/IMU input ✅

`fig-stabilization` · image stabilization: hand-held blur vs OIS (lens shift) vs IBIS (sensor shift); ~2–5 stops, camera shake not subject motion ✅

• **what it does**: a gyroscope senses the camera's **angular shake** and a feedback loop counters it during the exposure, so you can hand-hold **slower shutter speeds** — typically a **gain of ~2–5 stops** (e.g. 1/100 s at 1000 mm on a monopod, per the slides)

• **three implementations**:

• **OIS — optical, in-lens**: a floating lens element shifts to counter shake (Canon IS, Nikon VR) — tuned per lens, also stabilizes the viewfinder/AF

• **IBIS — in-body**: the **sensor** itself rides a moving stage (Panasonic, Olympus, Sony) → works with *any* mounted lens, and can stack with OIS; also enables sensor-shift super-resolution / high-res modes

• **digital / EIS**: crop a margin and **warp each frame** to cancel motion — cheap, no moving parts (phones, video), but costs resolution/FOV and can add wobble

• **the crucial limit** — it fixes **camera shake, not subject motion**: a stabilized 1/4 s shot of a *static* scene is sharp, but a *moving* subject still blurs (you need a fast shutter for that). State this explicitly — a common beginner confusion.

• **sidebar — IS hacking**: the stabilizer is a controllable blur actuator — driving it deliberately enables **custom motion blur** / coded exposure (Bando–Holtzman ICCP 2011) → Advanced

▸2.10.8 UI and display⧉

`fig-viewfinder-types` · viewfinder types: OVF (optical) vs EVF (electronic) vs rear LCD ✅

• **viewfinders**: **OVF** (optical — an SLR mirror/prism or a rangefinder window) shows the real world directly, zero lag, but **not** the captured exposure/WB; **EVF** (electronic, mirrorless) and the **rear LCD** show the *sensor's* live feed — slightly laggy, but **WYSIWYG**: exposure, white balance, and depth-of-field preview all visible before the shot

• **exposure preview & the live histogram**: the EVF/LCD can show the image *as it will be captured* (exposure simulation), with a **live histogram** to judge clipping; this is the practical home of **ETTR** — push the histogram right without clipping

• **focus aids**: **focus peaking** (highlight high-contrast in-focus edges) and **magnify** for manual focus; **zebras** (animated stripes over near-clipped highlights) for exposure — borrowed from video

• the through-line: an EVF turns the *abstractions* of exposure and focus into **direct visual feedback**, which is why mirrorless changed how people shoot

▸2.10.9 Video modes⧉

`fig-shutter-angle` · shutter angle sets the blur — a rotating-disc shutter at $0°/180°/360°$ admitting a smaller/larger fraction of the frame interval $T$; $180°$ ($\tau=T/2$) is the cinematic film-look, small angles strobe 🟨

• **frame rate & shutter angle**: video sets exposure time as a fraction of the frame interval — the **180° shutter-angle** convention (shutter ≈ ½ the frame time) gives "natural" motion blur; high frame rate → slow motion

• **rolling shutter / "jello"**: CMOS reads the sensor **row by row**, so fast motion (or a fast pan, or a spinning prop) **skews/wobbles** — a **global shutter** avoids it but is rarer/costlier (ties back to CMOS readout in Image measurements as integrals)

• **log / flat profiles + grading**: shoot a **log** (flat, low-contrast) profile to preserve dynamic range, then **color-grade** in post — the video cousin of raw + tone curve

• **codecs & bitrate**: intra- vs inter-frame compression, 8- vs 10-bit, chroma subsampling (4:2:0 vs 4:2:2), bitrate — trade file size vs editing/grading headroom

• *(kept deliberately brief — the algorithms and the full motion treatment are in **Motion/Video**)*

▸2.10.10 Programmability, or the lack thereof⧉

`fig-bokeh-shapes` · bokeh: out-of-focus highlights take the aperture's shape — circular (many blades / wide open) vs polygonal (few blades, stopped down); synthetic defocus-disk fields + aperture insets ✅

• most cameras run **closed firmware**: you cannot run your own code on them, so the **computational-photography ideas in this book can't be tried on a stock camera** — a real obstacle for experimentation

• the **hacks**: **CHDK** (Canon PowerShot) and **Magic Lantern** (Canon DSLRs) are community firmware add-ons that expose scripting, raw video, bracketing, intervalometers, and focus stacking — born exactly because the manufacturers wouldn't

• the **open road**: **Android's Camera2 / CameraX** API (and Google's earlier FCam research camera) expose per-frame control of exposure / focus / gain and access to **raw burst** — which is *why phones, not cameras, became the home of computational photography*

• the **maker road**: a **Raspberry Pi + camera module** (the HQ camera, driven by **libcamera / picamera2**) is a cheap, fully **open and programmable** camera — per-frame exposure/gain control and **raw** access on a hobbyist board — the natural platform for *trying this book's algorithms* and building DIY computational cameras

• **why this matters for the book**: our algorithms (HDR+, burst denoising, computational bokeh) need **programmable capture** — control of the per-frame integral and access to raw; the platform that grants it is where the research happens

▸2.10.11 Anatomy of a modern full-frame interchangeable-lens camera⧉

• the **mirrorless block diagram** (light → bits): **lens + mount** (electronic contacts carry AF/aperture/IS) → **sensor on an IBIS stage** → **shutter** (a mechanical curtain *and/or* an electronic/rolling shutter) → **image processor + AF** (on-sensor PDAF, subject detect, the ISP) → **EVF + rear LCD** (the live feed) → **card + battery**

• the defining change from the **SLR**: the **mirror box and optical prism are gone** — the sensor feeds the EVF directly, which is what enables WYSIWYG preview, on-sensor PDAF across the frame, silent electronic shutter, and tighter lens-to-sensor distances (shorter flange distance → new mounts)

• each block is a previous section made physical — a good place to consolidate the chapter

▸2.10.12 Anatomy of cell-phone cameras⧉

`fig-phone-camera-array` · phone multi-camera array (ultrawide/wide/tele+periscope, tiny sensors, bright fixed lenses) → a computational-pipeline block → final photo ✅

• the hardware constraints: a **tiny sensor** behind a **bright, fixed-aperture lens** (no iris — aperture is fixed, exposure is shutter + ISO only); tiny sensor → **huge depth of field** (why phones must *fake* background blur) and **more noise** (why they burst-and-average)

• a **multi-camera array** stands in for a zoom: separate **ultrawide / wide / tele** modules, the long end often a **periscope** (a prism folds the light path sideways so a long focal length fits a thin body); switching "zoom" really switches cameras (+ crop/fusion between them)

• 💡 the key idea: the **computational pipeline does the heavy lifting** — demosaicking, multi-frame **HDR+ / burst denoising**, **computational bokeh** (depth from dual-pixel / multi-camera), night mode, super-resolution. The phone overcomes small-sensor physics with **computation**, not glass → forward-ref **Basic ISP** and **Multiple-exposure**.

• this is the chapter's punchline and the bridge to the rest of the book: cheap optics + a smart pipeline beats good optics alone

▸2.10.13 Types of cameras⧉

`fig-lens-design-tradeoff` · lens-design Pareto cartoon as a 6-axis radar (sharpness, speed, size, weight, cost, distortion): fast pro prime vs compact phone lens overlaid, plus a dashed polygon showing where software correction shifts the phone lens (Lens optimization) ✅

• **by sensor size** (the spine — sets DoF, light-gathering, and body size; cross-ref crop factor): **phone** → **1-inch compacts** → **Micro Four Thirds** → **APS-C** → **full-frame (35 mm)** → **medium format** → **large-format / view cameras**

• **by lens**: **fixed-lens** (phones, compacts, bridge/superzooms, instant) vs **interchangeable-lens (ILC)** (mirrorless, DSLR, medium format)

• **by mechanism / viewfinder**: **DSLR** (mirror + OVF, legacy) vs **mirrorless** (EVF, dominant — the anatomy above); **rangefinder** (Leica-style window + focus patch); **point-and-shoot / compact**; **bridge / superzoom** (fixed huge-range zoom on a small sensor); **action / 360** (rugged, ultrawide, fixed); **instant** (Polaroid / Instax, self-developing print); **glasses / wearable cameras** (the camera leaves your hands entirely — Google Glass, Snap Spectacles, Ray-Ban Meta, GoPro-on-a-head; **point-of-view, always-available capture**, raising the obvious **privacy / consent** questions — cross-ref Ethics); **film** bodies (still made — see the darkroom below); **view / technical cameras** (large format + movements → tilt-shift, Optics)

• **by platform — how the camera is *carried* (aerial photography)**: the **drone / UAV** is the modern default (a gimballed camera that flies), but it sits at the end of a long lineage of getting a camera *above* the scene when no tripod will reach — **balloon** photography (Nadar over Paris, 1858), **kite** aerial photography (Arthur Batut, 1880s; the rig + a timer/altitude trigger), **pigeon** cameras, and **pole / mast** photography (a camera on a tall pole or monopod for low-altitude overhead frames). The platform buys a **viewpoint**, not a different sensor — the same imaging, lifted.

• the practical lesson: **sensor size and lens system, not megapixels**, mostly separate these — the rest is ergonomics, platform, and price

▸2.10.14 Cameras beyond photography⧉

`fig-lagrangian-vs-eulerian` · the organizing diagram — Lagrangian (arrows following particles over time → flow and tracking, output trajectories) vs Eulerian (a fixed grid, each pixel plotting intensity-over-time → magnification, never asking where anything went) 🟨

• **industrial / machine vision**: defect inspection, line metrology — often **global-shutter** mono sensors with fixed lighting and GigE/USB3 interfaces; the "image" is really a **pass/fail or a measurement**

• **barcode & QR readers**: dedicated retail/warehouse scanners, and the **QR / barcode** reader built into every phone camera — the camera as a **data input** (locate → rectify → decode a 1D/2D code to a number or URL), a tiny computer-vision pipeline rather than a picture

• **robotics & drones**: cameras for **SLAM**, obstacle avoidance, **visual-inertial odometry** (camera + IMU), and grasping — output is **pose / depth / a control signal**, not a picture

• **automotive (ADAS / self-driving)**: surround cameras fused with radar and lidar for lane-keeping, sign and pedestrian detection — **high dynamic range and reliability** over prettiness

• **scientific & specialty**: microscope and telescope cameras, **high-speed** cameras, **thermal / multispectral / hyperspectral**, and **event cameras** (per-pixel brightness-change sensors) — many revisited in Advanced

• **cooled astronomy cameras**: dedicated deep-sky cameras **actively cool the sensor** (a thermoelectric **Peltier** stage, often to a fixed *set-point* tens of °C below ambient) to crush **dark current** for minutes-long exposures; typically **mono** (paired with a filter wheel — LRGB / narrowband), **regulated** so dark frames match, and **no IR-cut / no Bayer** for sensitivity — the noise chapter's thermal term, attacked with refrigeration → cross-ref *Denoising by averaging* (stacking) and the noise section

• **medical**: endoscopes, retinal and surgical cameras, capsule cameras — imaging in service of diagnosis

• **ubiquitous & embedded**: webcams and video-conferencing, **AR/VR** inside-out and eye tracking, doorbell / CCTV security, automotive cabin sensing — cheap modules, computation-heavy

• throughline (→ Intro): once the camera is a **measurement device**, "good" stops meaning *pretty* and starts meaning **accurate, fast, and robust** for the task

▸2.10.15 A camera's other sensors⧉

`fig-pinhole-imaging` · imaging-scenario series (2/3): add a pinhole to the bare sensor — one ray per scene point → a dim **inverted** image (same tree+sensor+colours as fig-bare-sensor-averaging) ✅

⬜ figure not yet created

deferred]

• the image sensor isn't the only sensor; a small suite surrounds it, and several feed the *computational* pipeline, not just the metadata

• **IMU** (accelerometer + gyroscope) — the camera's sense of its own motion; drives **stabilization** (tells OIS/IBIS which way to counter-shift), auto-rotate / level horizon, and increasingly **computation**: EIS, seeding burst/panorama alignment, gyro-based deblur (knowing the motion helps estimate the blur kernel) → Motion / Deblurring

• **context / metadata sensors**: **GPS/GNSS** (geotag → EXIF), **magnetometer/compass** (heading; AR), **barometer** (altitude)

• **ambient-light sensor (ALS)**: a small front-facing photodiode — increasingly a multi-channel **color / spectral** version — that reads the **level and color of the surrounding light** (a *non-imaging* light meter sitting beside the imaging sensor). It drives **auto screen brightness**, seeds a **white-balance prior** (what illuminant are we under? → cross-ref automatic white balance, Color technology), and a fast variant detects the **flicker** of mains-powered fluorescent/LED lighting (the 50/60 Hz pulsing) so the camera can phase its exposures to dodge **banding** (*flicker-free / anti-banding* mode)

• **dedicated focus/depth sensors**: **laser / ToF** rangefinder for fast low-light AF, **lidar** scanner for coarse depth (portrait blur, AR) — beyond on-sensor PDAF

• **microphones** (video sound; arrays → directional/wind-aware audio; sync with IMU), **housekeeping** (thermometer → dark-current/noise comp; Hall sensors → lens/stabilizer element position; real-time clock → timestamps)

• throughline: a camera is now a small **sensor-fusion** platform; the side channels (esp. the IMU) are part of *how the picture is computed*, not just notes attached to it

▸2.10.16 Flash and lighting⧉

`fig-point-op-levels` · levels on a real (flat/hazy) photo: bunched-up luma histogram stretched out to fill [0,1] by setting a black point and a white point — input + after + transfer curve with the clip-and-stretch anchors and the two histograms (Point operations → Black point, white point, and levels, BASIC) ✅

`fig-wide-angle-perspective-distortion` · wide-angle perspective distortion: off-axis spheres image as radially-stretched ellipses (flat-sensor geometry), and faces near a wide frame's edge are widened — not a lens flaw; forward-ref to correction (Pinhole image formation) ✅

• **TTL metering**: the camera fires a **pre-flash**, meters the return **through the lens**, and sets flash power automatically — the flash analog of auto-exposure

• **sync speed & high-speed sync**: a **focal-plane** shutter only fully uncovers the sensor up to its **X-sync speed** (~1/200 s) — faster than that and a moving slit means the flash can't light the whole frame; **high-speed sync (HSS)** pulses the flash rapidly to cover the slit (at a big power cost). A **leaf shutter** (in-lens) opens fully at *any* speed → **syncs at all shutter speeds** (a real advantage for daylight fill).

• **fill flash**: in harsh/backlit light, add flash at **−1 to −2 EV** (the slides say −1.5 to −2, i.e. 3–4× below ambient) to **open the shadows** without looking flashed — it's **dynamic-range management**, not "more light"

• **bounce & off-camera**: never bare flash head-on (flat, harsh, red-eye) — **bounce** off a ceiling/wall (~45°) or use a diffuser to enlarge the source → softer light; **off-camera** flash gives directional, shaped light (→ Computational illumination; computational bounce flash, Davis SIGA 2016)

• 🖱️ **Interactive (web edition):** a live **portrait-lighting simulator** — three soft area lights (key / fill / kicker, each with azimuth, elevation, size/softness, intensity, and colour) on a 3D face, with **area-light-supersampled soft shadows** and a physiological melanin/hemoglobin skin model; a from-behind "setup" view shows the lights as studio umbrellas plus the camera, and lighting presets (butterfly, loop, Rembrandt, split, clamshell, rim) let you feel how light *placement and size* sculpt a face. [fig-portrait-lighting-sim; interactive] *(3D studio umbrella generated with Meshy AI — acknowledge in the caption)*

▸2.10.17 The traditional darkroom⧉

⬜ figure not yet created

chemical darkroom — film develop pipeline + enlarger printing, labelling exposure time / contrast grade / dodging / burning [fig-darkroom-enlarger fig-darkroom-enlarger

⬜ figure not yet created

deferred]

• film **development**: developer (exposed silver-halide → metallic silver) → **stop bath** → **fixer** (dissolves unexposed grains) → a **negative** (dark where the scene was bright)

• **printing**: an **enlarger** projects the negative onto light-sensitive **paper**, developed in turn → a positive print; the enlarger is a *second, fully controlled exposure*

• print controls: **exposure time** = overall lightness; **contrast grade** (paper grade / filter) = tone-mapping steepness; **dodging** = hold light back → lighter region; **burning** = add light → darker region

• the origin of the **dodge/burn** tools, the **exposure/contrast** sliders, and the **Zone System** (the 18% gray of metering = Zone V); makes the "print = performance of the negative" metaphor concrete

▸2.10.18 The digital darkroom: editing software⧉

⬜ figure not yet created

Lightroom-style develop panel vs Photoshop-style layer workspace, each tool annotated with the chapter that derives it [fig-editing-lr-vs-ps fig-editing-lr-vs-ps

⬜ figure not yet created

schem/screenshot

⬜ figure not yet created

deferred]

• **parametric / non-destructive (Lightroom style)**: the original file is untouched; edits are a re-rendered *recipe*. Global: white balance, exposure, contrast, highlights/shadows/whites/blacks, tone curve, HSL, clarity/texture/dehaze, sharpening, noise reduction, lens corrections, crop. Local: masks, graduated/radial filters, AI subject/sky masks. Plus **asset management** (catalog, rating, keyword, search) across whole shoots

• **pixel editor (Photoshop style)**: edits pixels directly — layers, selections, masks; clone stamp & healing brush, content-aware fill, dodge/burn, curves/levels adjustment layers, filters, text, montage → retouching & compositing

• the split: **global · reversible · many photos** vs **local · pixel-level · one photo**; serious workflows use both (develop in LR → retouch/composite in PS)

• 💡 the bridge: **the slider is the interface; the rest of the book is the machinery** — name each tool and forward-ref the chapter that explains it

▸Part 3 BASIC IMAGE PROCESSING AND ISP⧉

At the end of this part, the readers will be able to implement a complete ISP or very basic version of a photo editing software (mini lightroom) with exposure, contrast, etc.

▸3.1 Image representation⧉

`fig-image-memory-layout` · a 2×3 RGB image packed into 1-D memory two ways — interleaved HWC (NumPy) vs planar CHW (ML), with the stride formulas (Image representation, BASIC) 🟨

`fig-edge-handling-photo` · the four edge modes on a real crop — black / clamp / mirror / wrap continued past the border; real-image companion to `fig-edge-handling` (Image representation → falling off the edge, BASIC) ✅

▸3.1.1 An image is an array⧉

array, basic representation and manipulation

• an image is an H×W×C array: a 2D grid of pixels, each a vector of C channels (usually R, G, B)

• sometimes extra channels, for example hyperspectral or for transparency — more in compositing chapter

• warn that it can be a problem when reading PNG if you assume just RGB and you get more channels

▸3.1.2 Float vs 8-bit vs more bits⧉

We'll focus on floats in [0,1]

8 bits are used on disk and for fast computation but dealing with 8 bit arithmetic is a pain. We discuss it in later chapter (ref).

• side bar: **how floating point works** — a float stores a number as **sign × mantissa × 2^exponent**, binary scientific notation: the **exponent** slides the binary point so the same ~24 mantissa bits (32-bit `float`) give **relative** precision across a huge range. What this means for images:

• **precision is relative** — fine resolution near 0, coarser for large values; a good match for **linear radiance / HDR** (and roughly for how we perceive light)

• values **outside [0,1] are fine** (super-whites, the negative lobes of a sharpening filter) — nothing clamps until you output

• `0.1` is **not exact** in binary, and adding numbers of very different magnitude **loses the low bits** (a long sum of many small values drifts) — usually negligible for images, occasionally not

• **inf** and **NaN** (`0/0`, `log` of a negative, `0×inf`) **propagate** and silently poison an image — a classic bug source; check for them

• `float32` is the working default; **`float16`/half** halves memory (ML, some HDR pipelines), `float64` is rarely needed for images

• [forward ref: the pains of **8-bit integer** arithmetic — overflow, rounding, no headroom — later chapter]

• **reminder** — 💡 **L6, quantization is rarely the real problem** (the lesson is introduced in FUNDAMENTALS → [[Image Formation and linear perspective#Noise|Noise]]): with enough bits and a sane encoding it's **noise and dynamic range** that bite, not the number of levels — which is exactly why **floats with headroom** are the sane default here.

• **how many bits for what** (match the depth to the *job*):

• **8-bit integer** — **display and delivery**, and compact storage. ~256 levels is enough *perceptually* **only because gamma** puts the codes where the eye is sensitive (the darks — see *encoding* below); but it has **no headroom** and **rounds on every operation**, so it's a poor *working* format.

• **10–14-bit** — what **camera raw** delivers, sized to the sensor's dynamic range (and to dual-conversion-gain / HDR sensors, FUNDAMENTALS).

• **16-bit integer** — the safe **intermediate for editing**: repeated operations don't accumulate visible **banding / rounding** the way 8-bit does (Photoshop's "16-bit mode").

• **float32 (linear light)** — the sane **working / HDR** format: relative precision, values allowed **outside [0,1]**, nothing clamps until output. **float16 / half** halves memory for ML and HDR pipelines.

• **the rule of thumb**: **more bits — and float — wherever you do *math* or hold *high dynamic range*; few bits only at the display / output end.** Spend precision where rounding and range actually hurt.

• 💡 **bit-depth and encoding decisions go together.** How many bits you need depends on the **encoding** you store them in, because a big part of **gamma's job is to give you better *uniform* quantization**: it spaces the available codes the way the eye perceives brightness — fine steps in the darks, coarse in the highlights — so **8 *gamma*-encoded bits** look smooth where **8 *linear* bits** would band badly in the shadows. (That is not gamma's *only* job — it also matches display response and historical pipelines, see *encoding* just below — but "make a few bits go further" is a big one.) The converse: you do **math in linear**, and linear light wants **more bits / float** to avoid that banding. So don't pick a bit depth and an encoding separately — **choose them as a pair**.

▸3.1.3 What the numbers actually mean: encoding⧉

say clearly when things are encoded, when things are decoded.

KEY framing: a bare array of numbers is meaningless without knowing its encoding / color space

Discuss that different aspects of comp photo call for different encoding: linear (e.g. for dealing with physics, white balance, radiometry) vs. gamma vs. log

• rehash gamma and other encoding, ref to early section/chapter

• JPEG/film are usually some weird remapped versions for beautification

• as a result, you have more radiometric approaches to image editing/processing (starting with calibrated/linear/RAW values), vs older approaches that are more hand wavy because values were less clear

▸3.1.4 What the numbers actually mean: color spaces⧉

• the same three letters hide real differences: **channel order** (RGB vs BGR — OpenCV), **color space / primaries** (sRGB vs Adobe RGB vs Display P3 → Color technology), **encoding** (linear vs gamma), **range** (`[0,1]` float vs `[0,255]` uint8 vs 10/12-bit), and **premultiplied vs straight alpha**. A bare array doesn't say which — read the metadata / profile, don't assume.

▸3.1.5 Beyond the pixels: basic metadata and EXIF⧉

• an image carries more than the pixel array: **basic metadata** — pixel **width** and **height**, channel count, bit depth, the file **name / path**, creation date

• richer **EXIF** capture metadata written by the camera: **ISO, shutter speed, aperture, focal length**, lens, timestamp, **GPS**, **orientation** flag, embedded ICC profile

• ⚠️ **EXIF is frustrating in practice**: not every camera writes every field and the set / format is **inconsistent** across makers; some values (notably **ISO** and **shutter speed**) are **approximate / rounded** and **not radiometrically correct** — don't treat them as calibrated measurements

• **concrete numbers** (so the warning isn't hand-waving): the recorded **shutter speed is the *nominal* one** — a phone that actually exposed for 1/3.0 s or 1/3.9 s may write **"1/3 s" in both cases**, a photometric error of up to **~30 %** (the kind of EXIF exposure-time rounding documented in consumer-camera calibration work, [@burggraaff-etal-2019]); and the **EXIF ISO is the manufacturer's *setting*, not the measured sensitivity** — independent RAW measurements (DxOMark, ISO 12232 saturation method) routinely find the **real ISO differs from the marked one by up to ~⅓–1 stop** (a body set to "ISO 100" may measure ~80, "ISO 3200" may measure ~2000), because makers tune the gain/headroom and the standard leaves slack. So for any **radiometric** use (HDR weights, noise modelling, exposure fusion) treat EXIF shutter/ISO as a *hint*, and prefer a per-body **calibration** or the **DNG `NoiseProfile`** when available. (Refs: [@burggraaff-etal-2019]; DxOMark *ISO speed* glossary + sensor protocol; ISO 12232.)

• a catalog of the typical fields lives in the appendix → [[Common EXIF fields]]; the file-format side of metadata is in [[File formats and compression#Data versus metadata: EXIF|File formats]]

• 💡 **what metadata *should* travel with an image** (a recurring wish in this book): beyond the basics, a sane representation would carry exactly the things a bare array can't imply — the **encoding** (linear / gamma / log), the **color space / primaries** (ideally an embedded **ICC** profile), the **noise level** (the read + shot noise parameters, so downstream code knows the per-pixel uncertainty), and — when known — the **point-spread function (PSF)** (for deconvolution / deblurring). Most formats carry the first two unreliably and the last two not at all, yet later algorithms (white balance, **denoising**, **deblurring**, HDR) silently *need* this — so an honest image is pixels **plus** this provenance.

• 💡 **noise level deserves to be metadata — and almost never is.** Of that wish-list, the **noise level** is the most useful and the most neglected: a tiny per-sensor calibration (read noise + a shot-noise/gain term, i.e. the variance as a function of signal at the current ISO) tells downstream code **how much to trust each pixel**. It is **rarely written into any image file**, yet **denoising, HDR merging, deblurring, and any weighted least-squares fit** all silently need it — without it they fall back to guessing the noise from the image. It costs only a couple of numbers to store, so carrying it alongside the pixels would make a lot of estimation principled instead of ad hoc. (**DNG**'s `NoiseProfile` tag is a rare format that actually does this.) → [[Noise, signal-to-noise ratio and dynamic range]], [[Denoising]].

▸3.1.6 Three kinds of operation⧉

• preview the taxonomy of image operations by what they touch (point / domain / neighborhood) — developed in Point operations

▸3.1.7 How the array sits in memory⧉

packing multidimensional arrays in memory. Various options.

Which one we choose (different in Python and our C++ implementation)

• channel order: RGB vs BGR (e.g. OpenCV)

• interleaved (HWC) vs planar (CHW); CHW is what PyTorch wants later

• **the C++ image data structure** (C++ profile — show this as the book's working representation): a minimal `Image` holding `width, height, channels` and a flat `std::vector<float> data` in **planar** order, indexed `data[c*width*height + y*width + x]`, with a `float& operator()(int x, int y, int c)` accessor hiding the index arithmetic. (Python profile: a NumPy `H×W×C` array.)

▸3.1.8 Pixel coordinate conventions⧉

origin top-left (y down) vs bottom-left (y up)

pixel center vs pixel corner: the half-pixel offset — a classic bug source; matters for resampling / warping (where-from vs where-to)

▸3.1.9 A typed image: the type travels with the pixels⧉

• **the problem**: a `float[H][W][3]` is **mute** — it does not say whether it is **linear or gamma**, **sRGB or P3**, `[0,1]` or `[0,255]`, **straight or premultiplied** alpha. Nearly every "why are my colors wrong?" bug is an operation applied to data in the **wrong type** (e.g. white-balancing or blending gamma-encoded values as if they were linear). The cure is to make the image **typed**.

• **a good image type encodes its gamma (and the rest of its provenance)**: alongside the array, a small **type record / tags** — **encoding** (linear / gamma / log), **color space / primaries** (ideally an embedded **ICC** profile), **range and bit-depth**, **alpha kind** (straight vs premultiplied). The encoding is **part of the value, not a fact you keep in your head**. (This is the "what metadata *should* travel with an image" wish from *Beyond the pixels*, promoted into the type.)

• **conversion = cast = convert the numbers AND re-tag** (the key discipline): a `to_linear()` must **divide out the gamma *and* set the encoding field to `linear`**; a `to_uint8()` must **scale and round *and* set the range / bit-depth tag**; a color-space convert applies the 3×3 *and* updates the **primaries** tag; (un)premultiply flips the **alpha** tag. A cast that changes the numbers **without** updating the tag (or the reverse) is exactly the bug. Done right, **type and data are always consistent**, and an operation can **refuse, or silently auto-convert,** a mismatched input.

• **why it's worth the ceremony**: it makes the **illegal states unrepresentable**. White balance and exposure can *insist* on **linear** input (multiply-in-linear, [[Color technology]]); a blend can *insist* on matching **alpha kind** and **color space**; a save can *insist* on an output **encoding**. A type check catches the classic mistakes up front, instead of the *viewer* catching them later. It is **units for images**.

• **in practice**: most arrays are untyped (NumPy, OpenCV) — which is *why* these bugs are everywhere; but **color-managed** apps, Adobe's pipeline, and **DNG-style** containers ([[Appendices#DNG: the Digital Negative]]) effectively carry this type, and the book's own `imageops` treats encoding/space explicitly. The takeaway: **track the type, and let conversions move it — don't keep it in your head.**

▸3.1.10 Alpha and extra channels⧉

• **RGBA**: a 4th **alpha** channel = per-pixel opacity / coverage; **premultiplied** (RGB already scaled by α) vs **straight** alpha is a classic bug — full treatment in **Compositing**

• other extra channels: a depth / Z channel, a matte, or **hyperspectral** (many narrow bands) — the array just gains channels

▸3.1.11 Video and more dimensions⧉

• **video** = an image sequence: just an extra **time** axis (H×W×C×T); most per-frame operations generalize directly (and motion couples frames — Motion/Video)

• **higher-dimensional** imaging: volumetric **3D** (medical CT/MRI: H×W×D), **4D light fields** (two spatial + two angular), spectral cubes — the same array machinery, more axes

▸3.1.12 Reading a pixel — and falling off the edge⧉

• wrap pixel access in a **safe accessor** so out-of-bounds reads don't crash or read garbage — and decide what a pixel *past the boundary* means:

• **black / zero** (pad with 0 — darkens edges), **clamp / replicate** (repeat the edge pixel), **mirror / reflect** (fold the image back), **wrap / tile** (periodic — what the DFT assumes)

• the choice matters for **convolution, resampling and warping** at borders; mirror is the usual safe default [edge-handling modes figure]

▸3.2 Developing, Testing and Debugging⧉

▸3.2.1 The workflow: build, run, look at the picture⧉

• your favorite IDE or text editor; build, run, repeat

• the key habit: **save intermediate images to disk** — the "printf" of image code

• optional: display from code; a small UI (html?)

▸3.2.2 Principles⧉

• 💡 **Big lesson — most debugging is tedious but simple, not clever.** **90% of debugging is methodical, not inspired**, and it comes down to two things: (1) **choose smart inputs** — small, hand-verifiable cases where you already know the right answer (constant, impulse, edge — see *Test on inputs you can verify by hand*, below); and (2) **understand the execution** — walk through it **step by step** and **output / print as many intermediate values and images as you can**. The bug almost always reveals itself the moment you *look* at an intermediate, instead of reasoning about it in your head. The rest of this section is just that lesson, unpacked.

• doubt everything; assume no piece works until you've checked it

• test incrementally — never write the whole program before testing a piece

• isolate pieces; binary-search / divide and conquer to localize a bug

• change one thing at a time

• display, don't guess: dump intermediate results instead of reasoning in your head

• to confirm a step does what you think, deliberately break it and watch the output change

▸3.2.3 Test on inputs you can verify by hand⧉

• small (e.g. 3×3) synthetic images: constant, impulse, edge / half-plane (half black, half white), rectangle

• feed these to intermediate stages too, not just the final output

▸3.2.4 Per-algorithm debug recipes⧉

• each algorithm also has its *own* hand-checkable test inputs — those recipes now live as a **debug sidebar in each algorithm's own chapter**: impulse → kernel in [[Neighborhood operations and convolution]]; constant / rectangle in [[Demosaicking]]; the $\sigma_r$ limits of the **bilateral filter** in [[Bilateral filtering]]; a 1–2 px shift for **image alignment** in [[Multiple exposure imaging|Multiple-exposure imaging]]

▸3.2.5 Crashes and bounds⧉

• a seg fault usually surfaces far from its cause; suspect an array/vector index

• print the index and the array size; **assert index ∈ [0, size)** in a debug build — ties to the edge accessor in Image representation

▸3.2.6 Vibe coding: writing image code with an LLM⧉

• mostly good, not always predictable (e.g. antialiasing, bad cross-fading)

• **like all AI — sometimes remarkable, sometimes it fumbles something trivial**: reproducing **Kemelmacher-Shlizerman & Seitz's *Photobios*** ([[Many images and photo collections]]), the model autonomously nailed the *hard* parts — **face registration** from landmarks and a **dynamic-programming** path through the photo collection — yet repeatedly botched a **one-line cross-fade** `frame = (1−t)·start + t·end`, needing ~5 re-prompts to get it right. Lesson: **review even the trivial parts** — competence on the hard stuff is no guarantee on the easy stuff.

• **vibe debugging / vibe explaining**: use the LLM to explain an error, propose a minimal repro, or suggest where to look — but **verify against a hand-checkable input** (above) before trusting a fix; treat generated image code as *guilty until tested*

• our guidance: keep the LLM on a short leash for **numerically / perceptually subtle** code (filtering, resampling, color, antialiasing) where "looks plausible" ≠ correct; lean on it for boilerplate, I/O, and visualization

▸3.3 Point operations⧉

`fig-operation-types` · the three image-operation types — range (point) vs domain (spatial) vs neighborhood (Values / point ops, BASIC) 🟨

▸3.3.1 Three kinds of operation — a reminder⧉

• 🔁 **quick reminder** (already previewed back in *Image representation → Three kinds of operation*): image operations split by **what they touch** — **point** (range; same pixel's *value* → one 1-D curve), **domain** (a *different* location → resampling/warping), **neighborhood** (a *window* → convolution/bilateral). No need to re-derive it here — this chapter just **zooms into the point branch**, the simplest, and where we start. [the three operation types figure, with the **point-operation** branch **highlighted**]

• the **histogram** (the distribution of values — its own chapter, next) is the natural companion to point ops

• develop everything on a grayscale value; for color, apply the same curve per channel (saturation is the exception)

▸3.3.2 The point-operation curve⧉

for single channel : simple 1D input output curve. the whole operation *is* that curve (the editor's curves tool); identity = the 45° diagonal

one big question: in which units (linear, gamma, log)

Say that traditional photography has always involved controlling the mapping from scene radiance to output photo tone.

side bar: the zone system. Zones are intensity ranges. encourages photographers to think/act on mapping from scene to negative and then from negative to print. Sometimes in a local way. Lots of tools: exposure, type of film, pre-exposure, chemistry, darkroom.

• 💡 **Big lesson:** **additive vs multiplicative → choice of encoding** — ask whether a tonal op is additive (→ linear/display offset) or multiplicative (→ linear light, a sum in log); light and perception are multiplicative

▸3.3.3 Exposure: a multiply in linear light⧉

`fig-point-op-exposure` · exposure as a multiply in linear light: input, output (+1 stop), and the remapping curve (Basic Image Processing → point operations) 🟨

`fig-point-op-exposure-spaces` · exposure +1 stop done right (linear light) vs naively on the gamma values (over-brightens, clips at input ½): two outputs + remapping curves 🟨

• exposure is multiplicative in linear space ($L_\text{out} = 2^s L_\text{in}$, one stop = a factor of 2)

• the +1-stop curve bows upward in display space; highlights clip at 1 (unrecoverable)

• do it in linear light; the naive gamma-space multiply clips at input ½

▸3.3.4 Brightness vs. exposure⧉

`fig-point-op-brightness-vs-exposure` · brightness (additive, lifts blacks → milky) vs exposure (×2 in linear light, keeps 0→0): two outputs + both remapping curves 🟨

• brightness = additive offset $L_\text{out}=L_\text{in}+b$; lifts blacks, fogs the shadows

• recommend against brightness; exposure is more radiometric (keeps black at black). Brightness survives as a legacy slider / quick shadow lift

▸3.3.5 Contrast: steepen about a pivot⧉

`fig-point-op-contrast` · contrast as a steeper tone curve about a mid-gray pivot: input, output, and the remapping curve (Basic Image Processing → point operations) 🟨

`fig-point-op-contrast-spaces` · the same contrast (×gain about mid-gray) in gamma (sRGB) / linear (no gamma) / log: three outputs + remapping curves (linear crushes shadows, log preserves them; black clamped to ε for log) 🟨

• contrast = steepen about a pivot $L_\text{out}=g(L_\text{in}-p)+p$; gain >1 steepens and clips both ends

• big question: in linear, gamma, or log (three legitimate looks; defer the full argument to tone mapping)

• sidebar: the S-curve (sigmoid) — roll off the extremes instead of clipping

▸3.3.6 Black point, white point, and levels⧉

• **levels**: set the **black point** and **white point** (+ a mid-grey **gamma**) — a piecewise remap that stretches the used range to [0,1]; the workhorse first edit

• read them off the **histogram**: clip the empty ends, pull the mids with the gamma slider

▸3.3.7 The general tone curve⧉

• a free-form **tone curve**: drag control points, smoothed by a **spline** kept **monotone** (Catmull-Rom / monotone cubic) so tones never invert [tone-curve / spline figure]

• **sidebar — splines (the math)**: fit a smooth curve through the control points (knots) $(x_i,y_i)$ as a **piecewise cubic** — one cubic per interval, the pieces joined for continuity.

• **cubic Hermite** form on $[x_i,x_{i+1}]$ with $t=\frac{x-x_i}{x_{i+1}-x_i}\in[0,1]$: $p(t)=h_{00}(t)\,y_i+h_{10}(t)\,m_i+h_{01}(t)\,y_{i+1}+h_{11}(t)\,m_{i+1}$, with basis $h_{00}=2t^3-3t^2+1,\ h_{10}=t^3-2t^2+t,\ h_{01}=-2t^3+3t^2,\ h_{11}=t^3-t^2$. The knot **tangents** $m_i$ are the free choice — picking them is what separates the schemes.

• **Catmull–Rom** (interpolating, $C^1$): tangent = centred difference $m_i=\frac{y_{i+1}-y_{i-1}}{x_{i+1}-x_{i-1}}$ — local and cheap, but can **overshoot** (so it may go non-monotone).

• **natural cubic** ($C^2$): require continuous second derivatives and minimize total curvature $\int (p'')^2$, which gives a **tridiagonal** linear system for the knot slopes — smoothest, but still can overshoot.

• **monotone cubic** (Fritsch–Carlson): start from Hermite tangents, then **clamp** them so every segment is monotone — if a secant slope $\Delta_i=\frac{y_{i+1}-y_i}{x_{i+1}-x_i}=0$ set the neighbouring tangents to $0$; otherwise keep $(\alpha,\beta)=(m_i/\Delta_i,\ m_{i+1}/\Delta_i)$ inside the circle $\alpha^2+\beta^2\le 9$. This **guarantees no inversion**, which a **tone curve** requires (brighter in $\Rightarrow$ brighter out). [tone-curve / spline figure]

• **B-splines** are the *approximating* cousin — the curve follows a control polygon and need **not** pass through the points; smoother and easy to keep monotone, at the cost of exact interpolation.

• subsumes exposure / contrast / levels; container for the principled tone-mapping curves (next chapter)

• **sidebar — how analog photography shaped the same mapping**: long before a digital curve, photographers controlled the **scene-tone → printed-tone** mapping by physical and chemical means —

• **exposure**: where you *place* a scene luminance on the film's response (the Zone System's "place this shadow on Zone III, let the highlights fall where they may")

• the film's **characteristic (H&D) curve** and its **gamma / contrast** — chosen by emulsion *and* by **development** (time / temperature / agitation / developer; **push / pull**; "**N / N+1 / N−1**" development to expand or compress the scene's range onto the film)

• **pre-exposure / flashing**: a uniform sub-threshold exposure that lifts deep shadows above the film's *toe* (raising the floor without touching highlights)

• at print time: the **paper grade** (contrast grade 0–5), the print developer, and local **dodging & burning** (hold back / add light per region)

• the free-form tone curve (and the per-region masks of *Local point operations*, below) is the **digital descendant** of this whole toolkit. *Cite **Ansel Adams**' trilogy — **The Camera**, **The Negative**, **The Print** — and the **Zone System** (developed in *The Negative*).*

▸3.3.8 Basic color enhancement: saturation and vibrance⧉

• the **basic color edits** every editor exposes — the color analogue of the tone curve. **This is the home for the implementable mechanics** (the equations below); the *perceptual-space* nuance — which space to scale chroma in, why skin needs protecting — is the job of [[Color technology]] (don't defer the equation there)

• **saturation**: split each pixel into **luminance** $Y$ and **chroma** $(\text{RGB}-Y)$, scale the chroma, recombine — $\text{RGB}' = Y + s\,(\text{RGB}-Y)$ (so $s=0$→grey, $s=1$→identity, $s>1$→more saturated; equivalently $(1-s)\,Y + s\,\text{RGB}$). A *uniform* $s$ over-saturates already-vivid colors and skin

• **vibrance** (implementable): the same operation but with a **per-pixel gain** that fades as the pixel's own saturation rises, and protects skin — replace the constant $s$ by

$$g = 1 + v\,(1-\text{sat})\,w_\text{skin}(\text{hue}), \qquad \text{RGB}' = Y + g\,(\text{RGB}-Y)$$

where **sat** is the pixel's current saturation (e.g. HSV $S = \tfrac{\max-\min}{\max}$, or chroma magnitude / luma), **v** the vibrance amount, and **w_skin(hue)** a hue weight $\approx 0$ on skin / orange hues and $1$ elsewhere — so muted colors get the full boost while vivid colors and skin barely move (set $w_\text{skin}\equiv 1$ for the skin-agnostic version)

• which space to scale chroma in (perceptually uniform), and why skin needs protecting — deferred to [[Color technology]]

equations

saturation $\text{RGB}' = Y + s\,(\text{RGB} - Y)$ ($Y$ = luminance broadcast to the 3 channels

$s$: 0→grey, 1→identity, >1→boost)

vibrance = the same with a **per-pixel** gain replacing $s$ — $g = 1 + v\,(1-\text{sat})\,w_\text{skin}(\text{hue})$, then $\text{RGB}' = Y + g\,(\text{RGB} - Y)$

▸3.3.9 Lookup tables⧉

• **LUT (look-up table)**: bake any color/tone mapping into a precomputed table indexed by pixel value, so applying it is one **lookup per pixel** — fast, and the standard way to ship a *look*

• **1-D LUT** = three independent per-channel curves (exposure / contrast / levels / gamma collapse into one); **3-D LUT** = a full RGB→RGB map that handles **cross-channel** effects a 1-D LUT can't (white balance, film/grade looks — the `.cube` files of video), at the cost of a coarser sampled grid + interpolation

• LUTs are also a **speed** trick: precompute an expensive per-value function once, then index it (ties to fixed-point / 8-bit pipelines)

▸3.3.10 Video color grading⧉

• a **separate craft** from stills / radiometric editing — the *look* of motion pictures, with its own tools and vocabulary; less radiometric, more color craft

• the colorist's primaries: **lift / gamma / gain** wheels (shadows / midtones / highlights, each with a color offset); **primary** (whole-frame) vs **secondary** (qualified by hue / luma, or masked) corrections

• the working space is usually a camera **log** encoding (S-Log, Log-C) under a color-management pipeline (**ACES**); the finished *look* is shipped as a **3-D LUT** (`.cube`), applied per frame (ties to [[#Lookup tables]])

• the **teal-and-orange** look: push skin into warm **orange**, everything else toward its **teal** complement — complementary contrast that makes people pop; popularized by ***O Brother, Where Art Thou?*** (2000 — among the first features with a full **digital intermediate**; Coen brothers, cinematographer Roger Deakins)

• the **stills-side cousin**: Lightroom's HSL / color-mixer and color-grading panels

• ties: [[#Lookup tables]] (3-D LUTs are the delivery format); [[Color technology]] (color management, log / ACES); tone mapping (the dynamic-range side)

▸3.3.11 Local point operations⧉

• strictly not a point operation: apply a tool through a **selection mask** (the result depends on *where*, not just the pixel's value)

• **selections**: by value / color (magic wand), by **gradient** (linear / radial), by **brush**, **edge-aware** (snap to edges → Big lesson: edge-preserving = affinity), and increasingly **AI / semantic** (segment-anything-style)

• 💡 **the moment a point operation goes *local*, it raises the central question of "*which pixels?*" — segmentation and matting.** A hard binary **selection** is a *segmentation*; a soft-edged one (feathering, hair, motion blur, partial coverage) is a **matte** (a per-pixel α, exactly the alpha channel of [[#Alpha and extra channels]]). So even a humble masked exposure tweak is the front door to a deep topic — how to decide the mask, and how to make its border *soft and correct*. **Forward ref → [[Compositing, segmentation and matting]]** (the full treatment: matting equation, trimaps, alpha estimation), with edge-aware mask refinement in [[Guided image filtering]] / [[Bilateral filtering]].

• the bridge to **local tone mapping** and edge-preserving filtering (next chapters)

▸3.4 Histograms⧉

`fig-histogram` · an image histogram (per-channel) + cumulative histogram; its shape depends on the encoding space (BASIC tone mapping) 🟨

`fig-histogram-encoding-spaces` · the same image's luminance histogram in linear / gamma (sRGB) / log; 18% gray marked at 0.18 / ~0.46 / ~0.75 — the encoding reshapes the value axis (BASIC histograms) 🟩

`fig-histogram-equalization` · histogram equalization — the CDF used as the transfer curve, before/after (BASIC) 🟨

`fig-histogram-matching` · histogram matching — source + target images and their histograms → matched result, with the composed transfer curve $\text{CDF}_\text{tgt}^{-1}\circ\text{CDF}_\text{src}$ (BASIC histograms) ✅

• the **histogram** — the distribution of pixel values — is the natural companion to point operations: this short chapter is how to *read* it, and how it can *drive* an automatic tone curve

▸3.4.1 The histogram⧉

• the **histogram** = the distribution of pixel values; reading it tells you exposure, clipping, contrast, and where the tones sit [histogram figure]

▸3.4.2 What cameras and editing apps actually show⧉

• the on-screen **RGB and/or luminance histogram** in the camera and in editing apps; a **live histogram** in the viewfinder / on the rear screen while composing

• **clipping warnings** ("**blinkies**"): blown highlights and crushed shadows flash to flag lost detail at each end

• ⚠️ the displayed histogram is typically of the **gamma / display-encoded** image, **not** the linear data — so its shape (and where mid-grey sits) reflects the encoding, not scene radiance (ties to *depends on the encoding space*, below)

▸3.4.3 The 3-D color histogram⧉

• beyond per-channel: a **3-D color histogram** = the distribution of pixels in **color space** (a cloud / occupancy in the color volume)

• its shape **depends on the color space** you build it in — RGB cube vs a perceptual space (Lab / opponent) bin the same pixels very differently

• used for **palette extraction, color transfer**, and **segmentation** (clusters in color space)

▸3.4.4 The histogram depends on the encoding space⧉

• its **shape depends on the encoding space** — the *same* image has a very different histogram in linear vs gamma vs log [histogram-encoding-spaces figure: linear piles into shadows, gamma spreads it, log spreads shadows more; 18% gray lands at 0.18 / ~0.46 / ~0.75]

▸3.4.5 A histogram is a sampled estimate⧉

• warning: it's a **sampled / discretized** estimate — spikes and gaps are artifacts of **binning / quantization**, not the scene

▸3.4.6 Histogram equalization⧉

• **histogram equalization**: use the image's **cumulative histogram (CDF)** as the tone curve → flattens the histogram, maximizing global contrast (often too aggressively) [histogram-equalization figure: the CDF *is* the transfer curve]

• pseudocode: bin → running-total CDF scan → normalize → apply as a per-pixel LUT

▸3.4.7 Histogram matching⧉

• **histogram matching** (a.k.a. histogram **specification**) generalizes equalization: instead of flattening to a *uniform* target, reshape an image's histogram to match **any target distribution** — a chosen shape, or *another image's* histogram

• **the recipe = compose two CDFs**: equalize the source (push it through its own CDF, $\text{CDF}_\text{src}$, to a uniform), then **"un-equalize" through the target's inverse CDF** — $L_\text{out} = \text{CDF}_\text{tgt}^{-1}\big(\text{CDF}_\text{src}(L_\text{in})\big)$; both are 1-D LUTs, so matching is still a single **point operation** [histogram-matching figure: source histogram + target histogram → matched result + the composed transfer curve]

• **equalization is the special case** where the target is uniform ($\text{CDF}_\text{tgt}=$ identity) — so matching, equalization, and a hand-drawn levels curve are one family

• **for color**: matching each channel independently **shifts hues** (the channels aren't independent), so do it in a **decorrelated / perceptual** space, or use the proper multi-dimensional version — **color transfer** (Reinhard et al. 2001: match the **mean and standard deviation** per channel in the decorrelated $l\alpha\beta$ space), with the exact N-D form via **optimal transport** (Pitié et al.)

• 💡 forward ref: histogram / color matching is the workhorse behind **harmonization** (make a pasted region's tones match the destination — Compositing), **color grading / style transfer** (impose a reference look), and **texture synthesis** (match value statistics) — all instances of "make this image's statistics look like that one's"

• refs: Reinhard, Ashikhmin, Gooch & Shirley 2001 (color transfer between images); Pitié, Kokaram & Dahyot 2007 (automated N-D color transfer / optimal transport); Gonzalez & Woods (histogram specification — textbook)

▸3.4.8 Constraining the slope: Ward's histogram-based tone mapping⧉

• **Greg Ward's constrained / histogram-based tone mapping** (1997): equalize, but **cap the local slope** so contrast is never exaggerated beyond what the eye accepts — a perceptually-bounded version for HDR [Ward et al. 1997]

▸3.5 Tone mapping⧉

`fig-reinhard-curve` · the Reinhard global tone curve L/(1+L) mapping [0,∞)→[0,1); naive clip vs Reinhard on a simulated-HDR image (BASIC) 🟨

`fig-tonemap-real` · global vs local tone mapping RESULT on a real HDR photo (seal at marina): naive single exposure · global Reinhard (one slope flattens local contrast → hazy) · local bilateral base+detail split (range fits, detail stays crisp) — real-image companion to `fig-tonemap-global-vs-local` (BASIC tone mapping) ✅

`fig-zone-system` · the Zone System strip 0–X, Zone V = 18% mid-grey (scene → print) (BASIC) 🟨

• a **quick intro** here (full treatment in the HDR chapter): tone mapping = choosing the **mapping from scene tones to display tones**, the natural payoff of point operations

▸3.5.1 The dynamic-range problem⧉

• **dynamic range** = ratio of brightest to darkest (contrast, not exposure); scene ~1:100,000, sensor ~1:1000, display ~1:20–500

• two problems: **capture** (merge multiple exposures → HDR chapter) vs **display** (squeeze a huge range onto the screen *while keeping detail* — this chapter); can't just rescale (shadows collapse)

• sidebar: limited range can be a virtue (W. Eugene Smith; lighting reduces range at capture)

▸3.5.2 Two families: global and local⧉

• **global** = one curve everywhere (a point operation); simple, fast, no halos, but a flattened slope loses local contrast

• **local** = curve varies spatially (a neighborhood operation); split large-scale vs detail, compress only large-scale — needs an edge-aware blur [global-vs-local figure]

▸3.5.3 Global tone mapping: one curve for everything⧉

• naive clip / rescale both fail (linear answer to a multiplicative problem)

• the **Reinhard** global operator `L_out = L/(1+L)` (or a white-point variant) maps `[0,∞) → [0,1)` — a principled, halo-free default global tone map [Reinhard curve figure]

• **one curve for the whole image** (Reinhard's `L/(1+L)`, a log/gamma, an S-curve) — simple, no halos, but can't squeeze a huge range into a small one without flattening local contrast

• apply to **luminance** only, rescale RGB by the ratio (keep hue)

• family: gamma, log, S-curve, the freehand tone curve; Reinhard is the principled, parameter-light member

▸3.5.4 Histogram equalization and Ward's bound⧉

• **histogram equalization** is itself a tone map: CDF as the curve, data-driven, max global contrast — but no restraint (exaggerates busy regions, amplifies noise)

• **Ward et al. 1997**: cap the local slope → perceptually-bounded equalization for HDR

▸3.5.5 Local tone mapping and the halo problem: why naive local fails⧉

• the curve **varies spatially** (compress the slow, large-scale variations, keep local detail) — needs a **blur / edge-aware** step, so it waits for convolution & edge-preserving; its failure mode is **halos** around strong edges [local-tone-mapping halo figure]

• best-of-both decomposition: large-scale layer + detail layer; compress only large-scale, add detail back

• naive Gaussian blur crosses edges → contaminated large-scale layer → **halo** band at strong edges; fix = edge-preserving (bilateral) blur

• 💡 **Big lesson:** **edge-preserving = affinity** — weight a blur by value affinity as well as spatial distance (the bilateral filter)

▸3.5.6 Why log space pays off⧉

• **log encoding pays off here**: range compression is multiplicative, so log turns it into a simple subtraction (→ Big lesson: additive vs multiplicative)

• detail = log(L) − large-scale (a subtraction); compress + recombine = additions

• 💡 **Big lesson:** **additive vs multiplicative → work in log** — light/exposure/range are multiplicative; log turns them into addition, which blurs/decomposes/compresses cleanly

▸3.5.7 The analog ancestor: the Zone System⧉

• Ansel Adams' **Zone System** = global tone mapping: carve light into 11 zones (a stop apart), Zone V = 18% grey; place a scene tone in a zone, realize it through exposure/development/paper — the analog of Reinhard's white point [Zone System strip figure]

• side bar: **dodging & burning** — the darkroom ancestor of local tone mapping (hold back / add light locally), even local contrast / sharpness

• **video grading evolved separately** — less radiometric (lift/gamma/gain), rich craft; same operations, different dialect → see [[#Video color grading]] (Point operations)

▸3.5.8 A caveat: HDR tone mapping isn't common in practice⧉

• ⚠️ honest note: **full HDR tone mapping is not very commonly used in practice** — it's **hard to control** and tends to look **artificial / "HDR-look"** (haloes, flat, garish)

• most photographers prefer **simpler exposure / contrast adjustments** and **exposure fusion** over an aggressive HDR operator

▸3.5.9 Where this is going⧉

• recap: tone mapping = display-side range compression; global (halo-free, loses local contrast) vs local (crisp detail, needs edge-preserving blur, risks halos); all cleanest in log; HDR chapter picks up the threads

▸3.6 Neighborhood operations and convolution⧉

`fig-convolution-slide` · a kernel sliding over an image, weighted-summing a neighborhood into one output pixel (Convolution, BASIC) 🟨

`fig-convolution-flip` · the flip: where-from vs where-to — convolution `g(x−x')` vs correlation 🟨

`fig-psf-impulse` · impulse in → kernel out: convolving a Dirac reads off the PSF / impulse response 🟨

`fig-blur-zoo` · box vs Gaussian kernels — profiles + 2-D stencils, Gaussian truncated at ~3σ 🟨

`fig-unsharp-mask` · unsharp-mask decomposition: input − blur = detail (high-pass), output = input + k·detail 🟨

`fig-sharpen-kernel` · the sharpening kernel δ − blur as a +centre / −surround stencil 🟨

`fig-separable` · separability — a 2-D Gaussian = 1-D ⊗ 1-D (blur rows then columns); O(r²) → O(2r) 🟨

`fig-gradient-sobel` · Sobel x / y → gradient magnitude (edge strength) on an image 🟨

`fig-convolution-probability` · convolution as the sum of random variables: box ⊛ box = triangle (two dice) 🟨

equations

convolution (I*g)(x) = Σ I(x')·g(x−x')

Gaussian g(x) ∝ exp(−x²/2σ²)

unsharp out = in + k·(in − blur(in))

sharpening kernel δ − g

▸3.6.1 Motivation: blur and sharpen⧉

• a pixel computed from its neighbors (recall the three operation types)

• blur also comes from optics: diffraction, spherical aberration → motivates removing optical artifacts (forward ref: deblurring) so important to understand if we want to prevent and remove it.

▸3.6.2 Convolution 101⧉

• blur as a weighted average of neighbors.

• linear shift-invariant (LSI) filtering; why linearity matters (with a non-shift-invariant counter-example)

• the convolution algorithm: loop over the output

▸3.6.3 The flip: where-from vs. where-to⧉

• point spread function (PSF) / impulse response

• **debugging / hand-checkable inputs**: an **impulse** reads off the kernel (the PSF); a **constant** must pass through a normalized kernel unchanged; a **shift kernel** just translates the image; a **half-plane** (edge) shows the edge response; verify a **1-D 3×3** by hand first

• without the flip it's called correlation. USeful too but not as useful and doesn't have all the good mathematical symmetries.

• sidebar: convolution in probability (sum of random variables, Laplace first derivation of convolution, flip is obvious: if we want the sum to be a given number, then the second number has to be the sum minus the first one)

• **sidebar — debugging convolution** (recipe moved here from Developing, testing & debugging): convolve an **impulse** and the output **is your kernel**, laid out and flipped — read it off number-for-number to catch a wrong weight, an off-by-one in centering, and the **flip** (convolution vs correlation) at a glance; a **constant** comes back unchanged iff the weights sum to 1; an off-centre single-1 **shift kernel** translates the image by exactly that offset (tests sign & direction); a **half-plane** shows the border behaviour; do the very first test in **1-D (1×3)**, where every output is hand-computable. [fig-psf-impulse]

▸3.6.4 Properties: the impulse, normalization, symmetry, commutativity⧉

• **normalization**: the weights sum to 1 to preserve brightness (a blur shouldn't darken the image)

• **commutativity / associativity**: I*g = g*I and (I*g)*h = I*(g*h) — so filters can be pre-combined and reordered (the flip is what buys these symmetries)

• **linearity**: scales and superposes — the very reason Fourier will help (next chapter)

• the **identity** is the impulse δ (I*δ = I); **separability** gets its own point below

▸3.6.5 Nitty-gritty: finite images⧉

• edge padding (ties to the edge/accessor policies above)

• centering the kernel

▸3.6.6 A blur zoo⧉

• box filter

• Gaussian (formula, truncation at ~3σ)

▸3.6.7 Separability⧉

• a 2-D filter is **separable** if it's an outer product of two 1-D filters (the Gaussian is): blur **rows, then columns**

• big **cost** win: a (2r+1)×(2r+1) kernel costs O(r²) per pixel; done separably, **O(2r)** — the practical reason large Gaussian blurs are fast [separability figure]

▸3.6.8 Gradient and oriented filters⧉

• **derivative filters**: finite differences estimate ∂I/∂x and ∂I/∂y; the **Sobel** kernel pairs a derivative (one axis) with a little smoothing (the other) for noise robustness

• the **gradient** ∇I = (Iₓ, I_y) → its **magnitude** (edge strength) and **orientation** (edge direction): the basis of edge detection (forward ref: features / Canny) [gradient / Sobel figure]

▸3.6.9 Sharpening⧉

• unsharp mask (add back the high frequencies)

• sidebar: film photography unsharp mask.

▸3.6.10 Non-linear sharpening⧉

• linear unsharp masking has two failure modes: it **amplifies noise** (in flat areas the high frequencies are mostly noise) and **haloes** at strong edges

• **adaptive / non-linear** unsharp masking **gates** the boost — sharpen **more in detailed / edge areas, less in smooth ones** (by local **edgeness** or **intensity**); cubic / rational variants (Ramponi) and adaptive control (Polesel et al.) — the principled version is in the edge-preserving part

The canonical reference is Polesel, Ramponi & Mathews, "Image Enhancement via Adaptive Unsharp Masking," IEEE Trans. Image Processing 9(3):505–510, 2000, whose whole point is that an adaptive filter controls the sharpening so enhancement happens in high-detail areas and little or none in smooth areas. Its predecessor is Ramponi, "A cubic unsharp masking technique for contrast enhancement," Signal Processing 67:211–222, 1998, which introduced the nonlinear (rational/cubic) variant. Also worth citing in this cluster: Russo, "An image enhancement technique combining sharpening and noise reduction," IEEE Trans. Instrum. Meas., 2002. [port + 3](https://baldigital.port.ac.uk/?p=5058)

say we'll have more in the edge preserving part

▸3.6.11 Edge-preserving filtering: the nonlinear escape⧉

`fig-bilateral-1d` · noisy step edge: noisy input vs Gaussian (smears step) vs bilateral (denoises plateaus, keeps step); side panel — per-pixel spatial×range weight collapses to one side at the edge (Bilateral filtering) ✅

• a Gaussian blur **smears edges** because it weights neighbours by spatial distance alone (a near-but-different pixel pollutes the average)

• fix: weight each neighbour by **value similarity** too → the **bilateral filter** (Tomasi & Manduchi 1998): spatial Gaussian × range Gaussian on the intensity difference [edge-preserving figure]

• per-pixel normalization (weights depend on the values); **nonlinear** and **not shift-invariant** — leaves the clean world of convolution (no commutativity, no Fourier diagonalization)

• 💡 **Big lesson:** **edge-preserving = affinity** — use the color/intensity difference as how much two pixels "belong together"; the engine of edge-aware denoising/sharpening, halo-free local tone mapping, matting, colorization

▸3.6.12 Where this goes next⧉

• two kinds of weighted average: fixed weights → convolution (linear, shift-invariant); image-dependent weights → edge-preserving (bilateral, nonlinear)

• two threads forward: **Fourier** (next — convolution becomes per-frequency multiplication, enables deblurring) and the **affinity** lesson (recurs all book)

▸3.7 Fourier⧉

`fig-fourier-basis-matrix` · the DFT as a change-of-basis matrix (real & imaginary parts) 🟨

`fig-sine-eigenvectors` · sine waves are the eigenvectors of convolution: same wave out, only amplitude/phase change 🟨

`fig-fourier-2pixel` · the 2-pixel example: convolution [0.8,0.2] diagonalized by a 45° rotation to [1,1]/[1,−1] 🟨

`fig-fourier-magnitude-phase` · an image's Fourier magnitude & phase, and the phase-swap demo (phase carries structure) 🟨

`fig-compact-space-frequency` · compact in space ⇔ spread in frequency (narrow vs wide Gaussian and its transform) 🟨

`fig-aliasing` · sampling a sine too coarsely → aliasing (high frequency folds to a low one) 🟨

`fig-sampling-comb` · sampling = multiply by a comb → spectral replicas; pre-filter (low-pass) before downsampling 🟨

`fig-sinc-vs-practical` · ideal sinc reconstruction vs a practical filter, in space and frequency 🟨

`fig-deblur-preview` · the deblur preview: sharp → blurred + noise → naive inverse amplifies noise; MTF 🟨

`fig-svd-geometry` · SVD = rotate·scale·rotate: unit circle → ellipse with semi-axes σ₁u₁, σ₂u₂ (Linear algebra) 🟩

`fig-focus-stacking` · optics-chapter illustrative figure (07-07 Figure 4): a stepped-focus stack → sharpness selection → all-in-focus composite. Synthetic per-slice defocus on one photo (`sourced/corn-cobs.jpg`, © Frédo Durand) — license-safe. The full real-data treatment lives in part-08 `fig-focalstack-*` ✅

fig-pixel-timeseries-bandpass · one fixed pixel, signal to output — its value over time, its temporal spectrum (a small in-band peak among DC and noise), a band-pass keeping that band, and the band scaled by $\alpha$ and added back; identical at every pixel 🟨

equations

DFT Î(u) = Σ I(x)·e^{−2πi ux/N}

Euler e^{iθ} = cos θ + i sin θ

convolution theorem ℱ{I*g} = Î·ĝ

eigen-relation g * e^{iωx} = ĝ(ω)·e^{iωx}

Nyquist — sample > 2× the highest frequency

▸3.7.1 Images as vectors in a high-dimensional space⧉

Why? Allows us to use linear algebra, e.g. to solve and understand inverse problems such as deblurring.

cite Strang.

Useful both conceptually (translation to the language of linear algebra) as well as practically (use linear algebra libraries and well-established solvers)

Similar to how we store in memory: serialize all pixels/channels.

Note: one source of notation overload: images are often named x.

Operations on vector spaces and what they mean for images: addition, multiplication by scalar

dot product aka scalar product aka dot product aka inner product

Matrices without forming matrices

Once images are modeled as vector, linear operators (exposure, convolution) can be represented as matrices. But giant matrices. In a case like convolution

And just like with any vector space we can do a change of basis. See next.

▸3.7.2 Why Fourier⧉

Convolution is a linear operation. It can be represented as a matrix. But matrices and linear operations can be unwieldy to understand. For example, we may want to know how easy it it to invert a matrix (if we want to deblur).

Said differently, the output of a convolution for a pixel depends on other pixels and these cross-dependencies create complexity.

However, there is a special case of matrix that is easy to manage : a diagonal matrix. In a diagonal matrix, there is no cross-talk, each input just undergoes a scalar multiplication, which means the inverse is a simple scalar division. The difficulty of inversion boils down to how close the scalar is from zero.

Fortunately for us, we can do a change of basis (change of image representation) so that convolution turns into a diagonal operation. That change of basis is the Fourier transform. Warning: the only bug is it assumes that the image is cyclic. But still insightful.

And we have **already seen** that frequency is a good way to model **human vision**: the contrast sensitivity function (**CSF**) back in the perception chapter (Spatial vision) describes the eye's sensitivity as a function of *spatial frequency*. So the Fourier view is not only a computational convenience for convolution — it also matches how the visual system is organized, which is why frequency keeps coming back (JPEG, image-quality metrics, tone mapping). → see Human perception → Spatial vision / CSF.

• **sidebar — a Fourier transform you can see** (motivation: it's physical, not just an imposed trick): the little **starburst** a distant point of light (a star, a streetlight) makes through a small aperture/lens is ≈ the **Fourier transform of that aperture** (squared magnitude) — round hole → **Airy** disc + rings; stopped-down iris → **sunstar** spikes (one streak ⟂ each blade-edge). The optics performs the transform **for free**, in the instant light crosses the lens — no FFT. Lesson: Fourier is a *natural, physical* way waves + apertures organize themselves; we're borrowing it for images. → ties to [[06 Optics, lenses, and aberration correction]] (diffraction / the **MTF** = FT of the PSF).

▸3.7.3 Definition: one coefficient per wave⧉

We focus on greyscale but it's the same for RGB.

Fourier transform: instead of one value per pixel, one coefficient per sine wave.

It's orthonormal, so coefficients are just dot products (sum/integrals)

• complex exponential carries a cosine + sine per frequency (Euler $e^{i\theta}=\cos\theta+i\sin\theta$); real part = cosine content, imag = sine — needed to capture any phase

• ⚠️ **so many Fourier conventions**: complex-exponential vs real sin/cos form; whether the $2\pi$ lives in the exponent, in the frequency, or in the normalization; continuous transform vs Fourier series vs DFT; normalization ($1/N$, $1/\sqrt N$, or none — and which side carries it); sign of the exponent ("forward vs backward"); angular vs ordinary frequency. These differ across textbooks, NumPy, and signal processing — **always check which convention a source uses** before you trust a formula. [fig-fourier-conventions — author's own "Fourier flavors" convention-warning illustrations, shown **full-width** so each reads clearly; from Fredo's slides 3_Linear_image_processing, slides 67–68 (author-created images)]

Figure: show change of basis matrix [⚠️ legibility: text labels currently unreadable — enlarge / redraw]

In general, Fourier diagonlizes all shift and shift-invariant operators.

• 💡 **Big lesson:** **diagonalize when you can** — express a linear operator in a basis where it's orthogonal/diagonal; **Fourier diagonalizes convolution**, turning it into per-frequency multiplication.

• **sidebar — Fourier and the heat equation** (where this trick came from): Fourier (1807) was solving **heat diffusion** in a 1-D rod with some initial temperature profile. His move was to write the temperature as a sum of **cosines** — and in that basis the heat equation **diagonalizes**: each frequency evolves on its own, and a single cosine simply **decays exponentially in time** (∝ e^{−k²t}), the highest frequencies fastest, so the profile smooths out. The cosines are the **eigenfunctions** of the Laplacian — the same reason Fourier diagonalizes convolution here.

• the heat equation also hands us the **two canonical solution families of linear differential equations: sine waves (in space) and exponentials (in time)** — separation of variables gives **spatial sinusoids** whose **amplitudes decay exponentially**, $\propto e^{-k^2 t}$, the highest frequencies dying fastest, so the profile smooths out. [fig-heat-diffusion-1d — a 1-D temperature profile (amplitude on y, position on x) smoothing over time: shown for (a) a general signal and (b) a single sine wave that simply decays in place]

• the general lesson: **diagonalizing is the win for any linear ODE / PDE** — in the eigenbasis a coupled system decouples into independent scalar equations whose solutions are **exponentials** (e^{λt}) or, for oscillatory / conservative systems, **sine waves**. Find the right basis (the eigenvectors) and a hard differential equation becomes trivial per-mode arithmetic.

▸3.7.4 Sines are the eigenvectors of convolution⧉

• feed a pure sine into any blur → the **same sine**, only scaled and phase-shifted: sines are the **eigenvectors** of convolution, eigenvalue $\hat g(\omega)$ = the kernel's FT [sine-eigenvectors figure]

• blur leaves low freqs untouched, crushes high ones → blur is a **low-pass filter**

• the **convolution theorem** $\mathcal{F}\{I*g\}=\hat I\cdot\hat g$: convolution in space = multiplication in frequency

▸3.7.5 The 2-pixel example⧉

Give the example of a 2-pixel image visualized as a2D Euclidean plane. Our linear convolution has coefficients 0.8, 0.2 The Fourier transform is just along [1, 1] and [1, -1] (rotation by 45 degrees). Our simple convolution becomes diagonal

▸3.7.6 Reading an image's Fourier transform⧉

• 2-D: two frequency indices (u,v); display as **magnitude** + **phase**

• **symmetries**: a real image has a **conjugate-symmetric** spectrum (half the coefficients are redundant); low freqs at center, high at corners; natural images mostly low-freq; a shift → a **phase ramp**

• **magnitude vs. phase** — how to *read* an image's FT: the **magnitude** shows which frequencies / orientations are present; the **phase** carries the *structure*. Swap two images' magnitude and phase and you recognize the one whose **phase** you kept. [magnitude/phase figure]

• **compact in space ⇔ spread in frequency** — a narrow Gaussian has a wide spectrum; the locality / uncertainty tradeoff that recurs in pyramids & wavelets [compact-space-vs-frequency figure]

• sidebar: **MTF** for lenses: the optical system's frequency response is the FT of its PSF (ties to the diffraction limit, Optics)

▸3.7.7 The fine print: two limitations of Fourier⧉

• before we lean on it, two honest caveats that the rest of the book keeps bumping into:

• **the DFT treats the signal as cyclic / periodic**: it tiles the image and wraps around, so content leaking off the right edge re-enters on the left → **boundary leakage / ringing**. This is what motivates **windowing** before an FFT, and it is exactly the **periodic-boundary assumption** baked into the FFT deblur solver (forward ref: the FFT-based deconvolution in [[Computational tools]] inherits this wrap-around)

• **the basis has infinite / global support**: every coefficient is a **global wave** spanning the whole image, so there is **no spatial localization** — move a single pixel and *every* coefficient changes. A Fourier coefficient can tell you *which* frequencies are present but not *where*. This is precisely the motivation for **wavelets / pyramids**, which trade a little frequency precision for spatial locality → [[#Linear pyramids and wavelets]]

▸3.7.8 Sampling and aliasing⧉

example of spatial aliasing. Show classical sine wave (prudential tower photo downsampling). Intuitively we need to break down the content that cannot be represented (return grey instead of black or white). You can think of it as the average over a pixel. More formally, we need Fourier.

As it turns out, aliasing too is easy in Fourier (because sampling grid is essentially a set of shifted spikes)

• **Nyquist–Shannon**: to represent a signal you must sample at **more than twice** its highest frequency; above that, high frequencies **fold (alias)** down to low ones — the moiré / wagon-wheel effect [aliasing figure] [📺 embed aliasing/sampling explainer → https://www.youtube.com/watch?v=yr3ngmRuGUc, see [[Video Resources]]]

• anti-aliasing: low-pass (pre-filter) before you downsample

• **sidebar — Aliasing on purpose: dynamic moiré prints**: aliasing / moiré is usually a nuisance, but it can be **engineered**. A printed **micro-pattern** photographed by a camera produces a moiré whose appearance **encodes information** — the relative camera pose / motion — because the beat pattern is exquisitely sensitive to the sampling geometry. Cite MIT's **KinéCam** dynamic prints (https://hcie.csail.mit.edu/research/kinecam/kinecam.html). Ties to **structured / coded patterns** elsewhere in the book.

• the space-vs-frequency tradeoff (sharp in one ⇒ spread in the other)

• sidebar: a primal-domain (no-Fourier) perspective too

• perfect sinc vs practical filters. (more in resampling section)

▸3.7.9 Application preview: can we deblur, given the blur?⧉

We will discuss more in later chapter, but quickly show:

sharp input image

blurry image (say with box. Maybe just 1D ?) plus noise.

linear inversion with amplified noise.

Fourier MTF

▸3.8 Resampling⧉

`fig-resample-forward-inverse` · concrete 2× upsampling on a 5×5 → 10×10 rainbow grid: FORWARD pushes input (i,j)→output (2i,2j) so 1 of every 2×2 output block is filled and 3 are black holes (regular lattice of gaps); INVERSE loops output, samples input via f⁻¹ (nearest) → every pixel filled (2×2 colour blocks). Forward leaves holes, inverse fills everything (Resampling, BASIC) ✅

`fig-linear-interp-1d` · 1-D linear interpolation — `im[1.3]` from its two neighbours 🟨

`fig-bilinear` · bilinear interpolation — the 4-neighbour weighting on the unit square 🟨

`fig-resample-kernels-real` · the kernel ladder on a REAL photo (Boston skyline + rainbow) resampled under a 35° rotation, magnified on a detail: nearest (blocky) → bilinear (smeared) → bicubic (edges restored) → Lanczos (sharpest) — adds Lanczos + the rotation case to `fig-interp-comparison` ✅

`fig-reconstruction-pipeline` · the resampling pipeline: sample → reconstruct → prefilter → resample 🟨

`fig-aliasing-photo` · aliasing/moiré on a real photo: a window-grid facade decimated with no prefilter folds into moiré bands — the 2-D, real-image companion to `fig-aliasing` (Sampling and aliasing) ✅

equations

1-D linear interp

general resample out(x) = Σ I(x')·k(f⁻¹(x) − x')

Lanczos `sinc(πx)·sinc(πx/a)`

Mitchell–Netravali cubic (B, C)

▸3.8.1 Domain operations: moving pixels around⧉

domain operations, function from plane to itself

▸3.8.2 Start with scaling up⧉

start with example of scaling up (upsampling). Missing values. Say downsampling will be a bit different (too many values) but the solution will be surprisingly similar.

▸3.8.3 The naive idea, and why it leaves gaps⧉

start naive : loop over input pixels. Point out problem (gaps)

▸3.8.4 Loop over the output, use the inverse transform⧉

better solution: loop over output (inverse mapping: for each output pixel, inverse-transform and sample the input)

• where from vs where to: always inverse transform

▸3.8.5 Nearest neighbour⧉

challenge

▸3.8.6 Linear interpolation, in 1-D first⧉

derived with intuitive 1D linear interpolation

▸3.8.7 Bilinear, in 2-D⧉

say in2D it can't be purely linear unless we triangulate

sidebar: smart triangulation where we pick the best diagonal for each square (the one joining the two most similar values)

Say we do linear horizontally first, then vertical

exercise : show it's the same if we do it in the opposite order.

• pseudocode: upsample = loop over output, inverse-map to fractional coords, bilinear-blend the four corners (two horizontal blends then one vertical)

▸3.8.8 The convolution perspective⧉

say that the interpolation perspective is fine for linear and upsampling, but because this is about (re)sampling, Fourier has better insights. In practice, means reconstruction through convolution.

side bar: proper resampling analysis: start with sampled signal, reconstruct, prefilter, resample. The final filter is a combination of prefilter and reconstruct. Depending on whether we up or downsample one of them dominates.

• warning: much about bicubic wrong on the internet. Including in old python libraries

▸3.8.9 Better kernels: bicubic and Lanczos⧉

They all approximate a perfect sinc but with a finite support and try to find a tradeoff between loss of medium frequency and aliasing of high frequencies. Show in Fourier : perfect box, smooth approximation

Netravali Mitchell.

form of bicubic.

required symmetries, remaining two parameters

perceptual study.

point out that photoshop gives you a choice.

▸3.8.10 Upsampling vs. downsampling: scale the kernel⧉

The key is how you scale the filter.

• 💡 **Big lesson:** **prefilter before you downsample** — when you throw samples away, first blur / area-average out the detail too fine for the new grid, or it **aliases** into bold false low-frequency moiré; naive nearest-pixel decimation is the classic bug (→ **L16**)

▸3.8.11 Prefiltering when the transform isn't a clean scale⧉

• for a **shear / rotation / homography** the ideal prefilter is **spatially varying and anisotropic** — an output pixel back-projects to a stretched **ellipse** in the input, so exact filtering is costly; **mip-maps** and **EWA** (elliptical weighted average) approximate it

• side bar: **Andrew Adams' challenge** — rotate an image by **1° three-hundred-sixty times**: a naive resampler blurs it to mush (errors compound), a good one (proper prefilter + reconstruction) returns nearly the original — a stress test for resampling quality

▸3.8.12 Where this goes next⧉

• recap: resampling = reconstruct-then-resample, reconstruction = convolution with a finite-budget sinc approximation

• forward: the reduce step → Gaussian/Laplacian pyramids (next); inverse-mapping → warping & morphing; resampler quality → image metrics

▸3.9 Linear pyramids and wavelets⧉

`fig-gaussian-pyramid` · the Gaussian pyramid — repeated blur + downsample tower (Pyramids, BASIC) 🟨

`fig-laplacian-pyramid` · the Laplacian pyramid — band-pass levels (G_k − expand(G_{k+1})) 🟨

`fig-pyramid-encode-decode` · the Laplacian pyramid as an encoder → decoder (exact reconstruction) 🟨

`fig-pyramid-frequency-bands` · each pyramid level = a frequency band (concentric Fourier rings) 🟨

`fig-pyramid-blending` · pyramid blending walkthrough on Earth+Jupiter — source A · source B · naive (hard mask, visible seam) · pyramid blend (seamless) ✅

`fig-coring` · coring — zero the small detail coefficients (noise), keep the large ones 🟨

equations

reduce / expand (blur + ↓2 / ↑2 + blur)

L_k = G_k − expand(G_{k+1})

reconstruction G_k = L_k + expand(G_{k+1})

▸3.9.1 Halfway between space and frequency⧉

• a multiresolution representation — **halfway between primal (space) and Fourier (frequency)**: localized in *both*, which is exactly why it's so practical

▸3.9.2 The idea: process an image at many resolutions⧉

• multiresolution, coarse-to-fine (original papers, 1980s)

▸3.9.3 The Gaussian pyramid⧉

• repeatedly **blur then downsample by 2** → a tower of ever-coarser, smaller images (each ¼ the pixels): the **reduce** operator

• cheap (a geometric series → only ~4/3 the original pixel count in total); the basis for coarse-to-fine search and **scale space** [Gaussian pyramid figure]

▸3.9.4 The Laplacian pyramid⧉

• construction: each level = **G_k − expand(G_{k+1})** (the detail the coarser level missed) — a **band-pass** image [Laplacian pyramid figure]

▸3.9.5 Reconstruction: an exact encoder and decoder⧉

• reconstruction: **exactly invertible** — add the levels back coarse → fine (an **encoder → decoder**) [encode/decode figure]

• pseudocode: Laplacian build (reduce + difference loop) and collapse (expand + add loop), each the exact inverse of the other

▸3.9.6 Each level is a frequency band⧉

• frequency intuition: each level holds a **band** of detail, coarse → fine [frequency-bands figure]

▸3.9.7 What the bands look like on a real image⧉

• properties: **decorrelated** and **sparse** on natural images (mostly near-zero → compresses, denoises) — sparsity drives compression (JPEG 2000) and denoising (coring)

▸3.9.8 Honest limitations⧉

• **overcomplete** (>1 coefficient per pixel — a frame, not a tight basis); only **approximately shift-invariant**; **isotropic** (no orientation) — each limitation motivates a relative

▸3.9.9 Cousins: wavelets and steerable pyramids⧉

• **wavelets**: a true tight basis (one coefficient per pixel, no redundancy) → natural for compression (JPEG 2000); but less Fourier-friendly, not shift/rotation-invariant

• sidebar: **Fourier, wavelets, and the locality dial** — Fourier (all frequency) ↔ primal grid (all space); wavelets/pyramids in between, a tunable space-vs-frequency tradeoff

• **steerable pyramids** (Simoncelli & Freeman 1995): add **orientation** (oriented sub-bands), nicely shift/rotation-invariant, even more overcomplete — texture, video magnification; (**multigrid** in passing)

▸3.9.10 What pyramids are good for⧉

• multiscale / **pyramid blending**: blend each Laplacian level under a **Gaussian-blurred mask**, recombine → seamless joins (the apple/orange) — see compositing & pano [pyramid blending figure]

• exposure fusion, HDR+ tone mapping — see HDR

• pyramid **denoising / coring**: zero out the small detail coefficients (mostly noise), keep the large ones — see Denoising [coring figure]

• local Laplacian filtering (edge-aware detail) — see edge-preserving

▸3.9.11 Multiresolution as a recurring theme⧉

• coarse-to-fine (optical flow), scale (SIFT), pooling / U-Net (deep nets) — forward refs

▸3.10 Image metrics⧉

`fig-ssim-vs-mse` · same PSNR, very different SSIM — pixel error ≠ perceived quality (Image metrics, BASIC) 🟨

• we **often need to compare two images** — fidelity, alignment, an ML training loss

equations

MSE = mean(‖I−J‖²)

PSNR = 10·log₁₀(MAX²/MSE)

SSIM = luminance·contrast·structure (local windows)

▸3.10.1 Full-reference vs. no-reference⧉

• **full-reference**: compare against a known-good reference (denoising / compression / resampling) — most of this chapter

• **no-reference / blind**: score a single image with no reference (needs a model of natural images) — mostly set aside

▸3.10.2 Why not just L2?⧉

• **MSE / L2** = $\frac1N\sum(I-J)^2$, simple, cheap, differentiable, but **doesn't match perception**: a tiny shift, a small gamma change, or a touch of noise can give the *same* MSE as a perceptually-awful edit (treats the image as a bag of numbers, ignores edges/texture/faces)

• sidebar: why squared (L2 vs L1) — smooth at 0, punishes outliers; L1 more robust

▸3.10.3 PSNR⧉

• **PSNR = 10·log₁₀(MAX²/MSE)** (dB) — a monotone re-expression of MSE; bigger is better (~30–40 dB decent); inherits MSE's blind spots

• sidebar: at least measure it in a **perceptual color space** (Lab / Luv, ΔE) so error ≈ perceived difference

▸3.10.4 SSIM⧉

• **SSIM** (structural similarity): compares **luminance, contrast and structure** in local windows — far better correlated with perceived quality; the *same* PSNR can have very different SSIM [SSIM-vs-MSE figure]

▸3.10.5 VDP and HDR-VDP⧉

• **visible-difference predictors**: a full perceptual model (the **CSF + masking** from the perception chapter) that predicts *where* a difference is visible; **HDR-VDP** extends it to HDR

▸3.10.6 Learned metrics⧉

• forward ref: **LPIPS, DreamSim** — deep-feature distances → [[Deep learning]]

• forward ref: a differentiable metric is also a **training loss** — minimizing LPIPS/perceptual distance (not per-pixel MSE) is how learned methods get **sharp** results → [[Deep learning#Learned perceptual metrics and losses]]

▸3.10.7 The right metric depends on the task⧉

• no single best metric: choose by what you're evaluating (fidelity vs perceived quality vs a training loss)

▸3.11 Denoising⧉

`fig-denoising-before-after` · denoising a real colour crop (realistic shot+read noise): noisy vs Gaussian (smears edges) vs bilateral (edge-preserving) (Denoising, BASIC) ✅

`fig-averaging-convergence` · averaging N noisy frames → noise drops as 1/√N (1, 3, 9, 25 frames), on a real colour photo with realistic shot+read noise (Denoising, BASIC) ✅

▸3.11.1 What is noise?⧉

• sources: photon/Poisson, read, thermal (see noise / image formation)

• look at it: histogram of a flat grey patch, a scanline

• reminder: affine model (in linear space) is good.

• reminder: sensor saturates on both ends

▸3.11.2 Denoising by averaging multiple shots⧉

• 1 → 3 → 5 images: the noise visibly drops [averaging-convergence figure]

• the statistics: independent noise averages out — variance ∝ 1/N, so **noise drops as 1/√N**; SNR, PSNR improve accordingly

• needs **alignment** first (brute force; → burst / HDR+, see multiple-exposure)

▸3.11.3 Denoising from a single image⧉

• no other measurements of the same pixel → use redundancy *inside* the image: neighbours are near-repeated measurements (same 1/√N win), but averaging smears edges

• everything below = a different answer to "which neighbours count as the same value?"

▸3.11.4 Spatial averaging and its limits⧉

• spatial averaging: Gaussian blur — and why it **smears edges** [denoising before/after figure]

• bias–variance trade-off: wider filter cuts more noise but blurs more

• **median** filter: ignores the odd wild neighbour; good against speckle / salt-and-pepper

• sidebar: optimal *linear* filter = **Wiener** (per-frequency keep ∝ signal/noise) — proof we must leave fixed linear filters

▸3.11.5 The bilateral filter: averaging by affinity⧉

• bilateral filter (edge-preserving) — see edge-preserving chapter — weight a neighbour by spatial *and* value closeness; per-pixel normalization, non-linear, not shift-invariant; color = distance in color space

• 💡 **Big lesson:** **edge-preserving = affinity** — use the **color / intensity difference** between two pixels as how much they "belong together"; the bilateral filter weights neighbours by that affinity.

• debug by pushing the range σ to extremes (wide → Gaussian, narrow → no smoothing); set width by the noise level

▸3.11.6 Self-similarity: non-local means and BM3D⧉

• **non-local means**: drop "nearby" — average pixels whose *surrounding patch* matches, anywhere in the image (a bilateral filter in patch space)

• **BM3D**: group similar patches, filter jointly in a transform domain — the long-time benchmark

• sidebar: wavelet / pyramid **coring** — zero the small (noisy) detail coefficients, keep the large (edge) ones — see pyramids

• sidebar: **learned denoisers** (U-Net) beat BM3D; shade into generative priors; **diffusion models** are trained denoisers — see ML

▸3.11.7 Denoise color more than brightness⧉

• eyes have coarser acuity for color than brightness → chrominance noise is uglier and safe to smooth hard

• denoise in **YUV**: gentle on luminance, aggressive (large radius) on chrominance — kills color blotches with no visible cost; ties to perception

▸3.11.8 Noise estimation⧉

• you need the noise level to denoise well: **calibrate** it (shoot a flat field at each ISO → a per-ISO noise model), or **estimate from the image** (the std of flat patches / a high-frequency residual)

▸3.11.9 The limits of denoising⧉

• there's a floor: **Levin et al.** bound how much *any* denoiser can recover — once noise swamps the local structure, detail is unrecoverable; stronger **priors** (and more photons) are the only way past it

• **side bar (ethics) — when denoising becomes beautification**: pushed too far, denoising (especially on **skin**) erases **texture** (pores, fine lines) → the **"plastic" / wax-figure** look; skin-smoothing, blemish-removal and **beautification** are the *same operation* dialled up (suppress high-freq variation in detected skin). Phones often apply it **by default**, non-consensually — silently imposing a flawless-skin standard, making the photo unfaithful to the person, and affecting **self-image** (esp. young people). Where does a denoiser stop cleaning an image and start editing a face? → ties to ethics ([[Human factors and the art of photography]])

▸3.12 Demosaicking⧉

`fig-demosaick-real-bayer` · the PS3 raw mosaic (NO-PARKING sign) demosaicked: ground truth · simulated RGGB Bayer mosaic · naive bilinear (false colour on the window grid) · green-based (clean), with a zoom (Demosaicking, BASIC) ✅

`fig-demosaick-before-after` · Bayer → RGB: the mosaic, naive bilinear demosaick (zipper / fringing), edge-aware (Demosaicking, BASIC) 🟨

▸3.12.1 Quad-Bayer / Nonacell: remosaic before demosaicking⧉

• **quad-Bayer and Nonacell sensors** (2×2 or 3×3 same-color groups — see Sensors) require a **CFA-conversion front-end** step before the normal demosaicking pipeline:

• **full-resolution mode**: a **remosaic** maps each 2×2 (or 3×3) group back to a single-sample-per-site standard Bayer mosaic (e.g. picks the top-left sub-pixel of each group); standard demosaicking then runs as normal

• **binning / low-light mode**: each group is **summed or averaged** into one pixel, producing a quarter- (or ninth-) resolution standard Bayer image; demosaicking follows on that smaller, less noisy mosaic

• the key point: **normal Bayer demosaicking algorithms receive a standard mosaic** either way — the remosaic / binning stage is a CFA-format adapter that sits ahead of it

• → cross-ref Sensors quad-Bayer bullet ([[02 Fundamentals#Sensors photosites CCD vs CMOS]])

▸3.12.2 Reminder: the Bayer mosaic⧉

• one color per photosite (CCD / CMOS); the **RGGB** pattern; twice as many greens (luminance acuity) [Bayer mosaic figure]

• RAW files: the mosaic **before** demosaicking (linear)

3.12.3 The task: full RGB at every pixel⧉

▸3.12.4 The naive approach: interpolate each channel on its own⧉

• treat the 3 channels as sparse images, interpolate each (= upsampling); missing green = average of 4 measured neighbours

• works, but artifacts: **zippering** and **color fringing** at edges [demosaick before/after figure]

• **debugging / hand-checkable inputs**: a **constant** image must come back unchanged (nothing to interpolate); a **rectangle** on a flat field makes the **zipper / fringing** appear right at its edges — the minimal inputs that expose demosaicking errors

▸3.12.5 Why naive interpolation zippers: averaging across an edge⧉

• a skipped pixel's 4 neighbours straddle the edge → averaging mixes the two sides → values oscillate (the zipper); root cause = **averaging across the edge**

▸3.12.6 Doing better: edge-directed interpolation⧉

• interpolate along edges, not across them: pick the axis whose two neighbours agree most (smaller |up−down| vs |left−right|)

• bets on a piecewise-smooth world; collapses the zipper

• 💡 **Big lesson:** **let the data pick the direction** — adapt the operation to local structure (precursor to bilateral / affinity)

▸3.12.7 The harder half: red and blue, and color fringing⧉

• R/B are sparser (¼ sampled); harder edge direction

• even perfect per-channel interpolation gives **color fringing**: the 3 channels are sampled at different locations, so their edges don't line up (orange one side, cyan the other) — a between-channel problem

▸3.12.8 Green-based demosaicking: interpolate the color difference⧉

• color channels are highly correlated: **color (R−G, B−G) varies slowly**, brightness has the sharp edges

• recipe: reconstruct green first (edge-directed), form R−G / B−G at measured pixels, interpolate those *naively* (safe — smooth), add green back

• 💡 **Big lesson:** **luminance carries detail, chrominance is smooth** — green ≈ luminance (interpolate carefully), differences ≈ chrominance (interpolate crudely)

• sidebar: difference vs ratio (constant-hue); divide-by-near-zero risk

• still not perfect: fine color texture → **maze / labyrinth** artifacts

• the **OLPF / anti-aliasing filter**: a birefringent layer that slightly blurs before sampling (low-pass before sampling, from the aliasing chapter) — trades sharpness for fewer color artifacts; weakened / dropped as demosaicking improved (computation displacing optics)

• ⚠️ **increasingly omitted** on modern cameras: **higher pixel counts** make aliasing / moiré less visible, and buyers want **maximum sharpness** — a deliberate **sharpness-vs-moiré trade** (drop the filter, accept occasional moiré)

▸3.12.10 Beyond hand-tuned: joint denoising and learned demosaicking⧉

• RAW is noisy; demosaicking and **denoising** are entangled → do them **jointly** (better than in sequence)

• learned joint denoise+demosaick networks (Gharbi et al. 2016; Adobe Enhance Details) — forward ref to ML

▸3.12.11 Cross-reference: other ways to sense color⧉

• temporal multiplexing (Prokudin-Gorskii, color wheel) — static scenes only

• spatial multiplexing (the mosaic itself, the retina's cone mosaic), 3-chip prism (3-CCD)

• depth multiplexing (film tripack, Foveon)

• ties back to the Color technology

▸3.12.12 Where this sits in the pipeline⧉

• one of the *first* ISP stages (after black-level / defective-pixel), on linear RAW, **before** white balance / color matrix / tone / sharpen / gamma — order matters

▸3.13 Auto-exposure and auto white balance⧉

`fig-coma` · coma: an off-axis parallel bundle where each annular aperture zone images to a different height; chief ray through the lens centre + zonal rays missing a common focus → the one-sided comet ("coma") spot with a head and a radial tail ✅

▸3.13.1 Auto-exposure: metering⧉

• **the AE problem**: pick aperture/shutter/ISO so the scene lands at the **middle-grey** (~18%) target; the catch is *which* pixels are representative

• **average / center-weighted** metering — average the frame (fooled by bright skies → silhouettes, snow → grey); center-weighting assumes a centered subject

• **matrix / evaluative / pattern** metering — split into **zones**, combine brightness + contrast + focus point (+ color + lens distance = Nikon's **3D Color Matrix**); historically a lookup against a database of reference scenes; the default today

• **spot** metering — meter one small zone; the photographer chooses the anchored tone (Adams's Zone System)

• **learned** metering — a small network (often + semantic segmentation: sky/face/backlight) predicts the photographer's exposure; phones choose exposure jointly with burst/HDR (deliberately under-expose, recover shadows by stacking → Multiple exposure), leaning on the affine+truncated noise model

• 💡 same arc as AWB below: one hand-coded statistic → multi-zone pattern model → learned estimator, all guessing intent from pixels

▸3.13.2 White balance and color constancy⧉

⬜ figure not yet created

same object under different illuminants (before/after WB) fig-wb-before-after

`fig-von-kries` · Von Kries per-channel gains 🟨

⬜ figure not yet created

illuminant metamerism (cloth matching) fig-illuminant-metamerism

`fig-isp-walk-real` · the ISP pipeline walked on ONE real photo (boatman gathering lotus, overcast lake, Sony DNG): raw (linear, no WB — flat & green) · white balance · tone+colour (gamma encode) · denoise · sharpen · finished JPEG, frame-coloured by linear-light vs encoded regime — real-image companion to `fig-isp-block-diagram` (Recap ISP, BASIC) ✅

• spectral perspective

• Von Kries

• **white balance is a per-channel multiply — but only correct in *linear light***: it scales each channel by a gain (Von Kries), and a scaling **commutes with a *linear* map but not a non-linear one**. Apply the same gains to **gamma- / tone-curve-encoded (non-linear)** values and you get the *wrong* answer — the channel gains then **shift hue and change apparent saturation**, not just neutralize the cast. So white balance belongs **early, in the linear pipeline**, alongside exposure (also a multiply-in-linear-light). [fig-wb-linear-vs-nonlinear — the same white-balance gains applied in linear light vs. on gamma-encoded values: the hue/saturation shift] (→ [[#Linear vs gamma vs. log encoding]])

• **white balance from color temperature and tint (the raw-converter slider model)** — the two sliders just pick a **target illuminant white point**, and the diagonal gains follow:

1. **Temperature** $T$ (Kelvin) selects a point on the **Planckian (blackbody) locus** — its chromaticity $(x_c(T), y_c(T))$ from a standard closed-form fit (Kim et al. 2002): warm/low-K → orange, cool/high-K → blue.

2. **Tint** nudges that point **perpendicular to the locus**, along the **green↔magenta** axis (a signed $\Delta uv$ in CIE-1960 $uv$) — correcting casts a pure blackbody can't, e.g. fluorescent green.

3. Convert the white point to tristimulus: $X = x/y,\; Y = 1,\; Z = (1-x-y)/y$, then to the working/sensor space $\mathbf{W} = M_{XYZ\to RGB}\,(X,1,Z)$.

4. Set the **Von Kries gains** $k_c = G_w / W_c$ (normalize to green) so a surface under that illuminant reads **neutral**.

So "$T=5500$ K, tint $0$" and "$(k_R,k_G,k_B)$" are **two faces of the same diagonal correction** — the sliders are a human-friendly parameterization of the per-channel gains. (M lives in [[#Reproducing color]]; *where* the multiply must happen is [[#Linear vs gamma vs. log encoding]].)

equations

Von Kries (diagonal): R′ = k_R·R, G′ = k_G·G, B′ = k_B·B

or a general 3×3. **From color temperature T (K) + tint to gains**: T → Planckian-locus chromaticity (x_c(T), y_c(T)) (closed-form fit, Kim et al. 2002)

**tint** = signed offset Δuv **⊥ to the locus** in CIE-1960 uv (green −, magenta +)

(x,y) → XYZ at Y=1 [X = x/y, Z = (1−x−y)/y]

illuminant working-RGB **W = M_{XYZ→RGB}·(X, 1, Z)**

gains **k_c = G_w / W_c** (Von Kries diagonal, normalized to green) so that illuminant maps to neutral

▸3.13.3 Automatic white balance⧉

⬜ figure not yet created

grey-world vs bright-pixel on an example fig-awb-greyworld

⬜ figure not yet created

failure on mixed illuminants fig-awb-mixed-fail

• old style: grey assumption, bright pixels

• 3D matrix regression style

• modern ML solutions

• **skin tones** — skin is a **memory colour** cameras are tuned around (face detection feeds the meter + AWB so the *person* lands right); but that tuning has a **history of bias** (the **Shirley card** calibrated film/AE/AWB to light skin, rendering darker skin poorly). Obligation: design & test AE/WB across the full range of skin tones → ethics in [[Ethics of computational photography]]

equations

grey-world: k_c = mean_grey / mean_c

bright-pixel estimate

▸3.13.4 The limits of white balance, and CRI⧉

⬜ figure not yet created

a mixed-illuminant scene fig-mixed-illuminant-scene

• multiple / mixed illuminants: one global per-channel gain can't fix a scene lit by two light colors (forward ref: multi-light WB, Hsu et al.)

• metamerism failure: WB can't undo a bad illuminant spectrum (narrow-band fluorescent / cheap LED)

• Color Rendering Index (CRI): how faithfully a source renders colors vs a reference illuminant

▸3.14 File formats and compression⧉

`fig-jpeg-pipeline` · the JPEG encoder — RGB → opponent colour → chroma subsample → 8×8 DCT → quantize → entropy code (File formats, BASIC) 🟨

`fig-dct-basis` · the 8×8 DCT-II cosine basis (DC top-left → high frequency bottom-right) 🟨

`fig-chroma-subsampling` · chroma subsampling grids 4:4:4 / 4:2:2 / 4:2:0 (chroma sampled coarser than luma) 🟨

`fig-jpeg-artifacts` · JPEG blocking & ringing at low quality (original vs q=8, zoomed crop) 🟨

`fig-jpeg-quality-levels` · JPEG quality sweep (q=85→40→20→10→5): the same photo's edge-crop at each quality with the whole-image file size beneath — degradation grows as size drops (File formats, BASIC) 🟩

▸3.14.1 The big picture: none, lossless, lossy⧉

uncompressed/raw → lossless (PNG, TIFF) → lossy (JPEG)

lossless = exploit statistical redundancy; lossy = also discard perceptually unimportant info (ties to perception chapter)

▸3.14.2 Data versus metadata: EXIF⧉

• stuff like name, image size

• capture settings (ISO, shutter, aperture, focal length), GPS

• orientation flag (the infamous "why is my image sideways" bug)

• ICC color profile (ties to color management)

• typical-fields **catalog → appendix [[Common EXIF fields]]**; the practical caveats (coverage incomplete & inconsistent; ISO / shutter approximate & not radiometric) are introduced in [[#Image representation]]

-Fun with EXIF: my kids growing up, photos from X years ago

▸3.14.3 PNG: the format we read from⧉

• we use PNG because it's easy to read

• mention alphas in passing to warn people (because they need png with no alpha)

▸3.14.4 JPEG: compression by perception⧉

Central idea : leverage human perception (cite early chapter) to adjust the number of bits to what matters. Remind people of CSF and different color sensitivity

• **framing**: JPEG is a perfect illustration that **perceptual criteria can be made quantitative and objective** — "what the eye won't miss" becomes a concrete **quantization table** and bit allocation rather than hand-waving; perception turned into numbers you can compute, tune (the quality knob), and measure

• the **quality knob** = a multiplier on the quantization table; show a **quality sweep** of the same photo (q ≈ 85 → 5) with the resulting file size under each — degradation (softening → blocking → color bleed) grows as the file shrinks [JPEG quality-levels figure]

#### Stage 1: opponent color

Opponent color space

#### Stage 2: chroma subsampling

chroma subsampling (4:2:0 / 4:2:2): coarser chroma than luma because color CSF is lower

#### Stage 3: the 8×8 DCT

DCT

#### Stage 4: quantization — the lossy step

quantization table

#### Stage 5: entropy coding

classic entropy coding, with smart block order

#### Limitations and artifacts

limitation: 8-bit only (motivates other formats); artifacts: blocking, ringing

▸3.14.5 RAW files: before the cooking⧉

usually before demosaicking

linear

proprietary vs DNG

• **"RAW" is not always purely raw**: many cameras pre-cook it to some degree — black-level subtraction, defective/hot-pixel correction, sometimes lens-shading/vignetting correction or even mild denoising; some "RAW" is lossy-compressed or non-linearly encoded. So RAW is a *spectrum* of how cooked it is, not literally the untouched sensor readout.

• tools/libraries to read RAW: **dcraw** (the classic), **LibRaw** (C++, dcraw's successor) and its Python wrapper **rawpy**; open-source developers **darktable** / **RawTherapee**; **Adobe DNG Converter / DNG SDK**; **exiftool** for metadata; commercial **Lightroom / Capture One**.

▸3.14.6 HDR formats: more than 8 bits⧉

quickly in passing

EXR (OpenEXR), Radiance .hdr, 16-bit TIFF — needed to store HDR output (see HDR chapter)

▸3.14.7 Modern formats⧉

HEIC/HEIF (Apple default now, 10-bit / HDR-capable), WebP, AVIF, JPEG XL

▸3.14.8 Neural / learned compression⧉

• **the idea**: replace the hand-designed transform + quantizer + entropy coder of JPEG with a **trained autoencoder** — an **analysis** network maps the image to a latent, the latent is quantized and entropy-coded under a **learned prior**, a **synthesis** network reconstructs. The whole pipeline is optimized **end-to-end** for a **rate–distortion** objective (bits vs. a distortion/perceptual loss), so the codec *learns* what to throw away.

• **why it matters**: at low bitrates learned codecs beat JPEG/even AVIF on perceptual metrics, and they can optimize for **perceptual / GAN** losses rather than MSE (sharper, more natural reconstructions at the same rate). They also generalize: **video**, **depth**, and **scientific** data, and the latents double as features for downstream tasks.

• **the catches**: heavy compute (a neural net to *decode*, not just encode), **non-determinism / version-coupling** across encoder–decoder, and no universal standard yet — though **JPEG AI** (ISO/IEC) is the first standardized learned image codec. Ties to [[Deep learning]] (autoencoders, entropy models) and to generative priors in [[Single-image computational photography]] (super-resolution, inpainting are the same rate–distortion-with-a-prior idea pushed to extreme compression).

▸3.14.9 Other formats in passing: TIFF and GIF⧉

• photoshop, layers

• mention alpha again in passing.

▸3.15 Recap ISP, non-destructive editing:⧉

`fig-isp-block-diagram` · the ISP pipeline: RAW → black level → demosaick → white balance → denoise → tone/colour → sharpen → gamma → JPEG (Recap ISP, BASIC) 🟨

▸3.15.1 A basic ISP⧉

• the **image signal processor** runs a fixed **pipeline** from RAW to JPEG; the **order matters** — mostly **linear light** until the end [ISP block-diagram figure]

• **3A** (auto-exposure, auto-white-balance, auto-focus — AF in Optics); **black-level** subtraction; **hot / defective-pixel** correction

• **demosaick** (→ ch) · **denoise** (→ ch) · optional **lens corrections** (vignetting, CA, distortion)

• **white balance** + **color matrix** (sensor → standard RGB) · **tone mapping / curves** (incl. local) · **sharpening**

• **gamma encoding** · **beautification curves** · **JPEG** compression (→ ch)

• advanced: **aberration correction**, computational stages (HDR+, night mode) — later parts

• the throughline (Big lesson): do the physics (WB, demosaick, denoise) in **linear**, **encode (gamma) near the end**

▸3.15.2 Recap 2: non-destructive editing⧉

• e.g. **Adobe Lightroom**: keep the **RAW** untouched and store edits as **parameters** (a recipe — exposure, contrast, curves, masks), recomputed on the fly

• benefits: **reversible, re-editable, resolution-independent**; needs a **fast (GPU) pipeline** to recompute interactively

• the parametric pipeline ≈ a programmable ISP exposed to the user

▸Part 4 COMPUTATIONAL TOOLS⧉

The chapters before this one built images up from physics, perception, and the basic processing pipeline. This part assembles the **general-purpose computational tools** the rest of the book reaches for again and again: casting a recovery task as a **linear inverse problem** and solving it by regression; replacing a hand-designed operator with one **learned from data**; and **generating** plausible images with modern generative models. They are gathered here, before the single-image applications, because nearly every later part — deblurring, super-resolution, compositing, HDR, video — is at bottom an application of one of these three.

▸4.1 Linear Inverse Problems and Regression⧉

`fig-correspondence-then-transport` · the L17 spine — one scene displaced (two views / two faces / two frames / one long frame), each resolved by estimating a coordinate map (homography, morph field, flow, track, motion vector, camera path) then transporting pixels by one shared inverse-warp engine; the finding is hard, the moving is plumbing 🟨

`fig-image-as-vector` · an image *is* a vector: a 5×5 pixel grid unrolled into a tall column vector; n = H·W (Linear algebra) 🟩

`fig-least-squares` · line fit minimizing squared vertical residuals (Optimization & regression) 🟩

`fig-gradient-descent` · descent path on a convex-bowl contour (−∇f steps); too-large step overshoots (Optimization) 🟩

`fig-pyramid-reconstruction` · RECONSTRUCTION on a real photo: collapse the Laplacian pyramid coarse→fine (residual → +L_k each octave → exact image), plus an all-black per-pixel error panel (max ~1e-16 = lossless) (Reconstruction, BASIC) ✅

equations

forward model $y = Ax$

least squares $\hat x = \arg\min_x \tfrac12\|Ax - y\|^2$

normal equations $A^\top A\,x = A^\top y$

gradient of the data term $\nabla f(x) = A^\top(Ax - y)$

gradient step $x_{t+1} = x_t - \eta\,A^\top(Ax_t - y)$

convolution form $A x = k * x$ and $A^\top y = \tilde k * y$ (flipped kernel $\tilde k$)

per-frequency inverse $\hat x(\omega) = \hat y(\omega)/\hat k(\omega)$ (when it diagonalizes)

▸4.1.1 Blur is linear, so deblurring is inversion⧉

• **the setup**: a sharp scene $x$, a **known** blur kernel $k$ (camera-shake path, defocus disk, lens PSF, motion streak), and the photo we actually got, $y = k * x$. We are **non-blind** here — the kernel is *given* (measured, or recovered earlier); recovering the kernel *too* is **blind** deblurring, deferred to [[Advanced inverse problems]].

• **the one idea**: blur is a **linear, shift-invariant** operation, so it is a **matrix** $A$ acting on the image: $y = Ax$. Deblurring is therefore *inversion* — undo a matrix multiply. Everything below is what "undo" actually means when $A$ is a million-by-million matrix you must never write down. [deblur-linear-inversion figure]

• **three views of the same operator** (the figure ties them together): (i) a **convolution** $y = k * x$ — the intuitive, local view; (ii) a **giant matrix** $A$ — huge ($N\times N$ for $N = W\cdot H$ pixels), but **sparse** (each output touches only the kernel's support) and **structured** (every row is the same kernel, shifted — a *Toeplitz / circulant* matrix); (iii) a **map of vectors** $x \mapsto Ax$ in the image vector space. Same object; we pick whichever view makes the next step easy.

• **why not literal inversion?** Two reasons, both fatal: (1) $A$ is astronomically large — for a 12-megapixel image $A$ is $12\text{M}\times 12\text{M}$; you can't store it, let alone invert it. (2) Even conceptually, $A^{-1}$ is **ill-conditioned** — blur kills high frequencies, so the inverse must *amplify* them, and with them any noise (the deblur preview from the last chapter: naive inverse → exploding noise). We sidestep both: never form $A$ (this chapter), and tame the conditioning with a **prior** ([[Advanced inverse problems]]). [fig-deblur-preview — reuse]

• 💡 **Big lesson:** **diagonalize when you can.** A general blur matrix is a tangled mess to invert, but **Fourier diagonalizes convolution** ([[Linearity, Fourier, Aliasing and deblurring]]): in the frequency basis $A$ becomes a *diagonal* operator — multiply each frequency by $\hat k(\omega)$ — so its inverse is just **per-frequency division** $\hat x(\omega) = \hat y(\omega)/\hat k(\omega)$. That one fact is both the cleanest deblur (the FFT solver, below) and the cleanest explanation of *why* deblurring is hard: where $\hat k(\omega) \approx 0$ (a frequency the blur erased), dividing blows up. *(First appearance: [[Linearity, Fourier, Aliasing and deblurring]]; recurs here — full callback box.)*

▸4.1.2 Images as vectors, and the notation overload⧉

• **refresher (images as vectors)**: stack every pixel (and channel) of an image into one long **column vector** $x \in \mathbb{R}^N$, $N = W\cdot H\cdot C$ — exactly how the image already lives in memory (C++ planar $idx = c\,WH + y\,W + x$; NumPy `(H, W, C)` flattened). This **serialization** is mostly *conceptual*: it lets us speak the language of linear algebra (and call its libraries / solvers) without ever materializing the vector or the matrix. [image-as-vector figure]

• **what the vector view buys us**: addition, scaling, the **dot product** $\langle a, b\rangle = \sum_i a_i b_i$ (= correlation / projection), norms $\|r\|^2 = \langle r, r\rangle$ — and the fact that **every linear image operation is now a matrix**: exposure (a diagonal $A$), convolution (a Toeplitz/circulant $A$), downsampling, gradient. One frame, many operations.

• ⚠️ **notation overload (call it out explicitly)**: in linear algebra the unknown image is universally written $x$ — clashing with $x$ = the horizontal pixel coordinate. We keep both (they're standard) and **disambiguate by context**: $x$-the-vector is the whole serialized image; $x$-the-coordinate indexes a pixel. The measurement is $y$ (also overloaded with the row coordinate). *Queue for [[Notations]]:* register $x$ = image-as-vector, $y$ = measurement vector, $A$ = forward (system) operator/matrix, $k$ / $\tilde k$ = blur kernel / flipped kernel, $r = Ax - y$ = residual, $L$ = regularizer (Laplacian) operator.

▸4.1.3 Regression: deblurring as least squares⧉

• **the reframe**: don't ask "solve $Ax = y$ exactly" (with noise there may be **no** exact solution, and the inverse is unstable). Ask instead: **which sharp image $x$, once blurred, best reproduces the photo?** That is **regression** — fit $x$ to the data by minimizing squared error: $\hat x = \arg\min_x \tfrac12\|Ax - y\|^2$. Deblur, denoise, demosaic, HDR-merge, white-balance all share this skeleton; only $A$ changes. [fig-least-squares — reuse: points / residuals / fit, here with $A$ = blur]

• **why squared error**: it makes the problem **linear** (the gradient is linear in $x$), it is the maximum-likelihood fit under Gaussian noise (ties to Refreshers → Probability), and it has a clean closed form. Its blindness to outliers and its need for a prior come later ([[Advanced inverse problems]]).

• **the normal equations**: setting the gradient to zero, $\nabla f(x) = A^\top(Ax - y) = 0$, gives $\boxed{A^\top A\,x = A^\top y}$ — the **normal equations**, the least-squares analogue of "$Ax = y$." $A^\top A$ is **symmetric positive (semi-)definite**, which is exactly the structure the iterative solvers below exploit.

• **what $A^\top$ *is*, concretely**: the **transpose of a convolution by $k$ is a convolution by the flipped kernel** $\tilde k(\mathbf{u}) = k(-\mathbf{u})$ — i.e. correlation. So $A x = k * x$ (blur) and $A^\top y = \tilde k * y$ (the "back-projected" blur). This is the crucial hook: **both $A$ and $A^\top$ are just convolutions**, so we can apply them without ever building a matrix.

• **deferred (scope boundary)**: noiseless least squares over-fits noise — the cure is a **regularizer / prior**, $\min_x \tfrac12\|Ax - y\|^2 + \lambda R(x)$, turning the normal equations into $(A^\top A + \lambda L)\,x = A^\top y$. We name it here and **defer the why/how to [[Advanced inverse problems]]** (Tikhonov, total-variation, Wiener) and the *learned* prior to [[Machine learning]]. The solver machinery below is unchanged — only the operator picks up a $+\lambda L$.

▸4.1.4 Matrices without forming matrices: gradient descent and conjugate gradient⧉

• **the central trick of the chapter**: the normal-equation matrix $A^\top A$ is $N\times N$ and we will **never build it**. Every iterative solver below needs **only one primitive**: given a vector $v$, return $A^\top A\,v$ — computed as **two convolutions**, $A^\top(A v) = \tilde k * (k * v)$. *Matrix-free.* `apply_blur(v) = conv(v, k)`, `apply_blur_T(v) = conv(v, flip(k))`; the solver calls these and nothing else.

• **gradient descent**: walk downhill on $f(x)=\tfrac12\|Ax-y\|^2$. The gradient is $\nabla f(x) = A^\top(Ax - y)$ — **one forward blur, one residual, one transposed blur** per step — then $x_{t+1} = x_t - \eta\,\nabla f(x_t)$. Correct and trivial to code, but **slow**: the least-squares bowl is **elongated** (badly conditioned, because blur squashes high frequencies), so plain gradient descent **zig-zags** down the valley. [fig-gradient-descent — reuse: steps down a quadratic bowl]

• *pseudocode:* a matrix-free gradient-descent loop (init $x{=}0$; iterate: residual $A x{-}y$, gradient $A^\top$ of it, step by $-\eta\nabla$) using only the two convolutions.

• **conjugate gradient (CG)** — the workhorse: because $A^\top A$ is **symmetric positive-definite**, CG solves the normal equations in far fewer iterations by choosing **conjugate** (non-interfering) search directions instead of greedily steepest ones — it never undoes earlier progress, so it cuts straight down the elongated valley. Like gradient descent it touches $A$ only through **apply-operator** calls (one $A$ + one $A^\top$ per iteration); unlike it, it converges in tens, not thousands, of steps. Cite Shewchuk's *"…Without the Agonizing Pain"* for the intuition. [deblur-cg-vs-gd figure — CG's conjugate path vs GD's zig-zag on the same bowl]

• **preconditioning**: CG is fast when the bowl is **round** (well-conditioned). A **preconditioner** $M \approx (A^\top A)^{-1}$ that is cheap to apply (e.g. a diagonal, or an FFT-based approximate inverse) **re-rounds** the bowl and slashes the iteration count — preconditioned CG is the practical default.

▸4.1.5 How we actually solve it — forward-ref to the solver toolkit⧉

• **the situation**: $A^\top A$ is **huge but sparse and structured**, so in practice we use **iterative / multiscale** solvers — gradient descent and conjugate gradient (above), plus **multigrid** (coarse-to-fine on a pyramid) and the **FFT solver** (when the operator diagonalizes in Fourier) — never a direct inverse. These are the shared workhorses of the whole part, so they get their **own chapter next** ([[#Efficient solvers]]): which solver wins when, and why each is matrix-free.

• **the payoff (why these two chapters are the foundation of the part)**: this **same solver machinery** — set up $y = Ax$, minimize $\|Ax - y\|^2$ (+ a prior later), solve matrix-free with CG / multigrid / FFT — recurs across **gradient-domain HDR, [[Poisson image editing|Poisson cloning]], colorization, matting**, and most optimization-based edits in this part. Learn it once, here and in [[#Efficient solvers]]; every later chapter is a new $A$ and a new prior on the same spine.

▸4.2 Efficient solvers⧉

The previous chapter cast recovery as the normal equations $A^\top A\,x = A^\top y$ and made the key point that we **never build $A$** — every solver needs only to *apply* the blur and its transpose. This chapter is the toolkit that actually does the solving: a handful of matrix-free, iterative / multiscale methods, and the rule for which one to reach for. The same machinery returns in [[Poisson image editing]], colorization, matting, and gradient-domain HDR — learn it once here.

equations

matrix-free operator pair $A x = k * x$, $A^\top y = \tilde k * y$

gradient step $x_{t+1}=x_t-\eta\,A^\top(Ax_t-y)$

preconditioned CG ($M\approx(A^\top A)^{-1}$)

FFT one-shot $\hat x(\omega)=\hat y(\omega)\,\overline{\hat k(\omega)}/\big(|\hat k(\omega)|^2+\lambda|\hat L(\omega)|^2\big)$

▸4.2.1 Matrix-free, iterative, multiscale — the situation⧉

• **the situation**: $A^\top A$ is **huge but sparse and structured**, so we use **iterative / multiscale** solvers, never a direct inverse. Everything below touches $A$ only through the matrix-free pair `apply_blur(v) = conv(v, k)` and `apply_blur_T(v) = conv(v, flip(k))`. Three (plus one) tools, ordered by when each wins.

▸4.2.2 Gradient descent and conjugate gradient (recap + preconditioning)⧉

• **gradient descent and CG** (developed in the previous chapter) are the starting point: GD walks downhill on $\tfrac12\|Ax-y\|^2$ but **zig-zags** on the elongated, ill-conditioned least-squares bowl; **conjugate gradient** chooses non-interfering directions and converges in tens, not thousands, of steps. Both use only apply-operator calls. [deblur-cg-vs-gd figure — CG's conjugate path vs GD's zig-zag]

• *pseudocode:* the operator-only CG loop (residual, conjugate search direction, exact step $\alpha$, conjugacy update $\beta$) — one $A^\top A$ matvec = two convolutions per iteration, no matrix formed.

• **conjugate gradient + preconditioner** — the general-purpose default; works for *any* $A$ you can apply (and apply-transpose), even when the boundaries or the operator aren't shift-invariant. A **preconditioner** $M \approx (A^\top A)^{-1}$ that is cheap to apply re-rounds the bowl and slashes the iteration count — preconditioned CG is the practical default.

▸4.2.3 Multigrid: coarse-to-fine on a pyramid⧉

• **multigrid (coarse-to-fine)**: low-frequency error is what iterative solvers kill *slowest*, and it is exactly what a **coarse grid** represents cheaply. Solve on a coarse image, **prolong** (upsample) the correction, refine on the fine image — a **V-cycle** down and up an image **pyramid** — and convergence becomes near-linear in pixel count. Directly reuses the Gaussian/Laplacian-pyramid machinery of [[Linear pyramids and wavelets]] (reduce / expand). [deblur-multigrid-vcycle figure — V-cycle down and up the pyramid]

▸4.2.4 The FFT solver: diagonalize when you can⧉

• **FFT solver**: when the domain is **rectangular with periodic (or Neumann) boundaries**, the blur (and a Laplacian regularizer) is **circulant → diagonal in Fourier** (L5). Then there is **no iteration at all**: transform, **divide per frequency** $\hat x(\omega) = \hat y(\omega)\,\overline{\hat k(\omega)} \,/\, \big(|\hat k(\omega)|^2 + \lambda\,|\hat L(\omega)|^2\big)$, inverse-transform. Two FFTs and a pointwise divide — the fastest solve when its boundary assumption holds; ties to [[Linearity, Fourier, Aliasing and deblurring]]. [deblur-fft-solver figure — deblur as per-frequency division]

▸4.2.5 Which solver wins — and why this is the foundation of the part⧉

• **picking one**: **CG + preconditioner** for any operator you can apply (the safe default); **multigrid** when low-frequency error dominates and you have a pyramid; **FFT** when the operator is shift-invariant with nice boundaries (the fastest, but most restrictive).

• **the payoff (why this chapter is the foundation of the part)**: this **same solver machinery** — set up $y = Ax$, minimize $\|Ax - y\|^2$ (+ a prior later), solve matrix-free with CG / multigrid / FFT — recurs across **gradient-domain HDR, [[Poisson image editing|Poisson cloning]], colorization, matting**, and most optimization-based edits in this part. Learn it once here; every later chapter is a new $A$ and a new prior on the same spine.

▸4.3 Machine learning⧉

`fig-learned-vs-handdesigned` · same inverse-problem skeleton, prior swapped: classical (data-fit + hand prior $\Phi$) vs learned ($f_\theta$ fit to data) (ML) ✅

`fig-synthetic-data-pipeline` · manufacture (degraded, clean) training pairs by simulating the camera/degradation (ML) ✅

reference the ML & deep-learning refresher ([[Refreshers#Machine learning and deep learning]]) up front; this chapter does **not** re-teach networks — it sets up *learning an operator from data* and the **data** that powers it. The concrete deep-network operators are the next chapter, [[Deep learning]].

equations

learned operator $\hat I = f_\theta(\text{measurement})$ trained by $\min_\theta \sum_i \ell\!\big(f_\theta(x_i),\,y_i\big)$

reuse the inverse-problem $\hat I=\arg\min_I \lVert AI-b\rVert^2+\lambda\,R(I)$ from above to contrast a hand-tuned $R$ vs a learned one

▸4.3.1 The framing: learned operators replace hand-designed ones⧉

• **the move.** Every task in this part so far had a **hand-designed operator** — an inverse filter, a smoothness prior, a dark-channel heuristic. Replace it with a function $f_\theta$ (a net) **fit to example data**: collect pairs (degraded input, desired output), minimize a loss, deploy the learned map. [fig-learned-vs-handdesigned]

• **same skeleton, data-driven prior.** Recall the inverse-problem template $\hat I=\arg\min_I \lVert AI-b\rVert^2+\lambda R(I)$ ([[Optimization and regression]]). A learned method keeps **data-fit + prior**; it just replaces the hand-tuned $R$ (or the whole solver) with something **learned from a dataset** — so the network *is* the prior.

• 💡 **Big lesson:** L8 — *a learned operator swaps a hand-designed prior for one learned from data.* Same data-fit + prior skeleton as the inverse-problem chapters; the prior is now fit to examples, not hand-tuned. Cost: data, compute, and **hallucinated** detail. (first appearance — full box here; the ML refresher carries the one-line callback) [[Big Lessons#L8]]

• **why now (and why it works).** Big datasets + GPUs + CNN/U-Net/transformer building blocks make fitting these maps practical; intuition over derivation (nets are the Refreshers' job). One-line nod to **the Bitter Lesson** (Sutton): general methods that scale with data/compute tend to win over hand-engineered ones — but flag the costs and failure modes rather than cheerlead.

• **scope.** This chapter sets the **framing** and the **data** that powers learned operators; the **operator zoo** itself — low-level pixel-to-pixel, mid/high-level predictors, generative translation, and learned metrics — is the next chapter, [[Deep learning]]. Architectures and training: the Refreshers throughout.

• refs: Sutton 2019 (The Bitter Lesson); Torralba/Isola/Freeman Part III; Szeliski ch. 5.

▸4.3.2 The data story: synthetic data, noise models, datasets⧉

The recurring lesson of this chapter: **the data is the method.** A learned operator is only as good as the pairs it trains on, and good pairs are often *manufactured*.

• **generating synthetic data.** Start from clean targets and **simulate the degradation** to make (input, target) pairs at scale — the cheap way to get supervised data when ground truth is otherwise unobtainable (you can't photograph the same scene "clean" and "noisy" perfectly registered). [fig-synthetic-data-pipeline]

• **realistic noise models (the crux).** A net trained on *additive Gaussian* noise fails on real photos — real sensor noise is **signal-dependent (Poisson shot + read)**, post-ISP, and demosaicked. Match the **real noise model** (cross-ref the shot-noise / Poisson discussion, Refreshers & Book 2) or the learned denoiser won't transfer. This is why Gharbi mines hard cases and Real-ESRGAN models a complex degradation chain.

• **reverse-engineering the camera pipeline (ISP).** To synthesize realistic raw/finished pairs you must **invert/replay the ISP** — black level, white balance, demosaic, tone curve, compression — so synthetic data looks like the camera's actual output. Ties to FlexISP / the camera-pipeline material.

• **standard datasets & benchmarks.** Name the workhorses and *what each measures*: ImageNet (pretraining/features), DIV2K (super-res), SIDD/DND (real-world denoising), MIT-Adobe FiveK (retouching), KITTI/NYU (depth), COCO (detection/segmentation). The benchmark *defines the task*; its biases become the model's biases. (Full index + links: [[Appendices#Datasets]].)

• **a note on self-supervision (mention, don't expand).** Two spots here sidestep the data problem rather than brute-forcing it: (1) **self-supervised denoising** (Noise2Noise / Noise2Void) trains *without clean targets at all* — a direct answer to "we can't photograph the clean version"; (2) **pretrained backbones** behind depth/detection/perceptual metrics often come from self-/weakly-supervised pretraining on unlabeled images. Both matter; neither needs its own section — deep treatment belongs to the ML appendix / advanced books.

• refs: Gharbi et al. 2016; Wang et al. 2021 (Real-ESRGAN degradation model); Heide et al. 2014 (FlexISP); Bychkovsky et al. 2011 (FiveK); Lehtinen et al. 2018 (Noise2Noise) — **to download**; queue dataset citations (DIV2K, SIDD, NYU-Depth) — **to download.**

▸4.4 Deep learning⧉

`fig-downsample-aliasing` · 3 panels: the full **high-res input** zone plate (cos r², rings finer toward the edge — what's being shrunk) → ÷4 naive decimation (drop samples) → moiré aliasing → ÷4 prefilter-then-decimate (Gaussian low-pass first) → clean ✅

`fig-colorization-classical-vs-learned` · Levin 2004 scribble-propagation vs Zhang 2016 fully-automatic colorization (ML) ✅

⬜ figure not yet created

fig-depth-anything (one photo → its monocular depth map) fig-depth-anything

`fig-gan-pix2pix` · paired image-to-image translation: edge map → photo (conditional GAN) (ML) ✅

`fig-metric-degradation-gallery` · one clean photo vs four degradations (1-px shift, low-q JPEG, noise, blur), each labelled with PSNR + SSIM — PSNR and SSIM mostly agree but disagree where L2 is blind (shift scores worst PSNR yet looks identical; blur keeps high PSNR yet softens texture) (Image metrics, BASIC) ✅

these are the **deep-network** realizations of the learned-operator framing ([[Machine learning]]); architectures and training are the Refreshers' job — here we survey *what gets learned*.

equations

minimal (survey chapter) — perceptual / feature loss $\ell_{\text{feat}}=\sum_l \lVert \phi_l(\hat I)-\phi_l(I)\rVert^2$ over deep features $\phi_l$ (LPIPS)

reuse the learned operator $\hat I = f_\theta(\text{measurement})$ from [[Machine learning]]

▸4.4.1 Low-level learned operators (pixel-to-pixel)⧉

The bread-and-butter: an image-to-image net (typically a **U-Net** / encoder–decoder) maps a degraded or partial image to a restored one. Each task below is the *classical* chapter's task, now learned. [fig-learned-task-zoo]

• **denoising.** Train on (noisy, clean) pairs → a learned denoiser that beats hand-tuned NLM/bilateral, *if* the training noise matches the real sensor noise (→ the data story, [[Machine learning]]). Forward-ref: the denoiser as a reusable prior ([[Super-resolution and image priors#Denoising as a universal prior â Plug-and-Play and RED|Denoising as a universal prior]] — PnP/RED) and diffusion as iterated denoising ([[Generative AI and diffusion]]).

• **demosaicking — and joint demosaick+denoise.** Gharbi et al.: a single net learns to go **mosaic → full RGB**, jointly with denoising, trained on hard cases mined from data — outperforming the hand-built interpolators of Book 2. The headline case that "learn the operator end-to-end" beats a decades-tuned pipeline stage. [fig-learned-task-zoo: mosaic→RGB cell]

• refs: Gharbi et al. 2016 (deep joint demosaicking & denoising).

• **super-resolution.** Pose low-res → high-res as a learned map. Lineage from **Freeman et al.** (example-based / patch-dictionary super-res — the pre-deep ancestor) to modern nets: **SwinIR** (transformer restoration backbone) and **Real-ESRGAN** (GAN-trained on a *realistic* synthetic degradation model for in-the-wild photos). Note: the deep treatment of SR + priors may live in [[Super-resolution and image priors]]; here it's one entry in the zoo with a forward-ref.

• refs: Freeman et al. 2002 (example-based SR); Liang et al. 2021 (SwinIR); Wang et al. 2021 (Real-ESRGAN).

• **colorization — classical vs learned, side by side.** Two paradigms: **Levin et al. 2004** propagates a few user **scribbles** by optimization (edge-preserving — "same color where intensity is similar", an *affinity* prior, cross-ref L4); **Zhang, Isola & Efros 2016** learns **fully automatic** colorization from a huge corpus, posed as classification over quantized colors to fight desaturation. Contrast: hand-designed propagation vs learned-from-data plausibility. [fig-colorization-classical-vs-learned]

• refs: Levin, Lischinski & Weiss 2004 (colorization by optimization); Zhang, Isola & Efros 2016 (colorful image colorization).

• **monocular depth.** Estimate depth from a *single* RGB image — ill-posed, so the net learns a strong **scene prior**. **MiDaS**: trained across many datasets via a **scale/shift-invariant** loss to mix incompatible depth sources; **Depth Anything**: scale further with massive (pseudo-labelled) data → robust zero-shot depth. Note it's *relative* depth, and forward-ref geometric/dual-pixel depth (Books on geometry / multi-image). [fig-depth-anything]

• refs: Ranftl et al. 2019 (MiDaS); Yang et al. 2024 (Depth Anything).

• **learned exposure / retouch (tone & color adjustment).** Learn a photographer's **global/local adjustment** from before/after pairs. **Bychkovsky et al. (MIT-Adobe FiveK)**: the dataset of expert retouches that made this learnable. **HDRnet** (Gharbi 2017): predict the *coefficients of a bilateral-grid affine transform* — fast, edge-aware, runs on a phone (cross-ref the learned bilateral grid in edge-preserving). **Exposure** (Yuanming Hu 2018): learn a sequence of *interpretable* retouching operations (curves, white balance) rather than raw pixels — keep as the "learn the *operations*, not the pixels" variant.

• refs: Bychkovsky et al. 2011 (FiveK); Gharbi et al. 2017 (HDRnet); Hu et al. 2018 (Exposure).

▸4.4.2 Mid- and high-level learned predictors⧉

Not pixel-to-pixel restoration but **prediction of semantic structure** from a single image — the inputs to retouching, autofocus, framing, and selection.

• **sky / face detection.** The classic "where is the sky / where are the faces" detectors that drive auto-enhance and metering; the pre-deep (Viola–Jones-era) lineage → modern detectors. Keep brief; they're enabling, not the focus.

• **face landmarks & pose.** Predict facial keypoints + head pose — feeds portrait retouching, relighting, AR, and the *forensics* angle (consistency checks against deepfakes — cross-ref ethics). (Resolves the "photo bio here?" stub: **cut** — photo-bio/EXIF belongs in file formats, not here.)

• **saliency / gaze prediction (what draws the eye).** ML supplies the *estimator* (pre-deep Itti–Koch → learned saliency trained on eye-tracking data), but the model and its uses (auto-crop/framing, retargeting, attention-aware compression, quality/aesthetics) now live with the other perceptual models in **[[Human factors#Computational models of perception]]** — moved there so the *human-factors* framing owns it. Mentioned here only as one more learned predictor.

• refs: queue — Viola & Jones 2004 (face detection); learned-saliency ref (SALICON / DeepGaze) — **missing from `refs/`, to download.** (Saliency refs proper: see [[Human factors#Computational models of perception]].)

▸4.4.3 Generative models: image-to-image translation⧉

When the target isn't a unique "correct" image but a **plausible** one, shift from regression to **generative** modelling.

• **why generative.** Many single-image tasks are **one-to-many** (colorization, super-res, inpainting) — averaging plausible outputs (L2) gives blur; a generative model samples a *sharp, plausible* answer instead. Motivates the next two.

• **GANs (adversarial training).** A generator vs a discriminator that learns the "real-image" prior → sharp, plausible detail (the engine behind Real-ESRGAN's realism). One-paragraph intuition; depth deferred to [[Generative AI and diffusion]].

• **conditional / paired & unpaired translation.** **Pix2Pix** (Isola et al.): paired image-to-image translation — edges→photo, label-map→scene, day→night — a generic "learn the map between two image domains" tool. **CycleGAN**: the **unpaired** version (cycle-consistency) when you lack aligned pairs (horses↔zebras, summer↔winter). [fig-gan-pix2pix]

• **the pre-deep ancestor.** Mention **Image Analogies** (Hertzmann et al. 2001): "A:A′ :: B:B′" learned a filter/style from a *single* example pair by patch matching — the same translation idea before deep nets (ties to texture synthesis / patch-match).

• **forward-ref.** Diffusion models now dominate generative image-to-image (text-to-image, editing) — full treatment in [[Generative AI and diffusion]]; here just the placement (GAN → diffusion lineage).

• refs: queue — Isola et al. 2017 (Pix2Pix), Zhu et al. 2017 (CycleGAN), Goodfellow et al. 2014 (GANs), Hertzmann et al. 2001 (Image Analogies) — **missing from `refs/`, to download.** Forward-ref: Rombach et al. 2022, Brooks et al. 2023 (in the diffusion chapter).

▸4.4.4 Learned perceptual metrics and losses⧉

How do we *measure* whether a restored/generated image is good — and how do we *train* for it? The classical metrics break exactly where learned methods shine, and the metric you optimize decides the sharpness.

• **back-ref: why L2 / PSNR / SSIM / VDP fall short.** Introduced in Book 2: per-pixel L2 / **PSNR**, structural **SSIM**, and perceptual **VDP** all compare *low-level* statistics. They are **blind to texture and high-level similarity** — a slightly shifted but perceptually-identical texture scores terribly; a blurry mean scores *well*. So they reward exactly the wrong thing for generative restoration. [fig-perceptual-metric]

• **LPIPS — deep feature distance.** Compare two images by the distance between their **deep network features** $\ell_{\text{feat}}=\sum_l\lVert\phi_l(\hat I)-\phi_l(I)\rVert^2$, calibrated to human similarity judgments. Tracks perceived similarity far better than L2; it's also usable as a **training loss** (cf. Gatys' feature/perceptual loss, neural style).

• **DreamSim — learned perceptual embedding.** A more recent metric tuned to *mid-level* / **holistic** similarity (layout, pose, semantic content), capturing judgments LPIPS misses. Keep as "the next step beyond LPIPS."

• **from metric to loss — perceptual losses buy sharpness.** Training on per-pixel **MSE/L1** regresses to the **blurry mean**; adding a **perceptual (feature/LPIPS) loss** restores texture and an **adversarial (GAN)** term sharpens onto the real-image manifold. Caveat: perception–distortion trade-off (**L10**) — extra sharpness is partly hallucinated, so keep a fidelity anchor. [fig-perceptual-loss-sharpness — same input restored under L2 vs L1+perceptual vs +adversarial; blurry → crisp]

• **a go-to loss mix** for restoration/super-res/deblur: weighted sum $\mathcal{L}=\lambda_1\lVert\hat I-I\rVert_1 + \lambda_p\,\ell_{\text{LPIPS}} + \lambda_a\,\mathcal{L}_{\text{adv}}$ — **L1** fidelity/colour anchor ($\lambda_1{=}1$), **LPIPS/VGG** perceptual for texture ($\lambda_p{\approx}0.1$–$1$), a **small adversarial** term for crisp highs ($\lambda_a{\approx}0.01$–$0.1$, added last). Track LPIPS not PSNR.

• one-line takeaway: **metrics are themselves learned now** — and which metric you optimize quietly defines what "good" means (and what the model hallucinates toward).

• refs: back-ref Book 2 (PSNR/SSIM/VDP); Gatys, Ecker & Bethge 2016 (feature/perceptual loss); Zhang et al. 2018 (LPIPS) — **to download**; Fu et al. 2023 (DreamSim) — **to download.**

▸4.5 Generative AI and diffusion⧉

`fig-genai-evaluate-vs-sample` · the generative leap (L11): a prior you can only *evaluate* ($\Phi$/denoiser) vs one you can *sample* ($x\sim p(x)$) (GenAI) ✅

`fig-diffusion-forward-reverse` · the two chains: forward $q$ adds Gaussian noise to a real photo; reverse $p_\theta$ denoises back (GenAI) ✅

`fig-diffusion-demo` · live interactive demo (web edition): type a prompt and watch the reverse process denoise from noise, step by step; static fallback is a noise→photo filmstrip (GenAI) ✅

`fig-denoiser-as-prior-spectrum` · one PnP/RED prior slot, ever-stronger denoisers: hand-built → classical → learned → diffusion score (GenAI / Super-res) ✅

`fig-latent-diffusion` · encode → diffuse in a compressed latent → decode; prompt conditions by cross-attention (Stable Diffusion) (GenAI) ✅

`fig-posterior-sampling` · generative priors for inverse problems: one degraded $y$ → many plausible samples $x\sim p(x\mid y)$ (GenAI) ✅

`fig-conditioning-controlnet` · one prompt + a spatial hint (edges / pose / depth) → a generated image obeying both (GenAI) ✅

This chapter is the **generative limit** of the two before it. *Machine learning* replaced a hand-designed prior with a **learned** one (L8); *Super-resolution* showed the prior is **not optional** and that a **denoiser is a universal prior** you can plug into any solver (L10, PnP/RED, and the punchline that the **score is a denoiser**). Here that same prior becomes something you can **sample from** — a generative model — and the canonical one, **diffusion**, is literally that denoiser run in a loop. We reference the ML / deep-learning refresher ([[Refreshers#Machine learning and deep learning]]) for U-Nets, transformers, and attention; this chapter does **not** re-teach networks — it surveys the *generative* idea and where it plugs back into the book's inverse-problem spine.

equations

forward (noising) process $x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon$, $\epsilon\sim\mathcal N(0,I)$

training loss (predict the noise) $\mathbb{E}_{x_0,\epsilon,t}\big\lVert \epsilon - \epsilon_\theta(x_t, t)\big\rVert^2$

Tweedie / score = denoiser $\hat x_0 = x_t + \sigma_t^2\,\nabla_{x_t}\log p_{\sigma_t}(x_t)$, i.e. the score and a Gaussian denoiser are the same object

reverse (sampling) step — denoise a little, add a little noise, repeat from $x_T\sim\mathcal N(0,I)$ down to $x_0$

classifier-free guidance $\tilde\epsilon_\theta = \epsilon_\theta(x_t,\varnothing) + w\,\big(\epsilon_\theta(x_t,c) - \epsilon_\theta(x_t,\varnothing)\big)$

posterior sampling for an inverse problem — sample $x\sim p(x\mid y)\propto p(y\mid x)\,p(x)$ by alternating the diffusion prior step with a data-fit step against the forward model $A$ (a PnP/RED solver with a generative prior)

▸4.5.1 The framing: generation is learning and sampling a prior $p(x)$⧉

• **the move.** Every prior so far was something you **evaluate**: a penalty $\Phi(x)$ you *add* to a data term ($\min_x \tfrac12\|Ax-y\|^2 + \lambda\,\Phi(x)$), or a denoiser $\mathcal D$ you *plug* into a solver (PnP/RED). A **generative model** learns the distribution of natural images $p(x)$ well enough that you can **draw fresh samples** $x\sim p(x)$ — not score an image, but *produce* one. [fig-genai-evaluate-vs-sample]

• **why sampling changes everything.** With a sampleable prior you get three capabilities the evaluable prior couldn't give: (1) **unconditional generation** — invent a plausible image from nothing; (2) **conditioning** — sample from $p(x\mid c)$ for a text prompt, a sketch, a class, another image (text→image, image→image); (3) **posterior sampling** — condition the prior on a *measurement* $y$ and sample $p(x\mid y)\propto p(y\mid x)\,p(x)$, which is exactly **super-resolution / deblur / inpaint** with the strongest prior we have. The inverse-problem template is unchanged; the prior just learned to *generate*.

• **one-to-many is the point.** Restoration tasks like colorization, super-res, and inpainting are **one-to-many** — many sharp images explain one degraded input. Regression (L2) averages them into a blurry mean (the very failure motivated in [[Deep learning]]); a generative prior **samples one sharp, plausible answer** instead. [fig-posterior-sampling]

• 💡 **Big lesson:** L8 — *a learned operator swaps a hand-designed prior for one learned from data.* The generative prior here is the most extreme case of that swap: the entire distribution of natural images, learned. (Recurrence; first appears in [[Machine learning]].) [[Big Lessons#L8]]

• 💡 **Big lesson:** L10 — *the prior is not optional.* When the measurement destroys information (SR, inpaint, deblur), only a prior selects an answer. A generative model is that prior at full strength — and conditioning it on the measurement is posterior sampling. (Recurrence; first appears in [[Super-resolution and image priors]].) [[Big Lessons#L10]]

• 💡 **Big lesson:** L11 — *a generative model is a sampleable prior over images.* The earlier chapters' prior was something you *evaluate* (a penalty $\Phi$) or *denoise with*; a generative model is a prior you can **draw images from**, $x\sim p(x)$. That leap — from scoring to sampling — is what makes generation, text conditioning, and posterior sampling for inverse problems possible. (First appearance; registered as [[Big Lessons#L11]].)

• **the pre-deep ancestor.** "Sample plausible pixels" predates deep nets: **Efros–Leung texture synthesis** (1999) grew an image pixel-by-pixel by copying from matching neighbourhoods — a non-parametric sample from the patch distribution. Same instinct (draw from $p(x)$), no learned model. Ties to the patch/texture material and to Image Analogies ([[Deep learning]]).

• refs: Efros & Leung 1999 (texture synthesis); Goodfellow et al. 2014; Ho et al. 2020; Torralba/Isola/Freeman Part IX.

▸4.5.2 Diffusion: generation as iterated denoising⧉

• **the two chains.** A **forward** process gradually corrupts an image into pure Gaussian noise: $x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon$ — at $t=0$ the clean image, at $t=T$ static. A **reverse** process learns to undo one step of that corruption at every noise level. **Sampling** runs the reverse chain from $x_T\sim\mathcal N(0,I)$ down to a clean $x_0$ — an image *emerging from noise*. [fig-diffusion-forward-reverse]

• *pseudocode:* the reverse sampling loop (start from $x_T\sim\mathcal N(0,I)$; for $t$ from $T$ down to $1$: predict the noise $\epsilon_\theta$, denoise a step, reinject a little fresh noise) — the "sample, don't average" walk.

• **what the network learns (denoise everywhere).** Train a single net $\epsilon_\theta(x_t,t)$ to **predict the noise** added at level $t$, by minimizing $\mathbb{E}_{x_0,\epsilon,t}\lVert \epsilon - \epsilon_\theta(x_t,t)\rVert^2$ over clean images, random noise, and random levels. Predicting the noise *is* denoising — the model is a **denoiser at every noise level**.

• **the hinge — the score is a denoiser (Tweedie).** For a Gaussian-noised $x_t$, the MMSE denoiser (posterior mean) is $\hat x_0 = x_t + \sigma_t^2\,\nabla_{x_t}\log p_{\sigma_t}(x_t)$: the **score** $\nabla\log p$ and a **Gaussian denoiser** are the *same object*, one rearranged into the other. So a diffusion model *is* a learned score field / family of denoisers. (First appearance of this identity: *Denoising as a universal prior*, [[Super-resolution and image priors]]; recurs here as the chapter's spine.) [fig-score-is-denoiser]

• **therefore diffusion = PnP/RED at the continuous limit.** The sampling loop — *denoise a little, add a little noise, step down the schedule, repeat* — is exactly the **Plug-and-Play / RED** outer loop ([[Super-resolution and image priors]]), with the prior-step denoiser being the learned score run over a noise-level schedule. Generation is iterated denoising; the book's denoiser-as-prior abstraction and its hottest generative model are one mechanism.

• 💡 **callback to L10/L11:** the not-optional prior (L10) is now a *sampleable* one (L11) — and the machinery is the universal denoiser of the previous chapter, looped.

• **latent diffusion (why it scaled).** Diffusing in **pixels** at high resolution is expensive. **Latent diffusion** (Rombach 2022, Stable Diffusion) first **encodes** the image into a compact latent with an autoencoder, runs the whole diffusion process **in that latent**, then **decodes** — far cheaper, and the step that made open high-res text→image practical. [fig-latent-diffusion]

• refs: Ho et al. 2020 (DDPM); Song et al. 2021 (score SDE / samplers); Rombach et al. 2022 (latent diffusion); back-ref Venkatakrishnan 2013 / Romano 2017 (PnP/RED — the loop diffusion generalizes).

▸4.5.3 Conditioning: text, images, and control⧉

• **text→image.** A frozen **text encoder** (e.g. CLIP / T5) turns the prompt into embeddings that condition the denoiser at every step via **cross-attention** — the model samples from $p(x\mid \text{text})$ instead of $p(x)$. This is the headline capability of Stable Diffusion / DALL·E / Imagen-style systems.

• **text vs images — why two different generative models** (resolving the outline's old *Text vs images* stub): text is **discrete tokens** with a natural left-to-right order and a finite vocabulary, a perfect fit for **autoregressive** language models ($p(x)=\prod_i p(x_i\mid x_{<i})$, exact next-token likelihood); images are **continuous**, million-dimensional, and have **no canonical ordering**, a perfect fit for **diffusion** (add/remove continuous Gaussian noise over *all* pixels in parallel, no ordering). It is not fashion but **discrete-sequential vs continuous-parallel** — and the two modalities meet through a shared **CLIP** embedding, where an LLM-encoded prompt steers the image diffusion model.

• **classifier-free guidance (the knob).** Train the model **with and without** the condition (dropping $c$ sometimes), then at sampling time **extrapolate** along the conditioned-vs-unconditioned difference: $\tilde\epsilon_\theta = \epsilon_\theta(x_t,\varnothing) + w\big(\epsilon_\theta(x_t,c)-\epsilon_\theta(x_t,\varnothing)\big)$. The guidance weight $w$ trades **prompt fidelity vs diversity** — the single most-used quality dial.

• **structural control (ControlNet).** Beyond text, condition on a **spatial hint** — edges, pose, depth, segmentation — so the output obeys a given structure *and* the prompt. **ControlNet** (Zhang 2023) adds a trainable branch that injects the hint without retraining the base model. [fig-conditioning-controlnet]

• **instruction-based editing.** **InstructPix2Pix** (Brooks 2023): condition on *(an input image, a text instruction)* — "make it winter," "turn the car red" — and the model edits in place. The image→image lineage runs from Pix2Pix/CycleGAN ([[Deep learning]]) to diffusion editing here.

• refs: Ho & Salimans 2022 (CFG); Zhang et al. 2023 (ControlNet); Brooks et al. 2023 (InstructPix2Pix); Rombach et al. 2022.

▸4.5.4 Posterior sampling: generative priors for inverse problems⧉

• **the payoff for the rest of the book.** Plug the generative prior back into the inverse-problem template. To solve $y=Ax+n$, **sample the posterior** $p(x\mid y)\propto p(y\mid x)\,p(x)$ by **alternating** a diffusion prior step (denoise toward the manifold of natural images) with a **data-fit** step against the forward model $A$ — a PnP/RED solver whose prior is the strongest learned denoiser we have. Result: state-of-the-art **super-resolution, deblurring, inpainting, colorization**. [fig-posterior-sampling]

• **reconstruction vs hallucination (the honest caveat).** Because the prior now *invents* plausible detail, posterior sampling sits squarely on the **hallucination** side of L10's split: the detail is plausible, not measured. Different samples give different valid answers — useful (uncertainty) and dangerous (invented "evidence"). Always label which is which.

• **where it gets applied.** This abstraction is what the later application parts reach for: **compositing / inpainting** (fill a removed object), **relighting**, **video** (temporally-conditioned generation), and generative **super-resolution**. Here we establish the mechanism; each part applies it.

• refs: back-ref Venkatakrishnan 2013 / Romano 2017 (PnP/RED); Song et al. 2021 (conditional sampling); [[Super-resolution and image priors]] (the prior throughline this closes).

▸4.5.5 Other generative families, in brief⧉

• **GANs.** A generator vs a discriminator that learns the "real-image" boundary → sharp samples in one forward pass (the engine behind Real-ESRGAN's realism, [[Deep learning]]). Fast to sample, historically hard to train (mode collapse); largely overtaken by diffusion for general text→image, still used where one-shot speed matters.

• **VAEs.** Encode→sample-latent→decode; a principled probabilistic model, but samples are typically **blurrier**. Live on inside **latent diffusion** as the autoencoder that defines the latent space.

• **autoregressive / transformer image models.** Generate tokens (pixels or codebook entries) one at a time (PixelRNN/CNN, image transformers, VQ-based models); the same next-token idea as language models, now reborn in multimodal systems.

• one-line placement: **diffusion currently dominates** open image generation; the others matter as alternatives, components, or speed/latency trade-offs.

• refs: Goodfellow et al. 2014 (GANs); Kingma & Welling 2014 (VAE); van den Oord et al. 2016 (PixelCNN — queue ref).

▸4.5.6 Caveats and ethics⧉

• **hallucination is structural, not a bug.** A sampleable prior *must* invent detail — that is what makes it work and what makes it untrustworthy as evidence. Generated/super-resolved detail is **plausible, not real**; never treat it as measurement (forensics, medical, legal). Ties to L10's reconstruction-vs-hallucination split and to [[Human factors and the art of photography]].

• **bias and representation.** The model can only sample what its **training distribution** contained; skews in the data become skews in the outputs (demographics, styles, defaults). The benchmark/dataset defines the behaviour, as in [[Machine learning]].

• **deepfakes, provenance, consent.** Realistic synthesis enables **deepfakes** and misinformation; mitigations — watermarking, provenance/C2PA, detection (the forensics angle, cross-ref face-landmark consistency checks) — are partial. Training on scraped images raises **consent, copyright, and attribution** questions still unsettled.

• **environmental and access costs.** Training and serving large diffusion models is compute- and energy-intensive; capability concentrates with whoever can pay — flag, don't cheerlead (echoing the Bitter-Lesson caveat in [[Machine learning]]).

• refs: queue — provenance/C2PA, a deepfake-detection survey, a dataset-consent reference — **missing from `refs/`, to download**; cross-ref [[Human factors and the art of photography]] (ethics, forensics, the art-vs-evidence line).

▸Part 5 EDGES MATTER⧉

The thread of this part is that **where the edges are** governs the edit. Edges carry most of an image's meaning, so three families of method are organised around them: **gradient-domain (Poisson) editing** reconstructs an image *from* its edges (seamless cloning, gradient-domain HDR, photomontage); **edge-preserving filtering** smooths everything *except* across edges (bilateral, guided, local Laplacian → tone mapping, detail, denoising); and **seam optimization** cuts *along* the least-noticeable edges (seam carving, graph-cut compositing). All three lean on the linear-systems and convolution machinery from earlier parts; they belong together because they share one idea — respect the edges.

▸5.1 Poisson image editing⧉

`fig-seamless-cloning` · Poisson seamless cloning (Pérez 2003), EDGES MATTER → Poisson editing: a disk of Jupiter cloned onto Earth — destination (target ring) · source patch · naive paste (visible ring) · Poisson paste (seam gone) ✅

`fig-gradient-domain-pipeline` · gradient-domain (Poisson) workflow: image → take gradients (∇f) → modify the field → integrate by solving ∇²f=div v → reconstructed (EDGES MATTER → Poisson editing) ✅

`fig-mixing-gradients` · mixing gradients — keep the per-pixel *stronger* gradient so destination texture shows through holey / transparent inserts (naive vs max-gradient paste) (Poisson editing) ✅

⬜ figure not yet created

`fig-chong-colorspace-results` — results from Chong, Gortler & Zickler 2008 [placeholder]

`fig-poisson-vs-laplacian-blend` · Poisson vs Laplacian-pyramid blending, contrasted on the DC: 1-D three cases (matched level — both agree; mismatched pedestal — Poisson re-lights, pyramid leaves a halo; constant — Poisson ramps/dissolves, pyramid keeps a feathered plateau) + 2-D (paste a constant square — Poisson dissolves it via the harmonic membrane, pyramid keeps a feathered version) ✅

`fig-fourier-lowpass-highpass` · low-pass vs high-pass a real photo by masking its spectrum and inverse-transforming: keep center → blurred, drop center → edges only (convolution = mult in frequency, by hand) (Reading an image's Fourier transform) ✅

equations

the **Poisson equation** $\nabla^2 f = \operatorname{div}\mathbf{v}$ (reconstruct the field $f$ whose gradient best matches guidance $\mathbf{v}$)

the **discrete Laplacian** system $4f_{p}-\sum_{q\in N(p)} f_{q}=\sum_{q\in N(p)}\!\big(\,v_{pq}\,\big)$ (one sparse equation per pixel $p$ over its 4-neighbours $N(p)$)

the **seamless-cloning Dirichlet boundary condition** $f|_{\partial\Omega}=f^{*}|_{\partial\Omega}$ (interior gradients from the source, boundary values pinned to the destination $f^{*}$)

▸5.1.1 The key idea: edit gradients, not pixels⧉

• the **key idea**: the eye cares about **gradients** (local contrast), not absolute values — so edit the **gradient field** $\mathbf{v}$ and reconstruct the image $f$ whose gradients best match it. A field rarely has an exact integral (it is generally not a true gradient — see the conservative-fields sidebar), so it is a **least-squares** reconstruction $\min_f \int \lVert \nabla f - \mathbf{v}\rVert^2$, whose normal equations are the **Poisson equation** $\nabla^2 f = \operatorname{div}\mathbf{v}$ — i.e. exactly the inverse-problem machinery of [[Linear Inverse Problems and Regression]], applied to gradients.

• 💡 **Big lesson:** *the eye cares about gradients (local contrast), not absolute values* — so edit the gradient field and reconstruct. A mismatched absolute level (a paste's lighting/color shift, an HDR scene's enormous range) is invisible once the *gradients* are right; this is the same reason lightness constancy works and the visual system encodes differences (→ register/first appearance: [[Big Lessons#L9]]).

▸5.1.2 Seamless cloning: the headline application⧉

• **seamless cloning** (the headline app, Pérez 2003): to paste a region $\Omega$ without a visible seam, copy the source's **gradients** $\mathbf{v}=\nabla g$, not its pixels, but pin the interior to the destination's **boundary values** $f|_{\partial\Omega}=f^{*}|_{\partial\Omega}$ (a **Dirichlet** condition); solving $\nabla^2 f=\operatorname{div}\mathbf{v}$ fills an interior that flows smoothly out of the destination at the seam, so lighting/color mismatches are absorbed and the paste blends invisibly. `fig-seamless-cloning`

▸5.1.3 Reconstruction = solving the Poisson equation⧉

• the discrete **Laplacian** (a convolution kernel — the 5-point stencil) turns $\nabla^2 f = \operatorname{div}\mathbf{v}$ into **one equation per pixel**, $4f_p-\sum_{q\in N(p)}f_q=\sum_{q\in N(p)} v_{pq}$ over the 4-neighbours $N(p)$ — a **large, sparse, structured** linear system. **Boundary conditions** close it: **Dirichlet** (fixed boundary $f|_{\partial\Omega}=f^{*}|_{\partial\Omega}$) for cloning; **Neumann** (zero normal derivative) when integrating a whole image with no enclosing destination.

• pseudocode block: the transparent **Jacobi / Gauss-Seidel** iterate-to-convergence solve — each pixel ← ¼·(sum of 4 neighbours + divergence rhs), swept until settled, boundary held fixed — as the slow baseline the CG / multigrid / FFT solvers accelerate.

• the **Fourier-domain view (one FFT divide)**: gradient and Laplacian are **convolutions** (linear shift-invariant), so on a rectangular domain with periodic/Neumann boundaries they are **diagonal in the Fourier basis** — every frequency decouples and the whole Poisson equation $\nabla^2 f=\operatorname{div}\mathbf{g}$ solves in **one per-frequency division**, $\hat f(\omega)=\widehat{\operatorname{div}\mathbf{g}}(\omega)/\widehat{\nabla^2}(\omega)$ (dividing by the Laplacian's spectrum $-(\omega_x^2+\omega_y^2)$; a **DCT** variant handles Neumann/free boundary conditions cleanly). This is the **same "diagonalize when you can" (L5) FFT solver** as in deblurring — see [[Efficient solvers]] / [[Computational tools]] and [[Linearity, Fourier, Aliasing and deblurring]]. Subtlety: the **DC / mean is undetermined** — the Laplacian's spectrum is **zero at $\omega=0$** (its zero eigenvalue), so gradients pin down everything **but the overall offset**, which must be fixed by a boundary value or a reference. [`fig-poisson-fourier`: the Poisson solve as a per-frequency division — divide by the Laplacian's spectrum $-(\omega_x^2+\omega_y^2)$]

#### The same solver as the regression chapter

• **same solver as the regression chapter**: don't form the matrix — use **conjugate gradient** (with a preconditioner), **multigrid** (coarse-to-fine — ties to [[Linear pyramids and wavelets]]), or an **FFT solver** on a rectangular periodic/Neumann domain (the Laplacian **diagonalizes in Fourier** — ties to [[Linearity, Fourier, Aliasing and deblurring]]). This is the *same* machinery as [[Linear Inverse Problems and Regression]], colorization, and matting.

#### Which space to integrate in: linear vs. gamma vs. log

• **which space to integrate in (linear vs gamma vs log)**: integrate in a space where equal *gradients* mean equal *perceived* contrast. Plain **linear** light over-weights highlights and lets bright regions dominate the seam; **log** (or a perceptual encoding) makes a ratio a fixed gradient, which matches the eye and keeps shadow detail — so for tone/exposure-mismatched clones and gradient-domain HDR we integrate in **log**, then map back. We state the chosen encoding explicitly per edit (the encoding-clarity rule).

▸5.1.4 Poisson vs. pyramid blending⧉

• the **membrane** lens on the two seam-hiders: **Poisson** = gradient-domain + pinned boundary → copies *gradients* and **discards the source DC** (re-set by the boundary; adding a constant changes nothing); **Laplacian-pyramid blend** ([[Linear pyramids and wavelets]]; Burt–Adelson spline) = cross-fade *values* band-by-band → **keeps the DC**, feathered by a coarse-wide mask

• discriminator — **paste a constant**: Poisson collapses to the bare **membrane** (Laplace, boundary = destination) and the constant *vanishes* (a line in 1-D, a soap film in 2-D); the pyramid's coarse band **copies the DC**, so a feathered constant *survives*. This is **L9** (gradients, not absolute values) seen from both sides

• upshot: Poisson **re-lights** the insert (seamless, but bleeds / erases flat regions); the pyramid **preserves** appearance and only dissolves the seam (robust, solver-free, but a true brightness mismatch lingers as a halo). Cloning → Poisson; panorama → pyramid

• [`fig-poisson-vs-laplacian-blend`: 1-D textured + constant source, and a 2-D constant square — Poisson vs pyramid]

▸5.1.5 Advanced techniques⧉

#### Mixing gradients

• **mixing gradients** (Pérez 2003): per pixel, keep the **stronger** of the source's and destination's gradient, $\mathbf{v}=\arg\max(\lVert\nabla g\rVert,\lVert\nabla f^{*}\rVert)$, before solving — lets you insert **holey / lacy objects**, **transparent overlays**, or **text onto texture** so the destination's texture shows through the gaps. `fig-mixing-gradients`

• **interactive demo** (`fig-poisson-blend`): a seamless-cloning playground — upload/pick source+target, paint Ω, drag/scale the source, solve per-channel by SOR; switch color space (**linear / gamma / Chong log**) and toggle **mixed gradients**, with live source/dest/∇/reconstructed-∇/residual panels + a scanline. Standalone in `interactive-sandbox/poisson-blend/` (iframe-embedded, static PNG fallback), like the [[photo101-simulators]]. Ties together seamless cloning, the linear-vs-log encoding choice, and mixing gradients.

#### Illumination invariance, pixel encoding and color space

• **preferred solution = Chong's color space (Chong, Gortler & Zickler 2008)**: remind the reader of the transform from **RGB** (or **XYZ**) into Chong's space; because it is a **log** space it is **illumination-invariant** — the multiplicative shading becomes an **additive offset** that gradient-domain editing simply ignores, so equal gradients mean equal perceived contrast regardless of lighting. `fig-chong-colorspace-results` (results from their paper)

• **healing brush = a covariant derivative (Georgiev 2004), folded in**: Photoshop's healing brush is gradient-domain cloning made *multiplicative* — Georgiev replaces the plain gradient with a **covariant derivative** (work on $\log f$, transport gradients as ratios) so a patch matches the destination's *illumination*, not its absolute brightness — the same log/illumination-invariance point baked into the operator. Note the historical order: **Photoshop's healing brush came first (2002); Pérez et al. (2003) were inspired by it** and formalized the gradient-domain view.

#### Conservative-gradient simplification: the membrane shortcut

• **conservative-gradient simplification (Sylvain Paris / Farbman et al. 2009; Bhat 2010)**: when the guidance field is (or is made) **conservative**, the expensive Poisson solve collapses to adding a smooth **membrane** — solve $\nabla^2 u = 0$ (a Laplace, not Poisson, problem) for the boundary-mismatch correction $u$ and add it to the naive paste. Far cheaper, and the route to **real-time** gradient-domain editing.

#### Other gradient-domain edits

• **section framing — "cover your tracks"**: Poisson is the great tool for **hiding the evidence of an edit** — make any *local* edit blend so it leaves **no artifact at its boundary** (the destination flows smoothly out of the seam). Whatever you did inside the region, the gradient-domain reconstruction makes the join invisible.

• **panorama / seam blending**: stitch overlapping images by Poisson-blending across the chosen seam so exposure/color mismatches vanish at the join → forward-ref to the panorama / stitching part (where the seam is *found*; here it is *hidden*).

• **other gradient-domain edits**: HDR **tone compression** (attenuate large gradients, then integrate — Fattal 2002; integrate in log); **photomontage** (stitch many photos by choosing gradients + seams — Agarwala 2004); shadow / reflection removal; flash–no-flash fusion. All change only the **guidance field** $\mathbf{v}$ and reuse the same solver.

▸5.2 Bilateral filtering⧉

▸5.2.1 Motivation — local tone-mapping haloes⧉

**Motivation — local tone mapping haloes.** `fig-tonemap-halo`

• the setup (craft callback): HDR **local tone mapping** compresses range by a **two-scale decomposition** — split into a large-scale **base** (the overall light) and a fine **detail** layer, compress the base, keep the detail, recombine (the *best-of-both* move, [[Style Reference]] rule 5)

• do it with a **plain Gaussian** base and it **bleeds across edges**: at a bright-sky / dark-mountain boundary the blur averages sky into the mountain, so after compressing the base a **bright halo** glows along the silhouette — the classic tone-mapping artefact

• *we want* a base layer that is smooth **within** a region but **stops at edges** — exactly an edge-preserving filter

• other motivations (state plainly, don't list-open): edits that must **not bleed past edges** — denoise without smudging detail; sharpen / abstract for stylization; build selection masks. Cross-link to [[#Compositing, segmentation and matting]] and stylization (NPR).

▸5.2.2 Bilateral filter⧉

• **start from Gaussian blur** (name-then-use): the output is a weighted average of neighbours, weight = a spatial Gaussian $g_s(\lVert p-q\rVert)$ — close pixels count more. Illustrate on a **1-D scanline** = one row of the image. `fig-bilateral-1d`

• **sidebar — debugging the bilateral** (recipe moved here from Developing, testing & debugging): crank the **range variance $\sigma_r$ very large** → the range term drops out and the bilateral must collapse to a **plain Gaussian** (compare against your already-tested Gaussian); shrink **$\sigma_r$ very small** → pixels only average with near-identical neighbours, so **edges stay untouched** (run it on the half-plane / rectangle). Two extreme settings, two predictable outcomes that bracket the whole filter. `fig-bilateral-limits`

• **the problem**: near an edge the Gaussian **gathers information from across the boundary** — it averages the dark side into the bright side, smearing the step (the same bleed that made the halo)

• **the fix — a range weight**: multiply in a *second* Gaussian on the **value difference**, $g_r(\lvert I_p-I_q\rvert)$, that **down-weights pixels too different in value**. A neighbour that is close *in space* but far *in intensity* (across the edge) gets almost zero weight.

• **the bilateral weight** (Tomasi & Manduchi 1998): $w(p,q)=g_s(\lVert p-q\rVert)\,g_r(\lvert I_p-I_q\rvert)$, and the output is the normalized average $I^{\mathrm{bf}}_p=\frac1{W_p}\sum_q w(p,q)\,I_q$ with $W_p=\sum_q w(p,q)$. Read it back in words ([[Style Reference]] rule 3): *"average my neighbours, but only those that are both nearby and similar to me."* Two widths: $\sigma_s$ in pixels (how big a region), $\sigma_r$ in intensity units (how different is "too different").

• pseudocode block (from the slides): the brute-force double loop — precompute the **spatial** weight $g_s$ once, **recompute the range weight $g_r$ per pair** (it depends on the value difference, so it can't be tabulated or reused from convolution code), accumulate $\sum w\,I_q$ and $\sum w$, divide per pixel; truncate only the spatial Gaussian; RGB 3-D distance for color.

• **1-D demo**: a noisy **step edge** → Gaussian smears the step; the bilateral **denoises each plateau but keeps the step crisp**, because the weight collapses to one side at the edge. Then the **2-D image**: texture smoothed, edges held. `fig-bilateral-1d`, `fig-bilateral-2d`

• **the affinity reading** (the conceptual payoff): the range Gaussian *is* a similarity / **affinity** between two pixels — "how much do they belong together" — computed from their value difference. Every later method in this chapter is a variation on how to choose that affinity.

💡 **Big lesson:** edge-preserving = affinity — use the **color / intensity difference** between two pixels as how much they *belong together* (an **affinity**), then average (or connect) pixels in proportion to it. The bilateral's range weight $g_r$ is the first instance; the *same* affinity drives the bilateral grid, the guided filter, NL-means, colorization, matting and segmentation. *(L4 · first full edge-preserving treatment of the principle; registered in BASIC → Denoising, fully developed here.)*

#### Robustness for values with few neighbours

• **the failure**: a pixel whose value is **rare** (few neighbours within $\sigma_r$) has a tiny, noisy normalizer $W_p$ — the standard normalized bilateral becomes unstable / over-trusts the few similar pixels, and the Durand–Dorsey normalization is, strictly, **wrong** there

• **the fix — unnormalized / delta bilateral** (Aubry et al. 2014): express the output as a **delta from the input** rather than a fresh normalized average — $I^{\mathrm{bf}}_p=I_p+\sum_q w(p,q)\,(I_q-I_p)$ — and **drop the $1/W_p$ division**

• **why it works (intuition)**: each neighbour contributes a *correction* $w\,(I_q-I_p)$ pulling toward the input value; when few neighbours match, you make a **small correction** instead of dividing by a near-zero $W_p$ and amplifying noise — so a lonely value stays close to itself rather than blowing up

• ties to **local Laplacian filters** (Paris / Aubry): the unnormalized form is the bilateral cousin of the multiscale local Laplacian — see [[#Local Laplacian filters]] (and [[Linear pyramids and wavelets]])

#### Debugging — hand-checkable limits

• **large range variance** ($\sigma_r\to\infty$): $g_r\to1$, the range weight does nothing → the bilateral **collapses to a plain Gaussian** blur

• **tiny range variance** ($\sigma_r\to0$): only the pixel itself survives → output ≈ **identity**, edges (and everything) untouched

• these two limits are an **easy correctness test** for an implementation; a middle $\sigma_r$ interpolates between them

• also useful for tuning: pick $\sigma_s$ from the spatial scale of structure to remove, $\sigma_r$ from the contrast of edges to preserve (state the encoding first — see meta)

▸5.2.3 Cross / joint bilateral⧉

• **the twist**: read the **range weight from a *different* (guide) image** than the one being averaged — $g_r(\lvert J_p-J_q\rvert)$ with guide $J$, while still averaging the target $I$. The affinity comes from a *cleaner* signal.

• **why**: when the target is too noisy / dark for its own edges to be trusted, borrow edges from a **better-exposed guide**

• **flash / no-flash** (the headline app): smooth the noisy **no-flash** ambient image using **flash-image** edges as the guide — keeps the mood, kills the noise. Forward-ref → [[#Illumination related effects in a single image]] / flash–no-flash (Computational illumination). `fig-crossbilateral`

• a special case of the general **joint / guided** idea (→ guided filter, below)

▸5.2.4 Joint / cross bilateral upsampling⧉

• **the idea (Kopf et al. 2007)**: compute an **expensive** operation at **low resolution** — tone mapping, depth, stereo, colorization, a graph-cut label map — then **upsample it to full resolution guided by the full-res image**

• **how**: it's a cross/joint bilateral where the **range weight is read from the high-res guide** (the original image), so the cheap low-res solution snaps to the guide's edges and **stays sharp** — no blurry haloes across boundaries

• **why it matters**: a heavy solver runs on a tiny image, edges come (for free) from the big one — best-of-both, and the conceptual ancestor of HDRnet's learned edge-aware upsampling (below)

▸5.2.5 Bilateral grid⧉

• **the idea** (Jiawen Chen's SIGGRAPH paper): turn the **nonlinear 2-D** bilateral into a **linear 3-D** filter by adding the **range / value as a third axis** — lift $(x,y)\to(x,y,I)$ into a coarse 3-D **grid**. `fig-bilateral-grid`

• **splat → blur → slice**: *splat* pixels into grid cells (accumulate value + weight); *blur* with a **plain separable 3-D Gaussian** (now an ordinary convolution — edges are automatic because dissimilar values live in different grid layers); *slice* back out by interpolating at each pixel's $(x,y,I)$

• pseudocode block (from the slides): **splat / blur / slice** as code — accumulate value and weight into cell $(x/\sigma_s,\,y/\sigma_s,\,I/\sigma_r)$; separable 3-D Gaussian on **both** grid and weight; slice by trilinear lookup at each pixel's $(x,y,I)$ and divide value by weight (that division *is* the bilateral normalizer $1/W_p$).

• **why it's fast**: the grid is **coarse** (downsampled in space and range), so it runs real-time on a GPU; ties to [[#Differentiable image pipelines and algorithm optimization (Halide)]] (Halide schedules exactly this pipeline)

• connects to the affinity lesson: the third axis *is* the affinity made geometric — pixels far apart in value are far apart in the grid

• **what the grid unlocks** (uses): (1) **cross / joint bilateral filtering** — splat the target, read the range axis from a guide; (2) **local histogram equalization** — each grid cell already stores a **local histogram** (the per-value counts splatted into it), so you can equalize / manipulate tone *locally* straight off the grid.

▸5.2.6 Bilateral-grid learning (HDRnet)⧉

• **learn the grid**: a CNN predicts, at **low resolution**, the *coefficients* of a per-cell affine transform stored in a **bilateral grid**; slice + apply at **full resolution** → real-time learned enhancement on a phone

• the bilateral grid becomes a **learned, edge-aware upsampling** structure — best-of-both: a heavy network on a tiny image, edge-aware detail on the big one

• ties to L8 (a learned operator swaps a hand-designed prior for a learned one) — a one-line callback here, not a full box; xref ML / [[Machine learning]]

▸5.2.7 Edge-respecting selection brush⧉

• **same affinity, made interactive**: the affinity that stops the bilateral from averaging across edges also powers an **edge-respecting selection / enhancement brush** — a manual local-edit brush whose painted effect **does not bleed across edges** (it follows the same value/color affinity)

• so it is the **interactive cousin of edge-preserving filtering**: where the bilateral decides automatically "these pixels belong together," the brush lets a user *paint* an edit that respects exactly the same boundaries — dodge/burn, local tone or color tweaks that stop at the object's edge.

▸5.3 Locally adaptive regression kernel (LARK)⧉

• **the view**: treat denoising / upsampling as **local regression** — fit a smooth function to nearby pixels, weighted by a kernel

• **steering / adaptive kernels**: shape the kernel to the **local image structure** — estimate the local gradient, then **stretch the kernel *along* edges and squeeze it *across* them** (an anisotropic, data-dependent weight) so it averages along the edge, not across it

• so LARK is the **regression flavour** of the same affinity idea: the kernel geometry is *derived from* local structure, generalizing the bilateral's scalar range weight to an oriented one

• also a **descriptor**: LARK signatures are used for detection / matching (Seo & Milanfar) — note in passing, not developed

▸5.4 NL-means (non-local means)⧉

• **the leap — compare patches, not pixels**: the bilateral's affinity uses a single pixel's value; NL-means measures similarity between **small patches** around two pixels, so it matches **texture / structure**, not just brightness. `fig-nlmeans-patches`

• **non-local**: the search is **not limited to a spatial neighbourhood** — any patch anywhere in the image (a repeated texture, another brick, another eye) can vote; the spatial Gaussian is dropped or relaxed

• **patch distance & weight**: $d^2(p,q)=\sum_k g_a(k)\,\lVert I(p+k)-I(q+k)\rVert^2$ over patch offsets $k$ (optionally Gaussian-weighted $g_a$), weight $\propto e^{-d^2/2\sigma^2}$; average the **centre** pixels of the matching patches

• **why it denoises well**: a clean signal repeats, noise does not — so averaging many similar patches cancels noise while preserving structure

• the affinity reading again (now a patch-space affinity); the direct precursor to **BM3D** and to the self-similarity priors used in learned denoising (forward-ref ML)

• **BM3D** (Dabov et al. 2007) — the high-water mark of the patch / self-similarity idea, and for years the classical denoising **gold standard**. Where NL-means *averages* similar patches, BM3D goes further in two stages: (1) **block-matching** groups the most-similar patches into a **3-D stack**; (2) **collaborative filtering** applies a 3-D transform to that stack and **jointly shrinks** the coefficients (hard-threshold, then a Wiener pass using a first-pass estimate), exploiting the fact that aligned similar patches are *sparse together* — far more so than any one patch alone. Aggregate the filtered patches back (weighted by how many survived) and the result beats plain NL-means. Same lesson as the whole chapter — **find what belongs together (here, similar patches) and process them jointly** — pushed from a weighted average to a sparse transform-domain shrinkage; the bridge from bilateral/NL-means affinity to sparsity/dictionary and learned self-similarity denoisers (forward-ref [[Deep learning]]).

▸5.5 Local Laplacian filters⧉

`fig-local-laplacian` · local Laplacian filters on a real HDR sunset (Paris–Hasinoff–Kautz 2011, implemented from scratch): naive global (blown) vs Gaussian base (halo) vs local Laplacian (halo-free) tone mapping, plus the compressive remapping curve $r_g(\cdot)$ (detail centre + identity tails) and a zoom comparison proving the silhouette stays halo-free (Local Laplacian filters) ✅

⬜ figure not yet created

before/after detail-boost with **no halo** fig-wb-before-after

• **the odd one out — not an affinity method**: every other technique here weights *pairs* of pixels by similarity; local Laplacian instead works **band-by-band in a Laplacian pyramid** with a **pointwise remapping** — no range weight, no base/detail split (so none of the halos / gradient reversals those can produce). The multiscale cousin of the bilateral, by a different mechanism. Cross-ref [[Linear pyramids and wavelets]] (the pyramid it's built on) and [[Bilateral filtering]].

• **the mechanism**: to compute the output's Laplacian coefficient at a pixel and level, take the pixel's value $g$, apply a **remapping function** $r_g(\cdot)$ to the *whole* image (a smooth curve, centred on $g$, that decides what is "detail" vs "edge"), build a Laplacian pyramid of that remapped image, and read off **just this level's coefficient at this pixel**; collapse the resulting pyramid for the result.

• **the knob is the remapping $r$**: near $g$ (small amplitude = **detail**) it **compresses or amplifies** — controlling detail / local contrast; far from $g$ (large amplitude = **edge**) it stays the **identity** — so strong edges pass through untouched. One curve → detail enhancement, tone mapping, or inverse tone mapping, all **artifact-free**.

• **why it matters**: state-of-the-art **edge-aware detail manipulation and HDR tone mapping** with no halos and no gradient reversal — the failure modes that motivated the bilateral in the first place.

• **cost & the fast version**: the naive form rebuilds a pyramid per pixel-and-level (expensive); **Aubry et al. 2014** gives a fast approximation and shows the **unnormalized bilateral** is its single-scale relative (ties back to [[Bilateral filtering#Bilateral filter|the unnormalized bilateral]]).

▸5.6 Edge-preserving optimization — colorization⧉

• **the shift — filtering → optimization**: instead of one weighted-average *pass*, solve a **global least-squares** problem whose smoothness term uses the affinity as a weight. The same affinity, now inside an energy.

• **colorization** (the demo): the user scribbles a few colors; **propagate** them so that **neighbouring pixels of similar luminance get similar color** — minimize $\sum_p (U_p-\sum_{q\in N(p)} a_{pq} U_q)^2$ over the chrominance $U$, with affinity $a_{pq}$ **large where $I_p\approx I_q$** (small luminance difference) — exactly the bilateral's range idea inside a quadratic energy

• **solve it** with the linear-systems machinery of the regression chapter — sparse, structured, solved by CG / multigrid (cross-ref [[Single-image computational photography#Blind deblurring]], [[Linear Inverse Problems and Regression]])

• the **affinity matrix** here is the same object as the **matting Laplacian** and the graph-cut edge weights → forward-ref [[#Compositing, segmentation and matting]] (matting / segmentation = affinity again)

• also placeable under [[Single-image computational photography#Blind deblurring]] (the "Colorization as optimization?" stub) — keep the *technique* here, cross-ref from there

▸5.7 Guided image filtering⧉

• **the idea**: like the joint bilateral (filter $I$, take structure from a **guide** $J$) but the output is a **local linear transform of the guide**, $I^{\mathrm{out}}=a_k J + b_k$ fit per window by **least squares** — so it preserves the guide's edges by construction

• **why prefer it**: **$O(N)$**, **no range-quantization artefacts** (the gradient reversal the bilateral can show), and **differentiable** → drops into learning pipelines

• **uses**: edge-preserving smoothing; detail enhancement; **flash / no-flash**; **matting / feathering**; **dehazing** (guide = the hazy image — xref He, Sun & Tang 2009 dark-channel, [[#Dehazing as a prior-driven inverse problem]]); depth upsampling

• the affinity is now **implicit** in the local linear fit — same family, different mechanism; a fitting end-point for the chapter's through-line (bilateral → grid → learned → regression → optimization → guided, all *affinity*)

▸5.8 Seam optimization⧉

• **the third leg of EDGES MATTER** (the part's framing payoff): Poisson **reconstructs from edges**, bilateral **smooths except across edges**, seam optimization **cuts along edges**. State it as the dual of edge preservation: edge-preserving filtering works *hard to avoid* edges; seam methods *go looking for them* — a transition is invisible exactly where the image is already changing, so we hide the cut **in the edges / busy texture** and keep it **out of smooth regions** where a seam would scream. (Callback to the part intro's "respect the edges.")

• the **unifying machinery**: all three sub-families are **discrete optimization on a graph or grid** — a 1-D least-cost path (**DP**), a 2-D globally-optimal boundary (**graph cut / min-cut**), or a spectral partition (**normalized cuts**). What changes between methods is **the energy / affinity**, not the optimizer.

💡 **Big lesson:** *many image edits are **discrete optimization on a graph** — dynamic programming, min-cut/max-flow, or spectral relaxation — and the **energy (affinity) you choose is the whole game**.* The optimizer is off-the-shelf; **what** you penalize (gradient magnitude, label disagreement across edges, region/boundary cost) is the design. This is the **L4 affinity** principle (Bilateral) turned into a *cut*: there the affinity said "average these together"; here it says "**don't cut here**." *(Register as new lesson **L12 · Image edits as discrete optimization on a graph (the energy is the design)**, first appearance here; "also relevant": Poisson/photomontage, compositing/segmentation/matting, texture synthesis, stereo/MRF labeling. Full box at this section's head; one-line callback to L4.)*

▸5.8.1 Find the non-edges — least-noticeable cuts⧉

• the one idea: a cut is **least noticeable where the image is already changing** (a strong edge or busy texture) and **most noticeable in smooth regions** — so *seek* edges to hide the seam, the mirror image of edge-preserving smoothing.

• the **panorama motivation** (craft callback): naive reprojection shows a hard seam, smooth/feather blending **ghosts** moving objects; the fix is to **route the seam around disagreement** — build a difference map of "bad things happen here," then find the boundary that crosses the fewest changing pixels. (Pays off the panorama-blending forward-ref to graph cut.)

• **two regimes**: **interactive** path-finding where a user guides the cut (scissors, snakes), vs **automatic** global optimization (seam carving, graph cut, normalized cuts). Present the automatic ones as the main line; scissors/snakes as the interactive ancestors.

▸5.8.2 Intelligent scissors / live-wire — the cut as a least-cost path⧉

• **the idea** (Mortensen & Barrett 1995, *name-then-use*): an interactive selection tool — the user clicks a seed on an object boundary; as the cursor moves, a **"live wire" snaps to the strongest nearby edge** and traces the boundary in real time. Click to anchor, continue.

• **mechanism = shortest path on a cost graph**: pixels are nodes; edge cost is **low where image gradient is high** (cost $\propto$ inverse edge magnitude, + gradient-direction terms). Dijkstra from the seed gives the **least-cost path** to the cursor — the DP/shortest-path idea made interactive. `fig-snake-livewire`

• the payoff: same machinery as seam carving (least-cost path), different goal — **select/cut along** an edge rather than carve through a non-edge. The cost function is the design.

▸5.8.3 Snakes (active contours) — the continuous cousin⧉

• **the idea** (Kass, Witkin & Terzopoulos 1988): a deformable **curve** that **relaxes onto an object boundary** by minimizing an energy — the variational, continuous counterpart to live-wire's discrete path.

• **the energy** (read it back): $E=\int E_\text{int}+E_\text{image}+E_\text{con}$ — **internal** keeps the curve smooth (elastic tension $\alpha$, bending stiffness $\beta$); **image** $E_\text{image}=-\lvert\nabla I\rvert^2$ pulls it toward **strong edges**; **constraint** lets the user nudge it. Minimized by gradient descent / a sparse banded solve — same linear-algebra machinery as the regression chapter.

• honest tradeoff: snakes are **local** (need a good initialization, can get stuck) — which motivates the **global** optimizers next; note balloon/GVF/level-set variants only in passing.

▸5.8.4 Graph cut — globally optimal boundaries by min-cut/max-flow⧉

• **the leap from local to global** (Boykov & Jolly 2001): pose the cut as a **binary labeling** — every pixel is **object** or **background** — and minimize a **graph energy** whose minimum is found **exactly** (binary case) by **min-cut = max-flow**. No local minima, unlike snakes.

• **the energy** $E(L)=\sum_p D_p(L_p)+\sum_{(p,q)}V_{pq}(L_p,L_q)$ (read back): a **data term** $D_p$ — how well pixel $p$ fits each label (from user scribbles / color models) — plus a **smoothness term** $V_{pq}$ — a penalty for neighbours getting *different* labels, made **small across true edges** so the boundary **prefers to cut along edges**. (This $V_{pq}$ **is** the L4 affinity / the bilateral range weight — callback.) `fig-graphcut-segmentation`

• **the graph construction**: pixels = nodes; two terminals **source (object) / sink (background)**; **t-links** carry $D_p$, **n-links** carry $V_{pq}$; the **min cut** severs the cheapest set of edges separating source from sink → the segmentation. Solve by **max-flow** (Boykov–Kolmogorov 2004).

• **multi-label, in passing**: more than two labels (photomontage, stereo, denoising) is NP-hard exactly — solved approximately by **$\alpha$-expansion** (Boykov, Veksler & Zabih 2001), iterated binary graph cuts. Note, don't develop.

• **GrabCut** (Rother, Kolmogorov & Blake 2004): graph-cut segmentation made **easy to drive** — the user drags **one rectangle**, iterated graph cuts + updated **GMM color models** extract the object. (Cross-ref the segmentation/matting section, which owns the *application*.)

▸5.8.5 Normalized cuts — the spectral relaxation⧉

• **the problem with plain min-cut**: minimizing the raw cut weight is **biased toward tiny cuts** — it happily lops off a single isolated pixel. We want **balanced** partitions.

• **the fix** (Shi & Malik 2000): normalize the cut by each side's total **association** — $\mathrm{Ncut}(A,B)=\frac{\mathrm{cut}}{\mathrm{assoc}(A,V)}+\frac{\mathrm{cut}}{\mathrm{assoc}(B,V)}$ — penalizing unbalanced splits. The affinity $w_{uv}$ is again the **L4 affinity** (color / proximity similarity).

• **how it's solved (intuition only — keep eigen-machinery light, V13)**: exact Ncut is NP-hard, but it **relaxes to a generalized eigenproblem** $(D-W)x=\lambda D x$; the **second-smallest eigenvector** gives a soft partition you threshold — a **spectral** (vs combinatorial) route to the cut. Defer eigen-fluency to the appendix.

• the contrast worth stating: graph-cut = **combinatorial, needs seeds, exact binary**; normalized cut = **spectral, unsupervised, approximate** — two answers to "where's the best boundary," same affinity input.

▸5.8.6 Seam carving — content-aware resize by DP⧉

• **the application** (Avidan & Shamir 2007, motivation-first): resize a photo to a new aspect ratio **without squashing the subject or cropping it off** — instead of uniform scaling or cropping, **remove (or insert) the least-important pixels**, one connected seam at a time. `fig-seam-carving`

• **a seam** = an 8-connected top-to-bottom (or left-to-right) **path of one pixel per row** whose total **energy** $\sum e(i,j)$ is **minimum**; energy $e=\lvert\partial_x I\rvert+\lvert\partial_y I\rvert$ (gradient magnitude / saliency) — so the seam **threads through low-energy, low-detail regions** and avoids the subject.

• **the DP** (the worked mechanism, `fig-seam-energy-dp`): fill a **cumulative-energy** table top-down, $M(i,j)=e(i,j)+\min\!\big(M(i{-}1,j{-}1),M(i{-}1,j),M(i{-}1,j{+}1)\big)$; the cheapest endpoint $\arg\min_j M(H,j)$ + **back-trace** = the optimal seam. Remove it (narrow), or duplicate it (widen); **repeat**. Read back: *"the cheapest way down is the cheapest way to my best parent, plus my own cost."*

• pseudocode block: the DP sweep that fills $M$ row-by-row with the 3-way $\min$, then the $\arg\min$ + upward back-trace that reads off the seam.

• **the same DP as live-wire/scissors** (through-line): a 1-D least-cost path on a gradient-derived energy — only the **energy** and the **direction of the cut** differ.

• honest limits: seam carving **fails on busy/structured images** (no low-energy path exists → it carves through content); better saliency/forward-energy helps — note, don't develop.

▸5.8.7 Time-lapse via DP — a seam through time⧉

• **the idea**: choosing frames/exposures for a smooth **time-lapse or day-to-night sequence** is the **same least-cost path**, but the axis is **time** — pick a sequence minimizing inter-frame jumps (exposure/color discontinuity), a DP over candidate frames. ⚠️ *coverage note: no dedicated slide/transcript found — confirm scope or trim to a one-line example under seam carving.*

▸5.8.8 Video textures — looping a seam in time⧉

• **the idea** (Schödl et al. 2000): turn a short clip of a stochastic phenomenon (waving flag, candle, fountain) into an **arbitrarily long, seamless loop** by finding **good transitions** — frames that look similar enough to jump between invisibly.

• **mechanism = a cut in the time graph**: frames are nodes; a transition cost = **inter-frame similarity** (low cost = invisible jump); finding loops / a play order is **least-cost path / DP** on this graph — the seam idea applied to **time**. `fig-video-texture-loop`

• the payoff: same "hide the seam where things match" principle as panorama blending and seam carving — now between **frames** instead of pixels.

▸5.8.9 GraphCut textures & photomontage — cut, then blend⧉

• **GraphCut textures** (Kwatra et al. 2003): synthesize/extend a texture (or composite two images) by **graph-cut along the best seam** through the overlap region — the min cut picks the path where the two sources **agree best**, so the join is invisible. The texture-synthesis sibling of panorama seam-finding.

• **interactive digital photomontage** (Agarwala et al. 2004 — **shared with Poisson**): the headline **"cut + blend" pipeline** — **graph cut** chooses *which source contributes each pixel* (the seam, this section), then a **gradient-domain (Poisson) blend** hides any residual seam (the reconstruct-from-edges section). The clean statement of how the three legs **compose**: cut along the edges, then reconstruct across the seam. (Explicit two-way xref to [[#Poisson image editing]].)

▸5.8.10 Where this connects — MRFs and beyond⧉

• **MRFs**: the graph-cut energy $E(L)=\sum D_p+\sum V_{pq}$ **is** a pairwise **Markov Random Field** — data + pairwise prior — and graph cut / belief propagation are its inference engines. Name the connection (so the reader meets "MRF" once); the broader Bayesian/MRF treatment lives with inverse problems / segmentation, cross-ref rather than develop here. *(Resolves the old "MRFs? / Bill's Bayesian stuff" stub.)*

• closing through-line: scissors · snakes · seam carving · graph cut · normalized cuts · video textures are **one idea wearing different optimizers** — *cut where it won't be seen; the energy is the design.*

▸5.9 Recap: which edge-aware technique when?⧉

▸5.9.1 The three relationships to edges⧉

• **reconstruct *from* edges — Poisson / gradient-domain**: throw away absolute level, keep gradients, integrate back. Reach for it to **hide an edit's boundary** — seamless cloning, healing, gradient-domain HDR, photomontage/panorama blending.

• **smooth *except across* edges — bilateral & friends**: average pixels that *belong together* (the L4 affinity). Reach for it to **denoise / decompose / enhance without haloes or bleeding** — bilateral, joint/cross & upsampling, grid, guided filter, local Laplacian, NL-means, LARK, colorization.

• **cut *along* edges — seam optimization**: find the least-noticeable path/boundary. Reach for it to **resize, select, or composite** by hiding a seam in the busy regions — seam carving, scissors/snakes, graph cut, normalized cuts, GraphCut textures.

▸5.9.2 Pros / cons / cost — when to reach for each⧉

• **Poisson**: global, seamless, encoding-sensitive; cost = a **sparse linear solve** (CG / multigrid / FFT; membrane shortcut when conservative). Pick when the artifact lives **at a boundary**.

• **bilateral & friends**: mostly a **single pass** (grid → real-time; guided $O(N)$; local Laplacian heavier; colorization a solve). Pick when you must **process within regions but not across edges**. Watch range-quantization / gradient-reversal artefacts (guided & local Laplacian avoid them).

• **seam optimization**: **discrete optimization** — DP (1-D path, cheap), graph cut (2-D, exact binary, needs seeds), normalized cut (spectral, unsupervised, approximate). Pick when the right answer is **where to put the boundary**; fails on busy images with no low-energy path.

• **they compose**: photomontage / panorama = **graph-cut seam + Poisson blend** (cut along edges, then reconstruct across the seam) — the part's payoff.

▸Part 6 WARPING AND MORPHING⧉

💡 **Big lesson (L17 · transport half):** almost every geometric technique is **(1) establish a correspondence / coordinate map**, then **(2) transport pixels along it** by warping and resampling. This part owns step (2) — one shared engine serving rectification, panoramas, morphing, stabilization, and frame interpolation — and the ways to *build* a map from sparse control. The hard *estimation* half (step 1) is [[Matching pixels across space and time]].

▸6.1 Warping and resampling⧉

`fig-warp-domain-vs-range` · one image, two orthogonal arrows — the domain arrow bends the grid (a pixel slides, carrying its colour; a warp), the range arrow recolours in place (a point/tone op); warping moves where, not what 🟨

⬜ figure not yet created

`fig-forward-warp-holes` (forward/input-driven warp scatters input pixels to fractional output locations → gaps (uncovered output) and pile-ups (many inputs to one output)) fig-forward-warp-holes

`fig-inverse-warp-lookup` · inverse (output-driven) warping, the useful idiom — loop over output pixels, push each back through $f^{-1}$, sample there; every output covered once, the only difficulty a clean resampling 🟨

`fig-recon-filters-1d` · nearest (boxy staircase), linear (kinked through every sample), and cubic (smooth with mild overshoot) reconstruction of the same 1-D sample row — the sharpness/smoothness trade 🟨

⬜ figure not yet created

`fig-bilinear-weights` (the four surrounding pixels and the fractional-coordinate area weights of bilinear interpolation) fig-bilinear-weights

`fig-minify-aliasing-moire` · the L16 picture — a fine grating decimated naively (folds into false moiré) vs area-averaged first then decimated (clean); same size, opposite outcomes, decided by prefiltering 🟨

⬜ figure not yet created

`fig-kernel-scaling-minify` (the reconstruction kernel widened by the downscale factor so it averages over the right footprint — "gather farther when shrinking") fig-kernel-scaling-minify

`fig-dof-ladder` · the degrees-of-freedom ladder — a unit square under translation (2) → similarity (4) → affine (6) → projective (8) → free-form ($\infty$), each rung breaking a preserved property 🟨

`fig-mesh-triangulation-warp` · piecewise-affine mesh warp — control matches triangulated, each triangle carrying its own affine map; exact and fast but only $C^0$, creasing along shared edges 🟨

⬜ figure not yet created

`fig-tps-rbf-warp` (a few control points displaced → a smooth thin-plate-spline field filling the gaps, minimal bending) fig-tps-rbf-warp

`fig-beier-neely-segments` · Beier–Neely feature-line warp — before/after directed segments, one segment's $(u,v)$ frame, and the distance-weighted blend of several segments' proposed displacements ($w_i=(\ell_i^p/(a+d_i))^b$) 🟨

`fig-arap-vs-affine` · drag one handle — an unconstrained affine/least-squares warp shears and stretches, vs an as-rigid-as-possible / MLS warp that keeps each neighbourhood near a rotation so the shape bends without stretching 🟨

💡 **Big lesson (L16 · Prefilter before you downsample):** a warp that *shrinks* the image (minifies) is throwing samples away — and if you just decimate (keep every $N$th source pixel, or sample a narrow reconstruction kernel) the detail too fine for the new grid **folds down into bold false moiré**, not into nothing. The fix is to **area-average first**: widen the resampling kernel in proportion to the downscale factor $s$ (so it gathers over the whole footprint each output pixel now covers) and renormalize — that low-pass blur is the prefilter. Enlarging needs *no* prefilter (clamp the kernel width to 1); shrinking *always* does. (→ see Big lesson **L16**, first placed in BASIC → Resampling; the frequency-domain *why* is **L5** — aliasing folds frequencies above the new Nyquist. The same lesson governs every geometric resample in this part: rectification corners that shrink, stabilization crops, optical-flow rendering.)

⚠️ **Candidate new part-spine lesson (flag only — not registered).** Almost every algorithm in this part follows one shape: **establish a correspondence / coordinate map between two images (or an image and a target frame), then transport pixels along it.** Warping makes the map explicit and applies it; perspective rectification *solves* the map (a homography) from constraints; panorama stitching *estimates* the map from features; morphing *interpolates between* two maps; optical flow *estimates a dense map* from brightness constancy; stabilization *smooths a map over time*; frame interpolation *renders along* a map. A lesson like **"find the coordinate map, then move the pixels"** would unify the part the way L10 unifies the inverse-problem part. Worth registering as a part-spine lesson for MOTION/WARPING — *noted here, deliberately not registered* (registry-first convention: register in [[Big Lessons]] before placing).

equations

warp as a domain map $\text{out}(x,y)=\text{in}\big(f^{-1}(x,y)\big)$ (move the domain, keep the range)

**forward** warp $\text{out}(f(x,y))\leftarrow \text{in}(x,y)$ (scatters → holes)

1-D **linear** reconstruction $\text{im}(x)= (1-\alpha)\,\text{im}[\lfloor x\rfloor]+\alpha\,\text{im}[\lceil x\rceil]$, $\alpha=x-\lfloor x\rfloor$

**bilinear** = separable 1-D linear in $x$ then $y$ (four-neighbour area weights)

general **convolution resampling** $\text{out}(x)=\sum_{x'}k\big(f^{-1}(x)-x'\big)\,\text{in}(x')$ with analytic kernel $k$ centred at the fractional source location

**minification kernel scaling** $k_s(t)=\tfrac1s\,k(t/s)$ for downscale factor $s$ (widen + renormalize → area-average = prefilter, **L16**)

**affine** $\begin{psmallmatrix}x'\\y'\end{psmallmatrix}=A\begin{psmallmatrix}x\\y\end{psmallmatrix}+\mathbf t$ (6 DOF)

**projective / homography** $\begin{psmallmatrix}x'\\y'\\w'\end{psmallmatrix}=H\begin{psmallmatrix}x\\y\\1\end{psmallmatrix}$, then $(x'/w',y'/w')$ (8 DOF

solve from 4 correspondences — [[Manual panorama stitching from multiple views]])

**TPS / RBF** field $f(\mathbf p)=A\mathbf p+\mathbf t+\sum_i w_i\,\phi(\lVert\mathbf p-\mathbf c_i\rVert)$ with $\phi(r)=r^2\log r$ (thin-plate radial basis)

**Beier–Neely** single-segment map via the segment frame $u=\dfrac{(\mathbf X-\mathbf P)\cdot(\mathbf Q-\mathbf P)}{\lVert\mathbf Q-\mathbf P\rVert^2},\ v=\dfrac{(\mathbf X-\mathbf P)\cdot\perp(\mathbf Q-\mathbf P)}{\lVert\mathbf Q-\mathbf P\rVert}$, then $\mathbf X'=\mathbf P'+u(\mathbf Q'-\mathbf P')+\dfrac{v\,\perp(\mathbf Q'-\mathbf P')}{\lVert\mathbf Q'-\mathbf P'\rVert}$, multi-segment weight $w_i=\Big(\dfrac{\ell_i^{p}}{a+d_i}\Big)^{b}$.

▸6.1.1 Warping is a domain transform — move *where*, not *what*⧉

• **the one-line definition**: a warp applies a function $f:\mathbb{R}^2\to\mathbb{R}^2$ to the image *domain* — "if $(x,y)$ had color $c$ in the input, then $f(x,y)$ has color $c$ in the output." It deforms **where pixels live**, leaving their **colors (the range) untouched**. Contrast a *point/tone operation*, which edits the range (colors) and leaves the domain fixed. Warping is the domain twin of everything in BASIC point-ops. [`fig-warp-domain-vs-range`]

• **the rubber-sheet picture**: imagine the image printed on rubber and stretched — pixels slide to new positions; the picture content rides along. This is the right mental model for *all* of it, from a tiny lens-distortion correction to "liquify"/slimming, to morphing, to a full perspective rectification.

• **why it's everywhere** (motivate before machinery): geometric tasks across this whole part are warps — **optical aberration / lens-distortion correction**, **video stabilization**, **panorama reprojection** (fix parallax / align to a reference), **wide-angle distortion correction**, **morphable face models**, **image-based rendering**, and the cosmetic "slim the subject." One primitive, reused everywhere — which is exactly why this chapter comes first in the part.

• **two-and-a-half hard questions** (the chapter's skeleton, stated up front): given the rubber-sheet idea, three things are genuinely non-trivial — (1) *which direction* do we transform (forward or inverse)? (2) *how do we read a color at a non-integer location* (resampling/reconstruction)? and (2.5, only for free-form warps) *how do we even specify $f$* from a handful of user inputs? The next three sections take these in turn.

▸6.1.2 Forward vs inverse warping — why output-driven wins⧉

• **forward warp (input-driven)**: loop over every *input* pixel $(x,y)$, compute its destination $f(x,y)$, and write its color there: $\text{out}(f(x,y))\leftarrow\text{in}(x,y)$. Intuitive — "push each pixel to where it goes." But it has **two failure modes** baked in. [`fig-forward-warp-holes`]

• **holes / gaps**: destinations $f(x,y)$ are generally **non-integer** and **non-surjective** — when the warp stretches a region, neighbouring inputs land far apart, leaving output pixels that **no input ever writes** (black speckle / cracks).

• **pile-ups / double-writes**: where the warp compresses a region, **many** inputs land on the **same** output pixel — who wins? You'd have to splat-and-accumulate with weights and normalize, which is fiddly and still doesn't fill the gaps.

• **inverse warp (output-driven)** — *the useful one*: loop over every *output* pixel $(x,y)$, push it **backward** through $f^{-1}$ to find where it came from in the input, and **look up** that color: $\text{out}(x,y)=\text{in}\big(f^{-1}(x,y)\big)$. [`fig-inverse-warp-lookup`]

• **why it wins**: the main loop is over **output** pixels, so **every output is covered exactly once** — no gaps, no double-writes by construction. The non-integer problem is pushed entirely into the *input* lookup, where it's a clean **resampling** question (next section) rather than a scatter problem.

• **take-home (state it as a rule)**: *main loop over OUTPUT pixels; use the INVERSE transform; resample the input.* This is the canonical resampling/warping idiom and recurs verbatim in panorama warping ([[Manual panorama stitching from multiple views]]) and rectification.

• **pseudocode**: a three-line block (loop output pixels → `(x',y')=f⁻¹(x,y)` → `out[x,y]=resample(in,x',y')`) makes the output-driven loop and once-per-pixel coverage concrete.

• **the catch — you need $f^{-1}$**: inverse warping needs the *backward* map. For **parametric** warps this is free (invert a matrix; for a homography, use $H^{-1}$). For a warp you only know **forwards** — most importantly a **dense optical-flow field** (each input pixel's motion) — you don't have $f^{-1}$ in closed form, and either invert the field numerically or fall back to forward splatting. So forward warping isn't *wrong*, it's the tool when the field is only available input-side. (Cross-ref [[Optical flow]] — rendering a frame along an estimated flow is exactly this case.)

• **boundary handling**: $f^{-1}(x,y)$ can land **outside** the input. Decide the policy — **black/zero padding**, **clamp-to-edge** (replicate the border pixel), **mirror**, or **mark as undefined** (transparent / alpha). Rectification and stabilization both produce outputs that read past the input border (the corners), so this is not a corner case — it's the common case.

▸6.1.3 Resampling — reading a value at a non-integer location⧉

• **the core problem**: inverse warping asks for $\text{in}$ at a **fractional** coordinate $f^{-1}(x,y)$. An image is *samples* of an underlying continuous signal; to read between samples we **reconstruct** the continuous signal and evaluate it — this is *resampling* (reconstruct, then re-sample). The reconstruction filter is the whole story.

• **nearest neighbour**: round to the closest sample. Cheapest; **blocky / jaggy** (a box reconstruction kernel) — visible stair-steps under rotation/scaling. Fine only for label maps / when you must not invent in-between values. [`fig-recon-filters-1d`, "nearest"]

• **bilinear**: linear interpolation in 1-D, $\text{im}(x)=(1-\alpha)\,\text{im}[\lfloor x\rfloor]+\alpha\,\text{im}[\lceil x\rceil]$ with $\alpha$ the fractional part; in 2-D it's **separable** — interpolate along $x$ on the two top and two bottom neighbours, then along $y$ (or $y$ then $x$ — same result, a worthwhile exercise). Equivalently the four-neighbour **area weights**. Smooth, cheap, the workhorse default; slightly soft (a triangle/tent kernel attenuates high frequencies). [`fig-bilinear-weights`]

• **bicubic / Lanczos** — *the right framing*: do **not** think "fit a cubic to the data." Think **convolution with an analytic kernel**: $\text{out}(x)=\sum_{x'}k\big(f^{-1}(x)-x'\big)\,\text{in}(x')$, the kernel $k$ centred at the (fractional) source location and sampled at the surrounding integer inputs. **Bicubic** (the Mitchell–Netravali family) uses a wider, sharper, negative-lobed kernel over a $4\times4$ neighbourhood — sharper than bilinear, with slight ringing/overshoot you can tune. **Lanczos** is a windowed sinc, sharpest of the common kernels (more ringing). All are **separable** (run in $x$, then $y$), and in general the weights **can't be precomputed** because the fractional offset differs per output pixel (though for integer or rational scale factors they become periodic and *can* be tabulated). [`fig-recon-filters-1d`]

• **upsampling vs downsampling is the key split** (lead-in to L16): the kernel-as-convolution view makes the difference precise. **Upsampling (enlarging)**: reconstruct and read — clamp the kernel width to 1 input pixel; *no* prefilter needed (you're not discarding anything). **Downsampling (minifying)**: you're **discarding** samples, so reconstruction alone **aliases**.

• 💡 **the L16 discipline** — *prefilter before you downsample* (see the boxed lesson above): blur/area-average **before** decimating. In the convolution-resampling view this is one clean move: **scale the kernel up by the downscale factor**, $k_s(t)=\tfrac1s\,k(t/s)$, and renormalize — now each output pixel **gathers over the whole footprint** it covers in the input, i.e. averages instead of point-sampling. The infamous bug is **keep-every-Nth decimation** (or a width-1 kernel on minify): fine detail folds into bold **moiré / herringbone**. The honest rule: *upsampling clamps the kernel to width 1; downsampling widens it proportionally to scale.* [`fig-minify-aliasing-moire`, `fig-kernel-scaling-minify`]

• **practical notes**: resampling is **separable** — scale/resample in $x$ (producing an anisotropically-sized intermediate image), then in $y$; downsampling-in-$y$-first vectorizes well. Implementation-wise the clean trick (Adams) is to **form the tiny sparse 1-D resampling matrix** (one row/column's worth of weights, with boundary handling and renormalization folded in) and apply it to every row/column — the weight matrix is negligible next to the image. **For minification, MIP-maps** (a prefiltered pyramid) precompute the area-averages once and read them back per query — the standard real-time answer to L16. (Cross-ref [[Linearity, Fourier, Aliasing and deblurring]] for the Nyquist/aliasing account; pyramids for the band structure.)

• **the honest limit of interpolation** (forward-ref): resampling adds *pixels*, never new *frequencies* — bicubic/Lanczos upsampling looks soft because the high frequencies were never measured. Recovering plausible detail is **super-resolution**, a prior-driven inverse problem, not resampling (cross-ref [[Super-resolution and image priors]] — reconstruction vs hallucination).

▸6.1.4 Specifying the warp I — parametric models and the degrees-of-freedom (DOF) ladder⧉

• **the idea**: for many warps $f$ is a **low-parameter** function — fit it to a few correspondences and apply it everywhere. Climbing the **degrees-of-freedom ladder**, each rung adds freedom and **breaks a preserved property**: [`fig-dof-ladder`]

• **translation** (2 DOF) — preserves everything but position.

• **rigid / Euclidean** (3) — + rotation; preserves lengths & angles.

• **similarity** (4) — + uniform scale; preserves angles & ratios (shape).

• **affine** (6) — $\binom{x'}{y'}=A\binom{x}{y}+\mathbf t$; preserves **parallelism** and **ratios along lines**, *not* angles/lengths. Straight lines stay straight; a square → a parallelogram.

• **projective / homography** (8) — $\binom{x'}{y'}{w'}=H\binom{x}{y}{1}$ then divide by $w'$; preserves **straight lines only** — parallels may **converge** (a square → an arbitrary quadrilateral). This is the rung that does **perspective**, and the subject of the next chapter and of panorama stitching.

• **why homogeneous coordinates** (link, don't re-derive): the homography is non-linear in $(x,y)$ because of the **division by $w'$**, but **linear** in homogeneous coordinates — a single $3\times3$ matrix multiply followed by the perspective divide. It *is* the reprojection of a plane under a 3-D rotation (project, rotate, reproject), which is why "depth doesn't matter for a planar-or-rotation warp." Full development: FUNDAMENTALS — Pinhole; the **solve** for $H$ from correspondences: [[Manual panorama stitching from multiple views]].

• **fitting a parametric warp**: each point correspondence gives **two equations**; an affine needs **3** correspondences (6 unknowns), a homography **4** (8 unknowns up to scale). More correspondences → **overdetermined** → **least squares** (or SVD for the homography null-vector). This is a clean instance of [[Linear Inverse Problems and Regression]] — *the warp parameters are the regression unknowns.* (Robustness to wrong matches → RANSAC, developed for stitching.)

• **strength and limit**: parametric warps are **global, smooth, few-parameter, robust** — perfect when the deformation really is rigid/affine/projective (a plane, a camera rotation, a lens model). They **cannot** express a local, free-form deformation — a smile, a morph, a hand-drawn stretch. For that, you need a field built from sparse control (next).

▸6.1.5 Specifying the warp II — free-form warps from sparse correspondences⧉

• **the setup**: the user gives a **sparse** set of correspondences — a few control points, or pairs of feature segments — and we must produce a **dense** warp field $f$ defined at every pixel by *interpolating* the sparse constraints. Three standard constructions, trading smoothness, locality, and control:

• **mesh / triangulation warp (piecewise-affine)**: triangulate the control points (e.g. Delaunay); inside each triangle the three vertex correspondences **exactly determine an affine map** (6 DOF, 3 points), so warp each triangle by its own affine. Simple, fast, exact at the controls, and the standard morphing workhorse. Downside: **$C^0$ only** — the field is continuous but **creases** at triangle edges (derivative jumps), and quality depends on the triangulation. [`fig-mesh-triangulation-warp`]

• **RBF / thin-plate-spline (TPS) warp (smooth scattered-data interpolation)**: fit a **smooth** field that interpolates the control displacements with **minimal bending energy**: $f(\mathbf p)=A\mathbf p+\mathbf t+\sum_i w_i\,\phi(\lVert\mathbf p-\mathbf c_i\rVert)$, an affine "global" part plus a sum of **radial basis functions** centred at the controls. The thin-plate kernel $\phi(r)=r^2\log r$ minimizes integrated second-derivative energy — the smoothest interpolant (the physical analogy: a thin metal plate bent to pass through the constraints). The weights $w_i$ (and $A,\mathbf t$) solve a small **linear system**. Smooth everywhere ($C^1$+), no creases; **global support** (every control influences every pixel, so it can be less local than a mesh) and an $O(n^2)$–$O(n^3)$ solve in the number of controls. The go-to for smooth medical/face registration and morphometrics (Bookstein). [`fig-tps-rbf-warp`]

• **feature-based field warping — Beier & Neely 1992**: specify the warp with **pairs of directed line segments** (before/after) rather than points — natural for features like an eyebrow or a jawline. Each segment pair $(\mathbf{PQ}\to\mathbf{P'Q'})$ defines a **local coordinate frame**: a coordinate $u$ **along** the segment (normalized, $0$ at $\mathbf P$, $1$ at $\mathbf Q$) and $v$ **perpendicular** (in distance units). For a single segment, map a point by reading its $(u,v)$ in the **destination** segment's frame (inverse warp) and reconstructing it in the source frame:

• $u=\dfrac{(\mathbf X-\mathbf P)\cdot(\mathbf Q-\mathbf P)}{\lVert\mathbf Q-\mathbf P\rVert^2}$, $\quad v=\dfrac{(\mathbf X-\mathbf P)\cdot\perp(\mathbf Q-\mathbf P)}{\lVert\mathbf Q-\mathbf P\rVert}$; then $\mathbf X'=\mathbf P'+u\,(\mathbf Q'-\mathbf P')+\dfrac{v\,\perp(\mathbf Q'-\mathbf P')}{\lVert\mathbf Q'-\mathbf P'\rVert}$.

• **note** (Beier–Neely's own finding): $u$ scales with the segment (stretch), but $v$ is kept **absolute** — they tried scaling $v$ too and it looked worse.

• **multiple segments**: each segment $i$ proposes a displaced point $\mathbf X'_i$; the result is a **distance-weighted average**, weight $w_i=\big(\ell_i^{\,p}/(a+d_i)\big)^{b}$ — longer segments ($\ell_i$) carry more influence, nearer ones ($d_i$ = distance to the segment) dominate; $a,b,p$ tune the falloff. [`fig-beier-neely-segments`]

• **provenance / use**: the technique behind Michael Jackson's *Black or White* morph; a classic course problem set. *Built here, used in [[Morphing]].* (Ref: Beier & Neely 1992; the canonical warping text is Wolberg 1990.)

• **as-rigid-as-possible (ARAP) / MLS warps — penalize shear**: the catch with the above is that interpolating positions alone lets shapes **shear and squash** unnaturally when you drag a handle. **As-rigid-as-possible** (Igarashi 2005) and **moving-least-squares** rigid/similarity deformation (Schaefer 2006) add an explicit objective: each local neighbourhood should transform **as close to a rotation+translation (rigid) as possible** — minimize the deviation from rigidity rather than just matching control points. The deformation then **preserves local shape** (a dragged limb bends but doesn't stretch), which is what makes interactive shape manipulation and detail-preserving warps look natural. Conceptually: same "dense field from sparse handles," but with a **regularizer on local rigidity** instead of plain smoothness. [`fig-arap-vs-affine`]

• **choosing** (decision aid): a **plane / camera rotation / lens model** → parametric (affine/homography), fit by least squares. A **face/shape morph** with creases acceptable and speed wanted → **mesh/triangulation**. A **smooth** registration with few controls → **TPS/RBF**. **Feature-line** control (eyebrows, jaw) → **Beier–Neely**. **Interactive handle-dragging** where shapes must not shear → **ARAP/MLS**. Whatever the choice, the *application* is identical: produce $f$, then **inverse-warp and resample** (the first three sections) — that primitive never changes.

6.2 Resampling for complex spatial transforms⧉

▸6.3 Morphing⧉

`fig-morph-crossfade-ghost` · why a cross-dissolve ghosts — two misaligned faces averaged at $t=0.5$ superimpose features into a translucent four-eyed, two-mouthed ghost; the blend handled colour but ignored position ✅

⬜ figure not yet created

the failure that motivates everything)

`fig-morph-shape-vs-color` · the two orthogonal knobs of a morph — range (colour) as the cross-dissolve and domain (shape) as feature-position interpolation; a true morph is their product, the square's corners labelling the four extremes 🟨

`fig-morph-recipe` · the three-step morph at $t=0.5$ — interpolate the features to the midpoint shape, warp both images to that same shape, then cross-dissolve; a sharp single face because alignment preceded the blend 🟨

⬜ figure not yet created

`fig-beier-neely-oneseg` (a single before/after line pair: the $(u,v)$ coordinate frame along/perpendicular to the segment, and how a point $X$ maps to $X'$) fig-beier-neely-oneseg

⬜ figure not yet created

`fig-beier-neely-multiseg` (several line pairs, each proposing an $X'_i$, combined by a distance/length **weighted average** — the field warp) fig-beier-neely-multiseg

`fig-mesh-vs-field-morph` · two ways to drive the same morph — mesh (per-triangle affine, strictly local, fiddly, can crease) vs field/Beier–Neely (a few control lines, globally-supported distance-weighted field, sparse but fold-prone) 🟨

`fig-view-morph-prewarp` · why two views need view morphing — a naive 2-D morph bends straight edges and shrinks a rotating object, vs view morphing (prewarp to aligned epipolar lines → interpolate → postwarp) keeping lines straight and shape intact 🟨

💡 **Big lesson (recurrence — *align before you blend*):** averaging two images only makes sense once their features sit at the *same place*. A plain cross-dissolve blends colors at fixed coordinates, so any misalignment **double-exposes into a ghost**; the cure is to **warp into correspondence first, then average** — the same move as panorama de-ghosting and multi-frame super-resolution's sub-pixel alignment. Morphing is this lesson taken to its limit: not just align-then-average, but interpolate the *alignment itself* over $t$. (Companion to the *capture-then-decide / align-then-merge* family; the dense automatic aligner is [[Optical flow]].)

▸6.3.1 Why a cross-dissolve isn't enough — the ghosting motivation⧉

• **the goal**: *smoothly* turn one image into another — a face into another face, one video keyframe into the next (slow-motion / frame interpolation). "Smooth" means the in-between $I_t$ should look like a *plausible single image*, not a blend of two.

• **the naive attempt — cross-fading (a.k.a. cross-dissolve)**: linearly interpolate colors pixel-by-pixel, $\;\text{out}[y,x] = (1-t)\,I_0[y,x] + t\,I_1[y,x]$. This is the right idea for **color** but ignores **position**.

• **why it ghosts**: the two images' **features are not aligned** — eyes, mouth, silhouette sit at different pixels. Blending colors at fixed locations super-imposes a left-image eye on a right-image cheek → a translucent **double exposure** (four eyes, two mouths). And you generally *cannot* fix this with one global alignment (translate/rotate/scale) — faces differ locally, non-rigidly. [`fig-morph-crossfade-ghost`]

• **the diagnosis**: cross-fading interpolates only the **range** (color). To get a believable in-between we must *also* interpolate the **domain** — the **location** of every feature. This is a **domain transform** (a warp), and it's the missing half.

• 💡 **the throughline**: *separate shape and color, and interpolate each.* Everything below is how to (a) get a dense **shape** interpolation from sparse correspondence and (b) combine it with the **color** interpolation. (See the *align before you blend* big-lesson box above.)

▸6.3.2 Two interpolations: domain (shape) and range (color)⧉

• **range interpolation (color)** — the cross-dissolve we already have: $\;C(x) = (1-t)\,C_0(x) + t\,C_1(x)$. A straight linear blend of pixel values.

• **domain interpolation (location)** — treat a feature **point** as a vector and interpolate *where it is*: for corresponding points $P$ (in $I_0$) and $Q$ (in $I_1$), the in-between location is $\;V = (1-t)\,P + t\,Q$. At $t=0$ the feature is where $I_0$ put it; at $t=1$ where $I_1$ put it; in between it travels along the straight segment. [`fig-morph-shape-vs-color`]

• **warping = a domain transform**: moving pixels spatially while leaving their colors alone, $\;C'(x,y) = C\big(f(x,y)\big)$. In practice we use the **inverse / backward** map — for each output pixel, look up where it came from: $\;\text{out}(x,y) = \text{im}\big(f^{-1}(x,y)\big)$ — so every output pixel is filled (no holes) and resampling is well-defined. This is exactly the engine of [[Warping and resampling]]; morphing is one of its biggest customers (it warps twice per frame).

• 💡 **the morph in one line**: *interpolate the domain like a vector, then interpolate the range (color).* A morph is the **product** of a shape interpolation and a color interpolation — two independent linear blends applied together. [`fig-morph-shape-vs-color`]

▸6.3.3 The morphing recipe (combine both)⧉

• **inputs**: two images $I_0$ and $I_1$, plus **sparse correspondences** the user specifies on both (matching line segments, or matching mesh vertices — *which* primitive is the next two sections). Output: a sequence of in-betweens $I_t$, $t\in(0,1)$.

• **per intermediate frame $I_t$** (the canonical four steps):

1. **interpolate the feature geometry** to an intermediate shape: each corresponding pair (point/segment/vertex) $\to$ its in-between location $P_t = (1-t)\,P_0 + t\,P_1$.

2. **build a dense warp field** from the sparse pairs — two of them: one taking $I_0$'s features onto the intermediate shape, one taking $I_1$'s features onto the *same* intermediate shape.

3. **warp both images** to that common shape. Now $I_0$ and $I_1$ are **feature-aligned**: both pairs of eyes land on the same pixels, both mouths coincide.

4. **cross-dissolve** the two warped images: $I_t = (1-t)\,\text{warp}(I_0) + t\,\text{warp}(I_1)$. Because they're aligned, the blend is sharp — **no ghosting**. [`fig-morph-recipe`]

• **why warp *both* (not just one)**: warping only $I_0$ all the way to $I_1$'s shape and dissolving would make features *slide* while colors fade — and would over-distort early/late frames. Meeting **in the middle** (both warp partway to the shared intermediate shape) keeps distortions small and symmetric, and is what makes the motion read as one object transforming.

• **endpoints check**: at $t=0$ the intermediate shape = $I_0$'s shape, warp of $I_0$ is identity, color weight on $I_1$ is 0 → output is exactly $I_0$ (and symmetrically at $t=1$). The morph is a *genuine* interpolation, not just a fancy blend.

• **note (cross-ref)**: the correspondences here are **user-specified and sparse**. The **automatic, dense** way to get correspondence between two images is [[Optical flow]] — that's exactly what frame-interpolation / slow-motion methods use to morph between *video* frames without hand-drawn lines ([[Frame interpolation and slow-motion synthesis]]).

▸6.3.4 Field morphing: the Beier–Neely line-pair warp⧉

• **the primitive**: a **pair of directed line segments** — one drawn on the "before" image, one on the "after" — saying *"this feature edge here corresponds to this feature edge there."* A handful of segments (jaw, eyes, nose bridge, hairline) is enough; the warp interpolates everywhere else. (Beier & Neely 1992 — "Feature-Based Metamorphosis"; the technique behind Michael Jackson's *Black or White* video and a classic pset.)

• **one segment defines a coordinate change**: a before/after pair $\overline{PQ}\to\overline{P'Q'}$ induces a similarity (rotate + scale + translate) — the simplest transform that carries one segment onto the other. Set up a **segment-relative coordinate system**: along the segment, a **normalized** coordinate $u\in[0,1]$ (0 at $P$, 1 at $Q$); perpendicular to it, a **signed distance** $v$ (in pixels). For a query point $X$:

$$u = \frac{(X-P)\cdot(Q-P)}{\lVert Q-P\rVert^2}, \qquad v = \frac{(X-P)\cdot \text{perp}(Q-P)}{\lVert Q-P\rVert},$$

where $\text{perp}(\cdot)$ is the 90°-rotated vector. Then the corresponding point in the *other* image is reconstructed from the *primed* segment:

$$X' = P' + u\,(Q'-P') + \frac{v\cdot\text{perp}(Q'-P')}{\lVert Q'-P'\rVert}.$$

Read it: **$u$ scales** with the segment (a feature twice as long stretches with it), but **$v$ stays in absolute pixel units** (perpendicular offset is *not* rescaled — Beier & Neely report that scaling $v$ too looked worse). [`fig-beier-neely-oneseg`]

• **inverse-map direction**: you compute $(u,v)$ in the **destination** ("after") frame and look up the source — because warping uses the **inverse** map ($\text{out}(x)=\text{im}(f^{-1}(x))$), so each output pixel is filled.

• **multiple segments → a weighted average of displacements**: each line pair $i$ proposes its own target $X'_i$. Combine them by a **distance-weighted average**, where a line's influence falls off with the query point's distance to that segment and grows with the segment's length:

$$X' = \frac{\sum_i w_i\,X'_i}{\sum_i w_i}, \qquad w_i = \left(\frac{\text{length}_i^{\,p}}{a + \text{dist}_i}\right)^{b}.$$

The tunables $a,b,p$ control smoothness ($a$: how strongly far lines still pull), falloff ($b$), and the length bias ($p$). Distance-to-a-segment is the usual capped point-to-segment distance (perpendicular within the span, endpoint distance outside). [`fig-beier-neely-multiseg`]

• **strengths / weaknesses**: *strength* — sparse, intuitive control (draw a few feature lines), smooth fields, no mesh to manage. *weakness* — every line affects every pixel (global support → $O(\#\text{pixels}\times\#\text{lines})$ per frame), and far-apart lines can fight, giving "ghost" pulls and folds; careful $a,b,p$ tuning and debugging the distance function are part of the work.

• **bells & whistles**: morph the **foreground only** (matte it first) so the background doesn't drag artifacts; **non-uniform** timing (different features morph at different rates) for more lifelike transitions.

▸6.3.5 Mesh / triangulation morphing (the alternative warp)⧉

• **the primitive**: instead of lines, place a **mesh of corresponding control points** on both images (or a regular grid deformed to match features). Triangulate the points; each triangle in $I_0$ maps to the matching triangle in $I_1$.

• **the warp**: within each triangle the map is a single **affine** (barycentric) transform — fast, local, and exactly invertible. The intermediate shape is the mesh with vertices at $P_t=(1-t)P_0+tP_1$; warp each source by its per-triangle affine onto that mesh, then dissolve. (Spline/free-form mesh variants — Wolberg 1990; multilevel free-form, Lee et al. 1996 — trade the piecewise-affine creases for smoother fields.)

• **field vs mesh — the trade**: *mesh* warps have **local support** (a vertex moves only its incident triangles → fast, predictable, no far-line fights) but the control net is fiddly and triangle edges can crease; *field (Beier–Neely)* warps have **global, line-based** control (sparse and intuitive) but are slower and can fold. Same job — *turn sparse correspondence into a dense, invertible warp* — two different bases. [`fig-mesh-vs-field-morph`]

• **either way, the same resampling**: each warp is a domain transform that must **reconstruct + prefilter** as it resamples (minified regions averaged, magnified regions smoothly interpolated) — the EWA/mip machinery of [[Warping and resampling]]. Morphing is a stress test for good resampling.

▸6.3.6 View morphing — the geometrically-correct in-between of two views⧉

• **when a 2-D morph is *wrong***: if $I_0$ and $I_1$ are two **photographs of the same 3-D scene** from different viewpoints (e.g. a face turning), a naive linear morph **bends straight lines and shrinks the object** — the in-between is not a valid view of *any* rigid scene. Linearly interpolating image positions is **not** the same as interpolating the underlying 3-D geometry. [`fig-view-morph-prewarp`]

• **the fix (Seitz & Dyer 1996, *View Morphing*)** — three steps that make the morph a *physically valid* intermediate view:

1. **prewarp** both images by homographies that **rectify** them to a common plane (so the two image planes are parallel and corresponding **epipolar lines** are horizontal and aligned);

2. **linearly interpolate** the now-rectified images (a plain morph — but in this rectified frame the linear interpolation *is* geometrically correct, because corresponding points move along aligned scanlines);

3. **postwarp** the result by an interpolated homography to the desired in-between view.

• **why it works**: rectification turns the general two-view geometry into the **parallel-camera (pure-translation) case**, where image-space linear interpolation exactly corresponds to moving the camera along the baseline — straight lines stay straight, the object keeps its shape, and the in-between is a true perspective view. It's the projectively-aware upgrade of the plain morph.

• **the connection to flow / stereo**: view morphing needs **corresponding points** between the views and the epipolar geometry relating them — exactly the **correspondence** problem of [[Optical flow]] and stereo, and the **epipolar** constraint of [[Image alignment]] / multi-view. With dense correspondence you can synthesize in-between *views* automatically — the bridge from morphing to **image-based rendering** and to flow-driven **frame interpolation** ([[Frame interpolation and slow-motion synthesis]]).

▸6.3.7 Recap and significance⧉

• **the lasting ideas**: (1) **linear interpolation of color alone introduces blur/ghosts** — you must interpolate position too; (2) **separate shape and color**, then recombine; (3) **non-rigid alignment** of different images is a reusable primitive. [`fig-morph-recipe`]

• **where it goes**: **special effects** (the canonical face-morph); **video frame interpolation / slow-motion** and even **MPEG-style** motion prediction (morph-by-motion-field); **morphable models** for face recognition/animation (warp a mean face by shape parameters); **motion magnification** (Liu et al. 2005 — *analyze* a motion field, **magnify** it, and warp by the magnified field — a morph whose correspondence comes from motion analysis, robust to occlusion). The common thread: once you can **warp one image into correspondence with another**, you can interpolate, exaggerate, average, or re-render — and *getting* that correspondence automatically is the subject of the next chapter.

6.4 Morphable models⧉

6.5 Shape-preserving warping⧉

6.6 Video in-betweening⧉

▸6.7 Perspective distortion and its correction⧉

`fig-rectify-before-after` · rectifying by homography — a keystoned input with four corner handles dragged onto a known rectangle (vanishing point marked) re-rendered fronto-parallel by $\text{out}(\mathbf x)=\text{in}(H^{-1}\mathbf x)$ 🟨

⬜ figure not yet created

`fig-vanishing-point-autosolve` (detected parallel-line bundles meeting at vanishing points → the homography that sends the vertical VP to infinity straightens the verticals) fig-vanishing-point-autosolve

`fig-tilt-shift-scheimpflug` · fixing perspective in the lens — shift moves the sensor within an oversized image circle while staying parallel to the façade (no converging verticals), tilt sets the focus plane via the Scheimpflug condition 🟨

⬜ figure not yet created

`fig-rectify-resample-stretch` (the rectified output's non-uniform resolution: the far/top corners are enlarged and softened while the near edge is compressed — why one homography only rectifies a *plane*). fig-rectify-resample-stretch

💡 **Big lesson (L16 recurrence · prefilter the shrinking corners):** rectification does not resample uniformly — to make the *far* (top) of the façade as wide as the *near* edge, the homography **enlarges** the top corners (upsample → just interpolate) and **compresses** other regions (downsample → must prefilter, or alias). So a faithful rectifier applies the **prefilter-before-downsample** discipline *per region*, with the kernel footprint set by the **local Jacobian** of $H$ (how much each area shrinks). (→ see Big lesson **L16**, [[Warping and resampling]] / BASIC Resampling.)

equations

the rectifying map is a **homography** $\mathbf x' \simeq H\mathbf x$ (homogeneous

divide by $w'$) — the projective warp of [[Warping and resampling]]

solve $H$ from **4 correspondences** (building corners → target rectangle), each pair giving 2 linear equations, $H$ up to scale → 8 DOF (least squares / SVD, per [[Manual panorama stitching from multiple views]])

applied by **inverse warp + resample** $\text{out}(\mathbf x)=\text{in}\big(H^{-1}\mathbf x\big)$

the **auto** constraint: choose $H$ so the bundle of (imaged) parallel lines maps to a **vanishing point at infinity** (verticals become vertical / parallel).

▸6.7.1 Keystoning is projection, not a lens flaw⧉

• **the symptom**: aim the camera **up** at a tall building and its **parallel verticals converge** toward the top (keystoning); aim it down and they splay. Wide-angle shots show the same projective effect as exaggerated **near/far stretch** (the close part of a façade looms, the far part shrinks). Photographers find converging verticals **objectionable** in architecture — a building "looks more formal if the verticals stay vertical." [`fig-keystoning-cause`]

• **the cause** (the key reframing): this is **perspective projection doing its job**, not the lens misbehaving. Projection **preserves straight lines** but **does not preserve parallelism, angles, or lengths** — so a bundle of world-parallel verticals images as a bundle of lines **converging to a vanishing point**. (Contrast with *radial lens distortion* — barrel/pincushion — which **bends straight lines** and *is* an optical defect; that's a **different correction**, now developed in [[Optics, lenses, and aberration correction]] (*Aberrations correction* — radial distortion correction). Keystoning bends nothing; lines stay straight, they just stop being parallel.)

• **faces in a wide-angle shot — the perceptual stake**: the same near/far stretch is what makes **faces** shot up close with a wide lens look wrong — the nose looms, the ears recede. There's no single plane to rectify (a face is 3-D), so the fix is **per-face / content-aware un-distortion** — locally re-warp each face region toward how it would have looked from farther away. What makes this worth doing is *perceptual*, not merely geometric: a face's perceived **shape — and even its perceived personality** (trustworthiness, competence, attractiveness) — **depends on the camera-to-subject distance**, which is the basis for correcting wide-angle portrait distortion (Perona 2007; perceived-trait-vs-distance study Bryan et al. PMC4114730). This is the same focal-length-vs-distance fact that governs portrait lenses — cross-ref [[Fundamentals]] (Lenses and focal length).

• **the giveaway**: if the **image plane is parallel to the subject plane** (sensor parallel to the façade), then world-parallel lines parallel to the sensor **stay parallel in the image** — no convergence. Convergence appears exactly when you **tilt** the sensor off the façade. So the distortion is a function of **camera orientation**, and that is precisely what a homography can undo.

▸6.7.2 The fix is a homography — re-render the façade fronto-parallel⧉

• **the idea**: the converging-verticals photo *is* a perspective image of a real planar rectangle (the façade). A **homography** maps between any two images of a plane — so there exists a single $3\times3$ projective warp $H$ that re-renders the façade **as if the sensor had been parallel to it**, sending the converging lines back to parallel. This is the **projective rung** of the DOF ladder from [[Warping and resampling]], applied here with the *target* being a fronto-parallel rectangle. [`fig-rectify-before-after`]

• **why one homography suffices (for a plane)**: reprojecting a plane = "project, rotate the camera, reproject," which is exactly a homography (depth drops out — *all points on the plane*, indeed all points along each ray, reproject identically). This is the same "depth doesn't matter for a planar-or-rotation warp" fact that makes panorama stitching work. **Cross-ref [[Manual panorama stitching from multiple views]]** — *identical machinery*; there the second image is another *view*, here it's the *target rectangle*.

• **solving $H$ — method 1, four corner correspondences**: pick **4 points** on the building (the corners of a known-rectangular feature — a window, the façade outline) and specify where they should go (a true rectangle, verticals vertical). 4 correspondences → 8 equations → solve the 8-DOF homography (up to scale) by **least squares / SVD**, exactly as in [[Manual panorama stitching from multiple views]]. This is the manual **perspective crop** in Photoshop. [`fig-rectify-before-after`, the 4 points]

• **solving $H$ — method 2, automatic via vanishing points**: detect bundles of **parallel lines** (the verticals, and the horizontals) — each images as a bundle meeting at a **vanishing point**. Choose the homography that sends the relevant vanishing point **to infinity** (making the verticals vertical/parallel again, and optionally the horizontals too). This needs **no user clicks** — it's the basis of Lightroom's **Upright** (it finds straight lines and squares them up) and of single-view metric rectification (Liebowitz & Zisserman 1998; Criminisi 2000). [`fig-vanishing-point-autosolve`]

• **applying it**: once $H$ is known, **inverse-warp and resample** — $\text{out}(\mathbf x)=\text{in}(H^{-1}\mathbf x)$, loop over output pixels, push back through $H^{-1}$, resample the input (the primitive from [[Warping and resampling]]). Same forward-vs-inverse and resampling care as everywhere.

• **in practice (tools)**: **Lightroom Transform / "Upright"** (auto and guided modes), **Photoshop perspective crop**, and dedicated architecture tools all do exactly this homography. The user experience is "drag the building straight"; the engine is the 4-point or vanishing-point homography.

▸6.7.3 The optical alternative at capture — tilt-shift / Scheimpflug⧉

• **fix it in the lens, not in post**: a **view camera** or **tilt-shift lens** corrects perspective **at capture**. The lens projects an image circle **larger than the sensor**; **shifting** the lens (or the sensor / back) selects an **off-centre** part of that circle while keeping the **sensor parallel to the façade**. Because the sensor stays parallel to the subject plane, **verticals never converge** in the first place — no post-warp, no resampling penalty. [`fig-tilt-shift-scheimpflug`]

• **equivalent to "wider-angle + crop"**: a shift is geometrically the same as taking a **wider, centred** view and **cropping** to the off-axis region you want — which is also why software "perspective crop" can approximate it (at the cost of resolution). **Tilt** (the other view-camera movement) instead controls the **plane of focus** via the **Scheimpflug** condition (lens, sensor, and subject planes meet in a line) — used to get a slanted plane all-in-focus, or, reversed, the "miniature/tilt-shift" toy-town look. **Cross-ref Optics — Special optics** for the optical derivation; here we only note it as the at-capture counterpart of the homography. The two are related by — unsurprisingly — **a homography** (the shifted-sensor image and the tilted-sensor image of the same façade differ by a projective warp).

• **trade-off** (when to use which): optical correction **preserves resolution and avoids resampling artefacts** but needs special (expensive, manual) gear and must be decided **at capture**; the software homography is **free and deferrable** ("decide later") but **resamples** (next).

▸6.7.4 The catch — resampling cost and "only a plane rectifies exactly"⧉

• **rectification resamples and stretches**: making the far/top of the façade as wide as the near edge means the homography **enlarges** some regions and **compresses** others → the output has **non-uniform resolution** (top corners enlarged and **softened**; other parts compressed). The enlarged regions are interpolation-limited (no new detail — cross-ref [[Super-resolution and image priors]]); the compressed regions must be **prefiltered before downsampling** (**L16**, see the recurrence box) or they alias. The kernel footprint should follow the **local Jacobian** of $H$. [`fig-rectify-resample-stretch`]

• **you lose frame / pixels**: the rectified rectangle generally doesn't fill the original frame — corners read **outside** the input (boundary policy from [[Warping and resampling]]) and you usually **crop** to the valid region, sacrificing field of view and resolution. "Upright" with auto-crop is managing exactly this.

• **only a plane is exact** (the fundamental limit): a single homography exactly rectifies a **single plane**. A real 3-D scene (a building with depth, foreground objects, a receding street) has **multiple planes at different depths** — one homography can square up **one** of them, but everything off that plane is left distorted (and parallax between depths can't be removed by any 2-D warp). Genuinely correcting a 3-D scene needs **depth or multiple views** (image-based rendering / multi-view) — cross-ref view morphing and the panorama parallax caveat in [[Manual panorama stitching from multiple views]]. The honest scope of this chapter: **planar** perspective correction.

▸6.7.5 Where this sits — one map, then transport⧉

• **the single instance of the part spine (L17)**: this is the **simplest** member of the *establish a coordinate map, then transport the pixels along it* family — the map is **parametric** (one homography), the scene is a **single plane**, the input is **one image**. The map is *solved* (4 corners / vanishing points), not estimated.

• **closing link / where the family goes harder**: the next time the **same** homography machinery appears it aligns **two views** into a panorama ([[Manual panorama stitching from multiple views]], where the map must be *estimated* from feature matches); after that the maps become **dense and estimated** (optical flow) rather than parametric and solved. Same two steps throughout; only the map gets harder.

▸Part 7 MATCHING PIXELS ACROSS SPACE AND TIME⧉

💡 **Big lesson (L17 · estimation half):** correspondence-then-transport — this part owns the **hard, ill-posed estimation** half (where did each pixel go?), ill-posed by the **aperture problem** and broken by occlusion and large motion; [[Warping and morphing]] owns the transport. Matching is literally the inverse of warping: warping consumes a coordinate map, matching produces one.

7.1 Sparse matching⧉

▸7.2 Feature tracking⧉

`fig-track-vs-flow` · two regimes of correspondence — dense flow (one pair, a vector at every pixel; dense but short) vs tracking (a few points threaded as trajectories across many frames; sparse but long) 🟨

`fig-klt-is-harris` · one matrix, two readings — the same structure-tensor ellipse read as Harris ("good corner to detect?") and as KLT ("motion well-conditioned?"); detection and trackability are the same statement about $M$ 🟨

`fig-good-features-to-track` · Shi–Tomasi points land only on corners — markers clustering on corners/junctions, never flat sky or single edges, with windows annotated by their eigenvalue pair; "good feature" = "well-conditioned $M$" = "corner" 🟨

⬜ figure not yet created

`fig-klt-pyramid-iteration` (per-feature coarse-to-fine: predict on a blurred level, warp, re-solve the $2\times2$ system, refine down the pyramid — the LK iteration localised to one patch) fig-klt-pyramid-iteration

`fig-track-drift-occlusion` · three ways a long track fails — drift (window slides off as sub-pixel errors accumulate), occlusion (dissimilarity spikes → drop the feature), re-detection (a fresh Shi–Tomasi point spawned to replace a lost one) 🟨

⬜ figure not yet created

`fig-modern-point-trackers` (KLT's single greedy patch vs a learned tracker — CoTracker — tracking *many* points *jointly* through an occlusion, recovering the point when it reappears) fig-modern-point-trackers

The previous section estimated a **dense** flow field — a motion vector at *every* pixel — between *one* pair of frames. This section does the complementary thing: follow a **sparse** handful of *distinctive* points across *many* frames. It is the same physics (brightness constancy, the aperture problem) restricted to the places where the math is well-behaved, and run forward in time. The payoff of the restriction is the chapter's spine: **the matrix that says a patch is a good corner to *detect* is the same matrix that must be invertible to *track* it.** Detect with it (Harris), track with it (KLT), refuse to track where it is singular (Shi–Tomasi). One $2\times2$ matrix, three jobs.

💡 **Big lesson (recurrence of L8 — a learned operator swaps a hand-designed prior for one learned from data):** classical KLT tracks a hand-built brightness patch by a hand-derived least-squares update; modern point trackers (PIPs, CoTracker) keep the *exact same task* — "where did this point go?" — but replace the patch with **learned features** and the greedy per-point update with a **learned, occlusion-aware, track-together** module. The skeleton (initialise a trajectory, iteratively refine it against appearance, predict visibility) is unchanged; only the operator is now fit to data. (→ see Big lesson **L8**; first appears in [[Deep learning]]. The same "neuralise the classical iteration" move powers RAFT in [[Optical flow]].)

equations

per-feature LK normal equations $M\,\Delta\mathbf{u}=\mathbf{b}$, with the **structure tensor** $M=\sum_{\mathbf{x}\in W} \begin{psmallmatrix}I_x^2 & I_xI_y\\ I_xI_y & I_y^2\end{psmallmatrix}$ and $\mathbf{b}=-\sum_{\mathbf{x}\in W} I_t\begin{psmallmatrix}I_x\\ I_y\end{psmallmatrix}$ → update $\Delta\mathbf{u}=M^{-1}\mathbf{b}$

**Shi–Tomasi trackability** $\min(\lambda_1,\lambda_2)>\lambda_{\min}$ (a feature is trackable iff $M$ is well-conditioned)

affine-warp **dissimilarity** $\varepsilon=\sum_W [\,I_2(A\mathbf{x}+\mathbf{d})-I_1(\mathbf{x})\,]^2$ (occlusion / lost-track test)

▸7.2.1 Tracking vs dense flow: sparse-but-long vs dense-but-short⧉

• **the two regimes** of correspondence (the taxonomy that organises this whole part): **dense optical flow** gives a 2-D motion vector at *every* pixel, but typically only between *two successive frames*; **feature tracking** gives motion at only a *sparse* set of points, but *threads them across a long sequence* of frames. Dense-but-short vs sparse-but-long. [`fig-track-vs-flow`]

• **why go sparse** (the case for tracking, from the slides' "motivations for sparse"): **scalability** — far fewer unknowns (a few hundred points, not a million pixels), which mattered enormously historically and still matters for real-time SLAM and on-device tracking; **robustness / conditioning** — *some pixels are genuinely easier to track than others*, so spend effort only where the answer is reliable (a corner), and refuse to guess in flat regions or along edges where it is hopeless (the aperture problem). Tracking is dense flow's *quality-controlled* sibling: it only reports where it is confident.

• **the Lagrangian framing** (forward-ref, made precise in [[Motion blur and temporal sampling]]): tracking **follows a particle** — "this *physical* point, where is it now?" — the **Lagrangian** view. Dense per-pixel flow leans **Eulerian** — "at this *fixed* pixel, what moved through?" This is the organising dichotomy for the part; tracking is its cleanest Lagrangian instance. (The slides note matching-based flow "becomes more Lagrangian" precisely by *following the patch* rather than differentiating at a fixed pixel.)

• **where tracked points go next**: the sparse trajectories are the *input* to structure-from-motion, SLAM, video stabilization, and bundle adjustment — wherever you need *the same physical points* identified over time, not a per-pixel flow. (Cross-ref [[Automatic panorama stitching from multiple views and feature matching]] — features → 3D.)

▸7.2.2 KLT = Lucas–Kanade, per feature, iterated over time⧉

• **the core move**: take the **Lucas–Kanade** patch solve from [[Optical flow]] and apply it *to one small window around a feature*, then *iterate*, then *repeat for the next frame*. That is the whole Kanade–Lucas–Tomasi (KLT) tracker. (Lucas & Kanade 1981 = the registration update; Tomasi & Kanade 1991 = packaging it as a feature *tracker*; Shi & Tomasi 1994 = *which* features.)

• **the per-patch system** (reuse, don't re-derive — it is LK restricted to a window $W$): assume every pixel in $W$ shares one displacement $\mathbf{u}=(u,v)$ and stack the brightness-constancy constraint $I_x u + I_y v + I_t = 0$ over the window. The least-squares normal equations are

$$M\,\Delta\mathbf{u}=\mathbf{b},\qquad M=\sum_{\mathbf{x}\in W}\begin{pmatrix}I_x^2 & I_xI_y\\ I_xI_y & I_y^2\end{pmatrix},\qquad \mathbf{b}=-\sum_{\mathbf{x}\in W} I_t\begin{pmatrix}I_x\\ I_y\end{pmatrix},$$

so the update is $\Delta\mathbf{u}=M^{-1}\mathbf{b}$. **That $M$ is exactly the structure tensor** of Harris (the second-moment matrix) — *the same matrix, no new math*. (Slides: "per-patch linear system… can be seen as per-patch 1-step Horn–Schunck with strong constant smoothness.") [`fig-klt-is-harris`]

• **iterate, because LK is only a linearisation**: a single solve assumes sub-pixel motion (brightness constancy was Taylor-linearised). So **warp** the second-frame patch by the current estimate, recompute $I_t$ on the warped patch, re-solve for a residual $\Delta\mathbf{u}$, and **add** it — repeat until convergence. This is Gauss–Newton on the patch SSD (Baker & Matthews 2004's account).

• **coarse-to-fine for large motion**: LK assumes motion of roughly $\lesssim 1$ pixel. For real videos, run the iteration **down a pyramid** (Bouguet 2000): solve on a heavily-blurred/downsampled level where the displacement *is* small, **propagate** the estimate as the initial warp for the next-finer level, refine. Same coarse-to-fine logic as dense flow, localised to each feature. [`fig-klt-pyramid-iteration`]

• **across time**: having localised the feature in frame $t{+}1$, use that as the start for frame $t{+}2$, and so on. KLT is this $2\times2$ solve threaded greedily down the whole clip — cheap (one tiny matrix inverse per feature per frame) but **greedy and memoryless**, which is exactly what makes drift and occlusion bite (below).

• **pseudocode**: a short KLT loop (per feature: iterate warp-recompute-$I_t$ → build $M,\mathbf b$ → $\Delta\mathbf u=M^{-1}\mathbf b$ → add; then carry $\mathbf u$ to the next frame) shows the three nested layers — converge, then thread over time.

▸7.2.3 Good features to track = where the structure tensor is well-conditioned⧉

• 💡 **the hinge of the chapter**: the update $\Delta\mathbf{u}=M^{-1}\mathbf{b}$ is only meaningful when **$M$ is invertible and well-conditioned**. And $M$'s conditioning is read off its **eigenvalues** $\lambda_1\ge\lambda_2$ — the *same* eigenvalues that classify a patch as flat / edge / corner in Harris. So "*is this a good feature to track?*" is literally "*is the tracking matrix well-conditioned?*" which is *literally* "*is this a corner?*" — one question wearing three hats. [`fig-klt-is-harris`, `fig-good-features-to-track`]

• **the three cases** (reused from the Harris derivation, now read as *trackability*):

• **flat region** ($\lambda_1\approx\lambda_2\approx 0$): $M$ is near-zero, $M^{-1}$ blows up — no texture, motion totally ambiguous. **Untrackable.**

• **edge** ($\lambda_1\gg\lambda_2\approx 0$): $M$ is rank-deficient — motion *along* the edge is unconstrained (the **aperture problem**). You can recover only the across-edge component. **Half-trackable** — and the unstable half drifts.

• **corner** ($\lambda_1,\lambda_2$ both large): $M$ is well-conditioned, motion fully determined in both directions. **Trackable.**

• **Shi–Tomasi "good features to track"** (1994): pick features by the simple, principled test

$$\min(\lambda_1,\lambda_2) > \lambda_{\min},$$

i.e. require the *smaller* eigenvalue to be large enough — *both* directions well-constrained. This is the conditioning of $M$ stated directly; it is the same $\min(\lambda_1,\lambda_2)$ Shi–Tomasi corner response cross-referenced in [[Automatic panorama stitching from multiple views and feature matching]]. The Harris response $R=\det M - k(\operatorname{tr}M)^2$ is a (determinant-trace) proxy for the same thing. **The detector and the tracker agree by construction** — Shi & Tomasi derived "what is a good feature to *detect*" *from* "what is a good feature to *track*", not the other way round. [`fig-good-features-to-track`]

• **practical recipe**: compute $M$ over a window at every pixel (image gradients → products → Gaussian-weighted sum, exactly the Harris pipeline), score by $\min(\lambda_1,\lambda_2)$, **non-max suppress**, keep the top-$N$ well-spread points — these are the features you then KLT-track.

• **the window-size tension** (Shi & Tomasi, slide "Rest of the story"): a *small* window localises precisely but sees little texture (poorly conditioned $M$) and is fooled by noise; a *large* window is better conditioned but blurs across **multiple motions** (a single $\mathbf{u}$ no longer fits — depth discontinuities, object boundaries). The sweet spot trades localisation against conditioning — the recurring small-aperture-vs-large-aperture tension of all patch methods.

▸7.2.4 What breaks long-term tracking: drift, appearance change, occlusion — and re-detection⧉

• **drift** — the cost of greediness: each frame's tiny LK residual is imperfect; chained over hundreds of frames the errors **accumulate** and the window slowly slides off the true feature (sub-pixel bias becomes pixels). The classic fix is to **anchor to the first frame**: track against the *original* template (frame $t_0$), not the previous frame, so error doesn't compound — but then **appearance change** fights you. [`fig-track-drift-occlusion`(a)]

• **appearance change** — the template goes stale: over a long clip the patch changes due to **lighting**, **out-of-plane rotation**, **scale** (the object nears/recedes), and perspective. A pure-translation template match degrades. Shi & Tomasi's fix: track translation frame-to-frame (small motion, cheap) but **verify against the anchor template with an *affine* warp** — allow the patch to rotate/scale/shear when checking quality. The affine **dissimilarity**

$$\varepsilon=\sum_{\mathbf{x}\in W}\big[\,I_2(A\mathbf{x}+\mathbf{d})-I_1(\mathbf{x})\,\big]^2$$

measures how well the current patch still matches its origin under the best affine warp $A$; a large $\varepsilon$ means the feature is no longer itself.

• **occlusion** — the point disappears: when a feature passes **behind** another object (or off-frame, or into a shadow/specular flip), there is *no* correct match — but a greedy LK solve will happily lock onto *whatever* nearby texture minimises SSD, silently corrupting the track. **Detection: the dissimilarity $\varepsilon$ spikes** → declare the feature **lost** and drop it rather than trust a garbage match. (Slides "Issue: Motion Boundaries — no matches can be found at motion boundaries… also at frame boundary.") [`fig-track-drift-occlusion`(b)]

• **re-detection** — keep the set populated: as features are lost to drift/occlusion/leaving-frame, periodically **re-run Shi–Tomasi** and spawn fresh features in the gaps, so the tracked set stays dense enough and well-distributed. Detection and tracking become an interleaved loop (slides: "Combine tracking and detection; keep refining detection model over time"). [`fig-track-drift-occlusion`(c)]

• **the honest limitation** that motivates the next section: classical KLT is **per-point, greedy, and forgetful** — each feature is tracked independently with no memory of where it went and no way to *recover* a point after it reappears from behind an occluder. Points also **drift relative to each other** (no global consistency). These are exactly the failures learned trackers were built to fix.

▸7.2.5 Modern point trackers (briefly)⧉

• **the shift** (the *bitter-lesson* successor to KLT; **L8**): keep the task — "track this physical point through a video, even through occlusion" — but replace the hand-built brightness patch with **learned features** and the greedy per-frame solve with a **learned iterative update**, trained on synthetic video with ground-truth tracks (FlyingThings3D, Kubric, etc.). The classical scaffolding survives *as architecture*: a 4-D **cost volume** (appearance similarity), **iterative refinement** of the trajectory, and now an explicit **visibility / occlusion** prediction. [`fig-modern-point-trackers`]

• **PIPs — Particle Video Revisited** (Harley et al. 2022): tracks **one point at a time** over a temporal window; solves jointly for its $(x,y)$ location *and* an appearance vector across all frames, **predicts visibility per frame**, and **initialises with a constant** location/appearance then iteratively refines (RAFT-style). The reintroduction of *long-range, occlusion-aware* particle tracking that KLT lacked.

• **TAP-Vid** (Doersch et al. 2022): frames the modern task — **Track Any Point** — and supplies the **benchmark** that made progress measurable (the long-standing bottleneck, as in optical flow, is **ground-truth data**).

• **CoTracker — track points *together*** (Karaev et al. 2023): the key insight, stated as a slogan, is that **tracking points jointly beats tracking them independently** — points on the same object constrain each other, giving **global consistency** and far better **occlusion handling** (a point hidden behind an object is inferred from its visible neighbours). Mechanism: a **transformer** over a sliding window with **factorised attention across space, time, and the group of tracks**, iteratively updating each track's position/appearance/visibility — plus extra **register tokens** so the network has scratch space for global info (otherwise it hijacks real pixels, causing spiky artifacts).

• **why this closes the section**: every failure of greedy KLT — drift, single-point myopia, silent occlusion corruption — is answered by *learn the features, refine iteratively, track together, predict visibility*. The structure-tensor "is this point trackable?" question is now answered implicitly by the learned features and the joint optimisation, but the **problem** is identical to the one Lucas, Kanade, and Tomasi posed. (Full model details: [[Deep learning]]; the same *neuralise-the-classical-iteration* pattern is RAFT in [[Optical flow]].)

7.3 Robustness: the ratio test and RANSAC⧉

7.4 Deep learning approaches to sparse matching⧉

7.5 Misc: fast matching⧉

▸7.6 Optical flow⧉

`fig-flow-field-colorkey` · from two frames to a dense field — a per-pixel flow shown with the standard colour key (hue = direction, saturation = speed, after Baker et al. 2011); coherent regions read as one colour, static background near-white ✅

⬜ figure not yet created

`fig-brightness-constancy` (a patch at $(x,y,t)$ reappears at $(x+u,y+v,t+1)$ with the *same* intensity — the assumption, and where it breaks: lighting change, specularity, occlusion) fig-brightness-constancy

`fig-flow-constraint-line` · one equation, a line of answers — $I_x u + I_y v + I_t = 0$ as a line in the $(u,v)$ plane; the gradient fixes only the normal component (normal flow), the along-edge tangent undetermined 🟨

`fig-aperture-problem` · why one pixel is not enough — a straight edge through an aperture (three true motions, identical appearance, only normal recoverable) and the barber-pole illusion (diagonal stripes appearing to move straight up) 🟨

`fig-zoom-mechanism` · a zoom at two focal lengths (short-f/wide vs long-f/tele): moving variator + compensator groups slide along the axis to change f while the focal plane (sensor) stays fixed; motion arrows between states ✅

`fig-flow-corner-edge-flat` · the structure tensor $A^\top A$ deciding where flow is solvable — corner (two large eigenvalues, full 2-D flow), edge (rank-deficient, only normal flow), flat (indeterminate); the same picture that selected Harris corners 🟨

⬜ figure not yet created

`fig-LK-vs-HS` (Lucas–Kanade: solve a tiny system per window, local fig-klt-pyramid-iteration

`fig-mc-as-flow` · same correspondence, two budgets — a dense smooth optical-flow field vs the codec's one-constant-vector-per-block field; the same "where did this come from?" coarsened to what is cheap to estimate and transmit 🟨

`fig-coarse-to-fine-flow` · making large motion sub-pixel — a Gaussian pyramid where coarse displacement is fractional; at each level upsample-and-scale, warp, estimate the small residual and add; a Laplacian view of the flow 🟨

`fig-raft-skeleton` · RAFT, the classical pipeline neuralized — learned feature + context encoders, an all-pairs 4-D correlation volume, and a recurrent GRU update operator iterating; the boxes line up with data term, cost volume, and warp-then-refine 🟨

💡 **Big lesson (the chapter's core — *brightness constancy + the aperture problem*):** track a point by the one thing that's (assumed) conserved — its **brightness**. Linearizing that single scalar equation gives **one constraint per pixel**, $I_x u + I_y v + I_t = 0$, for **two** unknowns $(u,v)$ — so locally motion is **fundamentally underdetermined**: you can only see the component **along the image gradient** (normal flow), never the component **along an edge** (the *aperture problem*). Every flow method is a strategy for supplying the missing second equation — from a **neighbor** (Lucas–Kanade: pixels in a patch share a motion → the well-/ill-posedness is the **same corner/edge/flat structure tensor as Harris**), from a **smoothness prior** (Horn–Schunck), or from **learned context** (RAFT). *Only the gradient carries motion information, and one gradient is never enough.*

▸7.6.1 What optical flow is — and whether it is even well-defined⧉

• **definition**: the **2-D motion vector of every pixel** between two (usually consecutive) frames — a dense field $(u(x,y),\,v(x,y))$. Visualized with a **color key**: hue = motion direction, saturation = speed (Baker et al. 2011). [`fig-flow-field-colorkey`]

• **why it matters (downstream)**: frame interpolation / slow-motion, video super-resolution, in-painting, stabilization, special effects; action recognition, tracking, surveillance; egomotion / SLAM; **video compression** (motion-compensated prediction); medical registration; even optical mice and fluid-flow physics. A workhorse primitive.

• **relation to stereo**: disparity is optical flow **specialized to a static scene with a known camera baseline** — the search collapses to the **1-D epipolar line**, so there's a **single unknown (depth/disparity) per pixel** instead of two. Flow is the unconstrained 2-D cousin (cross-ref [[Image alignment]] / multi-view).

• **honesty up front**: optical flow is **ill-defined** in places — flat untextured regions (no information), occlusion / motion boundaries (a pixel *has* no correspondence), lighting changes and specularities (brightness *isn't* constant). Worth keeping in mind that "the flow" is an idealization that the math itself will warn us is sometimes unrecoverable.

• **the progression of "what is a match"** (foreshadow the chapter and its sequels): per-pixel **RGB + brightness constancy + gradient** (this chapter) → **window/patch** matching (SSD, cost volume) → hand-designed **features** (SIFT/HOG) → **learned** features (RAFT and successors). Each step buys robustness at the cost of locality.

▸7.6.2 Brightness constancy and the optical-flow constraint⧉

• **the assumption (brightness / color constancy)**: a scene point keeps the same intensity as it moves over one timestep — it just *relocates*:

$$I(x,\,y,\,t) = I(x+u,\;y+v,\;t+1).$$

($u,v$ = the per-pixel displacement we want.) This is the "color constancy assumption" of Horn & Schunck 1981. [`fig-brightness-constancy`]

• **linearize (first-order Taylor)**: expand the right side about $(x,y,t)$, assuming small motion:

$$I(x+u,y+v,t+1) \approx I(x,y,t) + I_x\,u + I_y\,v + I_t.$$

Subtract $I(x,y,t)$ from both sides of brightness constancy → the **optical-flow constraint equation (OFCE)**:

$$\boxed{\,I_x\,u + I_y\,v + I_t = 0\,}\qquad\text{equivalently}\qquad \nabla I\cdot(u,v) + I_t = 0,$$

with $I_x,I_y$ the **spatial** gradients and $I_t$ the **temporal** gradient (frame difference).

• 💡 **"only the gradient matters"**: motion is invisible in **flat** regions — where $\nabla I = 0$ the constraint says $I_t=0$ and tells you *nothing* about $(u,v)$. Information about motion lives **entirely in the image gradient**. (This is exactly the lever Eulerian video magnification pulls: amplify the *temporal* change $I_t$ in proportion to the spatial gradient.)

• **the constraint is a line, not a point**: one scalar equation in two unknowns. In the $(u,v)$ plane it's a **line**; the true flow lies *somewhere on it* but is otherwise undetermined. We can pin down only the component **along $\nabla I$**: the **normal flow** $\;u_\perp = -I_t / \lVert\nabla I\rVert$ (magnitude), directed along the gradient. The component **tangent to the edge is free**. [`fig-flow-constraint-line`]

▸7.6.3 The aperture problem⧉

• **the statement**: from a *single* pixel (one OFCE) you can recover **only the normal flow** — the motion **perpendicular to the local edge**. The motion **along** the edge is unobservable. *One equation, two unknowns.* [`fig-aperture-problem`]

• **the intuition (the "aperture")**: watch a moving straight edge through a small hole. A long diagonal bar sliding behind the aperture looks the same whether it moves horizontally, vertically, or perpendicular to itself — you cannot tell. Only motion that **changes what's inside the aperture** (the ⟂ component) is visible. The **barber-pole** illusion (stripes appear to move *up* though the pole spins sideways) is the same effect.

• **why it's not a bug**: the math is *honestly reporting* a genuine ambiguity — the data truly don't determine the tangential motion locally. So "solving" flow = **adding a second assumption** to supply the missing equation. The two classic choices:

• **a local one** — neighbors share the motion → **Lucas–Kanade**;

• **a global one** — the field is smooth → **Horn–Schunck**.

• **the perceptual flip side**: human vision suffers the aperture problem too (we mis-perceive these stimuli), and resolves it by **integrating across the object** (corners, line-endings) — exactly what the algorithms do by aggregating a window or imposing smoothness.

▸7.6.4 Lucas–Kanade — local constant-flow least squares (and the structure tensor)⧉

• **the assumption**: pixels in a **small window** $W$ all share the **same** motion $(u,v)$. Now we have **many** OFCEs (one per pixel in $W$) for the **two** unknowns — over-determined → **least squares** resolves the aperture problem *and* regularizes against noise. (Lucas & Kanade 1981.)

• **stack the equations**: with $n$ pixels in $W$, form

$$A = \begin{bmatrix} I_x^{(1)} & I_y^{(1)} \\ \vdots & \vdots \\ I_x^{(n)} & I_y^{(n)} \end{bmatrix},\quad b = \begin{bmatrix} I_t^{(1)} \\ \vdots \\ I_t^{(n)} \end{bmatrix},\qquad A\begin{bmatrix}u\\v\end{bmatrix} = -b.$$

The least-squares solution solves the **normal equations**:

$$\boxed{\,A^\top A \begin{bmatrix}u\\v\end{bmatrix} = -A^\top b\,},\qquad A^\top A = \sum_W \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix},\quad A^\top b = \sum_W \begin{bmatrix} I_x I_t \\ I_y I_t\end{bmatrix}.$$

• 💡 **the matrix $A^\top A$ is the Harris structure tensor** (the second-moment matrix). The very same object decides corner detection ([[Automatic panorama stitching from multiple views and feature matching]]) and flow conditioning — **flow is solvable exactly where Harris finds a corner**:

• **two large eigenvalues** (a **corner** / well-textured patch): $A^\top A$ invertible, full 2-D flow recoverable, well-conditioned. *These are the "good features to track."*

• **one large eigenvalue** (an **edge**): rank-deficient → only **normal flow** (the aperture problem, *re-derived* — along-edge motion is the null direction).

• **both small** (a **flat** region): no information; flow indeterminate.

This is the bridge from dense flow to **sparse feature tracking** (KLT): just track the pixels where $A^\top A$ is well-conditioned. [`fig-flow-corner-edge-flat`]

• **scope & relation to HS**: LK assumes **small motion** ($\sim$1 px, from the linearization) and a window large enough to be informative but small enough to be locally constant. It can be read as a **per-patch, one-step Horn–Schunck with a hard "constant flow" smoothness** — local and direct, vs HS's global and iterative. Extended to large motion via **coarse-to-fine pyramids** and iterative warping. LK is also the engine of **parametric image alignment** (register two images under one affine/homography — same gradient least-squares; cross-ref [[Image alignment]]). [`fig-LK-vs-HS`]

▸7.6.5 Horn–Schunck — global smoothness regularization⧉

• **the assumption**: instead of constancy in a window, impose that the flow field is **globally smooth** — *neighboring pixels have similar motion*. This fills in the unconstrained (aperture / flat) directions from the data-rich ones, propagating motion into textureless regions. (Horn & Schunck 1981.)

• **the energy** (data-fit + prior — the classic inverse-problem shape):

$$E(u,v) = \iint \underbrace{\big(I_x\,u + I_y\,v + I_t\big)^2}_{\text{data: brightness constancy}} \;+\; \lambda\,\underbrace{\big(\lVert\nabla u\rVert^2 + \lVert\nabla v\rVert^2\big)}_{\text{prior: smoothness}}\;\,dx\,dy.$$

The **data term** is the OFCE residual (penalizes violating brightness constancy); the **smoothness term** penalizes spatial *variation* of the flow; $\lambda$ trades them off (large $\lambda$ → smoother, more diffused; small $\lambda$ → trusts the data, noisier).

• **it's convex / linear to solve**: quadratic in $(u,v)$ → the Euler–Lagrange equations are **linear** (a large sparse system coupling all pixels — a screened-Poisson-like diffusion). Solve by **iterative** updates (Gauss–Seidel / Jacobi): each pixel's flow relaxes toward its **neighborhood average** corrected by the local brightness constraint. Globally smooth, fills the apertures.

• **LK vs HS — the contrast to land**: **local window + least squares** (LK) vs **global field + smoothness prior** (HS); **one-step** vs **iterative**; **constant-in-a-patch** vs **smoothly-varying-everywhere**. Two ways to supply the aperture problem's missing equation — *aggregate locally* or *regularize globally*. [`fig-LK-vs-HS`]

• **modern classical flow** — the *Secrets* paper (Sun, Roth & Black 2010): the bones are still HS, but what actually wins is **robust** (non-quadratic, $L_1$/Charbonnier) penalties on both terms (so edges/occlusions don't over-smooth), **median filtering** of the flow between iterations, **coarse-to-fine pyramids**, and **edge-aware / non-local** smoothness. The lesson: the *energy is the design* — same skeleton, better data/prior terms.

▸7.6.6 Large motion: coarse-to-fine warping⧉

• **the problem**: the Taylor linearization (hence both LK and HS) assumes **sub-pixel** motion. Real motion is many pixels — the linearization is invalid and the solver diverges or finds the wrong (aliased) match.

• **the fix — pyramids** (cross-ref [[Linear pyramids and wavelets]]): build a **Gaussian/Laplacian pyramid** of both frames. On the **coarsest** (heavily blurred & downsampled) level, even large motion is only a fraction of a pixel → the linear method works. Then, going finer:

1. **upsample** the coarse flow to the next level (and scale its magnitude);

2. **warp** frame 2 toward frame 1 by the current flow estimate ([[Warping and resampling]]);

3. estimate only the **residual** (now small) flow on the warped pair, and **add** it to the running estimate.

Repeat to the finest level. [`fig-coarse-to-fine-flow`]

• **why it's the right picture**: this is a **Laplacian-pyramid** view of flow — get the *low-frequency / large-scale* motion on blurry images, then add the *high-frequency* motion as a residual at each finer scale. The warp-then-refine loop is exactly the iterative scheme that **RAFT** and PWC-Net later learn.

• **pseudocode**: a coarse-to-fine block (loop levels: upsample-and-scale `flow` → `warp(frame2, flow)` → estimate residual `ΔLK` → add) makes the warp-then-refine loop explicit — the scheme RAFT later neuralizes.

• **a different attack — the cost volume**: rather than linearize, **match** patches directly. For each pixel in frame 1, score similarity (SSD / correlation) against candidate offsets in frame 2 → a **cost volume** (4-D: $(x_1,y_1)\times(x_2,y_2)$, or per-displacement). This keeps brightness constancy but **drops the linearization** (more "Lagrangian"), handling large motion at the cost of a big search — pruned by **coarse-to-fine** and (later) learned features. The cost volume is the centerpiece of learned flow.

▸7.6.7 Learned flow — RAFT (neuralize the classical pipeline)⧉

• **the framing (L8)**: keep the classical *skeleton* — match (data term) + iterative update (with smoothness/context) — but **learn** the pieces instead of hand-designing them. (Forward-ref [[Machine learning]]: a learned operator swaps a hand-built prior/feature for one fit to data.)

• **RAFT** (Teed & Deng 2020, ECCV best paper) — the three learned ingredients on the classical scaffold: [`fig-raft-skeleton`]

1. **learned per-pixel features** replace raw RGB in the matching (robust to lighting, texture) — plus a separate **context** encoder of frame 1 for edge-aware regularization;

2. an **all-pairs 4-D correlation (cost) volume** — precompute similarity between every pair of pixels, pooled into a **pyramid** for multi-scale lookup (the cost-volume idea, made efficient);

3. a **recurrent update operator** (GRU, shared weights) that **iteratively refines** the flow field — at each step it **looks up** correlations at the current flow estimate and predicts an increment, exactly **mimicking an iterative optimizer** (the unrolled HS/coarse-to-fine loop), but with the update *learned*. All differentiable, trained end-to-end.

• **why flow got the CNN treatment late**: flow is **non-local** (a pixel can match anywhere) — hard for ordinary convolutions; the explicit **cost volume** is the inductive bias that makes non-local matching tractable. Training data is scarce (real flow ground truth is hard), so synthetic sets (FlyingChairs/Things, Sintel, KITTI) and **self-supervised warping losses** (SMURF) do heavy lifting.

• **the through-line to land**: RAFT is the classical pipeline **with its hand-crafted parts (features, the iterative minimizer) replaced by learned ones** — *scaffolding from traditional methods, weights from data*. It is **not** "bitter-lessoned" into a generic transformer: the cost-volume + recurrent-update **inductive bias still wins**, which is itself the lesson — non-local matching is hard enough that problem-specific structure still pays. (Lineage to note: FlowNet 2015, the first CNN flow; PWC-Net 2018, the *pyramid+warping+cost-volume* "Lucas–Kanade-flavored" net; transformer variants for the cost volume.)

• **closing hand-off**: with a dense, automatic correspondence field in hand, the **transport** tasks open up — warp/interpolate between frames for **slow-motion** ([[Frame interpolation and slow-motion synthesis]]), stabilize, compress, magnify motion ([[Morphing]] — motion magnification), or lift to **scene flow / SfM** (the 3-D cousins). *Correspondence first, then transport* — the part's spine.

7.7 Deep learning approaches to optical flow⧉

▸Part 8 SINGLE IMAGE COMPUTATIONAL PHOTOGRAPHY⧉

▸8.1 Recap: tone mapping⧉

`fig-tonemap-taxonomy` · a taxonomy of local tone mapping — bilateral base/detail, gradient-domain, local Laplacian, exposure fusion, learned — placed on a shared axis 🟨

⬜ figure not yet created

`fig-darkroom-dodge-burn` (a Zone-System / dodge-&-burn darkroom print beside its digital local-tone-mapping analogue) fig-editing-lr-vs-ps

This chapter doesn't introduce new algorithms — it **gathers** the tone-mapping thread that has run since BASIC into one picture, and ties it back to the **analog darkroom** where tone control began. (Scope note: this single-image recap deliberately covers the **single-image** operators; the **multi-image** tone mapping — HDR *merge* / exposure fusion — lives in [[Multiple exposure imaging]] and is only forward-referenced here. **See Known Issue.**)

▸8.1.1 Global vs local, re-hashed⧉

• **global** tone mapping applies **one curve to every pixel** (a function of intensity alone — gamma, an S-curve, Reinhard's global operator). Simple, fast, monotone — but a single curve **cannot both compress the overall range and preserve local contrast**: flatten the range and local detail goes with it.

• **local** tone mapping lets the mapping **depend on a pixel's neighbourhood** — compress the *large-scale* range while *keeping* local contrast. The cost is the risk of **halos** (and reversals) where the locality is mishandled — the recurring failure mode introduced in BASIC.

• the recap point: every local operator below is a different answer to the same question — *how do you separate "overall brightness" (compress it) from "local detail" (keep it)?*

▸8.1.2 A taxonomy of local methods⧉

• **gradient-domain** (Fattal et al. 2002): attenuate **large gradients**, then **reconstruct** the image from the edited gradient field (a Poisson solve) — compress range by squashing big luminance steps while leaving small ones. Developed in [[Poisson image editing]].

• **bilateral base/detail** (Durand & Dorsey 2002): split the image into a **base** (large-scale, edge-preserving bilateral) and **detail**; **compress the base**, **keep the detail**, recombine. The canonical edge-aware local operator — developed in [[Edge-preserving techniques]].

• **local Laplacian** (Paris, Hasinoff & Kautz 2011): manipulate a **Laplacian pyramid** with a per-coefficient remapping that **avoids halos by construction** — local contrast control without the bilateral's artifacts. [[Edge-preserving techniques]].

• **exposure fusion** (Mertens et al. 2007): blend a **bracket of exposures** by per-pixel quality weights (well-exposedness, contrast, saturation) in a pyramid — *tone mapping without ever forming an HDR image.* (It straddles single/multi-image — flagged in the Known Issue.)

• **learned** (HDRnet, Gharbi 2017; FiveK retouching): **predict** a local (bilateral-grid affine) tone mapping from data — the operator itself learned from expert edits. [[Deep learning]].

• the table to land: same goal (separate and treat large-scale vs local), five mechanisms — gradient field, base/detail, pyramid remap, multi-exposure blend, learned. [fig-tonemap-taxonomy]

▸8.1.3 The darkroom ancestor: dodge & burn and the Zone System⧉

• tone mapping is **not** a digital invention — photographers have **locally controlled tone** since the wet darkroom. **Dodge** (hold light back from a region → lighter) and **burn** (add light → darker) are **manual local tone mapping** with the hands and bits of card under the enlarger.

• the **Zone System** (Ansel Adams): a disciplined method for **mapping scene luminances onto print tones** — meter the scene, place key values on chosen "zones," and develop/print to control where highlights and shadows land. It is **tone mapping as craft**: deciding the global transfer *and* the local adjustments before computation existed.

• the analog **characteristic curve** of film/paper (the H&D curve — toe, linear, shoulder) is the **global** tone curve in physical form; its shoulder is highlight roll-off, its toe is shadow compression — exactly what a digital global operator emulates.

• the bridge to land: digital global tone mapping = the **characteristic curve**; digital local tone mapping = **dodge & burn + the Zone System**, now automatic and per-pixel. The computation **mechanized** a century-old craft (cross-ref [[Human factors and the art of photography]]; the Intro's Adams "negative = score, print = performance" analogy).

▸8.2 Super-resolution and image priors⧉

`fig-sr-ill-posed` · many sharp HR images collapse to the same blurry LR image — the SR inverse is one-to-many (ill-posed); a prior picks one 🟨

⬜ figure not yet created

`fig-sr-recon-vs-halluc` (one low-res crop → reconstruction-based result from a burst (real detail) vs hallucination-based result from a learned prior (invented but plausible detail)) fig-sr-recon-vs-halluc

`fig-burst-subpixel` · hand-jitter across a burst lands samples between the LR grid points; merging the sub-pixel-shifted frames fills the HR grid 🟨

⬜ figure not yet created

aliasing made useful, after Wronski 2019)

`fig-pnp-loop` · Plug-and-Play / RED: alternate a data-fidelity step with a denoiser-as-prior step until convergence — the prior is a black-box denoiser 🟨

💡 **Big lesson (L10 · The prior is not optional):** when the measurement genuinely destroys information — super-resolution past the sensor's sampling, deblurring at killed frequencies, inpainting a hole — *no amount of cleverness recovers it from the data alone*. A **prior** (what natural images look like) is what selects an answer; it is a load-bearing part of the algorithm, not a tuning knob. The honest split: **reconstruction** priors fuse genuinely-measured extra samples; **hallucination** priors *invent* plausible detail that was never measured. (Registered in [[Big Lessons]] as **L10**; first appears here; recurs in [[#Blind deblurring]] — blind deblurring & dark-channel prior — and forward in [[Generative AI and diffusion]].)

equations

SR forward model $y = (k * x)\downarrow_s + n$ (blur by PSF $k$, downsample by factor $s$, add noise $n$)

MAP / variational recovery $\hat{x} = \arg\min_x \tfrac{1}{2}\lVert (k*x)\downarrow_s - y\rVert^2 + \lambda\, \Phi(x)$ (data-fit + prior $\Phi$)

PnP iteration (HQS form) — alternate a **data-fit** step $z \leftarrow \arg\min_x \tfrac{1}{2}\lVert (k*x)\downarrow_s - y\rVert^2 + \tfrac{\rho}{2}\lVert x - \tilde{x}\rVert^2$ and a **denoise** step $\tilde{x} \leftarrow \mathcal{D}_\sigma(z)$ (the prior, *any* denoiser $\mathcal{D}$)

RED gradient $\nabla \Phi(x) = x - \mathcal{D}(x)$ (prior gradient = residual of a denoiser)

Tweedie $\hat{x} = z + \sigma^2\, \nabla_z \log p_\sigma(z)$ (the score IS a Gaussian denoiser → diffusion = iterated $\mathcal{D}$)

▸8.2.1 What problem super-resolution solves (and why it's ill-posed)⧉

• **goal**: produce a higher-resolution image than the camera measured — more pixels that are *meaningfully* sharper, not just upsampled. Contrast with plain **interpolation** (bicubic), which adds pixels but no new frequencies → soft.

• **the forward model**: a low-res measurement is a high-res scene $x$ **blurred** by the optics/PSF $k$, **downsampled** by factor $s$, plus **noise** $n$: $y = (k * x)\downarrow_s + n$. Exactly the linear forward model $y = Ax + n$ of [[Linear Inverse Problems and Regression]], with $A = (\text{downsample})\cdot(\text{blur})$. [`fig-sr-forward-model`]

• **why it's ill-posed**: downsampling is **many-to-one** — it discards the high frequencies above the new Nyquist limit. Infinitely many high-res $x$ pass through the *same* $y$ (add any detail that the blur+downsample would have annihilated and $y$ is unchanged). So $A$ has no usable inverse; the data **under-determines** the answer. [`fig-sr-ill-posed`]

• **the resolution — a prior**: recover with **data-fit + regularizer**, $\hat{x} = \arg\min_x \tfrac12\lVert (k*x)\downarrow_s - y\rVert^2 + \lambda\,\Phi(x)$. The data term keeps $\hat{x}$ consistent with what was measured; $\Phi(x)$ (the **prior**) picks, among all consistent $x$, the one that *looks like a real image*. This is the MAP reading: $\Phi = -\log p(x)$, a prior × likelihood (Refreshers — Bayes).

• 💡 **the throughline**: without $\Phi$ the problem has no answer — see the **L10 big-lesson box** above. Everything below is *which* prior and *how* it's applied.

• **intuition-first framing**: SR is not "magnifying glass for files." It is "guess the most plausible scene that would have produced this blurry, aliased, noisy little image" — and *plausible* is defined by the prior.

▸8.2.2 Scenarios: single-image, burst, and hybrid space–time⧉

• **classic single-image SR**: one frame in, one bigger frame out. The data fixes nothing new above Nyquist → **all** the extra detail comes from the prior. Two classic priors: **internal** (patches recur across scales within the *same* image — Glasner 2009) and **external/example-based** (a database / learned model of how low-res patches map to high-res — Freeman 2002). Pure single-image SR is therefore **hallucination-leaning** by necessity. [`fig-sr-recon-vs-halluc`, right panel]

• **multi-frame / burst SR** (Wronski 2019, the Pixel "Super Res Zoom"): capture a **burst** of frames; **hand tremor** shifts the scene by **sub-pixel** amounts between frames. Each frame samples the *same* continuous scene at slightly different grid offsets → align with sub-pixel **registration** and fuse → a **denser** effective sampling grid. This is **genuine information**: aliasing (normally a defect) becomes the signal, because the shifts let you separate frequencies a single frame folded together. No dedicated prior needed for the *recovered* detail — the frames supply it (though robust fusion still needs a noise/motion model). [`fig-burst-subpixel`]

• key sub-points: **sub-pixel alignment is the whole game** (mis-registration → blur/ghosting); **robust merge** rejects moving objects / occlusion (else ghosts); replaces the demosaic step too (the color mosaic's offsets help). Ties to BASIC — sampling/aliasing and to multi-frame denoising ($1/\sqrt{N}$ averaging).

• **hybrid space–time SR**: trade **temporal** resolution for **spatial** (or vice versa) — combine several low-spatial/high-temporal captures (or a fast + a slow camera) so motion provides extra spatial samples and vice versa. Same idea as burst, generalized to the time axis.

• **the organizing axis** — *how much is measured vs invented*: many sub-pixel-shifted frames ⇒ mostly **reconstruction**; a single frame ⇒ mostly **hallucination**; real systems sit in between. This axis is the next section. [`fig-sr-recon-vs-halluc`]

▸8.2.3 Reconstruction vs hallucination — measured detail vs invented detail⧉

• **the honest distinction**, and why it matters for a reader: both make a sharper picture, but only one adds **truth**.

• **reconstruction-based**: the extra detail was **genuinely measured** — by multiple sub-pixel-shifted frames (burst), or by deconvolving frequencies the optics merely attenuated (not killed). It is *recovered*. Verifiable, repeatable, faithful.

• **hallucination-based**: a **learned prior** synthesizes detail that is *plausible* for natural images but was **never in the measurement** — pores on a face, text on a distant sign. It can look stunning and be **wrong** (invent the wrong digits, the wrong texture). [`fig-sr-recon-vs-halluc`]

• **the learned hallucination models** (used here as **priors**, treated as *models* in [[Deep learning]]): **Real-ESRGAN** (Wang 2021 — GAN-trained on a realistic degradation pipeline, sharp/textured outputs) and **SwinIR** (Liang 2021 — transformer restoration). They learn $p(\text{high-res} \mid \text{low-res})$ from data and sample a plausible completion. **Forward-ref**: their architecture and training live in the ML chapter; here they're one end of the prior axis.

• **caveats to surface** (pedagogical honesty): metric trap — a hallucinated result can score *better* on PSNR/SSIM yet be unfaithful (and a faithful one can score worse); the **perception–distortion tradeoff**. Forensic/ethical note: SR'd faces and license plates are **not evidence**. (Cross-ref learned perceptual metrics — LPIPS — in [[Deep learning]].)

• **resolve the open "in ML at this point?" question** — **verdict: keep super-resolution HERE**, as the *image-prior* story that unifies the whole single-image part. The chapter owns the **prior** abstraction (ill-posedness → regularizer → denoiser-as-prior → diffusion). The **learned models** themselves (Real-ESRGAN, SwinIR, GANs, diffusion nets) are taught as models in [[Deep learning]] / [[Generative AI and diffusion]], cross-referenced — not duplicated. So: priors here, models there.

▸8.2.4 Denoising as a universal prior — Plug-and-Play and RED⧉

• **the unifying move**: in $\hat{x} = \arg\min_x \tfrac12\lVert Ax - y\rVert^2 + \lambda\Phi(x)$, splitting algorithms (**ADMM / HQS**) separate the two terms into alternating steps. The regularizer step turns out to be **exactly a denoising operation** (the proximal operator of $\Phi$ = "find the nearby image the prior likes" = denoise). [`fig-pnp-loop`]

• **Plug-and-Play (PnP) priors** (Venkatakrishnan 2013): so **drop in *any* denoiser** $\mathcal{D}_\sigma$ for that step — even one with **no closed-form prior** (BM3D, NL-means, a learned CNN). Alternate:

• **data-fit step**: $z \leftarrow \arg\min_x \tfrac12\lVert (k*x)\downarrow_s - y\rVert^2 + \tfrac{\rho}{2}\lVert x - \tilde{x}\rVert^2$ (enforce the measurement; a linear solve — CG/FFT, per [[Linear Inverse Problems and Regression]]);

• **denoise step**: $\tilde{x} \leftarrow \mathcal{D}_\sigma(z)$ (impose the prior — *any* denoiser).

• payoff: **decouples** the imaging physics (the forward model $A$, swap per task — SR, deblur, inpaint, demosaic) from the **image prior** (the denoiser). One good denoiser ⇒ a solver for *every* inverse problem. (FlexISP, Heide 2014, builds a whole ISP this way.) [`fig-denoiser-as-prior-spectrum`]

• **RED (Regularization by Denoising)** (Romano 2017): make the prior **explicit** from a denoiser. Define $\Phi(x) = \tfrac12 x^\top(x - \mathcal{D}(x))$; under mild conditions its **gradient is just the denoising residual**, $\nabla\Phi(x) = x - \mathcal{D}(x)$. Then plain **gradient steps** minimize the regularized objective — the denoiser defines an actual energy, not only an opaque proximal step. (PnP ≈ proximal/implicit; RED ≈ explicit-gradient.)

• **why "universal"** (the section's headline): the **denoiser is the prior**. Improve the denoiser and *every* recovery task improves for free. This reframes "image prior" from a hand-derived penalty (TV, sparsity) to "whatever your best denoiser knows." [`fig-denoiser-as-prior-spectrum` — the spectrum from TV → BM3D → learned → diffusion]

• 💡 **callback to L10**: PnP/RED is the *mechanism* by which the not-optional prior enters — the denoise step IS the prior step.

▸8.2.5 Diffusion is iterated denoising (the continuous limit)⧉

• **Tweedie's formula**: for a Gaussian-noised observation $z = x + \sigma\epsilon$, the **MMSE denoiser** (posterior mean) is $\hat{x} = z + \sigma^2\,\nabla_z \log p_\sigma(z)$. So the **score** $\nabla_z \log p_\sigma(z)$ and a **Gaussian denoiser** are the *same object* (one is the other rearranged). [`fig-denoiser-as-prior-spectrum`, right end]

• **therefore diffusion = iterated denoising**: a diffusion model learns the score at every noise level; **sampling** repeatedly applies "denoise a little, add a little noise, repeat" from pure noise down to a clean image. That outer loop is precisely **PnP/RED taken to the continuous limit** — the prior-step denoiser is the learned score, run over a noise-level schedule.

• **consequence for recovery**: conditioning diffusion sampling on a measurement ($y$ via the forward model $A$) = a PnP/RED solver whose prior is the strongest learned denoiser we have → state-of-the-art SR / deblur / inpaint. This is the bridge from this chapter's *prior* story to generative models.

• **forward-ref**: the full diffusion treatment (DDPM, latent diffusion, score-matching, samplers) is in [[Generative AI and diffusion]]; here we only establish the **identity** denoiser ≡ score ≡ diffusion-step, closing the prior throughline. The diffusion chapter cross-refs back to *Denoising as a universal prior*.

• **closing idea** (hand-off to [[#Blind deblurring]]): the same "data-fit + prior, and the prior is doing the heavy lifting" story now plays out where the **forward operator itself is unknown** — blind deblurring — and where hand-built priors (dark-channel) still earn their keep.

`fig-naive-deconv-noise` · naive inverse filtering explodes: dividing by the blur's spectrum amplifies noise at its near-zeros 🟨

`fig-wiener-tradeoff` · the Wiener filter as a regularised inverse — the $1/\mathrm{SNR}$ term trades deconvolution sharpness against noise amplification 🟨

`fig-natural-image-gradient-prior` · natural-image gradients are heavy-tailed (mostly flat, rare strong edges) vs a Gaussian — the sparse prior that breaks the blind tie 🟨

`fig-camera-shake-nonuniform` · camera-shake blur is spatially varying — a rotation gives a different PSF in each corner of the frame 🟨

⬜ figure not yet created

`fig-colorization-scribbles` (gray photo + a few color scribbles → propagated color, stopping at edges) fig-colorization-scribbles

The previous chapter inverted a **known** blur on a **clean** image — a textbook least-squares solve. Reality breaks both assumptions: sensors add **noise**, and you usually **don't know the blur kernel**. This chapter is what you do then. The unifying frame is one line — *recovery = data-fit + prior* — and each section just swaps in the prior that suits the degradation. Forward-ref the punchline early: when the prior is *learned* rather than hand-designed, this is exactly [[Machine learning]] and [[Super-resolution and image priors]] (PnP/RED).

💡 **Big lesson:** *diagonalize when you can* — shift-invariant blur is a **convolution**, so it's **diagonal in Fourier**: deblurring becomes a per-frequency **division** by the kernel's frequency response, and the Wiener filter is just that division, *regularized* per frequency. (→ see Big lesson L5, first placed in [[Linearity, Fourier, Aliasing and deblurring]] — recurs here as the reason both inverse and Wiener filters have one-line Fourier forms.)

equations

forward model with noise $Y = K * X + n$

**naive (inverse-filter) deconvolution** $\hat X = \mathcal F^{-1}\!\big[\hat Y / \hat K\big]$ — blows up where $\hat K \to 0$

**Wiener / regularized inverse** (Fourier, per frequency) $\hat X = \dfrac{\overline{K}\,Y}{|K|^2 + \mathrm{SNR}^{-1}}$ (equivalently $|K|^2/(|K|^2 + S_n/S_x)$ applied to the inverse filter)

**blind objective** $\displaystyle (\hat X,\hat K) = \arg\min_{X,K}\

\underbrace{\lVert K * X - Y\rVert^2}_{\text{data fit}} + \underbrace{\lambda\,\rho(\nabla X)}_{\text{image prior}} + \underbrace{\mu\,\psi(K)}_{\text{kernel prior}}$ with $\rho$ a **sparse / heavy-tailed** gradient penalty ($\lVert\nabla X\rVert_\alpha,\ \alpha<1$)

**color2gray** objective = match grayscale gradients to (signed) color-difference gradients $\min_g \sum \lVert \nabla g - \theta\rVert^2$

**colorization** objective = minimise color variation between pixels weighted by intensity affinity $\min_U \sum_x \big(U(x) - \sum_{x'\in N(x)} w_{xx'} U(x')\big)^2$ (developed in [[Edge-preserving techniques]])

▸8.3.1 Deblurring in the presence of noise: why naive inversion fails⧉

• the trap: with a *known* kernel and *no* noise, deconvolution is exact — divide by the kernel in Fourier, done. Add even a *whisper* of noise and that exact inverse is **useless**. [fig-naive-deconv-noise]

• where it breaks, intuitively: blur is a **low-pass** filter — it *kills* high frequencies (the kernel's frequency response $\hat K$ goes to ~0 there). The inverse filter $1/\hat K$ therefore **divides by almost zero** at exactly those frequencies. The signal there is gone, but the *noise* isn't — so you amplify noise by huge factors. Result: an image of pure speckle, not a sharp photo.

• the linear-algebra twin: same statement as **small singular values** of the blur matrix. Inversion = dividing each component by its singular value; the tiny ones (high frequencies) are where noise explodes. Naive inversion has **no defense** because nothing tells it "this frequency is noise, not signal."

• 💡 **Big lesson (recurrence of L10):** *real measurements have noise, and inversion amplifies it — so naive inversion fails and a prior is not optional.* The forward problem (blur, project, downsample) is stable and loses information; running it backward is **ill-posed** and **noise-amplifying**. You cannot recover what the forward model destroyed — you can only **fill it in with prior knowledge.** This is the spine of the whole chapter (and of super-resolution, denoising, inpainting). (→ see Big lesson **L10** · the prior is not optional; first appears in [[#Super-resolution and image priors]].)

• the fix, stated: don't invert blindly — **trade off** fidelity to the measurement against plausibility of the answer. Where the kernel is strong (low freq), trust the data; where it's weak (high freq, noise-dominated), back off and rely on what we *expect* images to look like. That trade-off, optimal under a Gaussian-noise / known-spectrum assumption, **is the Wiener filter** (next).

• encoding note: deconvolution assumes the blur is **linear in scene radiance**, so do it in **linear light** (un-gamma first), per L2 — gamma/log distorts the additive convolution model. State this assumption up front.

▸8.3.2 The Wiener filter — the regularized, noise-aware inverse⧉

• one-line idea: take the naive inverse filter and **attenuate it wherever noise would dominate**, frequency by frequency. The amount of attenuation is governed by the **signal-to-noise ratio** at each frequency.

• the formula (per frequency, in Fourier): $\hat X = \dfrac{\overline{K}\,Y}{|K|^2 + \mathrm{SNR}^{-1}}$. Read it: the $\overline K/|K|^2$ part *is* the inverse filter; the $+\,\mathrm{SNR}^{-1}$ in the denominator is the **regularizer** that stops the blow-up. Equivalently the inverse filter times a gain $|K|^2/(|K|^2 + S_n/S_x)$ (noise spectrum $S_n$, signal spectrum $S_x$). [fig-wiener-tradeoff]

• limit checks (hand-verifiable): **high SNR** ($\mathrm{SNR}^{-1}\!\to 0$) → recovers the exact inverse filter $1/\hat K$ (no noise, invert fully). **Where the kernel vanishes** ($|K|\to 0$) → the $+\,\mathrm{SNR}^{-1}$ keeps the denominator finite, so the gain → 0 instead of ∞: those lost frequencies are **left alone**, not amplified. Exactly the desired behavior.

• where it comes from (just the intuition, no derivation): the **MMSE** linear estimator under Gaussian noise — minimize expected squared error between estimate and truth → this filter. The "$\mathrm{SNR}^{-1}$" is literally a **prior on the image's power spectrum** ($1/f$-ish for natural images). So even the classic Wiener filter is **already a prior-driven method** — that's the bridge to everything that follows.

• limitation → the sequel: Wiener assumes a **known kernel** and a **stationary Gaussian** prior (a power-spectrum prior). Real images aren't Gaussian — they have **edges**, and the right prior is **sparse gradients**, not a power spectrum. Swapping the Gaussian prior for a sparse-gradient one gives the modern non-blind deconvolution (Levin 2007; Krishnan & Fergus 2009) — and when even the *kernel* is unknown, we get blind deblurring (next).

• forward-ref: the general version "any denoiser as the prior" is **Plug-and-Play / RED** in [[Super-resolution and image priors]]; the *learned* prior is [[Machine learning]].

• the new difficulty: now we **don't know the blur kernel** $K$ either (camera shake gives an unknown, photo-specific PSF). We must recover **both** $X$ and $K$ from one blurry $Y$ — twice as many unknowns as equations.

• the chicken-and-egg: knowing the sharp image makes the kernel easy (divide), and knowing the kernel makes the image easy (non-blind deconv) — but we have **neither**. [fig-blind-chicken-egg]

• why it's genuinely ill-posed (the trap to call out): the blurry image **explains itself** — $Y = (\text{sharp }X) * (\text{real blur }K)$ fits, but so does $Y = (\text{the blurry image}) * (\delta)$, i.e. "no blur at all." A naive joint optimizer happily returns the **no-op / delta-kernel** solution because it fits the data perfectly. Data-fit alone is hopeless.

• what breaks the tie — **natural-image priors:** real sharp photos have **sparse, heavy-tailed gradients** (mostly flat regions, a few strong edges); **blurring spreads** those strong edges into many weak gradients, *raising* the prior's cost. So the prior **prefers the sharp explanation** — the delta-kernel "no-blur" answer is penalized because the *blurry* image has the wrong (too-dense, too-Gaussian) gradient statistics. [fig-natural-image-gradient-prior]

• the objective: $(\hat X,\hat K) = \arg\min_{X,K}\ \lVert K*X - Y\rVert^2 + \lambda\,\rho(\nabla X) + \mu\,\psi(K)$, with $\rho$ a **sparse** penalty ($\lVert\nabla X\rVert_\alpha$, $\alpha<1$, a hyper-Laplacian) and $\psi$ a non-negativity / sparsity / energy constraint on the kernel. Non-convex in $(X,K)$ jointly.

• two ways people solve it:

• **alternating minimization** (the obvious approach): fix $X$, solve for $K$; fix $K$, do non-blind deconv for $X$; repeat — usually **coarse-to-fine** (estimate the kernel on a downsampled image first, where it's small, then refine). Simple but can collapse to the trivial solution if the prior isn't handled carefully.

• **pseudocode block** (in the draft): the coarse-to-fine alternation — `K ← δ; for each pyramid level: repeat { X ← deconv given K; K ← fit given X }; final non-blind deconv`.

• **the Fergus 2006 insight / Levin 2009 lesson:** don't MAP-estimate $X$ and $K$ *jointly* — that **biases toward the blurry (no-blur) solution**. Instead **marginalize over the image** and estimate the **kernel alone** (variational Bayes), *then* do a single non-blind deconvolution. Levin et al. 2009's evaluation is the canonical "here's why naive MAP fails and what actually works" result — worth stating as the takeaway.

• attribution to state plainly: **Fergus et al. 2006** — "Removing Camera Shake from a Single Photograph" — put blind deblurring on the map for photographs using a natural-image gradient prior + variational inference; **Levin et al. 2007/2009** clarified the *role of the prior* and *why the estimator design matters*.

• forward-ref: the learned successors (CNN/diffusion deblurring) in [[Deep learning]] — same prior, now learned from data instead of hand-modeled.

▸8.3.4 A more realistic blur model: spatially-varying (camera-shake) blur⧉

• the assumption we've been making: a **single** kernel convolved over the whole image (**shift-invariant** blur). Real **camera shake** violates it.

• why shake is *not* a convolution: hand shake is mostly **rotation** of the camera, not translation. Rotation moves different parts of the frame by **different amounts** (and in different directions) — corners smear more than the center, and the smear *curves*. There is **no single PSF** that, convolved everywhere, reproduces it. [fig-camera-shake-nonuniform]

• the better model (Whyte 2011/2014): the blurry image is a **weighted sum of the sharp image under many camera poses** — integrate over the camera's **rotational trajectory** during the exposure. Equivalently a **spatially-varying PSF** that is consistent because it comes from *one* underlying 3-D camera motion (far fewer unknowns than an independent kernel per pixel).

• consequence: deblurring becomes inversion of this **global, non-stationary** operator — still linear in $X$, still the data-fit + sparse-prior recipe, but the forward operator is the pose-integral, not a convolution (so **no single-FFT trick**; back to the iterative solvers of the previous chapter).

• also note (Whyte 2014): real shots have **saturated highlights**, which break the linear model — handling clipping is part of "realistic" deblurring.

• forward-ref (the optimistic flip side): instead of *fighting* an accidental, hard-to-invert blur, **engineer** the PSF so it's easy to invert and even encodes depth — **coded aperture / motion-invariant capture / wavefront coding** (Levin 2007; Dowski & Cathey 1995) in the coded-imaging / PSF-engineering part. "Make the forward operator nice" beats "invert a nasty one."

▸8.3.5 Decolorization (Color2Gray) as gradient-matching optimization⧉

• **placement note**: color2gray and the colorization framing below are *not* deblurring, but they are the same **data-fit + prior** inverse-problem recipe this chapter teaches — kept here as the general "advanced inverse problem" instances (their *solvers* live in [[Poisson image editing]] / [[Edge-preserving techniques]]).

• the problem: convert color → grayscale **preserving contrast that luminance alone destroys.** Two colors can be **isoluminant** (different hues, *same* luminance) — naive "take the luminance / desaturate" maps them to the **identical gray**, erasing a contrast a viewer clearly saw. [fig-color2gray]

• the framing (Gooch et al. 2005): pose it as an **optimization** — find the grayscale image $g$ whose **gradients best match the perceived color differences** between neighboring pixels. Where colors differ (even isoluminantly), force a gray-level difference; carry the **sign** so ordering is consistent.

• objective (the gradient-domain shape, ties back to [[Poisson image editing]]): $\min_g \sum \lVert \nabla g - \theta\rVert^2$, where $\theta$ is a target gradient derived from **CIELAB** luminance + chrominance differences (a signed color-contrast). Solve the resulting **Poisson-type linear system** — *the same machinery as gradient-domain editing.*

• why it sits here: it is a **least-squares reconstruction from a target gradient field** — the inverse-problem / gradient-domain pattern again, with the "prior" being *match perceived contrast*. Short section; mainly a demonstration that the framework generalizes beyond restoration.

• encoding note: contrasts are computed in a **perceptually uniform** space (CIELAB), not linear RGB — state the space explicitly (L1/L2: work where differences mean what you want).

▸8.3.6 Colorization as an optimization (the inverse-problem framing)⧉

• decision (resolve "here or in edge-preserving?"): **introduce the framing here; develop the solver in [[Edge-preserving techniques]]** (*Edge-preserving optimization (colorization)*). Here it's *another instance of the chapter's recipe*; there it's the *worked edge-aware affinity solve.* Cross-ref both ways.

• the problem: a **grayscale photo plus a few color scribbles** → propagate color over the whole image, **stopping at edges** (don't bleed color across object boundaries). [fig-colorization-scribbles]

• the inverse-problem framing (Levin, Lischinski & Weiss 2004): the **unknown** is the per-pixel chrominance $U$; the **constraints** are the scribbles; the **prior** is *neighboring pixels with similar intensity should get similar color* — an **affinity** between pixels keyed on intensity difference. Minimize color variation subject to the scribbles: $\min_U \sum_x \big(U(x) - \sum_{x'\in N(x)} w_{xx'}\,U(x')\big)^2$, with weights $w_{xx'}$ large when intensities match. A **sparse linear least-squares** solve (the same CG/multigrid machinery as [[Efficient solvers]] and Poisson editing).

• 💡 **Big lesson (callback):** the weight $w_{xx'}$ = "how much these two pixels belong together, judged by their intensity difference" *is* an **affinity** — exactly **edge-preserving = affinity** (→ see Big lesson L4, first placed in the bilateral filter; the colorization solver is the optimization-form of that idea). This is the hinge to [[Edge-preserving techniques]], where the affinity/edge-aware machinery is developed in full.

• forward-ref: the **learned** alternative — predict plausible colors from a gray image with a CNN, no scribbles (Zhang, Isola & Efros 2016) — in [[Deep learning]] / colorization.

▸8.4 Dehazing⧉

Haze isn't a deblurring problem — light is **scattered**, not convolved — but it's the same *idea* as the previous chapter in a new costume: an image-formation model that is **under-determined**, made solvable by **one well-chosen statistical prior**.

equations

**haze image model** $Y(x)=t(x)\,X(x) + A\,(1-t(x))$ (transmission $t=e^{-\beta d}$, airlight $A$, depth $d$, scattering coeff $\beta$)

**dark channel** $X^{\text{dark}}(x)=\min_c \min_{x'\in\Omega(x)} X_c(x')$

▸8.4.1 Dehazing as a prior-driven inverse problem⧉

• the shape: an under-determined image-formation model + a statistical prior that makes it solvable — the **same thesis** as blind deblurring, even though the physics (atmospheric scattering, not blur) is different.

• the (physical) forward model: $Y(x) = t(x)\,X(x) + A\,(1-t(x))$ — each pixel is the clean radiance $X$ attenuated by **transmission** $t(x)$ (how much light survives the haze; $t = e^{-\beta d}$, $d$ = scene depth), **blended with airlight** $A$ (the bright, scattered atmospheric light). Haze = a depth-dependent fade toward $A$.

• why it's under-determined: per pixel we know $Y$ (3 numbers) but want $X$ (3), $t$ (1), and the global $A$ — more unknowns than measurements. Need a prior. [fig-dark-channel]

• the **dark-channel prior** (He, Sun & Tang 2009): in a haze-free outdoor patch, *some* color channel is **nearly zero** somewhere (shadows, colorful/dark surfaces) — the patch's $\min_c\min_{\text{window}} X_c \approx 0$. Haze *lifts* this dark channel (adds airlight everywhere), so **how bright the dark channel is ≈ how much haze ≈ $1-t$.** That single observation pins down the transmission map.

• the pipeline (state as steps): estimate $A$ from the haziest (brightest-dark-channel) pixels → estimate a coarse $t(x)$ from the dark channel → **refine the transmission's edges** so it follows real object boundaries (a matting / guided-filter / edge-aware solve — He's original used soft matting, later the **guided filter**, [[Edge-preserving techniques]]) → invert the model for $X$. [fig-dark-channel]

• the takeaway to land: *the prior is doing all the work.* The model is unsolvable as stated; one well-chosen statistical regularity (dark channel) turns it into a clean recovery — the **same moral as Wiener and blind deblurring**, just a different prior.

• xref: the refinement step ties into [[Edge-preserving techniques]] (guided filtering); the learned successors live in [[Deep learning]].

▸8.4.2 The wider lineage: scattering physics, learned dehazing, and beyond⧉

• **before the dark channel — physics from scattering.** **Cozman & Krotkov** ([@cozman-krotkov-1997]) turned the haze model around to *recover depth from scattering* — given the atmosphere, the depth-dependent fade *is* a depth cue (haze carries information, it doesn't only destroy it). **Narasimhan & Nayar**'s vision-and-weather work ([@narasimhan-nayar-2002]) laid down the systematic physics of imaging through fog, haze, and rain and how to invert it (classically from two images or a known airlight) — the physical foundation the single-image priors later shortcut.

• **learned dehazing.** Replace the hand-built prior with a network. **Zhang, Sindagi & Patel** ([@zhang-sindagi-patel-2019]) train a single network that **jointly estimates the transmission map and produces the dehazed image** end-to-end, instead of estimating $t$ by a prior and inverting separately — the learned successor to the dark-channel pipeline (forward-ref [[Deep learning]]). (Also flagged: **SDDE** — *queue exact ref / author to confirm*.)

• **out of the air, into the water.** **Sea-thru** ([@akkaynak-treibitz-2019]) corrects **underwater** images with the *right* physics: unlike atmospheric haze, water attenuates each wavelength differently and the veiling light varies with range, so a single global airlight is simply wrong — Sea-thru estimates wideband, range-dependent attenuation (from a known depth/range map) and removes the water's color cast where naive dehazing fails.

• **haze as a task, not just a nuisance.** **Sakaridis, Dai & Van Gool** ([@sakaridis-etal-2018]) synthesize fog onto clear images to build training data for **semantic scene understanding in fog**, so dehazing becomes a means to robust *perception* (driving, robotics) — closing back to the depth-from-scattering view that fog *carries* scene information.

▸8.4.3 Differentiable image pipelines and algorithm optimization (Halide)⧉

• the meta-point that closes the inverse-problem run: every method across these two chapters is *an optimization with a prior* — but who **optimizes the pipeline itself**? When a restoration/ISP pipeline has knobs (filter weights, kernel parameters, the whole sequence of stages), you want to **tune them by gradient descent against a quality loss** — which means the pipeline must be **differentiable** and **fast**.

• **Halide** (Ragan-Kelley et al. 2013): a language that **separates *what* an image pipeline computes (the algorithm) from *how* it runs (the schedule)** — tiling, vectorization, parallelism, fusion. One algorithm, many schedules → portable high performance without rewriting the math. This is *the* tool the rest of computational photography is implemented in.

• **differentiable Halide** (gradient Halide, Li et al. 2018): add **automatic differentiation** through the pipeline → you can **backprop a loss to the pipeline's parameters** (or use the pipeline as a layer inside a network). Now "tune the deblurring/dehazing/ISP parameters" or "learn the filter" is just gradient descent through a fast, scheduled pipeline.

• why it closes the chapter: it's the **engine** under the whole "data-fit + prior, minimized" program — and the bridge to [[Machine learning]] (differentiable pipelines = the boundary where classical optimization meets learned components; cf. FlexISP, Heide 2014, as a differentiable-ISP example).

• scope note: keep this **conceptual** (separation of algorithm/schedule; differentiability as an enabler), not a Halide tutorial — point to the language for hands-on use.

▸8.5 Style transfer⧉

`fig-style-transfer-zoo` · the zoo of style transfer — patch/analogies, texture statistics (Gram), neural optimisation, feed-forward, image-to-image GANs — same skeleton, swap the prior 🟨

`fig-neural-style-content-style` · neural style: a content loss (deep feature match) balanced against a style loss (Gram-matrix match), summed into one objective 🟨

The thread of this chapter is one sentence: **match the *look* of a reference onto a target, keeping the target's content.** It runs from **classical** methods — match patches or low-order statistics by hand — to **learned** ones — match deep-feature statistics, or learn a whole image-to-image mapping. Several of these pieces appear as *models* in [[Deep learning]]; here they're gathered under the single idea of **style transfer**.

equations

(light) **neural style** objective $\min_{\hat I}\ \alpha\,\lVert\phi_\ell(\hat I)-\phi_\ell(I_{\text{content}})\rVert^2 + \beta\sum_\ell \lVert G_\ell(\hat I)-G_\ell(I_{\text{style}})\rVert^2$, with **Gram matrix** $G_\ell = \phi_\ell\phi_\ell^\top$ (style = feature correlations)

feed-forward variant trains a net to minimize the same **perceptual loss** in one pass

▸8.5.1 Classical style transfer: patches and statistics⧉

• **Image Analogies** (Hertzmann et al. 2001): give an example pair **A → A′** (an image and its filtered/stylized version) and a new target **B**; synthesize **B′** so that *"A : A′ :: B : B′"* — learned **by patch matching** (for each B patch, find the best-matching A patch and copy its A′ counterpart). A single example teaches a filter/style — texture-synthesis machinery turned into a transfer, the **pre-deep ancestor** of neural style (cross-ref [[Deep learning]] and [[#Inpainting, texture synthesis]]).

• **color (palette) transfer** (Reinhard et al. 2001): the leanest classical method — make a target photo take on a reference's **palette/mood** by matching just the **mean and standard deviation** of each color channel. Convert both images to a **decorrelated color space** (l$\alpha\beta$ / Ruderman — axes statistically near-independent for natural scenes, so channels can be retargeted separately without cross-talk), then per channel **shift** to the reference's mean and **scale** by the ratio of standard deviations. Only **first- and second-order statistics** — no patches, no histograms — yet enough to swap a sunset's warmth onto a daylight shot. The classical, **pre-deep ancestor of neural style / color grading**: where Gatys later matches *deep-feature* second-order statistics (Gram matrices), Reinhard matches *raw-pixel* low-order statistics in a smart color space. (Forward-ref the learned / style-based color-grading successors → [[Deep learning]] and *Style transfer as image-to-image translation* below; the decorrelated transfer space ties to [[Human (and animal) vision and color]].)

• **two-scale tone / style transfer** (Bae, Paris & Durand 2006): transfer the **pictorial look** of a model photograph onto a target by **separating scales** — an edge-preserving **base / detail** split — and transferring the **large-scale tonal distribution** (a histogram-match of the base) and the **detail/texture** amount independently. Classic example: give a snapshot the tonal drama of an Ansel Adams print. Ties to the base–detail decomposition of [[Edge-preserving techniques]] and to histogram matching (BASIC).

• the classical takeaway: style here is **low-order statistics + patches** — global mean/variance (Reinhard), histograms, local texture, exemplar patches — matched by hand-designed pipelines. The arc rises in statistical order: Reinhard's first/second moments → histograms/patches → the deep-feature statistics below, where the learned methods replace "hand-designed statistic" with "deep-feature statistic."

▸8.5.2 Neural style and feed-forward stylization⧉

• **neural style** (Gatys, Ecker & Bethge 2016): the breakthrough that **content = deep CNN features** (what is where) and **style = the *correlations* among those features**, captured by the **Gram matrix** $G_\ell=\phi_\ell\phi_\ell^\top$ at each layer. **Optimize an image** to simultaneously match the target's content features and the reference's style Gram matrices: $\min_{\hat I}\ \alpha\lVert\phi(\hat I)-\phi(I_c)\rVert^2 + \beta\sum_\ell\lVert G_\ell(\hat I)-G_\ell(I_s)\rVert^2$. Slow (an optimization per image), but it reframed style as *deep-feature statistics*. [fig-neural-style-content-style]

• **feed-forward / perceptual-loss style transfer** (Johnson, Alahi & Fei-Fei 2016): **train a network once** to produce the stylized output in a **single forward pass**, by minimizing the *same* content+style **perceptual loss** (deep-feature distance) Gatys optimized iteratively. Real-time stylization; the same **perceptual / feature loss** reappears as a training loss for super-res and as LPIPS (cross-ref [[Deep learning]]).

• **Markovian / patch-based neural style** (Li & Wand 2016): combine **MRFs with CNNs** — match style on **local neighbourhoods of deep features** (a patch/Markovian style term) rather than global Gram statistics, better preserving local structure. Sits **between Image Analogies (patches, no net) and Gatys (global feature statistics)** — patches, but in deep-feature space.

▸8.5.3 Style transfer as image-to-image translation⧉

• **the GAN view**: when "style" is a whole **domain** (photos↔paintings, day↔night, summer↔winter), pose stylization as **learned image-to-image translation** — a generator maps one domain's look onto another, a discriminator keeps it plausible.

• **Pix2Pix** (Isola et al. 2017): **paired** translation — when you have aligned (input, styled-output) pairs, learn the map directly with a conditional GAN. The generic "learn the map between two image domains" tool.

• **CycleGAN** (Zhu et al. 2017): the **unpaired** version — when no aligned pairs exist (you have *photos* and *Monets*, but not the same scene as both), **cycle-consistency** ($B\to A\to B$ should return $B$) lets you learn the style mapping without pairs. The workhorse for "make my photo look like X" when X is only a *collection*.

• placement: these are surveyed as *models* in [[Deep learning]] (generative image-to-image); here they're the **learned, domain-level** end of the style-transfer spectrum. Forward-ref: text-/exemplar-driven stylization with diffusion ([[Generative AI and diffusion]]) is the current frontier.

• refs: Reinhard et al. 2001 (color transfer between images); Hertzmann et al. 2001; Bae, Paris & Durand 2006; Gatys et al. 2016; Johnson, Alahi & Fei-Fei 2016; Li & Wand 2016; Isola et al. 2017 (Pix2Pix); Zhu et al. 2017 (CycleGAN).

▸8.6 Inpainting, texture synthesis⧉

`fig-inpaint-prior-spectrum` · the hole-filling prior spectrum: PDE/diffusion (smoothness) → exemplar/texture (self-similarity) → learned/generative (semantics) 🟨

`fig-pde-isophote-diffusion` · PDE inpainting continues isophotes (level lines) smoothly into the hole — diffusion of structure inward (Bertalmío 2000) 🟨

`fig-thin-lens-vs-real` · side-by-side "convenient lie vs reality": ideal thin lens (single plane, all parallel rays → one focus) vs a real thick multi-surface lens where outer rays focus short of the paraxial focus (spherical aberration → blur, no single point) ✅

`fig-efros-leung-growth` · Efros–Leung non-parametric synthesis: grow one pixel at a time by matching its known neighbourhood against the sample 🟨

`fig-quilting-seam` · image quilting: lay overlapping patches and cut along the minimum-error boundary so seams disappear (Efros–Freeman 2001) 🟨

⬜ figure not yet created

`fig-object-removal` (Criminisi: a person erased, the hole filled by confidence-/edge-priority patch order so structures continue across it) fig-object-removal

⬜ figure not yet created

`fig-highlight-recovery` (a clipped specular spot → inpainted plausible surface color) fig-highlight-recovery

This chapter is the most literal instance of 💡 L10. A hole has **no measurement at all** — the only thing that can fill it is a model of what images look like. So inpainting is best read as a **ladder of priors**, from the weakest ("be smooth") to the strongest (a learned generative model), and the whole chapter is *which prior, and where it copies-vs-invents*. The recurring honesty: filled pixels are **plausible, never measured** — verifiable when the prior copies genuine structure from elsewhere, an outright **guess** when it hallucinates.

💡 **Big lesson (recurrence of L10 · the prior is not optional):** filling a hole is the purest case — there is *zero* data inside it, so **100% of the answer comes from the prior.** Smoothness, self-similarity, a photo database, a learned net — each is a different prior, and the result is exactly as good (and as honest) as the prior is. (→ see Big lesson **L10**, first placed in [[#Super-resolution and image priors]]; here taken to its extreme — no data, all prior.)

▸8.6.1 Inpainting as filling unmeasured pixels — the spectrum of priors⧉

• **the problem**: given an image with a **masked region** (a scratch, a date-stamp, a removed power line, a missing JPEG block), produce a complete image where the hole is **plausible and seamless**. There is **no data in the hole** — the entire fill is the prior's invention. [`fig-inpaint-prior-spectrum`]

• **the organizing axis — weak → strong prior** (this is the chapter):

• **smoothness / PDE** (weakest): *"continue the surrounding color/edges smoothly."* Great for **thin** gaps; **blurs** anything large because smoothness has no notion of *texture*.

• **self-similarity / exemplar** (the workhorse): *"this image repeats itself — copy patches from elsewhere in the same image."* Texture synthesis, clone/heal, Criminisi object removal. Copies **real** structure (reconstruction-leaning), but only structure the image already contains.

• **database** (Hays & Efros): *"the answer is in some other photo"* — retrieve a matching scene from millions and graft a region in. The prior is *the rest of the internet*.

• **learned generative** (strongest): a net that has internalized $p(\text{image})$ **generates** a completion (context encoders → gated conv → diffusion inpainting). Most flexible, most prone to **hallucination**.

• **intuition-first framing**: not "Photoshop magic" — *"the algorithm answers a question it has no data for, by assuming the missing region looks like (a) its neighbours, (b) some other part of this image, (c) some other image, or (d) a typical image."* Those four assumptions are the four families.

• 💡 **the throughline**: this is L10 with the dial at max — see the box. Everything below is *which* prior.

▸8.6.2 PDE / diffusion-based inpainting⧉

• **the idea** (Bertalmío et al. 2000): mimic what a **museum restorer** does to a cracked painting — **continue the lines** arriving at the hole's boundary smoothly into it. Formally, **propagate the image's smoothness (its Laplacian) *along* the level-lines (isophotes)** so edges arriving at the hole are extended across it rather than blurred over. [`fig-pde-isophote-diffusion`]

• **why "along isophotes" matters**: naive diffusion (solve Laplace's equation in the hole — a membrane / harmonic fill) smears *across* edges → a soft blur. Steering the diffusion **along** the level-lines keeps an edge continuing as an edge. (Telea 2004 is a fast **marching-method** approximation.)

• **good for / hard limit**: **thin** structures — scratches, text overlays, fine cracks, scanner dust. For a **large** hole there is nothing to propagate but flatness → the fill goes **blurry and textureless** — the motivation for the exemplar/texture methods next.

• **placement note**: this is the **gradient-domain / Poisson** family in disguise — cross-ref [[Poisson image editing]] (💡 L9): seamless *cloning* edits the gradient field too; PDE inpainting edits it where there's *no source at all*.

▸8.6.3 Texture synthesis (Efros–Leung; Efros–Freeman quilting)⧉

• **the reframing**: a large hole in a *textured* region (grass, gravel, a sweater) can't be diffused — but it **can be synthesized**, because texture is **self-similar**: patches recur.

#### Non-parametric texture synthesis (Efros–Leung)

• **non-parametric texture synthesis** (Efros & Leung 1999): synthesize **one pixel at a time** — for the next unfilled pixel $p$, take its **already-known neighbourhood** $N(p)$, search the source for **windows that match it** (Gaussian-weighted SSD over the known part), pick a best match, and **copy its centre pixel**. No texture *model* is fit — the sample image **is** the model (a Markov-random-field assumption). [`fig-efros-leung-growth`]

• **the one knob — window size**: too **small** → loses large-scale structure (noise); too **large** → copies the source **verbatim** and is slow. Must straddle the texture's characteristic scale. Honest cost: **slow** (a full search per pixel) → the motivation for PatchMatch ([[#Patch match]]).

• **pseudocode block** (in the draft): the frontier loop — `while unfilled: pick the most-surrounded pixel p; SSD-match N(p) against source windows; copy a near-best match's centre into p`.

#### Image quilting (Efros–Freeman)

• **image quilting** (Efros & Freeman 2001): synthesize a **whole patch at a time** — lay down overlapping blocks copied from the source, and where two overlap, **cut along the minimum-error boundary** through the overlap (a tiny dynamic-program / 1-D graph cut) so the seam is invisible. Faster than pixel-at-a-time, and the seam-cut prefigures **GraphCut textures** (Kwatra 2003) and 💡 **L12**. [`fig-quilting-seam`]

• **parametric texture** (the road not taken): **Heeger & Bergen 1995** / **Portilla & Simoncelli 2000** match **statistics** of **steerable-pyramid** subband responses — the direct ancestor of **neural style's Gram-matrix** idea (cross-ref [[Style transfer]], [[Linear pyramids and wavelets]]).

• 💡 **L12 connection**: the patch + min-error-seam view is **discrete optimization on a graph** (match cost + seam cost) — the energy-is-the-design lesson (→ Big lesson **L12**).

▸8.6.4 Exemplar inpainting: clone, healing brush, and object removal (Criminisi)⧉

• **clone brush**: the manual exemplar prior — **copy pixels from a chosen source region**. Pure copy → **visible seams** where lighting differs. The **healing brush** fixes that by transferring **texture/detail** but **matching the destination's color/illumination** — paste the source's *gradients* and re-level (the **gradient-domain / Poisson** trick — cross-ref [[Poisson image editing]], 💡 L9).

• **exemplar-based inpainting / object removal** (Criminisi, Pérez & Toyama 2004): automate the clone brush with texture-synthesis-style **patch copying** plus a crucial **fill order** — a **priority** $P(p)=C(p)\,D(p)$ (a **confidence** term × a **data/edge** term) fills **high-priority patches first**, so strong linear structures (a wall edge, a horizon) continue across the hole *before* flat regions. [`fig-object-removal`]

• **scaling → next sections**: the search is slow and limited to *this* image; **PatchMatch** ([[#Patch match]]) makes it interactive; a **database** (Hays & Efros) or a **learned net** lifts the "only this image" limit.

▸8.6.5 Data-driven scene completion (Hays & Efros)⧉

• **the provocation**: if exemplar methods are limited to *this image's* content, **use a different image.** Hays & Efros 2007 fill a hole by **retrieving, from millions of photos, scenes that match the hole's surround**, then **seamlessly graft** a region from a good match (alignment + a graph-cut seam + Poisson blend). [`fig-inpaint-prior-spectrum`, database panel]

• **the lesson — "the internet is your prior"**: with a big enough database, *some* photo already contains a plausible completion; you **retrieve** rather than model. The non-parametric prior taken to web scale.

• **honest caveats**: it grafts a **semantically different** scene — coherent and photoreal, but not *this* scene's truth (a vivid L10 "plausible ≠ measured"); needs a huge database and good scene matching.

• **why it's the hinge to deep learning**: "retrieve the right completion from millions of images" is what a **generative net** does *implicitly* — it has compressed those millions into weights and can **synthesize** the completion.

▸8.6.6 Deep inpainting (context encoders → partial/gated conv → diffusion)⧉

• **the move** (💡 L8, [[Deep learning]]): replace the hand-built prior with a **learned** one — a net trained to **regenerate masked regions**.

• **context encoders** (Pathak et al. 2016): the first CNN inpainter (encoder–decoder, reconstruction + adversarial loss) — captures **semantics** but early versions were blurry.

• **partial / gated convolutions** (Liu et al. 2018; Yu et al. 2019): fix the core nuisance — a vanilla conv near a hole mixes in **garbage masked pixels**. Partial conv renormalizes by the *valid* fraction; gated conv *learns* a soft mask gate → robust to **free-form** holes, the practical "remove this object" workhorse.

• **diffusion inpainting** (Lugmayr et al. 2022, RePaint): the strongest prior — treat the hole as a **measurement constraint** and, at every denoising step, **pin the known region** to the (renoised) real pixels: $x_{t-1}\!\leftarrow\!\text{mask}\odot q(x^{\text{known}}_{t-1}) + (1-\text{mask})\odot \mathcal{D}_\theta(x_t)$. This is **exactly PnP/RED posterior sampling** (💡 L11) with operator = "keep the known pixels" — the continuation of [[#Super-resolution and image priors]].

• **forward reference**: the architectures live in [[Deep learning]] / [[Generative AI and diffusion]]; here they're the **strong-prior end**. Honest note: most flexible **and** most prone to **hallucinate** — an inpainted region is **not evidence** (L10 ethics).

▸8.6.7 Highlight / specular recovery⧉

• **the framing — clipping is a hole**: a **blown highlight** is a region where the sensor **saturated** — the true radiance above the clip is *unmeasured*. So **highlight recovery is inpainting a clipped region**: fill the saturated pixels using the **unclipped channels** and surrounding texture as the prior. [`fig-highlight-recovery`]

• **two angles**: (a) **specular vs diffuse separation** — recover the diffuse layer under the specular add-on (cross-ref [[#Illumination related effects in a single image]]); (b) **per-channel clipping recovery** — when only one channel clipped, the others carry the hue, so extrapolate the missing channel (a small, well-posed inpainting).

• **why it lives here**: same L10 idea — *recover what the measurement destroyed* (by clipping rather than masking); the single-image cousin of **HDR merge** (capture the highlight with a shorter exposure instead of guessing it — forward-ref [[Multiple exposure imaging]]).

▸8.6.8 Epitomes (compact patch models)⧉

• **the idea** (Jojic, Frey & Kannan 2003): summarize an image (or video) by a **small "epitome"** — a miniature holding the **patch statistics** of the whole, learned so that every original patch has a close match somewhere inside it. A **generative patch model**: the big image is explained as a quilt of epitome patches.

• **uses**: denoising, inpainting, super-resolution, and texture transfer all fall out of "find each patch's place in the epitome, read back a clean version" — the **patch-NN** idea of [[#Patch match]] with a *learned compact dictionary* instead of the image itself.

• **place**: the pre-deep cousin of today's learned patch / latent priors (**L8 / L11**); a bridge from exemplar texture synthesis to modern generative models.

▸8.7 Patch match⧉

`fig-nnf-field` · the nearest-neighbour field: every patch points to its best match elsewhere — visualised as a colour-coded offset map 🟨

`fig-patchmatch-three-steps` · PatchMatch's three moves — random initialisation → propagation (good offsets spread to neighbours) → random search (refine locally) 🟨

⬜ figure not yet created

`fig-patchmatch-apps` (one image → hole-filled, retargeted (wider, no distortion), reshuffled (an object dragged, surround re-synthesized)) fig-patchmatch-apps

`fig-shiftmap-labeling` · Shift-Map editing as a graph-cut labelling: each output pixel is assigned a shift, optimised for data + smoothness 🟨

The previous chapter's exemplar methods (Efros–Leung, Criminisi) all live or die on one operation: *for each patch, find its most similar patch elsewhere.* Done naively that's a full search per patch — far too slow to be interactive, and the real reason texture synthesis felt like a batch job. This chapter is the **two ways to make that fast or optimal**: a brilliant **randomized heuristic** (PatchMatch) that gets a near-perfect answer in milliseconds, and a **global graph-cut** (Shift-Map) that gets the *optimal* answer more slowly. The unifying object is the **nearest-neighbour field**; the unifying lesson is 💡 **L12** — *the edit is an optimization over a correspondence, and the energy you pick is the design.*

💡 **Big lesson (recurrence of L12 · image edits as optimization on a graph):** "where does each output pixel come from?" is a **labelling** — assign every output pixel a **shift** into the input. Pick the **energy** — a data term (respect the user's keep/remove/resize constraints) plus a **seam** term (neighbouring pixels should take consistent offsets, so cuts fall on real edges) — and a generic **graph-cut** solver does the rest. PatchMatch is the *fast heuristic* answer to the same labelling; Shift-Map is the *globally-optimal* one. (→ see Big lesson **L12**, first placed in Seam optimization; it is the affinity lesson L4 turned into a *cut*.)

▸8.7.1 The nearest-neighbour field, and why exhaustive search is the bottleneck⧉

• **the shared sub-problem**: inpainting, texture synthesis, retargeting, NL-means denoising, Image Analogies — all need, for each output patch, **the most similar source patch**. Collect those answers → a **nearest-neighbour field (NNF)**: an **offset** $f(p)=\Delta$ per pixel s.t. $P_{\text{in}}(p+\Delta)$ best matches $P_{\text{out}}(p)$. [`fig-nnf-field`]

• **the cost wall**: computed exhaustively, each patch is compared to **every** candidate → quadratic; kd-trees help but stay far from interactive on megapixels. *This* is why Efros–Leung and Criminisi were slow, and why "Content-Aware Fill" didn't exist as a click-and-wait tool until the NNF could be computed in real time.

• **the reframing**: compute the NNF **approximately, but fast** — a **randomized heuristic** (PatchMatch) or a **global energy minimization** (Shift-Map). The rest of the chapter is those two.

▸8.7.2 PatchMatch — randomized correspondence [@barnes-etal-2009|Barnes et al. 2009]⧉

• **the two facts it exploits**:

1. **coherence** — adjacent output patches usually map to **adjacent** source patches, so **one good offset is good for its neighbours.**

2. **the birthday-paradox / law of large numbers** — over a whole image **some** patches get a good match by **random guessing**; you just need to **spread** the lucky hits.

• **the algorithm — three moves** [`fig-patchmatch-three-steps`]:

1. **random initialization**: every pixel gets a **random** offset. Mostly bad — a few good.

2. **propagation**: scan (alternating raster order each iteration); for $p$, also **try the offsets its scanned neighbours used** (shifted by one) → good offsets **flood-fill** across coherent regions.

3. **random search**: refine the best by testing random offsets in an **exponentially shrinking window** ($w\cdot\alpha^i$) — large jumps escape local optima, small jumps fine-tune.

• iterate 2–5 times → a clean NNF in **milliseconds**, orders of magnitude faster than exhaustive search.

• **pseudocode block** (in the draft): the init-then-sweep loop — `init f randomly; repeat passes (alternating scan): for each p: propagate from neighbours, then random-search in a shrinking window; return f`.

• **why it converges fast**: propagation makes a single lucky match **infect** an entire coherent region in one scan; random search avoids getting stuck. *Randomness + coherence beats brute force.*

• **generalized PatchMatch** (Barnes et al. 2010): k nearest neighbours + **rotations/scales** — needed for matching transformed content (style transfer, denoising).

• 💡 **why it sits next to Shift-Map**: PatchMatch answers "where does each patch come from?" **heuristically and fast**; Shift-Map answers the **same** question **optimally and slower** (next).

▸8.7.3 Applications: hole filling, retargeting, reshuffle⧉

• **interactive hole filling / Content-Aware Fill**: run PatchMatch between hole and rest, copy best-matching patches in, iterate coarse-to-fine for global coherence — the engine that turned exemplar inpainting into a **real-time** tool. The *objective* it approximately optimizes is **bidirectional similarity** (Simakov et al. 2008 — output should be *coherent* and *complete*). [`fig-patchmatch-apps`, hole panel]

• **image retargeting** (content-aware resize): change aspect ratio while **preserving important content** (vs **seam carving**, Avidan & Shamir 2007, the greedy cousin) — reshape without squashing faces/buildings. [`fig-patchmatch-apps`, retarget panel]

• **reshuffle**: **drag an object** to a new location; re-synthesize the rest so the scene stays coherent. [`fig-patchmatch-apps`, reshuffle panel]

• **and beyond**: **video completion** (Wexler et al. 2007), **Image Melding** (Darabi et al. 2012 — PatchMatch + gradient-domain), denoising self-similarity, the search inside Image Analogies (cross-ref [[Style transfer]]).

• **honest framing (L10 callback)**: all still **copy** real content (reconstruction-leaning within the image); the *generative* successors (diffusion inpainting/outpainting) **invent** content — more powerful, less faithful.

▸8.7.4 Shift-Map image editing [@pritch-etal-2009|Pritch et al. 2009] — editing as graph-cut labelling⧉

• **the reframing**: instead of a heuristic search, decide **per output pixel** a single **shift** $M(p)$ into the input by **minimizing a global energy**, then read the output by copying `input(p + M(p))` — the output is a **rearrangement** of input pixels under a smooth offset field. [`fig-shiftmap-labeling`]

• **the energy** (💡 L12): $E(M)=\sum_p E_{\text{data}}(M(p)) + \lambda\sum_{(p,q)} E_{\text{smooth}}(M(p),M(q))$.

• **data term** encodes the task: retargeting (force output size / drop columns), object removal (forbid shifts into the removed region), reshuffle (pin content to its new place).

• **smoothness / seam term** penalizes **inconsistent neighbouring shifts** by **color + gradient discontinuity** — cheap seams fall on **real edges** (same principle as quilting's min-error cut).

• **the solver — graph cuts / α-expansion** (Boykov et al. 2001), often **coarse-to-fine** (the label set — all shifts — is large) → a **globally (near-)optimal** offset map.

• **PatchMatch vs Shift-Map — the duality**: both produce an **offset field** rearranging input into output. PatchMatch = randomized **heuristic**, fast, approximate, great for dense synthesis (fill, reshuffle). Shift-Map = a **global graph-cut energy**, slower, optimal for its energy, with clean explicit seams — great for retargeting / removal. This is the **L4→L12** progression: an **affinity** (patch similarity / seam cost) used by a **fast local rule** or a **global cut** — *the affinity is the design; the solver is a choice.*

• **cross refs**: seam carving (Avidan & Shamir 2007) is the greedy special case Shift-Map generalizes; the seamless blend after a cut is the **Poisson** solve ([[Poisson image editing]], L9); the learned successor is [[Generative AI and diffusion]] (L11).

▸8.8 Compositing, segmentation and matting⧉

`fig-matting-equation` · the matting equation — one boundary pixel is $\alpha F + (1-\alpha)B$: a foreground colour, a background colour, and a soft coverage $\alpha$ 🟨

`fig-graphcut-segmentation` · binary segmentation as a min cut: pixel nodes, source/sink terminals, t-links (Dp) + n-links (Vpq, small across edges), min cut severs the cheap edges (Seam optimization) ✅

`fig-segmentation-graphcut` · binary segmentation as a min-cut on a grid graph wired to source/sink terminals; the cut is the object boundary (GrabCut inset) 🟨

`fig-segmentation-three-framings` · three framings of segmentation — min-cut, normalized-cut, and least-cost path (intelligent scissors) — on the same boundary 🟨

⬜ figure not yet created

`fig-trimap-to-alpha` (user trimap FG/BG/unknown → recovered continuous $\alpha$ over the unknown band, e.g. hair, after closed-form matting) fig-trimap-to-alpha

`fig-key-vs-measure` · two ways to get the cut-out: *key* it (constrain the background — green screen) vs *measure* the separation (depth / IR / flash) 🟨

⬜ figure not yet created

`fig-harmonization` (a correct-$\alpha$ cutout pasted raw ("stickered on") vs Poisson-blended vs color/lighting-harmonized → "belongs") fig-harmonization

The previous chapters fixed *one* image — denoise, deblur, recover detail. Now we **combine** images: take the subject out of one photo and drop it into another. That splits cleanly into two jobs — **cut** (which pixels are the subject?) and **blend** (lay it down so the join is invisible). The first is **segmentation/matting**; the second is **compositing** (and its seamless cousins, Poisson and pyramid blending, built in EDGES MATTER and only pointed to here). The honest twist is that "which pixels are the subject" has no crisp answer at hair, fur, motion blur and glass — so the right object is not a binary mask but a **soft $\alpha$**, and the model of the world is one line: $C = \alpha F + (1-\alpha)B$.

💡 **Big lesson (recurrence of L12 · image edits as discrete optimization on a graph — the energy is the design):** a *hard* selection is exactly the **graph cut** of [[Seam optimization]] — pixels are nodes, you add a **region data term** (this color looks like foreground / background) and the same **edge-aware smoothness term** (cheap to cut where neighbouring colors differ), and min-cut/max-flow returns the **globally optimal** boundary for two labels. The optimizer is off-the-shelf; *choosing the energy is the whole design.* And the smoothness weight $V_{pq}\propto e^{-\lVert C_p-C_q\rVert^2/2\sigma^2}$ is an **affinity** — so this is also a recurrence of **L4 · edge-preserving = affinity**: the very same "how much do these two pixels belong together" that drove the bilateral filter, now deciding *where not to cut*. (→ see Big lesson **L12**, first placed in [[Seam optimization]], extending **L4**, first placed in [[Bilateral filtering]]; here both recur, as a *cut* and, below, as the *matting Laplacian*.)

equations

**compositing / matting equation** $C = \alpha F + (1-\alpha)B,\ \alpha\in[0,1]$ — *forward* = composite, *inverse* = matting (under-determined: 3 knowns, 7 unknowns per pixel)

**graph-cut segmentation energy** $E(\mathbf{L}) = \sum_p D_p(L_p) + \lambda \sum_{(p,q)\in\mathcal N} V_{pq}(L_p,L_q)$ with **data term** $D_p$ (color fit FG vs BG) and **edge-aware smoothness** $V_{pq}\propto \exp(-\lVert C_p-C_q\rVert^2/2\sigma^2)$ (an **affinity**, **L4**), minimized **exactly** for two labels by min-cut/max-flow (**L12**)

**closed-form matting** $\hat\alpha = \arg\min_\alpha \alpha^\top L\,\alpha$ s.t. trimap, where $L$ is the **matting Laplacian** (a pixel-affinity matrix from a local color-line model) — *the same affinity-Laplacian as colorization*

▸8.8.1 Segmentation: cutting the object out⧉

• **the task**: produce a **binary mask** — these pixels are the subject, those are not. Three classic framings, the *same* lesson (**L12**) with three different optimizers — the energy/affinity you pick is the design, not the solver.

#### Interactive graph-cut segmentation

• **interactive graph-cut segmentation** (Boykov & Jolly 2001): treat the image as a **grid graph**; the user paints a few **FG** and **BG** strokes. Build a color model from the strokes (a **data/region term** $D_p$), add an **edge-aware smoothness term** $V_{pq}\propto e^{-\lVert C_p-C_q\rVert^2/2\sigma^2}$ (cheap to cut where colors differ), and minimize $E=\sum_p D_p+\lambda\sum_{pq}V_{pq}$ by **min-cut/max-flow**. For two labels this is **globally optimal** — *why* graph cut beats greedy region-growing. [`fig-segmentation-graphcut`]

#### GrabCut: less input, same cut

• **GrabCut** (Rother et al. 2004): the user drags a **loose rectangle**; fit **Gaussian-mixture color models** to FG/BG, run the cut, **re-estimate** the models, **iterate**. A rectangle plus a couple of strokes → a clean mask — the interaction model "select subject" tools descend from.

#### Two other framings: normalized cuts and intelligent scissors

• **normalized cuts** (Shi & Malik 2000): the **spectral** framing — cut a pixel-affinity graph into balanced pieces by a **normalized** cut cost, solved by an **eigenvector** of the graph Laplacian (a relaxation) → *unsupervised* grouping. Same affinity, different optimizer. [`fig-segmentation-three-framings`]

• **intelligent scissors / live-wire** (Mortensen & Barrett 1995): the **boundary** framing — a **least-cost path** (dynamic programming, costs low along strong edges) **snaps** to the contour as you trace. Energy on the *boundary*, optimizer = shortest-path. Three energies — region, spectral, boundary — one **L12** lesson. [`fig-segmentation-three-framings`]

#### The bridge to soft selection

• **the bridge to soft selection**: every method here returns a **hard 0/1** mask, which through **hair** looks scissor-cut — jagged, missing wisps, fringed. The fix is to relax the label to a **continuous $\alpha\in[0,1]$** — **matting**, next.

▸8.8.2 The matting equation and the alpha channel⧉

• **the model**: each observed pixel is a **mixture**, $C = \alpha F + (1-\alpha)B,\ \alpha\in[0,1]$ — $\alpha=1$ pure foreground, $\alpha=0$ pure background, and the *interesting* pixels are **in between** (a strand of hair, a motion-blurred edge, an out-of-focus border, frosted glass). The matte $\alpha$ is the **coverage / transparency** channel. [`fig-matting-equation`]

• **compositing = the forward direction** (the "over" operator, Porter & Duff 1984 — *queue ref*): given $F,\alpha$ and a new $B$, the composite is one **per-pixel lerp**. (In practice colors are **premultiplied**, $F'=\alpha F$, so over is $C=F'+(1-\alpha)B$ — cheaper, correct under filtering.) This part is easy.

• **matting = the inverse, brutally ill-posed**: from one photo you observe only $C$ (3 numbers) and must recover $F$ (3), $B$ (3), $\alpha$ (1) — **7 unknowns, 3 equations** per pixel.

• 💡 **callback to L10 (the prior is not optional)**: every method adds information — a user **trimap** + a **prior/affinity** (natural matting), a **known background** (chroma key), or an **extra measurement** (depth/IR/multi-flash). Same moral as super-resolution and dehazing. (→ see Big lesson **L10**, [[#Super-resolution and image priors]].)

▸8.8.3 Trimaps: turning the impossible inverse into a tractable one⧉

• **the trimap** is the standard input: a three-way label — **definitely FG**, **definitely BG**, and a thin **unknown** band $\Omega$ straddling the boundary (the hair, the blur). From a scribble, a dilated mask, or learned.

• **why it's enough**: in the definite regions $\alpha$ is pinned and $F$ or $B$ is *observed*; the algorithm solves $\alpha,F,B$ only in the **narrow unknown band**, with nearby known colors as anchors — "7 unknowns everywhere" → "a few unknowns in a thin strip with strong boundary conditions." [`fig-trimap-to-alpha`]

• **the through-line**: *less user input is the research arc* — hard scribbles (graph cut) → loose rectangle (GrabCut) → trimap (matting) → **nothing** (learned/SAM).

▸8.8.4 Natural-image matting: Bayesian and closed-form⧉

• **Bayesian matting** (Chuang et al. 2001 — *queue ref*): a **per-pixel MAP** — model nearby FG/BG colors as Gaussians, find the $(F,B,\alpha)$ best explaining $C$, sweep outward from known regions. Intuitive but **local** (neighbouring alphas needn't agree).

• **closed-form matting** (Levin et al. 2008 — *queue ref*): **eliminate $F,B$** — assume in a small window FG and BG each lie on a **color line**, so $\alpha$ is locally an **affine function of color**, and solve $\hat\alpha=\arg\min_\alpha \alpha^\top L\,\alpha$ s.t. the trimap, where $L$ is the **matting Laplacian** (a sparse **pixel-affinity** matrix). A **sparse linear solve**. [`fig-trimap-to-alpha`]

• 💡 **the hinge — recurrence of L4**: the matting Laplacian $L$ **is an affinity matrix** — the *same object* as the bilateral weight, the graph-cut smoothness term, and the **colorization** solve ([[#Colorization as an optimization (the inverse-problem framing)]]). "**Propagate $\alpha$ along affinities, anchored by the trimap**" is colorization with $\alpha$ in place of chroma. (→ Big lesson **L4**, [[Bilateral filtering]]; the *cut* form is **L12**.)

• **robust / sampling hybrids** (Wang & Cohen 2007 — *queue ref*): combine color-sampling with the affinity smoothness — the workhorse just before the learned era.

▸8.8.5 Easy matting I — blue/green/chroma keying (constrain the background)⧉

• **the trick**: make the inverse *well-posed by construction* — shoot against a **known, uniform, saturated background** (green/blue screen) → $B$ is **known** (and far from skin/hair), so only $\alpha,F$ remain (Smith & Blinn 1996 — *queue ref*). [`fig-key-vs-measure`, left]

• **green vs blue**: **green** dominates today (two green photosites per Bayer cell → less noise; rarer in skin/clothing); **blue** was the film standard.

• **practice / failure modes**: **spill** (green bounce on edges → de-spill), **uneven screen lighting** (the "known" $B$ isn't constant), **fine detail** (hair still needs a soft $\alpha$ at the boundary). *Why studios obsess over even screen lighting* — it keeps $B$ genuinely known.

• **the moral**: chroma key removes unknowns by **controlling the scene**.

▸8.8.6 Easy matting II — IR, depth and multi-flash matting (measure the separation)⧉

• **the other road**: **add a measurement** that distinguishes FG from BG — **computational illumination** (light/sense so the *capture* hands you the matte).

• **depth matting**: a **depth sensor** (ToF, stereo, dual-pixel, LiDAR) gives "near = foreground" independent of color — phone **portrait mode**, video-call background blur. The depth edge is coarse → used as a **trimap generator**, refined with closed-form/guided matting (same affinity solve; cf. the guided-filter transmission refinement in [[#Dehazing]]).

• **IR / "Disney" matting**: light the background in the **infrared** (a retro-reflective IR background; IR-keyed matting) → a *second IR image* is a built-in green screen with **no visible colored screen** to spill/relight.

• **flash / multi-flash matting** (*queue ref*): a **flash** falls off as $1/d^2$, brightening the *near* foreground far more than the far background → differencing flash/no-flash (or multi-flash occlusion shadows) separates FG/BG by **illumination falloff**. [`fig-key-vs-measure`, right]

• **the moral**: these spend a *sensor or a light* to make the matte well-posed, not *user input* or a *prior* — the computational-illumination thread.

▸8.8.7 Rotoscoping and video matting (coherence over time)⧉

• **rotoscoping**: the hand-craft ancestor — tracing a moving subject **frame by frame**; modern tools interpolate **keyframed splines/masks** — segmentation with a **temporal** axis.

• **the new requirement — temporal coherence**: a per-frame matte that **flickers** looks worse than a slightly-wrong but **stable** one, so video matting adds a **temporal smoothness / propagation** term (track, or optical-flow-warp the matte forward) — the graph/affinity now spans a video volume.

• **applications**: green-screen VFX, live virtual backgrounds, object removal/replacement, the **selection** layer under any video edit (ties to [[Non-photorealistic rendering]]). Learned video matting (Lin et al. 2021 — *queue ref*) does this with a recurrent net.

▸8.8.8 Learned segmentation and matting (the prior, learned)⧉

• 💡 **callback to L8**: deep methods keep the *same skeleton* — select, then recover $\alpha$ — but replace the hand-built color model / affinity / trimap-propagation with a function **fit to data**. The matting equation and "cut then blend" don't change; the prior does. (→ Big lesson **L8**, [[Deep learning]].)

• **SAM — Segment Anything** (Kirillov et al. 2023): a **promptable** model — click/box/"everything" → clean masks for **arbitrary objects**, no per-class training; collapses GrabCut's interaction to a prompt and its color model to a pretrained net. The "**cut anything**" front-end feeding masks/trimaps downstream.

• **deep alpha matting** (Xu et al. 2017 — *queue ref*; trimap-free MODNet / robust matting, Lin et al. 2021 — *queue ref*): a network regresses $\alpha$ (and $F$) directly, learning the hair/edge prior; trimap-free variants push user input to **zero** (the end of the "less input" arc), only as good as their training distribution.

• placement: taught as **models** in [[Deep learning]]; here the **learned-prior** end of the same select-then-recover spectrum. Generative *object insertion* (re-render the blended result incl. shadow) is [[Generative AI and diffusion]] (**L11**).

▸8.8.9 Harmonization: making the composite belong⧉

• **the problem after the cut**: even with a *perfect* $\alpha$, a raw paste looks "**stickered on**" — different light, white balance, contrast, **grain** than the new background. The matte was right; the *appearance* is wrong. [`fig-harmonization`]

• **two layers of fix, both pointing to EDGES MATTER**:

• **seamless blending** — **Poisson / gradient-domain** (Pérez et al. 2003) pastes the foreground's **gradients**, re-deriving the boundary color from the background (**L9**, [[Poisson image editing]]); **multi-band / pyramid blending** (Burt & Adelson 1983) blends each band over a band-appropriate width (developed in [[Linear pyramids and wavelets]]). *Used here, not re-derived.*

• **global appearance harmonization** — match the foreground's **color distribution, illumination direction, contrast and grain** (classical: transfer color statistics; learned: a net predicting a harmonized composite). Fixes lighting *consistency*, which Poisson alone doesn't (Poisson hides the *seam*, not a wrong *shading direction*).

• **the takeaway**: a believable composite needs **all three** — a correct **matte** (the cut), a seamless **blend** (Poisson/pyramid), and consistent **appearance** (harmonization).

▸8.8.10 Where the blends live: Poisson, pyramid, photomontage, and fusion (cross-refs)⧉

• **resolving the stub's placeholders** — these belong to other parts (*cut here, blend there*):

• **Poisson / gradient-domain blending** → [[Poisson image editing]] (EDGES MATTER); used above (**L9**).

• **pyramid / multi-band blending** → [[Linear pyramids and wavelets]] (Burt & Adelson 1983).

• **interactive digital photomontage** (Agarwala et al. 2004) → the canonical *cut-then-blend* system: a **graph-cut seam** picks which source photo per pixel (group photo with everyone smiling, clean-plate removal, extended DoF), then a **Poisson blend** hides seams — exactly this part's two halves wired together (graph cut = **L12**; Poisson = **L9**); placed at the seam/blending crossover in EDGES MATTER, showcased here.

• **multi-image fusion** (HDR merge, exposure fusion, focus stacking, flash/no-flash) → [[Multiple exposure imaging]] (and the [[#Recap: tone mapping]] recap); shares "weighted per-pixel combine in a pyramid" with pyramid blending but combines **exposures/focus**, not **objects**.

• the one-line map: **segmentation/matting = the cut (this part); Poisson/pyramid/photomontage = the blend (EDGES MATTER); HDR/fusion = the multi-capture combine ([[Multiple exposure imaging]]).**

⬜ figure not yet created

`fig-intrinsic-decomp` (one photo → its **reflectance** layer (flat albedo) × **shading** layer (smooth illumination, no texture)) fig-intrinsic-decomp

`fig-retinex-thresholding` · Retinex: threshold the log-gradient (small steps = shading, large steps = reflectance edges), then re-integrate 🟨

⬜ figure not yet created

`fig-multi-illuminant-wb` (a scene lit warm tungsten on one side, cool window light on the other → one global white balance leaves *one* half wrong fig-multi-illuminant-wb

`fig-hdr-merge-antelope` · the HDR merge on a REAL bracket (Antelope Canyon): two exposures (short/long) + their per-pixel reliability weight maps → reliability-weighted radiance average → tone-mapped result; both sun and shadows readable ✅

`fig-dichromatic-model` · the dichromatic reflection model: observed colour = a body (diffuse) component + a specular (illuminant-coloured) component — a plane in colour space 🟨

A single photograph mixes things the eye effortlessly keeps apart: it instantly reads a white shirt in shadow as *white-and-shadowed*, not as a grey shirt. The camera can't — it records the **product**, $I = R\cdot S$, illumination times surface (**L1**). This chapter is the project of **un-mixing** that product from one image — separating light from surface, or one layer of light from another. And it is the same story as super-resolution and deblurring: the un-mixing is **under-determined**, so a **prior** does the work (**L10**). The unifying frame is the **intrinsic-image decomposition**; everything else here is a special case of it. (These are the *single-image* methods; their easier *multi-image* cousins — flash/no-flash, a rotating polarizer, a light stage — live in **Computational illumination (Advanced)**.)

💡 **Big lesson (recurrence of L1 + L10):** because the world is **multiplicative** — what you measure is light **×** surface — recovering *either factor* from one image is **ill-posed**: a dark patch can be a dim surface in bright light or a bright surface in dim light, and the pixel can't tell you which. So separating illumination from reflectance (or a reflection from a transmission, a shadow from a stain, a highlight from a color) always needs a **prior** about how surfaces, light, and natural images behave — not a tuning knob but the load-bearing part of the method. (→ see **L1** · multiplicative world, FUNDAMENTALS; → see **L10** · the prior is not optional, [[#Super-resolution and image priors]].)

equations

**image as product** $I(x) = R(x)\,S(x)$ — log makes it additive $\log I = \log R + \log S$ (**L1/L2**)

**Retinex / gradient classification** — assign $\nabla(\log I)$ to $\nabla R$ if $|\nabla\log I| > \tau$ (sharp = reflectance) else to $\nabla S$ (smooth = shading), then **Poisson-integrate** each layer (**L9**)

**layer superposition (reflection)** $I = T + R$ (transmitted + reflected), under-determined → priors

**dichromatic model** $I(x) = m_d(x)\,\Lambda_{\text{body}} + m_s(x)\,\Lambda_{\text{illum}}$ (diffuse in the **surface** color direction + specular in the **illuminant** color direction — two color vectors, separable per pixel)

▸8.9.1 Intrinsic images: the unifying frame ($I = R\cdot S$)⧉

• **the decomposition**: split one image into two **intrinsic layers** — **reflectance** $R$ (the surface's albedo, independent of lighting) and **shading** $S$ (the illumination, independent of surface). Per channel $I = R\cdot S$; in **log** it's additive, $\log I = \log R + \log S$ (**L2** — why illumination work lives in log/linear, not gamma). [`fig-intrinsic-decomp`]

• **why it's the umbrella**: once you can name "this variation is *light*, that is *surface*," every effect here is an edit on one layer — **white balance** corrects the *color* of $S$; **shadow removal** flattens a dark region of $S$; **highlight removal** strips a specular add-on; **reflection removal** un-sums two whole $I$'s.

• **why it's hard**: infinitely many $(R,S)$ multiply to the same $I$ — the multiplicative ill-posedness in the box. You need a prior on *which* variations are reflectance and which are shading.

• **Retinex** (Land & McCann 1971) — the **classic gradient prior**: assume **reflectance changes are sharp** (a painted edge) while **shading changes are smooth** (light falls off gradually). Threshold the (log-)gradient: **big jumps → reflectance**, **gentle ramps → shading**; then **Poisson-integrate** each (**L9**). Crude but captures the whole intuition: *edges are surface, ramps are light.* [`fig-retinex-thresholding`]

• **SIRFS** (Barron & Malik 2015 — *queue ref*): jointly recover **Shape, Illumination, Reflectance** by **MAP** with hand-built priors (reflectance piecewise-constant/sparse, shape smooth, illumination low-frequency). Same data-fit + prior recipe; learned successors (CNNs predicting $R,S$) in [[Deep learning]] (**L8**).

• **the takeaway**: intrinsic decomposition is *the* generalization; the rest is "I care about one layer, and I have a cue that pins it down."

▸8.9.2 Multiple-light / spatially-varying white balance⧉

• **the assumption that breaks**: ordinary white balance estimates **one** illuminant for the whole frame and **divides** it out (**L1**, [[Color technology]]) — right only under a *single* color of light.

• **the failure**: real scenes mix lights — **warm tungsten** vs **cool daylight** through a window; flash + ambient; sun + open-shade blue. One global correction neutralizes *one* region and casts the other: fix the tungsten side and the window goes blue. [`fig-multi-illuminant-wb`]

• **the fix — per-pixel light mixture** (Hsu et al. 2008): model each pixel as a **blend of two illuminant colors**, $I(x) = (\alpha(x)L_1 + (1-\alpha(x))L_2)\cdot R(x)$, solve for the **mixing map** $\alpha(x)$ — a **spatially-varying** white balance — then divide out the *local* light. The two colors come from the user, clipped highlights, or clustering.

• **why it's intrinsic-image**: recovering the *color* of the shading layer as it varies — a chrominance-only intrinsic split; the map $\alpha(x)$ is **smooth** (light varies gently) — the Retinex prior on illuminant *color*.

• **encoding note**: divide in **linear light** (mixing/dividing illuminants is linear in radiance — **L2**).

• **cross-ref**: the *easy* multi-image version (flash/no-flash, the flash's known color disambiguates) is in **Computational illumination (Advanced)**; learned single-image in [[Deep learning]].

▸8.9.3 Reflection removal — pulling apart a transmission and a glass reflection⧉

• **the problem**: shoot through a window → **two images summed**, the **transmitted** scene ($T$) plus a **reflection** ($R$): $I = T + R$ (this one is **additive**, unlike $R\cdot S$; **L2**). Pulling them apart from one $I$ is hopelessly under-determined — pure prior territory (**L10**).

• **the cue zoo** (each a different prior breaking the tie): [`fig-reflection-removal-cues`]

• **defocus / blur**: you focus on the *transmitted* scene → the **reflection is out of focus** (weak, broad gradients). Assign sharp gradients to $T$, soft to $R$ (Fattal 2008 — a Retinex-like split on *focus*).

• **ghosting**: a *thick* pane reflects twice → the reflection appears **doubled/shifted**; that known double-impulse lets you identify and subtract $R$.

• **sparse-gradient prior** (Levin & Weiss 2007 — *queue ref*): both layers have **sparse, heavy-tailed gradients**; the sum has *too many* edges → find the $(T,R)$ split **minimizing total edges** (optionally with user scribbles). Same sparse-gradient prior as blind deblurring.

• **polarization**: light off glass is **partially polarized** (Brewster) → a polarizer attenuates $R$ differently from $T$. *(The genuine multi-image method — rotate a polarizer, two frames — lives in **Computational illumination (Advanced)**; single-image methods try to get this without the extra shot.)*

• **learned** (Zhang et al. 2018 — *queue ref*): a CNN trained on synthesized $T+R$ pairs predicts $T$ with a **perceptual loss** so the layer *looks* like a real photo (**L8**).

• **the takeaway**: "$I = T + R$ has no answer; pick the cue (focus, ghosting, edge-sparsity, polarization, a learned prior)." A cousin of intrinsic images — un-summing two illuminations.

▸8.9.4 Shadow detection and shadow removal⧉

• **what a shadow *is*, intrinsically**: a region where the **shading** $S$ drops (less light) while **reflectance** $R$ is **unchanged** — same paint, just darker. So shadow removal is an **edit on the shading layer only**: flatten the darkening, keep the texture. This is why it lives under intrinsic images.

• **the detection cue (the crux)**: tell a **shadow edge** (shading discontinuity) from a **material edge** (reflectance): a shadow edge changes **brightness but not hue much** (same surface color), is often **soft** (penumbra) and geometrically smooth; shadow regions are lit by **skylight** (bluer) not direct sun (warmer) — a color tell.

• **removal as a gradient-domain edit** (**L9**): label the shadow *boundary* gradients, **zero them**, and **Poisson-reconstruct** — interior texture (small gradients) preserved, the big shadow step removed. Same machinery as gradient-domain compositing. Refine the matte with the **fast bilateral solver** (Barron 2017) so the correction follows the real penumbra (**L4**).

• **the penumbra subtlety**: a hard binary mask leaves a dark halo; you need a **soft shadow matte** and to **scale** (multiply the shadowed radiance back up by the recovered shading ratio — a multiplicative correction in linear light, **L1**).

• **cross-ref**: the easy multi-image version (**flash/no-flash**) is in **Computational illumination (Advanced)**; learned shadow removal in [[Deep learning]].

▸8.9.5 Specular-highlight removal / "fake polarization" (the dichromatic model)⧉

• **the problem**: glossy surfaces (skin, leaves, plastic, wet roads) show **specular highlights** — the light source reflected, carrying the *illuminant's* color, not the surface's; they blow out texture and fool color/material estimation. We want the **matte (diffuse-only)** image.

• **the dichromatic reflection model** (Shafer 1985 — *queue ref*): reflected light is the **sum of two terms** with **different colors** — a **diffuse/body** term in the **surface's** color + a **specular/interface** term in the **illuminant's** color: $I(x) = m_d(x)\Lambda_{\text{body}} + m_s(x)\Lambda_{\text{illum}}$. The falloffs vary per pixel, but there are only **two color directions**.

• **the geometric picture (why separable)**: plot a glossy patch's pixels in RGB → an **"L"/dog-leg**: a **line** along the body color (matte shading) with a **spur** toward the **illuminant white** (the specular ramp). Identify the two directions (Lee 1979; Klinker–Shafer–Kanade 1988 — *queue ref*), then **project each pixel back onto the diffuse line**. [`fig-dichromatic-model`]

• **"fake polarization"**: a **real** specular killer is a **polarizing filter** (specular reflection is polarized); doing the *same job* purely **computationally from one image** via the dichromatic model is the "fake polarization" — the polarizer's matte look **without** the filter or extra shot. (Lee 1979 also used the highlight's illuminant color for **color constancy** — a highlight is a free white reference.)

• **why it's an intrinsic-image cousin**: the specular term is an **additive contamination of the shading layer** in the *illuminant's* color; removing it recovers the diffuse $R\cdot S$. A **color-direction** prior rather than a gradient prior — different cue, same un-mixing goal.

• **cross-ref**: the genuine optical version (rotate a polarizer) and multi-light specular removal are in **Computational illumination (Advanced)**; learned separation in [[Deep learning]]. **Encoding**: separate in **linear light** so the dichromatic *sum* is physically additive (**L2**).

▸8.10 Non-photorealistic rendering⧉

`fig-npr-painterly-layers` · painterly rendering in coarse-to-fine brush layers: big strokes lay the ground, smaller strokes add detail where it matters (Hertzmann 1998) 🟨

`fig-npr-cartoon-real` · the Winnemöller edge-preserving abstraction pipeline on a real photo: input → iterated bilateral flatten → luminance quantise → thresholded DoG ink → cartoon composite (NPR) ✅

`fig-npr-flow-dog` · isotropic Difference-of-Gaussians lines vs flow-based coherent lines that follow the local edge tangent (Kang et al. 2007) 🟨

`fig-npr-brush-pset` · the brush-stroke problem-set figure: placing, orienting and sizing strokes from image gradients (course p-set) 🟨

Everything before this chapter tried to make images **truer** — sharper, cleaner, better-exposed. NPR does the opposite on purpose: it throws information **away** to make a photograph more like a **drawing or a painting** — to *communicate* rather than reproduce. The surprising lesson is that the tools are the **same** ones from EDGES MATTER: the operation that makes a good cartoon — **flatten the unimportant texture, keep and darken the meaningful edges** — is exactly the **edge-preserving / base–detail** decomposition (**L4**), used to *abstract* instead of *enhance*. A light chapter and a coda: *learned* stylization is in [[#Style transfer]] above and [[Deep learning]]; here we cover the **classical, structure-driven** core and the course **brush p-set**.

equations

(light) **stroke placement** — paint where the canvas differs from a **blurred reference** (error $> T$), stroke **orientation $\perp \nabla I$**, **size coarse→fine** across pyramid levels

**DoG edges** $D = G_{\sigma_1}*I - G_{\sigma_2}*I$, thresholded → ink lines

**abstraction** = **bilateral**-flattened (quantized) color **+** DoG edges overlaid

▸8.10.1 What NPR is for, and the one idea⧉

• **the goal**: a **depiction**, not a reproduction — painterly, sketch, or cartoon, **emphasizing structure** and **suppressing clutter**. Closer to *illustration* than photography; the value is in **what's left out**.

• **the one idea — structure-aware abstraction**: keep what the eye uses (strong **edges**, **large-scale tone**, the silhouette), simplify the rest (fine texture, mid-tone noise, busy backgrounds). **L13** (separate the subject, commit) executed in pixels, using **L4** (edge-preserving) machinery.

• **intuition-first**: a cartoon = **flat color regions + bold outlines.** Flat regions come from an **edge-preserving smoother** (don't blur across edges); outlines from an **edge detector**. That two-part recipe is the whole field in one sentence.

▸8.10.2 Stroke-based / painterly rendering (and the brush p-set)⧉

• **the idea** (Hertzmann 1998 — *queue ref*): repaint the photo with **discrete brush strokes**, **coarse-to-fine** — big rough strokes over a heavily-blurred version, then progressively **smaller** strokes **only where** the painting still differs from the (less-blurred) reference. Detail appears where needed; flat areas stay loose. [`fig-npr-painterly-layers`]

• **the structure that makes it look good (not random daubs)**:

• **orientation**: strokes run **perpendicular to the image gradient** (along the "grain") — following a cheek's contour or a sky's flow (**L9**);

• **size = scale**: coarse strokes on coarse pyramid levels, fine on fine — a direct use of the **Gaussian / image pyramid** ([[Linear pyramids and wavelets]]);

• **curved strokes**: trace along the orientation field for long painterly strokes, stopping at edges;

• **error-driven placement**: add a stroke only where canvas-vs-reference error exceeds a threshold → the painting **converges** to the photo at the chosen abstraction level.

• **the parameters are the *style***: stroke size range, opacity, jitter, curvature, blur radius → impressionist vs pointillist vs expressionist from *one* algorithm. (The **Gaussian pyramid** sets stroke scale and the blurred references.)

• **the course p-set**: implement the **layered, gradient-following brush painter** — build the pyramid of blurred references, place oriented strokes coarse-to-fine, explore a couple of brush/parameter styles on your own photo. [`fig-npr-brush-pset`] Use the photos-from-Fredo / sourced images for the worked example.

▸8.10.3 Edge-preserving abstraction: bilateral + Difference-of-Gaussians⧉

• **the canonical real-time pipeline** (Winnemöller et al. 2006 — *queue ref*): [`fig-npr-abstraction-pipeline`]

1. **flatten regions** with a **bilateral filter** (iterated) — smooths within objects, **stops at edges** (**L4**), collapsing texture into flat cartoon regions; optionally **quantize** color/luminance into a few bands (the posterized look);

2. **draw lines** with a **Difference-of-Gaussians**, $D = G_{\sigma_1}*I - G_{\sigma_2}*I$, thresholded into **black ink edges**;

3. **composite**: flat quantized color **+** DoG lines = a cartoon, in real time, even on video.

• **why this *is* EDGES MATTER**: the bilateral step is literally the [[Edge-preserving techniques]] base/detail filter (**L4**) — here used to **destroy** detail deliberately, not enhance it. Same tool, opposite intent.

• **flow-based DoG / coherent lines** (Kang et al. 2007 — *queue ref*): plain DoG gives **broken, noisy** edges; compute a smooth **edge-tangent flow** and run the DoG **along** it → **long, clean, coherent** lines like a confident pen. [`fig-npr-flow-dog`]

• **temporal coherence** (for video): filter with motion-awareness so flat regions and lines don't **flicker** frame-to-frame.

▸8.10.4 Example-based stylization and the bridge to neural style⧉

• **eye-tracked, perceptually-guided abstraction** (DeCarlo & Santella 2002, *Stylization and Abstraction of Photographs*, SIGGRAPH): the chapter's deepest answer to *what to keep*. Instead of inferring importance from edges alone, they **record where a viewer actually looks** (an eye-tracker) and build a **hierarchical, multi-scale** abstraction that **preserves detail at the fixated, meaningful regions and simplifies everything else** — bold flat color and clean bounding contours from a segmentation hierarchy, with detail *dialed in by measured visual attention*. NPR reframed as **directing the eye**: the abstraction reads as right because it is tied to **perception**, not just to gradients (**L13** — separate the subject and commit). This is the model of *meaningful* (importance-driven) abstraction, and the natural complement to the bilateral + DoG pipeline above, which abstracts the frame uniformly.

• **Image Analogies** (Hertzmann et al. 2001): give one example pair **A → A′** (a photo and its hand-stylized version) and a new photo **B**; synthesize **B′** so *A:A′ :: B:B′* by **patch matching** — *learn the style from one example*. The pre-deep ancestor of neural style; developed in [[#Style transfer]] (and surveyed in [[Deep learning]]) — appearing in both by design.

• **the bridge to neural style**: classical NPR hand-codes *what to keep* (edges, strokes, flat regions); **neural style** (Gatys 2016) replaces those rules with **deep-feature statistics** (content features + Gram-matrix style), and **feed-forward** stylization (Johnson 2016) makes it real-time — the same painterly goal, the prior now **learned** (**L8**). Full treatment in [[#Style transfer]] / [[Deep learning]].

• **the takeaway**: across painterly strokes, bilateral+DoG cartoons, and neural style, the constant is **abstraction guided by structure** — decide what carries meaning, keep and exaggerate it, simplify the rest. NPR is **L4 + L13** turned into a rendering style.

▸8.10.5 Region-based stylization: stained glass, low-poly, mosaics⧉

• **the idea**: instead of strokes or edges, **partition** the image into a few large regions and fill each with a **flat color** (its mean) — *abstraction by tessellation*. Pick the tiling to respect structure: a **Voronoi / centroidal** tessellation ("**stained glass by numbers**"), a **Delaunay low-poly** triangulation seeded at salient/edge points, or **superpixels** (SLIC). Same NPR thesis — *abstraction guided by structure* (**L4**) — with "what to keep" carried by **region boundaries** rather than brush orientation.

• **the dial = number / shape of regions**: few large cells → bold, poster-like; many small cells → near-photographic. Edge-aware seeding keeps tiles from straddling strong contours (the [[Bilateral filtering]] / segmentation link).

• **p-set lineage**: the "stained glass by numbers" make-your-own assignment; kin to **photomosaics** ([[10 Many images and photo collections#Photomosaics, Salavon]]) and to **graph-cut region labelling** ([[Seam optimization]]).

▸8.10.6 Artistic screening and halftoning⧉

• **the idea**: reproduce a continuous-tone image with **discrete marks** — the classic print problem — but make the *marks themselves* expressive. Ordinary **halftoning** trades spatial resolution for tone (clustered-dot screens, **ordered dithering**, **error diffusion** à la Floyd–Steinberg); **artistic screening** (Ostromoukhov & Hersch) replaces the regular dot screen with **shaped, image-bearing screen elements** so the halftone microstructure carries a *second* picture, lettering, or texture. Tone is still right at a distance; up close the rendering is a deliberate pattern.

• **why it's NPR**: same thesis — *abstraction guided by structure* (**L4**) — here the abstraction lives in **how tone is quantized into marks**, not in strokes, edges, or regions. Stippling, engraving/hatching, and ASCII-art are the same move with different mark vocabularies.

• **the dial = the screen / mark set**: choice of screen element, frequency, and angle sets the style; error-diffusion vs ordered-dither sets the "texture" of the noise (blue-noise dithers look cleanest). Ties to [[Color technology]] (CMYK screen angles, moiré) and the perceptual reason it works (the eye's spatial averaging — [[Human (and animal) vision and color]]).

• **p-set lineage**: the "artistic screening" make-your-own assignment.

▸Part 9 OPTICS, LENSES, AND ABERRATION CORRECTION⧉

Bottom line: in theory (thin-lens optics) everything converges nicely. In practice it is hard to make **all** the rays, from **all** field points, at **all** wavelengths, through **all** of the aperture, converge to the right place.

• because Snell's law is **non-linear** (the cubic term the thin lens dropped) and lenses are **not infinitely thin**;

• because the refractive index depends on **wavelength** (dispersion);

• because real surfaces, mounts, and air gaps add stray light, manufacturing error, and motion.

But with **computation** we can calibrate and correct — so the optics only has to get *close*, and software finishes the job. That partnership is the spine of the part.

---

▸9.1 Aberrations and optical challenges⧉

`fig-doublet-triplet-gauss` · cross-sections of three classic designs as stacked glass elements — achromatic doublet (2), Cooke triplet (3, + − +), double-Gauss (6, ≈symmetric about the stop); each labels element count and the aperture stop ✅

`fig-double-gauss` · a labelled double-Gauss cross-section: 6 elements ≈symmetric about the central aperture stop, parallel ray bundle traced to the focal plane; notes near-mirror symmetry cancels odd aberrations (coma, distortion, lateral colour) ✅

`fig-principal-planes-pupils` · a thick-lens cardinal-points diagram: principal planes H, H′, nodal points N, N′, entrance & exit pupils (the stop imaged by front/rear groups), focal points F, F′; f measured from H′ via the H→H′ equivalent-thin-lens construction ✅

equations

thick-lens / system focal length via the two principal planes (FUNDAMENTALS' $1/f=1/u+1/v$ holds when $u,v$ are measured from $H, H'$)

f-number from the **entrance-pupil** diameter $N = f/D_{\text{EP}}$ (not the front element)

T-stop $T = N/\sqrt{\tau}$ (transmittance $\tau<1$)

effective vs back focal length.

▸9.1.1 Taxonomy of challenges⧉

• a single lens fixes **one** degree of freedom (its power/focal length) but you need to satisfy **many** constraints simultaneously: bring *every* field point to focus, at *every* wavelength, across the *whole* aperture, onto a *flat* sensor.

• the thin lens **dropped** everything past the linear $\sin\theta\approx\theta$; restore the next term $\sin\theta\approx\theta-\theta^3/6$ (**third-order / Seidel optics**) and rays from one point **no longer meet at one point**. The departures are the aberrations — same approximation, dropped one term later (FUNDAMENTALS margin note). [`fig-seidel-aberrations`]

• two families: **monochromatic** (geometry of the surfaces, present even in one color — the five Seidel) and **chromatic** (because $n(\lambda)$ — dispersion). Each is *described* here; each is *corrected* in *Aberrations correction*.

#### spherical aberration

• rays through the **edge** of the aperture focus **nearer** than paraxial rays → a soft glowing core; worst wide open. (Deliberately left in for *soft-focus*, *Special optics*.) [`fig-spherical-aberration`]

#### coma

• off-axis a point images as a **comet/teardrop** smear (asymmetric) — the name *is* the comet shape.

#### astigmatism

• off-axis, tangential and sagittal line-features focus at **different depths** → you can't focus radial and tangential detail at once.

#### field curvature

• (Petzval): the sharp image lies on a **curved surface**, not the flat sensor → corners soft when the center is sharp (why "flat-field" objectives are prized).

#### chromatic aberration

• glass index varies with wavelength (**dispersion**, measured by the **Abbe number** $V_d$), so colors refract differently. [`fig-chromatic-aberration`]

• **axial (longitudinal)**: R/G/B focus at **different depths** → color halos; worst at fast apertures.

• **lateral (transverse)**: R/G/B imaged at **different magnifications** → colored fringes that **grow toward the corners** (and the everyday **purple fringing** at high-contrast edges, partly CA, partly sensor/flare). Lateral CA is essentially a **per-channel radial scaling**; axial CA is a depth-dependent blur — the two are a continuum but call for **different fixes** (*Aberrations correction*).

#### radial distortion

• magnification varies with field → straight lines bow (**barrel** / **pincushion**). Uniquely, distortion **moves no light energy** (points stay sharp, just mis-placed), so it is the **easiest to correct in software** — a **radial** remapping. [`fig-distortion-barrel-pincushion`]

• the standard parametric models used for lens-profile correction and camera calibration map undistorted radius $r_u$ and distorted radius $r_d$ (both measured from the distortion center):

• **Brown–Conrady** (radial, forward): $r_d = r_u\,(1 + k_1 r_u^2 + k_2 r_u^4 + k_3 r_u^6)$, with optional **tangential** (decentering) terms $p_1, p_2$ for off-axis-mounted elements; $k_1<0$ gives barrel, $k_1>0$ pincushion.

• **division model** (Fitzgibbon): $r_u = r_d / (1 + \lambda_1 r_d^2 + \lambda_2 r_d^4 + \dots)$ — often a single $\lambda_1$ captures even strong (wide-angle) distortion and inverts more cleanly, which is why it is popular for calibration.

#### wave effects and diffraction

• light is a **wave**; a wave through an aperture **diffracts** (spreads), so even a perfect lens images a point as the **Airy** disc + rings, angular radius $\theta \approx \lambda/D$ — a **floor** (the diffraction limit), not a defect. Bigger aperture → less diffraction.

• **the pattern *is* the aperture's Fourier transform** (squared magnitude = the diffraction-limited PSF): round hole → Airy disc; a wider hole has a narrower transform = a smaller spot (the $\lambda/D$ floor restated). Explains **sunstars** = FT of the polygonal blade opening (each blade-edge → a perpendicular streak; $n$ blades → $n$ spikes if even, $2n$ if odd). The lens computes a Fourier transform for free → ties to [[03 Basic image processing and ISP]] (Fourier) and the **MTF**.

• stopping down cures geometric aberrations but eventually raises the diffraction floor → every lens has a **sweet spot** ~2–3 stops down; past it, uniform softening.

#### vignetting

#### flare and coating

▸9.2 Aberrations correction⧉

`fig-achromat-doublet` · an achromatic doublet: positive crown cemented to negative flint; singlet panel (red & blue split = chromatic aberration) vs doublet panel (red = blue at a common focus); labels crown/flint, low/high dispersion ✅

⬜ figure not yet created

the dispersion cancels)

fig-lens-profile-correction · computational lens correction: a real window-grid facade degraded (barrel + vignette + lateral CA) → a measured profile's three radial maps (warp / gain / per-channel rescale) → corrected; note: geometric defects only, not blur (Aberrations correction) ✅

`fig-psf-deconvolution` · non-blind deconvolution by the measured PSF: a real crop convolved with a comatic PSF (inset) + noise → Wiener deconvolution (sharper) → under-regularized (noise amplified); image = scene ∗ PSF (Aberrations correction) ✅

equations

**achromat condition** $\phi_1/V_1 + \phi_2/V_2 = 0$ (powers weighted by dispersion cancel the chromatic term — Abbe number $V_d=(n_d-1)/(n_F-n_C)$ from the challenges chapter)

image = scene $*$ **PSF** $\to$ correction = **deconvolution** by the *measured, spatially-varying* PSF (cross-ref Wiener $\hat X = \overline K Y/(|K|^2+\mathrm{SNR}^{-1})$).

▸9.2.1 The two families of cure⧉

• every aberration is *described once* in *Aberrations and optical challenges*; here each is *corrected* — **in glass** (shape and combine elements, pick glasses, stop down) or **in software** (remap, rescale, deconvolve). Each design feature cancels a specific aberration, so the cures are organized by the defect they target.

▸9.2.2 Correction in glass⧉

• **achromat** (crown + flint doublet): pair a low-dispersion positive element with a high-dispersion negative one so the **chromatic terms cancel** at two wavelengths — the **achromat condition** $\phi_1/V_1+\phi_2/V_2=0$. [`fig-achromat-doublet`]

• **apochromat / superachromat**: correct **three (or more) wavelengths**, using **ED / low-dispersion / fluorite** glasses — for long telephotos where residual color is visible.

• **aspheric elements**: a non-spherical surface gives the designer an extra knob to kill **spherical aberration** with **fewer elements** (key to compact, fast, and **cell-phone** lenses). Sidebar: **spherical surfaces are easy to grind** (any two ground together tend to a sphere), so aspheres came late and rely on **precision molding** — which is exactly why cheap **molded plastic/glass** aspheres gave phone cameras their edge.

• **stopping down** reduces the aberrations that scale with aperture (spherical, coma, lateral color, vignetting) — the universal field fix — until **diffraction** sets the **sweet spot** (FUNDAMENTALS diffraction limit; usually ~2–3 stops down).

▸9.2.3 Computational correction⧉

• **lens-profile correction** (in the ISP): from a **measured profile** per lens+focal length+aperture, software undoes **geometric distortion** (radial warp via the Brown–Conrady / division models from *Aberrations and optical challenges* — fit $k_1,k_2,\dots$ or $\lambda_1$ per lens; detailed below in *Radial distortion correction*), **vignetting** (a radial gain map — both optical *and* the FUNDAMENTALS $\cos^4$), and **lateral CA** (per-channel radial rescale). Cheap, robust, ubiquitous — the everyday face of optics-plus-computation. [`fig-lens-profile-correction`]

• **deconvolution by the measured PSF**: an aberrated lens has a known, **spatially-varying PSF**; measuring it lets you **deconvolve** to recover sharpness — the same machinery as [[Slopendium/08 Single-image computational photography|deblurring]] (Wiener / regularized inverse), here with a **calibrated** rather than unknown kernel, so it is non-blind and per-field-position. [`fig-psf-deconvolution`]

• limits: deconvolution **amplifies noise** at frequencies the lens killed (FUNDAMENTALS/deblurring L) — it sharpens what was *attenuated*, not what was *destroyed*; and a **spatially-varying** PSF means per-region kernels.

• **forward-ref**: instead of *fighting* the lens's accidental blur, **engineer** a PSF that inverts cleanly and even encodes depth — **coded aperture / wavefront coding** (Dowski & Cathey 1995) and end-to-end optics (Advanced).

▸9.2.4 Radial distortion correction⧉

• **the most software-friendly aberration** (it is a *lens* defect, not a projection one — moved here from the warping part, where keystoning lives): radial distortion **moves no light energy** (points stay sharp, only mis-placed), so it comes off almost perfectly as a **1-D remap in radius**, with **no loss of sharpness** — the cleanest cure in the catalogue. Because it is so correctable it is increasingly *designed in*: phone/compact wide-angles ship heavy **barrel** distortion on purpose (smaller, faster glass) and trust the ISP to straighten it. [`fig-radial-distortion-correction`]

• **the symptom**: world-straight lines image as **curves** — **barrel** (bow **outward**, magnification falls with radius — wide-angle), **pincushion** (bow **inward** — telephoto), or a **mustache/wavy** mix; radially symmetric about the principal point, **worst at the edges**; the extreme is the **fisheye** (so strong it is the intended projection, not a defect).

• **the model — per-radius remap**: a 1-D function of radius applied along each ray (points move only **in/out**, never sideways) — the same forward **Brown–Conrady** $r_d = r_u(1+k_1 r_u^2+k_2 r_u^4+\dots)$ ($k_1<0$ barrel, $k_1>0$ pincushion) and **division** $r_u = r_d/(1+\lambda_1 r_d^2+\dots)$ models stated in *Aberrations and optical challenges*; correcting the image **inverts** that map per radius.

• **where the parameters come from**: a **lens profile** (a per-(lens, focal length, aperture) table shipped by **Lightroom/ACR, PTLens, DxO**, keyed off **EXIF** — the same profile that undoes vignetting and lateral CA in one pass) or **plumb-line calibration** (choose the $k_i$/$\lambda$ that make imaged straight lines straight — minimize their residual curvature).

• **applying it — inverse-warp + resample**: loop over output pixels, push each back through the radial map to its source radius, and resample the input (the transport engine of [[Warping and resampling]]); the stretched/compressed edge regions carry the usual **prefilter-before-downsample** care (**L16**), and out-of-frame corners follow the boundary/crop policy. The warping part now points *back here* for this correction.

---

▸9.3 Lens optimization⧉

`fig-merit-function-loop` · lens design as optimization, a closed loop: parameters → ray-trace (fields × wavelengths × pupil zones) → merit function Φ (weighted spot/wavefront + constraints) → damped-least-squares step → repeat; the book's forward-model+loss+optimizer pattern (Lens optimization) ✅

⬜ figure not yet created

before/after optimization) fig-wb-before-after

`fig-rms-spot-vs-field` · RMS spot radius (µm) vs field (center→corner) for 3 wavelengths, rising toward the edge, with the Airy diffraction limit as a shaded dashed floor (Lens optimization) ✅

• the move: add elements → add **surfaces, glasses, and air gaps** → add free parameters to *trade off* aberrations against each other. The art is doing it with as few elements as the quality target allows (cost, weight, flare).

equations

ray-trace forward model — apply Snell's law $n_1\sin\theta_1=n_2\sin\theta_2$ surface-by-surface (no paraxial approximation)

merit function $\Phi = \sum_{\text{fields}}\sum_{\lambda}\sum_{\text{pupil rays}} w_i\,\lVert (x_i,y_i) - (\bar x,\bar y)\rVert^2 \

+\

(\text{constraints on } f, \text{length, edge thickness}\dots)$ — a weighted sum of squared transverse ray errors plus penalties

damped-least-squares (Levenberg–Marquardt-style) update on the parameter vector.

▸9.3.1 The high-level idea: design as optimization⧉

• restate the impossibility: beyond first order there is **no formula** that places every ray; you **search** the parameter space instead.

• **parameters (the unknowns)**: each surface's **radius of curvature**, the **glass** of each element (index *and* dispersion — a discrete catalog choice), element **thicknesses** and **air-gap spacings**, **aspheric** coefficients, and the **stop** size/position.

• **objective (the merit function)**: a single number summing **residual aberration** over a grid of **field angles**, a set of **wavelengths** (e.g. C, d, F lines), and a sampling of **aperture/pupil** zones — typically RMS spot size or wavefront error — plus **hard constraints** (target $f$, total length, positive edge thickness, vignetting limits).

• "**the big loss integral**": conceptually, integrate the blur over *all the conditions the lens must satisfy at once* (every field point, color, aperture position) and drive it down — the discrete merit function is that integral, sampled.

▸9.3.2 The forward model: ray-tracing and spot diagrams⧉

• evaluate a candidate design by **exact ray-tracing**: launch many rays from each field point through a grid of pupil points, refract them by **Snell's law at every surface** (the full non-linear law, not the paraxial one), and see where they land.

• a **spot diagram** plots those landing points at the image plane for one field point and color: a tight cluster = sharp; a smeared blob = aberrated. Different colors land in different clusters (chromatic). This *is* the lens's PSF, sampled by rays. [`fig-spot-diagram`]

• the optimizer shrinks these spots across all fields/colors simultaneously. [`fig-merit-function-loop`, `fig-rms-spot-vs-field`]

▸9.3.3 From hand calculation to software⧉

• **the old-fashioned way**: designers traced rays **by hand** (and with logarithm tables, then mechanical/early electronic calculators), iterating curvatures laboriously — a few rays a day. The classic doublets/triplets were found this way.

• **today**: optical-design software (**Zemax OpticStudio**, **CODE V**, OSLO) ray-traces thousands of rays per evaluation and runs **damped-least-squares** optimization over the parameter vector, plus **global** search to escape local minima and **tolerancing** to keep the design manufacturable.

• **how people actually use it** (workflow, not a black box): start from a **starting point** (the program's library, a scaled existing lens, or a competitor's **patent** prescription) — never a blank sheet; set up the *problem* — field points, wavelengths + weights, f-number, then mark which parameters are **variables** and which are pinned by **solves** (auto-recompute one number to hold focal length / back focus); pick the **default** RMS-spot/wavefront merit + hand-added **operands** (cap distortion, edge thickness, length, glass cost); run **local** damped-least-squares (seconds), read **spot diagrams / ray fans / MTF**, nudge weights, re-run — a tight **interactive loop**, optimizer does arithmetic, human supplies judgment; when stuck, unleash **global** search (Zemax "Hammer", CODE V global synthesis) for hours/overnight; then **tolerance** + export to mechanical CAD / stray-light. The craft moved *up a level*: from tracing rays by hand to **steering** an optimizer.

• it is the same machinery as the rest of the book — forward model + merit/loss + iterative optimizer — applied to glass.

▸9.3.4 Tradeoffs: the design is always a compromise⧉

• the merit function lives under hard **constraints**, and the axes fight: **speed** (low f-number) vs. **aberration control**; **size/weight** vs. **correction**; **cost** (number of elements, exotic glass, aspheres) vs. **sharpness**; **zoom range** vs. **everything**. You buy one by spending another. [`fig-lens-design-tradeoff`]

• this is exactly why **computation** is attractive: let the optics be *good enough and cheap*, and **correct the rest in software** (lens-profile correction, deconvolution — *Aberrations correction*), shifting the operating point on the Pareto front.

• **forward-ref — end-to-end optimization**: make the *whole pipeline* (lens + sensor + reconstruction network) differentiable and optimize the **merit of the final image**, not of the optics alone — the optics may then look "wrong" by classical metrics yet give better pictures (Advanced part: differentiable / computational optics, coded apertures, wavefront coding).

▸9.3.5 Tolerancing: from the nominal design to a manufacturable one⧉

⬜ figure not yet created

a tolerance budget — performance (MTF) vs each perturbation

⬜ figure not yet created

a **Monte-Carlo** as-built MTF distribution (nominal vs the manufacturing spread, with a yield cutoff) [schem/plot]

• **the error sources**: per-surface **radius** error, **thickness / airspace** error, glass **index & dispersion** (melt-to-melt) variation, **wedge**, surface **irregularity** (departure from the test sphere), and assembly **decenter / tilt / roll** of each element

• **sensitivity vs statistics**: a **sensitivity analysis** ranks which perturbations hurt most (the tight tolerances that drive cost); a **Monte-Carlo** draws many random as-built lenses and predicts the **performance distribution** and the manufacturing **yield**

• **compensators**: leave a few adjustments free at assembly — refocus, set one **airspace**, or shim/center an element — to *recover* performance, so tolerances elsewhere can loosen (cheaper)

• **the punchline**: nominal performance is the easy part; **as-built** performance and **yield** decide cost and feasibility. Loosen tolerances and lean on **computational correction** (lens-profile correction, deconvolution — *Aberrations correction*) to claw back the residual — another reason cheap optics + computation wins.

---

▸9.4 A short bestiary of classic designs⧉

• **achromatic doublet**: a positive crown + negative flint, usually cemented; the *minimum* system that corrects color (two wavelengths) and is also used to tame spherical aberration. (Full treatment in *Aberrations correction*.) [`fig-doublet-triplet-gauss`]

• **Cooke triplet**: three elements (+ − +) — the smallest design with enough freedom to correct **all** the primary (Seidel) aberrations *and* color; the ancestor of most simple lenses. [`fig-doublet-triplet-gauss`]

• **double-Gauss (Planar)**: roughly **symmetric about the aperture stop**, 6–7 elements; symmetry automatically cancels the *odd* aberrations (coma, distortion, lateral color), which is why it dominates fast normal lenses ($f$/2 and faster). [`fig-double-gauss`]

• **telephoto vs. retrofocus** — two ways to decouple physical length from focal length: [`fig-telephoto-vs-retrofocus`]

• **telephoto** (positive group then negative): the rear negative group pushes the rear principal plane *forward*, so a long-focal-length lens is **physically shorter** than its $f$.

• **retrofocus / inverted telephoto** (negative then positive): pushes the rear principal plane *back*, giving a long **back-focal distance** — needed so a wide-angle lens clears the SLR mirror box (and historically why mirrorless can go wider/smaller).

• **zoom**: a variable-focal-length system — groups slide so $f$ changes (the *variator*) while a *compensator* group keeps the image plane fixed; many elements, many compromises, the hardest thing to optimize. [`fig-zoom-mechanism`]

▸9.4.1 The lens as a system: cardinal points, pupils, f-number, T-stop⧉

• **principal planes** $H, H'$: the two planes from which object/image distances and focal length are measured; with them, the thin-lens equation $1/f=1/u+1/v$ holds for the whole thick system. **Nodal points** (where a ray crosses the axis undeviated) coincide with the principal points in air — the pivot for "no-parallax" panorama heads. [`fig-principal-planes-pupils`]

• **entrance & exit pupils**: the images of the aperture stop seen from object / image space. The **entrance pupil** is the opening that actually gathers light, so the true f-number is $N = f/D_{\text{EP}}$ — *not* $f$ over the front-element diameter. Pupil position governs perspective of the bokeh and the **no-parallax point**.

• **f-number** (recap from FUNDAMENTALS) sets light-gathering and DoF; here it is pinned to the entrance pupil. **Vignetting** clips the pupil off-axis (next section's cat's-eye bokeh and the optical part of corner darkening).

• **T-stop**: the f-number assumes 100% transmission; real glass and coatings lose some, so cinema lenses are marked in **T-stops**, $T = N/\sqrt{\tau}$ with transmittance $\tau$, giving the *actual* exposure. Ties to *Glare suppression* (coatings raise $\tau$) and FUNDAMENTALS exposure.

• **element count today**: modern zooms and fast primes run **10–20+ elements** in several groups (with aspheres and special glasses) — a direct consequence of stacking corrections; phone "lenses" are 5–8 tiny molded-plastic/glass elements. The count is the visible cost of cancelling everything in glass instead of software.

---

▸9.5 Special optics⧉

`fig-scheimpflug-tilt` · the Scheimpflug principle: tilting the lens makes the lens plane, image plane & plane of focus meet in a common line → the focus plane tilts into a wedge of sharpness; contrasted with the parallel (untilted) case ✅

`fig-fisheye-projection` · projection mappings — image radius r vs incidence angle θ for rectilinear (r=f·tanθ, blows up near 90°), equidistant (r=f·θ) and equisolid (r=2f·sin(θ/2)); why fisheye reaches ~180° where rectilinear can't ✅

`fig-seidel-aberrations` · the five monochromatic (Seidel) aberrations at a glance — spherical, coma, astigmatism, field curvature, distortion — each a tiny ray/spot icon + a one-line "what it looks like" note ✅

`fig-catadioptric-mirror-lens` · a mirror (catadioptric) lens: light folds via primary + secondary mirror (Cassegrain); central obstruction → ring-shaped (doughnut) bokeh — folded path + doughnut-highlight inset ✅

`fig-anamorphic-squeeze` · anamorphic optics: 2× horizontal squeeze at capture → de-squeeze in post; a circle → vertical oval on sensor → restored; oval bokeh + horizontal flares noted ✅

`fig-fresnel-zones` · a Fresnel lens: collapse a thick convex lens into concentric ring "zones" that keep the local surface slope but drop the bulk; cross-section of a normal lens vs its Fresnel equivalent ✅

`fig-grin-vs-metalens` · three ways to bend light: shaped bulk lens (refraction), a GRIN rod (curved ray through a varying-index medium, flat faces), and a metalens (flat surface, sub-wavelength nanostructures setting a phase profile); "keep the phase, drop the thickness" ✅

`fig-tilt-shift-perspective` · the *shift* movement (not tilt): tilting the camera up keystones a building (converging verticals) vs. keeping the sensor vertical/parallel and raising the lens within a large image circle, so verticals stay parallel — architectural perspective control ✅

`fig-telescope-types` · refractor (doublet objective → focus) vs. reflectors — Newtonian (concave primary + flat 45° secondary → side focus) and Cassegrain (primary + convex secondary → focus through a hole behind the primary); glass vs. mirrors to a focus ✅

`fig-microscope-objective` · high-NA objective at a short working distance: finite (fixed tube length, objective forms the image directly) vs. infinity-corrected (subject at front focus → parallel "infinity space" → separate tube lens forms the image) ✅

`fig-soft-focus` · a soft-focus portrait lens: under-corrected spherical aberration (marginal rays cross nearer than paraxial → halo at the paraxial plane) and the resulting PSF — a sharp core inside a broad glow vs. a well-corrected lens; a *wanted* aberration ✅

`fig-apodization-pupil` · an apodizing filter: pupil transmission profile (hard-edged plain aperture vs. radially graded) → the defocus disk each makes (flat disk with a hard ring vs. a smoothly fading disk, no ring) — shaping the bokeh PSF in hardware ✅

equations

**Scheimpflug condition** (lens plane, image plane, focus plane meet in one line) + the **Hinge rule** for where the focal wedge pivots

fisheye projection laws $r = f\theta$ (equidistant), $r = 2f\sin(\theta/2)$ (equisolid-angle) vs. rectilinear $r = f\tan\theta$

anamorphic squeeze factor (typ. 2×)

Fresnel = local surface slope preserved, bulk removed.

▸9.5.1 Tilt-shift and the Scheimpflug principle⧉

• **shift**: translate the lens parallel to the sensor to control **perspective** (keep verticals parallel — architecture) and to **stitch** without parallax; the image circle must be large enough to cover the shifted frame. [`fig-tilt-shift-perspective`]

• **tilt**: rotate the lens so the **plane of focus tilts** away from parallel-to-sensor. By the **Scheimpflug condition**, the lens plane, image plane, and plane of sharp focus all meet along one line — so you can lay the focal plane along a tabletop (everything sharp) or slice a thin wedge across a scene (the "miniature-faking" look). [`fig-scheimpflug-tilt`]

• ties to *Depth of field*: tilt reshapes the in-focus region from a slab into a **wedge**; it does not add DoF, it **reorients** it.

▸9.5.2 Fisheye and non-rectilinear projection⧉

• a normal lens targets **rectilinear** projection ($r=f\tan\theta$) — straight lines stay straight — which **blows up** past ~100° field. A **fisheye** instead adopts a projection that *bends* straight lines to fit a huge field (often **180°+**) onto the sensor: **equidistant** $r=f\theta$ or **equisolid-angle** $r=2f\sin(\theta/2)$. [`fig-fisheye-projection`]

• the barrel "distortion" is **the intended mapping**, not an aberration — and it is **invertible in software** to rectilinear (at the cost of stretched corners), the cleanest example of optics-and-computation splitting the work.

▸9.5.3 Mirrors: catadioptric and reflecting systems⧉

• **catadioptric** = mirrors **and** lenses. Folding the path with mirrors makes a **compact long telephoto** (mirror lens) and is naturally **achromatic** (reflection has no dispersion). The **central obstruction** gives a **ring-shaped pupil**, so out-of-focus highlights render as **doughnuts** — the signature mirror-lens bokeh (cross-ref *Depth of field*). [`fig-catadioptric-mirror-lens`]

• **telescopes**: reflectors (Newtonian, Cassegrain) and catadioptrics (Schmidt–Cassegrain, Maksutov) scale the same idea to large apertures; (space) telescopes also motivate the **glare/baffling** chapter. [`fig-telescope-types`]

▸9.5.4 Periscope / folded-lens design (smartphone telephoto)⧉

• **the constraint**: a phone body is only a few millimetres thick, so a lens whose axis runs perpendicular to the screen cannot be physically long — capping the focal length and thus the telephoto reach.

• **the fix**: a **prism or mirror** placed just behind the entrance window **bends the optical axis ~90°**, so the lens groups lie *sideways* along the phone's width or length, where there is room for a long focal length; the sensor sits parallel to the phone's back. Because the path is "folded" the design is called a **periscope** or **folded-optics** lens. [`fig-catadioptric-mirror-lens` (cross-ref, same folding idea)]

• **what it buys**: a long effective focal length — ~5×–10× optical zoom / ~100–250 mm equivalent — in a thin body, without the protruding bump a straight long-lens design would require.

• **mechanics**: moving internal groups give continuous or stepped zoom; **OIS** is commonly implemented by *tilting or shifting the prism* (or a group) rather than floating a lens element — a compact, low-power actuator.

• **tetraprism (Apple, iPhone 15 Pro Max)**: folds the path **multiple times** with a tetrahedral prism, lengthening the effective optical path further within the same device width.

• **history**: popularized by the **Huawei P30 Pro (2019)**, then adopted by Samsung (Galaxy S20 Ultra and successors) and Apple; the reason phones acquired genuine long-reach telephoto. Cross-ref: the **telephoto design** in [[A short bestiary of classic designs]] compresses *physical length* relative to focal length by placing principal planes; the periscope instead *reorients* the whole long lens so its length runs parallel to the phone face — two independent strategies for compact telephoto, usable together.

• **trade-offs**: the small front window and folded path limit the **maximum aperture** (less light than a rear-camera main module); each periscope module covers a fixed or narrow zoom range; prism alignment and cost add manufacturing complexity.

▸9.5.5 Anamorphic optics⧉

• a cylindrical-element system **squeezes the image horizontally** (classically 2×) so a wide aspect ratio fits a narrower frame, then a matching de-squeeze restores it on projection. [`fig-anamorphic-squeeze`]

• side effects that became a *look*: **oval bokeh** and **horizontal (blue) flare streaks** — an aesthetic now imitated computationally.

▸9.5.6 Macro, microscope objectives, and telescopes⧉

• **macro**: images at **life-size or larger** ($m\ge 1$), where the thin-lens conjugates run the *other* way (image distance large), DoF is razor-thin (cross-ref *Depth of field* — **focus stacking** lives there; [Stanford focal-stack paper](https://graphics.stanford.edu/papers/focalstack/focalstack.pdf)), and special **flat-field** correction matters.

• **microscope objectives** and **telescope objectives** are the same compound-lens optimization pushed to extreme magnification / aperture, with their own corrections (apochromatic, flat-field). The high-NA microscope objective images a tiny subject at a short **working distance**, in either a **finite** (fixed tube length) or modern **infinity-corrected** (parallel "infinity space" + a separate tube lens) configuration. [`fig-microscope-objective`]

▸9.5.7 Soft-focus and apodization⧉

• **soft-focus** portrait lenses deliberately leave **under-corrected spherical aberration**, overlaying a sharp core with a glowing halo — a *wanted* aberration. [`fig-soft-focus`]

• **apodization**: a radially-graded (smooth-edged) pupil that **softens the bokeh transition** (smooth-edged defocus disks, no hard ring) — shaping the PSF on purpose, a hardware cousin of computational bokeh. [`fig-apodization-pupil`]

▸9.5.8 From shaped glass to thin structures: Fresnel, diffractive, GRIN, metalenses⧉

`fig-diffraction-grating-sim` · live 2-D FDTD wave sim of a **diffraction grating** (many slits): plane wave splits into orders along d·sinθ=mλ, orders sweep with wavelength (longer λ → larger deviation); intensity comb peaks at the orders (OPTICS → Diffractive elements) ✅

• *(throughline to the Advanced part: "keep the phase / slope, drop the thickness.")*

#### Fresnel lenses

• a **Fresnel lens** keeps only the part of a lens that actually bends light — its **surface slope** — and throws away the bulk glass: slice the lens into concentric annular zones and **collapse each to one thin sheet**, so the smooth profile becomes a set of **sawtooth rings** (a ring of tiny prisms) with the same local refraction as the original curve.

• **why**: dramatically **thinner, lighter, cheaper** than an equivalent thick lens — at the cost of **image quality** (scattering and diffraction at the ring discontinuities, the visible concentric-ring artifacts), so it is used where thinness / weight / cost beat sharpness.

• **history & uses**: invented by **Augustin Fresnel** for **lighthouses** (a huge aperture without an unliftable blob of glass); today in **condenser / projection optics**, **screen & page magnifiers**, **solar concentrators**, **vehicle brake lights**, thin illumination optics, and the **Fresnel/PF telephoto** elements in modern camera lenses.

• **the throughline**: it is the first step from "shape the glass" toward **encoding the lens as a thin structure** → diffractive optics, kinoforms, and ultimately **flat optics / metalenses** (Advanced); the same "keep the phase, drop the thickness" idea recurs in **wavefront coding**.

#### Diffractive elements

• bend light by **diffraction** off fine surface structure rather than bulk refraction; their dispersion has the **opposite sign** to glass, so a diffractive element can **correct chromatic aberration** in a hybrid lens (Canon DO, Nikon PF). Thin, light, but flare-prone at the rings. [`fig-grin-vs-metalens`]

• the canonical diffractive element is the **diffraction grating** — a periodic array of slits/grooves of period $d$. Wavelets from adjacent periods reinforce only where their path difference is a whole number of wavelengths, the **grating equation $d\sin\theta_m = m\lambda$**, so incident light splits into discrete **orders** $m=0,\pm1,\pm2,\dots$ at angles $\theta_m$. Because $\theta_m$ **grows with $\lambda$**, a grating **disperses** white light into a spectrum — the heart of the **spectrometer** — dispersing in the *opposite* sense to a prism (prisms bend blue most; gratings bend red/long-λ most), and the same physics behind the rainbow sheen of a **CD/DVD**. 🖱️ **Interactive (web edition):** a live 2-D **FDTD wave simulation** — a plane wave hits a multi-slit grating and breaks into orders along the $d\sin\theta=m\lambda$ directions; **sweep the wavelength and watch the orders fan wider** (longer λ → larger deviation), or change the period. Absorbing (sponge + Mur) borders, so nothing reflects off the window edges. Companion to the single-slit `fig-diffraction-wave-sim` (FUNDAMENTALS). [`fig-diffraction-grating-sim`; interactive, FDTD]

#### Mirrors (cross-link)

• see *Mirrors: catadioptric and reflecting systems* above (reflection is the other way to avoid dispersion).

#### GRIN (gradient-index)

• the index **varies continuously inside** the glass, so a ray bends along a *curved* path even through flat surfaces — focusing with a homogeneous-looking slab. Used in fiber coupling, endoscopes, photocopier rod arrays, and the eye's own lens. [`fig-grin-vs-metalens`]

#### Metalenses (forward-ref)

• a flat **sub-wavelength nanostructure metasurface** imposes the lens's phase profile with essentially **zero thickness** — the endpoint of the Fresnel→diffractive→flat-optics arc. Full treatment in the **Advanced** part; placed here as the conceptual destination.

---

▸9.6 Focus, autofocus⧉

⬜ figure not yet created

hunting to find it)

`fig-pdaf-principle` · phase-detect AF: a separator splits the beam to two line sensors; the signed phase (separation) of the two profiles gives direction + amount of defocus ("stereo in the lens"); in-focus vs front/back-focus ✅

fig-dual-pixel-af · on-sensor / dual-pixel AF: a photodiode split into L/R halves under one microlens; the L vs R images give per-pixel disparity → defocus; split-pixel cross-section + the two readout images ✅

`fig-depth-from-defocus` · depth from defocus: the blur-disk diameter grows with distance from the focus plane; two shots at different focus/aperture → estimate depth from the relative blur ✅

equations

focus error → sensor shift via $1/f=1/u+1/v$ (differentiate to get sensor move per diopter)

contrast metric (sum of squared gradients / high-pass energy) maximized

phase-detect: **disparity between the two pupil half-images is proportional to defocus** (it is stereo with a baseline = the pupil split — ties to FUNDAMENTALS $Z=fB/d$)

defocus blur radius $\propto$ distance from focus and aperture (from *Depth of field*).

▸9.6.1 Focus mechanics⧉

• focusing moves the **lens (or a focusing group, or the sensor)** to set the image distance $v$ so the subject's conjugate falls on the sensor; an object at infinity sits at $v=f$, nearer objects push $v$ back (FUNDAMENTALS). [`fig-focus-conjugate-move`]

• **manual focus aids**: ground-glass/split-prism, the **rangefinder** (a coincidence/disparity device — cross-ref FUNDAMENTALS stereo), and modern **focus peaking** / magnified live-view.

• **focus-by-wire**: in many AF lenses the focus ring is *not* mechanically linked — it sends an electronic signal to a motor (stepper / ultrasonic **USM** / linear voice-coil), enabling fast, quiet, programmable, and lens-data-aware focusing.

• **focus breathing**: as you refocus, a lens's **focal length / field of view subtly changes** — the image appears to **zoom slightly** as the focus pulls (because the focusing group's motion also shifts the effective $f$). An artifact in two regimes: in **video** it shows as a visible "breathing" during **focus pulls** (distracting in cinematography), and in **focus stacking / depth-from-defocus** it means consecutive frames **don't register** (different magnification → must be re-aligned, cross-ref *Depth of field* §focus stacking and part 08's focal-stack chapter, which calls it out). Modern **cine lenses** are designed to be breathing-free, and some cameras/software **correct** it digitally.

▸9.6.2 Contrast-detection AF⧉

• premise: a sharp image has **more high-frequency energy** than a blurred one. Maximize a **contrast metric** (sum of squared gradients / high-pass energy) over the focus position — gradient *ascent* on the sharpness hill. [`fig-contrast-af-curve`]

• strength: needs **no special hardware**, uses the imaging sensor itself, very accurate at the peak.

• weakness: the metric gives **no direction** — the camera must **hunt** (step, observe, maybe reverse), so it is slower and visibly racks back and forth; struggles in low contrast / low light.

▸9.6.3 Phase-detection AF (the split-pupil / stereo trick)⧉

• compare light from **two halves of the exit pupil**: in a dedicated AF sensor a **separator (secondary) lens** re-images the two pupil halves onto **two line sensors**. The **lateral offset** between the two patterns is a **signed** measurement of defocus — its **sign gives the direction**, its **magnitude gives how far** — so the lens can be driven to focus in **one move**, no hunting. [`fig-pdaf-principle`]

• this is literally **stereo inside the lens**: the baseline is the pupil split, and the offset is a disparity proportional to defocus (cf. FUNDAMENTALS $Z=fB/d$).

• classic **SLR** implementation: a partly-transparent main mirror + sub-mirror diverts light down to a dedicated PDAF module — fast, but a *separate* optical path needing calibration (front/back-focus).

• **sidebar — why the reflex mirror is disappearing** (synthesis of the author's 2018 *"DSLR will probably die"* essay, see [[Blog Index]]): the mirror's old benefits (optical TTL view, dedicated fast PDAF) shrink as EVFs improve and on-sensor PDAF closes the gap; its costs do not — added mechanism + per-body/lens calibration, larger body, the **retrofocus tax** on wide/fast lenses (cross-ref bestiary), and it **occludes the sensor while focusing** so no pre-shot scene analysis (face/eye); for video the mirror stays up, forfeiting all advantages. Reflex → mirrorless; AF migrates onto the main sensor.

▸9.6.4 On-sensor PDAF and dual-pixel AF⧉

• **masked / phase-detect pixels**: scatter pixels across the imaging sensor whose left or right half is masked, recovering the two pupil views *on the main sensor* — no separate module (enables mirrorless/live-view PDAF). [`fig-dual-pixel-af`]

• **dual-pixel AF**: split **every** photosite into **two half-photodiodes under one microlens** (reusing the FUNDAMENTALS microlens stack) — each pixel is its own phase-detect pair, giving dense, fast AF *and* a low-baseline disparity map (small depth signal usable for portrait-mode and refocus).

▸9.6.5 Depth from focus / defocus, and learned subject AF⧉

• **the two are different strategies**: **depth from focus** *sweeps* the focus through its range and picks, per pixel, the **setting that maximizes sharpness** — many shots, no blur model needed (the focus position *is* the depth). **Depth from defocus** instead estimates depth from the **amount of blur** in just **one or two** shots, given a **known blur (PSF) model** of how defocus scales with depth and aperture — far fewer frames, but it leans on the model.

• **depth from focus**: sweep focus, find for each pixel the position of **maximum sharpness** → depth (also the basis of **focus stacking**, *Depth of field* — the focal-stack sharpness map is this same computation, [Stanford focal-stack paper](https://graphics.stanford.edu/papers/focalstack/focalstack.pdf)).

• **depth from defocus**: from a few images at different focus/aperture, infer per-pixel **blur radius** → depth, with far fewer shots (Pentland 1987; Subbarao). The defocus PSF is the bridge to deconvolution (*Aberrations correction*). [`fig-depth-from-defocus`]

• **depth-from-defocus *as autofocus* — Panasonic DFD (Depth From Defocus)**: rather than build a depth map, use defocus to **drive focus**. With a stored model of how a given lens's blur (PSF) changes with focus, the camera compares **two frames at slightly different focus** and infers the **direction *and* amount** of defocus — a PDAF-like, near-single-step cue from an ordinary **contrast-detect** sensor with **no phase pixels**. It rides on CDAF (confirms at the sharpness peak) and is strong for video; its weakness is the reliance on a **per-lens blur profile** (and on the subject holding still between the two frames).

• **subject / eye AF (learned)**: a detector finds **faces, eyes, animals, vehicles**, and tracks them, then drives PDAF to the chosen point — AF's *decision* layer is now a recognition problem (cross-ref the learned-detection chapter). Other rangefinding aids historically/peripherally: active **IR / ultrasonic** (early Polaroid SONAR) and **LiDAR/ToF** assist in phones.

▸9.6.6 Focusing in astrophotography⧉

• **why autofocus fails**: stars are **dim point sources at infinity** with almost no contrast for CDAF and too little light for PDAF — so astro focus is essentially **manual** and demands *more* precision than daytime (a star is the ultimate test of focus; the circle of confusion must be tiny)

• **you can't just rack to infinity**: most lenses **focus past infinity** (for thermal expansion / AF overshoot), so the "∞" hard stop is *not* in focus — you must focus empirically each session, and **refocus as the temperature drops** (the barrel contracts)

• **the practical methods**: focus on a **bright star** with **magnified live view**; or use a **Bahtinov mask** — a grid placed over the aperture whose **diffraction spikes** form an X with a central spike that is centred *only at exact focus* (a precise, eyeballable null); software **FWHM / HFD** metrics on a star automate the same idea, and **motorized focusers with autofocus routines** (minimize star size vs focuser position — contrast-AF, done deliberately and slowly) are standard on deep-sky rigs

• cross-ref: the daytime AF families above (why they don't transfer); *Depth of field* (tiny CoC); FUNDAMENTALS Focus (the practical operator's view)

---

▸9.7 Depth of field⧉

`fig-dof-coc` · depth of field via the circle of confusion: the double cone behind the lens; points whose blur disk stays within the CoC limit c at the sensor are "acceptably sharp" → near/far DoF limits; CoC + hyperfocal H=f²/(N·c) labeled ✅

⬜ figure not yet created

pointer to FUNDAMENTALS for the full geometry)

`fig-cats-eye-vignetting` · cat's-eye bokeh: off-axis defocus disks clipped into lemon / cat's-eye shapes by optical (mechanical) vignetting toward the frame edge; round at centre → clipped at edge, tilted toward centre ✅

equations

**all imported** from FUNDAMENTALS (circle of confusion, hyperfocal $H=f^2/(Nc)$, near/far limits, DoF scaling). New here: aperture-shape ⇒ defocus-disk shape (the disk is the **image of the aperture stop**)

cat's-eye area loss from off-axis pupil clipping.

▸9.7.1 Recap: the geometry of focus (pointer, not re-derivation)⧉

• briefly restate the circle of confusion, the near/far limits, **hyperfocal** $H=f^2/(Nc)$, and the three knobs (smaller aperture, shorter $f$, farther focus all deepen DoF) — then **point to FUNDAMENTALS** for the full similar-triangles derivation and the framing-invariance result. [`fig-dof-coc`]

• restate the load-bearing caveat: **defocus is not a 2-D convolution** (it is depth- and occlusion-dependent) — the reason naive portrait-mode fakes fail and depth-aware/light-field methods are needed (forward-ref Advanced).

▸9.7.2 The bokeh look: shape and structure of the blur⧉

• the out-of-focus highlight is the **image of the aperture stop**, so its **shape = the aperture's shape**: a round iris → round disks, a few straight blades → **polygons** (pentagons, heptagons). [`fig-bokeh-shapes`]

• **disk structure** sets "good" vs. "bad" bokeh: a **bright-edged** disk (from over-corrected spherical aberration) looks **busy/nervous**; a **uniform or soft-edged** disk (under-corrected, or apodized — *Special optics*) looks **creamy**. The *whole* point of a portrait lens is the disk profile, not just resolution.

• **cat's-eye / lemon bokeh**: toward the frame edge the pupil is **clipped by the lens barrel and front/rear elements** (optical vignetting), so off-axis defocus disks are **cut to lemon shapes**, swirling toward the corners; stopping down restores round disks (and reduces corner vignetting). [`fig-cats-eye-vignetting`]

• ties to *Compound lenses* (pupil/vignetting) and *Special optics* (mirror-lens doughnuts, anamorphic ovals, apodizers).

▸9.7.3 Extending DoF: focus stacking⧉

• **focus stacking**: capture a **stack** at stepped focus distances, then composite the **sharpest pixels** from each into one **all-in-focus** image — the standard fix where DoF is hopeless (**macro**, product, landscape near-to-far) ([Stanford focal-stack paper](https://graphics.stanford.edu/papers/focalstack/focalstack.pdf)). [`fig-focus-stacking`]

• requires per-pixel **sharpness selection** + alignment (focus *breathing* changes magnification between shots → must register); a multi-image-fusion problem (cross-ref the fusion/stacking machinery).

• **the sharpness map is depth-from-defocus**: estimating, per pixel, *which slice of the stack is sharpest* is exactly **depth from focus/defocus** (cross-ref *Focus, autofocus*) — the focus position of maximum sharpness corresponds to that pixel's **depth**, so a focal stack yields an all-in-focus image **and** a depth map for free.

• contrast with *increasing* DoF by stopping down — which costs light and eventually **diffraction** softens everything (FUNDAMENTALS): stacking beats the diffraction wall.

▸9.7.4 Controlling and faking DoF⧉

• **shallow on purpose**: open up / step closer / longer lens to *magnify* the subject — recall DoF is set by **magnification**, not focal length per se (FUNDAMENTALS framing-invariance).

• **fake / computational DoF (portrait mode)**: estimate a **depth map** (dual-pixel disparity from *Focus*, stereo, or learned mono-depth) → apply **depth- and occlusion-aware** blur with a chosen disk shape. Honest about why it is hard (hair, edges, transparency) — full treatment forward-referenced to the Advanced computational-camera part; here we only set up *why* a 2-D blur is wrong and *what* a depth map buys.

• **reminder — defocus is NOT a single 2-D convolution**: the blur **kernel size varies with depth** (a far point blurs more than a near one), and **occlusion boundaries break shift-invariance** (a sharp foreground edge bleeds onto a blurred background, not symmetrically) — so no one uniform kernel applies. This is exactly why portrait-mode background blur needs a **depth map** rather than a flat Gaussian (cross-ref the *depth-of-field geometry* recap above and FUNDAMENTALS [[Manuscript/Draft — 02-02 Image formation and linear perspective|Image formation §Depth of field]]).

• **sidebar — coherent imaging (Barbastathis): out-of-depth is *black*, not blurry**: everything above is **incoherent** imaging, where an out-of-focus object still contributes light (a blur disk). George Barbastathis (MIT) notes that **coherent** imaging built around a *prepared* depth can do something qualitatively different — off-depth objects contribute **no light at all**, so they come out **black, not blurry** (incoherent imaging adds positive *intensities*, so every depth contributes; coherent imaging adds *complex amplitudes*, which can interfere destructively and cancel). Reframes DoF as a hard **on/off depth gate** rather than a graceful blur.

---

▸9.8 Fake (synthetic) depth of field⧉

---

▸9.9 Glare suppression⧉

`fig-dr-clip-vs-noise` · one exposure's two walls — clipping at full-well above, the noise floor below — shading the narrow band of scene contrast that fits between them 🟨

`fig-ar-coating-interference` · anti-reflection coating: a quarter-wave film cancels the front-surface reflection by destructive interference (top-surface reflection vs film–glass reflection, extra path λ/2 → out of phase); two emergent wavelets summing to ~0 + design rules $n_c=\sqrt{n_1 n_2}$, $d=\lambda/4n_c$ ✅

`fig-lens-hood-baffles` · lens hood + internal blackened baffles: an off-axis source (sun outside the frame) is blocked by the hood; stray light that sneaks in is trapped/absorbed by baffles; wanted image-forming rays (within the angle of view) pass through to the sensor ✅

`fig-veiling-glare-psf` · a PSF with a bright core + a broad low-level skirt (veiling glare) on a log-radial profile; second panel shows the integrated veil lifting blacks / reducing contrast ✅

equations

Fresnel reflectance at an interface $R=\left(\frac{n_1-n_2}{n_1+n_2}\right)^2$ (~4%/surface for air–glass → why an uncoated multi-element lens loses huge contrast)

**quarter-wave AR condition** — coating index $n_c=\sqrt{n_1 n_2}$ and thickness $\lambda/4$ for destructive interference

veiling glare modeled as the image $*$ a **broad glare PSF** → deconvolution.

▸9.9.1 Where stray light comes from: flare, ghosting, veiling glare⧉

• at **every** air–glass surface a few percent of light **reflects** (Fresnel, ~4% uncoated); a stray ray can reflect **twice** and reach the sensor as a **ghost** — a discrete, often aperture-shaped bright blob, usually mirrored across center from a strong source. [`fig-lens-flare-ghosting`]

• many weak scatters and reflections sum into **veiling glare** — a low-frequency **wash that lifts the blacks** and crushes contrast (worst with a bright source just outside the frame, or shooting into the sun/backlight).

• **why compound lenses make it worse**: $N$ surfaces give $O(N^2)$ inter-reflection pairs — a 15-element lens has dozens of surfaces, so without coatings it would be hopeless (cross-ref *Compound lenses* element count, and the T-stop transmittance loss).

• **flare as an aesthetic (cinematography)**: lens flare was long considered a **flaw / a no-no** in cinematography (the whole point of coatings and hoods is to *kill* it), but **modern films deliberately court it** for a dramatic / realistic, "you're really there" effect — **J.J. Abrams' anamorphic streaks** are the cliché. It also **conveys very high contrast / the presence of a bright source**: flare reads as "that light is *intense*," a cue the eye trusts, so a little flare can sell a bright window or the sun better than a clean frame. (An aesthetic now also **faked computationally** — cross-ref anamorphic flare, *Special optics*.)

• **not the same as a sunstar**: lens **flare/ghosts** are **inter-surface reflections**; the **star / sunstar** pattern radiating from a bright point is a **diffraction spike** — diffraction at the **aperture blades** (the FT of the blade polygon, cross-ref *Aberrations*/wave effects), present even in a single perfectly-coated element. Different physics, often confused.

▸9.9.2 Hardware suppression⧉

• **anti-reflection coatings**: a thin transparent film of index $n_c=\sqrt{n_1 n_2}$ and thickness **λ/4** makes the reflections off its two faces **interfere destructively**, cancelling the surface reflection over a band; **multi-layer** stacks broaden it to the whole visible range (the colored sheen of modern coatings). This dropped per-surface reflection from ~4% toward ~0.1–0.5%, which is what *made* many-element lenses practical. **Cross-ref the FUNDAMENTALS lens-coating history.** [`fig-ar-coating-interference`]

• **sidebar — the coating, Zeiss, and the spoils of war**: Smakula's 1935 Zeiss-Jena patent (DE 685767), classified as a military secret (the red "T"/Transparenz mark, ancestor of T\*); independently re-invented in the open by Strong 1936 / Blodgett 1938; then the 1945 Allied seizure of German patents (FIAT microfilming, US Alien Property vesting, 1946 London Agreement) and the Jena↔Oberkochen split of Zeiss scattered it. Lesson: it spread because it was discovered in the open more than once, not because one patent fell.

• **everyday encounters with the same trick**: these are the **same coatings on eyeglasses** (anti-reflective lenses — fewer ghost reflections, less glare for the wearer and for people looking at you) and the **anti-glare / AR coatings on displays and picture-frame glass** (museum / "non-reflective" glass) — the quarter-wave **interference** trick is all around you, not just inside camera lenses.

• **lens hood**: blocks off-axis light (a bright source *outside* the frame) from striking the front element at all — the cheapest, most effective anti-flare tool. [`fig-lens-hood-baffles`]

• **internal baffles, blackened edges, light traps, and anti-reflective barrel paint**: stop grazing rays from skipping down the inside of the barrel; critical in **(space) telescopes**, where stray sunlight must be baffled to image faint targets (cross-ref *Special optics* telescopes). [`fig-lens-hood-baffles`]

▸9.9.3 Computational deflare and glare deconvolution⧉

• model veiling glare as the true image **convolved with a broad glare PSF** (plus discrete ghosts); **measure** that glare spread function and **deconvolve / subtract** it to **recover contrast** — same PSF-deconvolution toolbox as *Aberrations correction*, applied to stray light. [`fig-veiling-glare-psf`]

• **dehaze-style** contrast recovery (cross-ref [[Slopendium/08 Single-image computational photography|dehazing]]) and **lens-specific flare/ghost removal** (learned, since ghost patterns are lens-specific). Important for **HDR** capture, where veiling glare from bright sources otherwise corrupts the dark-end measurements (Talvala 2007). Forward-ref Advanced for learned deflare.

---

▸9.10 Optical stabilization⧉

`fig-handshake-motion-blur` · hand-shake motion blur: a small angular shake $\Delta\theta$ over the exposure smears a point into a streak of length $\approx f\,\Delta\theta$ on the sensor (lens as pivot, focal length $f$); the $t<1/f$ shutter rule of thumb ✅

`fig-gyro-feedback-loop` · the stabilization control loop as a block diagram: gyro/IMU senses angular rate → controller → actuator moves the lens group/sensor → position sensor feeds residual error back (sense→control→actuate→measure, ~kHz) ✅

`fig-digital-vs-optical-stab` · optical (in lens/sensor, before capture, full resolution & FOV) vs digital/electronic IS (crop-and-warp in software after capture): the EIS safety-margin crop + warp costing resolution/field of view vs the optical in-lens/sensor correction ✅

equations

image-plane shift for a camera rotation $\Delta\theta$ is $\approx f\,\Delta\theta$ (so **longer $f$ ⇒ more blur ⇒ more correction needed**)

stabilization benefit quoted in **stops** = $\log_2$ of the shutter-time ratio it allows

the actuator must move the element/sensor to make the **net image displacement ≈ 0** over the exposure.

▸9.10.1 The problem: hand-shake and the blur budget⧉

• hand tremor **rotates** the camera by small angles during the exposure; the image shifts by $\approx f\,\Delta\theta$, so the **smear grows with both shutter time and focal length** (the old "1/focal-length" minimum-shutter rule). [`fig-handshake-motion-blur`]

• restate the **blur budget** from FUNDAMENTALS: you want a short shutter to freeze shake, but that costs light → wider aperture (less DoF) or higher ISO (more noise). Stabilization **relaxes** that budget by letting you use a *longer* shutter without smear.

• **key scope limit (state up front)**: stabilization fights **camera** motion only. A **moving subject** still blurs — no amount of IS helps; that is **motion deblurring** (forward-ref Advanced/Motion).

▸9.10.2 Optical stabilization: lens-shift vs. sensor-shift (IBIS)⧉

• **sensing**: **MEMS gyroscopes** (and accelerometers) measure the camera's angular velocity in real time; a controller computes the counter-motion and drives an actuator in a **feedback loop** to null the image displacement. [`fig-gyro-feedback-loop`]

• **lens-shift OIS**: a **floating lens group** moves **laterally** (voice-coil actuators) to **steer the image back** onto its original spot. Pros: optimized per lens, the viewfinder/AF sees the stabilized image; corrects shift well. (Canon IS / Nikon VR.) [`fig-ois-vs-ibis`]

• **sensor-shift / IBIS**: the **sensor** rides a magnetic stage and **translates (and rotates/roll)** to follow the image. Pros: works with **any** lens (even un-stabilized/adapted), can correct **roll** (a 5th axis) and pitch/yaw translation; combines with lens OIS ("dual IS"). [`fig-ois-vs-ibis`]

• **how many stops**: modern systems buy **~3–7 stops** of slower shutter; the limit is residual low-frequency drift, actuator travel, and (for IBIS) keeping the moved sensor inside the lens's image circle.

▸9.10.3 Digital / electronic stabilization and the computational alternatives⧉

• **electronic / digital stabilization**: **crop** a margin and **warp each frame** to cancel inter-frame motion (video) — no moving parts, but **costs resolution/field of view** and works *between* frames, not *within* a single exposure (so it doesn't reduce per-frame motion blur the way optical IS does). Phones combine gyro data + optical flow for very stable video. [`fig-digital-vs-optical-stab`]

• **rolling-shutter** interaction: on CMOS, fast motion + electronic stabilization must also undo **rolling-shutter** skew (cross-ref Motion/Video).

• **the computational-photography flip**: instead of one long stabilized exposure, take a **burst of short, blur-free frames** and **align + merge** them — short shutters freeze shake, alignment removes the offset, averaging recovers SNR ($1/\sqrt{N}$). This **handheld burst** approach (cross-ref multi-frame denoising / super-res) is how phones get sharp low-light shots without bulky IS — the software counterpart to optical stabilization (forward-ref Advanced).

▸9.11 The eye as an optical instrument: vision and its correction⧉

`fig-eye-as-camera` · the eye as a camera: cornea + crystalline lens = objective, iris/pupil = aperture, retina = sensor (fovea), accommodation = autofocus — labelled cross-section ✅

`fig-refractive-errors` · the four refractive errors — myopia / hyperopia / astigmatism / presbyopia — where the image focuses vs the retina and the correcting (negative / positive / cylindrical / add) lens ✅

⬜ figure not yet created

bifocal vs progressive vs AF-glasses focal zones [schem] fig-bifocal-progressive-af

▸9.11.1 The eye as a camera⧉

• **cornea + crystalline lens** form the objective (most of the power is the **cornea**); the **iris/pupil** is the aperture (~f/2.1 dilated to ~f/8 constricted); the **retina** is the sensor (the **fovea** is the sharp centre); **accommodation** — the ciliary muscle reshaping the soft lens — is the eye's **autofocus** (cross-ref *Focus*)

• like any lens the eye has **aberrations** and a diffraction-limited **PSF**; visual acuity ("20/20") is its resolution spec — the optical chain of this whole part, in wetware

▸9.11.2 Refractive errors: the eye out of focus⧉

• **myopia (near-sighted)**: eye too long / too powerful → image focuses **in front of** the retina; distant objects blur → corrected with a **negative (diverging)** lens. (Now near-epidemic, linked to near-work and lack of outdoor light.)

• **hyperopia (far-sighted)**: eye too short → focus falls **behind** the retina → corrected with a **positive (converging)** lens

• **astigmatism**: the cornea/lens is not rotationally symmetric (**toric**) → different power in different meridians, orientation-dependent blur → corrected with a **cylindrical** lens — literally this part's *astigmatism*, in the eye

• **presbyopia**: with age the lens stiffens and **accommodation fails** → can no longer focus up close (reading distance); near-universal after ~45 — the driver of the next section

• aside — **the pinhole effect**: squinting (or **pinhole glasses**) sharpens by stopping the pupil down (more DoF), at the cost of light (FUNDAMENTALS — DoF)

▸9.11.3 Correcting vision⧉

• **spectacles** (the original computation-free fix), **contact lenses** (sit on the cornea → wide field, no prism/distortion), **intraocular lenses (IOLs)** (replace the lens, e.g. cataract surgery), and **refractive surgery** (**LASIK / PRK** reshape the cornea — surgically editing the prescription)

• eyewear is **applied lens design**: anti-reflection **coatings**, **high-index** glass for thinner lenses, **aspheric** surfaces to cut aberration, **photochromic** (auto-darkening) and **polarized** sunglasses (cut glare — the same physics as the camera polarizer)

• **the pinhole trick — a smaller aperture sharpens any eye.** Squinting, looking through a pinhole, or **pinhole (stenopeic) glasses** improve vision for *any* refractive error with no prescription: shrinking the entrance aperture keeps only the near-axial rays, so the retinal **blur circle collapses** — the eye's version of *stopping a lens down for more depth of focus*. The catches are a camera's: the image gets **dim**, and a tiny enough hole is **diffraction-limited**.

• **conversely, a wide-open pupil makes things worse.** In dim light (or under dilation) the pupil opens to admit marginal rays — and exactly as a wide camera aperture gives a **shallow depth of focus**, it also exposes the eye's own **aberrations** (spherical aberration, "night myopia," halos/starbursts around lights). Night acuity drops for the same reason a lens is softest wide open and sharpest stopped down; stopping the eye down (bright light, or a pinhole) fixes both the defocus *and* the aberrations at once.

• **pinhole as sunglasses.** Because it also blocks most of the light, a pinhole or narrow slit doubles as a glare-cutting **dark aperture** — sharpening *and* dimming together. Inuit **snow goggles** (a slit cut in bone/antler/wood) are exactly this, used for millennia against snow glare.

• **side bar — a short history of correcting human vision.** Magnifying **reading stones** (plano-convex glass laid on text) appear by the 11th century; the first wearable **spectacles** in Italy around **1286** (convex lenses for presbyopia and hyperopia), with concave lenses for myopia following in the 15th–16th centuries. **Bifocals** are credited to Benjamin **Franklin** (1784). Practical **contact lenses** date to the late 19th century (glass scleral lenses, Fick 1888), turning soft/hydrogel in the 1960s–70s. **Intraocular lenses** begin with Harold **Ridley** (1949). **Refractive surgery** — radial keratotomy, then **LASIK / PRK** — reshapes the cornea from the 1980s–90s. The thread: each step moves the correction **closer to the eye** — on the nose, on the cornea, inside the eye, then carved into the cornea itself.

▸9.11.4 Presbyopia and the bifocal problem: from bifocals to AF glasses⧉

• the problem: presbyopia needs **different power for near vs far**, but a simple lens has **one** focal length

• **bifocals / trifocals** (Franklin): two or three zones with a visible line (near segment low) — abrupt jump and an "image-jump" artifact

• **progressive (varifocal) lenses**: a **continuous gradient** of power top→bottom, no line — at the cost of **peripheral distortion** and a narrow usable corridor

• **adjustable / autofocus (AF) glasses** — the *active* alternative: **focus-tunable lenses** whose power changes on demand, so one pair spans the whole range:

• **fluidic / Alvarez lenses** — manually pumped or slid to set power (adjustable-focus eyewear for the developing world — Josh Silver)

• **liquid-crystal / electronic lenses** — a voltage changes the LC retardance → an electrically **switchable or tunable** add-power (a programmable **phase** element — the same family as the **LC-SLMs** of the Advanced part); paired with an **eye-tracker / rangefinder** they *autofocus* to wherever you look (Deep Optics / "32°N")

• the throughline: **AF glasses are accommodation re-implemented in hardware** — the eye's lost autofocus restored by a tunable lens, the wearable cousin of camera autofocus and of the phase modulators in the Advanced part

▸Part 10 MULTIPLE EXPOSURE IMAGING⧉

`fig-mei-axes` · one scene captured along six axes (time/exposure/viewpoint/focus/wavelength/illumination) — many captures → one better image; the L14 spine 🟨

💡 **Big lesson (L14 · Capture the full set, decide later):** the unifying move of the whole part. Instead of forcing one capture to make every tradeoff at once — exposure, focus, viewpoint, even which moment — **take many captures that each commit differently, and resolve the tradeoff per-pixel in software afterward.** Averaging defers the noise/exposure tradeoff; HDR defers "expose for highlights or shadows?"; focal stacks defer "focus near or far?"; panoramas defer "which field of view?"; hyperspectral defers "which three colors?"; time-lapse defers "which illumination?". Capture is cheap and parallel; the decision is a downstream computation. (Registered in [[Big Lessons]] as **L14**; recurs across every chapter of this part and forward in [[Advanced computational photography]] — light fields are the same idea for *viewpoint/focus*.)

**A very old idea.** Long before digital, photographers combined multiple captures by hand: Gustave Le Gray's 1850s seascapes sandwiched a separately-exposed sky negative with a sea negative to beat the medium's dynamic range; others physically cut and pasted prints to widen the field of view (e.g. W. H. Fox Talbot's printing establishment at Reading, c. 1845). The arithmetic is now done in software and per-pixel, but the instinct — *one frame isn't enough, so take several and combine* — is the same. (Historical figures: Le Gray combination prints; Talbot establishment — see `fig-le-gray-historical`.)

**Roadmap.** The part walks the axes of "more than one capture." **[[#Denoising by averaging]]** adds *time* to beat noise ($1/\sqrt N$), with deep-sky astrophotography as the extreme. **[[#Image alignment]]** is the glue every later chapter needs — you must register before you combine. **[[#HDR merging]]** adds *exposure* to beat dynamic range, and **[[#Application to cell phones: HDR+ and burst imaging]]** shows the burst form that ships in every phone. **[[#Manual panorama stitching from multiple views]]** and **[[#Automatic panorama stitching from multiple views and feature matching]]** add *viewpoint* — the homography that needs no depth, found automatically via features + RANSAC. **[[#Blending]]** hides the seams, and **[[#Bells and whistles]]** plus **[[#Continuous panoramas (e.g. on cell phones)]]** handle projections, drift, motion, and the phone sweep. **[[#Focal stacks and depth of field extension]]** adds *focus*, **[[#Hyperspectral imaging, color wheels]]** adds *wavelength*, and **[[#Intrinsic images with time lapse]]** adds *illumination* — all the same stack-and-decide idea on a new axis.

equations

averaging $\operatorname{Var}[\bar z] = \sigma^2/N \Rightarrow \text{std} \propto 1/\sqrt N$

HDR radiance $\hat L = \frac{\sum_i w(z_i)\,z_i/t_i}{\sum_i w(z_i)}$

homography $\mathbf x' \simeq H\mathbf x$, pure rotation $H = K R K^{-1}$

RANSAC $N = \log(1-p)/\log(1-w^s)$

all-in-focus $k^*(p) = \arg\max_k s_k(p)$

▸10.1 Denoising by averaging⧉

⬜ figure not yet created

`fig-averaging-N-frames` (one noisy frame → mean of 3 → mean of 5 → mean of 45: the grey patch's histogram visibly narrows, after the Canon-1D ISO-3200 slide sequence) fig-noise-histogram

⬜ figure not yet created

doubling cleanliness costs 4× the frames)

⬜ figure not yet created

`fig-iid-variance-adds` (schematic of the derivation: $N$ independent noise samples, variance adds then $\div N^2$ → $\sigma^2/N$) fig-iid-variance-adds

`fig-mean-vs-median-stack` · plain-mean stack (residual ghost streak from one outlier frame) vs median / sigma-clip stack (streak rejected) of the same frames, with the per-pixel sample distribution inset ✅

`fig-calibration-triad` · a light frame's bias / dark / flat components and the calibration formula $X_\text{cal}=(X_\text{light}-X_\text{dark})/(X_\text{flat}-X_\text{bias})$ that subtracts the additive terms and divides out the multiplicative one 🟨

`fig-clipping-bias` · how clamping noise at zero biases a dark region bright under averaging, and how a pre-read positive offset restores zero-mean behaviour 🟨

⬜ figure not yet created

`fig-astro-stack-reveal` (a single 30-s sub of a faint galaxy — almost noise — vs 200 subs stacked: structure emerges) fig-astro-stack-reveal

⬜ figure not yet created

`fig-sub-length-tradeoff` (read-noise floor vs total integration time: fewer-longer beats more-shorter once read noise is sub-dominant) fig-sub-length-tradeoff

💡 **Big lesson (the $1/\sqrt N$ law — averaging $N$ independent measurements halves the noise every 4× frames):** model each pixel as a true value plus **independent, zero-mean** noise. Averaging $N$ frames leaves the true value alone (its average is the true value) but the **noise variance falls as $1/N$**, so the **standard deviation — the visible error — falls as $1/\sqrt N$**. This is the part's recurring quantitative result: it is the same law behind HDR's noise-weighted merge, burst low-light (HDR+ / Night Sight), burst super-resolution, and astrophotography's hours of integration. The diminishing-returns sting is built in — *each extra stop of cleanliness costs four times the frames*. (It ties to **L15** — averaging is the algorithmic way to lower the effective **noise floor**, the complement to widening the well; the two together set how much real signal you can pull from shadow. Registered in [[Big Lessons]] alongside L15; first developed here.)

💡 **Big lesson (L14 · capture the full set, decide later):** rather than commit at the instant of exposure, **record the whole set and choose afterward**. Averaging is the first instance in this part: don't take *one* careful frame and hope — take a **burst** and combine. The same move recurs as exposure bracketing (capture all exposures → [[#HDR merging]]), focus stacks (capture all focal planes), and panoramas (capture all directions). The cost is data and a reconstruction step; the payoff is deferring an irreversible decision out of the moment of capture. (Registered in [[Big Lessons]] as **L14**; this part is its natural home.)

The previous part measured and characterised noise. This part *fights* it — and the bluntest, most trustworthy weapon is to take the **same picture many times and average**. No prior, no model of what images look like: just repeated measurement and the law of large numbers. Where single-image denoising ([[Bilateral filtering]]) must *guess* which neighbours to trust, averaging multiple frames adds **genuine independent information** with each shot. The whole chapter is one derivation and its consequences.

equations

noise model per pixel per frame $X_i = \mu + n_i$, with $\mathbb{E}[n_i]=0$ (zero-mean) and $\operatorname{Var}[n_i]=\sigma^2$, the $n_i$ **independent** across frames

the average $\bar X = \tfrac1N\sum_{i=1}^N X_i$

**unbiased** in the signal $\mathbb{E}[\bar X]=\mu$

the two variance rules $\operatorname{Var}[kX]=k^2\operatorname{Var}[X]$ and (independence) $\operatorname{Var}\big[\sum_i X_i\big]=\sum_i\operatorname{Var}[X_i]$

hence $\operatorname{Var}[\bar X]=\tfrac{1}{N^2}\cdot N\sigma^2=\dfrac{\sigma^2}{N}$, so $\boxed{\sigma_{\bar X}=\dfrac{\sigma}{\sqrt N}}$

sample-variance estimator with **Bessel correction** $\hat\sigma^2=\tfrac{1}{N-1}\sum_i (X_i-\bar X)^2$ (divide by $N{-}1$, not $N$ — removes the downward bias from reusing the same samples for mean and variance)

SNR $=\mu/\sigma$, in dB $20\log_{10}(\mu/\sigma)$, improving by $10\log_{10}N$ ($\approx 3$ dB per doubling)

calibration $\

X_\text{cal}=\dfrac{X_\text{light}-X_\text{dark}}{X_\text{flat}-X_\text{bias}}$ (subtract additive fixed-pattern, divide out multiplicative).

▸10.1.1 Why averaging works — the $1/\sqrt N$ derivation⧉

• **the observation**, before any math: point a tripod-mounted camera at a static, evenly-lit grey patch and shoot a burst. Each pixel *should* read one constant value; instead it **fluctuates** frame to frame, and across the patch (it should be flat) it's grainy. Two faces of the same thing — noise as **fluctuation over time** and **fluctuation over space when the answer should be constant**. [`fig-averaging-N-frames`]

• **the naive algorithm** is a one-liner: sum the frames, divide by $N$.

```

mean = zeros_like(imSeq[0])

for im in imSeq: mean += im

mean /= len(imSeq)

```

Visibly, 1 → 3 → 5 → 45 frames: the grey patch's histogram tightens, dark areas clean up. *Why* — and *how fast*? [`fig-averaging-N-frames`]

• **the model** (the load-bearing assumption): each pixel's value in frame $i$ is the **true signal plus noise**, $X_i = \mu + n_i$, where the noise is **zero-mean** ($\mathbb{E}[n_i]=0$ — averaging is only honest if the noise has no DC offset) and **independent** across frames (frame $i$'s noise tells you nothing about frame $j$'s — true for shot & read noise, *false* for fixed-pattern noise; flag this now, fix it in *Calibration*). Each has variance $\operatorname{Var}[n_i]=\sigma^2$.

• **the two tools** (state them as the reusable algebra): for a random variable, scaling squares the variance, $\operatorname{Var}[kX]=k^2\operatorname{Var}[X]$; and for **independent** variables, **variance of the sum is the sum of variances**, $\operatorname{Var}[\sum_i X_i]=\sum_i\operatorname{Var}[X_i]$ (independence is what kills the cross terms / covariance). [`fig-iid-variance-adds`]

• **the derivation** (two lines): the average is $\bar X=\tfrac1N\sum_i X_i$. Its **mean is the true value** — $\mathbb{E}[\bar X]=\tfrac1N\sum_i\mathbb{E}[X_i]=\mu$ — so averaging is **unbiased**: it doesn't distort the signal. Its **variance**: pull out the $\tfrac1N$ (squares, by the first rule), then add (by independence):

$$\operatorname{Var}[\bar X]=\frac{1}{N^2}\operatorname{Var}\Big[\sum_i X_i\Big]=\frac{1}{N^2}\cdot N\sigma^2=\frac{\sigma^2}{N}.$$

Take the square root for the human-readable number: $\sigma_{\bar X}=\sigma/\sqrt N$.

• 💡 **the punchline — the $1/\sqrt N$ law** (see the big-lesson box above): **variance drops as $1/N$, std as $1/\sqrt N$.** This is the chapter's whole quantitative content. Its sting is the diminishing return: $4\times$ the frames for $1\times$ less noise (one stop); 100 frames buys a $10\times$ cleaner image, not 100×. In SNR terms, $+10\log_{10}N$ dB — about **3 dB per doubling**. [`fig-sqrtN-curve`]

• **estimating the noise itself** (aside, for the analysis pipeline): from the same stack you can *measure* $\sigma^2$ per pixel — but divide by $N-1$, not $N$ (**Bessel's correction**), because you used the same samples to estimate the mean, which correlates the residuals and otherwise *under*-estimates the variance. The coin-flip-with-$N{=}2$ example makes this concrete: the $N$ estimator averages to half the true variance; the $N{-}1$ estimator is unbiased.

• **what it confirms about the sensor** (callback to the noise part): map the measured $\sigma$ across a real scene and it tracks the noise sources — more noise (numerically) in **bright** areas (shot noise $\propto\sqrt{\#\text{photons}}$), yet **better SNR** there; the log-SNR image "looks like the photo." Averaging lowers the floor uniformly; it does not change *which* source dominates where.

▸10.1.2 When the plain mean is wrong: robust combination⧉

• **the failure of the mean**: the derivation assumed *clean* IID noise. A real stack has **outliers** — a satellite or plane streaks one frame, a cosmic ray spikes a pixel, someone walks through. The mean is **not robust**: a single wild value drags the average, leaving a faint **ghost trail** exactly where the outlier was. [`fig-mean-vs-median-stack`]

• **median**: take the per-pixel **median** across the stack instead of the mean. One streaked frame is just an extreme value the median ignores → the streak vanishes. Cost: the median is **noisier than the mean** for clean Gaussian data (its variance is $\approx\!1.57\times$ larger, so you lose roughly a third of your $\sqrt N$ benefit) — robustness isn't free.

• **sigma-clipping** (the practical compromise, the astro/stacking default): iterate — compute per-pixel mean and std, **reject** samples more than $k\sigma$ (say $2$–$3\sigma$) from the mean, recompute on the survivors. You get the mean's **efficiency** on the inliers *and* the median's **rejection** of outliers. This is what RegiStax / DeepSkyStacker do under "kappa-sigma" or "winsorized" stacking.

• **the design choice to surface**: this is the same *reject-or-average* decision as edge-preserving filtering (**L4** — use a difference to decide who belongs together), now applied **across frames** instead of across space: a sample too far from its peers is treated as not-belonging. Plain mean = trust everyone; median/sigma-clip = trust the consensus.

• **forward-ref**: the same robust-combine logic reappears as **de-ghosting** in HDR merging and panoramas (reject pixels that disagree because something *moved*), and as the **robust merge** in burst SR ([[Super-resolution and image priors]]).

▸10.1.3 Calibration frames: what averaging can't fix⧉

• **the limit of averaging** (callback to the model): the $1/\sqrt N$ law only kills **zero-mean, independent** noise. Two things break that assumption and survive any amount of averaging — **fixed-pattern** offsets (the *same* error every frame: hot pixels, amp glow, per-pixel dark current, vignetting, dust shadows) and **clipping** (a non-linearity that biases the mean). These need to be *removed*, not averaged. [`fig-calibration-triad`]

• **the calibration triad** — each frame is the inverse of one degradation:

• **bias / offset**: shoot the shortest possible exposure with the lens capped → captures the read-out **electronic offset** (and the deliberate offset many cameras add so noise stays *zero-mean* — see clipping below). **Subtract** it.

• **darks**: shoot capped frames at the **same exposure time and temperature** as your lights → captures **dark current** (thermal charge that accumulates per second) and **hot pixels**. **Subtract** them. *This is why astro cameras are **cooled and temperature-regulated*** — dark current roughly doubles every 6–8 °C, and a *repeatable* temperature makes the dark frame actually match the light frame (callback to FUNDAMENTALS — cooled cameras, **L15** lower floor).

• **flats**: shoot an evenly-lit blank field → captures the **multiplicative** response variation — **vignetting** (corners darker) and **dust** shadows. **Divide** by it (normalised).

• in one expression: $X_\text{cal}=\dfrac{X_\text{light}-X_\text{dark}}{X_\text{flat}-X_\text{bias}}$ — subtract the **additive** fixed-pattern terms, divide out the **multiplicative** ones. (And you average *many* of each calibration frame too, so they don't re-inject noise — $1/\sqrt N$ all the way down.)

• **the clipping bias** (a subtle one worth a figure): the sensor **clamps at 0 and at max**, so in deep shadows the noise is *not* zero-mean — the below-zero excursions are cut off, leaving only the positive ones. Averaging then pulls a true-black region **too bright**. The fix is a **deliberate positive offset** (Canon's trick) added before read-out so the full noise distribution stays above 0 and *stays zero-mean*; subtract it back in calibration. [`fig-clipping-bias`]

• **dithering** (the trick that *converts* fixed-pattern into averageable noise): deliberately **nudge the framing** a pixel or two between frames. After re-alignment, any residual fixed-pattern error lands on a *different* part of the sky each frame, so it becomes effectively random across the registered stack — and *then* averaging/clipping can remove it. (This is also why burst SR's hand-tremor is a feature, not a bug — [[Super-resolution and image priors]].)

▸10.1.4 Deep-sky astrophotography: averaging at the extreme⧉

• the purest use of "average to denoise": a faint nebula/galaxy buried in noise and light pollution → shoot **many long "subs"**, **align** them (the sky rotates — cross-ref [[#Image alignment]] below), and **average** → signal stays while noise falls as **1/√N**, pulling structure out of near-nothing. This is the $1/\sqrt N$ law pushed to its physical limit: hours of total integration (hundreds of subs) to lift a galaxy that is *invisible* in any single frame. [`fig-astro-stack-reveal`]

• **calibration frames** matter as much as the lights: subtract **darks** (dark current + hot pixels — why cooled astro cameras regulate temperature), divide by **flats** (vignetting + dust), remove **bias/offset** — the fixed-pattern terms of the noise chapter, taken out by calibration (see *Calibration frames* above — astro is where the triad is non-negotiable).

• **robust combine, not a plain mean**: per-pixel **sigma-clipping / median** across the stack rejects satellites, planes, and cosmic rays; **dithering** (nudging the framing between subs) turns fixed-pattern residue into something averaging *can* remove (see *robust combination* and *dithering* above). [`fig-mean-vs-median-stack`]

• the budget: read noise per sub favors **fewer, longer** subs; saturation, tracking error, and clipping favor **more, shorter** ones — the same exposure trade-off, astro edition (cross-ref [[Noise, signal-to-noise ratio and dynamic range]], ETTR). Quantitatively: once each sub's signal is well **above the read-noise floor**, lengthening subs no longer helps the $\sqrt N$ math — but it reduces the *number* of read-noise events, so when read noise is significant (light-polluted or narrowband), **fewer-longer** wins; when tracking is shaky or stars saturate, **more-shorter** wins. [`fig-sub-length-tradeoff`]

• cross-ref: **cooled cameras** (FUNDAMENTALS *Cameras beyond photography*); **lucky imaging** for planetary (Big data — pick the *sharpest* frames rather than averaging all, when atmospheric seeing dominates); the **astro hub** (Adjacent fields). Note the contrast with planetary lucky imaging: deep-sky *averages everything* to beat read/shot noise; planetary *selects the best few percent* to beat atmospheric blur — same burst, opposite combine rule.

▸10.2 Image alignment⧉

`fig-align-before-average` · averaging an unregistered hand-held burst (a blurry double) vs averaging the same burst after alignment (sharp, clean, $1/\sqrt N$ benefit visible) ✅

fig-ssd-search-surface · the SSD score over candidate $(dx,dy)$ shifts drawn as a single-welled basin whose minimum marks the true alignment 🟨

⬜ figure not yet created

`fig-shifted-copy-debug` (an image aligned to a copy of itself shifted by +2 px → the score map's minimum must sit exactly at $(+2,0)$: the hand-checkable test) fig-shifted-copy-debug

`fig-phase-correlation-spike` · two shifted frames producing, via the inverse FFT of the normalized cross-power spectrum, a single sharp delta whose position is the inter-frame shift 🟨

⬜ figure not yet created

**L5** in one picture)

`fig-coarse-to-fine-pyramid` · alignment estimated cheaply at the coarsest pyramid level then propagated and refined down to full resolution, capturing large motions while avoiding local minima 🟨

Every method in this part — averaging just now, HDR and panorama and focal stacks to come — assumes that *corresponding scene points already sit on the same pixel across frames*. They don't: hands tremble, the Earth rotates, even a "locked" tripod drifts. **Alignment is the unglamorous precondition** of all of it, and it is unforgiving — averaging two frames that disagree by a pixel doesn't clean the image, it **blurs** it into a double exposure. This chapter is the shared registration toolbox, and because the same machinery — find the motion between frames — is exactly what **video stabilization and optical flow** need, it hands off naturally to the video part.

equations

alignment as $\hat{T}=\arg\min_T \sum_{x}\big(I_1(x)-I_2(T x)\big)^2$ (find the transform that makes one image match the other)

**SSD** over a shift $\mathrm{SSD}(dx,dy)=\sum_{x,y}\big(I_1(x,y)-I_2(x{+}dx,y{+}dy)\big)^2$

**NCC** (gain/offset-invariant) $\mathrm{NCC}=\dfrac{\sum (I_1-\bar I_1)(I_2-\bar I_2)}{\sqrt{\sum(I_1-\bar I_1)^2}\sqrt{\sum(I_2-\bar I_2)^2}}$

the **shift theorem** $I_2(x)=I_1(x-d)\

\Leftrightarrow\

\hat I_2(\omega)=\hat I_1(\omega)\,e^{-i\,\omega\cdot d}$ (a translation is a pure phase ramp — **L5**)

**phase correlation** — the normalised cross-power spectrum $R(\omega)=\dfrac{\hat I_1\,\overline{\hat I_2}}{|\hat I_1\,\overline{\hat I_2}|}=e^{i\,\omega\cdot d}$, whose inverse transform $\mathcal F^{-1}\{R\}$ is a **delta at the shift $d$**.

▸10.2.1 Why you must align first⧉

• **the precondition**, made vivid: take a hand-held burst and average it *without* aligning → you don't get a cleaner image, you get a **blurry double**. The $1/\sqrt N$ benefit only materialises once the frames agree pixel-for-pixel; alignment is what *earns* it. [`fig-align-before-average`]

• **the two failure scales**: a **sub-pixel** misalignment loses fine detail (the average smears across the offset — a soft low-pass); a **whole-pixel-or-more** misalignment **ghosts** (edges appear twice). In astro the driver is unavoidable — the **sky rotates** through the exposure, so without per-sub alignment the stars trail and the stack is mush.

• **the reframe**: alignment is itself an **inverse problem / optimization** — find the transform $T$ that makes $I_2(T\!\cdot)$ match $I_1$, i.e. $\hat T=\arg\min_T\sum_x\big(I_1(x)-I_2(Tx)\big)^2$. Everything below is *which family of $T$* (translation, homography, dense flow) and *how you search it*.

▸10.2.2 Brute-force translational alignment (SSD / NCC)⧉

• **the simplest model — pure translation**: for a tight burst, frame-to-frame motion is well-approximated by a 2-D **shift** $(dx,dy)$. Try **all shifts** within $\pm\,$maxOffset and keep the best-scoring one. (This is literally a pset: brute force over a shift window.)

• **the score — SSD**: sum of squared differences, $\mathrm{SSD}(dx,dy)=\sum_{x,y}\big(I_1(x,y)-I_2(x{+}dx,y{+}dy)\big)^2$ — small when the frames agree. Plot it over the shift window and you get a **basin with a clear minimum** at the true shift. [`fig-ssd-search-surface`]

• *pseudocode*: the brute-force aligner as a double loop over the shift window — score each candidate by SSD, keep the argmin.

• **when SSD lies — NCC**: SSD assumes **brightness constancy** (same exposure/illumination). Across an HDR bracket or with a brightness drift, that's false and SSD breaks. **Normalised cross-correlation** is invariant to a per-frame **gain and offset** — subtract each patch's mean, divide by its std, correlate: $\mathrm{NCC}=\dfrac{\sum (I_1-\bar I_1)(I_2-\bar I_2)}{\sqrt{\sum(I_1-\bar I_1)^2}\,\sqrt{\sum(I_2-\bar I_2)^2}}$, **maximised** at the true alignment. Use SSD for same-exposure bursts, NCC (or align in a normalised/gradient domain) when brightness changes between frames.

• **debugging / hand-checkable input**: align a synthetic image to a copy of itself **shifted by 1–2 px** — the alignment score (e.g. SSD of the difference) must be **smallest at the true shift**; validate the search this way before trusting real data. Concretely: shift a known image by $(+2,0)$, run the search, and confirm the score map's minimum sits *exactly* at $(+2,0)$. If it doesn't, your indexing/sign/interpolation is wrong — and you'll never catch it on noisy real frames. This shifted-copy test is the cheapest, most reliable unit test an aligner has. [`fig-shifted-copy-debug`]

• **sub-pixel refinement** (forward-ref): integer-shift search gets you close; fit a parabola to the SSD basin (or upsample) for **sub-pixel** accuracy — which is exactly the precision burst **super-resolution** depends on ([[Super-resolution and image priors]]). Shifting by a fractional pixel means **interpolation** (bilinear/bicubic — BASIC Resampling).

▸10.2.3 Phase correlation: alignment in Fourier (L5)⧉

• **the idea** (and the lesson): a **pure shift** is, in Fourier, just a **phase ramp** — the magnitudes are unchanged, only the phases tilt: $I_2(x)=I_1(x-d)\Leftrightarrow\hat I_2(\omega)=\hat I_1(\omega)e^{-i\omega\cdot d}$ (the **shift theorem**). So the *entire* translation lives in the phase. 💡 **Big lesson (L5 · diagonalize when you can):** convolution and correlation are **diagonal in Fourier** — instead of an $O(\text{window}^2)$ search over shifts, one FFT pair recovers the global shift directly. (→ see Big lesson **L5**, first placed in BASIC — [[Linearity, Fourier, Aliasing and deblurring]]; here it turns brute-force search into a single transform.)

• **the algorithm** — phase correlation (Kuglin & Hines 1975): form the **normalised cross-power spectrum** $R(\omega)=\dfrac{\hat I_1\,\overline{\hat I_2}}{|\hat I_1\,\overline{\hat I_2}|}=e^{i\omega\cdot d}$ (normalising to *unit magnitude* keeps only the phase — robust to brightness and to dominant low-frequency content). Its inverse transform $\mathcal F^{-1}\{R\}$ is a **single sharp delta at the shift $d$** — read off the peak location. [`fig-phase-correlation-spike`]

• **why it's nice**: global, fast ($O(n\log n)$), and the delta's **sharpness/height** doubles as a confidence measure. Extends (log-polar — Reddy & Chatterji 1996) to recover **rotation and scale** as shifts in a transformed domain.

• **its limits** (be honest): it assumes a **single global shift** over the whole frame — it can't handle parallax, scene motion, or a per-region warp. Those need a richer transform (homography → panorama chapter) or **per-pixel** motion (optical flow → video).

▸10.2.4 Coarse-to-fine alignment on a pyramid⧉

• **the two problems brute force has**: large motions blow up the search window (cost), and a raw SSD surface can have **local minima** (texture/aliasing) that trap a naive search.

• **the fix — multiscale**: build an **image pyramid** (cross-ref [[Linear pyramids and wavelets]]) and align **coarse-to-fine**. At the coarsest (tiny, blurred) level a *small* shift covers a *large* real motion and the score surface is smooth (no local minima) → get a cheap rough estimate; **propagate it down**, refining at each finer level with only a tiny local search. [`fig-coarse-to-fine-pyramid`]

• **why it works** (frequency view, **L5/L16**): the coarse level keeps only **low frequencies**, whose SSD basin is wide and unimodal; finer levels add the high-frequency detail that sharpens the estimate but would otherwise create false minima. Prefiltering on the way down the pyramid is the same **L16** "blur before you decimate" discipline.

• **in production**: HDR+ (Hasinoff 2016) aligns burst frames this way — a **tiled, coarse-to-fine** search per tile, then a robust merge — handling both global hand motion and local scene motion at once.

• **the hand-off to video**: replace "one transform per frame" with "**one motion vector per pixel**" and coarse-to-fine matching becomes **optical flow**; constrain $T$ to a homography and it becomes **stabilization / panorama**. The alignment machinery here is the entry point to [[Slopendium/09 Motion, warping, and video|Motion, warping, and video]] — *this chapter links forward there.*

▸10.3 HDR merging⧉

`fig-exposure-stack-slices` · N exposures as overlapping reliable bands of the (log) radiance axis that slide with exposure time and together tile a range no single shot could hold 🟨

`fig-vary-exposure-knobs` · the four exposure knobs and their side effects — none for shutter, depth-of-field for aperture, noise for ISO, color cast plus camera contact for ND 🟨

⬜ figure not yet created

`fig-pixel-ratio-calibration` (two adjacent exposures, scatter of $I_i$ vs $I_j$ over the jointly-good pixels → slope $= k_i/k_j$ fig-pixel-ratio-calibration

⬜ figure not yet created

median is robust)

`fig-response-curve-debevec` · a camera's non-linear response $f(kL)$ and the recovered inverse $g(z)$ mapping pixel value to log-radiance, constrained by multiply-exposed pixels up to a global scale (Debevec–Malik) 🟨

`fig-histogram-exposure` · one real scene at three exposures (−2 / as-shot / +2 stops), each with its luminance histogram — distribution slides left→right, clipping spikes pin to the edge; crushed-shadow / clipped-highlight callouts (BASIC histograms) ✅

`fig-weight-triangle` · the reliability weight $w(z)$ as a hat that vanishes at both clipping and noise extremes, with an SNR-optimal variant skewed toward brighter samples, plus per-frame contribution bands 🟨

⬜ figure not yet created

`fig-naive-vs-optimal-weights` (same bracket merged with binary/hat weights vs inverse-variance weights — shadow noise visibly lower, after the Hasinoff "Nancy Church" result) fig-naive-vs-optimal-weights

`fig-hasinoff-snr-optimal-set` · SNR vs log scene brightness for naive bracketing, an SNR-optimal high-ISO short-exposure set, and an ideal sensor, the optimal set nearly matching the ideal and beating naive in the shadows (Hasinoff–Durand–Freeman) 🟨

`fig-hdrplus-pipeline` · the HDR+ flow: under-exposed raw burst → pick reference → align → robust raw-domain merge → demosaic → tone-map → JPEG, highlighting that merge precedes demosaic 🟨

💡 **Big lesson (L15 · Dynamic range = full-well capacity ÷ noise floor):** a single exposure records from a **top** — the photosite's **full-well capacity**, where values saturate/clip — down to a **bottom** — the **read-noise floor** in the shadows. Their **ratio is the dynamic range** (in stops): how much scene contrast fits in *one* shot, typically far less than the 10–12 orders of magnitude the world throws at us. You can **widen** the range with a bigger well (larger photosites — a full-frame sensor beats a phone) or a lower floor (cooling, lower read noise) — but the photographic workhorse is to **beat it by merging multiple exposures**: bracket the scene, let each shot own a different slice of the radiance axis, and stitch the slices into one radiance map. That merge is this chapter. (Registered in [[Big Lessons]] as **L15**; first appears in FUNDAMENTALS — Noise/SNR/DR; this is its capture-side **HDR** recurrence. The companion is **L6** — with a sane encoding it's *range*, not bit depth, that bounds a shot.)

💡 **Big lesson (L14 · Capture the full set, decide later):** rather than commit to one exposure at the instant of capture, **record the whole bracket and choose afterward**. HDR bracketing is the exposure-axis instance of the same move that gives focus stacks (capture every focal plane) and light-field cameras (capture every ray → refocus later). The cost is data and a harder reconstruction (alignment, merge); the payoff is deferring an irreversible decision — *which exposure?* — out of the moment of capture. (→ see Big lesson **L14**; first placed with light-field / plenoptic cameras; recurs here as HDR exposure bracketing and again in the next chapter as the **burst**.)

equations

image formation with clipping $I_i(x,y)=\mathrm{clip}\big(k_i\,L(x,y)+n\big)$, exposure factor $k_i\propto t_i\cdot(\text{ISO})/N_{\!f}^2$ (shutter $t_i$, aperture $f$-number $N_f$, ISO gain)

**linear radiance estimate** $\displaystyle \hat L(x,y)=\frac{\sum_i w(z_i)\,z_i/t_i}{\sum_i w(z_i)}$ ($z_i=I_i(x,y)$)

**pairwise scale** $k_i/k_j=\mathrm{median}_{x:\,w_i,w_j>0}\big(I_i(x,y)/I_j(x,y)\big)$ (linear capture)

**Debevec response recovery** $g(z_i)=\ln L + \ln t_i$ — solve for the response $g=\ln f^{-1}$ and the $\ln L$ jointly by least squares with a smoothness term, up to one additive constant

**Debevec merge (log domain)** $\ln \hat L=\dfrac{\sum_i w(z_i)\,(g(z_i)-\ln t_i)}{\sum_i w(z_i)}$

**two-observation optimal blend** $\hat L = a x+(1-a)y$, $a^\*=\sigma_y^2/(\sigma_x^2+\sigma_y^2)$ → general **inverse-variance** weights $w_i=1/\sigma_i^2$, $\hat L=\sum_i (z_i/k_i)/\sigma_i^2 \big/ \sum_i 1/\sigma_i^2$

**per-pixel noise variance** $\sigma^2(x)=a\,x + \sigma_{\text{read}}^2$ (photon term ∝ signal + constant read term).

▸10.3.1 The HDR challenge⧉

• **the gap, stated in numbers**: real scenes span **10 to 12 orders of magnitude** of illumination (the eye adapts from $\sim 10^{-6}$ to $10^{6}$ cd/m²; a single indoor-with-window scene is routinely $1{:}100{,}000$). A negative or sensor records only **2–3 orders** in one shot; a print or display shows even less — typically $1{:}20$ to $1{:}50$, at best $\sim 1{:}500$. *Two distinct problems fall out:* **(1) record** the information (a single capture can't), and **(2) display** it (the medium can't show it even once recorded). This chapter is problem (1) — **recording** via merge. Problem (2), **tone reproduction**, is forward/elsewhere. [`fig-dr-clip-vs-noise`]

• **why one exposure fails — the two walls** (this *is* **L15**): turn exposure up and the **highlights clip** — any radiance past the photosite's full-well capacity saturates to the max value, all detail above it collapses to white. Turn exposure down and the **shadows drown** — the signal sinks below the **read-noise floor**, and brightening them in post just amplifies noise. The window between the two walls is the **single-shot dynamic range**; a high-DR scene doesn't fit. [`fig-dr-clip-vs-noise`]

• **the formation model** (carry it through the whole chapter): scene radiance $L(x,y)$ reaches a photosite, is scaled by an **exposure factor** $k_i$ (set by shutter, aperture, ISO), has **noise** added, and **clips** above the well capacity: $I_i(x,y)=\mathrm{clip}\big(k_i\,L(x,y)+n\big)$. Everything below is *invert this*: estimate the single $L$ from the several clipped, noisy $I_i$.

• **the resolution — bracket and merge** (this *is* **L14**): shoot $N$ exposures that **sequentially measure different segments of the range** — a short one keeps the highlights, a long one digs out the shadows. Each $I_i$ is reliable only over a *band* of radiance (above its noise, below its clip); slide the band with $t_i$ and the bands tile the whole scene range. Merge the reliable parts into one floating-point radiance map. [`fig-exposure-stack-slices`]

• **scope note (what this chapter is *not*)**: the output is a **linear radiance map** — one float per $(x,y,c)$, values may be $\gg 1$ (stored in OpenEXR / RGBE / DNG; see [[#Application to cell phones: HDR+ and burst imaging]] for the raw-domain variant). **Tone mapping** — compressing that map to a displayable image while preserving local detail (bilateral / gradient-domain operators) — is a separate stage, cross-referenced to [[Edge-preserving techniques]] and recapped in [[Single-image computational photography]]. *Merge here, tone-reproduce there.*

• **caveat — modern single-shot DR**: today's sensors already give $\sim 13$–14 stops in one capture, so HDR-merge is less universally necessary than it was; even a single raw can benefit from *tone mapping* alone (noise in the shadows is the limit). HDR merge still wins when the scene exceeds the sensor's single-shot range — and, crucially, the **cell-phone burst** (next chapter) reuses the same merge for *denoising*, not only range.

• **the old, analog precursors** (history, ties to the chapter's opening): photographers fought dynamic range long before digital — Gustave Le Gray (1857) **sandwiched two negatives** (one exposed for sky, one for sea); fill-flash and **graduated ND filters** compress scene contrast at capture; the darkroom's **dodging and burning** and **Zone System** are tone reproduction by hand. Surface these as the lineage; the digital merge automates the negative-sandwich.

▸10.3.2 Data capture: how to vary the exposure⧉

• **the question**: to tile the radiance axis you need to *change the exposure factor* $k_i$ between shots. Four knobs do it — **shutter speed, aperture, ISO, neutral-density (ND) filter** — and they are **not** interchangeable: each has a side effect on the image beyond brightness. [`fig-vary-exposure-knobs`]

• **shutter speed — the clean knob (use this)**: range $\sim$30 s to 1/4000 s ≈ **6 orders of magnitude**. **Pros: reliable and linear** — doubling exposure time doubles collected photons, so $k_i\propto t_i$ exactly, which is what the merge math assumes. **Con:** very long exposures add their own noise (dark current) and risk motion. This is the default for bracketing because it changes *only* the exposure, nothing else about the image.

• **aperture — changes the picture**: range $\sim f/1.4$ to $f/22 \approx$ **2.5 orders** ($k\propto 1/N_f^2$). **Con:** it changes **depth of field** — the bracket frames would have *different focus blur*, breaking the "only the exposure changes" assumption. Use only when desperate.

• **ISO — amplifies, doesn't collect**: range $\sim$100 to 1600 ≈ **1.5 orders**. ISO is **analog/digital gain applied after capture**, not more light — so it scales signal *and* read-noise-referred terms and generally **raises noise**. Useful when desperate, but a poor way to extend range on its own. (Hasinoff's insight, below, is that ISO *is* a useful degree of freedom once you model noise properly.)

• **neutral-density (ND) filter — darkens the world**: up to $\sim$4 densities ≈ 4 orders, **stackable**. **Cons:** not perfectly neutral (introduces a **color shift**), imprecise, and you must **touch the camera** to add it (risking shake → needs re-registration). **Pro:** works with strobe/flash where shutter can't, a good complement when desperate.

• **the takeaway** (ties to **L7** — stops): exposure is **counted in stops** (factors of 2, additive in log space), and the clean way to walk the radiance axis is **shutter**; the other knobs trade a side effect (DoF, noise, color) for range and are fallbacks. *Forward-ref:* which exposures to pick — not just *how* to vary, but the **optimal set** — is the *Optimizing the capture and merge* section (Hasinoff).

• **the merge assumes static + registered**: the basic merge assumes **only the exposure changes — no motion**. Hand-held bracketing breaks this; align first (Ward 2003's **median-threshold bitmap**, MTB — a contrast-robust alignment that compares images thresholded at their median, accelerated on a pyramid) and reject **ghosts** from moving objects. Cross-ref [[#Image alignment]]; the burst-imaging chapter makes robust alignment central.

• **on-chip HDR — the sensor does the merge (no software bracket)**: everything above assumes *software* fuses a bracket of separate frames. But **on-chip HDR is now an industry-wide sensor capability** — many modern image sensors **produce a high-dynamic-range signal directly, on-chip**, folding the "capture several slices and combine" step into the readout itself. **Sony** is a prominent example (its DOL / dual-read schemes, below), but **OmniVision**, **Samsung (ISOCELL)**, and **onsemi** and other automotive vendors (split-pixel designs) all do on-chip HDR by various strategies — it's a hardware strategy, not one vendor's feature. This is the hardware cousin of bracketing: instead of $N$ frames merged in post, the slices are captured and combined *inside the sensor / ISP*, so the photographer gets one wide-DR raw out. The main strategies differ in **how** they multiplex the dynamic range — in **time**, in **space**, or in **gain**: [`fig-hdr-sensor-strategies`]

• **DOL-HDR / "double read" (digital overlap)** — read the same exposure twice, or interleave **long + short** exposures line-by-line, then combine; the long read recovers shadows, the short read the highlights. This is the Sony **"double read."** *Time-multiplexed:* the long and short captures are not simultaneous, so anything moving between them **ghosts** (motion artifacts) — the same de-ghosting problem as a hand-held bracket, now inside one frame.

• **dual-conversion-gain (DCG)** — each pixel can be read at **two readout (conversion) gains**: high gain for a low noise floor in the shadows, low gain for more full-well headroom in the highlights; the two reads are merged. *Gain-multiplexed:* both reads see the **same instant**, so it is the **cleanest** option — no motion artifacts, no resolution loss — but the extra range it buys is **bounded** by the gain spread.

• **split-pixel / dual-photodiode** — each photosite carries a **large + small** sub-photodiode: the large one saturates early (shadows/midtones), the small one keeps reading into bright highlights. Common in **automotive** sensors (sun-glare, tunnel exits). *Space-multiplexed:* costs **resolution / fill-factor and SNR** (the small diode is noisy), but both sub-pixels are exposed simultaneously, so no motion ghosting.

• **lateral-overflow / LOFIC / voltage-domain tricks** — add **extra charge capacity** per pixel (a lateral-overflow integration capacitor that catches the charge spilling past the photodiode's well), extending full-well in the **voltage/charge domain** without a second exposure.

• **spatially-varying exposure (SVE / "assorted pixels")** — bake **different sensitivities into neighbouring pixels** (an exposure mosaic, like a Bayer pattern but for ND), then demosaic into an HDR image — **Nayar & Mitsunaga**. *Space-multiplexed:* trades **resolution** for range, single-shot so no motion artifacts.

• **logarithmic / companding sensors** — a pixel whose response is **log (or piecewise-companded)** in radiance, capturing many decades in one read at the cost of **lower contrast resolution / fixed-pattern non-uniformity** in any given band.

• **the trade-off, in one line**: **time-multiplexed** schemes (DOL double-read) buy range with **motion artifacts**; **space-multiplexed** schemes (split-pixel, SVE/assorted pixels) buy it with **resolution / SNR**; **gain-domain** (DCG) is the **cleanest** — same instant, full resolution — but its extra range is **bounded**. The software **HDR+ burst** of the next chapter ([[#Application to cell phones: HDR+ and burst imaging]]) sits at the opposite end: maximal flexibility and denoising-for-free, paid for in compute and alignment. This is the **computational-sensor** trade space — push the work into silicon or into software — developed in [[Advanced computational photography]] (cross-ref). [`fig-hdr-sensor-strategies`]

▸10.3.3 Curve calibration⧉

• **the calibration question**: to merge, you must know how a pixel value $z_i$ relates to scene radiance — i.e. recover the **exposure factors** $k_i$ (or their ratios) and, if the capture is non-linear, the **camera response curve** $f$ where $z = f(k\,L)$. Two regimes: **linear capture** (easy, a ratio) and **non-linear capture** (regression on pixels seen at multiple exposures).

• **linear case — just a ratio**: if the images are **linearly encoded** (raw, or un-gamma'd — `dcraw -g 2.2 0` gives linear-ish; **L2**: do radiometry in linear light), then in any non-clipped, non-noisy pixel $I_i = k_i L$. So for a pair of adjacent exposures the radiance cancels: $I_i/I_j = k_i/k_j$ — *the same value at every good pixel* if linearity holds. Estimate it robustly: $k_i/k_j=\mathrm{median}_{x:\,w_i,w_j>0}\big(I_i(x,y)/I_j(x,y)\big)$, taking the **median** over pixels where **both** images are reliable. **Chain** the pairwise ratios to get every $k_i/k_0$ — you only ever recover exposure up to **one global scale factor** (which is fine: radiance maps are relative unless you photometrically calibrate). [`fig-pixel-ratio-calibration`]

• *practical note:* you can also just read $k_i$ from the **EXIF exposure data** ($t_i$, $f$, ISO); the ratio-from-pixels method is the fallback when the metadata is missing or the response is suspect, and it's more robust because it's measured from the actual data.

• **non-linear case — recover the response curve (Debevec–Malik 1997)**: real cameras (and film, and JPEG) apply a **non-linear response** $f$ (gamma, tone curve, film characteristic). Then $z_i = f(k_i L)$ and you must recover $f$ (equivalently its log-inverse $g=\ln f^{-1}$) before you can merge. The trick: a pixel imaged at **several exposures** gives several $(z_i, t_i)$ pairs constraining the *same* radiance — that over-determination pins down the curve. Write, in the log domain, $g(z_i) = \ln L + \ln t_i$. The unknowns are the curve $g(\cdot)$ (one value per possible pixel level, e.g. 256) and one $\ln L$ per pixel. Solve a **linear least squares** over all pixels and exposures, with two riders: a **smoothness penalty** on $g''$ (a real response is smooth), and a **reliability weight** $w(z)$ down-weighting near-clip and near-noise samples. Recovered only up to **one additive constant** (= the global scale). [`fig-response-curve-debevec`]

• **Mitsunaga–Nayar 1999 — even less is assumed**: models the response as a **low-order polynomial** and recovers it (and the unknown exposure ratios jointly) by **radiometric self-calibration** — you don't even need to trust the EXIF exposure ratios. Same spirit: pixels seen at multiple exposures over-determine the curve; fit a smooth response. (Mann & Picard 1995's "comparametric" / Wyckoff work is the earlier multiple-exposure precursor.)

• **why regression beats raw ratios** (the cleaner-way bridge): a **robust regression** for $g$ and $L$ avoids brittle per-pair ratios, **directly models uncertainty due to noise** (heavier weight where SNR is high), and **generalizes to the full non-linear response** in one framework. This is the seam into *Optimizing the merge*: once you've written it as least squares, the *right* weights are inverse-variance — the next two sections.

• **encoding note (L2)**: merging is an **additive** operation in radiance, so it must be done in **linear light** (or, equivalently, the log-radiance domain of Debevec). Gamma/tone-curved values are *not* proportional to radiance and would merge wrong — un-apply the response first. (The *display* side, by contrast, lives in a perceptual/log space — that's tone mapping, elsewhere.)

▸10.3.4 Combining exposures⧉

• **the assembly recipe** (three steps): (1) figure out the **scale factor** $k_i$ between images — from EXIF or pixel ratios (previous section); (2) compute a **weight map** $w_i(x,y)$ for each image marking which pixels are usable; (3) **reconstruct** the radiance per pixel as a **weighted combination** of the rescaled observations. [`fig-weight-triangle`]

• **which pixels are useful — the reliability mask**: at each pixel, an exposure's value is trustworthy only if it is **neither clipped nor noise-dominated**. Reject **clipped** highlights (e.g. $z > 0.99$) — they only tell you "at least this bright," not how bright. Reject **too-dark / too-noisy** shadows (e.g. $z < 0.002$) — the read-noise floor swamps the signal. What's left is the usable band. (P-set 5 uses **binary** $w\in\{0,1\}$, extensible to a smooth scalar; weights can differ per color channel.)

• **the merge — a weighted average in radiance**: rescale each usable observation to radiance by dividing out its exposure factor, $z_i/t_i$ (or $z_i/k_i$), then average, weighted by reliability:

$$\hat L(x,y)=\frac{\sum_i w(z_i)\,z_i/t_i}{\sum_i w(z_i)}.$$

In the non-linear (Debevec) case, the identical average runs in the **log domain** on the recovered curve: $\ln\hat L=\big(\sum_i w(z_i)\,(g(z_i)-\ln t_i)\big)\big/\big(\sum_i w(z_i)\big)$. Either way it is a **reliability-weighted mean** — and where several exposures overlap, the averaging also **denoises** (the $1/\sqrt N$ effect of [[#Denoising by averaging]], reliability-weighted). [`fig-weight-triangle`]

• *pseudocode*: the merge as one per-pixel pass over the bracket — divide out each exposure factor, accumulate by reliability weight, normalize, with the all-bad fallback to the brightest/shortest frame.

• **edge case — pixels bad in *every* exposure**: a pixel may be **clipped in all** images (a light source) or **noise-dominated in all** (deepest shadow). Don't discard it: **keep clipped pixels in the *brightest*-exposed image** (the one shot long enough to have *some* value there — for the darkest part) and **keep dark pixels in the *shortest* exposure**. Careful with the threshold implementation: the literal values $0.0$ / $1.0$ must actually be *kept* for these fallback images, not rejected. (Mind the direction: dark pixels of the *brightest* image, bright pixels of the *darkest* — not the reverse.)

• **the connection to denoising and to fusion** (the unifying view): the unweighted, single-radiance-band case of this average is exactly **average-to-denoise** ([[#Denoising by averaging]]) — HDR merge generalizes it with per-observation reliability/SNR weights. A *different* philosophy is **exposure fusion** (Mertens, Kautz & Van Reeth 2007): skip the radiance map entirely — score each bracket pixel by a **per-pixel quality** (well-exposedness, local contrast, saturation), then **blend the bracket directly in a Laplacian pyramid** weighted by those scores. No response-curve recovery, no HDR float map, no separate tone-mapping step — the multi-scale (pyramid) blend hides the exposure seams and yields a *displayable* image in one shot. Cross-ref the pyramid-blend machinery in [[Linear pyramids and wavelets]]; fusion is the pragmatic "merge straight to a viewable image" alternative to "build an HDR map, then tone-map."

▸10.3.5 Optimizing the capture and merge⧉

• **the better question** (beyond binary weights): when **multiple exposures both see a pixel validly**, they have **different noise**. Plain binary masks throw away information — using *only* the least-noisy observation gets no $1/\sqrt N$ benefit; there must be a way to use the noisier one *at least a little*. The principled answer: weight by **reliability = inverse noise variance**.

• **the two-observation derivation** (the kernel of it): given two unbiased estimates $x,y$ of the same quantity with variances $\sigma_x^2,\sigma_y^2$, combine as $\hat L = a x + (1-a) y$. Its variance is $a^2\sigma_x^2 + (1-a)^2\sigma_y^2$; differentiate w.r.t. $a$, set to zero → $a^\* = \dfrac{\sigma_y^2}{\sigma_x^2+\sigma_y^2}$. The **optimal combination weights each estimate by the inverse of its variance** (check: equal variances → $a=1/2$, the plain average; one estimate far noisier → it gets little weight but not zero). Generalize: $w_i = 1/\sigma_i^2$, giving $\hat L = \dfrac{\sum_i (z_i/k_i)/\sigma_i^2}{\sum_i 1/\sigma_i^2}$ — **inverse-variance (minimum-variance) weighting**, the BLUE estimator. [`fig-naive-vs-optimal-weights`]

• **plug in the camera's noise model**: variance is dominated by two terms (FUNDAMENTALS — Noise): **photon (shot) noise** with variance $\propto$ signal (dominates in the highlights) and **read noise** with constant variance (dominates in the shadows): $\sigma^2(x) = a\,x + \sigma_{\text{read}}^2$, where $a$ and $\sigma_{\text{read}}$ depend on camera and ISO. Calibrate by shooting repeated frames (variance across the stack) or derive it on paper. Now the merge weights aren't a hand-drawn hat — they come from physics.

• **what falls out** (a clean, slightly surprising result): rescaling an observation by $1/k_i$ to radiance amplifies **both signal and noise**, so $\mathrm{Var}(z_i/k_i)\approx \sigma_i^2/k_i^2$. Working the algebra through the inverse-variance formula for the **photon-noise-only** case: the per-pixel radiance $L$ and the slope $a$ are common to every term (they appear in numerator and normalizer) and **cancel**, and the two $k_i$ factors cancel too — so **all pixels of a given exposure get the same weight**, $w'_i \propto k_i$ (longer/brighter exposures, with relatively less photon noise, weighted up). The images aren't literally rescaled to radiance in the sum, but the normalization makes it equivalent. (With **read noise** included the cancellation is incomplete and the optimal weights *do* vary per pixel — read noise is what makes the optimum genuinely spatially varying.) Still reject clipped/dark pixels with the binary mask, *then* weight the survivors by inverse variance. [`fig-naive-vs-optimal-weights`]

• **Hasinoff–Durand–Freeman 2010 — optimizing the *capture set*, not just the merge**: don't stop at optimal *weights* for a *given* bracket — choose the **optimal set of exposures** to shoot. With a full noise model (photon + read, and **exploiting ISO** as a real degree of freedom), pick the shutter/ISO sequence that **maximizes worst-case SNR** across the scene's brightness range under a fixed time budget. The counter-intuitive result: the SNR-optimal set often uses **higher ISO and shorter exposures** than naive bracketing — e.g. for one scene, standard bracketing ISO 100 (1/100, 1/25, 1/6 s) is beaten by an SNR-optimal ISO 3200 (1/3200, 1/125, 1/5 s), gaining $\sim$12 dB in the dark regions and approaching the **ideal HDR sensor** curve. [`fig-hasinoff-snr-optimal-set`]

• **the through-line into cell phones**: Hasinoff's "fewer/shorter/higher-ISO, then merge optimally" logic — *underexpose to avoid clipping and to freeze motion, recover the rest by computation* — is exactly the philosophy of **HDR+** (next section), where the bracket becomes a **burst of identical underexposed raws** merged for denoising *and* range at once. (Other refs in this vein: Granados et al. 2010, optimal per-pixel weights from the full noise model; Ward 2003, robust capture with registration + ghost/flare removal.)

▸10.4 Application to cell phones: HDR+ and burst imaging⧉

⬜ figure not yet created

redraw from the Hasinoff 2016 system diagram)

`fig-burst-vs-bracket` · a few-frame varied-exposure bracket (needs tripod, ghosts on motion) vs a many-frame identical-short-exposure burst (hand-held, motion frozen, denoised by averaging) 🟨

⬜ figure not yet created

robust merge rejects the outlier frames per tile → clean) fig-ghost-from-misalignment

⬜ figure not yet created

`fig-burst-superres-link` (the same aligned burst, but sub-pixel hand-tremor shifts sampled on a finer grid → recovered resolution, demosaic replaced — pointer to [[Single-image computational photography]], after Wronski 2019). fig-burst-superres-link

equations

burst formation $I_i(x,y)=\mathrm{clip}\big(k\,L(W_i(x,y)) + n_i\big)$ — **same** small exposure $k$ every frame, each warped by an estimated alignment $W_i$ (≈ identity + sub-pixel hand motion)

**merged estimate** = robust weighted average over aligned frames, $\hat L(x,y)=\sum_i w_i\,I_i(W_i^{-1}x)\big/\sum_i w_i$, with $w_i$ down-weighting frames that disagree with the reference (motion/occlusion) — outlier rejection on top of the inverse-variance weighting of [[#HDR merging]]

**noise after merge** $\sigma_{\text{merged}}\approx \sigma_{\text{single}}/\sqrt{N}$ for the static, well-aligned pixels (the denoise)

**why underexpose** — choose $k$ so that the brightest scene radiance stays below clip ($k\,L_{\max}<1$), accepting a higher relative noise that the $1/\sqrt N$ merge then removes.

▸10.4.1 Why a phone shoots a burst, and why it underexposes⧉

• **the phone's predicament**: a tiny sensor (small photosites → low full-well capacity → low single-shot DR and high relative noise — **L15**), **no tripod**, an unsteady hand, and a scene with people moving. Classic bracketing — a few *different* exposures, assume static — fails: the long exposure blurs from hand shake, subjects move between frames (ghosts), and the small sensor's shadows are noisy anyway. [`fig-burst-vs-bracket`]

• **the HDR+ move (Hasinoff 2016)**: replace the bracket with a **burst of many *identical* frames**, each at the **same short, deliberately under-exposed** raw exposure. Short exposures **freeze motion** (no per-frame blur) and are easy to align; *many* of them, **averaged**, beat the noise down by $1/\sqrt N$ — recovering the shadow detail you'd otherwise have gotten from a long exposure, but without the blur. The burst does double duty: **denoising** (the $1/\sqrt N$ of [[#Denoising by averaging]]) **and HDR** (range recovered from the merge) in **one** operation. This is **L14** on a phone — capture the full set (the burst), decide later (merge + tone-map). [`fig-burst-vs-bracket`]

• **why *under*expose — the irreversibility asymmetry** (the crux): the two failure walls are **not** symmetric. A **clipped highlight is gone forever** — once a photosite saturates, the information is destroyed and no merge recovers it. A **dark-but-unclipped shadow is only noisy**, and noise is **recoverable** by averaging $N$ frames. So bias the exposure **down**: pick $k$ small enough that even the **brightest** scene radiance stays below clipping ($k\,L_{\max}<1$), guaranteeing **highlight headroom**, and pay for it with extra shadow noise that the burst merge then removes. You trade a *recoverable* problem (noise) for avoiding an *unrecoverable* one (clipping). [`fig-why-underexpose`]

• **this is Hasinoff 2010 on a phone**: it's exactly the **SNR-optimal capture set** logic — shorter exposures, higher effective ISO, recover by computation — instantiated as "one short exposure repeated $N$ times." The 2010 paper said *how to choose the optimal set*; HDR+ says *for a phone, the optimal set is a uniform under-exposed burst*, because that also solves motion and alignment.

▸10.4.2 The HDR+ pipeline: align and robust-merge in raw⧉

• **the pipeline** (end to end): **capture** a burst of $N$ under-exposed raw frames (raw = linear, pre-demosaic, **L2**) → **select a reference** (a sharp, well-exposed-enough frame, often an early one) → **align** every other frame to the reference (tile-based, sub-pixel) → **robustly merge** the aligned frames **in the raw domain** → emit **one denoised, high-DR raw** → **demosaic**, then **tone-map** and finish (the displayable JPEG). The key architectural choice: **merge happens in raw, before demosaicking**, so denoise + HDR are a single operation on the cleanest (linear, mosaiced) data. [`fig-hdrplus-pipeline`]

• **the merge is a robust weighted average** — the [[#HDR merging]] merge with one addition: $\hat L = \sum_i w_i\,I_i(W_i^{-1}x)/\sum_i w_i$, where $W_i$ is the estimated alignment of frame $i$ and the weight $w_i$ combines **inverse-variance reliability** (from the noise model) with **outlier rejection** — a frame (or tile) that *disagrees* with the reference (because a subject moved, or alignment failed) is **down-weighted or dropped**, so it can't contribute a ghost. Conceptually a per-frequency **Wiener-style merge** in the tile transform: where frames agree, average hard (full denoise); where they disagree, trust the reference (no ghost). [`fig-ghost-from-misalignment`]

• **alignment is the whole game** (and ghosts are its failures): mis-registration turns the helpful average into **blur and double-images**. HDR+ aligns in **tiles** (a coarse-to-fine, pyramid search per tile) to handle locally-varying motion, and the **robust** merge is the safety net for the residual motion the alignment can't undo (occlusion, fast subjects). The trade is explicit: more aggressive averaging = more denoising but more ghost risk; the robustness weight tunes it per pixel. (Cross-ref [[#Image alignment]]; this is the same MTB/pyramid-search lineage as Ward 2003, made local and sub-pixel.)

• **why raw, before demosaic**: doing the merge on **linear raw** keeps the noise model simple and correct (photon + read, **L15**), lets the merge *also* serve as the demosaic's denoiser, and avoids baking in the non-linear tone curve before the additive average (**L2** — average in linear light). Tone mapping comes **after** the merge (local tone mapping; HDRnet / Gharbi 2017 learns this step on a **bilateral grid** — forward-ref to [[Edge-preserving techniques]] and [[Deep learning]]).

▸10.4.3 From burst HDR to burst super-resolution⧉

• **the same substrate, one knob turned**: HDR+ aligns and merges a burst to gain **SNR and range**. **Burst super-resolution** (Wronski et al. 2019, the Pixel "Super Res Zoom") aligns and merges the *same kind of burst* to gain **resolution**. The only difference that matters is **sub-pixel** alignment: the unavoidable **hand tremor** between frames shifts the scene by fractions of a pixel, so each frame samples the continuous scene on a slightly **offset grid** — fuse them on a **finer grid** and you recover detail beyond a single frame's sampling. [`fig-burst-superres-link`]

• **aliasing becomes the signal**: a single frame's sampling **folds** high frequencies (aliasing — normally a defect); the sub-pixel-shifted frames let you *separate* those folded frequencies, turning aliasing into recovered resolution. This is **reconstruction-based** super-resolution (genuine measured detail), not hallucination — cross-ref the reconstruction-vs-hallucination axis in [[Single-image computational photography#Reconstruction vs hallucination]]. It even **replaces demosaicking**: the color mosaic's own per-channel offsets, combined across the burst, reconstruct full color directly. [`fig-burst-superres-link`]

• **the unifying statement** (close the chapter): HDR merge, burst denoising, and burst super-resolution are **one operation** — *align a burst, then robustly merge* — with the dial set to recover **range** (HDR+), **SNR** ($1/\sqrt N$ denoise), or **resolution** (sub-pixel super-res), and usually **all three at once** on a modern phone. The full super-resolution treatment (the prior, the ill-posedness, the fusion math) lives in [[Single-image computational photography]]; here it closes the multiple-exposure arc — *capture the set, align, merge, decide later* (**L14**), beating the single-shot limits of **L15** by computation.

▸10.5 Manual panorama stitching from multiple views⧉

`fig-pano-rotate-vs-translate` · pure camera rotation about a fixed center (views related by one homography) vs camera translation (parallax, no single map) 🟨

`fig-depth-cancels-ray` · points at different depths along one ray collapsing to a single pixel after rotation and reprojection — depth drops out of the view-to-view map 🟨

⬜ figure not yet created

`fig-pinhole-divide-by-z` (3D point $(x,y,z)$ → image $(x/z,y/z)$: projection = divide by depth) fig-pinhole-divide-by-z

`fig-homography-quad` · a homography sending a rectangle to a general quadrilateral, keeping lines straight while letting parallels converge (8 DOF; affine would keep a parallelogram) 🟨

⬜ figure not yet created

reprojecting to another line is *linear* up there and depth-independent — the toy version of the whole idea) fig-iid-variance-adds

⬜ figure not yet created

`fig-pano-warp-to-reference` (pick one image as reference fig-pano-warp-to-reference

`fig-pano-4clicks` · the manual stitching pipeline: click four correspondences across the overlap, solve $H$, warp, blend into one wide frame 🟨

`fig-doc-flatten` · a skewed photo of a page rectified to a flat scan by clicking its four corners and warping by the recovered homography (the planar-scene case) 🟨

💡 **Big lesson (L14 · Capture the full set, decide later):** *the move that drives this whole part is — instead of committing one capture parameter at the instant of exposure, **record the whole set and choose afterward**.* A panorama defers your **field of view**: shoot several overlapping frames now, decide the final framing/crop of the wide view later. It is the same move as exposure bracketing → HDR (capture all exposures), focus stacks (all focal planes), and burst imaging (every instant). The cost is data and a harder reconstruction (alignment + blending); the payoff is taking an irreversible decision *out* of the moment of capture. (Registered in [[Big Lessons]] as **L14** — *first appears* with light-field/plenoptic cameras, but it is the **spine of this part** and surfaces in every section here; one-line callbacks at HDR, focus stacks, and panoramas.)

A panorama is the oldest computational-photography trick: a single shot can't see wide enough, so take **several overlapping photos** and stitch them into one wide image. The deep question is *what relates two overlapping photos of the same scene?* — and the surprisingly clean answer (no 3D, no depth) is what makes stitching a tractable problem rather than a full reconstruction. This chapter does the geometry by hand; the **manual** version asks the user to click the correspondences, and the [[#Automatic panorama stitching from multiple views and feature matching|next chapter]] automates them.

equations

pinhole projection $(x,y,z)\mapsto(x/z,\,y/z)$ (project = divide by depth)

homography $\mathbf{x}' \simeq H\mathbf{x}$ with $H\in\mathbb{R}^{3\times3}$, **8 DOF** (defined up to scale, $H\sim kH$)

pure-rotation view-to-view map $H = K R K^{-1}$ — **independent of depth**

for the calibrated/normalised case $H = R$ acting on rays $(x,y,1)$

the per-point divide $\big(x',y'\big)=\big(\tfrac{a x+by+c}{gx+hy+i},\,\tfrac{dx+ey+f}{gx+hy+i}\big)$

the linear system $A\mathbf{h}=\mathbf{0}$ (each correspondence → 2 rows

4 correspondences → 8 equations in 8 free unknowns).

▸10.5.1 The scenario, and the one rule: rotate, don't translate⧉

• **the setup**: you stand in one place and **pan the camera** across a scene, taking overlapping frames. Choose one frame as the **reference**; reproject (warp) the others into its coordinate system; blend the overlaps. The output is a virtual **wide-angle** view. [`fig-pano-4clicks`]

• **the one rule** — pure **rotation about the optical centre**. The camera may rotate freely (pan, tilt, roll) but its **viewpoint must not move**. If you keep the optical centre fixed, every pair of views is related by a single homography (derived below) *regardless of scene geometry*. [`fig-pano-rotate-vs-translate`]

• **why translation breaks it**: if you *walk* (translate the optical centre), near and far objects shift by **different** amounts — **parallax**. There is then no single 2D map between the views; you'd need the depth of every pixel (i.e. actual 3D reconstruction). The slogan for the reader: **"rotate, don't walk."** The one exception is a **planar scene** (next-to-last section) — there a homography works *even if you translate*, because a plane has, in effect, one depth law.

• **forward-pointer**: the *warping itself* — forward vs inverse warp, resampling, antialiasing — and the closely-related **perspective-distortion correction** are developed in [[Slopendium/09 Motion, warping, and video|Motion, warping, and video]]. Here we only need the **geometry** (which $H$) and how to **solve** for it.

▸10.5.2 Refresher: pinhole projection is "divide by depth"⧉

• **the pinhole model**: a 3D point $(x,y,z)$ in the camera frame images to $(x/z,\,y/z)$ on the plane $z=1$. **Projection is division by depth.** Far things look small because you divide by a bigger $z$. [`fig-pinhole-divide-by-z`]

• **rotating the camera = rotating the world the other way**: to project a point into a camera that has been rotated by $R$, apply $R^{-1}$ (equivalently $R$ to the ray) then divide. So a rotated view of the *same* scene is: rotate the 3D points, then re-project (divide by the new $z'$).

• **the trap we're about to escape**: this recipe needs the 3D point — and for stitching we only have the **first image's pixels**, not their depths $z$. Are we stuck? (No — that's the next section.)

▸10.5.3 Why you don't need 3D: depth cancels for a pure rotation⧉

• **the key observation** — *all points along one viewing ray reproject to the same place.* A pixel $(x,y)$ in image 1 is the image of an entire **ray** of 3D points $(tx,ty,t)$ for every depth $t$. Rotate that whole ray by $R$ and re-project: the **divide by $z'$ removes the common factor $t$**. Every point on the ray lands on the *same* pixel in the rotated view. So the destination pixel **does not depend on depth at all**. [`fig-depth-cancels-ray`]

• **therefore pick $z=1$**: since depth doesn't matter, treat each pixel as the point $(x,y,1)$ — pretend the image is a plane at distance 1 — rotate by $R$, divide by the new third coordinate. That is the entire mapping. We *chose* a depth only for convenience; any choice gives the same answer. [`fig-homog-coords-1d` shows the 1D toy: reprojecting points to another line is linear and depth-free]

• **the equation** — chaining "un-project (apply $K^{-1}$), rotate ($R$), re-project (apply $K$)" gives the view-to-view map

$$\mathbf{x}' \simeq H\,\mathbf{x}, \qquad H = K\,R\,K^{-1},$$

where $K$ is the camera's intrinsics (focal length, principal point) and $\simeq$ means "up to scale" (you still divide by the last coordinate). **Crucially, no $z$ appears in $H$** — depth has cancelled. For normalised coordinates ($K=I$) it's just $H=R$ acting on rays $(x,y,1)$.

• **the punchline to state plainly**: *for a camera rotating about its optical centre, the relationship between two images is a homography that is independent of the scene.* No depth, no 3D model, no parallax — one $3\times3$ matrix relates the views everywhere. This is the chapter's central idea and the reason panorama stitching is a clean linear problem. (Reused in [[#Automatic panorama stitching from multiple views and feature matching|automatic stitching]] and in perspective correction.)

▸10.5.4 Homographies and homogeneous coordinates⧉

• **homogeneous coordinates** — represent a 2D point $(u,v)$ by **three** numbers $(x,y,w)$ with $u=x/w,\ v=y/w$. Recover the Euclidean point by **dividing by the last coordinate**. All scalings $(wx,wy,w)$ name the same point → coordinates are defined **up to scale**. Motivations: they make **perspective** (the divide) and **translation** expressible as a single matrix multiply, and they're exactly "treat the image as a plane at $z=1$ in 3D." [`fig-homog-coords-1d`]

• **a homography** is any $3\times3$ matrix $H$ acting on homogeneous coordinates: compute $\mathbf{x}'=H\mathbf{x}$ (plain matrix multiply), then **divide by $w'$** to read off image coordinates. It is the most general **projective mapping of a plane** — and, per the previous section, exactly the camera-rotation map. [`fig-homography-quad`]

• **what it does geometrically**: maps the input **rectangle to an arbitrary quadrilateral**; parallel lines need not stay parallel (they can converge to a vanishing point), but **straight lines stay straight**. Same as "project the plane, rotate, re-project." [`fig-homography-quad`]

• **8 degrees of freedom**: $H$ has 9 entries but is defined **up to scale** ($H$ and $kH$ give the same map, since $(kw'x',\dots)$ and $(w'x',\dots)$ are the same Euclidean point) → **8 DOF**. This is why **4 point correspondences** (8 numbers) are exactly enough to pin it down.

• **the per-point formula** (writing $H=\begin{psmallmatrix}a&b&c\\ d&e&f\\ g&h&i\end{psmallmatrix}$): $\ x'=\dfrac{ax+by+c}{gx+hy+i},\quad y'=\dfrac{dx+ey+f}{gx+hy+i}.$ The denominator $w'=gx+hy+i$ is the perspective divide; $g,h$ are what make lines converge (zero $g,h$ → an affine map, no perspective).

▸10.5.5 Solving for H from correspondences⧉

• **the goal**: given $\ge4$ matched point pairs $\mathbf{x}_i\leftrightarrow\mathbf{x}_i'$, find $H$ mapping each $\mathbf{x}_i$ to $\mathbf{x}_i'$. In manual stitching the user supplies these by clicking; automatically they come from feature matching (next chapter).

• **make it linear**: the relation $\mathbf{x}'\simeq H\mathbf{x}$ looks nonlinear because of the divide, but **cross-multiplying** by the denominator clears it. Each correspondence yields **two linear equations** in the 9 unknown entries of $H$ — e.g. one of them is $a x + b y + c - x'(g x + h y + i)=0$. Stack them: $A\mathbf{h}=\mathbf{0}$, where $\mathbf{h}$ is the 9-vector of $H$'s entries. With 4 pairs you get $8$ equations.

• **the scale freedom** (and how to kill it): $\mathbf{h}=\mathbf{0}$ is a trivial solution and $H$ is only defined up to scale, so you must **fix the scale**:

• **dirty version** — assume $i\neq0$ and **set $i=1$** (true unless the camera is rotated ~90°). Then 8 unknowns, $A\mathbf{x}=\mathbf{b}$ with a nonzero right side; solve directly (4 pairs) or by **least squares** if you have more pairs (overconstrained $\arg\min\lVert A\mathbf{x}-\mathbf{b}\rVert^2$, the normal equations $A^\top A\mathbf{x}=A^\top\mathbf{b}$). Good enough for a problem set.

• **clean version** — keep all 9 unknowns and solve the homogeneous $A\mathbf{h}=\mathbf{0}$ by **SVD**: $H$ is the **singular vector with the smallest singular value** (the null space). No arbitrary assumption about $i$. [link to [[Linear Inverse Problems and Regression]] for the SVD null-space solve]

• **overdetermined is good**: more than 4 correspondences → solve in **least squares**; this averages out clicking noise. (And, in the next chapter, lets **RANSAC** discard wrong matches before the final fit.)

• **a fitting analogy**: just like fitting a line $x'=ap+b$ from many noisy $(p,x')$ pairs, different correspondence sets can yield the *same* $H$; the unknowns are $H$'s coefficients, the data are the point pairs.

▸10.5.6 Warping and assembling the panorama⧉

• **pick a reference, warp the rest**: choose one image's coordinate frame as the canvas; for every other image, warp it into that frame using its homography (warp one-by-one toward the reference). The union is the wide panorama; pixels no part of any image reaches stay **black**. [`fig-pano-warp-to-reference`]

• **inverse warp** (the right way to resample): loop over **output** pixels $(x,y)$, map *back* into the source with $H^{-1}$, then divide and sample (with interpolation/antialiasing) — this guarantees every output pixel is filled, no holes. (Forward warp leaves gaps.) The full forward-vs-inverse warp and resampling story is in [[Slopendium/09 Motion, warping, and video|Warping]].

• **bounding box**: to size the canvas, **forward**-map the source corners through $H$ to see where the warped image lands; then **inverse**-warp to fill the output. (Forward for the bbox, inverse for the pixels.)

• **then blend** — the warped images overlap and won't match exactly (exposure, vignetting, tiny misalignment); merging them seamlessly is the [[#Blending|Blending]] section's job (feathered/two-scale/Poisson). Here we just place them.

• **other projections** (pointer): warping everything onto one flat reference plane stretches badly for very wide fields of view; **cylindrical / spherical** projections (and "little planet" stereographic) are the fix — covered later under *Other projections*.

▸10.5.7 Another application: document flattening and merging⧉

• **the planar twin**: everything above needed *pure rotation* because general scenes have depth. But if the scene is a **single plane**, a homography relates **any** two views of it — *even when the camera translates* — because a plane imposes one depth relationship across the whole surface (no independent per-pixel depth). A plane photographed from any viewpoint maps to a plane by a homography.

• **document flattening (rectification)**: photograph a page, whiteboard, painting, or building façade at an angle and it appears as a **keystoned quadrilateral**. Click (or detect) its **four corners**, set them as correspondences to a target rectangle, solve for $H$, and **warp** → a fronto-parallel, "scanned" version with verticals vertical and the page rectangular. [`fig-doc-flatten`]

• **document merging**: to capture a large document (or a whiteboard too big for one frame), shoot **overlapping** planar patches and stitch them with planar homographies — the same pipeline as a panorama, but the homography is justified by the **planar scene** rather than by pure rotation. Useful for book scanning, mosaicing microscope slides, aerial/satellite image tiles of (locally) flat ground.

• **same machinery, different justification** — worth stating for the reader: panoramas (rotation) and document rectification/merging (plane) are *the same homography solve*; only the geometric reason the homography is valid differs. (Perspective-distortion correction — making a building's verticals parallel — is the same trick and is developed in [[Slopendium/09 Motion, warping, and video|Warping]].)

▸10.6 Automatic panorama stitching from multiple views and feature matching⧉

`fig-stata-stitch-pipeline` · the full automatic-stitching pipeline on a REAL pair (Stata Center): Harris corners → oriented descriptors → putative matches → RANSAC inliers → homography warp & blend → seamless panorama (real-image capstone for `fig-feature-pipeline`) ✅

`fig-corner-flat-edge` · why corners are the best keypoints — only a corner's small window changes under a shift in every direction (flat / edge / corner columns, aperture problem) 🟨

`fig-harris-eigen-ellipse` · the structure tensor's eigenvalues linked to the flat / edge / corner classification and the sign of the Harris response $R=\det M-k(\operatorname{tr}M)^2$ 🟨

⬜ figure not yet created

axes $\propto\lambda^{-1/2}$

`fig-harris-not-scale-invariant` · one scene corner classified "corner" at one scale and "edge" at another under a fixed window, motivating scale selection 🟨

⬜ figure not yet created

`fig-scale-space-bumps` (1D bumps of three widths blurred over a stack of Gaussians fig-scale-space-bumps

`fig-dog-pyramid` · the Difference-of-Gaussians pyramid and the 26-neighbour space-and-scale extremum test that detects SIFT keypoints 🟨

`fig-sift-descriptor` · the $16\times16$ window → $4\times4$ grid of 8-bin orientation histograms → 128-D normalised vector that is the SIFT descriptor 🟨

⬜ figure not yet created

`fig-canonical-orientation` (gradient-orientation histogram of the patch fig-canonical-orientation

`fig-ratio-test-histograms` · overlapping raw-distance histograms vs well-separated ratio $d_1/d_2$ histograms for correct/incorrect matches, showing why the ratio test discriminates 🟨

`fig-ransac-line` · RANSAC's sample → fit → count-inliers → keep-best → re-fit loop on a toy line-fitting example, four panels 🟨

⬜ figure not yet created

`fig-ransac-pano-inliers` (a real overlap: matches colored — rejected-by-ratio-test, RANSAC outliers, inliers — and the resulting clean stitch) fig-ransac-pano-inliers

⬜ figure not yet created

`fig-ransac-iterations-graph` (probability-of-failure vs inlier fraction $w$, family of curves for $N=10,10^2,\dots,10^6$ iterations — the exponentials at work) fig-ransac-iterations-graph

The [[#Manual panorama stitching from multiple views|manual pipeline]] is complete except for one human step: someone clicked the corresponding points. That step is the bottleneck — slow, and impossible at scale (gigapixel mosaics, photo collections, video). This chapter replaces the clicks with **automatic feature matching**, which Durand's slides rightly call *"perhaps the most important innovation in computer vision and computational photography in the last twenty years"* — the same machinery powers 3D reconstruction, tracking, object recognition, retrieval, and robot navigation. We keep the homography solve from before; we only manufacture its inputs.

equations

window SSD under a shift $E(u,v)=\sum_{x,y} w(x,y)\,[I(x+u,y+v)-I(x,y)]^2$

**Taylor → quadratic form** $E(u,v)\approx \begin{psmallmatrix}u&v\end{psmallmatrix} M \begin{psmallmatrix}u\\ v\end{psmallmatrix}$ with the **structure tensor** $M=\sum w\begin{psmallmatrix}I_x^2&I_xI_y\\ I_xI_y&I_y^2\end{psmallmatrix}$

**Harris response** $R=\det M-k(\operatorname{tr}M)^2=\lambda_1\lambda_2-k(\lambda_1+\lambda_2)^2$ ($k\approx0.04$–$0.06$)

Shi–Tomasi $R=\min(\lambda_1,\lambda_2)$

**scale-space** $L(\cdot,\sigma)=G_\sigma*I$, **DoG** $D=L(\cdot,k\sigma)-L(\cdot,\sigma)\approx(k{-}1)\sigma^2\nabla^2 G*I$ (scale-normalised Laplacian)

**descriptor distance** SSD $d=\lVert \mathbf{f}_i-\mathbf{f}_j\rVert^2$

**ratio test** $d_1/d_2<\tau$ ($\tau\approx0.8$, $d_1\le d_2$ the two nearest neighbours)

**RANSAC inlier test** $\lVert \mathbf{x}_i'-H\mathbf{x}_i\rVert<\varepsilon$

probability a random $s$-sample is all-inlier $w^s$

**iteration count** $N=\dfrac{\log(1-p)}{\log(1-w^s)}$ (succeed with prob. $p$

inlier fraction $w$

sample size $s{=}4$ for a homography).

▸10.6.1 Why not brute force, and the two sub-problems⧉

• **the dead end** (worth showing): the old way matched images **densely** — try all translations, compute the sum-of-squared-differences (SSD) over all pixels, keep the best. For a **homography** that means discretising **8 coefficients** — combinatorially hopeless — and each trial requires a full warp + SSD. Even reduced to a couple of rotational DOF it's restrictive, slow, and **not robust** to moving objects or distortion. [`fig-feature-pipeline` motivates the alternative]

• **the modern idea — sparse matching**: ignore uniform regions; focus on a few **very distinctive points** (interest points), match them **locally** (which point ↔ which point), and add machinery to be **robust to local mistakes**. Sparse, fast, robust.

• **the two problems to solve** (state them as the chapter's skeleton): **(1) repeatable detection** — detect the *same* scene point **independently** in each image (if a point is found in one image but not the other, it can never match); **(2) distinctive description** — for a detected point, **recognise** its correspondent in the other image, via a descriptor that's similar for the same scene point and different for others. [`fig-corner-flat-edge`]

• **the discipline to emphasise**: detection and description run **independently in each image**. Only **at the very end**, given a bag of (point, descriptor) features per image, do we compute correspondences. We analyse the detector/descriptor on *corresponding* scene points only to *check* they'd behave consistently.

▸10.6.2 Basic feature matching — Harris corners and what makes a good keypoint⧉

• **what makes a good keypoint**: a point you can **localise precisely** and **re-find reliably**. The test (Moravec's idea): slide a small window a little in **any** direction and measure how much the patch changes. **Flat** region → no change in any direction (not localisable); **edge** → no change *along* the edge (slides freely → ambiguous); **corner** → **large change in every direction** → uniquely pinned down. *Corners make the best keypoints.* [`fig-corner-flat-edge`]

• **quantify it**: the windowed change under a shift $(u,v)$ is $E(u,v)=\sum_{x,y} w(x,y)\,[I(x+u,y+v)-I(x,y)]^2$, with $w$ a box or Gaussian window. Computing this for all shifts is costly — so **Taylor-expand** $I(x+u,y+v)\approx I+I_xu+I_yv$ to get a **quadratic form**: $E(u,v)\approx \begin{psmallmatrix}u&v\end{psmallmatrix}M\begin{psmallmatrix}u\\ v\end{psmallmatrix}$.

• **the structure tensor** $M=\sum_{x,y} w\begin{psmallmatrix}I_x^2 & I_xI_y\\ I_xI_y & I_y^2\end{psmallmatrix}$ (a.k.a. second-moment matrix) packs the local gradient statistics into a $2\times2$ symmetric matrix. Its **eigenvalues** $\lambda_1,\lambda_2$ are the change rates along the two principal directions; the iso-$E$ contour is an **ellipse** with axes $\propto\lambda^{-1/2}$. [`fig-harris-eigen-ellipse`]

• **classify by eigenvalues**: both small → **flat**; one large, one small → **edge** (the freely-sliding direction has tiny $\lambda$); **both large** → **corner**. [`fig-harris-eigen-ellipse`]

• **the Harris response** (avoid computing eigenvalues explicitly): $R=\det M - k(\operatorname{tr}M)^2 = \lambda_1\lambda_2 - k(\lambda_1+\lambda_2)^2$, $k\approx0.04$–$0.06$. $R$ is large-positive at corners, large-negative on edges, small in flat regions. (**Shi–Tomasi** variant: just use $R=\min(\lambda_1,\lambda_2)$.)

• **the algorithm**: compute gradients $I_x,I_y$; form $M$ per pixel (windowed sums of products); compute $R$; **threshold** $R$; keep **local maxima** of $R$ (non-max suppression over a $k\times k$ neighbourhood) → the detected corners. [workflow figure within `fig-harris-eigen-ellipse` family]

• **invariances** (what survives between two views): **rotation** — the ellipse rotates but its **eigenvalues don't change**, so $R$ is rotation-invariant (a clean example of eigen-analysis giving a rotation-invariant quantity); **intensity** — uses only **derivatives**, so invariant to a brightness shift $I\to I+b$, and partially to scaling $I\to aI$ (the threshold is the catch). **Not invariant to scale**: zoom in and a corner's neighbourhood looks like an edge → it gets classified differently. That failure motivates SIFT. [`fig-harris-not-scale-invariant`]

• **the simple descriptor** (the basic, pre-SIFT version — enough for a first system): take the **$k\times k$ patch** of luminance around the corner; **blur** slightly (reduce sampling/aliasing); **subtract the mean and divide by the standard deviation** (kill brightness/contrast differences from exposure, vignetting, clouds). The descriptor is just those $k\times k$ normalised numbers.

• **brute-force matching**: for each feature in image 1, scan all features in image 2 and keep the one with smallest descriptor **SSD** $d=\lVert\mathbf{f}_i-\mathbf{f}_j\rVert^2$. This gives every corner a match — *including many wrong ones* (even points in non-overlapping regions get a "match"). Two cleanups follow: the **ratio test** (next-but-one) and **RANSAC**.

▸10.6.3 SIFT and advanced feature matching⧉

• **the goal**: a detector that finds the same point across **scale and rotation** changes, and a descriptor that's correspondingly **invariant** — so wide-baseline, zoomed, rotated views still match. SIFT (Lowe 2004) is the canonical answer. [`fig-feature-pipeline`]

• **scale selection via scale-space**: blur the image with Gaussians of increasing $\sigma$ to build a **scale-space** $L(\cdot,\sigma)=G_\sigma*I$. The right window size for a feature is found as an **extremum over scale** of a scale-invariant function. A plain average is too flat to peak; instead use a **centre-surround / contrast** operator — the **Difference-of-Gaussians** $D=L(\cdot,k\sigma)-L(\cdot,\sigma)$, which approximates the **scale-normalised Laplacian** $\sigma^2\nabla^2 G$. [`fig-scale-space-bumps`]

• **characteristic scale**: the DoG response, tracked across scale at a point, **peaks at a scale proportional to the feature's size** (the slides' 1D-bumps demo: widths 5/9/15 peak at $\sigma\approx1.7/3/5.2$ — a constant ratio). That peak scale is the feature's **characteristic scale**, found **independently in each image**, so a zoomed copy is detected at the correspondingly larger scale. [`fig-scale-space-bumps`]

• **SIFT detection**: find **local extrema of DoG in space *and* scale** — a point larger/smaller than all **26** neighbours (8 in-scale + 9+9 in adjacent scales) of a **DoG pyramid**, processed one **octave** (factor-of-2) at a time. Then **refine** to sub-pixel/sub-scale by fitting a quadratic (Brown & Lowe 2002), and **reject** low-contrast points and edge-like points (the principal-curvature / Harris ratio test). The alternative **Harris-Laplacian** (Mikolajczyk & Schmid) maximises Harris in space and Laplacian in scale. [`fig-dog-pyramid`]

• **canonical orientation** (for rotation invariance): build a **histogram of local gradient orientations** (weighted by gradient magnitude) over the keypoint's neighbourhood at its scale; the **peak** sets the keypoint's **canonical angle**. Describing the patch *relative* to this angle makes the descriptor rotation-invariant. [`fig-canonical-orientation`]

• **the SIFT descriptor** (128-D gradient-orientation histogram): resample a **$16\times16$** window in the patch's **canonical scale and orientation**; compute gradients, **weight by a Gaussian** (smooth falloff toward the edge); divide into a **$4\times4$** grid of cells; in each cell accumulate an **8-bin orientation histogram** (weighted by magnitude, with trilinear interpolation so a gradient spreads smoothly across 8 bins — 4 spatial + 2 orientation). $4\times4\times8=\mathbf{128}$ numbers. [`fig-sift-descriptor`]

• **illumination invariance**: **normalise** the 128-vector to unit length (kills contrast scaling); **clamp** any component $>0.2$ and renormalise (a few large gradients — e.g. from a specular edge — don't dominate); using **gradient orientations** is already fairly lighting-robust. Net invariances: **scale, rotation, and affine illumination** (and graceful degradation under small affine geometry).

• **why it matters** — locality (robust to occlusion/clutter, no segmentation), distinctiveness (a single feature can be matched against a database of thousands), quantity (many per image), efficiency. **Matching at scale** uses approximate nearest-neighbour structures — **k-d trees / Best-Bin-First** (Beis & Lowe) — instead of brute force.

• **forward-pointers**: SIFT features feed **structure-from-motion** at internet scale — **Photo Tourism / Photosynth** (Snavely et al. 2006), and today **COLMAP** → Gaussian splatting. The modern **learned** successors (SuperPoint, LightGlue, neural matchers) are cross-referenced in [[Deep learning]] — same detect/describe/match skeleton, learned end-to-end.

▸10.6.4 Second-nearest-neighbour test⧉

• **the problem it solves**: brute-force matching returns a nearest neighbour for **every** feature — including features with **no true correspondent** (background, non-overlap, repeated texture). We need to tell good matches from bad. [continues `fig-feature-pipeline`]

• **why an absolute threshold fails**: thresholding the raw descriptor distance ("if $d$ too big, reject") doesn't cleanly separate good from bad — the distributions **overlap**; there's no clean cutoff (good matches can be far, bad matches can be near). [`fig-ratio-test-histograms`]

• **Lowe's ratio test** (the fix): compare the **best** match distance $d_1$ to the **second-best** $d_2$. Keep the match only if $\boxed{d_1/d_2 < \tau}$ (typically $\tau\approx0.8$). Intuition: a **distinctive, correct** match is *much* closer than every alternative ($d_1\ll d_2$, ratio small); an **ambiguous** match (repeated texture, no true correspondent) has a near-tie best/second-best ($d_1\approx d_2$, ratio near 1) → reject. The *relative* test is far more discriminative than the absolute one. [`fig-ratio-test-histograms` shows the well-separated ratio distributions for correct vs incorrect matches]

• **status after the ratio test**: a set of matches that is **mostly** correct — but **not entirely**. Some outliers survive (and near-duplicate structure can still fool it). That residual error is exactly what **RANSAC** mops up next.

▸10.6.5 RANSAC⧉

• **the setup**: we have point correspondences, most good (inliers) but some wrong (**outliers**), and we want the homography $H$. A plain **least-squares** fit over *all* matches is hopeless — even a few gross outliers drag the fit arbitrarily far off (squared error rewards them). We need a **robust** estimator. [`fig-ransac-pano-inliers`]

• **RANSAC** (RANdom SAmple Consensus, Fischler & Bolles 1981) — the loop:

1. **randomly pick a minimal sample** — for a homography, **4 correspondences** ($s=4$);

2. **fit the model exactly** from that sample (solve $H$);

3. **count inliers** — all matches with **small reprojection error** $\lVert\mathbf{x}_i'-H\mathbf{x}_i\rVert<\varepsilon$;

4. **keep the sample with the largest consensus set** (most inliers);

5. after the loop, **re-fit $H$ by least squares on all inliers** (optionally reweighted to pull in near-inliers). [`fig-ransac-line` shows the loop on a toy line fit; `fig-ransac-pano-inliers` on a real overlap]

• *pseudocode*: the canonical robust-fitting loop as a fence — sample 4, fit, count inliers, keep best, re-fit on inliers.

• **the intuition**: a model fit to an **all-inlier** sample is approximately correct and so will be **endorsed by many other inliers**; a model contaminated by an outlier is wrong and gets **few votes**. Consensus, not least squares, identifies the good model. The minimal sample size keeps the chance of a clean draw as high as possible.

• **probabilistic analysis — how many iterations?** Let $w$ be the **fraction of inliers** and $s$ the sample size. The probability a single random sample is **all inliers** is $w^s$ (independent draws → product / AND). The probability that $N$ samples **all fail** to be clean is $(1-w^s)^N$ (each independent → product). To **succeed with probability $p$**, demand $(1-w^s)^N \le 1-p$, giving

$$N \;=\; \frac{\log(1-p)}{\log(1-w^s)}.$$

[`fig-ransac-iterations-graph`]

• **read the three knobs** (state the takeaways):

• **sample size $s$ / model complexity** — the **bad** exponential: $w^s$ shrinks fast as $s$ grows (homography needs $s=4$, so $w^4$). More parameters → many more iterations.

• **inlier fraction $w$** — the **base** of the exponential: cleaner matches → dramatically fewer iterations. (This is why the **ratio test runs first** — it *raises $w$*.)

• **iterations $N$** — the **good** exponential: failure probability falls geometrically with $N$, so a few extra iterations buy enormous reliability.

• **worked numbers** (from the slides, to ground it): with $s=4$ — at $w=0.5$, $w^s=0.0625$, so $N=100$ → fail prob $\approx0.0016$ (1 in 600), $N=1000$ → 1 in $10^{28}$. At $w=0.3$, $N=100$ → fail $\approx0.44$, $N=1000$ → 1 in 3400. At $w=0.1$, you need $\sim10^5$ iterations to be safe. **Inlier fraction dominates.**

• **the bigger lesson**: RANSAC is a **general robust-fitting** tool (lines, fundamental matrices, planes, …) — *fit a low-parameter model from a random minimal sample, score by consensus, keep the best, re-fit on inliers.* Cheap when the model needs few points and inliers aren't too rare.

• **closing the loop — the whole auto-pano system**: extract **Harris** (or SIFT) interest points → describe (normalised patch / SIFT) → match by descriptor distance → **ratio test** rejects ambiguous matches → **RANSAC** fits the homography robustly → warp every image to the reference → **[[#Blending|blend]]**. That is automatic panorama stitching (Brown & Lowe 2007), and the same feature pipeline scales up to **bundle adjustment** ([[#Bundle adjustment]]) and full 3D reconstruction.

▸10.7 Blending⧉

⬜ figure not yet created

`fig-blend-visible-seam` (a registered two-image mosaic with a hard photometric seam — exposure + white-balance + vignetting mismatch — the problem statement) fig-blend-visible-seam

⬜ figure not yet created

`fig-blend-feather-ghost` (single distance-weighted average → the same scene where small misregistration produces a doubled/ghosted edge and overall blur) fig-blend-feather-ghost

`fig-blend-twoscale-split` · a source split into low and high bands, the low band blended smoothly and the high band composited winner-take-all, plus a hard-cut / feather / two-scale comparison 🟨

`fig-blend-mask-pyramid` · the blend mask decomposed by a Gaussian pyramid, the transition wide at coarse levels and sharp at fine levels, explaining why width tracks frequency band 🟨

⬜ figure not yet created

`fig-blend-laplacian-bands` (per-band blend $L^k_{out}=m^k L^k_A+(1-m^k)L^k_B$ then collapse → the seamless multiband result) fig-blend-laplacian-bands

`fig-blend-poisson-vs-pyramid` · pyramid vs Poisson blending on a paste with a DC offset — the pyramid leaves a faint low-frequency halo, Poisson absorbs the offset into the boundary 🟨

`fig-blend-seam-graphcut` · two overlapping frames with a moving person and the min-cut seam routed through low-disagreement pixels around the person, taking them entirely from one source 🟨

💡 **Big lesson (L9 · the eye cares about gradients, not absolute values):** a stitch seam is jarring not because the *average* brightnesses differ but because the **gradient is wrong right at the boundary** — a sudden jump the visual system reads as an edge that isn't in the scene. So the entire blending ladder is a sequence of ways to *fix the gradients at the seam*: smooth-blend the low frequencies so the jump is spread out (two-scale, multiband), or paste the gradients and solve for values so the jump is **absorbed into an invisible DC shift** (Poisson), or route the seam through pixels whose gradients already match so there is nothing to fix (graph cut). A constant brightness/color offset between two frames is **invisible once the boundary gradient is right** — exactly why gradient-domain blending works. (→ first appears in [[Poisson image editing]] — *the key idea*; recurs here as the organizing principle of stitch blending.)

equations

**feathering / alpha** $I_{out}(p)=\dfrac{\sum_i w_i(p)\,I_i(p)}{\sum_i w_i(p)}$ with $w_i$ a distance-to-boundary weight

**two-scale** — split $I_i=L_i+H_i$ ($L_i=G_\sigma * I_i$, $H_i=I_i-L_i$)

blend low band by feathering, high band winner-take-all $H_{out}(p)=H_{\arg\max_i w_i(p)}(p)$

**multiband / Laplacian** per band $L^k_{out}=m^k\,L^k_A+(1-m^k)\,L^k_B$ where $m^k$ is the **Gaussian-pyramid** level-$k$ of the mask, then collapse $\sum_k \mathrm{expand}(L^k_{out})$

**Poisson blend** $\displaystyle \min_f \iint_\Omega \lVert\nabla f - \mathbf v\rVert^2\ \text{s.t. } f|_{\partial\Omega}=I_{\text{target}}|_{\partial\Omega}$, Euler–Lagrange $\nabla^2 f=\operatorname{div}\mathbf v$

**seam energy** $E=\sum_p D_p(\ell_p)+\sum_{(p,q)}V_{pq}(\ell_p,\ell_q)$ (data + smoothness, minimized by graph cut)

▸10.7.1 Why a hard seam is visible — the photometric mismatch⧉

• the setup: by this point the images are **geometrically aligned** (homography from RANSAC, then reprojected into a common frame). Geometry is *not* the problem anymore. The problem is **color**. [`fig-blend-visible-seam`]

• the three culprits, all real and all unavoidable in a handheld pano:

• **exposure / auto-exposure drift**: the camera re-meters between shots (or you changed shutter/ISO), so the same wall is brighter in one frame than the next.

• **white-balance drift**: auto-WB shifts per frame → a color cast that differs across the seam.

• **vignetting**: every lens darkens toward the corners, so the *same* scene point is darker when it lands near a frame edge than when it lands center — guaranteeing a brightness ramp that disagrees across overlaps.

• why the eye pounces on it (the **L9** hook, full box above): a hard seam is a **discontinuity in the gradient field** — a step edge the visual system flags as a real boundary. A *smooth* brightness difference of the same magnitude, by contrast, is nearly invisible. So "hide the seam" really means "**don't leave a gradient discontinuity at the boundary**."

• encoding note (L2): do all blending in **linear light** (un-gamma first). Light from overlapping frames *adds* linearly; averaging gamma-encoded values darkens midtones and warps the blend. State this before any weighted average.

• the naive non-answer — **hard cut** ("basic reprojection"): just paint each output pixel from whichever source covers it. Result: a sharp, ugly seam exactly where exposure/WB/vignetting disagree. This is the baseline every method below beats. [`fig-blend-visible-seam`]

▸10.7.2 Feathering / alpha blending — and why it ghosts⧉

• the first real idea (**Solution 1: smooth blending**): make each output pixel a **weighted average** of the sources, with the weight **falling off toward each frame's boundary**: $I_{out}(p)=\dfrac{\sum_i w_i(p)\,I_i(p)}{\sum_i w_i(p)}$. A source counts fully in its interior and fades to zero at its edge, so the transition is gradual and the seam softens.

• computing the weight cheaply: the distance-to-boundary is **awkward in the output frame** (the homography warps it) but **trivial in each source frame** — a **separable, piecewise-linear** ramp (1 in the middle, 0 at the edge, in $x$ and in $y$). Compute the ramp in source space, warp it along with the image. [`fig-blend-feather-ghost`]

• engineering note worth stating (slide 77): keep an **extra "sum of weights" image** alongside the accumulator and divide at the end — it makes the normalization $\sum_i w_i$ a one-liner and handles arbitrary overlap counts.

• the catch — **ghosting and blur**: feathering averages *aligned* pixels, but registration is never perfect (parallax, moving objects, lens distortion). Averaging slightly-misaligned high-frequency detail **doubles edges** (ghosts) and **smears texture**. The wider the feather, the worse the blur. [`fig-blend-feather-ghost`]

• the diagnosis that motivates everything next: feathering uses **one transition width for all frequencies**. But the *right* width is **frequency-dependent** — low frequencies (the exposure/vignetting ramp we *want* to merge) need a **wide, smooth** blend; high frequencies (sharp detail we want to keep crisp and un-ghosted) need a **narrow, almost-abrupt** transition. One width cannot serve both. That realization is the whole ladder.

▸10.7.3 Two-scale blending — the simple split (the pset method)⧉

• the idea (the user's own course method, after Brown & Lowe's two-band simplification): split each source into **just two bands** and blend them **differently**.

• **low band** $L_i = G_\sigma * I_i$ (Gaussian-blurred): carries exposure, white balance, vignetting — the *slow* stuff we want to merge smoothly. **Smooth-blend** it with the feathering weights above. A wide transition here corrects exposure/vignetting differences without introducing a visible step. [`fig-blend-twoscale-split`]

• **high band** $H_i = I_i - L_i$ (the residual detail): carries sharp edges and texture. Here a smooth average is exactly what *causes* ghosting — so instead use **winner-take-all**: at each pixel keep the high-frequency detail from the source with the **highest weight**, $H_{out}(p)=H_{\arg\max_i w_i(p)}(p)$. An **abrupt** transition keeps detail sharp and avoids duplicated objects.

• recombine: $I_{out} = L_{out} + H_{out}$. [`fig-blend-twoscale-result`]

• why it works in one line: **smooth where the eye tolerates smoothness (low frequencies), abrupt where the eye demands sharpness (high frequencies).** It directly implements "match the transition width to the scale," just with two scales instead of a continuum.

• where it sits in the ladder: this is the **2-level special case** of Laplacian-pyramid blending (next). Brown & Lowe showed two bands are often *enough* for panoramas, which is why it's a great pset: most of the benefit, a fraction of the code. [`fig-blend-twoscale-result`]

• limitation (the push to multiband): two scales is a coarse quantization of "frequency." With strong exposure differences or fine repeating texture across the seam, two bands can still leave a faint transition. The principled version uses a **full pyramid**.

▸10.7.4 Multiband / Laplacian-pyramid blending — a transition per band⧉

• the principle (Burt & Adelson 1983): generalize two-scale to a **continuum of scales** via the **Laplacian pyramid**. Decompose each source into band-pass levels $L^k_A, L^k_B$; **blend each band with a transition width matched to that band's scale**; then **collapse** the blended pyramid back to an image. Cross-ref [[Linear pyramids and wavelets]] for the pyramid itself. [`fig-blend-laplacian-bands`]

• the per-band blend (the headline equation): $L^k_{out}=m^k\,L^k_A+(1-m^k)\,L^k_B$, where $m^k$ is the blend mask at level $k$. Then reconstruct $I_{out}=\sum_k \mathrm{expand}(L^k_{out})$.

• *pseudocode*: the whole method as a fence — build two Laplacian pyramids plus a Gaussian pyramid of the mask, blend each level, collapse.

• the crucial detail — **the mask uses a Gaussian pyramid, not a Laplacian one**: $m^k = G^k(\text{mask})$. Blurring the binary mask more and more as you go coarser means the **transition gets wider at coarse scales and stays sharp at fine scales** — exactly the frequency-dependent width feathering couldn't deliver. (Slide 19: "Transition width proportional to pyramid level; decompose mask with **Gaussian** (NOT Laplacian) pyramid.") [`fig-blend-mask-pyramid`]

• why this beats two-scale: every frequency band gets *its own* transition width — fine detail switches over a few pixels (no ghosting), the coarse exposure/vignetting ramp blends over a wide region (no visible step). It's the optimal frequency-tradeoff, made automatic by the pyramid.

• **L16 callout (prefilter before downsample):** building the Gaussian/Laplacian pyramid *correctly* requires **blurring before you decimate** at each level — otherwise high frequencies **alias** into the coarse levels and the blend gets bold false low-frequency artifacts (moiré) that no amount of clever masking fixes. The pyramid's quality is only as good as its prefilter. (→ see Big lesson **L16** · prefilter before you downsample; first appears in BASIC — Resampling.) This is *why* the pyramid construction, not just the blend formula, matters.

• relation to exposure fusion (forward-ref): replace the **spatial** mask $m$ with a **per-pixel quality weight** (well-exposedness, contrast, saturation) and the *identical* machinery fuses an exposure bracket into a tone-mapped image — **Exposure Fusion** (Mertens 2007), used on phones. Same pyramid blend, different mask. (→ [[#HDR merging]], [[#Application to cell phones: HDR+ and burst imaging]].)

▸10.7.5 Poisson / gradient-domain blending — paste gradients, solve for values⧉

• the move (Pérez, Gangnet & Blake 2003): stop manipulating *pixel values* across the seam and manipulate the **gradient field** instead. Inside the region you're pasting, demand that the result's gradient match the **source's gradient**; on the boundary, demand it match the **target's values**. Then *solve for the pixels* whose gradients best satisfy that. This is the **same solve as seamless cloning** in [[Poisson image editing]] — here the "source region" is one panorama image and the "background" is the other. Cross-ref the worked single-image version. [`fig-blend-poisson-vs-pyramid`]

• the objective (the headline equation): $\displaystyle \min_f \iint_\Omega \lVert\nabla f - \mathbf v\rVert^2$ subject to $f|_{\partial\Omega}=I_{\text{target}}|_{\partial\Omega}$, where $\mathbf v$ is the **guidance field** (the pasted source gradient). Its Euler–Lagrange condition is the **Poisson equation** $\nabla^2 f = \operatorname{div}\mathbf v$ — a large **sparse linear system**, solved matrix-free with conjugate gradient (never forming the matrix; just repeated Laplacian convolutions — [[Efficient solvers]], and the deblurring solver of [[Single-image computational photography#Blind deblurring]]).

• why it kills the seam (the **L9** payoff): the visual system reads **gradients**, and we've made the gradients across the boundary **continuous by construction**. Any **constant brightness/color offset** between the two frames is simply **spread out and absorbed** as an invisible low-frequency shift (in 1-D: a linear ramp interpolating the boundary mismatch). The seam *vanishes* because there's no gradient discontinuity left to see — even when exposures differ substantially.

• **Poisson vs. pyramid — the DC-handling contrast** (cross-ref the single-image treatment in [[Poisson image editing]] → *Poisson vs. pyramid blending*): both hide the seam, but they handle the **DC (overall level) difference** differently. **Pyramid blending** blends the coarsest (DC-ish) band over a wide transition — it *averages* the level mismatch, which can leave a faint low-frequency halo and a compromise brightness. **Poisson** doesn't average the level at all; it **pins the boundary to the target and lets the interior float**, re-deriving absolute values from gradients — so it can shift the pasted region's whole brightness to match, but it also *inherits the target's lighting* (good for compositing, sometimes wrong for panoramas where you want to keep a source's own exposure). [`fig-blend-poisson-vs-pyramid`]

• practical caveats to surface (slides 150–152): Poisson assumes the two backgrounds are **photometrically similar**; pasting from dark into bright loses **contrast**, because Poisson reproduces *linear* differences while contrast is *multiplicative* — the fix is to do the Poisson solve in a **log / perceptual color space** (or use covariant derivatives, à la the Photoshop healing brush). State this as the honest limit. (L1/L2: contrast is multiplicative → work in log.)

• gradient-mixing bonus: you can take, per pixel, the **stronger** of source vs target gradient (max) instead of always the source — useful for blending over textured backgrounds (slide 137).

▸10.7.6 Seam optimization — route the seam instead of fading it⧉

• the reframing (Agarwala 2004; Kwatra 2003): all the methods above blend over a **fixed** seam (wherever the overlap is). Seam optimization asks the prior question — **where should the seam even be?** Choose the boundary to thread through pixels where the **two images already agree**, so there's barely anything to blend. [`fig-blend-seam-graphcut`]

• the formulation as **discrete optimization on a graph**: label each output pixel with the **source it comes from**, $\ell_p$. Minimize $E=\sum_p D_p(\ell_p)+\sum_{(p,q)}V_{pq}(\ell_p,\ell_q)$ — a **data term** $D_p$ (must lie inside the chosen source / respect coverage) plus a **smoothness term** $V_{pq}$ that **penalizes a label change between neighbours by how much the two sources disagree there**. Minimizing $E$ is a **min-cut / max-flow** (graph cut) — globally optimal for two labels. Cross-ref [[Seam optimization]] for the machinery.

• **L12 callout (edits as discrete optimization on a graph):** the optimizer (graph cut) is off-the-shelf; **the energy is the entire design.** Make $V_{pq}$ = "color disagreement between the sources at this edge" and the min-cut automatically prefers to cut **where the images match** — e.g. along a uniform sky, *not* through a person's face. This is the **affinity principle (L4)** turned into a *cut* ("don't cut here") rather than an average. (→ see Big lesson **L12** · image edits as discrete optimization on a graph; first appears in [[Seam optimization]].)

• how it composes with the rest of the ladder: seam optimization decides **where** to put the boundary; you then still run a **Poisson or pyramid blend across that optimal seam** to mop up any residual exposure mismatch. Graph-cut + Poisson is the canonical photomontage compose (Agarwala 2004): **cut, then reconstruct.**

• the natural bridge to the next chapter: because the seam can route *around* a region, graph cut is also how you handle **moving objects** — pick the seam (and thus one source) so a walking person is taken *entirely* from one frame, never half-and-half. That's **deghosting**, picked up in *Bells and whistles*.

▸10.8 Bells and whistles⧉

`fig-projections-three` · the same wide scene unrolled three ways — plane (lines straight, FOV limited), cylinder (verticals straight, full horizontal sweep), sphere (full surround, everything curves) — with a "use when" each 🟨

⬜ figure not yet created

`fig-projection-line-bending` (a straight architectural edge: planar keeps it straight but can't exceed ~120° fig-projection-line-bending

`fig-bundle-drift` · a chained-homography 360° panorama (visible gap from accumulated drift) vs a bundle-adjusted one (error distributed around the loop, closure seamless) 🟨

⬜ figure not yet created

`fig-gain-vignette-solve` (per-image gain + a vignetting falloff recovered jointly so overlaps agree — before/after) fig-gain-vignette-solve

`fig-deghost-seam` · a blended overlap where a walking person is doubled into a ghost vs a deghosted result that routes the seam to take the person from one frame only 🟨

`fig-parallax-breaks-homography` · two shots from a translated camera where aligning the background doubles the foreground and aligning the foreground tears the background — one homography cannot handle depth-dependent parallax 🟨

The pairwise pipeline (match → RANSAC → homography → blend) makes a *demo*. A *tool* has to survive 360° loops, lens imperfections, people who wander through the shot, and a photographer who took a step sideways. Each subsection below is one such assumption breaking, and the repair. The recurring fix is the same: **treat the whole panorama as one global estimation problem instead of a chain of independent local ones.** *(To consult Ce Liu & Miki Rubinstein for the production deghosting / motion-handling and continuous-capture details — queue marker.)*

equations

**bundle adjustment** $\displaystyle \min_{\{\theta_j\}} \sum_{j,k}\sum_{i} \rho\big(\lVert \hat{\mathbf x}^j_i(\theta_j,\theta_k) - \mathbf x^k_i \rVert\big)$ — jointly over all camera params $\theta_j$ (rotation $R_j$, focal $f_j$, distortion, gain), summed over all feature correspondences $i$ across all overlapping image pairs $(j,k)$, $\rho$ a robust loss

**gain compensation** $\min_{\{g_j\}} \sum_{(j,k)}\sum_{p\in \text{overlap}} (g_j I_j(p) - g_k I_k(p))^2 + \lambda\sum_j (g_j-1)^2$

**radial distortion** $r_d = r(1+\kappa_1 r^2 + \kappa_2 r^4)$ folded into $\theta_j$

▸10.8.1 Other projections⧉

• the question: a panorama can span more than a flat image plane can hold, so onto **what surface** do you unroll it? The mapping you pick determines what looks natural and what distorts.

• **planar** (project onto a flat plane, like one big photo): keeps **straight lines straight** — great for architecture and narrow fields of view. But it **stretches the corners** badly and **cannot exceed ~120°**: as the field of view grows, edges race off to infinity (a 180° planar pano is impossible). [`fig-projections-three`]

• **cylindrical** (wrap onto a cylinder, unroll): natural for a **wide horizontal sweep** (the classic "turn in place" pano). **Vertical lines stay straight**; the horizon becomes a straight band; horizontal straight lines that aren't at eye level **bend**. Handles a full 360° horizontally. Needs the focal length to set the cylinder radius.

• **spherical / equirectangular** (wrap onto a sphere): the choice for a **full 360°×180°** panorama (VR, street view). Everything maps, but **almost every straight line curves** and the poles stretch.

• the unavoidable tradeoff to state plainly (**"straight lines bend"**): you **cannot** simultaneously (a) cover a very wide field of view and (b) keep all straight lines straight on a flat output — it's the cartographer's dilemma (you can't flatten a sphere without distortion). Planar buys straight lines at the cost of FOV; cylindrical/spherical buy FOV at the cost of bent lines. **Pick by content**: lots of straight architecture and a modest sweep → planar; a wide landscape turn → cylindrical; immersive 360° → spherical. [`fig-projection-line-bending`]

• practical note: the projection is chosen *after* alignment — it's a re-parameterization of the output canvas, independent of the homographies that registered the frames.

▸10.8.2 Bundle adjustment⧉

• the failure it fixes — **drift from chaining**: the naive pipeline composes homographies pairwise ($H_{13}=H_{23}H_{12}$, …). Each pairwise estimate has small error; **composing them accumulates** the error, so by the time you go around a 360° loop the two ends **don't meet** — a visible gap or step. Sequential = error **piles up**. [`fig-bundle-drift`]

• the fix — **one global solve**: instead of fixing cameras one pair at a time, **jointly optimize the parameters of *all* cameras at once** to best explain *all* the feature correspondences simultaneously. Error is then **distributed evenly** around the loop (and the loop **closes**) rather than dumped at the seam. [`fig-bundle-drift`]

• the objective: $\displaystyle \min_{\{\theta_j\}} \sum_{(j,k)}\sum_{i} \rho\big(\lVert \hat{\mathbf x}^j_i(\theta_j,\theta_k) - \mathbf x^k_i \rVert\big)$ — minimize the total **reprojection error** of every matched feature across every overlapping pair, over all camera parameters $\theta_j$ (rotation $R_j$, focal length $f_j$, and below, distortion and gain). A big **nonlinear least-squares**, solved by **Levenberg–Marquardt**, **initialized from the RANSAC pairwise homographies** and grown one image at a time. $\rho$ is a **robust** loss so surviving outliers don't dominate.

• **folding in vignetting and radial distortion** (the reason it's all *one* solve): the same global optimization can carry extra per-image unknowns —

• **gain / exposure compensation**: a per-image gain $g_j$ (and optionally a per-channel WB) chosen so overlaps **agree photometrically**: $\min_{\{g_j\}} \sum_{(j,k)}\sum_{p}(g_j I_j(p) - g_k I_k(p))^2 + \lambda\sum_j (g_j-1)^2$ (the regularizer keeps gains near 1 so the whole pano doesn't drift dark/bright). This is what makes blending's job easy — it removes most of the exposure mismatch *before* the seam blend.

• **vignetting**: a radial brightness-falloff model per lens, estimated so the *same* scene point matches whether it lands center or corner.

• **radial distortion**: $r_d = r(1+\kappa_1 r^2 + \kappa_2 r^4)$ — barrel/pincushion warp folded in as parameters $\kappa$, so straight scene lines align across frames instead of fighting the homography. (Model lives in [[Optics, lenses, and aberration correction]].)

• the lesson to land: **geometry, photometry, and lens imperfections are all one joint estimation.** Solving them together (rather than as a chain of patches) is what gives a globally consistent, seam-friendly mosaic. [`fig-gain-vignette-solve`]

▸10.8.3 Movement and parallax handling⧉

*(to consult Ce Liu & Miki Rubinstein — production deghosting / motion handling; keep this queue marker until confirmed.)*

#### Deghosting: one source per moving region

• **movement / ghost handling (deghosting)** — the problem: bundle adjustment and blending assume a **static scene**, but people walk, leaves move, cars pass. A moving object sits in *different places* in overlapping frames; **blending averages it → it appears twice (a ghost) or as a translucent smear.** Distinguish this from misregistration ghosting: here the pixels are *correctly* aligned for the background but the *object itself* changed. [`fig-deghost-seam`]

• the fix — **pick one source per moving region**: don't average across a moving object; take it **entirely from a single frame**. Detect disagreement regions (large per-pixel difference *after* photometric/geometric alignment = "something moved here"), then choose the seam so each such region comes from one frame only.

• why it's a **seam / graph-cut** problem (ties to [[#Blending]] → *Seam optimization*): "one source per moving region" is exactly a **labeling** decision — make the graph-cut smoothness term cheap to cut *around* a moving blob and the optimizer routes the seam to keep the object whole, from one frame. Then Poisson/pyramid-blend the chosen seam. So deghosting is **not a new algorithm** — it's the seam energy of *Blending*, with motion-disagreement added to the cost. (**L12** again: the energy is the design.)

• bonus modes: you can deliberately keep **all** copies (a "stroboscopic" multiplicity pano) or **none** (remove transient people) by flipping which label each moving region gets — same machinery.

#### Parallax: when translation breaks the homography

• **parallax handling** — the deeper limit: a homography aligns two frames **exactly only if** the camera underwent **pure rotation** (about its optical center) **or** the scene is **planar / very distant**. If the camera **translated** between shots, **near and far objects shift by different amounts** (motion parallax) — and **no single homography can align both**: line up the background and the foreground doubles; line up the foreground and the background tears. [`fig-parallax-breaks-homography`]

• why this is fundamental, not a tuning issue: a homography is a **2-D** warp (8 DOF); translation through a **3-D** scene induces a **depth-dependent** displacement that a global 2-D map simply can't represent. The mismatch grows with translation and with scene depth range (worst for close foreground).

• the practical escapes:

• **rotate about the no-parallax point** (the entrance pupil / "nodal point") so there's *no* translation → the homography is exact again. This is why tripod panos use a nodal-slide.

• **accept it locally**: route the **seam** through low-parallax regions (graph cut again) and blend, hiding the unavoidable mismatch where the scene is far/uniform.

• **go 3-D** when translation is essential: estimate depth / use multi-view geometry rather than a homography — the hand-off to [[Matching pixels across space and time]] and structure-from-motion.

• the connection forward: phone **sweep panoramas** are continuous translation+rotation, so parallax (and rolling shutter) are front-and-center there — next chapter.

▸10.9 Continuous panoramas (e.g. on cell phones)⧉

⬜ figure not yet created

`fig-sweep-incremental` (a phone panning fig-sweep-incremental

`fig-sweep-central-strip` · a single frame with only its central strip retained, annotated low-vignetting / low-distortion / low-parallax, and narrow overlaps feathered between strips 🟨

⬜ figure not yet created

strips overlap just enough to feather)

`fig-portrait-undistortion` · distortion-free wide-angle portraits (Shih, Lai & Liang 2019): stretched edge face → content-aware warp mesh (locally stereographic over faces, perspective elsewhere) → corrected face with straight lines preserved (Perspective distortion and its correction) ✅

A cell-phone "sweep" panorama is the same goal — one wide image from many views — but the *capture model* is inverted: instead of shooting N discrete frames and stitching offline, the phone **captures a continuous video while you pan** and builds the mosaic **as the frames arrive**. That streaming constraint, plus the fact that you're hand-holding and continuously translating, reshapes every design choice. *(To consult Ce Liu & Miki Rubinstein for the mobile-pipeline specifics — queue marker.)*

equations

**per-strip registration** incremental homography $H_{t} = H_{t-1}\,\Delta H_t$ ($\Delta H_t$ = frame-to-frame, often near pure rotation, predicted from the **gyroscope**)

**rolling-shutter** per-row pose $\mathbf p(y)=\mathbf p_0 + y\cdot \dot{\mathbf p}$ (camera moves *during* a frame's readout → row $y$ captured at time $t_0+y\,t_{\text{row}}$)

**strip feather** $I_{out}=\alpha I_{\text{new strip}} + (1-\alpha)I_{\text{mosaic}}$ across the narrow overlap

▸10.9.1 Incremental registration of a video stream⧉

• the streaming model: frames arrive at video rate; the phone **registers each new frame against the growing mosaic** (frame-to-mosaic), not against all previous frames. Each registration is a **small incremental homography** $H_t = H_{t-1}\,\Delta H_t$, where $\Delta H_t$ is the *frame-to-frame* motion — small, mostly rotation, and **cheaply predicted from the phone's gyroscope** so the visual tracker only has to refine it. This is what makes it real-time: never an all-pairs solve, just one fast update per frame. [`fig-sweep-incremental`]

• the user guide ribbon: phones show an **on-screen level/arrow** because the registration assumes a smooth, roughly-constant sweep — keeping the motion regular keeps $\Delta H_t$ small and well-conditioned, and keeps strips overlapping.

• drift over a long sweep: errors still **accumulate** along a long pan (same drift problem as *Bells and whistles*); phones mitigate with the gyro prior, occasional re-registration, and by keeping per-frame motion tiny. A full bundle adjustment is usually too heavy for real time, so it's an **incremental / windowed** version.

▸10.9.2 Mosaicking a moving strip (and why a *central* strip)⧉

• the core trick: from each frame, **paste only a thin vertical strip taken from the center of the frame** onto the moving mosaic canvas; consecutive strips **overlap slightly** and are **feathered** together ($I_{out}=\alpha I_{\text{new}}+(1-\alpha)I_{\text{mosaic}}$). The mosaic grows sideways as you pan. [`fig-sweep-central-strip`]

• why the **center** strip is the clever part — it minimizes *three* problems at once, exactly where each is smallest:

• **vignetting** is smallest at the frame center (corners are darkest) → strips are photometrically uniform.

• **radial distortion** is smallest near the optical axis → strips warp least, lines stay straight.

• **parallax** is smallest at the center / along the rotation axis, and because each strip is **narrow**, the depth-dependent displacement across the strip is tiny → translation-induced doubling is suppressed *without* needing 3-D. A narrow strip is "locally a homography" even when the global motion isn't.

• so the strip design is the continuous-pano analog of *seam routing*: instead of choosing a seam through a wide overlap, you only ever keep the slice where the registration assumptions hold best. [`fig-sweep-central-strip`]

▸10.9.3 Rolling shutter and exposure drift⧉

• **rolling shutter** — the sensor caveat that dominates fast pans: a CMOS sensor reads out **row by row**, so the bottom of a frame is captured a few milliseconds *after* the top. While you pan, the camera **moves during that readout**, so different rows see different camera poses: $\mathbf p(y)=\mathbf p_0 + y\,\dot{\mathbf p}$, row $y$ exposed at $t_0+y\,t_{\text{row}}$. The result is **skew / wobble** — vertical poles lean, a fast pan shears the whole strip. [`fig-rolling-shutter-pan`]

• corrections: (a) **pan slowly** (the user-facing fix — the guide ribbon enforces it); (b) **model the per-row motion** (from the gyro) and **rectify** each row to a common pose before pasting — the same rolling-shutter rectification used in video stabilization ([[Video]]). Because we already use a narrow central strip, the per-strip rolling-shutter warp is small and easy to undo.

• **exposure / white-balance drift** — the photometric caveat: over a long sweep the auto-exposure and auto-WB re-meter continuously (you pan from shadow into sun), so brightness and color **ramp** along the mosaic. The fix is **continuous gain compensation** (the streaming version of *Bells and whistles*' gain comp): estimate a per-frame gain so each new strip matches the mosaic at the overlap, optionally lock AE/AWB at the start of the sweep. Because strips are blended over a narrow overlap, residual drift shows as a gentle, low-frequency ramp the eye tolerates (L9 again — no hard gradient at any one strip boundary).

• the takeaway: a sweep pano is the discrete pipeline's assumptions (static scene, global shutter, fixed exposure, pure rotation) **all relaxed at once and handled incrementally** — narrow central strips + gyro-predicted registration + continuous gain comp + rolling-shutter rectification are how a phone makes a hand-waved video look like one clean wide photograph.

▸10.10 Focal stacks and depth of field extension⧉

⬜ figure not yet created

`fig-focalstack-problem` (a macro/close-up scene at one aperture: only a thin slab is sharp, foreground and background mush — "even f/16 isn't enough") fig-focalstack-problem

`fig-focalstack-capture` · a focus stack — several frames of one scene focused at different distances, the sharp slab moving through depth — plus an inset on refocusing the lens vs translating on a rail 🟨

`fig-focalstack-sharpness` · one frame walked through high-pass → square → smooth to produce its per-pixel sharpness map $s_k$ (pipeline layout with arrows + $\LaTeX$ labels) ✅

`fig-focalstack-why-square` · a 1-D edge whose signed band-pass averages to ~0, and how squaring it makes the local average register the presence of detail 🟨

`fig-focalstack-argmax-composite` · the per-pixel argmax selection map (which-frame-won / coarse depth) beside the all-in-focus image, plus a zoom comparing $\gamma=1$ vs $\gamma=4$ weighting ✅

⬜ figure not yet created

exponent-1 vs exponent-4 weighting)

`fig-focalstack-photomontage` · a weighted-sum merge (ghosting and blur at depth edges) vs a graph-cut-plus-Poisson merge (clean seams) of the same focal stack (Agarwala et al.) 🟨

`fig-focalstack-magnification` · a focus-at-infinity frame overlaid with a focus-close frame to show that refocusing changes scale, so frames must be registered before merging 🟨

💡 **Big lesson (L14 · Capture the full set, decide later):** instead of committing the focus distance at the instant of exposure, **record a whole stack of focal planes and choose per pixel afterward**. The slogan: *focus stacking is to the focus distance what HDR bracketing is to exposure and what the light-field camera is to the aperture* — defer an irreversible capture decision out of the moment of exposure, paying with **more data** and a **harder reconstruction (a per-pixel merge)** for the payoff of an all-in-focus image no single shot could make. (Registered in [[Big Lessons]] as **L14**; its full first-appearance box sits at the light-field / plenoptic introduction — here it recurs on the **focus axis**, the third sibling after HDR's **exposure axis** and panorama's **viewpoint axis**.)

equations

per-pixel **sharpness** $s_k(p)=\sum_{q\in w(p)} \lVert \nabla I_k(q)\rVert^2$ (local **high-frequency energy** — squared band-pass / Laplacian power, then window-summed/Gaussian-smoothed)

**all-in-focus selection** $k^\*(p)=\arg\max_k s_k(p)$, output $I(p)=I_{k^\*(p)}(p)$

**soft weighted composite** $I(p)=\dfrac{\sum_k w_k(p)\,I_k(p)}{\sum_k w_k(p)}$ with $w_k(p)=s_k(p)^\gamma$ (exponent $\gamma\!\approx\!4$ → near-argmax but smoother)

**Photomontage energy** $E(\ell)=\sum_p D_p(\ell_p)+\sum_{p\sim q}V_{pq}(\ell_p,\ell_q)$ minimized by graph cut (data $D$ = $-$sharpness of label $\ell_p$

smoothness $V$ = seam-cost penalizing visible transitions), then **Poisson** reconstruction across the chosen labels.

▸10.10.1 Why limited depth of field is the problem⧉

• **the optics constraint** (recap from [[Optics]] / Image formation): only objects within the **depth of field** image inside an acceptable **circle of confusion**; everything nearer or farther blurs. DoF *grows* with a smaller aperture (bigger $N$) — but a small aperture lets in less light (→ noise) and eventually hits the **diffraction limit**, so sharpness *falls* again. There's a floor: **even at the smallest usable aperture, you often can't get enough DoF.** [`fig-focalstack-problem`]

• **where it bites**: **macro** (bug/flower close-ups), **microscopy**, **product/jewelry**, **landscape** wanting both a near rock and a far peak crisp. These are exactly the cases the optics chapter flags as unwinnable in one shot.

• **the escape — change axis**: don't push the aperture harder; **take many frames at different focus distances** and **merge**, keeping each pixel from the frame where it's sharpest. The thin in-focus slab **sweeps through depth** across the stack, so *every* surface is sharp in *some* frame. [`fig-focalstack-capture`]

• **the analogy that places it in this part** (L14): this is the **same move** as the chapter's other multi-shot methods — HDR captures every *exposure* and picks per-pixel; panorama captures every *viewpoint* and stitches; the focal stack captures every *focal plane* and selects. And the **dual** escape is the [[Advanced computational photography#Light field and multi-aperture imaging|light-field camera]]: record all the aperture **rays** and **refocus after capture** — in fact a captured light field can be refocused into a focal stack and then merged exactly as here (Ng 2005).

• **important non-convolution caveat** (from optics): defocus is **not** a single convolution of the image — the blur radius depends on **scene depth**, not output pixel location, and **occlusion** matters at depth edges. That is *why* a naive merge struggles at depth boundaries (next sections) and why the good merge needs coherent seams.

▸10.10.2 A simple algorithm — sharpness = local high-frequency energy, then argmax⧉

*(Preserves the user's WIP "A simple algorithm" — same intent, fleshed.)*

• **the core idea**: a region is **in focus** when it carries **local high-frequency detail** — sharp edges, texture. So measure, per pixel and per frame, how much high-frequency content sits in a small window, and at each output pixel **pick the frame with the most**. [`fig-focalstack-sharpness`]

• **the sharpness measure**: take the **band-pass / high-frequency** component of frame $k$ (a Laplacian, a high-pass, or a band of a pyramid), and accumulate its **energy** in a window $w(p)$:

$$s_k(p)=\sum_{q\in w(p)} \lVert \nabla I_k(q)\rVert^2 \quad(\text{local high-frequency / Laplacian power}).$$

• **why you must *square* before averaging** (the subtle step from the slides): the raw high-frequency signal **swings positive and negative** — around an edge it goes $+\to 0\to-$. Its local average is **~zero by construction** (low frequencies were removed), so you cannot smooth it directly. **Square it** (or take $|\cdot|$) → all-positive → now the local (Gaussian) average correctly reads "there *is* detail here." [`fig-focalstack-why-square`]

• **the selection**: the all-in-focus image is the per-pixel argmax,

$$k^\*(p)=\arg\max_k s_k(p),\qquad I(p)=I_{k^\*(p)}(p).$$

(Aside: $k^\*(p)$ is itself a coarse **depth map** — "which focal plane was sharpest here" — i.e. **shape-from-focus**, Nayar–Nakagawa 1994. Cross-ref depth estimation.)

• **softening it — weighted composite**: a hard per-pixel argmax is noisy and shows seams. Two fixes from the slides, both about **giving more weight to the best frame while staying spatially coherent**:

• **weighted sum** (then **normalize!**): $I(p)=\dfrac{\sum_k w_k(p)\,I_k(p)}{\sum_k w_k(p)}$ with $w_k(p)=s_k(p)$ — smooth, but a blurry frame still leaks in.

• **sharpen the weights with an exponent**: $w_k(p)=s_k(p)^{\gamma}$, $\gamma\approx4$ — pushes toward argmax (the winner dominates) while keeping a smooth blend; "subtle but sharper" than $\gamma=1$. [`fig-focalstack-argmax-composite`]

• **and smooth the selection map**: beyond weighting, **spatially regularize the choice** (smooth $s_k$, or smooth/MRF the label field) so neighbours tend to come from the same frame — avoiding salt-and-pepper switching between frames. This "spatial coherence in the choice of pixels" is exactly what the advanced method makes principled. [`fig-focalstack-argmax-composite`, weight-visualization inset]

▸10.10.3 The more advanced method: Interactive Digital Photomontage (graph-cut + Poisson)⧉

*(Preserves the user's WIP "More advanced method — Aseem's Photomontage".)*

• **what the simple method gets wrong**: weighted blending **mixes** frames near depth edges → residual blur/ghosting and visible transitions where sharpness ties (the non-convolution / occlusion problem). We want a **clean per-pixel choice** that is **spatially coherent** *and* an **invisible blend** at the boundaries where the choice changes.

• **the recipe — Interactive Digital Photomontage** (Agarwala et al. 2004; "all-in-focus" is one of its showcase applications): two stages, both already developed in this part —

1. **Graph-cut label selection** (L12, [[Seam optimization]]): assign each pixel a **source-frame label** $\ell_p$ by minimizing a discrete energy

$$E(\ell)=\sum_p D_p(\ell_p)+\sum_{p\sim q}V_{pq}(\ell_p,\ell_q),$$

where the **data term** $D_p$ prefers the **sharpest** frame at $p$ (e.g. $D_p(\ell)=-s_\ell(p)$) and the **smoothness term** $V_{pq}$ is a **seam cost** that penalizes a label change where the two frames *disagree* (so cuts hide along matching, low-contrast boundaries). Solved globally by **min-cut / max-flow**. This is precisely the seam-optimization energy — the **energy is the design**, the optimizer is off-the-shelf.

2. **Gradient-domain (Poisson) blend** (L9, [[Poisson image editing]]): even the best label map leaves small exposure/color/level mismatches across a seam. Copy each region's **gradients**, then **reconstruct** the image whose gradients best match (solve $\nabla^2 f=\operatorname{div}\mathbf v$). The absolute-level jumps vanish; only the (correct) gradients remain → a **seamless** all-in-focus composite. [`fig-focalstack-photomontage`]

• **why this is the same two lessons twice**: *cut* where it's cheap (L12 graph cut) then *reconstruct* in the gradient domain (L9 Poisson) — the identical **cut-then-reconstruct compose** used for panorama blending and object compositing in this part. Focal-stack all-in-focus is just that pipeline with **"sharpest frame"** as the data term.

• **scale** (from the slides): real macro montages stack **tens** of frames (e.g. 55) — the graph cut handles many labels; commercial tools (**Helicon Focus**, **Zerene Stacker**, Photoshop **Auto-Blend Layers**) implement essentially this.

▸10.10.4 Capturing the stack — hardware, and the magnification trap⧉

• **two ways to sweep focus** [`fig-focalstack-capture`, inset]:

• **refocus the lens** (change the focus distance, camera fixed) — the natural way; what an autofocus motor or a tunable lens does.

• **translate the camera** along the optical axis (a **focus rail**, e.g. CogniSys **StackShot**) — common in macro, where moving the whole camera a few mm is easier and more repeatable than refocusing.

• **the warning — refocusing changes magnification** (the slides' "swept under the rug"): focusing **closer** moves the lens away from the sensor, which **changes the field of view / magnification** (magnification $\sim f/D$). So frames in a focus sweep are **not pixel-aligned** — a focus-at-infinity frame and a focus-close frame show the scene at slightly **different scales**. [`fig-focalstack-magnification`]

• **consequence — register before you merge**: you must **scale/register** the frames (estimate a per-frame scale, often near-pure zoom about the optical axis) **before** computing sharpness and selecting — otherwise the all-in-focus merge stitches mis-registered detail and you get double edges. Translating on a rail has a related but different geometry (perspective/parallax shift), and may even **change viewpoint** slightly (the slides note "here we change the viewpoint as well"), needing a fuller alignment. This is the same **align-then-merge** discipline as HDR and panorama in this part ([[#Image alignment]]).

• **single-exposure alternatives** (pointers): **scanning / sweep-during-exposure** macro rigs integrate the stack optically in one exposure; and the **dual escape** — a [[Advanced computational photography#Light field and multi-aperture imaging|light-field camera]] captures all focal planes in **one** shot and refocuses computationally (Ng 2005), which can then be merged exactly as above. Time-/light-efficient capture of focal stacks is analyzed by Hasinoff & Kutulakos 2009.

▸10.11 Hyperspectral imaging, color wheels⧉

`fig-hyperspectral-cube` · the cube as an image stack indexed by wavelength, contrasting a pixel's full spectrum against the three numbers RGB retains 🟨

`fig-hyperspectral-rgb-vs-spectrum` · two materials identical in RGB but clearly distinct in their measured reflectance spectra (with the three RGB sensitivities overlaid) — the metamer case for many bands 🟨

`fig-hyperspectral-capture` · the three capture architectures (spectral scan, pushbroom line-scan, snapshot mosaic) and the spatial-vs-spectral-vs-time tradeoff triangle 🟨

⬜ figure not yet created

pushbroom slit+prism scanning lines in space

⬜ figure not yet created

`fig-hyperspectral-uses` (material map / NDVI vegetation map / art-conservation pigment or underdrawing reveal). fig-hyperspectral-uses

💡 **Big lesson (L14 · Capture the full set, decide later — wavelength axis):** RGB throws away the spectrum at the instant of capture (three broad bands, irreversibly mixed). Hyperspectral imaging **records the whole spectrum per pixel and decides spectral questions afterward** — the same defer-the-decision move as HDR (exposure axis), focal stacks (focus axis) and light fields (aperture axis), now on the **wavelength axis**. Cost: data, light, capture time. Payoff: you can ask *what is this made of* long after the shutter. (→ see Big lesson **L14**, [[Big Lessons]]; first-appearance box at the light-field introduction.)

equations

**measurement as projection** — a camera channel $c$ records $I_c(x,y)=\int S(x,y,\lambda)\,R_c(\lambda)\,d\lambda$

RGB is this with **3** broad $R_c$

hyperspectral is the same with **many narrow** band responses $R_b(\lambda)$ (near-delta), recovering the spectrum $S(x,y,\lambda)$ sampled at the bands — i.e. the cube $I(x,y,\lambda_b)$

**NDVI** $=\dfrac{I_{\mathrm{NIR}}-I_{\mathrm{red}}}{I_{\mathrm{NIR}}+I_{\mathrm{red}}}$ (a band ratio — a per-pixel multiplicative/normalized index, L1).

▸10.11.1 Why three numbers aren't enough — RGB as a 3-sample projection⧉

• **color is a projection** (recap, [[Human (and animal) vision and color]]): a camera channel integrates the incoming spectrum $S(\lambda)$ against a **spectral sensitivity** $R_c(\lambda)$: $I_c=\int S(\lambda)R_c(\lambda)\,d\lambda$. RGB uses **three broad, overlapping** $R_c$ → three numbers. That projection is **lossy and many-to-one**: distinct spectra can give **identical** RGB (**metamers**). [`fig-hyperspectral-rgb-vs-spectrum`]

• **what's lost matters**: two pigments, two plant species, fresh vs spoiled produce, a forgery's ink vs the original — can look identical to RGB yet have **different spectra**. The information needed to tell them apart is *exactly* what RGB discarded.

• **the generalization**: keep the same equation but use **many narrow** band responses $R_b(\lambda)$ (approaching deltas at wavelengths $\lambda_b$). Now each pixel yields a **sampled spectrum** $S(x,y,\lambda_b)$ — the **hyperspectral data cube** $I(x,y,\lambda)$, an image stack whose third axis is **wavelength**. **RGB is just this cube with 3 broad bands** (and a CFA/Bayer mosaic is a 3-band snapshot mosaic — cross-ref [[Color technology]]). [`fig-hyperspectral-cube`]

• **multispectral vs hyperspectral**: *multispectral* = a handful of bands (e.g. Landsat's ~7, or RGB+NIR); *hyperspectral* = many **contiguous, narrow** bands (tens–hundreds) → a quasi-continuous spectrum per pixel. Same idea, different sampling density along $\lambda$.

▸10.11.2 Building the spectral stack — filter wheels, tunable filters, pushbroom, snapshot⧉

• **the unifying view (L14)**: each method **fills the cube** $I(x,y,\lambda)$ differently, trading **time vs spatial vs spectral** resolution — the spectral analogue of "how do you capture the focal/exposure stack." [`fig-hyperspectral-capture`]

• **spectral scanning — filter wheel / tunable filter**: put a **bandpass filter in front of the sensor** and capture **one wavelength band at a time**, then change the band:

• a mechanical **color/filter wheel** rotates discrete narrow filters into place;

• an electronically **tunable filter** — **LCTF** (liquid-crystal) or **AOTF** (acousto-optic) — sweeps the passband with **no moving parts**, fast and programmable.

• tradeoff: **full spatial resolution**, many bands, but **sequential in time** → the **scene (and camera) must hold still** across the sweep. Great for microscopy, lab, art on an easel.

• **spatial scanning — pushbroom (line-scan)**: image a **single line** of the scene through a **slit + dispersing element** (prism/grating), so the sensor records that line spread across **all wavelengths at once** (one sensor axis = position along the line, the other = wavelength). **Sweep the line** across the scene (platform motion) to fill $(x,y)$:

• natural when the scene already moves past the sensor — **satellites/aircraft** (the ground sweeps beneath), **conveyor belts** (food/mineral sorting).

• tradeoff: **all bands captured per line simultaneously**, but needs **relative motion** and good line-to-line registration.

• **snapshot spectral imaging**: capture the whole cube in **one exposure** — e.g. a **spectral filter mosaic** (a Bayer-like array but with many band filters) or computational/coded designs (Hagen & Kudenov 2013). Tradeoff: **one shot (handles motion/video)** but **trades spatial resolution** for spectral bands (the mosaic spends pixels on wavelength).

• **coded snapshot / CASSI — the inverse-problem version** (Wagadarikar et al. 2008): the most compelling snapshot designs are **computational** — a **coded aperture + a dispersive element** (prism/grating) optically **multiplex** the full $(x,y,\lambda)$ cube onto a *single* 2-D sensor frame. The coded mask shears and scrambles the bands so each sensor pixel sums a *coded mixture* of wavelengths; the cube is then **recovered by solving a compressive-sensing inverse problem** — find the $(x,y,\lambda)$ cube whose coded projection matches the measured frame, regularized by a **sparsity / learned prior** (the spectrum is sparse in some basis, or a learned spectral prior fills in the rest). This is **snapshot compressive spectral imaging (CASSI)**: you *trade a reconstruction for a single-shot capture* — one frame instead of the slow filter-wheel/pushbroom **scan**, at the cost of solving an under-determined inverse problem. It is the same coded-imaging / compressive-sensing move as coded-aperture and single-pixel imaging — cross-ref **compressive sensing / coded imaging** in [[Advanced computational photography]].

• **the recurring tradeoff triangle**: you can have any two of {full **spatial** resolution, full **spectral** resolution, **single-shot/fast**} cheaply, but not all three — spectral scan sacrifices time, pushbroom needs motion, snapshot sacrifices spatial resolution.

▸10.11.3 What it's for — material ID, agriculture, art and beyond⧉

• **material identification / unmixing**: a pixel's **spectrum is a fingerprint**; match it to a library to **classify materials** (minerals, plastics, vegetation species), or **unmix** a pixel that is a blend of several materials. The core remote-sensing use (imaging spectrometers like AVIRIS). [`fig-hyperspectral-uses`]

• **agriculture / vegetation**: plants reflect strongly in **near-infrared**; indices like **NDVI** $=\frac{I_{\mathrm{NIR}}-I_{\mathrm{red}}}{I_{\mathrm{NIR}}+I_{\mathrm{red}}}$ reveal vigour, water stress, disease before it's visible — a per-pixel **band ratio** (a normalized, multiplicative index — L1, the multiplicative-world lesson). Precision agriculture, crop monitoring.

• **art conservation & forensics**: reveal **underdrawings, pentimenti, retouching**, identify **pigments**, and detect **forgeries** by spectral differences invisible to the eye / to RGB; document and authenticate.

• **other**: **food inspection** (ripeness, contamination), **medical/biological** imaging (tissue oxygenation, fluorescence microscopy with LCTF/AOTF), **recycling/sorting**.

• **closing tie**: in every case the discipline is L14 — **record the full spectral set, then ask the question** (which material? healthy or stressed? original or forgery?) long after capture. The wavelength axis joins exposure (HDR), focus (focal stacks) and aperture (light fields) as another axis you can defer.

▸10.12 Intrinsic images with time lapse⧉

⬜ figure not yet created

`fig-intrinsic-timelapse-stack` (a fixed-camera day sequence: shadows sweep across a constant scene fig-intrinsic-timelapse-stack

`fig-intrinsic-weiss-median` · noisy per-frame log-gradient maps with moving shadow edges, their per-pixel median over time giving clean reflectance gradients, then a Poisson integration to a flat-lit reflectance plus residual shading (Weiss) 🟨

💡 **Big lesson (L14 · Capture the full set, decide later — time/illumination axis):** one image can't separate **reflectance** from **shading** (the split is ambiguous — L10). A **time-lapse stack under changing light** breaks the tie: record the **whole sequence** and decide per-pixel afterward — **what stays constant is reflectance, what varies is shading**. Same defer-the-decision move as HDR (exposure), focal stacks (focus), hyperspectral (wavelength) and light fields (aperture) — here the deferred axis is **time / illumination**. (→ see Big lesson **L14**, [[Big Lessons]]; and **L10**, the prior-not-optional ambiguity this resolves.)

equations

image model $I(x,y,t)=R(x,y)\cdot S(x,y,t)$ (reflectance constant in $t$, shading varies)

in **log** $\log I = \log R + \log S$ (product → sum, L1/L2)

spatial gradient $\nabla\log I(t)=\nabla\log R+\nabla\log S(t)$

**Weiss estimator** $\widehat{\nabla\log R}(x,y)=\operatorname{median}_t\,\nabla\log I(x,y,t)$ (shading gradients vary/cancel, reflectance gradient persists)

recover $\log R$ by **gradient-domain reconstruction** (Poisson, $\nabla^2\log R=\operatorname{div}\,\widehat{\nabla\log R}$)

then $\log S(t)=\log I(t)-\log R$.

▸10.12.1 The split, and why one image can't do it⧉

• **the model**: an image is approximately **reflectance × shading**, $I=R\cdot S$ — albedo (the surface's intrinsic color/lightness, **fixed**) times illumination/shading (how much light reaches each point, **scene/lighting-dependent**). Separating them is the classic **intrinsic-images** problem (Barrow & Tenenbaum 1978): recover the "paint" from the "light."

• **why it's ill-posed from one frame** (L10): infinitely many $(R,S)$ pairs multiply to the **same** $I$ — is a dark patch dark *paint* or *shadow*? Single-image methods need a **prior** (Retinex's assumption: **large** gradients are reflectance edges, **smooth/gentle** gradients are shading — threshold the gradient field, then integrate; the gradient-domain cousin in [[Poisson image editing]]). A prior is **not optional** here.

• **the time-lapse escape (L14)**: fix the camera and shoot a **day-long sequence**. The **scene's reflectance is constant**; only the **illumination/shading moves** (sun arc, shadows sweeping). So the ambiguity disappears in the stack: **constant-in-time ⇒ reflectance; varying-in-time ⇒ shading.** No hand prior needed — the *temporal* dimension supplies the disambiguation, exactly as extra focal planes or exposures supplied it for the other stacks. [`fig-intrinsic-timelapse-stack`]

▸10.12.2 Weiss 2001 — the median of log-gradients⧉

• **go to log** (L1/L2): $\log I = \log R + \log S$ turns the product into a **sum**, so gradients are additive: $\nabla\log I(t)=\nabla\log R + \nabla\log S(t)$. (Work in linear light first, then log — the encoding lesson.)

• **the key statistic — median over time**: at each pixel, the **reflectance** gradient $\nabla\log R$ is the **same in every frame**, while the **shading** gradient $\nabla\log S(t)$ **moves** (shadow edges sweep through, sometimes here, sometimes gone). A **moving-shadow edge appears in only a minority of frames** → it is an **outlier**. So take the **per-pixel median across time** of the log-gradient:

$$\widehat{\nabla\log R}(x,y)=\operatorname{median}_t\,\nabla\log I(x,y,t).$$

The median **rejects** the transient shading gradients and keeps the **persistent** reflectance gradient — a one-line robust estimator. [`fig-intrinsic-weiss-median`]

• **reconstruct (L9)**: integrate the estimated reflectance gradient field back to an image — a **gradient-domain / Poisson** solve, $\nabla^2\log R=\operatorname{div}\,\widehat{\nabla\log R}$ — yielding the **flat-lit reflectance** image. Then the per-frame **shading** falls out by subtraction: $\log S(t)=\log I(t)-\log R$.

• **why this is elegant**: it is **L1/L2** (log makes ×→+), **robust statistics** (median kills outlier shading edges), **L9** (integrate gradients) and **L14** (the temporal stack disambiguates) in four lines — and it needs **no learned prior**, in contrast to single-image intrinsic decomposition.

▸10.12.3 Where this chapter belongs — passive vs. active illumination⧉

• **Recommendation: keep this chapter HERE, in *Multiple exposure imaging*, as the brief time/illumination sibling — and plant a one-line forward-pointer to *Computational illumination*.** Rationale:

• The organizing principle of *this* part is **L14 — capture a stack along one axis, decide later**: HDR (exposure), panorama (viewpoint), focal stacks (focus), hyperspectral (wavelength). Time-lapse intrinsic images is the **time/illumination** axis — it belongs to the **same family** and reinforces the spine. Its machinery (log, gradients, **median**, Poisson reconstruction) reuses tools this part already teaches.

• It is fundamentally about **merging a passively-captured set of frames** — the camera is **fixed** and you exploit illumination that **happens** to vary. That is multi-exposure-imaging in spirit.

• **Computational illumination** is about **actively controlling** the light (flash/no-flash, multi-flash, photometric stereo, structured light). Weiss is the **passive** counterpart: same $I=R\cdot S$ goal, but you don't *set* the lighting, you *observe* it change. So note the kinship and **forward-ref** computational illumination (where the active versions live), but the **passive time-lapse** belongs here.

• Resolves the WIP's "??? Or in computational illumination?": **here**, brief, with an explicit cross-reference to computational illumination for the active-lighting decompositions.

▸Part 11 MANY IMAGES AND PHOTO COLLECTIONS⧉

reference ML

11.1 Database, Lightroom-style⧉

11.2 Retrieval⧉

▸11.3 Auto curation⧉

standard criteria (sharpness

ML

▸11.4 Life logging cameras⧉

• always-on **wearable cameras** that passively capture your day (Microsoft **SenseCam**, **Narrative Clip**, **Google Clips**, Memoto) → a firehose of thousands of images/day

• **SenseCam** (Microsoft Research, **Hodges et al. 2006**) is the seminal device: a small chest-worn camera with a fish-eye lens that fires **automatically from onboard sensors** (accelerometer, ambient-light, and passive-infrared body-heat triggers) — no shutter press — grabbing ~2,000–3,000 images a day. Tellingly, it was built and studied less as a *camera* than as a **memory prosthesis**: clinical trials found that reviewing SenseCam image streams helped patients with amnesia re-consolidate **autobiographical memory** far better than a written diary. That reframes the wearable lifelog as a tool for *recall* rather than photography — and it is exactly what spawns the big-data **curation, summarization, retrieval, and privacy** problems below.

• the big-data problem they *create*: **curation, summarization, retrieval, and privacy** at personal-archive scale (ties to auto-curation / retrieval above)

• selection becomes the product: **what to keep** — Google Clips ran on-device ML to grab candid moments (the blind / anticliché camera idea, automated, below)

• **privacy & ethics**: bystander consent, always-on recording, lifelog security (→ Human factors ethics; adjacent fields)

▸11.5 Lucky imaging (planetary / lunar astro)⧉

• bright targets (the Moon, planets) are blurred and wobbled by atmospheric **seeing**; instead of one exposure, shoot **thousands of fast frames** (usually video) and **keep only the sharpest few %** — the brief moments when the turbulence settles

• a **big-data selection** problem: score every frame by a **sharpness metric**, **rank**, keep the best, then **align + stack** the survivors (1/√N again) → ground-based resolution approaching the telescope's diffraction limit, cheaply (a poor-man's adaptive optics)

• ties to **auto curation** above (rank-and-select from a huge frame set) and to **Denoising by averaging** (stack the keepers, Multiple exposure); refs: Law et al. 2006

11.6 Inpainting Hays and Efros⧉

11.7 Photo tourism⧉

▸11.8 Photobios⧉

• **Photobios** (the *Exploring Photobios* project of **Ira Kemelmacher-Shlizerman** and **Steve Seitz**): face photo collections **over time** — align many portraits of one person to a common frame and order them → a smooth **time-lapse of a face aging** (Kemelmacher-Shlizerman et al. 2011). A big-collection instance of [[Morphing]] + alignment, where the **data** (not a model) supplies the in-betweens; kin to [[#Photo tourism]] (a collection becomes an experience).

11.10 Anticliche camera⧉

11.11 Pix 2 GPS⧉

11.12 Personalized priors⧉

11.13 Photomosaics, Salavon⧉

▸Part 12 VIDEO⧉

**Roadmap.** **[[Motion blur and temporal sampling]]** sets up the time axis (a frame is an integral; time is sampled; the Lagrangian-vs-Eulerian framing). The applications follow: **[[Video compression and motion compensation]]** (correspondence on a budget), **[[Video magnification]]** (the Eulerian reveal), **[[Video stabilization and rolling-shutter correction]]** (estimate → smooth → re-render), **[[Frame interpolation and slow-motion synthesis]]** (transport to the in-between), and a closing **[[Video editing]]** coda (timelines, summarization, reduce-over-time filters, transcript-based editing).

equations

motion blur $B(\mathbf x)=\tfrac1\tau\int_0^\tau I(\mathbf x-\mathbf v t)\,dt$

Eulerian magnification $I'(\mathbf x,t)=I(\mathbf x,t)+\alpha\,\mathcal B\{I(\mathbf x,t)\}$

▸12.1 Motion blur and temporal sampling⧉

`fig-motion-blur-integral` · a frame is an integral over the exposure — a bright point traversing the frame while the shutter is open records a streak of length $\lVert\mathbf v\rVert\tau$; motion blur is a 1-D box convolution along $\mathbf v$ 🟨

⬜ figure not yet created

`fig-blur-spatial-vs-temporal` (side-by-side: a spatial Gaussian PSF blurring a static edge vs a temporal box PSF blurring a *moving* edge — same convolution, different axis) fig-blur-spatial-vs-temporal

`fig-temporal-aliasing-wagonwheel` · the wagon-wheel effect — a spoked wheel sampled below temporal Nyquist appears to stall near it and spin backward past it; the temporal twin of moiré 🟨

⬜ figure not yet created

`fig-temporal-nyquist` (a sinusoidal motion signal sampled at frame rate $f_s$: $f_s>2f_{\text{motion}}$ reconstructs the true motion fig-temporal-nyquist

`fig-blur-as-temporal-prefilter` · one knob, two outcomes — the same fast motion captured long (blurred but not aliased; the window low-passed before sampling) vs short/high-angle (sharp but strobing); the exposure window is the temporal anti-alias filter, L16 in time 🟨

The two great themes of spatial imaging are **convolution** (a measurement integrates over a region — the PSF) and **sampling** (a discrete grid can only represent frequencies below Nyquist, else they alias). This section's single claim is that **both recur, identically, on the *time* axis** — and together they organise the entire part. A frame is an *integral over the exposure* → **motion blur** is convolution *in time*. A video is a *sequence of temporal samples* → motion faster than the frame rate **aliases** (the wagon wheel) *in time*. And the question "do I think of motion as *particles I follow* or *a time-series at each fixed pixel*?" is the **Lagrangian vs Eulerian** distinction that separates flow/tracking from video magnification.

💡 **Big lesson (recurrence of L5 / L16 — Nyquist, and *prefilter before you downsample*, on the time axis):** the spatial sampling laws apply unchanged in **time**. **L5:** a frame rate $f_s$ can only faithfully capture temporal frequencies below $f_s/2$; faster motion **folds down** to a false low frequency — the **wagon-wheel effect** is temporal moiré. **L16:** the cure for aliasing is to **prefilter before sampling** — and in time the prefilter is *built into the camera*: integrating over the exposure window (which produces motion blur) is exactly the temporal **anti-alias filter**. So motion blur and temporal aliasing are not two problems but **the same tradeoff** — a longer exposure removes strobing by *blurring*, a shorter one gives sharp frames that *strobe*. (→ see Big lessons **L5** & **L16**, first placed in [[Linearity, aliasing and deblurring]] / [[Basic image processing and ISP]]; the **wagon-wheel** is L16's named temporal instance.)

equations

**motion blur** $B(\mathbf{x})=\dfrac{1}{\tau}\displaystyle\int_0^\tau I(\mathbf{x}-\mathbf{v}\,t)\,dt$ (a directional integration along the motion vector $\mathbf v$ over exposure $\tau$) $= I * k_{\mathbf v}$ with **path kernel** $k_{\mathbf v}$ a 1-D box of length $\|\mathbf v\|\tau$ oriented along $\mathbf v$ (constant-velocity case)

**shutter angle** $\theta=360^\circ\cdot\tau/T$ (exposure $\tau$ as a fraction of frame interval $T=1/f_s$)

**temporal Nyquist** $f_s > 2\,f_{\text{motion}}$ (sample faster than twice the motion's highest temporal frequency, else alias)

**aliased apparent frequency** $f_{\text{app}}=|f_{\text{motion}} - n f_s|$ (the folded-down false frequency — wagon wheel)

▸12.1.1 A frame is an integral over time → motion blur⧉

• **the cause** (intuition first): a photo is **not** an instant — the shutter is open for an exposure time $\tau$, and the sensor **accumulates** all the light that lands during that window. If a bright feature *moves* across the frame while the shutter is open, its light is **smeared** along the path it traveled — a **streak**, not a point. Motion blur is the *temporal* version of "a pixel is a weighted average of what it saw," with the average taken over *time* instead of over a spatial neighbourhood. [`fig-motion-blur-integral`]

• **the equation** — blur is a time integral: for image-plane velocity $\mathbf v$ (constant over the exposure), the blurred frame is

$$B(\mathbf{x})=\frac{1}{\tau}\int_0^\tau I(\mathbf{x}-\mathbf{v}\,t)\,dt.$$

Read it literally: the value at pixel $\mathbf x$ is the **average** of the *sharp* scene $I$ over all the positions the scene occupied (offset $-\mathbf v t$) during the exposure.

• **it is a directional convolution** (the punchline — the temporal analog of the spatial PSF): that integral is exactly a **convolution** $B = I * k_{\mathbf v}$ where the **path kernel** $k_{\mathbf v}$ is a **1-D box** of length $\|\mathbf v\|\tau$, oriented along the motion direction $\mathbf v$, normalised to unit area. Constant velocity → a straight box; curved motion → a curved path kernel. **Spatial blur smears over a spatial neighbourhood; motion blur smears over the motion path** — same operator, the axis is just time-projected-onto-space. (Recall from BASIC that convolution *was* the sum-of-random-variables / averaging operation — here the "random variable" is *position-during-exposure*.) [`fig-blur-spatial-vs-temporal`]

• **shutter speed / shutter angle** — the photographer's control of $\tau$: motion-blur length is $\|\mathbf v\|\tau$, so it scales with **exposure time**. In cine, exposure is expressed as a **shutter angle** $\theta=360^\circ\cdot\tau/T$ — the fraction of the frame interval $T=1/f_s$ the (historically rotating-disc) shutter is open. The **$180^\circ$** convention ($\tau=T/2$) gives the natural "film look"; $360^\circ$ blurs maximally (the whole interval), small angles give crisp, **strobed** frames (the *Saving Private Ryan* look). [`fig-shutter-angle`]

• **the tradeoff to state plainly**: longer $\tau$ → more light (better SNR, **L15**) but **more blur**; shorter $\tau$ → sharper frames but less light *and* — the next subsection — **temporal aliasing**. You cannot have sharp, well-exposed, *and* alias-free for fast motion at a fixed frame rate; you pick two.

• **removing it** (forward-ref): if the path kernel $k_{\mathbf v}$ is **known**, deblurring is **deconvolution** by it — but a straight-line box kernel has *zeros* in its frequency response, so naive inversion is ill-posed (it is exactly the Wiener / blind-deblurring story of [[Blind deblurring]], now with a *motion* PSF). When motion **differs across the frame** (different objects, different speeds) the blur is **spatially varying** — no single kernel — connecting to non-uniform deblurring. And you can **engineer** the kernel to invert cleanly: **coded / flutter shutter** (Raskar 2006) flips the shutter open/closed during $\tau$ to give $k_{\mathbf v}$ a broadband, zero-free spectrum; **motion-invariant capture** (Levin 2008) sweeps the camera so *every* speed gets the *same* invertible PSF. (Same "make the forward operator nice" philosophy as coded aperture.)

▸12.1.2 Time is sampled → temporal aliasing, the wagon-wheel effect⧉

• **the cause**: a video is a **discrete sequence** of frames at rate $f_s$ — it **samples** the continuous motion. As with any sampling, motion components **faster than half the frame rate cannot be represented** and instead **alias** — fold down to a false, slower (even reversed) apparent motion. This is **L5/L16 on the time axis**, full stop. [`fig-temporal-nyquist`]

• **the temporal Nyquist condition**: to capture motion of temporal frequency $f_{\text{motion}}$ (e.g. a wheel's spokes passing a point), you need

$$f_s > 2\,f_{\text{motion}}.$$

Violate it and the apparent frequency folds to $f_{\text{app}}=|f_{\text{motion}}-n f_s|$ for the integer $n$ that brings it into $[0,f_s/2]$.

• **the wagon-wheel effect** — the canonical demo: a spoked wheel filmed at frame rate $f_s$. When spoke-passing frequency is **below** Nyquist the wheel spins **forward** correctly; as it approaches $f_s$ the wheel appears to **slow, stop, then spin backward** — because each frame catches the next spoke *just short of* where the previous one was, so the sampled snapshots imply a small *backward* step. It is the **temporal twin of spatial moiré**: undersampling maps a high frequency to a wrong low one. Helicopter rotors and car rims under video/streetlights show the same. [`fig-temporal-aliasing-wagonwheel`]

• **strobing / judder** — the everyday face of it: fast pans or fast subjects at low frame rate or short shutter look **stuttery / juddery** rather than smoothly moving — the motion is *temporally undersampled* and the eye sees discrete jumps instead of continuous flow. (Stroboscopic lighting freezing or reversing fans is the same effect with the *light* doing the sampling.)

• **the fixes** (the sampling toolkit, in time): **raise $f_s$** (higher frame rate → higher temporal Nyquist — the brute-force cure, what high-speed cameras buy); or **prefilter before sampling** — which on the time axis means **integrate longer** (a longer exposure = a wider temporal box = a low-pass that attenuates the offending high temporal frequencies *before* the frame samples them). That prefilter **is motion blur** — which is why the two subsections are one story:

▸12.1.3 Motion blur *is* the temporal prefilter (the two are one tradeoff)⧉

• 💡 **the unification** (the L16 payoff): the temporal anti-alias filter you would *want* before sampling a video is a **low-pass over time** — and the camera already has one, *for free*: **integration over the exposure window**. A long exposure **box-filters the motion in time** before the frame rate samples it, suppressing the high temporal frequencies that would otherwise alias. **Motion blur and temporal aliasing are the two outcomes of one knob** ($\tau$): integrate more → **blur** (aliasing prefiltered away); integrate less → **sharp frames that strobe** (aliasing let through). [`fig-blur-as-temporal-prefilter`]

• **why you can't win for free**: this is the exact **L16** spatial dilemma — prefiltering trades aliasing for blur — transplanted to time. The "right" exposure is a *frequency-domain* compromise: long enough to kill aliasing, short enough to keep wanted motion sharp. CGI renderers face the dual problem and *add synthetic motion blur* precisely as a temporal anti-alias filter (Potmesil & Chakravarty 1983) — they integrate over sub-frame time samples so fast motion doesn't strobe.

• **the escape hatch — capture the full set, decide later** (cross-ref **L14**): shoot at a **high frame rate with a short shutter** (sharp, alias-free *if* $f_s$ is high enough) and **synthesize** any desired motion blur or slow-motion afterward by integrating/interpolating frames — defer the exposure-vs-blur decision out of the moment of capture. ("High-speed photography is to shutter speed what light fields are to aperture", **L14**; forward-ref [[Frame interpolation and slow-motion synthesis]].)

▸12.1.4 Lagrangian vs Eulerian: the organizing distinction for the part⧉

• 💡 **the dichotomy** (the frame that ties the whole part together): there are **two ways to describe a moving scene**, borrowed from fluid dynamics —

• **Lagrangian** — *follow the particles*: attach yourself to a *physical point* and ask where **it** goes over time. This is **tracking** and **optical flow** — a *displacement field* of *which point moved where* ($\frac{D}{Dt}$, the material derivative). The output is **trajectories / correspondences**.

• **Eulerian** — *watch fixed locations*: stay at a **fixed pixel** and record the **time series of intensity** passing through it. No correspondence, no "which point" — just *what happened at this grid cell over time*. The output is a **per-pixel signal in time**. [`fig-lagrangian-vs-eulerian`]

• **why it is the right organising axis** (and *where* each method sits):

• **Lagrangian methods** — [[Optical flow]] (dense, *follow every pixel's* motion), [[Feature tracking]] (sparse, *follow distinctive points*). Hard because correspondence is **non-local** and ill-posed (aperture problem, occlusion); the slides repeatedly note flow's *"Lagrangian nature"* as its core difficulty. Matching-based flow is "more Lagrangian" than differential flow precisely because it *follows the patch* rather than differentiating at a fixed pixel.

• **Eulerian methods** — **video magnification** ([[#Video magnification]]): *do not* solve correspondence at all; instead take the **time series at each fixed pixel**, **band-pass filter it in time**, and **amplify** the tiny temporal variations (a pulse, a sub-pixel sway, a sound-induced vibration) — then add them back. By *never* following particles, Eulerian processing sidesteps the hard correspondence problem entirely — that is *why* it works on motions too small to track. **This section exists to set that up.** (Recall the flow-slides' aside "Recall Eulerian Video Magnification" appearing right beside the differential brightness-constancy derivation — the contrast is deliberate.)

• **the tradeoff between them**: Lagrangian gives you **explicit, large-range correspondence** (you know *where the ball went*) but must *solve* the hard ill-posed matching; Eulerian is **trivial to compute** (just per-pixel temporal filtering) and brilliant for **tiny, sub-pixel** changes, but it has **no notion of where anything went** and breaks down for **large** motion (a big displacement is not a small per-pixel perturbation). Small motion → Eulerian; large motion → Lagrangian. Phase-based magnification is a clever middle ground (Eulerian *in a steerable-pyramid phase representation*, which encodes local motion).

• **optional sidebar — the fluid-dynamics origin**: the terms come from describing fluid flow. The **Lagrangian** specification tracks individual fluid *parcels* along their trajectories (a dye droplet); the **Eulerian** specification fixes a measurement grid and records velocity/density *at each fixed point* as fluid passes (a weather station). Computer-graphics fluid solvers make the same choice — **particle / SPH** methods are Lagrangian, **grid-based** solvers are Eulerian, and many real simulators are *hybrid* (FLIP, particle-in-cell) — exactly mirroring why vision uses **tracked particles** for large motion and **per-pixel grids** for tiny motion. The same dichotomy, the same reasons. [optional sidebar box]

12.2 Time-lapse photography⧉

▸12.3 Video compression and motion compensation⧉

`fig-video-temporal-redundancy` · temporal redundancy made visible — three near-identical consecutive frames whose per-pixel difference $I_t-I_{t-1}$ is near-black except a thin fringe at moving edges and a disoccluded patch; the only new information to code 🟨

`fig-mc-prediction-loop` · the motion-compensated prediction loop — encoder (block-match → motion vectors, residual $r=I_t-\hat I_t$ → DCT → quantize → entropy-code, plus a reconstruction branch storing the lossy reference) and the mirror-image decoder; $\min D+\lambda R$ in the mode-decision box 🟨

⬜ figure not yet created

decoder mirrors it)

⬜ figure not yet created

`fig-block-matching-search` (one macroblock in frame $t$, its search window in the reference frame, the best-match offset = the motion vector) fig-block-matching-search

⬜ figure not yet created

`fig-residual-vs-naive` (left: code the whole block fig-residual-vs-naive

`fig-ipb-gop` · a GOP timeline — an I-frame opening the group, P-frames predicting forward, B-frames predicting from past and future references; arrows are prediction dependencies, so coding order $\neq$ display order 🟨

The still-image part of this book taught one compression idea over and over: **don't store what the viewer won't miss, and don't store what you can predict.** JPEG ([[File formats and compression]]) throws away high-frequency chroma and finely-quantizes the DCT because perception won't notice. Video adds a second, far larger redundancy that stills can't touch: **the next frame looks almost exactly like this one.** At 30–60 fps the camera and the world barely move between frames; most of the picture is *already on screen*. Coding each frame independently (so-called **Motion-JPEG** — literally JPEG-per-frame) ignores this entirely and is wildly wasteful. Real codecs **predict each frame from its neighbours and code only the prediction error.** That is the whole subject.

💡 **Big lesson (correspondence on a budget):** **motion compensation is optical flow you can afford.** A codec needs, for every block, *where did this come from in a frame we already have?* — exactly the correspondence question of [[Optical flow]]. But it does not need a physically correct, dense, sub-pixel flow field; it needs a **coarse, block-constant** field that is **cheap to estimate** and, crucially, **cheap to transmit** — and it chooses that field to **minimise total bits**, not endpoint error. So decades before learned optical flow, video codecs were already estimating dense-ish correspondence at massive scale — just optimised for rate, not accuracy. The recurring move: *reuse what the decoder already has, and code only the difference.* (Recurs as the framing for [[Video magnification]], which keeps the difference instead of discarding it; and is the rate-aware cousin of the correspondence story in [[Optical flow]].)

equations

motion-compensated **residual** $r(x,y)=I_t(x,y)-\hat I_t(x,y)$ where $\hat I_t(x,y)=I_{\text{ref}}\big(x-u,\,y-v\big)$ is the reference frame shifted by the block's motion vector $(u,v)$

**block-matching cost** $(u,v)=\arg\min_{(u,v)}\sum_{(x,y)\in B}\big|I_t(x,y)-I_{\text{ref}}(x-u,y-v)\big|$ (SAD over a block $B$, sometimes SSD)

**rate–distortion** objective $\min\ D+\lambda R$ — choose motion vectors/modes to minimise distortion $D$ *plus* $\lambda$ times the bits $R$ (the codec optimises bits, not physical accuracy)

coded frame $\approx$ entropy-code$\big(\text{quantize}(\text{DCT}(r))\big)$ + motion vectors

▸12.3.1 Why video compresses far better than still × N — temporal redundancy⧉

• **the observation**: consecutive frames are nearly identical. Subtract one from the next and the difference image $I_t-I_{t-1}$ is **near-zero almost everywhere** — non-zero only at moving edges, occlusions, and lighting changes. That near-black difference image *is the only new information*. [`fig-video-temporal-redundancy`]

• **the naive baseline (Motion-JPEG)**: code every frame as an independent JPEG. Simple, edit-friendly, random-access — but it pays the *full* still-image cost $N$ times and exploits **none** of the temporal redundancy. It's the "still × N" strawman; real codecs beat it by 5–50×.

• **the lever**: instead of coding $I_t$, code the **residual** between $I_t$ and a **prediction** $\hat I_t$ built from already-decoded frames. If the prediction is good, the residual is **mostly zero** → after DCT + quantization almost all coefficients vanish → entropy coding crushes it. *Good prediction = small residual = few bits.* [`fig-residual-vs-naive`]

• **two redundancies, stacked**: video keeps JPEG's **spatial/perceptual** redundancy removal (DCT, chroma subsampling, perceptual quantization — all carry over unchanged) **and adds temporal** redundancy removal on top. The temporal one is the big win.

• **why simple frame-differencing isn't enough**: just coding $I_t-I_{t-1}$ works when nothing moves, but **any motion** turns a sharp edge into a bright residual band (the edge is in a different place). The fix is to *predict with motion* — line the block up with where it actually went — so the residual stays small even as things move. That is **motion compensation** (next).

▸12.3.2 Motion-compensated prediction — the core trick⧉

• **the idea**: a moving object's block in frame $t$ is, to first order, **the same pixels** as some block in a **reference frame**, just **displaced**. So for each block, find that displacement, copy the matched patch as the prediction, and code only the leftover. [`fig-mc-prediction-loop`]

• **macroblocks**: partition the frame into blocks (classically 16×16 luma "macroblocks"; modern codecs use variable sizes down to 4×4). Each block is predicted **independently** with its own motion.

• **block matching** (the correspondence step): for a block $B$ in frame $t$, search a window in the reference frame for the **best-matching** patch and record the offset $(u,v)$ — the **motion vector**:

$$(u,v)=\arg\min_{(u,v)}\sum_{(x,y)\in B}\big|\,I_t(x,y)-I_{\text{ref}}(x-u,\,y-v)\,\big|$$

(sum of absolute differences, SAD; SSD is the smooth cousin — same cost-volume idea as [[Optical flow]], just per-block instead of per-pixel). Search is sped up with hierarchical/coarse-to-fine search and predicted starting points; **sub-pixel** motion vectors (half/quarter-pel, with interpolation) sharpen the match. [`fig-block-matching-search`]

• *pseudocode*: block matching as a per-block search over the displacement window — minimize SAD, store the motion vector, then form the residual.

• **the prediction**: $\hat I_t(x,y)=I_{\text{ref}}(x-u,\,y-v)$ — the reference frame shifted by the block's motion vector.

• **the residual** — the thing we actually code:

$$r(x,y)=I_t(x,y)-\hat I_t(x,y)$$

small wherever the match was good. [`fig-residual-vs-naive`]

• **reuse the JPEG pipeline on the residual**: take $r$ block by block → **DCT** → **quantize** (a quantization table, with a quality/QP knob, exactly as in [[File formats and compression]]) → **entropy-code** the quantized coefficients. The codec transmits, per block: the **motion vector** $(u,v)$ + the **coded residual**. (Motion vectors themselves are predictively coded from neighbours — they're spatially correlated too.)

• **the decoder mirrors the encoder**: decode the residual, fetch the same reference patch via the transmitted motion vector, add them: $I_t=\hat I_t+r$. Critically the encoder predicts from the **reconstructed (lossy) reference**, not the original, so encoder and decoder stay in lock-step (no drift).

• **rate–distortion is the real objective**: the encoder doesn't pick the *most accurate* motion vector — it picks the one minimising $D+\lambda R$: reconstruction error **plus** the bits to send the vector and residual. A slightly-worse match that costs far fewer bits wins. This is *why* the field is "on a budget."

▸12.3.3 I, P, and B frames; GOP structure⧉

• **the prediction graph**: not every frame can be predicted from another — you need an entry point, and you'd like both backward and forward prediction. Three frame types: [`fig-ipb-gop`]

• **I-frame (Intra)**: coded **alone**, no temporal prediction — essentially a **JPEG** (spatial/DCT only). Big (most bits) but **self-contained**: a random-access point and a resync after errors.

• **P-frame (Predicted)**: predicted from a **previous** decoded frame (forward only). Sends motion vectors + residual; much smaller than an I-frame.

• **B-frame (Bidirectional)**: predicted from frames on **both** sides (a past **and** a future reference), often averaging the two predictions. Smallest of all — great for smooth motion and revealed/occluded regions — but requires **out-of-order** decoding (the future reference must be decoded first), which is why **coding order ≠ display order**.

• **GOP (Group of Pictures)**: the repeating pattern, e.g. `I B B P B B P …`, restarting at the next I-frame. The GOP length trades off:

• **short GOP** (frequent I-frames) → easy random access / seeking, robust to errors, **bigger** files;

• **long GOP** (rare I-frames) → best compression, but slow seeking and error propagation until the next I-frame.

• **practical reads**: streaming/editing want shorter GOPs (seek, splice); archival/broadcast push longer GOPs for size. **All-intra** (I-only) modes exist for editing-friendly, frame-accurate video at the cost of size — that's essentially Motion-JPEG with a better intra coder.

▸12.3.4 Why this is "optical flow on a budget"⧉

• **the equivalence to state plainly**: the field of motion vectors over all blocks **is a correspondence field** — a piecewise-constant (one vector per block) approximation to the **optical flow** between frame and reference ([[Optical flow]]). Motion compensation *is* dense correspondence, computed at scale, in every video you've ever watched. [`fig-mc-as-flow`]

• **but it's the budget version**, and the differences are the point:

• **block-constant, not per-pixel**: one vector covers a whole block (real flow varies within it) — the residual mops up the within-block error.

• **rate-optimal, not accuracy-optimal**: chosen to minimise **bits** ($D+\lambda R$), so it will happily pick a *physically wrong* vector if it codes cheaper. A codec's motion field is **not** a measurement of scene motion — don't read it as ground-truth flow.

• **block-matching, not variational**: pure SAD/SSD search (the cost-volume primitive of [[Optical flow]]) with no global smoothness solve — smoothness comes implicitly from predictively coding the vectors and from rate cost.

• **handles failure by falling back**: where prediction is hopeless (scene cut, fast new content, occlusion), the encoder just codes the block (or whole frame) **intra** — a graceful degradation to the still-image path.

• **the conceptual bridge**: the *same* aperture problem, search window, and sub-pixel interpolation appear in both. Learned optical flow (RAFT etc., [[Optical flow]]) and learned video codecs both build on this cost-volume correspondence skeleton; the codec just got there first, optimised for a different loss.

▸12.3.5 Modern codecs in one breath⧉

• **the lineage, same skeleton**: MPEG-1/2 → **H.264/AVC** (Wiegand 2003) → **H.265/HEVC** (Sullivan 2012) → **AV1** (AOMedia 2018) → H.266/VVC. Every one keeps **motion-compensated block prediction + transform-coded residual + entropy coding**. What advances is *how cleverly each piece is done*, not the architecture.

• **what they add on the same bones**: **variable / recursive block sizes** (big blocks for flat motion, tiny for detail); **multiple reference frames** and **better sub-pixel** interpolation; **intra prediction** *within* a frame (predict a block from already-decoded neighbours — directional modes — so even I-frames beat JPEG); richer **entropy coders** (CABAC); **in-loop deblocking/SAO** filters to hide block edges; **AV1** adds warped/affine motion (a step *toward* true flow), overlapped block MC, and is **royalty-free**.

• **the takeaway**: 25 years of codec progress is **better prediction + smarter residual coding within the unchanged motion-compensation framework** — and the framework is, at heart, *correspondence-on-a-budget* applied to the temporal axis. Learned neural codecs now replace pieces (the motion model, the residual transform) with networks, but the **predict-then-code-the-residual** spine is the same.

▸12.4 Video magnification⧉

⬜ figure not yet created

`fig-eulerian-vs-lagrangian-mag` (two routes to amplify small motion: **Lagrangian** — estimate flow $\delta(t)$, scale it, re-render fig-eulerian-vs-lagrangian-mag

fig-pixel-timeseries-bandpass · one fixed pixel, signal to output — its value over time, its temporal spectrum (a small in-band peak among DC and noise), a band-pass keeping that band, and the band scaled by $\alpha$ and added back; identical at every pixel 🟨

⬜ figure not yet created

`fig-pulse-color-mag` (a face video fig-pulse-color-mag

`fig-firstorder-motion-mag` · intensity amplification is motion amplification to first order — a 1-D edge displaced by $\delta(t)$ giving temporal change $\delta(t)I_x$, amplified to a $(1+\alpha)\delta(t)$ shift; right panel where a large $\delta$ or sharp edge breaks it (haloing, clipping) 🟨

⬜ figure not yet created

`fig-phase-vs-linear-mag` (same clip: linear intensity magnification (haloing, noise blow-up) vs phase-based magnification in a steerable pyramid (clean, larger $\alpha$)) fig-phase-vs-linear-mag

⬜ figure not yet created

`fig-visual-microphone` (sound vibrates a plant/chip-bag

You cannot see a heartbeat in a face, or a skyscraper sway, or a wall flex under a passing truck — these changes are real but **below the threshold of perception**, buried in pixel values that look constant. Yet the information is *there* in an ordinary video: the green channel of a cheek really does brighten and dim with each pulse; an edge really does move a fraction of a pixel as a structure vibrates. **Video magnification** is the family of methods that **pulls out these tiny temporal variations and scales them up** until they're plainly visible — turning a normal camera into an instrument for the invisible.

💡 **Big lesson (Eulerian vs Lagrangian — the cheaper view often wins):** there are two ways to amplify small motion, the same dichotomy from [[Motion blur and temporal sampling]]. The **Lagrangian** way *follows points*: estimate the optical-flow displacement $\delta(t)$ of each feature, multiply it, and re-render the frame with the exaggerated motion. It's intuitive but **fragile** — it inherits every failure of [[Optical flow]] (occlusion, aperture problem, sub-pixel error), and tracking *tiny* motions accurately is exactly where flow is weakest. The **Eulerian** way *stays put*: at each **fixed pixel** treat the value over time as a signal, **band-pass** it, amplify, add back — **no motion estimation at all**. For small variations this is dramatically simpler and more robust, and it amplifies **color** changes (a pulse) just as naturally as motion. The recurring moral: *picking the right frame of reference (fixed grid vs moving point) can turn a hard estimation problem into a trivial filtering one.*

equations

**Eulerian linear amplification** $I'(x,t)=I(x,t)+\alpha\,B\{I(x,t)\}$, where $B\{\cdot\}$ is a temporal **band-pass** of the per-pixel signal and $\alpha$ the magnification factor

**first-order motion link** — a feature translating as $I(x,t)=f(x+\delta(t))$ band-passes to $B\approx\delta(t)\,I_x$, and adding $\alpha B$ gives $I'(x,t)\approx f\big(x+(1+\alpha)\delta(t)\big)$ (amplifying the band-passed intensity ≈ amplifying the displacement by $(1+\alpha)$)

**phase-based** — in a complex steerable sub-band $S(x,t)=A(x,t)\,e^{i\phi(x,t)}$, temporally band-pass the **local phase** $\phi$ and amplify it ($\phi\to\phi+\alpha\,B\{\phi\}$), which shifts the sub-band signal without amplifying amplitude noise

▸12.4.1 The two views: Lagrangian (track-then-amplify) vs Eulerian (per-pixel signal)⧉

• **the goal**: take subtle temporal change and make it visible — without (ideally) needing to solve correspondence. [`fig-eulerian-vs-lagrangian-mag`]

• **Lagrangian magnification** (Liu et al. 2005, the precursor): **estimate the motion** (optical flow) of features, **scale the displacement** $\delta(t)\to(1+\alpha)\delta(t)$, and **warp** the frame to render the exaggerated motion. Physically explicit, but: needs **accurate flow** for *sub-pixel* motions (the hardest regime), needs **layer/occlusion** handling, and tends to **break and tear** where tracking fails. Powerful in principle, brittle in practice.

• **Eulerian magnification** (Wu et al. 2012, the breakthrough): **fix the spatial grid** and look at the **time series** $t\mapsto I(x,y,t)$ at each pixel. **Band-pass** it temporally to the band you care about (e.g. 0.8–4 Hz for heart rate), **amplify** that band, **add it back**. **No flow, no warping, no tracking.** It's a pure **temporal signal-processing** operation done independently per pixel (per spatial sub-band). [`fig-pixel-timeseries-bandpass`]

• **the connection to the rest of the book**: this is the **Eulerian frame of reference** introduced in [[Motion blur and temporal sampling]] (watch a fixed location and ask *how does it change here?*) versus the **Lagrangian** (follow a particle), exactly the fluid-simulation analogy. Video magnification is the clearest payoff of that distinction.

▸12.4.2 Eulerian video magnification — band-pass and amplify⧉

• **the pipeline**, per pixel / per spatial band:

1. **decompose spatially** into a pyramid ([[Linear pyramids and wavelets]]) — process each band/scale separately (this also lets you spatially smooth, which controls noise).

2. **temporal band-pass** each pixel's time series with $B\{\cdot\}$ — keep only the frequencies of the phenomenon (pulse band, vibration band); discard the DC "static" part and out-of-band content.

3. **amplify and recombine**:

$$I'(x,t)=I(x,t)+\alpha\,B\{I(x,t)\}$$

add an $\alpha$-scaled copy of the band-passed signal back onto the original. Collapse the pyramid → the magnified video. [`fig-pixel-timeseries-bandpass`]

#### Why amplifying a color signal reveals a pulse

• **the cleanest case — no motion at all**: blood flow modulates skin reflectance slightly; the pixel's green-channel value oscillates at the heart rate. Band-pass to 0.8–4 Hz, amplify → the face visibly **blushes** in time with the (recoverable) pulse waveform. **No model of motion needed** — it's literally a color change over time. [`fig-pulse-color-mag`] (This is the basis of contactless **photoplethysmography** / remote vital signs.)

#### Why amplifying intensity also reveals motion — the first-order argument

• **the argument** (one Taylor expansion): take a 1-D feature translating by a tiny displacement, $I(x,t)=f\big(x+\delta(t)\big)$. First-order Taylor: $I(x,t)\approx f(x)+\delta(t)\,f_x(x)$, so the **temporal variation at a fixed pixel is $\delta(t)\,I_x$** (displacement × spatial gradient — the same brightness-constancy term as [[Optical flow]], here *exploited* rather than solved). Band-passing isolates $B\approx\delta(t)\,I_x$; adding $\alpha B$ gives

$$I'(x,t)\approx f(x)+(1+\alpha)\,\delta(t)\,I_x\approx f\big(x+(1+\alpha)\delta(t)\big),$$

i.e. the result **looks like the feature moved $(1+\alpha)$ times as far.** Amplifying the per-pixel intensity signal *is* amplifying the motion — **without ever estimating it** (the formal content of the **L17 exception**). [`fig-firstorder-motion-mag`]

#### The catches — noise and large motion

• **first-order, so small/smooth only**: the approximation is **first-order** → only valid for **small** $\delta$ and **smooth** edges; large motions or sharp edges produce **artifacts/haloing** (the linearization breaks). [`fig-firstorder-motion-mag`, right]

• **it amplifies noise too**: the band-pass passes sensor noise in-band; multiplying by $\alpha$ blows it up. Remedy: **spatially blur / use coarser pyramid levels** for large $\alpha$ (trade spatial detail for SNR), and bound $\alpha$ by the spatial wavelength (amplify low spatial frequencies more).

• **single global band**: fine for one dominant frequency (a pulse), clumsy for broadband motion. The limit the next method addresses.

▸12.4.3 The phase-based successor — amplify phase, not intensity⧉

• **the diagnosis**: linear *intensity* magnification conflates "the edge moved" with "the brightness changed," so it amplifies amplitude **noise** and **clips/halos** at edges. Motion is better expressed not as intensity change but as a **shift in local phase**.

• **the fix — work in a complex steerable pyramid** ([[Linear pyramids and wavelets]]): decompose each frame into oriented, band-pass sub-bands whose coefficients are **complex**, $S(x,t)=A(x,t)\,e^{i\phi(x,t)}$, with **amplitude** $A$ and **local phase** $\phi$. By the Fourier shift theorem, **local translation of a sub-band shows up as a change in its phase $\phi$**, while the amplitude $A$ (the "content") stays put.

• **the operation**: **temporally band-pass the phase** $\phi(x,t)$ and amplify *it*, $\phi\to\phi+\alpha\,B\{\phi\}$, leaving $A$ alone. This **shifts** each sub-band — i.e. *moves* the content — by the amplified amount, **without scaling amplitude noise**. [`fig-phase-vs-linear-mag`]

• **why it's better**: (1) **far fewer artifacts** — phase manipulation is a clean sub-pixel shift, not an intensity add, so much less haloing; (2) **supports larger $\alpha$** (bigger, cleaner magnification); (3) **noise is translated, not amplified** (it moves with the band instead of blowing up). The cost is the steerable-pyramid machinery and more compute.

• **the unifying view**: linear Eulerian ≈ amplify the **intensity** band-pass; phase-based ≈ amplify the **phase** band-pass in an oriented complex pyramid. Both are **Eulerian** (per-location, no flow) — phase-based just uses a representation in which "small motion" lives on a cleaner axis.

▸12.4.4 Uses, and the bookend to compression⧉

• **vital signs (contactless)**: recover **heart rate / pulse waveform** (and even breathing) from a webcam video of a face or hand — remote photoplethysmography, telehealth, driver monitoring, neonatal care. [`fig-pulse-color-mag`]

• **structural & material vibration**: visualize and **measure** how bridges, buildings, blades, and machines vibrate; reveal **resonant modes** and material properties from ordinary video — a cheap, full-field, non-contact alternative to accelerometers.

• **the visual microphone — sound from silent video** (Davis et al. 2014): sound waves vibrate everyday objects (a chip bag, a plant leaf, a glass of water) by **micrometres**; magnifying and reading those per-sub-band micro-motions **recovers the audio** that caused them — *speech and music from a silent, high-speed video*. The most striking demonstration that the signal was there all along. [`fig-visual-microphone`]

• **other cues**: subtle **facial / lip** motion for speech and affect; revealing **breathing**, fluid flow, and slow drifts; an analysis tool for any "is something moving that I can't see?" question.

• **the bookend to compression**: [[Video compression and motion compensation]] spends all its effort **predicting away and discarding** the inter-frame difference (it's redundant, throw it out). Video magnification does the **exact opposite** — it treats that same tiny temporal residual as **the signal**, and **amplifies** it. Same temporal-difference quantity; opposite intent. A clean way to remember both: *compression hides the small temporal change; magnification reveals it.*

▸12.5 Video stabilization and rolling-shutter correction⧉

`fig-stab-pipeline` · the stabilization pipeline — estimate (fit inter-frame $F_t$ from KLT/RANSAC, chain into the real path $C_t$) → smooth ($C_t\to P_t$) → re-render (warp by $W_t=P_t C_t^{-1}$, crop to a valid window); the path is the correspondence, the warp the transport 🟨

`fig-stab-trajectory-smoothing` · smoothing the camera-path signal — a jittery raw trajectory $C_t$, a Gaussian low-pass that removes tremor but lags/overshoots at pans, and an $L_1$-optimal path snapping to static/constant-velocity/eased segments with sharp transitions 🟨

`fig-stab-crop-window` · the stabilization↔crop tradeoff on one frame — the captured rectangle warped by $W_t$ to a tilted quad with empty margins, the crop window the largest rectangle valid for every frame (upscaled back); more shake removed forces a smaller crop 🟨

⬜ figure not yet created

`fig-l1-cinematic-paths` (the three allowed motion primitives — static hold, constant pan, smooth ease-in/out — that L1 stitches together) fig-l1-cinematic-paths

⬜ figure not yet created

`fig-rolling-shutter-perrow` (the row-time diagram: each scanline exposed at $t_0 + r\,\Delta t$, each reading a different camera pose, then rectified to one virtual instant) fig-rolling-shutter-perrow

The previous chapters made motion an *enemy to estimate* (optical flow) or a *cue to track* (KLT). Here motion is the thing we want to **edit**: handheld footage carries an involuntary high-frequency camera path on top of the intended one. Every method in this chapter is one pipeline — **measure the real path, design a better path, warp the frames onto it** — and the only thing you give up is the **border pixels** that the better path no longer sees.

💡 **Big lesson (recurrence of L16 · prefilter before you downsample / temporal aliasing):** stabilization is *temporal* signal processing on the camera-path signal — and the same Nyquist intuition applies. We **low-pass the trajectory** to remove jitter, and rolling-shutter "jello" is precisely a **temporal aliasing** artifact (rows sampled at different times under fast motion, the video cousin of the wagon-wheel effect). (→ see Big lesson **L16**, first placed in BASIC → Resampling; here it recurs as *smooth the path in time, and beware time-varying sampling within a frame*.)

equations

cumulative path $C_t = F_t F_{t-1}\cdots F_1$ from inter-frame transforms $F_t$ (each a homography or affine fit to feature matches)

update/correction warp $W_t = P_t\, C_t^{-1}$ sending real pose $C_t$ to smoothed pose $P_t$

low-pass smoothing $P_t = \sum_k g_k\, C_{t-k}$ (Gaussian weights $g_k$)

**L1 objective** $\min_{P}\ \lVert D^1 P\rVert_1 + \lVert D^2 P\rVert_1 + \lVert D^3 P\rVert_1$ (penalize 1st/2nd/3rd path derivatives in $L_1$ → mostly-zero derivatives = static / constant-velocity / constant-acceleration segments) subject to the crop-window inclusion constraints

rolling-shutter row time $t(r) = t_{\text{frame}} + r\cdot t_{\text{row}}$ and per-row pose $R(t(r))$, rectify pixel $(r,c)$ by reprojecting through $R(t_{\text{ref}})\,R(t(r))^{-1}$

▸12.5.1 What stabilization is: a camera-path signal to be smoothed⧉

• **the framing**: the recorded video = an intended camera motion (a slow pan, a hold) **plus** an unwanted high-frequency tremor. The tremor is a *signal on the camera path*. Stabilization removes it without removing the intended motion — exactly a **low-pass filter on the trajectory**, with the subtleties that (a) the "samples" are 2-D image transforms, not scalars, and (b) we must not crop away the whole frame. [`fig-stab-pipeline`]

• **the three stages** (the spine of the chapter): **(1) estimate** the per-frame camera motion and accumulate it into a trajectory; **(2) smooth** that trajectory into the steady path a tripod/dolly/Steadicam would have followed; **(3) re-render** each frame by warping it from its real pose to the smoothed pose, then **crop/inpaint** the exposed borders.

• **why it works without 3-D**: for a *purely rotating* camera (or a distant/planar scene) the inter-frame motion is a **homography** — the same "depth doesn't matter for a rotation or a plane" fact behind panorama reprojection (cross-ref panorama *Warping and assembling*). Most handheld shake is dominated by rotation, so a 2-D motion model already does the heavy lifting; true translation through a 3-D scene (parallax) is the hard case that needs content-preserving warps (Liu 2009) or 3-D reconstruction.

▸12.5.2 Stage 1 — estimating the camera trajectory⧉

• **inter-frame motion**: between consecutive frames, fit a global 2-D transform $F_t$ (translation → affine → **homography**, in increasing fidelity) to a set of **feature correspondences**. Get the correspondences from **KLT tracks** or matched corners (cross-ref [[Feature tracking]]), and fit **robustly with RANSAC** so that moving foreground objects don't hijack the camera-motion estimate (cross-ref panorama *Automatic stitching* — same RANSAC homography fit). [`fig-stab-pipeline`, left]

• **cumulative trajectory**: chain the inter-frame transforms into an absolute path, $C_t = F_t F_{t-1}\cdots F_1$. The camera path is now a *sequence of transforms* (or, parameterized, a multi-dimensional time signal: translation, rotation, scale, …). [`fig-stab-trajectory-smoothing`]

• **drift — the catch to name**: each $F_t$ has small estimation error, and chaining **accumulates** it, so $C_t$ slowly drifts (the same error-accumulation that **bundle adjustment** fixes in panoramas/SfM — cross-ref). For finite clips this is usually tolerable; long sequences want global optimization or loop closure.

• **sensor alternative**: instead of estimating rotation from pixels, **read it from the phone's gyroscope** — Karpenko et al. 2011 integrate the gyro to get the camera's rotation directly, sidestepping feature tracking (cheap, robust to scene content, and exactly the per-instant rotation rolling-shutter correction also needs). Forward-pointer to Stage-4 rolling shutter, where the gyro really pays off.

▸12.5.3 Stage 2 — smoothing the path (low-pass vs L1-optimal cinematic paths)⧉

• **naive low-pass**: convolve the trajectory with a temporal Gaussian, $P_t = \sum_k g_k\,C_{t-k}$ — simple, removes jitter. **Problems**: it **lags** and **overshoots** at genuine motion onsets (start/stop of a real pan), it **blurs intended motion**, and a fixed kernel can't tell "deliberate whip-pan" from "tremor." Also needs future frames (acausal) or it lags. [`fig-stab-trajectory-smoothing`]

• **L1-optimal cinematic paths (Grundmann et al. 2011)** — the key idea: don't just *smooth*; **synthesize the path a professional camera operator would shoot**, which is made of three primitives — **static holds**, **constant-velocity pans/dollies**, and **smooth ease-in/ease-out** transitions. These correspond to the path's 1st/2nd/3rd derivatives being **zero** for long stretches. So minimize the $L_1$ norm of those derivatives:

$$\min_{P}\ \ w_1\lVert D^1 P\rVert_1 + w_2\lVert D^2 P\rVert_1 + w_3\lVert D^3 P\rVert_1$$

$L_1$ (not $L_2$) is the whole trick: it is **sparsity-inducing**, so the optimal path has derivatives that are **exactly zero almost everywhere**, broken by a few **deliberate transitions** — i.e. crisp static-then-constant-then-eased segments rather than a perpetually wobbling smooth curve. (Same sparsity intuition as the sparse-gradient image priors in [[Blind deblurring]], applied to the *time* axis.) [`fig-l1-cinematic-paths`]

• **the crop constraint, made explicit**: the optimization is **constrained** so that the smoothed frame always stays **inside** the real frame — i.e. the crop window (a rectangle inset by a chosen margin) must, after warping by $P_t C_t^{-1}$, contain only real pixels. Grundmann solve it as a **linear program** (L1 objective + linear inclusion constraints). This bakes the stabilization↔crop tradeoff directly into the path design.

• **content-preserving warps (Liu et al. 2009)** — for true 3-D parallax: a single homography can't re-render a translating camera correctly (near and far move differently). Liu's method allows a **spatially-varying warp** guided by a sparse 3-D reconstruction, with a regularizer that keeps the warp **as-rigid-as-possible** so straight lines and shapes are preserved even though it isn't a global homography (ties to the as-rigid-as-possible warping note in [[Warping and resampling]]). This is the bridge from 2-D-path stabilization to genuinely 3-D-aware stabilization.

▸12.5.4 Stage 3 — re-rendering and the stabilization↔crop tradeoff⧉

• **the warp**: for each frame, apply the **correction transform** $W_t = P_t\,C_t^{-1}$ — "undo where the camera actually was, redo where the smooth path wants it," then **resample** (cross-ref [[Warping and resampling]] for the resampling/interpolation). [`fig-stab-pipeline`, right]

• **why borders go missing**: warping to the smoothed pose swings the frame quad off the original image grid; parts of the output rectangle fall **outside** the captured pixels → empty margins. [`fig-stab-crop-window`]

• **two fixes, a real tradeoff**:

• **crop & zoom** (standard): keep only a smaller central window that stays valid for all frames, then upscale back to full size. **More stabilization ⇒ smaller crop window ⇒ less field of view + softer (upscaled) image.** This is *the* tradeoff — typical consumer modes crop ~10–20%. [`fig-stab-crop-window`]

• **inpaint / fill from neighbors**: paint the missing borders using content from **other frames** (temporally nearby frames saw those pixels) or a generative **inpainter** (forward-ref EDGES MATTER inpainting / [[Generative AI and diffusion]]). Keeps full FOV at the cost of synthesized (possibly wrong) borders and flicker risk.

• **adaptive crop**: choose the crop/margin **per clip** or per segment so smooth shots crop little and violent shake crops more — turning the fixed tradeoff into a content-aware one.

▸12.5.5 Rolling-shutter correction: per-row pose and rectification⧉

• **the cause**: most CMOS sensors are **read out row by row**, not all at once (global shutter). Row $r$ is exposed at time $t(r) = t_{\text{frame}} + r\cdot t_{\text{row}}$. If the camera (or subject) moves appreciably during that **readout sweep** (a few to ~30 ms), each row records a **different camera pose**. [`fig-rolling-shutter-perrow`]

• **the artifacts** (name them, they're diagnostic): **skew/shear** of vertical edges under horizontal pan; **wobble/"jello"** under tremor; **partial exposure** banding and the spinning-propeller "smear" — all because the top and bottom of the frame are effectively *different time samples*. This is **temporal aliasing within a single frame** (callback to L16). [`fig-rolling-shutter-skew`]

• **the model & fix**: treat the per-row pose as a continuous function of time $R(t(r))$. **Rectify** by reprojecting every pixel from the pose at its own row-time onto a single **reference pose** $R(t_{\text{ref}})$ (e.g. the first row's, or the frame center's):

$$(r,c)\ \longmapsto\ \text{reproject through } R(t_{\text{ref}})\,R\!\big(t(r)\big)^{-1}.$$

Each row is warped by a slightly different transform → the per-row warps "un-shear" the frame back to one virtual instant.

• **where the pose comes from**: **Karpenko et al. 2011** read $R(t)$ from the phone's **gyroscope** (after calibrating the gyro-to-frame timing and the row readout time $t_{\text{row}}$) — robust and cheap, and it doubles as the stabilization rotation estimate. **Forssén & Ringaby 2010** instead estimate the per-row rotation **from the image** (feature tracks + a continuous-time motion model), then resample. Either way the output is a rolling-shutter-free frame, after which ordinary stabilization runs.

• **order & coupling**: rolling-shutter rectification should precede (or be jointly solved with) trajectory smoothing — otherwise the homography estimates in Stage 1 are corrupted by the very shear we're trying to remove. This is exactly the rolling-shutter caveat the panorama chapter forward-refs for **continuous phone sweeps** (cross-ref [[Continuous panoramas (e.g. on cell phones)]]), where readout-time motion is most severe.

• **the capture-side escape**: a **global-shutter** sensor reads all rows at once and has no rolling shutter at all — the optical fix, paid for in sensor cost/complexity. (Same spirit as "engineer the system so you don't have to invert a nasty artifact.")

▸12.6 Frame interpolation and slow-motion synthesis⧉

`fig-interp-as-morph` · interpolation as a morph — a plain cross-dissolve of two frames yielding two ghosts vs a flow-warp where corresponded points slide halfway before blending so the subject moves rather than fades; the correspondence here is automatic optical flow 🟨

`fig-flow-warp-blend` · the flow-based interpolation pipeline — estimate bidirectional flow, backward-warp each frame to time $t$ ($-t\mathbf F_{0\to1}$ and $(1-t)\mathbf F_{1\to0}$), convex-blend by $(1-t),t$; a highlighted disocclusion hole present in only one warped frame 🟨

⬜ figure not yet created

`fig-forward-vs-backward-warp` (splatting a pixel forward leaves holes/collisions fig-forward-vs-backward-warp

⬜ figure not yet created

`fig-occlusion-disocclusion` (a moving foreground: the background it *uncovers* exists in only one of the two frames → which frame to copy from) fig-occlusion-disocclusion

⬜ figure not yet created

`fig-film-largemotion` (a fast-moving subject where small-motion flow fails and FILM's scale-agnostic pyramid still tracks it) fig-film-largemotion

A normal camera shoots, say, 30 fps; to play a moment 8× slower and smooth you'd need ~240 distinct instants per second that were never recorded. You can either **capture** them (a high-speed camera — true, but expensive and light-hungry) or **synthesize** them. Frame interpolation synthesizes: given two real frames, invent the ones that belong **between** them. The whole problem reduces to a question you already met in [[Morphing]] — *what moved where?* — answered here by **optical flow** or by a **learned** model.

💡 **Big lesson (recurrence of L14 · capture the full set, decide later — and its limits):** high-speed photography is the honest way to own every instant (capture the full temporal set, pick the moment/shutter later). Interpolation is the **budget substitute**: when you *didn't* capture the full set, a **prior** invents the missing instants. So this chapter sits at the seam between **L14** (capture everything) and **L10** (the prior is not optional) — the in-between frames are partly **reconstructed** (where flow is reliable) and partly **hallucinated** (across occlusions). (→ see Big lessons **L14**, first in MULTIPLE EXPOSURE, and **L10**, first in Super-resolution.)

equations

linear motion assumption $x_t = (1-t)\,x_0 + t\,x_1$ along a flow vector

backward-warp synthesis $I_t(\mathbf{p}) = (1-t)\,I_0(\mathbf{p} - t\,\mathbf{F}_{0\to 1}(\mathbf{p})) + t\,I_1(\mathbf{p} + (1-t)\,\mathbf{F}_{1\to 0}(\mathbf{p}))$ (warp each source to $t$, then convex-blend by temporal distance)

occlusion-aware blend $I_t = \dfrac{(1-t)\,V_0\,W_0 + t\,V_1\,W_1}{(1-t)\,V_0 + t\,V_1}$ with visibility masks $V_0,V_1$ (down-weight the frame where a pixel is occluded)

flow consistency check $\lVert \mathbf{F}_{0\to1}(\mathbf p) + \mathbf{F}_{1\to0}(\mathbf p + \mathbf F_{0\to1}(\mathbf p))\rVert \le \tau$ (forward-backward agreement → trust map)

▸12.6.1 Why interpolate: faking slow-motion and up-converting frame rate⧉

• **slow-motion**: to slow footage $N\times$ smoothly you need $N{-}1$ synthetic frames between every real pair. Interpolation manufactures them from ordinary 30/60 fps footage — no high-speed camera, no extra light. [`fig-slowmo-axis`]

• **frame-rate up-conversion**: 24→48, 30→60→120 fps for smoother playback, VR/high-refresh displays, and TV "motion smoothing" (the much-debated "soap-opera effect").

• **the four ways to treat time** (place interpolation among rivals): **normal capture** = sparse samples with gaps; **high-speed capture** = dense true samples (expensive, dim); **interpolation** = synthesized in-betweens (cheap, can be wrong); **long-exposure / motion blur** = the *integral* over time, not the instants (cross-ref temporal averaging / "max-over-time" in [[Video editing]]). Interpolation is the only one that adds temporal resolution *after* capture. [`fig-slowmo-axis`]

• **the honest limit**: synthesized frames are guesses. Where motion is large, fast, or self-occluding, they can warp, ghost, or tear — which is exactly what the learned methods target.

▸12.6.2 Interpolation = morphing between adjacent frames⧉

• **the equivalence**: a [[Morphing]] takes two images + a correspondence and produces in-betweens by **warping both toward an intermediate shape and cross-dissolving**. Frame interpolation is that, with **two consecutive video frames** as the endpoints. [`fig-interp-as-morph`]

• **what's added over a static morph**: in classic morphing a human supplies the correspondence (Beier–Neely feature lines); here the correspondence is the **optical flow**, estimated automatically. So interpolation = **automatic** morphing, where motion estimation replaces hand-drawn correspondence.

• **why not just cross-dissolve**: a plain temporal cross-dissolve (no warp) **double-exposes** the moving object — you see two ghosts fading between positions, not one object *moving*. The warp is what turns a dissolve into motion. [`fig-interp-as-morph`, ghost vs slide]

▸12.6.3 Flow-based interpolation: warp both frames to the midpoint and blend⧉

• **the recipe** (intuition first): assume a pixel moves along a roughly **straight, constant-velocity** path between $A$ and $B$. To make the frame at fraction $t\in(0,1)$: estimate flow both ways, **pull** each source frame to time $t$ along (a $t$-scaled slice of) its flow, then **blend** the two warped images weighted by temporal closeness. [`fig-flow-warp-blend`]

$$I_t(\mathbf{p}) = (1-t)\,I_0\!\big(\mathbf{p} - t\,\mathbf{F}_{0\to 1}\big) + t\,I_1\!\big(\mathbf{p} + (1-t)\,\mathbf{F}_{1\to 0}\big).$$

#### Forward vs backward warping

• **the resampling detail that matters**: **forward warping (splatting)** pushes each source pixel to its destination — but destinations collide and leave **holes** (no source pixel lands there). **Backward warping** asks, for each *output* pixel, where it came from, and samples there — giving a **full, hole-free grid** (cross-ref [[Warping and resampling]] on backward resampling). Hence the recipe formula above is written as a **backward** sample. The snag: backward warping needs the flow *at time $t$*, which must itself be approximated (e.g. by splatting the flow, or a learned flow-at-$t$). [`fig-forward-vs-backward-warp`]

#### Occlusions and holes: the real difficulty

• **the hard case**: pixels that are **disoccluded** (background uncovered by a moving foreground) exist in **only one** of the two frames; a pixel **occluded** in $B$ should be taken from $A$, and vice versa. Naive blending of both ghosts the boundary. The fix is an **occlusion/visibility mask** $V_0,V_1$ that down-weights the frame where a pixel isn't visible:

$$I_t = \frac{(1-t)\,V_0\,(\text{warp of }I_0) + t\,V_1\,(\text{warp of }I_1)}{(1-t)\,V_0 + t\,V_1}.$$

Get the masks from a **forward–backward flow consistency check** — where $\mathbf{F}_{0\to1}$ and $\mathbf{F}_{1\to0}$ disagree, a pixel is likely occluded; trust the other frame there. Genuine holes (visible in neither, or torn) need **inpainting** (forward-ref). [`fig-occlusion-disocclusion`]

• **failure modes to name**: large displacements break the small-motion flow assumption (objects "teleport," flow finds the wrong match); thin/fast structures (a swung bat, spokes) tear; non-linear motion violates the straight-line assumption. These are exactly what learned synthesis attacks.

▸12.6.4 Learned synthesis: Super SloMo and FILM⧉

• **the learned-operator move** (💡 L8): keep the same skeleton — estimate motion, warp, blend — but **train it end-to-end** on real video so the network learns better flow, better occlusion handling, and a learned blend/refinement, rather than hand-tuning each. (→ see Big lesson **L8**: a learned operator swaps a hand-designed prior for one learned from data.)

• **Super SloMo (Jiang et al. 2018)**: a two-network design — first estimate **bidirectional flow** between $A$ and $B$, then a second network **refines the intermediate flow at each time $t$ and predicts visibility maps** for occlusion-aware warping+blend. Crucially it can synthesize **any number** of intermediate frames (variable-length slow-mo) from one flow estimate, and it's **trained end-to-end** with the warped-and-blended output supervised against held-out real frames.

• **FILM (Reda et al. 2022)** — the **large-motion** specialist: ordinary interpolators fail when things move *far* between frames (near-duplicate photos, action shots). FILM uses a **scale-agnostic feature pyramid** with **shared weights across scales**, so the *same* matcher operates from coarse (big motion) to fine (detail) — large motion at a coarse scale looks like small motion. It's a **single unified network** (no separate flow net), trained with a **perceptual/Gram-matrix (style) loss** to keep synthesized detail sharp rather than blurry, and is notably good at interpolating between **far-apart still photos** ("near-duplicate" photo animation). [`fig-film-largemotion`]

• **why learned wins where flow loses**: the network learns a **prior** over how scenes move and how disoccluded regions should look, so it can **hallucinate plausible content** in holes and across large gaps where explicit flow has no correspondence to copy — the L8/L10 pairing again (learned prior fills what measurement didn't supply).

• **the models live elsewhere**: architectures, training, and losses are taught as *models* in [[Machine learning]] / [[Generative AI and diffusion]]; here we use them as the **interpolation operator** and stress *what they add over hand-built flow* — robustness to large motion and clean occlusion handling. (Same "priors here, models there" split as Super-resolution.) Forward-pointer: diffusion-based video models push interpolation/in-betweening further still.

▸12.7 Video editing⧉

`fig-nle-timeline` · the non-linear timeline — parallel video/audio tracks, clips with trim handles, a playhead, a labelled hard cut and a cross-dissolve transition; the random-access reference-list model that replaced sequential tape splicing 🟨

⬜ figure not yet created

`fig-reduce-over-time` (one stacked video → four outputs side by side: **mean** (motion-blur/long-exposure), **median** (people vanish), **max** (bright streaks / star trails), **min** (darkest-pixel) — the per-pixel reduce family) fig-reduce-over-time

⬜ figure not yet created

`fig-median-deghost` (a busy plaza, N frames → median → empty plaza: transient people removed) fig-median-deghost

⬜ figure not yet created

`fig-everyone-smiling` (a burst of a group portrait → per-region best-frame selection → one frame where everyone's eyes open / smiling, after photomontage) fig-everyone-smiling

`fig-hyperlapse` · hyperlapse vs naive fast-forward — a shaky first-person walk, a jagged frame-dropping $10\times$ speed-up (nauseating), and a hyperlapse that reconstructs the 3-D trajectory, fits a smooth virtual path, and renders along it (stable); speed-up and stabilisation together 🟨

`fig-transcript-edit` · transcript-based editing — a word-timestamped transcript with a filler "um" struck through, and the timeline below with the aligned micro-segment removed and clips closed up; speech-to-text as a 1-D handle on video 🟨

This closing chapter spends the part's machinery. Editing is where alignment, flow, warping, and robust per-pixel statistics stop being algorithms and become **tools an editor reaches for**. It's deliberately lighter and demo-driven — and honest where a specific product or paper is uncertain (marked *queue*). The through-line: almost every "video effect" is **align the frames, then reduce or select across time**, or **re-time the timeline**.

equations

per-pixel temporal reduce $O(\mathbf p) = \operatorname*{reduce}_{t} I_t(\mathbf p)$ for $\operatorname{reduce}\in\{\text{mean},\ \text{median},\ \max,\ \min\}$ (after alignment)

mean = long-exposure $O = \tfrac1N\sum_t I_t$

median = robust background $O(\mathbf p)=\operatorname{median}_t I_t(\mathbf p)$ (transients are outliers → rejected)

selection composite $O(\mathbf p) = I_{t^\star(\mathbf p)}(\mathbf p)$ with $t^\star(\mathbf p)=\arg\max_t s(I_t,\mathbf p)$ (pick the frame maximizing a per-region score $s$ — e.g. "eyes open / smiling")

▸12.7.1 Non-linear editing: the timeline metaphor⧉

• **what "non-linear" means**: on tape, you edited **in order** by physically splicing — to change shot 3 you re-recorded everything after it. Digital **non-linear editing (NLE)** stores clips as **random-access** files, so you assemble them on a **timeline** in any order, re-arrange freely, and the edit is just a **list of references + in/out points** (non-destructive — the source media is never altered). [`fig-nle-timeline`]

• **the timeline**: parallel **video and audio tracks**, **clips** with trim handles, a **playhead**; edits are **cuts** (hard joins) and **transitions** (cross-dissolve, wipe, fade) — a transition is literally a temporal **cross-dissolve / morph** between two clips (cross-ref [[Morphing]] and the interpolation cross-dissolve).

• **why it matters here**: the timeline is the substrate the rest of the chapter manipulates — summarization *shortens* it, fun filters *collapse* a stretch of it to one frame, transcript editing *re-orders* it from text, and slow-mo/speed-ramps (cross-ref [[Frame interpolation and slow-motion synthesis]]) *re-time* it.

• **honest scope note**: the craft of *editing* (pacing, continuity, the cut on action — Murch's *In the Blink of an Eye*) is a whole art; we touch only the computational hooks. *Queue* a sidebar on editing grammar.

▸12.7.2 Summarization: keyframes, fast-forward, and highlights⧉

• **the goal**: compress *watching time*, not file size — turn an hour of footage into something **glanceable** or **skimmable**.

• **representative / keyframes**: pick a small set of frames that **cover** the content — e.g. one per **shot** (detect shot boundaries by large inter-frame change), or cluster frames and take a medoid per cluster. Output: a contact-sheet/storyboard of the video. (Ties to scene-change detection; *queue* survey ref Truong & Venkatesh 2007.)

• **adaptive fast-forward / hyperlapse**: speed up long first-person footage. Naive $10\times$ frame-dropping is **unwatchably shaky** (every step jolts). **Hyperlapse** (Kopf et al. 2014) reconstructs the camera path in 3-D, **chooses a smooth virtual path** through it, and renders along that path — *stabilization fused with speed-up* (cross-ref [[Video stabilization and rolling-shutter correction]] for the path-smoothing; cross-ref [[Optical flow]] for the motion). Speed can be **content-adaptive** — slow on interesting/eventful stretches, fast on dull ones. [`fig-hyperlapse`]

• **highlight / event detection**: surface the *interesting* parts — score frames by **saliency / motion / faces / audio peaks**, or a **learned interestingness** model (forward-ref [[Machine learning]]), then keep the top segments (sports highlights, GoPro "best moments"). *Queue* specific systems.

▸12.7.3 Fun temporal filters: reduce-over-time⧉

• **the unifying primitive**: after the frames are **aligned** (stabilized — cross-ref, this is the prerequisite, or the result ghosts), apply a **per-pixel reduction across time**: $O(\mathbf p) = \operatorname*{reduce}_t I_t(\mathbf p)$. The choice of reducer is the whole effect. [`fig-reduce-over-time`]

• **mean over time** = a **long exposure from video**: $O = \tfrac1N\sum_t I_t$. Moving water silks out, traffic becomes light ribbons, people blur away — and it's the **denoise-by-averaging** filter ($1/\sqrt N$) from MULTIPLE EXPOSURE seen as an *effect* instead of a fix. (Same align-then-average pattern.)

• **median over time** = **robust background / people-removal**: $O(\mathbf p)=\operatorname{median}_t I_t(\mathbf p)$. Anything **transient** (a person walking through, a passing car) appears at any given pixel only briefly → it's an **outlier** the median **rejects**, leaving the static scene. The classic "**empty the busy plaza**" / tourist-removal trick — and exactly why median is the **robust** statistic (cross-ref BASIC denoising). [`fig-median-deghost`]

• **max over time** = **bright-streak accumulation**: $O(\mathbf p)=\max_t I_t(\mathbf p)$. Keep the brightest value each pixel ever saw → **star trails**, light-painting, lightning, firework trails, a runner's headlamp streak. (The part's "**max over time**" fun filter.) **Min over time** is the dual (darkest-pixel) — useful to drop specular flashes / find the persistent dark background.

• **selection composite** (not a reduce but a **pick**): choose, per region, the **best** frame from a burst — $O(\mathbf p)=I_{t^\star(\mathbf p)}(\mathbf p)$ with $t^\star(\mathbf p)=\arg\max_t s(I_t,\mathbf p)$ for a score $s$. This is the **"everyone smiling / no blinks"** group-photo trick: for each face, splice in the frame where that person's **eyes are open and they're smiling**, blended seamlessly. It's **interactive digital photomontage** (Agarwala et al. 2004) — select-best-pixel over a stack — applied to a short burst; the seamless merge uses graph-cut + Poisson blending (cross-ref EDGES MATTER). [`fig-everyone-smiling`]

• **the prerequisite, restated**: all of these assume the frames are **registered**; on handheld video, **stabilize/align first** (cross-ref [[Video stabilization and rolling-shutter correction]]) or the static scene smears too. Same align-then-combine discipline as HDR/panorama.

▸12.7.4 Transcript-based editing⧉

• **the idea**: edit a **talking-head** video by editing its **transcript**. Run **speech-to-text (ASR)** with **word-level timestamps** (forward-ref [[Machine learning]]); the text aligns to the audio, the audio to the video. **Delete a sentence in the text → the matching clip is removed from the timeline**; reorder paragraphs → reorder shots. [`fig-transcript-edit`]

• **the killer feature**: **filler-word removal** — find and strike every "um," "uh," "you know," and the corresponding micro-segments vanish, tightening the cut automatically. Also: search-and-replace navigation ("jump to where they said X"), and **overdub**.

• **the research root**: Berthouzoz, Li & Agrawala 2012, *Tools for Placing Cuts and Transitions in Interview Video*, automated finding clean cut points (at sentence/pause boundaries, hidden cuts) for interview footage — the academic ancestor of **Descript**-style transcript editing. (*Queue* exact citation/products.)

• **why it works**: speech is a **clean alignment signal** for dialogue-driven video; the transcript is a *one-dimensional handle* on a two-dimensional medium. Limits: it's for **spoken-word** content (useless for action/music), and **jump cuts** between removed segments may need a **transition** or a flow-based **seam-hiding** frame interpolation (cross-ref [[Frame interpolation and slow-motion synthesis]]) to look smooth.

▸12.7.5 Coda: storyboards, interviews, and where this part lands⧉

• **storyboards** — the **timeline as a sequence of stills**, looking two directions:

• *forward (pre-visualization)*: a **storyboard** plans a shoot before any footage exists — a panel per intended shot (framing, camera move, action), the director's equivalent of an outline. **Animatics** add rough timing/motion; modern **pre-viz** does it in 3-D. Increasingly **text-to-image / text-to-video** models draft storyboard panels from a script (forward-ref [[Generative AI and diffusion]]) — generation standing in for the not-yet-captured frames.

• *backward (auto-storyboard / recap)*: turn an *existing* video into a storyboard automatically — the **keyframe-summarization** output above (one representative frame per shot, laid out as a contact sheet) used to **skim, recap, or index** footage; comic-style layouts (Schödl/Agarwala-line "video summarization as a comic"; *queue ref*) size and arrange the keyframes by importance.

• the link: a storyboard is the **summarization primitive in both tenses** — keyframes chosen to *cover* the story, whether the story is planned (forward) or already shot (backward). [`fig-storyboard`]

• **interviews / multi-cam** (brief): the transcript-edit + multi-track timeline shine for **interview** assembly (sync cameras by audio, cut between angles on the transcript). *Queue* specific tooling.

• **the part's arc, closed**: we began with motion as *physics to model* (flow, warping, magnification, compression), turned it into *correction* (stabilization, rolling shutter) and *synthesis* (interpolation, slow-mo), and end here with *editing* — motion as raw material for storytelling. Every effect was one of two moves: **register-then-reduce/select across time**, or **re-time the timeline** — the same "align, then combine, then decide" spine that runs through the whole book.

▸Part 13 ADVANCED COMPUTATIONAL PHOTOGRAPHY ON THE BLEEDING EDGE⧉

▸13.1 Exotic / advanced optics⧉

13.1.1 GRIN⧉

13.1.2 Metalenses and advanced crazy optics à la Barbastathis (coherent though)⧉

13.1.3 Non-linear optics⧉

▸13.2 Coded imaging,⧉

13.2.1 Wavefront coding⧉

13.2.2 Coded aperture⧉

13.2.3 Code in time (phase, amplitude)⧉

▸13.2.4 Optical modulators (spatial light modulators): DMD and LCD/LCoS⧉

• **DMD** — micromirror array, each mirror tilts ± → **binary amplitude**, very fast (kHz); DLP projectors, single-pixel / compressive imaging, coded exposure & illumination

• **LCD / LCoS** — liquid-crystal pixels rotate polarization → with a polarizer give **greyscale amplitude**; in **phase-only** mode the controllable retardance imposes an arbitrary **phase** profile (programmable lens / hologram / aberration corrector)

• a programmable SLM is the **active** cousin of the fixed coded masks above (and of metalenses) — same idea (shape the wavefront), but reconfigurable on the fly

13.2.5 Theorems⧉

13.2.6 End-to-end optimization⧉

13.2.7 Lensless imaging?⧉

13.2.8 Compressive sensing⧉

13.2.9 Discuss optical computing⧉

▸13.3 Light field and multi-aperture imaging⧉

▸13.3.1 Multi-aperture imaging⧉

• **camera arrays** (Wilburn et al. 2005): a **grid of synchronized cameras** captures many viewpoints at once. Fuse them for **synthetic aperture** (below), **high-speed** video (stagger the triggers → interleave frames), **high dynamic range** (vary exposure per camera), or **super-resolution** (sub-pixel-shifted full-res views). The Stanford **multi-camera array** is the canonical research instrument.

• **synthetic aperture**: align all the small-aperture array views to a chosen **focal plane** and **average** them. The result behaves like a **single lens with an aperture as wide as the whole array** — an extremely shallow depth of field that **blurs foreground occluders into oblivion**, letting you **see through** foliage/crowds to a subject behind them; refocus by choosing a different alignment plane. The array's signature trick.

• **multi-camera phones**: modern phones carry a **cluster of cameras of different focal lengths** — **ultrawide · wide · tele** (sometimes a periscope tele) — and **fuse** them: switch/blend between modules across the zoom range, borrow detail from the sharper module, and use the **disparity between cameras** for **depth / portrait-mode** background blur. Note the caveat: this is **discrete optical steps + computational interpolation**, *not* a continuous optical zoom — and parallax between modules must be reconciled in software.

• **light.co** — the (now-defunct) **Light L16**: sixteen small camera modules of mixed focal lengths in a phone-sized body, fused into one high-resolution image with computed depth — a commercial **multi-aperture array** product; instructive as both the promise and the difficulty of array fusion.

• **bullet time**: a **ring/arc of synchronized cameras** fired at once (or in rapid sequence) → a **frozen instant viewed from a sweeping viewpoint** (the *Matrix* effect). A camera array used for its **angular** dimension rather than its aperture — the most recognizable array demo.

• **plenoptic vs array — the tradeoff to land**: both sample the light field, but a **plenoptic** camera buys **fine angular** sampling at the cost of **spatial** resolution (angular samples come *out of* the sensor's pixels), while an **array** keeps each view at **full resolution** but samples angle **coarsely** over a **wide baseline** (→ stronger parallax, true synthetic aperture). Form factor (one body vs a rig) and baseline (mm vs cm–m) are the practical axes.

▸13.3.2 Light fields 101⧉

• plus non uniformity of parameterization, paraxial

▸13.3.3 Light field cameras⧉

• **integral photography (Gabriel Lippmann, 1908)** — the origin of light-field capture: photograph the scene through an **array of tiny lenses** (a "fly's-eye" / lenslet sheet) onto a single plate, so each lenslet records a slightly **different view** (an **elemental image**) → the plate stores the scene's **directional** information, i.e. (a slice of) the light field. Viewed back through the *same* lens array it replays a **3-D, parallax-correct (autostereoscopic)** image. The direct ancestor of every plenoptic camera and of lenticular / **integral-imaging 3-D displays**.

• **the plenoptic / light-field camera** lineage: **Adelson & Wang 1992** (the plenoptic camera) → **Ng 2006 / Lytro** — a **microlens array** in front of the sensor samples the rays *inside* the main lens (the same integral-photography trick, now digital), trading **spatial resolution for angular samples** → **refocus, viewpoint shift, and depth after capture** (the "capture the full set, decide later" lesson, L14).

• **Lytro** (and **Raytrix**) as products; cross-ref the Intro history (Lippmann; Lytro 2012).

▸13.3.4 Refocusing⧉

`fig-lightfield-refocus` · **light field → refocus → merge:** refocus a light field by **shift-and-add** (refocused near / refocused far) → the synthesised focal stack merged to one all-in-focus image. placed in **[[11 Advanced computational photography]] § Refocusing** (forward-ref'd from the DoF & focal-stack chapters); registered here next to its focus-stacking siblings ✅

• **shift-and-add refocusing** — once the light field is captured, choosing a focus plane is just **summing the sub-aperture views, each shifted in proportion to its position in the aperture** (a shear of the light field, then integrate over angle). Different shifts pick different virtual focus planes, so a *single* capture yields a whole **focal stack**, refocusable **after** the shot. Ng's **Fourier-slice photography** (2005) is the frequency-domain fast version (refocusing = taking a sheared 2-D slice of the 4-D light field's spectrum).

• this is the **dual of focus stacking** ([[Multiple exposure imaging]] — *Focal stacks and depth of field extension*): there you capture **frames** at stepped focus and merge; here you capture **rays** and *synthesise* the frames, then the **same sharpness-weighted merge** (with the high exponent) collapses them to all-in-focus. The per-pixel sharpest plane is again a coarse **depth map** (shape-from-focus). Both are **L14** — capture the full set, decide later — on the focus axis.

13.3.5 Aberration correction⧉

▸13.3.6 I want a light field camera⧉

see my blog https://www.thecomputationalphotographer.net/2020/02/i-want-a-light-field-camera/

13.3.7 Dual pixel autofocus⧉

▸13.3.8 Plus light field in time = high speed⧉

• tracking

13.3.9 Lenticular displays here⧉

13.3.11 Dave Brady’s gigapixel stuff⧉

▸13.4 Extra sensors and non-visual data⧉

13.4.1 Accelerometer, gyro⧉

13.4.2 Sound⧉

13.4.3 GPS⧉

13.4.4 Compass⧉

13.4.5 Near IR⧉

13.4.6 Temperature⧉

▸13.5 Computational illumination⧉

13.5.1 Flash / no flash⧉

13.5.2 Ramesh’s multiflash⧉

13.5.3 Removing flash artifacts⧉

13.5.4 Dark flash, (plus Stasi version!)⧉

▸13.5.5 Strobe photography⧉

• and traps

13.5.6 Matting again?⧉

13.5.7 Direct/indirect⧉

13.5.8 Polarization?⧉

13.5.9 Smart flash⧉

13.5.10 Light domes⧉

13.5.11 Tabletop version⧉

▸13.5.12 Drone lighting (computational rim illumination with aerial robots)⧉

• mount a light on a **quadrotor** and let it *fly itself* into position: a feedback loop measures the subject's **rim light** (the bright contour that separates subject from background) and commands the drone to hold a target rim width/placement as the subject or photographer moves — the light source becomes an **autonomous, computationally-controlled actuator** rather than a fixed stand. Computational illumination where the optimized degree of freedom is the **light's position in space**; ties the lighting sections above to the robotics/active-capture theme (cross-ref FUNDAMENTALS — *Cameras beyond photography*).

13.5.13 Lukas multi-illumination⧉

▸13.5.14 Dual photography⧉

• sidebar: Yang privacy

13.5.15 Intrinsic images with time lapse⧉

13.5.16 Structured light scanning⧉

▸13.5.17 Exploiting natural variations⧉

• time lapse tricks

13.5.18 Coherent imaging⧉

13.5.19 Confocal, Ioannis⧉

13.5.20 Schlieren⧉

▸13.5.21 Relighting, shadow removal⧉

• not perfect place but OK

13.5.22 Reflection removal⧉

13.5.23 In general, what to say about intrinsic images?⧉

▸13.6 Computational sensors⧉

13.6.1 Assorted pixels⧉

13.6.2 Fuji⧉

▸13.6.3 Depth sensors⧉

• a taxonomy of ways to recover per-pixel **range $Z$** — the dimension a single camera throws away in the divide-by-$Z$ projection (FUNDAMENTALS):

• **stereo** (passive): two cameras a baseline $B$ apart; triangulate depth from **disparity**, $Z = fB/d$ (cross-ref *Multiple view geometry*). Cheap and passive, but fails on **textureless / repetitive** surfaces — the correspondence problem.

• **stereo with structured light** (active): *project* a known pattern — stripes, a pseudo-random **speckle / dot** pattern, or phase-shifted sinusoids — so correspondence is solved even on a blank wall; one camera + one projector triangulate. The original **Kinect v1** (PrimeSense speckle); industrial **phase-shift profilometry** for 3-D scanning. Robust indoors; struggles in bright sun and on specular / very dark surfaces.

• **LiDAR** (light detection and ranging): emit a laser pulse and time its return (**direct time-of-flight**) for each scanned or flashed direction → a depth **point cloud**. Scanning (spinning or MEMS-mirror) vs **flash** LiDAR (a SPAD array — cross-ref single-photon sensors below); long range, works in the dark — the basis of self-driving range sensing and recent phone/tablet LiDAR.

• **time-of-flight** (the other active route): continuous-wave (**phase**) ToF — e.g. **Kinect v2** — vs direct/pulsed ToF (LiDAR); see the ToF section below.

• others, in passing: depth from **focus / defocus** and **dual-pixel** (Optics — Focus/AF), **monocular learned depth** (COMPUTATIONAL TOOLS — Deep learning), and shape-from-X.

13.6.4 Time of flight⧉

13.6.5 Single-photon sensors (SPAD, avalanche, photon counting)⧉

▸13.6.6 Doppler / velocity imaging⧉

• per-pixel **velocity** from a frequency / phase shift — optical Doppler, Doppler **lidar / radar**, laser **vibrometry**; the imaging cousin of medical **Doppler ultrasound** and Doppler weather radar

• measures motion **directly** (radial velocity) rather than inferring it from frame-to-frame optical flow → fast, robust in low light; pairs with ToF (range + range-rate)

13.6.7 Event sensors⧉

13.6.8 Lincoln lab sensors⧉

▸13.7 More temporal and Video stuff⧉

13.7.1 High-speed cameras⧉

13.7.2 Ultra high speed imaging (Velten, etc.)⧉

13.8 De-weathering (fog, rain)⧉

▸Part 14 3D AND DEPTH⧉

▸14.1 What "depth" means, and where it comes from⧉

• recap, don't re-teach: **depth = the z-coordinate**, *not* the ray length to the point; **unprojection** (pixel + depth → 3-D point) inverts perspective projection (defined in [[Multiple view geometry]] / Fundamentals). A **depth map** is just a per-pixel z image; a **point map** (DUSt3R's term, below) is its unprojected 3-D field.

• the cues/sensors that *give* depth, as a quick inventory (each lives in its home chapter; collected here for contrast):

• **stereo / disparity** — two views, triangulate matched points (baseline + correspondence) → [[Multiple view geometry]]

• **multi-view** — many views (the SfM/MVS pipeline below)

• **active depth** — **time-of-flight**, **structured light** (project a known pattern), **LiDAR** — return z directly (pointer to sensors/optics; RealSense & phone face-ID dot projectors)

• **defocus / focus** — depth-from-defocus and focal stacks → [[Advanced computational photography#Light field and multi-aperture imaging]], Depth of field

• **dual-pixel** — the half-aperture parallax phones already capture for autofocus, reused for portrait depth → Optics (dual-pixel AF)

• **motion** — structure-from-motion / optical-flow parallax → [[Matching pixels across space and time]]

• **monocular cues** — a *single* still already implies depth to a human (occlusion, perspective, texture gradient, shading, familiar size, haze); turning those into numbers is the next chapter

• the recurring catch: most reconstructions recover geometry only **up to a similarity** (scale/rotation/translation) unless something fixes the **absolute scale** — a known baseline, a calibration target, or a metric sensor. Keep "relative" vs "metric" depth distinct.

▸14.2 Monocular depth estimation (one image → depth)⧉

⬜ figure not yet created

`fig-monocular-cues` (one photo annotated with each depth cue) fig-depth-anything

▸14.3 Single-image 3-D: tour into the picture, photo pop-up, 3-D Ken Burns⧉

⬜ figure not yet created

`fig-popup-ground-vertical-sky` (geometric-context labels → folded pop-up)

▸14.4 Multi-view 3-D reconstruction: the classic pipeline⧉

▸14.5 Photos → radiance fields and Gaussian splatting (NeRF, 3DGS)⧉

⬜ figure not yet created

`fig-nvs-vs-reconstruction` (the 3-D↔2-D triangle: rendering, reconstruction, NVS)

⬜ figure not yet created

`fig-3dgs-splat` (anisotropic 3-D Gaussians → projected & alpha-blended)

▸14.6 Feed-forward (amortized) 3-D: skip the per-scene optimization⧉

▸14.7 Re-photography⧉

⬜ figure not yet created

the estimated relative-pose vector) fig-rolling-shutter-perrow

▸14.8 The landscape, and is 3-D a "fake task"?⧉

• a deliberately reflective close (slides 84–86): the field is a **complex landscape** along several axes — *reconstruction* (output 3-D) vs *NVS* (output 2-D); *overfit-per-scene* (NeRF, 3DGS) vs *amortized* (DUSt3R, VGGT); *baked illumination* vs *relightable*; **number of input views** (1 → 360° → ∞); the **3-D representation** used (volumetric/fuzzy vs hard surface vs 4-D ray space — or none); how much **prior** is learned; world-space vs camera-space.

• the provocation worth leaving the reader with: the **Bitter Lesson** (Sutton) and the **"parable of the parser"** — when data + compute are plentiful, the hand-built intermediate (here, *explicit 3-D*) may be an unnecessary **"fake task"** that learning can skip. **Is 3-D still needed, or is it scaffolding we'll discard?** (Counterpoint: real applications — robotics, AR, fabrication, autonomous driving — keep wanting *actual* geometry. → [[Adjacent fields and applications]].)

• frontier (slide 50): scalability / long context, beyond shape+texture (materials, **relightability**), **motion** (dynamic scenes), less supervision, fewer inductive biases.

▸14.9 3-D displays⧉

• the *output* counterpart to all of the above — once you have a 3-D scene, how do you *show* it with parallax? Brief pointer, since the optics live with light fields: **lenticular / parallax-barrier** autostereoscopic displays, **light-field displays**, head-tracked & VR/AR stereo, and research like **Project Starline**. Cross-link to the light-field display material in [[Advanced computational photography#Light field and multi-aperture imaging]] (lenticular displays, 3-D TV) rather than re-deriving.

▸Part 15 REVEALING THE INVISIBLE⧉

▸15.1 Near-infrared (NIR) photography⧉

• **how**: silicon photodiodes respond well into the **near-infrared**, so every camera ships with an **IR-cut filter** to keep IR from contaminating color. Two routes to NIR imaging: (1) **convert the camera** — remove the internal IR-cut filter (and optionally replace it with clear glass or an IR-pass filter); (2) **filter the lens** — an **IR-pass (visible-blocking) filter** (e.g. R72 / 720 nm) on a stock camera, at the cost of long exposures since most IR is being thrown away by the still-present internal cut filter.

• **what changes — the look**: **foliage glows bright white** (chlorophyll/leaf cell structure reflects NIR strongly) — the classic **"Wood effect"** (after R. W. Wood, 1910); **skin** goes smooth and waxy (NIR penetrates the surface, veins show); **blue sky darkens** and **haze is cut** (less Rayleigh scattering at longer wavelengths → see further); **water** goes inky black.

• **false-color IR**: since NIR is invisible, render it by **mapping bands to visible channels** (e.g. NIR→red, red→green, …) — the surreal magenta-foliage "Aerochrome" look; a channel-remapping choice, not a measurement.

• **uses beyond art**: **forensics & art conservation** (read faded ink, underdrawings, alterations beneath paint); **agriculture / remote sensing** — **NDVI** $=(\text{NIR}-\text{Red})/(\text{NIR}+\text{Red})$, the vegetation-health index that exploits the foliage-glows-in-NIR effect; **astronomy** (NIR pierces dust, redshifted light — e.g. JWST); and as an **extra channel** for computational photography (NIR-assisted denoising / dehazing / dark flash, cross-ref).

15.2 Accidental cameras⧉

15.3 Motion and video magnification⧉

15.4 Visual microphone⧉

15.5 Corner camera⧉

15.6 Active non-line-of-sight⧉

▸15.7 Passive non-line-of-sight⧉

`fig-photoshop-resample` · the resampling kernel zoo as a shipping product's UI: Photoshop *Image Size* resample-method dropdown screenshot + on-figure legend mapping each menu option to its kernel — Nearest Neighbor→box, Bilinear→tent, Bicubic→cubic, Bicubic Smoother→cubic biased smooth (enlarge), Bicubic Sharper→cubic biased sharp (reduce), Preserve Details 1.0/2.0→edge-aware/learned upsampler. © Adobe Inc., educational commentary ✅

`fig-nn-downsample-adversarial` · "decimation reveals hidden content": plant single pixels of a 2nd image at the NN sample sites (s·i, s·j) of a base photo — naive NN ÷s reconstructs the **hidden** image, prefiltered (area) ÷s averages it away and keeps the base. Panels: attacked image, pixel-zoom showing the planted dots, hidden reference, NN result, prefiltered result, "why it works". Image-scaling attack — cite arXiv:2104.11222 (Tamkin et al.) & arXiv:2104.08690 (Quiring et al.) 🟨

• **the setup**: no laser, no time-of-flight — only **ambient light** from the hidden scene spilling onto a **relay surface** you can see (a floor, a wall). An **accidental occluder** — a vertical **edge**, a **doorframe**, an opaque object — selectively blocks rays, so different parts of the visible surface "see" different parts of the hidden region. The occluder is what makes the problem solvable: a fully open scene blurs everything together.

• **occluder as a crude lens / coded aperture**: the edge or object **modulates** the hidden light, much like a **pinhole or coded aperture** modulates a normal scene. The light measured on the visible surface is (hidden image) **convolved** with the occluder's transfer/visibility function — so recovering the hidden image is a **deconvolution / inverse problem**, where the **occluder geometry** is itself part of the unknown.

• **corner camera (1-D)**: at a wall corner, the vertical edge acts as a 1-D **angular** aperture — the **penumbra** gradient on the floor encodes a 1-D image of the hidden scene **vs. angle around the corner**; differencing/inverting the penumbra yields a 1-D **video** of motion in the hidden room (Bouman et al. 2017). Cross-ref `[[#Corner camera]]`.

• **computational periscopy (2-D, single photo)**: with an **opaque occluder** of known or estimated shape between the hidden scene and the wall, a **single photograph** of the wall contains enough structure to **invert** for a full **2-D image** of the hidden scene — the occluder's cast shadow plays the role of a coded aperture (Saunders, Murray-Bruce & Goyal 2019).

• **passive vs active — the contrast**: **active** NLOS (above) sends a controlled light pulse and times its return, buying strong signal and depth at the cost of specialized hardware; **passive** NLOS is **photon-starved and ill-conditioned** (faint penumbra signal, unknown occluder), trading robustness for working with **ordinary cameras and existing light** — and it degrades gracefully as the accidental occluder becomes less ideal.

15.8 Mm-wave, wifi⧉

▸Part 16 ADJACENT FIELDS AND APPLICATIONS⧉

16.1 Modern sensors⧉

▸16.2 Astro⧉

▸16.2.1 Extreme long exposure⧉

• the regime where the **exposure is minutes to a year**, so the sky's motion *is* the subject:

• **star trails**: the Earth's rotation turns stars into arcs about the celestial pole — captured either as one very long exposure or (better, for noise/saturation) as a **stack of many short frames** combined with a *lighten* blend (cross-ref Multiple exposure)

• **solargraphy**: a **pinhole camera on photo paper** left for **days to a whole year** records the Sun's daily arcs marching across the sky with the seasons — *no development*, the latent image is scanned directly; the canonical "sun over a year" image

• **why you usually stack instead of one long exposure**: a single ultra-long digital exposure piles up **dark current, light pollution, and saturation**; many shorter **subs** that are aligned and averaged win on noise and dynamic range (→ DSO, below) — solargraphy survives only because photo paper is near-grainless and self-limiting

• ties to FUNDAMENTALS: long exposures expose the **thermal/dark-current** noise term and motivate **cooling**; the sky's motion motivates **equatorial tracking mounts** (and software **field de-rotation**)

▸16.2.2 Tracking, stacking, and selection (pointers)⧉

• **deep-sky objects (DSO)** — faint nebulae/galaxies: track the sky, shoot many long **subs**, calibrate and **average** → see [[Slopendium/09 Multiple exposure imaging|Denoising by averaging]]

• **planetary / lunar** — bright but blurred by **atmospheric seeing**: shoot thousands of fast frames and keep only the sharpest — **lucky imaging** → see [[Slopendium/10 Big data|Big data]]

16.3 X-ray⧉

16.4 Medical⧉

▸16.5 Microscopy⧉

• confocal

16.6 Mm-wave⧉

16.7 Music, sound⧉

16.8 Fluorescence⧉

16.9 Opto-acoustic⧉

16.10 Ultrasound⧉

16.11 Aerial imaging⧉

16.12 Computer vision⧉

16.13 Robotics, driving⧉

▸Part 17 HUMAN FACTORS⧉

▸17.1 Human factors and the art of photography⧉

▸17.1.1 Make better photos⧉

⬜ figure not yet created

before/after cluttered vs clean background fig-background-cleanup

⬜ figure not yet created

shallow-DoF isolation fig-dof-isolation

⬜ figure not yet created

layering / separation fig-layering

⬜ figure not yet created

off-center (rule of thirds) vs centered

• it's a **vocabulary to talk about pictures**, not a creativity recipe — take pictures and **critique your own**

• the throughline: **layering & separation** — make the subject stand out from its surroundings

• 💡 **Big lesson:** separate the subject from its surroundings, and commit — every rule below is one tactic for layering/separation; whatever you do (off-center, blur, tilt, saturate), do it boldly or not at all (the timid half-measure is the amateur's tell)

#### Composition

• **control the background** (the #1 fix): move your feet, shallow depth of field, telephoto compression, crop / get close, clean up (clone / Poisson), desaturate / darken, or go black & white

• **simplify — get close**: "if you can't make it good, make it tight"

• **off-center the subject** (rule of thirds); leave room ahead of motion or gaze — or deliberately **center** for symmetry

• **negative space**: give the subject room to breathe

• **pick the viewpoint** ("move your feet"): get **low** or to **eye level** (people, animals), occasionally high — the angle beats the zoom

• **horizontal vs vertical**: match the frame orientation to the subject

• **mind the geometry**: keep verticals straight, avoid near-parallels and accidental alignments (correct perspective with a homography later), build on diagonals, try an unusual angle

• **sweep the frame edges** before you shoot — catch distractions and anything half-cut by the border

• **foreground helps** (landscapes): a near element adds depth

#### Light

• **it's the light that counts**: soft, directional light flatters; harsh midday sun rarely does

• **open shade / overcast** = soft, even light — often the easiest good light

• **golden hour** (sunrise / sunset) for warm, low, directional light; **blue hour** just after sunset

• **fill the shadows**: fill flash, **bounce** flash (off a wall or ceiling, ~45°), a diffuser, or a reflector — never bare flash head-on

• the deeper idea: most lighting rules are about **managing dynamic range** (soft light / fill / reflectors compress contrast); studio three-point lighting; HDR + tone mapping in post

#### Portraits

• it's **all about the eyes** (focus there, catchlight); shoot near **eye level**; a short **telephoto** (~85–135 mm) flatters faces; light the face softly, warm WB a touch; basic **lighting & makeup** shape a face with light and shadow; over-shoot

#### Then, lightly, in post

• (→ ISP / Color / HDR): shoot **RAW**, set **white balance**, fix exposure, **manage contrast** with a tone curve, dodge / burn, crop; boost saturation/vibrance a touch; HDR + tone-mapping for high-contrast scenes

• governing principle: **whatever you do, do it less** — restraint is the mark of the finished image

▸17.1.2 Typical shooting scenarios⧉

• a quick field guide — per genre, the lens, the exposure priorities, and the one thing that matters most:

• **portrait**: short telephoto (85–135 mm), wide aperture for shallow depth of field + subject isolation, focus on the near **eye**, soft directional light, slightly warm white balance

• **landscape**: wide lens, small aperture (f/8–f/16) for deep depth of field, low ISO + tripod, focus near hyperfocal, a foreground element for depth, golden / blue hour; bracket for HDR

• **wildlife**: long telephoto (300–600 mm), fast shutter to freeze motion, continuous autofocus + burst, high ISO as needed, get to eye level, patience

• **studio**: controlled strobes (key / fill / rim), low ISO, aperture for depth of field, mind flash sync speed, shape the light (softbox / reflector), often tethered

• **sports / action**: fast shutter (1/1000 s+), continuous autofocus tracking + high frame rate, wide aperture / high ISO; pan to blur the background while keeping the subject sharp

• **architecture / real estate**: wide or **tilt-shift** lens to keep verticals straight (or correct perspective later with a homography), small aperture, level the camera, bracket interiors for the window-vs-room dynamic range

• **macro**: high magnification — its own section below

▸17.1.3 Macro photography⧉

• **macro** = photographing small subjects at high magnification (life-size **1:1**, or greater)

• **the central problem is depth of field**: at 1:1 the in-focus band is millimetres even stopped down, so —

• stop down (f/8–f/16+), but watch **diffraction** softening at very small apertures (a real ceiling at high magnification)

• **focus stacking**: shoot a stack at stepped focus and combine the sharp slices (→ Multiple-exposure / focus stacks)

• small sensors get macro depth of field almost for free (sensor-size scaling of DoF — see Image formation)

• **working distance vs magnification**: a longer macro focal length buys more **working distance** at the *same* magnification (so you can light the subject) without changing the depth-of-field band — same magnification ⇒ same DoF (Image formation)

• **gear**: a true macro lens, extension tubes, a reversed lens, or close-up diopters; a **focusing rail** for precise steps

• **light**: the subject is so close the lens shadows it — **ring flash / twin flash**, diffusers; tripod for static subjects

• **macro 3D**: **hypostereo** (closely spaced lenses, tiny baseline) for natural-looking depth on very near subjects (cross-ref stereo)

▸17.1.4 Special effect photography⧉

⬜ figure not yet created

light painting fig-light-painting

⬜ figure not yet created

multi-Fredo fig-multi-fredo

⬜ figure not yet created

box rigs fig-box-rigs

• **light painting**: a long exposure in the dark while moving a light (or the camera) — the sensor accumulates the trail (→ exposure / long exposure)

• **multi-Fredo / clone shots**: a locked-off camera + several frames of the same person in different spots, composited — one subject, many times

• **box rigs & multi-camera**: bullet-time / array rigs and beam-splitters — capturing one moment from many viewpoints or instants

• others: tilt-shift / miniature faking, intentional camera movement, prism / kaleidoscope, infrared & UV

▸17.1.5 Fun artsy stuff⧉

sprinkle throughout books

• Hockney

• Jason salavon

• face detector, clouds

• Bill time action visualization

• weird non-linear perspective, push broom.

• multiperspective panos

• text output camera

• blind camera

• anticliche camera

▸17.1.6 Perception of art⧉

• how the **visual system shapes how we make and read images** (Aaron Hertzmann's "why does art look the way it does"; Livingstone's *Vision and Art*):

• **luminance vs color**: the luminance channel carries form, depth and motion; equiluminant (color-only) art looks unstable / shimmery (Impressionism, Op-art) — ties to the chroma CSF (Spatial vision)

• **central vs peripheral vision**: low peripheral acuity explains the **Mona Lisa's elusive smile** and why a loose sketch reads from across the room

• **shortcuts that match perception**: edges / line drawings (we encode edges), exaggerated contrast (lateral inhibition), chiaroscuro & cast shadows for depth, atmospheric perspective (scattering)

• **the camera vs the eye**: a photo flattens what the eye explores over time (no foveation, one exposure / one focus) — part of why "the photo doesn't look like I remember it"

• forward tie-ins: NPR / stylization (Advanced); aesthetics & saliency models (ML)

▸17.2 Ethics of computational photography⧉

• photography has **never been "the truth"** — framing, exposure and selection are choices; computation only widens the gap between scene and image

• **manipulation & provenance**: retouching → deepfakes; the line between *enhancement* and *deception*; **forensics** (detecting edits) and **provenance** (C2PA / content credentials, watermarking) — see forensics (Missing stuff)

• **fairness & representation**: film and auto-exposure / AWB were tuned for light skin (**Shirley cards**) — a bias that persists in metering, white balance and face detection; design for **diverse skin tones** (→ Skin tones)

• **privacy & consent**: faces, surveillance, always-on cameras, face / plate recognition; the right not to be captured

• **generative AI**: training-data consent & copyright, synthetic "photographs", and what *photographic evidence* means when any image can be generated (→ Generative AI)

• the book's stance: name each capability *and* its misuse, and prefer transparency (disclose computation) — an editorial line to keep consistent

▸17.3 Computational models of perception⧉

`fig-interp-comparison` · UPSAMPLING ×8 on a real photo (iguana eye + scales): nearest vs bilinear vs **two** bicubics bracketing the sharpness↔aliasing/ringing tradeoff — Catmull–Rom (0, ½) sharp vs Mitchell (⅓, ⅓) smooth; tight zoom (blocky) + larger crop (real pixel size) 🟨

▸17.4 Spatial (and spatio-temporal) vision⧉

17.5 User studies⧉

▸Part 18 IMAGE FORENSICS AND AUTHENTICATION⧉

The rest of this book is mostly about *making* images — better, brighter, sharper, or out of nothing. This part is about *believing* them. Cheap editing tools, and now generative models that synthesize a convincing photograph from a sentence, have severed the old reflexive link between "photograph" and "something that happened." Two responses run in parallel. **Forensics** works on an image you are handed cold — no cooperation, possibly an adversary on the other side — and looks for the statistical and physical fingerprints that authentic capture leaves and editing or generation disturbs. **Authentication** works the other way around: it asks creators and cameras to *sign* their work and every edit to it, so trust comes from a verifiable record rather than a hunt for tells. Neither is sufficient alone — forensics gives evidence, not proof, and provenance is opt-in — so the honest position is that they are complementary.

▸18.1 Image Forensics⧉

The premise of all blind forensics: a real photograph is the output of a specific **physical pipeline** — a particular sensor, a color-filter mosaic, a demosaicking algorithm, a lens, a JPEG encoder, a single illumination of a single 3-D scene — and that pipeline stamps the pixels with **consistent low-level regularities**. Splice two photos together, paint something out, scale a pasted region, or synthesize an image from a network, and those regularities are broken locally or never reproduced. Forensics is the art of modeling the expected regularity and flagging where it fails.

▸18.1.1 The problem and the threat model⧉

• **passive / blind, after the fact, adversarial.** Forensics assumes **no cooperation**: no watermark, no signature, possibly a skilled forger who knows the detectors. You get pixels (maybe a file) and must reason backward. Contrast the next chapter, which assumes cooperation up front.

• **three questions.** *Is it authentic?* (manipulation detection), *what was changed and where?* (localization), and *where did it come from?* (source attribution / was it AI-generated). Different traces answer different questions.

• **why now.** Editing went from darkroom craft to a one-tap "remove object"; generation went from impossible to a text prompt. The barrier to a convincing fake collapsed, so detection had to industrialize.

▸18.1.2 Sensor and pipeline traces: PRNU, CFA, and noise⧉

• **sensor fingerprint — PRNU (photo-response non-uniformity).** Every sensor's photosites have tiny, *permanent*, manufacturing-random gain differences, so a given camera multiplies the scene by a fixed, near-unique noise pattern. Average many of a camera's images (especially flat, bright ones) to estimate its **PRNU fingerprint**; then **correlate** a suspect image's noise residual with candidate fingerprints to (a) **identify the source camera** (forensic "ballistics") and (b) **localize tampering** — a pasted-in region carries a *different* (or no) fingerprint, so its correlation drops out. Lukáš–Fridrich–Goljan's signature result. Robust-ish, but defeatable by strong denoising, re-compression, or deliberate fingerprint transplant.

• **CFA / demosaicking correlations.** A single-chip camera measures one color per pixel (the **Bayer mosaic**) and *interpolates* the rest, which imprints a periodic correlation lattice between a pixel and its neighbors ([[Demosaicking]]). Resaving, resizing, or splicing disturbs that lattice; estimating the local interpolation pattern reveals regions that don't match the camera's demosaicking (Popescu & Farid).

• **noise & resampling inconsistency.** Authentic capture has a coherent, signal-dependent noise level (shot + read, [[Fundamentals]] — Noise). A spliced region from another image, or one that was **scaled/rotated**, carries a *different* noise floor and tell-tale **resampling** periodicities (interpolation correlates neighbors in a detectable way). Local noise-level and resampling maps expose the seams.

• **lens traces.** Optical **chromatic aberration** and **vignetting/lens-distortion** vary smoothly and predictably across a real frame; a pasted object violates the global model (Johnson & Farid). Tie-in to [[Optics, lenses, and aberration correction]].

▸18.1.3 Compression, geometry, and metadata forensics⧉

• **JPEG forensics — the file remembers.** JPEG compresses 8×8 blocks with a quantization table; **re-saving** leaves **double-compression** artifacts in the DCT-coefficient histograms, **blocking-grid misalignment** where a region was pasted off-grid, and **JPEG ghosts** — re-compress the whole image at a quality matching a hidden region's prior compression and that region's error *dips*, making it pop out (Farid). The quantization table itself is a coarse camera/software signature. (Background: [[File formats and compression]].)

• **lighting, shadows, and geometry consistency.** A composite must obey **one** physical scene: a single dominant **light direction** (estimate it from shading and from specular highlights / catchlights in eyes), consistent **shadows** (cast-shadow constraints, the shadow of every object must agree on the light), consistent **reflections**, and consistent **perspective / vanishing points**. Physically impossible lighting is one of the hardest tells for a forger to fake and one of the most convincing for a court (Farid's geometric methods). Ties to the BRDF / shading and projective geometry earlier in the book.

• **copy-move and splicing.** **Copy-move** (clone-stamp within one image — hiding an object by covering it with a patch of the same image) is found by matching **blocks or keypoints** that repeat at an unusual offset (Fridrich et al.; SIFT-based variants — cross-ref [[Automatic panorama stitching from multiple views and feature matching]]). **Splicing** (paste from another image) is caught by the seam/noise/CFA/illumination inconsistencies above.

• **metadata & file-structure forensics.** **EXIF** inconsistencies (a "straight-from-camera" JPEG with editing software in the tags, an embedded **thumbnail** that doesn't match the full image, an impossible timestamp/GPS, a quantization table that no camera uses) are quick, useful tells — but **trivially stripped or forged**, so they corroborate, never prove. Direct tie to [[Appendices#EXIF and image metadata]].

▸18.1.4 Deepfakes, GAN/diffusion fingerprints, and learned detection⧉

• **the synthetic-media problem.** Generative models ([[Generative AI and diffusion]]) make **face swaps / reenactment ("deepfakes")** and **fully synthetic** images that pass the eye. Detection shifts from hand-modeled traces to **learned classifiers** trained on real-vs-fake corpora.

• **generative fingerprints.** Up-sampling layers and the generator architecture leave **frequency-domain artifacts** — periodic grid/checkerboard spectra, unnatural high-frequency statistics — and even **model-specific "fingerprints"** that can attribute an image to a particular generator (Marra, Yu). Wang et al. showed a detector trained on one CNN generator's images can **generalize surprisingly well** with the right augmentation — *for a while*.

• **the generalization wall.** A detector trained on yesterday's generator often **fails on tomorrow's** (new architectures, diffusion vs GAN, post-processing, social-media re-compression). This is the central, unsolved difficulty; benchmarks like **FaceForensics++** and the **DFDC** track it.

• **biological/behavioral cues (faces, video).** Inconsistent **blinking**, pulse (subtle skin-color **video magnification** — cross-ref [[Video magnification]]), lip-sync, head-pose vs gaze, and lighting-on-skin physics catch many face fakes — and get patched in the next generation.

▸18.1.5 Why forensics is evidence, not proof — and the hand-off to provenance⧉

• 💡 **the cat-and-mouse reality.** Every published detector is a **training target** for the next forger; **anti-forensics** deliberately launders fingerprints (denoise, re-add plausible noise, re-compress, adversarial perturbations). Robustness and generalization, not peak accuracy on a fixed benchmark, are what matter.

• **calibrated honesty.** Forensics produces a **likelihood with a false-positive rate**, region maps, and physical arguments a human expert weighs — not a yes/no oracle. Over-claimed "AI detectors" do real harm. The discipline's own honesty is part of the methodology.

• **the structural fix.** Because detection is adversarial and asymptotically losing against ever-better generators, the durable answer is to **stop guessing** and instead **carry trust with the asset** — signed provenance at capture and through every edit. That is the next chapter.

▸18.2 Authentication and Provenance (C2PA)⧉

⬜ figure not yet created

their complementarity [fig-provenance-vs-forensics

`fig-rotation-challenge` · Andrew Adams' rotation challenge: rotate a full turn in N steps of 360/N° (each step bilinear-resamples the PREVIOUS result), for N∈{10,60,360}; track a validity mask through identical rotations so corners that ever leave the frame go black. Result returns to 0° but is resampled N times → accumulated blur/mush + shrinking valid region toward the inscribed disc. More, smaller rotations compound MORE damage (rotate k× by 360/N, not once by k·360/N) ✅

⬜ figure not yet created

**Durable Content Credentials** — manifest + invisible **watermark** + **fingerprint**, so a screenshot that strips the metadata can still be **recovered** by matching the watermark/fingerprint to a provenance store [fig-durable-credentials fig-editing-lr-vs-ps

`fig-cos4-vignetting` · natural vignetting: relative illumination ∝ cos⁴θ falling toward the corner, with an image-corner darkening illustration (companion to `fig-cos4-falloff`, ties θ to image radius + a picture) ✅

Forensics is a losing race in the limit: as generators improve, the pixel-level tells vanish. Authentication changes the game by **not relying on the pixels to confess**. Instead it asks the *honest* parts of the pipeline — the camera, the editing app, the AI tool, the publisher — to **cryptographically sign what they did**, and binds that signed record to the image. A reader then **verifies a signature against a trust list** rather than hunting for artifacts. The catch, which we keep front and center, is that this is **opt-in**: it can prove an image *is* what it claims, but a missing or stripped credential proves nothing — so it complements forensics rather than replacing it.

▸18.2.1 From detection to attestation: trustworthy cameras and watermarking⧉

• **the shift.** Stop asking "does this image betray a fake?" and start asking "**what does this image's signed record say, and does it verify?**" Trust derives from cryptography and a known signer, not from artifacts.

• **the trustworthy camera (the old idea).** Friedman (1993) proposed a camera that **digitally signs each photo at capture** with a private key, so any later edit breaks the signature. The seed of all provenance — held back for decades by missing standards, key management, and the fact that *every* edit (even a legitimate crop) invalidates a naive whole-image signature. C2PA's manifest-of-edits is the modern fix.

• **watermarking — robust, fragile, semi-fragile.** Embed an imperceptible signal in the pixels. **Robust** watermarks survive compression/cropping/screenshotting (used to carry an ID or recover provenance); **fragile** ones break on any edit (tamper *evidence*); **semi-fragile** tolerate benign processing but break on malicious edits. Watermarking is provenance's way to survive when the metadata is stripped (see Durable Content Credentials). Text: Cox–Miller–Bloom.

▸18.2.2 C2PA and Content Credentials: the standard⧉

• **who and what.** **C2PA** — the *Coalition for Content Provenance and Authenticity* — is an open standard formed by merging Adobe's **Content Authenticity Initiative (CAI)** with **Project Origin** (BBC, Microsoft), now backed by Adobe, Microsoft, BBC, Intel, **Sony, Nikon, Leica, Canon**, Truepic, Google, OpenAI, and others. The consumer-facing brand is **Content Credentials** ("a nutrition label for digital content").

• **the manifest.** Attached to (or associated with) the asset is a **manifest**: a **tamper-evident, cryptographically-signed** bundle of **assertions** ("claims") — e.g. *capture device & settings*, *date/time/location*, the **list of edits** performed, **thumbnails** of intermediate states, **AI-generation disclosure** (was a generative model used, and how), and **who signed**. Change a pixel without updating the manifest and the **hash** no longer matches → tamper is evident.

• **the claim signature & trust.** The manifest is signed with an **X.509 certificate**; verification walks the **certificate chain** to a **trust list** of accepted signers (cameras, apps, publishers). So "verified" means *this manifest was signed by a key we recognize and the bytes haven't changed since* — **not** that the content is "true," only that its provenance record is **authentic and intact**.

• **hard vs soft binding.** **Hard binding** = a cryptographic **hash of the pixels** in the manifest (any pixel change is detected, but the credential is lost if the file is re-encoded or the manifest is stripped). **Soft binding** = an imperceptible **watermark** and/or a perceptual **fingerprint** that lets a stripped or re-encoded asset be **re-matched** to its manifest in a provenance store. C2PA supports both.

• **ingredients and provenance chains.** Edits **compose**: open a photo, crop it, color-grade it, paste in an element, export. Each step records its **ingredients** (the assets it consumed) and adds a **signed assertion**, building an **auditable lineage** from capture to publish. You can trace where a published image's pieces came from.

• **Durable Content Credentials.** Because metadata is easily stripped (a screenshot, a re-upload), Adobe/CAI pair the **signed manifest + invisible watermark + fingerprint** so provenance is **recoverable** even after stripping: match the watermark/fingerprint to the cloud provenance store and re-surface the credential. This is the practical answer to "but you can just screenshot it."

• **verification & UX.** Published assets carry a small **"cr" Content Credentials pin**; clicking it (or dropping the file into *verify.contentauthenticity.org*) shows the manifest — device, edits, AI use, signer. The whole point is a **one-click, human-readable** provenance view.

▸18.2.3 AI disclosure, watermarking, and the regulatory push⧉

• **disclosing the synthetic.** Generative tools increasingly **attach C2PA at creation**: Adobe **Firefly**, OpenAI's image models, Microsoft, and others stamp "**AI-generated**" assertions, so synthetic content is *labeled by the maker* rather than *caught by a detector*. This is provenance's most immediate, highest-leverage use.

• **AI watermarking.** Complementary invisible watermarks designed for *generated* content — Google DeepMind's **SynthID** (for images, audio, and text) — let a generator mark its own output so it can later be recognized even without metadata. Pairs with C2PA's soft binding.

• **regulation.** The **EU AI Act** (and similar proposals) push **mandatory labeling/transparency** for AI-generated and manipulated media, which is accelerating provenance adoption across cameras, editors, platforms, and model providers.

▸18.2.4 Limits, critiques, and the forensics partnership⧉

• ⚠️ **opt-in: absence is not guilt.** No credential ≠ fake; a credential ≠ "true." Provenance proves **origin and integrity of the record**, not the *honesty* of the scene. A staged photo can carry a perfectly valid credential.

• ⚠️ **strippable (mitigated, not solved).** Plain metadata is removed by a screenshot or a re-upload; **Durable Content Credentials** (watermark + fingerprint + cloud store) recover it, but watermarks can be attacked and fingerprints are probabilistic.

• ⚠️ **governance & trust lists.** Who decides which signers are trusted? Centralized trust lists carry power and single points of failure; key compromise is a real risk.

• ⚠️ **privacy.** Rich provenance can leak **location, device serials, and identity** (echoing the EXIF privacy issue) — so the standard makes many assertions optional and redactable.

• **the synthesis.** **Forensics** (detect, no cooperation, adversarial) and **authentication** (attest, cooperative, opt-in) are two halves of one trust system: provenance makes honesty cheap and verifiable; forensics handles everything *without* a credential. Trust on the modern internet of images comes from **both**, plus the human judgment the [[Human factors and the art of photography|ethics]] discussion frames.

▸Part 19 SYSTEMS⧉

19.1 Programmable cameras⧉

19.2 Image processing libraries⧉

19.3 Lightroom-style raw developers⧉

19.4 Photoshop-style editors⧉

19.5 Networking and image transport⧉

19.6 Photography programming on phones⧉

▸Part 20 PERFORMANCE ENGINEERING AND HALIDE⧉

▸20.1 8-bit and fixed-point arithmetic⧉

• how about fp16???

▸20.2 Algorithmic speedups⧉

• separable

• recursive filters

• **integral images / summed-area tables** — precompute a running 2-D prefix sum of the image once; afterwards **any** axis-aligned box sum is just **four array lookups** ($S_{D}-S_{B}-S_{C}+S_{A}$), independent of the box size, so a box filter is **O(1) per pixel** at any radius (and multi-scale features cost the same). First introduced in graphics as the **summed-area table** (Crow 1984) for mip-free texture filtering, then **re-invented as the "integral image"** by **Viola & Jones (2001)** to sum Haar-like features fast enough for real-time face detection — a clean example of the same trick rediscovered across fields. (Watch precision: the running sum grows large, so it wants wide integer/float accumulators.)

▸20.3 Performance engineering and scheduling (Halide)⧉

• Modern machines

• challenges: parallelism, locality/cache behavior

• solutions: slice, fuse

• **Why low-level matters — the language speed differential**: when you write the **low-level pixel code yourself** (e.g. **your own convolution**), the speed gaps are dramatic — in Fredo's experience roughly **~1000× from Python to optimized C++**, and **at least another ~10× from C++ to Halide**, won back through scheduling (tiling, vectorization/SIMD, multithreading, producer-consumer fusion). This is exactly **why** Python is right for prototyping, glue, and deep learning, but **C++/Halide earn their place for the inner loops**. See [[Appendices#Programming: Python, C++, and PyTorch]] and [[Intro#What language for computational photography?]].

• **historical example — Photoshop 1.0 (1990), the split done by hand**: the discipline predates the tools. Thomas Knoll wrote **Photoshop 1.0** ≈75 % in **Pascal** but dropped into **68000 assembly** for the speed-critical inner loops — *productive language for the bulk, hand-tuned machine code for the hot pixels* (≈128,000 lines total; source released by the Computer History Museum, 2013). That **manual** algorithm-vs-hand-optimization split is exactly what **Halide** automates with its algorithm/schedule separation — the same idea, now mechanized and retargetable. (Full tidbits in the language sidebar, [[Intro#What language for computational photography?]].)

• Halide

• **the scheduling primitives — `compute_at` and the locality/recompute trade-off**: a Halide schedule is mostly choosing, for each stage, **where it is computed relative to its consumer** — `compute_root` (compute the whole stage once, store it all), `compute_at` (compute just the slice needed inside the consumer's loop, trading **recomputation** for **locality / less memory traffic**), and `store_at`/`fold` in between. This producer–consumer granularity *is* the core of the algorithm/schedule split. [📺 **embed Fredo's `compute_at` explainer** → https://www.youtube.com/watch?v=ViFfigvV418 — see [[Video Resources]]]

• **Benchmarking and the development loop** (the part people skip): you do **not** get fast code by *one-shotting* it — not by hand, and not by asking an LLM to write it in one go. Fast code comes from a **loop**: change the code → **verify correctness** → **measure** performance → repeat. Wire an LLM into exactly that loop (give it a correctness check and a benchmark whose output it can read) and it will converge to fast *and* correct code; the bottleneck is that the model needs some **benchmarking hygiene** trained in, or it does naïve things — e.g. running several benchmarks **in parallel**, so they fight over cores and cache and every number is garbage.

• **a little statistics is mandatory.** A trap worth naming: run *one* benchmark several times, see that the runtimes are **stable**, and conclude your measurements are reliable — then run a **thousand different** benchmarks in one big batch and read the **outliers** as real performance regressions. Low variance *within* a single benchmark tells you nothing about the **tail across a thousand** of them: with that many trials, extreme outliers are *expected from noise alone*, so without accounting for the multiple comparisons (and for shared-machine contention) you will chase regressions that aren't there.

▸20.4 Hardware backends: GPU, NPU, DSP⧉

*(Parked here for now; the real-world ways imaging code is made fast on devices, beyond hand-written C++/Halide.)*

• **GPU compute — OpenCL / Vulkan for image processing**: the per-pixel work (filters, pyramids, warps, blending, demosaicking) is **embarrassingly parallel**, so it maps naturally to the GPU. **OpenCL** kernels and **Vulkan compute shaders** are the cross-platform ways to run it on mobile and desktop GPUs; **Halide can target these backends**, so the same algorithm retargets from CPU SIMD to GPU compute. The GPU is the workhorse for **real-time** image processing (camera preview, video).

• **On-device ML framework (e.g. LiteRT)**: trained imaging models — denoise, super-resolution, segmentation/portrait, semantic ISP, HDR fusion — increasingly run **locally on the device** for latency, privacy, and offline use. **LiteRT** (formerly TensorFlow Lite), Core ML, and ONNX Runtime Mobile load **quantized (int8/fp16)** models and **delegate** the heavy layers to the GPU or NPU. Cross-ref [[Deep learning]] (the models) and [[On-device, mobile and real-time considerations]] if present.

• **NPU acceleration for imaging models**: dedicated **neural processing units** (Apple **Neural Engine**, Qualcomm **Hexagon** AI, Google **Tensor** TPU) run CNN/transformer inference at **far higher throughput per watt** than CPU/GPU — the engine behind Night mode, learned demosaicking/denoising, face & scene understanding in the live pipeline. The scheduling problem shifts from cache tiling to **operator fusion, quantization, and keeping the NPU fed**.

• **DSP programming for real-time 3A control**: the **3A** loop — **auto-exposure, auto-focus, auto-white-balance** — needs per-frame scene statistics and control at video rate with minimal power, so it typically runs on a low-power **DSP** (e.g. Hexagon) inside the camera subsystem/ISP rather than the application CPU. Real-time, deadline-driven, and tightly coupled to the sensor — a different optimization regime from batch image processing.

20.4.1 Karima’s search⧉

▸Part 21 CONCLUSIONS, DISCUSSION⧉

Measurement, generation

Sociology

▸21.1 Recap in context⧉

21.1.1 Modern phones, multiple apertures, pano, HDR+⧉

21.1.2 Recap: a modern mirrorless camera⧉

21.1.3 Recap: a modern cell phone multi camera⧉

21.1.4 Lightroom⧉

21.1.5 Photoshop⧉

▸Part 22 APPENDICES⧉

▸22.1 Refreshers⧉

The book is built on a small kit of mathematical tools, used over and over: images are **vectors**, operations on them are **linear maps** (so we get to use all of linear algebra and the Fourier transform), recovering an image from measurements is an **optimization**, noise is **probability**, and the modern workhorses are **learned**. This appendix is the kit — each tool sketched just far enough to read the chapters that use it, with a pointer to where to go deeper.

▸22.1.1 Linear algebra⧉

`fig-matrix-as-transform` · a 2×2 matrix as a linear map: unit grid + basis vectors → deformed parallelogram under A (Linear algebra) 🟩

`fig-vector-projection` · projection & orthogonality: v̂ = ⟨u,v⟩/⟨u,u⟩·u and the perpendicular residual v−v̂ (Linear algebra) 🟩

• **Vectors and what they mean here.** A length-$n$ list of numbers; geometrically a point/arrow in $n$-D space. **An image *is* a vector** — stack the pixels into one long column (fig-image-as-vector): a megapixel image is a point in a million-dimensional space. Everything we do to images is geometry in that space.

• **Inner product, norm, angle.** $\langle u,v\rangle=u^\top v$ measures alignment; $\|v\|=\sqrt{\langle v,v\rangle}$ is length; orthogonal means $\langle u,v\rangle=0$. The inner product is how a cone "reads" a spectrum, how a filter responds to a patch, how we measure image similarity.

• **Matrices as linear maps.** $y=Ax$ — a matrix sends vectors to vectors, linearly (preserves sums and scaling). Visualize a 2×2 $A$ as deforming the unit square / grid into a parallelogram (fig-matrix-as-transform). A blur, a color-space conversion, a resampling, a perspective warp — all are (or linearize to) $y=Ax$.

• **Linear systems and inverse problems.** Imaging forward models read $y=Ax$ (scene $x$ → measurement $y$); recovering $x$ is solving the system / inverting $A$. Usually $A$ is huge, rank-deficient or noisy, so we don't literally invert — we optimize (next sections). Square vs over/under-determined; rank; null space (what the measurement *cannot* see).

• **Bases, projection, orthogonality.** A basis is a coordinate system; changing basis re-expresses the same vector. Projecting $v$ onto $u$: $\hat v=\frac{\langle u,v\rangle}{\langle u,u\rangle}u$ (fig-vector-projection). In an **orthonormal** basis $\{e_i\}$ the coordinates are just inner products, $x=\sum_i\langle e_i,x\rangle e_i$ — clean, and energy-preserving (Parseval). This is why we love orthogonal bases (Fourier, orthonormal wavelets); **non-orthogonal, non-negative** bases (the cones/primaries of [[Color technology]]) need a *dual* basis and are the source of color's trickiness.

• **Eigenvectors and the SVD.** Eigenvectors are directions a matrix only scales; for symmetric matrices they're orthogonal (the spectral theorem) — diagonalizing = finding the basis where the operator is just per-axis scaling. The **SVD** $A=U\Sigma V^\top$ generalizes this to any matrix: rotate ($V^\top$), scale ($\Sigma$), rotate ($U$) — the unit circle becomes an ellipse whose axes are the singular values (fig-svd-geometry). It underlies low-rank approximation, PCA, the pseudo-inverse, and the conditioning of deblurring (small singular values → noise blow-up).

• **Tie to the book's recurring move:** convolution is a linear operator that is *diagonalized by the Fourier basis* ([[Linearity, Fourier, Aliasing and deblurring]]); PCA diagonalizes a covariance; pyramids/wavelets are change-of-basis. Same idea each time — pick the basis where the operator is simple.

equations

$y = Ax$

inner product $\langle u,v\rangle = u^\top v$

projection $\hat v = \frac{\langle u,v\rangle}{\langle u,u\rangle}u$

orthonormal expansion $x = \sum_i \langle e_i,x\rangle e_i$

SVD $A = U\Sigma V^\top$

▸22.1.2 Calculus: derivatives, gradients, integrals⧉

• **Derivative = slope = rate of change.** $f'(x)$ is the tangent slope (fig-derivative-gradient, left). In images, the **discrete derivative** (finite difference, neighbor minus neighbor) is large at **edges** — the basis of Sobel gradients and sharpening ([[Neighborhood operations and convolution]]).

• **Partial derivatives and the gradient.** For $f(x_1,\dots,x_n)$, the gradient $\nabla f$ stacks the partials; it points in the direction of **steepest ascent**, and its magnitude is the steepness (fig-derivative-gradient, right: arrows on a contour map). Edge *orientation* is the direction of the image gradient.

• **Jacobian and Hessian — the multivariable kit.** A map that takes a vector and returns a vector, $\mathbf f:\mathbb R^n\to\mathbb R^m$, has a **Jacobian** $J$: the $m\times n$ matrix of all first partials $J_{ij}=\partial f_i/\partial x_j$ — the best **linear approximation** of $\mathbf f$ near a point (the gradient is just the scalar, $m{=}1$, case), and composing maps **multiplies** their Jacobians (the vector chain rule that backprop runs on; also how a warp is linearized locally). A **scalar** $f$ has a **Hessian** $H$: the $n\times n$ matrix of *second* partials $H_{ij}=\partial^2 f/\partial x_i\partial x_j$, i.e. **curvature**. At a stationary point ($\nabla f=0$) a **positive-definite** Hessian certifies a genuine **minimum** (vs a saddle), and **Newton's method** steps by $H^{-1}\nabla f$ — second-order optimization (Gauss–Newton for least squares). Keep it conceptual; the Jacobian is the one you meet constantly.

• **Chain rule.** Derivative of a composition — the engine of **backpropagation**: a deep network is a long composition, and training differentiates the loss through it by repeated chain rule. State it once here so the ML section can lean on it.

• **Integrals = accumulation.** Area under a curve; in imaging a **pixel value is an integral** — radiance integrated over the sensor's area, solid angle, exposure time, and spectral response ([[Image formation and linear perspective]]). Continuous vs discrete (sums) — we constantly move between them (sampling, [[Resampling]]).

• **Convexity, minima, and why gradients matter.** A function is minimized where $\nabla f=0$; if it's **convex** (bowl-shaped) that's the global minimum. This is the bridge to optimization: follow $-\nabla f$ downhill.

equations

derivative $f'(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}$

gradient $\nabla f=(\partial f/\partial x_1,\dots)$

chain rule $\frac{d}{dx}f(g(x))=f'(g)g'(x)$

**Jacobian** $J_{ij}=\partial f_i/\partial x_j$ (the $m\times n$ derivative of a vector→vector map

$=\nabla f$ when scalar)

**Hessian** $H_{ij}=\partial^2 f/\partial x_i\partial x_j$ (the $n\times n$ matrix of second partials — curvature)

integral $\int f$

▸22.1.3 Optimization and regression⧉

⬜ figure not yet created

fig-overfitting

• **The pattern: loss = data term + prior.** Almost every recovery problem is $\min_x \underbrace{\|Ax-y\|^2}_{\text{fit the data}}+\lambda\, \underbrace{R(x)}_{\text{prefer plausible }x}$. Denoise, deblur, demosaic, HDR-merge, white-balance — same skeleton, different $A$ and $R$.

• **Least squares & the normal equations.** Fitting a line/plane/model by minimizing squared residuals (fig-least-squares: points, fit, vertical residuals); closed form $A^\top A\hat x=A^\top y$ (and the SVD/pseudo-inverse when $A^\top A$ is singular). Why *squared* error — Gaussian-noise MLE (ties to the probability section), and it's the one that makes the math linear; its blindness to outliers motivates robust losses.

• **Gradient descent.** When there's no closed form (huge or nonlinear problems), step downhill: $x_{t+1}=x_t-\eta\nabla f$ (fig-gradient-descent: descent path on a contour bowl). Learning rate $\eta$ — too big overshoots/diverges, too small crawls. Variants in one breath: stochastic GD (a minibatch per step — how nets train), momentum/Adam. Convex → one basin; non-convex (deep nets) → local minima, and yet it works.

• **Ill-posedness & regularization.** Inverse problems often have many $x$ explaining $y$ (or noise blows up the inverse). A **regularizer** $R(x)$ encodes a prior — smoothness (Tikhonov), sparsity ($\ell_1$, total variation), or a learned prior. $\lambda$ trades data-fit against prior. This is the principled cure for the deblurring noise-amplification of [[Linearity, Fourier, Aliasing and deblurring]].

• **Overfitting and the bias–variance trade-off.** Fit the noise and you generalize badly (fig-overfitting: under-/just-right/over-fit polynomials). Capacity, regularization, and held-out validation — the same dial as $\lambda$, and the bridge to the ML section.

equations

least squares $\hat x=\arg\min_x\|Ax-y\|^2$, normal equations $A^\top A\,\hat x=A^\top y$

gradient step $x_{t+1}=x_t-\eta\nabla f(x_t)$

regularized $\min_x\|Ax-y\|^2+\lambda R(x)$

▸22.1.4 Probability and information theory⧉

⬜ figure not yet created

fig-gaussian

`fig-bayes-posterior` · Bayes: posterior ∝ likelihood × prior (three curves) (Probability) 🟩

`fig-entropy-coin` · binary entropy H(p): 1 bit at p=0.5, 0 at the ends (Information theory) 🟩

• **the basics, in one breath.** A **probability** is a number in $[0,1]$; **independent** events **multiply** (AND), **mutually-exclusive** ones **add** (OR). A **distribution** says how probability is spread over the possible outcomes; its **expectation** $E[X]$ is the long-run average and its **variance** the spread. That is the whole vocabulary the rest of this section leans on — kept deliberately simple.

• **Random variables, distributions, expectation, variance.** A pixel reading is a random variable around the true value; mean = the value we want, **variance/std = the noise**. Expectation is linear; variances of independent things add.

• **The two distributions we actually use.** **Gaussian** $\mathcal N(\mu,\sigma^2)$ — the bell curve (fig-gaussian: μ, ±σ), read noise, the central limit theorem, and the reason squared error is "natural." **Poisson** — photon counting: shot noise with $\mathrm{variance}=\mathrm{mean}$, so std $=\sqrt N$ → SNR is **worse in the shadows** ([[Image formation and linear perspective]]).

• **Independence and averaging.** Average $N$ independent noisy measurements → variance drops by $N$, std by $\sqrt N$. This single fact is why we stack frames, why bigger pixels are cleaner, and the whole "quality through quantity" thread.

• **Conditional probability and Bayes.** $p(x\mid y)\propto p(y\mid x)\,p(x)$ (fig-bayes-posterior: prior × likelihood → posterior). **MAP estimation** = maximize the posterior = the optimization "data term ($\log$ likelihood) + prior ($\log$ prior)" of the last section — the two views are the same. Discounting the illuminant (white balance), denoising, and demosaicking are all Bayesian estimates.

• **Information & entropy.** $H=-\sum p\log p$ = average surprise = the bits needed to encode a source (fig-entropy-coin: $H(p)$ peaks at $p=0.5$). Low entropy = predictable = compressible — the theory behind entropy coding in [[File formats and compression]]. Mutual information as "shared bits." Keep it light: enough to know *why* JPEG can shrink an image and *why* it can't shrink random noise.

equations

mean $\mu=E[X]$, variance $\sigma^2=E[(X-\mu)^2]$

Gaussian $\mathcal N(\mu,\sigma^2)$

Poisson (shot noise) $\mathrm{var}=\mathrm{mean}$

Bayes $p(x\mid y)\propto p(y\mid x)p(x)$

entropy $H=-\sum p\log p$

▸22.1.5 Machine learning and deep learning⧉

⬜ figure not yet created

fig-neuron

`fig-train-val-loss` · training vs validation loss; validation U-turn = overfitting / early stop (ML / deep learning) 🟩

• **Supervised learning = curve fitting, scaled up.** Given pairs $(x_i,y_i)$, find $f_\theta$ minimizing average loss $\mathcal L(\theta)$ — exactly the regression of two sections ago, with millions of parameters. Train/validation/test split; generalization is the goal, not memorization (callback to fig-overfitting).

• **The artificial neuron and layers.** A neuron computes $\sigma(w^\top x+b)$ — a weighted sum (an inner product!) through a nonlinearity (fig-neuron). Stack them into layers → a **multilayer perceptron**; a deep net is a long composition $f_\theta=f_L\circ\dots\circ f_1$ of such building blocks, hence very expressive.

• **Training = gradient descent + backprop.** Minimize $\mathcal L(\theta)$ by SGD; **backpropagation** is the chain rule (calculus section) computing $\nabla_\theta\mathcal L$ efficiently through the composition. Autodiff frameworks (PyTorch) do this for you (programming section).

• **Convolutional networks — the imaging-native architecture.** Weight-shared, local, shift-equivariant filters — literally learned convolutions ([[Neighborhood operations and convolution]]), so they fit images naturally; encoder–decoder / U-Net shapes for image-to-image tasks (denoise, demosaic, super-res). Brief, with forward-ref to the advanced books for depth.

• **What changes vs classical methods.** Same "loss = data-fit + prior" skeleton; the **prior is now learned** from a dataset rather than hand-designed. Strengths (state-of-the-art quality), costs (data, compute, failure modes, hallucinated detail — tie to the ethics discussion in [[Human factors and the art of photography]]).

equations

model $\hat y=f_\theta(x)$

loss $\mathcal L(\theta)=\frac1N\sum_i \ell(f_\theta(x_i),y_i)$

SGD $\theta\leftarrow\theta-\eta\nabla_\theta\mathcal L$

a neuron $\sigma(w^\top x+b)$

▸22.1.6 Programming: Python, C++, and PyTorch⧉

⬜ figure not yet created

none — a prose + small-code-snippet section

• **Why two languages.** The book renders as a **Python book** *or* a **C++ book** from one source ([[Book Generation Process]]); this section is the shared mental model. Python/NumPy: arrays, vectorization, broadcasting — read like the math. C++: the same operations when speed/memory matters (ISP, Halide).

• **Images as arrays/tensors.** An image is an $H\times W\times C$ array (NumPy ndarray / `std::vector` / `torch.Tensor`); operations are whole-array (vectorized) not per-pixel loops — the linear-algebra view made executable. Indexing, slicing, broadcasting; row-major memory and why loop order matters (forward-ref to [[Performance engineering and Halide]]).

• **The speed differential.** For a low-level inner loop you write yourself (e.g. your own convolution), expect roughly **~1000× Python→C++** and **~10× more again for an optimized [[Performance engineering and Halide|Halide]] schedule** — why hot loops leave Python; see [[Performance engineering and Halide]].

• **PyTorch in one paragraph.** Tensors that live on CPU or GPU, plus **autograd** — every operation records a graph so `loss.backward()` gives gradients for free (the backprop of the ML section). This is why modern imaging code is written in it even when no "learning" is involved: free derivatives + GPU.

• **Numerical hygiene.** Floats vs ints, value ranges and clipping, **linear vs gamma space** before arithmetic (the encoding Big Lesson), in-place vs copy, NaNs/inf from divide-by-zero — the bugs that actually bite, cross-ref [[Developing, Testing and Debugging]].

• **Pointers to practice.** The dual-language listings, the `imageops`/`figstyle` style, and [[Exercises & Experiments]] warmups (load an image → it's a NumPy array → do arithmetic → save) that make all of the above concrete.

equations

none

22.2 Problem Set 0 — Environment and C++ basics⧉

22.3 Problem Set 1 — Image class, point operations, and color⧉

22.4 Problem Set 2 — Convolution and the bilateral filter⧉

22.5 Problem Set 3 — Denoising and demosaicking⧉

22.6 Problem Set 4 — High dynamic range⧉

22.7 Problem Set 5 — Resampling, warping, and morphing⧉

22.8 Problem Set 6 — Homographies and manual panoramas⧉

22.9 Problem Set 7 — Automatic panoramas⧉

22.10 Problem Set 8 — Non-photorealistic rendering⧉

22.11 Problem Set 9 — Make-your-own, video, and ethics⧉

▸22.12 EXIF and image metadata⧉

• **What EXIF is.** Structured **metadata embedded in the image file itself** (JPEG, TIFF, and most RAW/DNG), not a sidecar — the camera's record of *how* the shot was taken. In JPEG it rides in the **APP1 marker segment**; the payload is a little **TIFF/IFD** structure (tags → values). Standardized as **Exif** (JEITA/CIPA), with **DCF** governing on-card file/folder naming.

• **EXIF is not the only block.** A photo commonly carries two more metadata standards alongside EXIF: **XMP** (Adobe's XML-based metadata — where Lightroom/ACR edit settings, keywords, and ratings live, often in a **sidecar `.xmp`** for raw files) and **IPTC** (the press/captioning standard — caption, byline, credit, keywords, location). The three **co-occur** and can **disagree** (e.g. a date or copyright string present in more than one and edited in only one); a tool like `exiftool` reads all three at once.

• **Field groups** — the metadata sorted by what it describes:

• **capture settings**: exposure time / **shutter speed**, **f-number** (aperture), **ISO**, **exposure program / mode**, metering mode, **exposure bias** (compensation), flash fired/mode

• **optics**: **focal length** (and **35 mm-equivalent**), lens make / model, subject / focus distance

• **image**: pixel width & height, **orientation** flag (the in-camera rotation the viewer must apply), **color space** / embedded ICC profile, white-balance tag

• **time**: **DateTimeOriginal** (when the photo was taken) vs **DateTimeDigitized** (when it was written/scanned) — they differ for film scans and edits; plus sub-second and time-zone tags

• **place**: **GPS** latitude / longitude / altitude (and heading, timestamp)

• **MakerNotes**: proprietary, undocumented per-vendor blobs (often where the interesting bits — true ISO, lens corrections, shot count — hide)

• **embedded thumbnail / preview**: a small JPEG (and, in RAW, a full-res preview) carried alongside the full image

• **identity / file**: software, artist / copyright, camera & lens serial numbers, owner name

• ⚠️ **caveats** (why EXIF frustrates): coverage is **incomplete** (not every camera writes every field) and **inconsistent** across makers; some values — notably **ISO** and **shutter speed** — are **approximate / rounded** and **not radiometrically correct**, so never read them as calibrated radiometry.

• 🔒 **privacy.** EXIF leaks more than people expect: **GPS** coordinates (home/where a photo was taken), camera & lens **serial numbers** (links photos to one device), and **owner name** / copyright. Many platforms strip EXIF on upload; if sharing matters, strip it yourself (`exiftool -all=`).

• **Tools.** `exiftool` (the reference, widest tag coverage), `exiv2`; in Python `piexif` and `Pillow` (`Image.getexif()`), the C `libexif`.

• **How the book's pipeline reads EXIF.** The capture chapter pulls orientation, exposure triangle, and lens info from EXIF to annotate and correctly display the book's own photos — see [[Basic image processing and ISP]] → "Beyond the pixels: basic metadata and EXIF" and "Data versus metadata: EXIF".

▸22.13 DNG: the Digital Negative⧉

• **What DNG is.** An **open, documented raw format** introduced by **Adobe in 2004**, built on **TIFF/EP**: a single, vendor-neutral, **archival** container for camera raw data — *one* published format instead of a proprietary raw per camera model (Canon CR2/CR3, Nikon NEF, Sony ARW, Fuji RAF, …). The spec is public, so anyone can read and write it.

• **The problem it solves.** Every camera ships its **own undocumented raw**; software must reverse-engineer each one, and old proprietary raws risk becoming **unreadable** as cameras and decoders age. DNG is a **stable, documented target** for long-term archiving and interchange — a "digital negative" you can still open decades later.

• **What's inside (the container).** A **TIFF/IFD** structure (same family as EXIF): the **raw image data**, full **EXIF + MakerNote** metadata, **XMP** (edit settings, ratings), one or more embedded **JPEG previews** + a **thumbnail**, and the **color/calibration** data a converter needs to render the raw.

• **Two flavors of raw payload.** **Mosaic ("raw") DNG** keeps the original **Bayer CFA** samples — demosaick later, the usual case (→ [[Demosaicking]]); **Linear DNG** stores already-**demosaicked** data (one RGB triple per pixel) — used after processing/remosaic or for non-Bayer sensors.

• **Color-rendering data (how raw → color).** DNG carries the math to turn sensor counts into colorimetric values: **ColorMatrix / ForwardMatrix / CameraCalibration** tags for (typically) **two reference illuminants** (e.g. Standard-A and D65) interpolated by correlated color temperature, plus **AsShotNeutral / AsShotWhiteXY** (the camera's white-balance estimate). **DNG camera profiles (DCP)** add hue/saturation/tone maps. (→ [[Color technology]] — camera color matrices & white balance.)

• **DNG opcodes.** A small list of **opcodes** the raw processor applies at decode: **WarpRectilinear / WarpFisheye** (geometric distortion correction), **FixVignetteRadial** (vignetting), **GainMap** (per-channel lens-shading / color cast), **TrimBounds**, defective-pixel maps. This is how DNG bakes in **lens corrections declaratively** (→ Optics, lens-profile correction).

• **Compression.** **Lossless JPEG** (default) or **uncompressed**; an optional **lossy DNG** trades some fidelity for size. (→ [[File formats and compression]].)

• **Embed-the-original.** A DNG can **wrap the original proprietary raw** inside itself, so conversion is **reversible** (extract the untouched original later).

• **Adoption & relatives.** Native raw on some cameras (**Leica, Pentax/Ricoh**), most Android phones via the **Camera2** API, and Apple **ProRAW** (which *is* DNG); the **Adobe DNG Converter** transcodes proprietary raws; **CinemaDNG** is the motion/video raw variant. Not universal — **Canon, Nikon, Sony** keep their own raw formats.

• **Trade-offs.** *Pros*: open, documented, archival; one pipeline; embedded corrections + profiles; lossless and often smaller than the native raw. *Cons*: a transcoding step; some maker-note / "secret-sauce" raw features may not map; not every tool round-trips a DNG perfectly.

• **Relation to EXIF.** DNG **contains** EXIF and XMP — it is a **superset container**, so the trust/coverage caveats of [[Appendices#EXIF and image metadata]] apply to the metadata it carries.

▸22.14 Datasets⧉

A pointer index, not a survey: the workhorse public datasets behind the methods in this book, grouped by task. Each entry is a name, a one-line description, and a link.

**Classification / features**

• **ImageNet** — million-image labelled classification set; the pretraining backbone behind most vision features. https://image-net.org

• **COCO** — common objects in context: detection, segmentation, captioning. https://cocodataset.org

• **Places** — scene-recognition dataset (millions of images, hundreds of scene categories). http://places2.csail.mit.edu

**Super-resolution**

• **DIV2K** — 2K-resolution high-quality images, the standard SR training set. https://data.vision.ee.ethz.ch/cvl/DIV2K

• **Flickr2K** — 2K Flickr images, often paired with DIV2K for training. *(verify URL)* https://github.com/limbee/NTIRE2017

• **Set5 / Set14 / BSD100 / Urban100** — small standard SR **test** sets (classic images, natural scenes, urban self-similar structure). *(verify URL)* https://github.com/jbhuang0604/SelfExSR

**Deblurring & restoration**

• **GoPro (Nah et al. 2017)** — sharp/blurry video-frame pairs synthesized from high-fps GoPro footage; the standard dynamic-scene motion-deblurring benchmark. *(verify URL)* https://seungjunnah.github.io/Datasets/gopro

• **REDS** — REalistic and Dynamic Scenes (NTIRE challenge set): high-quality video for deblurring, super-resolution, and denoising. *(verify URL)* https://seungjunnah.github.io/Datasets/reds

**Denoising**

• **SIDD** — Smartphone Image Denoising Dataset: real noisy/clean smartphone pairs. https://www.eecs.yorku.ca/~kamel/sidd/

• **DND** — Darmstadt Noise Dataset: real-photograph denoising benchmark (held-out ground truth). https://noise.visinf.tu-darmstadt.de

• **Kodak** — the 24 "kodim" lossless test images, a long-standing denoising/compression benchmark. *(verify URL)* https://r0k.us/graphics/kodak/

**HDR / tone mapping**

• **HDR+ burst dataset** — Google's raw burst dataset behind HDR+ computational photography. https://hdrplusdata.org

• **Kalantari HDR (dynamic scenes)** — multi-exposure bursts with motion, for HDR merging with moving content. *(verify URL)* https://www.robots.ox.ac.uk/~szwu/storage/hdr/kalantari17.html

• **Laval HDR (sky / indoor)** — high-dynamic-range outdoor-sky and indoor panoramas (lighting estimation). *(verify URL)* http://hdrdb.com

• **Fairchild HDR Photographic Survey** — calibrated HDR scenes for tone-mapping evaluation. *(verify URL)* http://markfairchild.org/HDR.html

**Retouching / enhancement**

• **MIT-Adobe FiveK** — 5,000 raw photos each retouched by five expert artists; the standard learned-enhancement set. https://data.csail.mit.edu/graphics/fivek

**Depth / flow**

• **NYU Depth V2** — indoor RGB-D (Kinect) for monocular depth. https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

• **KITTI** — driving stereo / depth / flow benchmark. https://www.cvlibs.net/datasets/kitti/

• **Middlebury stereo** — classic high-accuracy stereo benchmark. https://vision.middlebury.edu/stereo/

• **MPI Sintel** — synthetic optical-flow benchmark with long-range, large-motion sequences. http://sintel.is.tue.mpg.de

**Color / white balance**

• **NUS / Gehler-Shi** — color-constancy sets with measured ground-truth illuminant (color-checker in scene). *(verify URL)* https://www2.cs.sfu.ca/~color/data/shi_gehler/

• **Cube+** — single-illuminant color-constancy images with a calibration cube. *(verify URL)* https://ipg.fer.hr/ipg/resources/color_constancy

**Faces**

• **CelebA / CelebA-HQ** — celebrity faces with attributes; HQ is the high-res variant for generative work. https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

• **FFHQ** — Flickr-Faces-HQ: 70k high-quality aligned faces (the StyleGAN set). https://github.com/NVlabs/ffhq-dataset

• **LFW** — Labeled Faces in the Wild: the classic face-verification benchmark. http://vis-www.cs.umass.edu/lfw/

**Inpainting / segmentation / matting**

• **Places2** — same scene corpus as *Places / Places2* (Classification, above); also the standard set for **inpainting** and scene parsing.

• **ADE20K** — densely annotated scene parsing / semantic segmentation. https://groups.csail.mit.edu/vision/datasets/ADE20K/

• **Composition-1k** — the standard alpha-matting benchmark (foregrounds composited over many backgrounds). *(verify URL)* https://sites.google.com/view/deepimagematting

▸22.15 A camera-feature wish list⧉

• **framing**: most asks are not hard *algorithms* — the book covers them — they are missing because of conservative firmware, cramped UI, and closed software. The list is the book held up as a mirror to real products.

• **exposure, ISO, HDR**: tiered **auto-ISO** (minimum shutter speed as a first-class rule); **ambient/flash balance** control; **in-camera HDR bracket → 16-bit raw merge** (→ [[HDR merging]]); **auto-ETTR** with a pull flag recorded in metadata; **frame averaging** for a low-ISO, low-noise long exposure (→ [[Denoising]], the $\sqrt{N}$ argument).

• **bracketing the controls**: alternate settings *within a burst* — flash/no-flash pairs, slow+fast shutter pairs, and **aperture bracketing** at constant exposure for post-hoc depth-of-field choice (→ [[Depth of field]]; light field).

• **focus & DoF**: **focus stacking** with step-size control and blur-boundary detection (→ [[Depth of field]]); focus-shift compensation; cycle through detected **faces/AF subjects** when it locks on the wrong one (→ [[Focus, autofocus]]).

• **computational raw / sensor**: a **true per-channel raw histogram** (not the JPEG proxy); **lossless-compressed raw**; **pixel-shift multi-shot high-res** (→ [[Super-resolution and image priors]]).

• **motion data & metadata**: record the **gyro/accelerometer** streams for shake removal, blur assessment, stabilization (→ [[Video stabilization and rolling-shutter correction]]) and automatic **perspective correction** (→ [[Perspective distortion and its correction]]); in-camera **rating/sorting** that round-trips to the editor (cross-ref [[Appendices#EXIF and image metadata]] — XMP/IPTC); save/restore settings as a text file; separate stills/video settings.

• **panorama**: an electronic **sweep-panorama** mode like a phone's (→ [[Automatic panorama stitching from multiple views and feature matching]]).

• **UI & ecosystem**: better self-timer / long-exposure / **AF-ON** ergonomics; **automatic Wi-Fi offload** at home; and the wish that subsumes the rest — **open the software ecosystem** so third parties can add exactly these features. The recurring lesson: the bottleneck is openness, not optics.

▸22.16 How this book was created⧉

This book was written **with an AI collaborator** (Anthropic's Claude), as a deliberate experiment in AI-assisted authoring — a fitting, slightly meta exercise for a book about computational imaging. The short version: an AI did much of the drafting, building, and checking; a human did all of the deciding. The longer version (below) matters because the failure modes everyone worries about — hallucinated facts, inconsistent notation, prose that drifts over hundreds of pages — are exactly what the method is built to prevent.

▸22.16.1 Two documents, not one⧉

• The central design decision: keep two living documents apart.

• **Outline** = an annotated spec of *what* the book says — structure, order of ideas, and per-section figures, equations/symbols, prerequisites, and the papers/slides it draws on.

• **Prose** = the actual text a reader sees — about *how* it reads. A **compilation** step turns outline (plus supporting files) into draft prose, the way a compiler turns source into a program.

• Point of the separation: leverage — structure/content reasoned about and rearranged cheaply in one place, voice/phrasing in another.

• The arrow runs both ways: when the author edits prose, the change is **triaged** home — *content* (new point, reorder, corrected claim) → outline; *voice* (word choice, rhythm) → style guide; purely local polish → locked in place so it is never silently overwritten.

• Central promise: *re-compiling reproduces essentially the author's edits* rather than regenerating something new. The outline is kept a strict **superset** of the prose's content, so nothing emerged-in-writing is lost on the next pass.

▸22.16.2 Compiling a section⧉

• LLMs are at their worst in long, drifting conversations where early mistakes calcify and context "rots." Antidote: **files are the durable memory, the chat is disposable.**

• Each section is compiled in a **fresh, clean context** from an explicit **capsule** of exactly what it needs — never from accumulated conversation.

• The capsule holds: the section's outline bullets; a one-line chapter thesis + neighbouring sections for continuity; in-scope notation and relevant style rules; and — the key part — the **source material**.

• Two kinds of source, different roles: (1) the teaching the book grew from — lecture **slides** and cleaned **transcripts**, as machine-readable digests — supply *pedagogy and voice* (framing, worked examples, analogies, order of reveal); (2) the technical literature — cited **papers and textbooks** on disk — supply *precision* (definitions, results, attributions).

• Division of labour: slides/transcripts decide *how* to explain, papers decide *what is true*. Grounding every claim in a named source is the first defence against fabrication. The rule: write from the sources, not the model's memory.

▸22.16.3 Figures as code⧉

• No diagram or example image is drawn by hand. Each figure is a small **program** that builds it — regenerates from source, stays consistent with the others, updates when data/style changes (the same reproducibility ethic the book preaches for image-processing code).

• Two libraries: `imageops` holds the actual image **operations** a figure demonstrates (exposure, contrast, tone mapping, convolution, pyramids, demosaicking…), written once so figure and prose can never disagree; `figstyle` holds one shared **visual style** — one palette, one font set, consistent labels — applied everywhere.

• A figure is specified in the outline alongside the text it supports, built to a **legibility standard** (no text smaller than body type, nothing clipped/overlapping), and checked so the prose actually *leverages* it.

• Photographs used as examples are the author's own, or carefully credited Creative-Commons images.

▸22.16.4 Text-to-image: cover art and stylized figures⧉

• A pleasing bit of recursion: the book's **decorative** imagery was made with a **generative text-to-image model** — the same diffusion/transformer family the book explains in [[Generative AI and diffusion]]. A book about computational photography, illustrated by computational photography's newest tool.

• The line to keep straight: **technical figures are code** (previous section) — generative models are *not* trusted to draw a ray diagram or a Bayer grid, where a hallucinated arrow or a mangled equation would be a lie. The model is used only where invention is the point: **ornament**, not evidence.

• **The model.** We used **Reve**, a text-to-image API, driven by two small scripts that read a key, post a prompt (or an image-edit instruction), and save the returned PNG — each call metered in credits, so generation is always a deliberate, bounded selection, never an open faucet.

• **Cover and title art** (`reve_gen.py` ← `cali.txt`). A bank of prompts, each pairing the book's title with one *calligraphic hand* (copperplate, blackletter, Spencerian, Carolingian, Insular/Celtic, Art Nouveau, Roman capitals, brush) and one *gloriously over-engineered futuristic camera* bristling with mismatched lenses — rendered as paper-quilling, gold leaf, illuminated manuscript, and so on. Each prompt is generated in **several variants**; the author picks the keepers. The whole cover suite is one prompt file plus one command.

• **Stylized figures** (`reve_stylize.py` ← `figstyles.txt`). The *opposite* direction — **image-to-image**: take an already-built, typeset figure and ask Reve to **re-render it in another artistic medium** (chalk on a blackboard, blueprint, da-Vinci ink, Diderot engraving, watercolour…) while *preserving every label and equation*. Used for chapter-opener flavour, never to compute or alter content.

• **The fidelity guard.** Generative edits love to redraw a typeset "$r_d$" as literal LaTeX or drop a subscript. Two defences: every edit instruction carries an explicit *keep the math as real typeset glyphs, no backslashes/dollar-signs* clause, and each result is optionally checked by a **vision model** (`reve_verify.py`) that compares it against the original and flags any **major** content drift (a missing panel, a changed sign, an illegible label) — cosmetic medium differences pass, meaning-changing ones are rejected.

• **Honesty.** Every generated image is decorative or illustrative and labelled as such; none is presented as a photograph, a dataset sample, or the output of an algorithm. Where a figure *mimics* a result without being one, it carries a visible **SIMULATED** mark (see the figure conventions). Provenance — model, prompt, and that the image is synthetic — is recorded with the asset.

▸22.16.5 Text-to-3D: models for figures and interactive demos⧉

• A second generative tool, in a different medium: where a figure or an **interactive demo** needs a **3D model** — an insect, a camera, an eyeball, a studio umbrella, a human figure — we generate it from a text prompt with **Meshy**, a **text-to-3D** API, rather than hunting for a correctly-licensed mesh. The same recursion as the cover art: a book about imaging, populated by models a generative system imagined.

• **Where it is used.** The book's **interactive WebGL demos** for Photography 101 (the exposure triangle, portrait lighting, dolly zoom, depth of field, and face-distortion simulators) need real 3D geometry to light, focus, and reproject: a beetle / bee / ladybug for the macro depth-of-field bokeh, flowers for the blurred background, studio umbrellas for the lighting rig, diverse human figures for the portrait and perspective demos, and props (cameras, an eye) for general illustration.

• **The pipeline** (`generate_mesh.py` ← a prompt). One small script reads a key, posts a prompt, polls until the model is ready, downloads a **GLB**, and embeds it as base64 so it loads offline — no server, even from a local `file://` page. Two modes mirror a credit/quality trade-off: **preview** returns *untextured* geometry (cheaper) for demos that apply their **own** material — a metallic shell, a grid — and **refine** adds **PBR textures** when colour carries the meaning (a bee's stripes, a ladybug's spots). Untextured meshes are decimated to a light polygon budget; textured ones are kept whole (decimation would drop the texture coordinates). Each call is metered in credits, so generation stays a deliberate, bounded selection, never an open faucet.

• **Diversity, on purpose.** The human figures are generated as a deliberately **diverse set** — varied gender, race, age, and dress — so the people who populate the demos and figures are not a monoculture.

• **Honesty and credit.** As with the text-to-image art, these models are **illustrative stand-ins**, never presented as photographs of real objects, specimens, or people, and their synthetic provenance is recorded with the asset. **Every caption that uses a Meshy-generated model acknowledges it** — e.g. *"3D model generated with Meshy AI"* — the convention is firm: AI-made assets are labelled where they appear.

▸22.16.6 Text-to-video: animated covers⧉

• A third generative medium: the cover art is also brought to life as short, silent, **seamlessly-looping videos** for the web edition — the index page offers an **animated / static** toggle, and animated mode plays one of the loops at random (`cover_animate.py` ← a cover image, via **Kling** image-to-video on **fal.ai**).

• **The hard part is the loop.** The obvious way to force a true loop — pin the model's *end frame* to the *start frame* — backfires: the laziest path to landing on the identical final frame is to barely move, so the clip comes out nearly frozen. The loops are made the other way round: generate slow, calm, **cyclic** motion — prompted so every element *returns to its starting state* (a full 360° rotation, figures walking back to their pose) — then close the seam with a short **forward crossfade-wrap** (a brief dissolve, never a reversing "ping-pong"). A flawless seam with lively motion is genuinely hard; truly cyclic motion, or a purpose-built looping-context video model, is the clean route.

• **Honesty.** Like the still art, the animations are generative and decorative — never presented as captured footage; their synthetic provenance is recorded with the asset.

▸22.16.7 Verification and review⧉

• A chapter is not allowed to start drafting until it passes a **readiness gate**: outline fleshed out; prerequisites declared and satisfied upstream; figures built (or stub-placeheld); equations/notation locked; references resolving to real documents; source coverage checked.

• After a draft exists, it is run past a panel of automated **reviewers**, each a different lens: technical (inaccuracies, missing assumptions); pedagogical (where a student gets lost); prerequisite (assumed-but-never-introduced background); style (single-author voice); plus figure–text fit, citation correctness, and faithful coverage of slides/papers.

• Findings collect into a running issues list and steer the next revision. Mechanical checks — consistent notation, language-isolated code, sane value-range/encoding assumptions — run separately and automatically.

• **A second opinion from a rival model.** Because the drafting AI shares blind spots with its own prose, an *independent* model from a different vendor (here, OpenAI's GPT) is run as an outside reviewer over the outline or the drafts — at chapter, part, or whole-book scope — tasked only with disagreeing. Its notes are triaged: small, unambiguous fixes the editor applies directly; judgment calls escalated to the author. Crossing model families catches errors a single model would wave through (it flagged, for instance, a real optics slip in an early draft).

• The verifiers don't replace human review; they make it land on what matters.

▸22.16.8 Keeping a long book coherent⧉

• Hardest problem in piecewise generation: the pieces must add up to one object. Three measures.

• **Single-source registries**: one master **notation** table owns every symbol; one **glossary** owns every key term; a registry of recurring **"big lessons"** (imaging is multiplicative; color is non-orthogonal and non-negative; the right basis diagonalizes; edge-preserving filtering is about affinity; quantization is rarely the culprit) records where each first appears and recurs. Defined in one place, referenced everywhere → cannot drift.

• **Dependency graph**: each chapter declares what it *provides* and *requires*; the build checks nothing is used before introduced, and propagates a changed shared definition to every dependent chapter.

• **Version control**: every change to prose and tooling is committed, so generated and edited versions sit side by side, edits extract as diffs, and history is recoverable. A verbatim **log of every instruction** and a dated **process changelog** accompany the repo, so the (continuously evolving) method can be reconstructed — much of what this appendix is built from.

▸22.16.9 The toolchain⧉

• Prose written once in **neutral markup**, rendered to several outputs — print/PDF and a web edition — from a single source, so it is proofread once, not per-format.

• The same single-source trick produces two flavours, one with **Python** code and one with **C++**: shared explanation written once, language-specific snippets swapped in per build, so a reader sees only one language, never both.

• Derived artifacts — slide digests, figures, and eventually compiled chapters, index, glossary — follow an **incremental, build-style** discipline: each records its inputs and rebuilds only when those change, so a small edit doesn't trigger wholesale regeneration.

• The book carries a **version number** that ticks up on each recompile.

▸22.16.10 What the machine did, and what it did not⧉

• "Written by AI" is both an overstatement and an undersell — worth being precise about the division of labour.

• The AI was excellent at the laborious-but-well-specified: drafting fluent prose from a tight, source-backed outline; writing and debugging figure-generation code; mechanical bookkeeping (notation tables, cross-references, consistency sweeps); and building the tooling that ran the process. It gave one author the throughput of a small team.

• It was *not* trusted with judgment. A human chose structure and argument, decided what mattered and what to cut, supplied the taste, caught subtle errors that survive plausible prose, and owns every remaining claim.

• The safeguards above — source-grounding vs fabrication, fresh per-section context vs drift, single-source registries vs incoherence, a reviewer panel feeding a human editor — exist precisely because an unsupervised model would fail in each way.

• Honest summary: the AI is a powerful collaborator and tireless build engine; the responsibility, like the byline, stays human.

▸22.16.11 Who wrote what: a per-part estimate⧉

• an honest, prompt-log-reconstructed table of the **human vs AI** split, **per part**, for the **outline** (content/structure) and the **figures** — rounded to 5%, proportions not precision.

• caveat up front: "authorship" of the outline is fuzzy — the human set structure, chose every topic, and supplied the course material, while the AI expanded those seeds into connective text; a low human % means *directed-not-absent*. A figure counted "AI-originated" was still requested in general terms and accepted/killed by the human.

• two metrics: **outline human %** (originated as explicit human direction / course material vs AI expansion) and **figures human-specified %** (named/described by the human vs AI-proposed from a "suggest the figures" prompt).

• patterns: human outline-share peaks where the book leans on the author's course/research (Fundamentals, Basic, Optics, Multiple-exposure, Performance/Halide, pset Appendices) and dips in survey-style later parts; human figure-share concentrates in the drafted-and-iterated parts (Fundamentals, Basic) and the Appendices (author's own handout figures). Whole-book ≈ **30–35%** human on the outline, ≈ **20%** human-specified on the figures — everything reviewed and owned by the human regardless of who first drafted it.

▸22.17 The course tutor: a local, book-grounded AI teaching assistant⧉

• **What it is.** A terminal **and** browser chatbot tutor for the course, grounded in *this book* plus the lecture slides and pset handouts. It answers conceptual questions fully and, for problem-set work, **guides** — concept, the section to read, the next step — without handing over answers or solution code.

• **Local-first.** The default runs an open-weights model (**Gemma**, via Ollama) **on the student's own machine** — no API key, no per-token cost, works offline. An install-time probe picks a model tier that fits the hardware (a small model on any laptop; a larger, multimodal one on a GPU / Apple Silicon).

• **Retrieval-augmented (RAG).** The book drafts, outline, companion files, slide digests, and pset *handouts* are chunked, embedded, and stored in a small local vector index; each question retrieves the most relevant passages, which are fed to the model as grounded context — the same "write from the sources, not memory" rule the authoring pipeline uses. **Solutions and answer keys are deliberately never indexed.**

• **Guide, don't solve.** Enforced by a Socratic system prompt + worked exemplars, *not* by a brittle filter (a student can get answers from any frontier model anyway — the point is a genuinely useful *guided* tool, the CS50 stance). A **grounding gate** makes the tutor say "the material doesn't cover this" rather than confabulate when retrieval is weak (the Jill-Watson "decline when uncovered" behaviour).

• **It links and shows.** Answers cite the book by name and **link** to the published section; in the browser it renders **equations** (LaTeX/KaTeX) and shows the relevant **figures** inline — the reasons a browser UI beats a terminal for a visual, mathematical subject. A **vision-capable** model lets a student upload an image (their result, an artifact) and ask about it.

• **Two front-ends, one core.** A terminal client (lives in the dev shell, can see code, works over SSH) and a local browser app (equations, figures, image upload, a model dropdown) share the same retrieval + inference + logging core. An optional **online backup** (instructor-hosted) serves students whose machines can't run a local model.

• **Logging and analytics for the instructor.** Every interaction (question, retrieved sources, response) is logged and delivered to the instructor. A batch pipeline produces a **dashboard**: what help students seek, usage over time, per-student activity, retrieval-coverage gaps, and a **pedagogy audit** — an independent model reviews each turn for quality (correct / confusing / harmful), grounding, and answer-leaks, surfacing problem turns. Student identities are **pseudonymized and PII-scrubbed** before any third-party model sees a trace.

• **Privacy & honesty.** Interaction logs are student records (FERPA): disclosed in the syllabus, access-restricted, anonymized before external analysis. The tutor is labelled as an AI that can be wrong; it points students *back into the book* rather than substituting for it.

▸22.18 The semi-automatic grading system⧉

• **The shape (to confirm).** Each pset ships a **test harness** + reference data; submissions are built and run against it, image/numeric outputs compared to references **within a tolerance**, and a provisional score produced for a human to review and adjust.

• **What is automatic vs human (to confirm).** *Automatic:* compiles, runs without crashing, output matches reference within tolerance, performance within budget where relevant. *Human:* partial credit for near-misses, code style/clarity, the open-ended make-your-own / video / ethics components, and any appeal.

• **To fill in (from Frédo):** the exact harness and languages (Python/C++), per-pset reference outputs and tolerances, the rubric and point allocation, how grades + feedback are returned to students, the academic-integrity / permitted-AI-use policy (and how it interacts with [[Appendices#The course tutor: a local, book-grounded AI teaching assistant|the tutor]]), and any plagiarism/similarity checks.

▸Part 23 BIBLIOGRAPHY⧉

The book's references — the **books, papers, and sites** cited throughout — collected in one place. Every in-text citation (written `[@key]` in the drafts) links here, and in the HTML build hovering a citation shows the full reference as a tooltip.

The list is maintained as machine-readable data in [[bibliography|Sources/bibliography.md]] and **rendered** to the bibliography page, grouped into **Textbooks & books**, **Papers**, and **Web, standards & misc**, sorted alphabetically. Each entry carries a stable **citation key**; drafts cite it as `[@key]` (optionally `[@key|custom text]`).

*(The rendered list is generated — edit `Sources/bibliography.md` and rebuild rather than hand-editing the page.)*

▸Part 24 MISSING STUFF, BUGS⧉

24.1 Tone mapping doesn't have its own chapter. Just an application of edge aware stuff.⧉

24.2 Denoising by averaging is after basic denoising⧉

24.3 Photomosaics, my self organizing maps⧉

▸24.4 Rephotography⧉

in multiple images?

24.5 Extreme low light⧉

▸24.6 Perspective disortion⧉

with warping? Homography?

24.7 Tilt shift,⧉

▸24.8 Connectivity⧉

Sharing

Network

▸24.9 Misc.⧉

Something about plugins

Something about software engineering and image processing libraries

Zero lag shutter

Beautification????

Decluttering

Hybrid images

Depth map extraction, especially for fake Depth of field