💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
Computational Photography, an AI-powered Slopendium — 10 Multiple exposure imaging
expand to📖 Full book outlinejump to1 parts · 12 chapters · 50 sections · 60 figures embedded · 33 placeholders · double-click a figure to enlarge
Part 10 MULTIPLE EXPOSURE IMAGING
fig-mei-axes
fig-mei-axes · one scene captured along six axes (time/exposure/viewpoint/focus/wavelength/illumination) — many captures → one better image; the L14 spine 🟨
fig-slr-cross-section
fig-slr-cross-section · labelled cross-section of an SLR: 45° main reflex mirror sends light UP to the focusing screen + pentaprism → optical viewfinder; the semi-transparent main mirror + secondary (sub) mirror fold light DOWN to the phase-detect AF module in the mirror-box floor; focal-plane shutter + sensor/film behind; inset shows the mirror flipping up for exposure (companion to `fig-mirrorless-anatomy`)
💡 **Big lesson (L14 · Capture the full set, decide later):** the unifying move of the whole part. Instead of forcing one capture to make every tradeoff at once — exposure, focus, viewpoint, even which moment — **take many captures that each commit differently, and resolve the tradeoff per-pixel in software afterward.** Averaging defers the noise/exposure tradeoff; HDR defers "expose for highlights or shadows?"; focal stacks defer "focus near or far?"; panoramas defer "which field of view?"; hyperspectral defers "which three colors?"; time-lapse defers "which illumination?". Capture is cheap and parallel; the decision is a downstream computation. (Registered in [[Big Lessons]] as **L14**; recurs across every chapter of this part and forward in [[Advanced computational photography]] — light fields are the same idea for *viewpoint/focus*.)
**A very old idea.** Long before digital, photographers combined multiple captures by hand: Gustave Le Gray's 1850s seascapes sandwiched a separately-exposed sky negative with a sea negative to beat the medium's dynamic range; others physically cut and pasted prints to widen the field of view (e.g. W. H. Fox Talbot's printing establishment at Reading, c. 1845). The arithmetic is now done in software and per-pixel, but the instinct — *one frame isn't enough, so take several and combine* — is the same. (Historical figures: Le Gray combination prints; Talbot establishment — see `fig-le-gray-historical`.)
**Roadmap.** The part walks the axes of "more than one capture." **[[#Denoising by averaging]]** adds *time* to beat noise ($1/\sqrt N$), with deep-sky astrophotography as the extreme. **[[#Image alignment]]** is the glue every later chapter needs — you must register before you combine. **[[#HDR merging]]** adds *exposure* to beat dynamic range, and **[[#Application to cell phones: HDR+ and burst imaging]]** shows the burst form that ships in every phone. **[[#Manual panorama stitching from multiple views]]** and **[[#Automatic panorama stitching from multiple views and feature matching]]** add *viewpoint* — the homography that needs no depth, found automatically via features + RANSAC. **[[#Blending]]** hides the seams, and **[[#Bells and whistles]]** plus **[[#Continuous panoramas (e.g. on cell phones)]]** handle projections, drift, motion, and the phone sweep. **[[#Focal stacks and depth of field extension]]** adds *focus*, **[[#Hyperspectral imaging, color wheels]]** adds *wavelength*, and **[[#Intrinsic images with time lapse]]** adds *illumination* — all the same stack-and-decide idea on a new axis.
equations
averaging $\operatorname{Var}[\bar z] = \sigma^2/N \Rightarrow \text{std} \propto 1/\sqrt N$
HDR radiance $\hat L = \frac{\sum_i w(z_i)\,z_i/t_i}{\sum_i w(z_i)}$
homography $\mathbf x' \simeq H\mathbf x$, pure rotation $H = K R K^{-1}$
RANSAC $N = \log(1-p)/\log(1-w^s)$
all-in-focus $k^*(p) = \arg\max_k s_k(p)$
10.1 Denoising by averaging
⬜ figure not yet created
`fig-averaging-N-frames` (one noisy frame → mean of 3 → mean of 5 → mean of 45: the grey patch's histogram visibly narrows, after the Canon-1D ISO-3200 slide sequence) fig-noise-histogram
fig-snr-vs-stddev
fig-snr-vs-stddev · noise std-dev map vs SNR map of a brightness ramp: std rises with brightness (∝√N), but SNR=√N is worst in the shadows 🟨
⬜ figure not yet created
doubling cleanliness costs 4× the frames)
⬜ figure not yet created
`fig-iid-variance-adds` (schematic of the derivation: $N$ independent noise samples, variance adds then $\div N^2$ → $\sigma^2/N$) fig-iid-variance-adds
fig-mean-vs-median-stack
fig-mean-vs-median-stack · plain-mean stack (residual ghost streak from one outlier frame) vs median / sigma-clip stack (streak rejected) of the same frames, with the per-pixel sample distribution inset
fig-calibration-triad
fig-calibration-triad · a light frame's bias / dark / flat components and the calibration formula $X_\text{cal}=(X_\text{light}-X_\text{dark})/(X_\text{flat}-X_\text{bias})$ that subtracts the additive terms and divides out the multiplicative one 🟨
fig-clipping-bias
fig-clipping-bias · how clamping noise at zero biases a dark region bright under averaging, and how a pre-read positive offset restores zero-mean behaviour 🟨
⬜ figure not yet created
`fig-astro-stack-reveal` (a single 30-s sub of a faint galaxy — almost noise — vs 200 subs stacked: structure emerges) fig-astro-stack-reveal
⬜ figure not yet created
`fig-sub-length-tradeoff` (read-noise floor vs total integration time: fewer-longer beats more-shorter once read noise is sub-dominant) fig-sub-length-tradeoff
💡 **Big lesson (the $1/\sqrt N$ law — averaging $N$ independent measurements halves the noise every 4× frames):** model each pixel as a true value plus **independent, zero-mean** noise. Averaging $N$ frames leaves the true value alone (its average is the true value) but the **noise variance falls as $1/N$**, so the **standard deviation — the visible error — falls as $1/\sqrt N$**. This is the part's recurring quantitative result: it is the same law behind HDR's noise-weighted merge, burst low-light (HDR+ / Night Sight), burst super-resolution, and astrophotography's hours of integration. The diminishing-returns sting is built in — *each extra stop of cleanliness costs four times the frames*. (It ties to **L15** — averaging is the algorithmic way to lower the effective **noise floor**, the complement to widening the well; the two together set how much real signal you can pull from shadow. Registered in [[Big Lessons]] alongside L15; first developed here.)
💡 **Big lesson (L14 · capture the full set, decide later):** rather than commit at the instant of exposure, **record the whole set and choose afterward**. Averaging is the first instance in this part: don't take *one* careful frame and hope — take a **burst** and combine. The same move recurs as exposure bracketing (capture all exposures → [[#HDR merging]]), focus stacks (capture all focal planes), and panoramas (capture all directions). The cost is data and a reconstruction step; the payoff is deferring an irreversible decision out of the moment of capture. (Registered in [[Big Lessons]] as **L14**; this part is its natural home.)
The previous part measured and characterised noise. This part *fights* it — and the bluntest, most trustworthy weapon is to take the **same picture many times and average**. No prior, no model of what images look like: just repeated measurement and the law of large numbers. Where single-image denoising ([[Bilateral filtering]]) must *guess* which neighbours to trust, averaging multiple frames adds **genuine independent information** with each shot. The whole chapter is one derivation and its consequences.
equations
noise model per pixel per frame $X_i = \mu + n_i$, with $\mathbb{E}[n_i]=0$ (zero-mean) and $\operatorname{Var}[n_i]=\sigma^2$, the $n_i$ **independent** across frames
the average $\bar X = \tfrac1N\sum_{i=1}^N X_i$
**unbiased** in the signal $\mathbb{E}[\bar X]=\mu$
the two variance rules $\operatorname{Var}[kX]=k^2\operatorname{Var}[X]$ and (independence) $\operatorname{Var}\big[\sum_i X_i\big]=\sum_i\operatorname{Var}[X_i]$
hence $\operatorname{Var}[\bar X]=\tfrac{1}{N^2}\cdot N\sigma^2=\dfrac{\sigma^2}{N}$, so $\boxed{\sigma_{\bar X}=\dfrac{\sigma}{\sqrt N}}$
sample-variance estimator with **Bessel correction** $\hat\sigma^2=\tfrac{1}{N-1}\sum_i (X_i-\bar X)^2$ (divide by $N{-}1$, not $N$ — removes the downward bias from reusing the same samples for mean and variance)
SNR $=\mu/\sigma$, in dB $20\log_{10}(\mu/\sigma)$, improving by $10\log_{10}N$ ($\approx 3$ dB per doubling)
calibration $\
X_\text{cal}=\dfrac{X_\text{light}-X_\text{dark}}{X_\text{flat}-X_\text{bias}}$ (subtract additive fixed-pattern, divide out multiplicative).
10.2 Image alignment
fig-align-before-average
fig-align-before-average · averaging an unregistered hand-held burst (a blurry double) vs averaging the same burst after alignment (sharp, clean, $1/\sqrt N$ benefit visible)
fig-ssd-search-surface
fig-ssd-search-surface · the SSD score over candidate $(dx,dy)$ shifts drawn as a single-welled basin whose minimum marks the true alignment 🟨
⬜ figure not yet created
`fig-shifted-copy-debug` (an image aligned to a copy of itself shifted by +2 px → the score map's minimum must sit exactly at $(+2,0)$: the hand-checkable test) fig-shifted-copy-debug
fig-phase-correlation-spike
fig-phase-correlation-spike · two shifted frames producing, via the inverse FFT of the normalized cross-power spectrum, a single sharp delta whose position is the inter-frame shift 🟨
⬜ figure not yet created
**L5** in one picture)
fig-coarse-to-fine-pyramid
fig-coarse-to-fine-pyramid · alignment estimated cheaply at the coarsest pyramid level then propagated and refined down to full resolution, capturing large motions while avoiding local minima 🟨
Every method in this part — averaging just now, HDR and panorama and focal stacks to come — assumes that *corresponding scene points already sit on the same pixel across frames*. They don't: hands tremble, the Earth rotates, even a "locked" tripod drifts. **Alignment is the unglamorous precondition** of all of it, and it is unforgiving — averaging two frames that disagree by a pixel doesn't clean the image, it **blurs** it into a double exposure. This chapter is the shared registration toolbox, and because the same machinery — find the motion between frames — is exactly what **video stabilization and optical flow** need, it hands off naturally to the video part.
equations
alignment as $\hat{T}=\arg\min_T \sum_{x}\big(I_1(x)-I_2(T x)\big)^2$ (find the transform that makes one image match the other)
**SSD** over a shift $\mathrm{SSD}(dx,dy)=\sum_{x,y}\big(I_1(x,y)-I_2(x{+}dx,y{+}dy)\big)^2$
**NCC** (gain/offset-invariant) $\mathrm{NCC}=\dfrac{\sum (I_1-\bar I_1)(I_2-\bar I_2)}{\sqrt{\sum(I_1-\bar I_1)^2}\sqrt{\sum(I_2-\bar I_2)^2}}$
the **shift theorem** $I_2(x)=I_1(x-d)\
\Leftrightarrow\
\hat I_2(\omega)=\hat I_1(\omega)\,e^{-i\,\omega\cdot d}$ (a translation is a pure phase ramp — **L5**)
**phase correlation** — the normalised cross-power spectrum $R(\omega)=\dfrac{\hat I_1\,\overline{\hat I_2}}{|\hat I_1\,\overline{\hat I_2}|}=e^{i\,\omega\cdot d}$, whose inverse transform $\mathcal F^{-1}\{R\}$ is a **delta at the shift $d$**.
10.3 HDR merging
fig-dr-clip-vs-noise
fig-dr-clip-vs-noise · one exposure's two walls — clipping at full-well above, the noise floor below — shading the narrow band of scene contrast that fits between them 🟨
fig-exposure-stack-slices
fig-exposure-stack-slices · N exposures as overlapping reliable bands of the (log) radiance axis that slide with exposure time and together tile a range no single shot could hold 🟨
fig-vary-exposure-knobs
fig-vary-exposure-knobs · the four exposure knobs and their side effects — none for shutter, depth-of-field for aperture, noise for ISO, color cast plus camera contact for ND 🟨
fig-depth-of-field-sim
fig-depth-of-field-sim · live macro depth-of-field simulator (web edition): a to-scale thin-lens diagram + a 3D photo (aperture-supersampled bokeh) + a circle-of-confusion plot all sharing one optics model; subject dropdown (Meshy beetle / bee / ladybug) over a flower background, focus landing on the front of the face so the body and antennae blur; static fallback is a screenshot. *Bug & flower 3D models generated with Meshy AI.*
fig-exposure-triangle-sim
fig-exposure-triangle-sim · live exposure-triangle simulator (web edition): two 3D subjects at different depths walking opposite ways; shutter/aperture/ISO/scene-light/sensor sliders with brute-force time + aperture supersampling (real motion blur, depth of field, shot+read noise) and an auto-set triangle; static fallback is a screenshot
⬜ figure not yet created
`fig-pixel-ratio-calibration` (two adjacent exposures, scatter of $I_i$ vs $I_j$ over the jointly-good pixels → slope $= k_i/k_j$ fig-pixel-ratio-calibration
⬜ figure not yet created
median is robust)
fig-response-curve-debevec
fig-response-curve-debevec · a camera's non-linear response $f(kL)$ and the recovered inverse $g(z)$ mapping pixel value to log-radiance, constrained by multiply-exposed pixels up to a global scale (Debevec–Malik) 🟨
fig-histogram-exposure
fig-histogram-exposure · one real scene at three exposures (−2 / as-shot / +2 stops), each with its luminance histogram — distribution slides left→right, clipping spikes pin to the edge; crushed-shadow / clipped-highlight callouts (BASIC histograms)
fig-weight-triangle
fig-weight-triangle · the reliability weight $w(z)$ as a hat that vanishes at both clipping and noise extremes, with an SNR-optimal variant skewed toward brighter samples, plus per-frame contribution bands 🟨
⬜ figure not yet created
`fig-naive-vs-optimal-weights` (same bracket merged with binary/hat weights vs inverse-variance weights — shadow noise visibly lower, after the Hasinoff "Nancy Church" result) fig-naive-vs-optimal-weights
fig-hasinoff-snr-optimal-set
fig-hasinoff-snr-optimal-set · SNR vs log scene brightness for naive bracketing, an SNR-optimal high-ISO short-exposure set, and an ideal sensor, the optimal set nearly matching the ideal and beating naive in the shadows (Hasinoff–Durand–Freeman) 🟨
fig-hdrplus-pipeline
fig-hdrplus-pipeline · the HDR+ flow: under-exposed raw burst → pick reference → align → robust raw-domain merge → demosaic → tone-map → JPEG, highlighting that merge precedes demosaic 🟨
💡 **Big lesson (L15 · Dynamic range = full-well capacity ÷ noise floor):** a single exposure records from a **top** — the photosite's **full-well capacity**, where values saturate/clip — down to a **bottom** — the **read-noise floor** in the shadows. Their **ratio is the dynamic range** (in stops): how much scene contrast fits in *one* shot, typically far less than the 10–12 orders of magnitude the world throws at us. You can **widen** the range with a bigger well (larger photosites — a full-frame sensor beats a phone) or a lower floor (cooling, lower read noise) — but the photographic workhorse is to **beat it by merging multiple exposures**: bracket the scene, let each shot own a different slice of the radiance axis, and stitch the slices into one radiance map. That merge is this chapter. (Registered in [[Big Lessons]] as **L15**; first appears in FUNDAMENTALS — Noise/SNR/DR; this is its capture-side **HDR** recurrence. The companion is **L6** — with a sane encoding it's *range*, not bit depth, that bounds a shot.)
💡 **Big lesson (L14 · Capture the full set, decide later):** rather than commit to one exposure at the instant of capture, **record the whole bracket and choose afterward**. HDR bracketing is the exposure-axis instance of the same move that gives focus stacks (capture every focal plane) and light-field cameras (capture every ray → refocus later). The cost is data and a harder reconstruction (alignment, merge); the payoff is deferring an irreversible decision — *which exposure?* — out of the moment of capture. (→ see Big lesson **L14**; first placed with light-field / plenoptic cameras; recurs here as HDR exposure bracketing and again in the next chapter as the **burst**.)
equations
image formation with clipping $I_i(x,y)=\mathrm{clip}\big(k_i\,L(x,y)+n\big)$, exposure factor $k_i\propto t_i\cdot(\text{ISO})/N_{\!f}^2$ (shutter $t_i$, aperture $f$-number $N_f$, ISO gain)
**linear radiance estimate** $\displaystyle \hat L(x,y)=\frac{\sum_i w(z_i)\,z_i/t_i}{\sum_i w(z_i)}$ ($z_i=I_i(x,y)$)
**pairwise scale** $k_i/k_j=\mathrm{median}_{x:\,w_i,w_j>0}\big(I_i(x,y)/I_j(x,y)\big)$ (linear capture)
**Debevec response recovery** $g(z_i)=\ln L + \ln t_i$ — solve for the response $g=\ln f^{-1}$ and the $\ln L$ jointly by least squares with a smoothness term, up to one additive constant
**Debevec merge (log domain)** $\ln \hat L=\dfrac{\sum_i w(z_i)\,(g(z_i)-\ln t_i)}{\sum_i w(z_i)}$
**two-observation optimal blend** $\hat L = a x+(1-a)y$, $a^\*=\sigma_y^2/(\sigma_x^2+\sigma_y^2)$ → general **inverse-variance** weights $w_i=1/\sigma_i^2$, $\hat L=\sum_i (z_i/k_i)/\sigma_i^2 \big/ \sum_i 1/\sigma_i^2$
**per-pixel noise variance** $\sigma^2(x)=a\,x + \sigma_{\text{read}}^2$ (photon term ∝ signal + constant read term).
10.4 Application to cell phones: HDR+ and burst imaging
fig-hdrplus-pipeline
fig-hdrplus-pipeline · the HDR+ flow: under-exposed raw burst → pick reference → align → robust raw-domain merge → demosaic → tone-map → JPEG, highlighting that merge precedes demosaic 🟨
⬜ figure not yet created
redraw from the Hasinoff 2016 system diagram)
fig-dynamic-range-comparison
fig-dynamic-range-comparison · dynamic-range ladder in **stops** (horizontal bars): colour slide (~5–6) · reflective print (~6–7) · film negative (~12–13) · phone sensor (~10–12) · full-frame sensor (~14) · human eye instantaneous (~10–14) vs adapted (~20+) · an animal example · a typical sun-and-shadow scene (>20) — shows why no single capture holds a high-contrast scene (→ HDR)
fig-burst-vs-bracket
fig-burst-vs-bracket · a few-frame varied-exposure bracket (needs tripod, ghosts on motion) vs a many-frame identical-short-exposure burst (hand-held, motion frozen, denoised by averaging) 🟨
fig-pinhole-imaging
fig-pinhole-imaging · imaging-scenario series (2/3): add a pinhole to the bare sensor — one ray per scene point → a dim **inverted** image (same tree+sensor+colours as fig-bare-sensor-averaging)
⬜ figure not yet created
robust merge rejects the outlier frames per tile → clean) fig-ghost-from-misalignment
⬜ figure not yet created
`fig-burst-superres-link` (the same aligned burst, but sub-pixel hand-tremor shifts sampled on a finer grid → recovered resolution, demosaic replaced — pointer to [[Single-image computational photography]], after Wronski 2019). fig-burst-superres-link
equations
burst formation $I_i(x,y)=\mathrm{clip}\big(k\,L(W_i(x,y)) + n_i\big)$ — **same** small exposure $k$ every frame, each warped by an estimated alignment $W_i$ (≈ identity + sub-pixel hand motion)
**merged estimate** = robust weighted average over aligned frames, $\hat L(x,y)=\sum_i w_i\,I_i(W_i^{-1}x)\big/\sum_i w_i$, with $w_i$ down-weighting frames that disagree with the reference (motion/occlusion) — outlier rejection on top of the inverse-variance weighting of [[#HDR merging]]
**noise after merge** $\sigma_{\text{merged}}\approx \sigma_{\text{single}}/\sqrt{N}$ for the static, well-aligned pixels (the denoise)
**why underexpose** — choose $k$ so that the brightest scene radiance stays below clip ($k\,L_{\max}<1$), accepting a higher relative noise that the $1/\sqrt N$ merge then removes.
10.5 Manual panorama stitching from multiple views
fig-pano-rotate-vs-translate
fig-pano-rotate-vs-translate · pure camera rotation about a fixed center (views related by one homography) vs camera translation (parallax, no single map) 🟨
fig-depth-cancels-ray
fig-depth-cancels-ray · points at different depths along one ray collapsing to a single pixel after rotation and reprojection — depth drops out of the view-to-view map 🟨
⬜ figure not yet created
`fig-pinhole-divide-by-z` (3D point $(x,y,z)$ → image $(x/z,y/z)$: projection = divide by depth) fig-pinhole-divide-by-z
fig-homography-quad
fig-homography-quad · a homography sending a rectangle to a general quadrilateral, keeping lines straight while letting parallels converge (8 DOF; affine would keep a parallelogram) 🟨
fig-pinhole-imaging
fig-pinhole-imaging · imaging-scenario series (2/3): add a pinhole to the bare sensor — one ray per scene point → a dim **inverted** image (same tree+sensor+colours as fig-bare-sensor-averaging)
⬜ figure not yet created
reprojecting to another line is *linear* up there and depth-independent — the toy version of the whole idea) fig-iid-variance-adds
⬜ figure not yet created
`fig-pano-warp-to-reference` (pick one image as reference fig-pano-warp-to-reference
fig-pano-4clicks
fig-pano-4clicks · the manual stitching pipeline: click four correspondences across the overlap, solve $H$, warp, blend into one wide frame 🟨
fig-doc-flatten
fig-doc-flatten · a skewed photo of a page rectified to a flat scan by clicking its four corners and warping by the recovered homography (the planar-scene case) 🟨
💡 **Big lesson (L14 · Capture the full set, decide later):** *the move that drives this whole part is — instead of committing one capture parameter at the instant of exposure, **record the whole set and choose afterward**.* A panorama defers your **field of view**: shoot several overlapping frames now, decide the final framing/crop of the wide view later. It is the same move as exposure bracketing → HDR (capture all exposures), focus stacks (all focal planes), and burst imaging (every instant). The cost is data and a harder reconstruction (alignment + blending); the payoff is taking an irreversible decision *out* of the moment of capture. (Registered in [[Big Lessons]] as **L14** — *first appears* with light-field/plenoptic cameras, but it is the **spine of this part** and surfaces in every section here; one-line callbacks at HDR, focus stacks, and panoramas.)
A panorama is the oldest computational-photography trick: a single shot can't see wide enough, so take **several overlapping photos** and stitch them into one wide image. The deep question is *what relates two overlapping photos of the same scene?* — and the surprisingly clean answer (no 3D, no depth) is what makes stitching a tractable problem rather than a full reconstruction. This chapter does the geometry by hand; the **manual** version asks the user to click the correspondences, and the [[#Automatic panorama stitching from multiple views and feature matching|next chapter]] automates them.
equations
pinhole projection $(x,y,z)\mapsto(x/z,\,y/z)$ (project = divide by depth)
homography $\mathbf{x}' \simeq H\mathbf{x}$ with $H\in\mathbb{R}^{3\times3}$, **8 DOF** (defined up to scale, $H\sim kH$)
pure-rotation view-to-view map $H = K R K^{-1}$ — **independent of depth**
for the calibrated/normalised case $H = R$ acting on rays $(x,y,1)$
the per-point divide $\big(x',y'\big)=\big(\tfrac{a x+by+c}{gx+hy+i},\,\tfrac{dx+ey+f}{gx+hy+i}\big)$
the linear system $A\mathbf{h}=\mathbf{0}$ (each correspondence → 2 rows
4 correspondences → 8 equations in 8 free unknowns).
10.6 Automatic panorama stitching from multiple views and feature matching
fig-stata-stitch-pipeline
fig-stata-stitch-pipeline · the full automatic-stitching pipeline on a REAL pair (Stata Center): Harris corners → oriented descriptors → putative matches → RANSAC inliers → homography warp & blend → seamless panorama (real-image capstone for `fig-feature-pipeline`)
fig-corner-flat-edge
fig-corner-flat-edge · why corners are the best keypoints — only a corner's small window changes under a shift in every direction (flat / edge / corner columns, aperture problem) 🟨
fig-harris-eigen-ellipse
fig-harris-eigen-ellipse · the structure tensor's eigenvalues linked to the flat / edge / corner classification and the sign of the Harris response $R=\det M-k(\operatorname{tr}M)^2$ 🟨
⬜ figure not yet created
axes $\propto\lambda^{-1/2}$
fig-harris-not-scale-invariant
fig-harris-not-scale-invariant · one scene corner classified "corner" at one scale and "edge" at another under a fixed window, motivating scale selection 🟨
⬜ figure not yet created
`fig-scale-space-bumps` (1D bumps of three widths blurred over a stack of Gaussians fig-scale-space-bumps
fig-dog-pyramid
fig-dog-pyramid · the Difference-of-Gaussians pyramid and the 26-neighbour space-and-scale extremum test that detects SIFT keypoints 🟨
fig-sift-descriptor
fig-sift-descriptor · the $16\times16$ window → $4\times4$ grid of 8-bin orientation histograms → 128-D normalised vector that is the SIFT descriptor 🟨
⬜ figure not yet created
`fig-canonical-orientation` (gradient-orientation histogram of the patch fig-canonical-orientation
fig-ratio-test-histograms
fig-ratio-test-histograms · overlapping raw-distance histograms vs well-separated ratio $d_1/d_2$ histograms for correct/incorrect matches, showing why the ratio test discriminates 🟨
fig-ransac-line
fig-ransac-line · RANSAC's sample → fit → count-inliers → keep-best → re-fit loop on a toy line-fitting example, four panels 🟨
⬜ figure not yet created
`fig-ransac-pano-inliers` (a real overlap: matches colored — rejected-by-ratio-test, RANSAC outliers, inliers — and the resulting clean stitch) fig-ransac-pano-inliers
⬜ figure not yet created
`fig-ransac-iterations-graph` (probability-of-failure vs inlier fraction $w$, family of curves for $N=10,10^2,\dots,10^6$ iterations — the exponentials at work) fig-ransac-iterations-graph
The [[#Manual panorama stitching from multiple views|manual pipeline]] is complete except for one human step: someone clicked the corresponding points. That step is the bottleneck — slow, and impossible at scale (gigapixel mosaics, photo collections, video). This chapter replaces the clicks with **automatic feature matching**, which Durand's slides rightly call *"perhaps the most important innovation in computer vision and computational photography in the last twenty years"* — the same machinery powers 3D reconstruction, tracking, object recognition, retrieval, and robot navigation. We keep the homography solve from before; we only manufacture its inputs.
equations
window SSD under a shift $E(u,v)=\sum_{x,y} w(x,y)\,[I(x+u,y+v)-I(x,y)]^2$
**Taylor → quadratic form** $E(u,v)\approx \begin{psmallmatrix}u&v\end{psmallmatrix} M \begin{psmallmatrix}u\\ v\end{psmallmatrix}$ with the **structure tensor** $M=\sum w\begin{psmallmatrix}I_x^2&I_xI_y\\ I_xI_y&I_y^2\end{psmallmatrix}$
**Harris response** $R=\det M-k(\operatorname{tr}M)^2=\lambda_1\lambda_2-k(\lambda_1+\lambda_2)^2$ ($k\approx0.04$–$0.06$)
Shi–Tomasi $R=\min(\lambda_1,\lambda_2)$
**scale-space** $L(\cdot,\sigma)=G_\sigma*I$, **DoG** $D=L(\cdot,k\sigma)-L(\cdot,\sigma)\approx(k{-}1)\sigma^2\nabla^2 G*I$ (scale-normalised Laplacian)
**descriptor distance** SSD $d=\lVert \mathbf{f}_i-\mathbf{f}_j\rVert^2$
**ratio test** $d_1/d_2<\tau$ ($\tau\approx0.8$, $d_1\le d_2$ the two nearest neighbours)
**RANSAC inlier test** $\lVert \mathbf{x}_i'-H\mathbf{x}_i\rVert<\varepsilon$
probability a random $s$-sample is all-inlier $w^s$
**iteration count** $N=\dfrac{\log(1-p)}{\log(1-w^s)}$ (succeed with prob. $p$
inlier fraction $w$
sample size $s{=}4$ for a homography).
10.7 Blending
⬜ figure not yet created
`fig-blend-visible-seam` (a registered two-image mosaic with a hard photometric seam — exposure + white-balance + vignetting mismatch — the problem statement) fig-blend-visible-seam
⬜ figure not yet created
`fig-blend-feather-ghost` (single distance-weighted average → the same scene where small misregistration produces a doubled/ghosted edge and overall blur) fig-blend-feather-ghost
fig-blend-twoscale-split
fig-blend-twoscale-split · a source split into low and high bands, the low band blended smoothly and the high band composited winner-take-all, plus a hard-cut / feather / two-scale comparison 🟨
fig-blend-mask-pyramid
fig-blend-mask-pyramid · the blend mask decomposed by a Gaussian pyramid, the transition wide at coarse levels and sharp at fine levels, explaining why width tracks frequency band 🟨
⬜ figure not yet created
`fig-blend-laplacian-bands` (per-band blend $L^k_{out}=m^k L^k_A+(1-m^k)L^k_B$ then collapse → the seamless multiband result) fig-blend-laplacian-bands
fig-blend-poisson-vs-pyramid
fig-blend-poisson-vs-pyramid · pyramid vs Poisson blending on a paste with a DC offset — the pyramid leaves a faint low-frequency halo, Poisson absorbs the offset into the boundary 🟨
fig-blend-seam-graphcut
fig-blend-seam-graphcut · two overlapping frames with a moving person and the min-cut seam routed through low-disagreement pixels around the person, taking them entirely from one source 🟨
💡 **Big lesson (L9 · the eye cares about gradients, not absolute values):** a stitch seam is jarring not because the *average* brightnesses differ but because the **gradient is wrong right at the boundary** — a sudden jump the visual system reads as an edge that isn't in the scene. So the entire blending ladder is a sequence of ways to *fix the gradients at the seam*: smooth-blend the low frequencies so the jump is spread out (two-scale, multiband), or paste the gradients and solve for values so the jump is **absorbed into an invisible DC shift** (Poisson), or route the seam through pixels whose gradients already match so there is nothing to fix (graph cut). A constant brightness/color offset between two frames is **invisible once the boundary gradient is right** — exactly why gradient-domain blending works. (→ first appears in [[Poisson image editing]] — *the key idea*; recurs here as the organizing principle of stitch blending.)
equations
**feathering / alpha** $I_{out}(p)=\dfrac{\sum_i w_i(p)\,I_i(p)}{\sum_i w_i(p)}$ with $w_i$ a distance-to-boundary weight
**two-scale** — split $I_i=L_i+H_i$ ($L_i=G_\sigma * I_i$, $H_i=I_i-L_i$)
blend low band by feathering, high band winner-take-all $H_{out}(p)=H_{\arg\max_i w_i(p)}(p)$
**multiband / Laplacian** per band $L^k_{out}=m^k\,L^k_A+(1-m^k)\,L^k_B$ where $m^k$ is the **Gaussian-pyramid** level-$k$ of the mask, then collapse $\sum_k \mathrm{expand}(L^k_{out})$
**Poisson blend** $\displaystyle \min_f \iint_\Omega \lVert\nabla f - \mathbf v\rVert^2\ \text{s.t. } f|_{\partial\Omega}=I_{\text{target}}|_{\partial\Omega}$, Euler–Lagrange $\nabla^2 f=\operatorname{div}\mathbf v$
**seam energy** $E=\sum_p D_p(\ell_p)+\sum_{(p,q)}V_{pq}(\ell_p,\ell_q)$ (data + smoothness, minimized by graph cut)
10.8 Bells and whistles
fig-projections-three
fig-projections-three · the same wide scene unrolled three ways — plane (lines straight, FOV limited), cylinder (verticals straight, full horizontal sweep), sphere (full surround, everything curves) — with a "use when" each 🟨
⬜ figure not yet created
`fig-projection-line-bending` (a straight architectural edge: planar keeps it straight but can't exceed ~120° fig-projection-line-bending
fig-bundle-drift
fig-bundle-drift · a chained-homography 360° panorama (visible gap from accumulated drift) vs a bundle-adjusted one (error distributed around the loop, closure seamless) 🟨
⬜ figure not yet created
`fig-gain-vignette-solve` (per-image gain + a vignetting falloff recovered jointly so overlaps agree — before/after) fig-gain-vignette-solve
fig-deghost-seam
fig-deghost-seam · a blended overlap where a walking person is doubled into a ghost vs a deghosted result that routes the seam to take the person from one frame only 🟨
fig-parallax-breaks-homography
fig-parallax-breaks-homography · two shots from a translated camera where aligning the background doubles the foreground and aligning the foreground tears the background — one homography cannot handle depth-dependent parallax 🟨
The pairwise pipeline (match → RANSAC → homography → blend) makes a *demo*. A *tool* has to survive 360° loops, lens imperfections, people who wander through the shot, and a photographer who took a step sideways. Each subsection below is one such assumption breaking, and the repair. The recurring fix is the same: **treat the whole panorama as one global estimation problem instead of a chain of independent local ones.** *(To consult Ce Liu & Miki Rubinstein for the production deghosting / motion-handling and continuous-capture details — queue marker.)*
equations
**bundle adjustment** $\displaystyle \min_{\{\theta_j\}} \sum_{j,k}\sum_{i} \rho\big(\lVert \hat{\mathbf x}^j_i(\theta_j,\theta_k) - \mathbf x^k_i \rVert\big)$ — jointly over all camera params $\theta_j$ (rotation $R_j$, focal $f_j$, distortion, gain), summed over all feature correspondences $i$ across all overlapping image pairs $(j,k)$, $\rho$ a robust loss
**gain compensation** $\min_{\{g_j\}} \sum_{(j,k)}\sum_{p\in \text{overlap}} (g_j I_j(p) - g_k I_k(p))^2 + \lambda\sum_j (g_j-1)^2$
**radial distortion** $r_d = r(1+\kappa_1 r^2 + \kappa_2 r^4)$ folded into $\theta_j$
10.9 Continuous panoramas (e.g. on cell phones)
⬜ figure not yet created
`fig-sweep-incremental` (a phone panning fig-sweep-incremental
fig-sweep-central-strip
fig-sweep-central-strip · a single frame with only its central strip retained, annotated low-vignetting / low-distortion / low-parallax, and narrow overlaps feathered between strips 🟨
⬜ figure not yet created
strips overlap just enough to feather)
fig-rolling-shutter-pan
fig-rolling-shutter-pan · a slanted, sheared vertical pole from an uncorrected fast pan vs a straightened, rectified version, illustrating line-by-line CMOS readout during camera motion 🟨
fig-portrait-undistortion
fig-portrait-undistortion · distortion-free wide-angle portraits (Shih, Lai & Liang 2019): stretched edge face → content-aware warp mesh (locally stereographic over faces, perspective elsewhere) → corrected face with straight lines preserved (Perspective distortion and its correction)
A cell-phone "sweep" panorama is the same goal — one wide image from many views — but the *capture model* is inverted: instead of shooting N discrete frames and stitching offline, the phone **captures a continuous video while you pan** and builds the mosaic **as the frames arrive**. That streaming constraint, plus the fact that you're hand-holding and continuously translating, reshapes every design choice. *(To consult Ce Liu & Miki Rubinstein for the mobile-pipeline specifics — queue marker.)*
equations
**per-strip registration** incremental homography $H_{t} = H_{t-1}\,\Delta H_t$ ($\Delta H_t$ = frame-to-frame, often near pure rotation, predicted from the **gyroscope**)
**rolling-shutter** per-row pose $\mathbf p(y)=\mathbf p_0 + y\cdot \dot{\mathbf p}$ (camera moves *during* a frame's readout → row $y$ captured at time $t_0+y\,t_{\text{row}}$)
**strip feather** $I_{out}=\alpha I_{\text{new strip}} + (1-\alpha)I_{\text{mosaic}}$ across the narrow overlap
10.10 Focal stacks and depth of field extension
⬜ figure not yet created
`fig-focalstack-problem` (a macro/close-up scene at one aperture: only a thin slab is sharp, foreground and background mush — "even f/16 isn't enough") fig-focalstack-problem
fig-focalstack-capture
fig-focalstack-capture · a focus stack — several frames of one scene focused at different distances, the sharp slab moving through depth — plus an inset on refocusing the lens vs translating on a rail 🟨
fig-focalstack-sharpness
fig-focalstack-sharpness · one frame walked through high-pass → square → smooth to produce its per-pixel sharpness map $s_k$ (pipeline layout with arrows + $\LaTeX$ labels)
fig-focalstack-why-square
fig-focalstack-why-square · a 1-D edge whose signed band-pass averages to ~0, and how squaring it makes the local average register the presence of detail 🟨
fig-focalstack-argmax-composite
fig-focalstack-argmax-composite · the per-pixel argmax selection map (which-frame-won / coarse depth) beside the all-in-focus image, plus a zoom comparing $\gamma=1$ vs $\gamma=4$ weighting
⬜ figure not yet created
exponent-1 vs exponent-4 weighting)
fig-focalstack-photomontage
fig-focalstack-photomontage · a weighted-sum merge (ghosting and blur at depth edges) vs a graph-cut-plus-Poisson merge (clean seams) of the same focal stack (Agarwala et al.) 🟨
fig-focalstack-magnification
fig-focalstack-magnification · a focus-at-infinity frame overlaid with a focus-close frame to show that refocusing changes scale, so frames must be registered before merging 🟨
💡 **Big lesson (L14 · Capture the full set, decide later):** instead of committing the focus distance at the instant of exposure, **record a whole stack of focal planes and choose per pixel afterward**. The slogan: *focus stacking is to the focus distance what HDR bracketing is to exposure and what the light-field camera is to the aperture* — defer an irreversible capture decision out of the moment of exposure, paying with **more data** and a **harder reconstruction (a per-pixel merge)** for the payoff of an all-in-focus image no single shot could make. (Registered in [[Big Lessons]] as **L14**; its full first-appearance box sits at the light-field / plenoptic introduction — here it recurs on the **focus axis**, the third sibling after HDR's **exposure axis** and panorama's **viewpoint axis**.)
equations
per-pixel **sharpness** $s_k(p)=\sum_{q\in w(p)} \lVert \nabla I_k(q)\rVert^2$ (local **high-frequency energy** — squared band-pass / Laplacian power, then window-summed/Gaussian-smoothed)
**all-in-focus selection** $k^\*(p)=\arg\max_k s_k(p)$, output $I(p)=I_{k^\*(p)}(p)$
**soft weighted composite** $I(p)=\dfrac{\sum_k w_k(p)\,I_k(p)}{\sum_k w_k(p)}$ with $w_k(p)=s_k(p)^\gamma$ (exponent $\gamma\!\approx\!4$ → near-argmax but smoother)
**Photomontage energy** $E(\ell)=\sum_p D_p(\ell_p)+\sum_{p\sim q}V_{pq}(\ell_p,\ell_q)$ minimized by graph cut (data $D$ = $-$sharpness of label $\ell_p$
smoothness $V$ = seam-cost penalizing visible transitions), then **Poisson** reconstruction across the chosen labels.
10.11 Hyperspectral imaging, color wheels
fig-hyperspectral-cube
fig-hyperspectral-cube · the cube as an image stack indexed by wavelength, contrasting a pixel's full spectrum against the three numbers RGB retains 🟨
fig-hyperspectral-rgb-vs-spectrum
fig-hyperspectral-rgb-vs-spectrum · two materials identical in RGB but clearly distinct in their measured reflectance spectra (with the three RGB sensitivities overlaid) — the metamer case for many bands 🟨
fig-hyperspectral-capture
fig-hyperspectral-capture · the three capture architectures (spectral scan, pushbroom line-scan, snapshot mosaic) and the spatial-vs-spectral-vs-time tradeoff triangle 🟨
⬜ figure not yet created
pushbroom slit+prism scanning lines in space
fig-demosaick-snapshot
fig-demosaick-snapshot · even a plain snapshot needs computation: one colour per pixel (Bayer mosaic) → demosaicked RGB
⬜ figure not yet created
`fig-hyperspectral-uses` (material map / NDVI vegetation map / art-conservation pigment or underdrawing reveal). fig-hyperspectral-uses
💡 **Big lesson (L14 · Capture the full set, decide later — wavelength axis):** RGB throws away the spectrum at the instant of capture (three broad bands, irreversibly mixed). Hyperspectral imaging **records the whole spectrum per pixel and decides spectral questions afterward** — the same defer-the-decision move as HDR (exposure axis), focal stacks (focus axis) and light fields (aperture axis), now on the **wavelength axis**. Cost: data, light, capture time. Payoff: you can ask *what is this made of* long after the shutter. (→ see Big lesson **L14**, [[Big Lessons]]; first-appearance box at the light-field introduction.)
equations
**measurement as projection** — a camera channel $c$ records $I_c(x,y)=\int S(x,y,\lambda)\,R_c(\lambda)\,d\lambda$
RGB is this with **3** broad $R_c$
hyperspectral is the same with **many narrow** band responses $R_b(\lambda)$ (near-delta), recovering the spectrum $S(x,y,\lambda)$ sampled at the bands — i.e. the cube $I(x,y,\lambda_b)$
**NDVI** $=\dfrac{I_{\mathrm{NIR}}-I_{\mathrm{red}}}{I_{\mathrm{NIR}}+I_{\mathrm{red}}}$ (a band ratio — a per-pixel multiplicative/normalized index, L1).
10.12 Intrinsic images with time lapse
⬜ figure not yet created
`fig-intrinsic-timelapse-stack` (a fixed-camera day sequence: shadows sweep across a constant scene fig-intrinsic-timelapse-stack
fig-intrinsic-weiss-median
fig-intrinsic-weiss-median · noisy per-frame log-gradient maps with moving shadow edges, their per-pixel median over time giving clean reflectance gradients, then a Poisson integration to a flat-lit reflectance plus residual shading (Weiss) 🟨
💡 **Big lesson (L14 · Capture the full set, decide later — time/illumination axis):** one image can't separate **reflectance** from **shading** (the split is ambiguous — L10). A **time-lapse stack under changing light** breaks the tie: record the **whole sequence** and decide per-pixel afterward — **what stays constant is reflectance, what varies is shading**. Same defer-the-decision move as HDR (exposure), focal stacks (focus), hyperspectral (wavelength) and light fields (aperture) — here the deferred axis is **time / illumination**. (→ see Big lesson **L14**, [[Big Lessons]]; and **L10**, the prior-not-optional ambiguity this resolves.)
equations
image model $I(x,y,t)=R(x,y)\cdot S(x,y,t)$ (reflectance constant in $t$, shading varies)
in **log** $\log I = \log R + \log S$ (product → sum, L1/L2)
spatial gradient $\nabla\log I(t)=\nabla\log R+\nabla\log S(t)$
**Weiss estimator** $\widehat{\nabla\log R}(x,y)=\operatorname{median}_t\,\nabla\log I(x,y,t)$ (shading gradients vary/cancel, reflectance gradient persists)
recover $\log R$ by **gradient-domain reconstruction** (Poisson, $\nabla^2\log R=\operatorname{div}\,\widehat{\nabla\log R}$)
then $\log S(t)=\log I(t)-\log R$.