10.4 Application to cell phones: HDR+ and burst imaging⧉
Hold up a phone in a backlit room — someone standing in front of a bright window — and press the shutter. The phone does not, as a tripod-bound photographer would, fire three frames at three different exposures and stitch their reliable parts together. It does something that looks almost perverse: it shoots a burst of a dozen or more frames that are all the same exposure, and every one of them is too dark. Then, in a fraction of a second, it aligns them and merges them into a single image whose highlights are intact, whose shadows are clean, and whose noise is far below what any one of those frames showed. That is HDR+, the pipeline Hasinoff and colleagues (2016) built for Google's Pixel phones, and it is the form in which the whole MULTIPLE-EXPOSURE part actually reaches a billion pockets.
The radical move is the one that looks perverse: replace the bracket with a burst of identical, deliberately under-exposed raw frames. Everything else in this chapter explains why that single choice is the right one for a phone, and how it lets one pipeline do denoising and HDR at the same time, in the raw domain, before demosaicking. We will see why a burst beats a bracket when nothing is on a tripod and everything moves; why you bias the exposure down and recover the shadows by averaging rather than risk clipping the highlights; how the align-and-robust-merge core works on linear raw; and finally how the very same aligned burst, with one knob turned, yields resolution instead of range — burst super-resolution (Wronski).
10.4.1 Why a phone shoots a burst, and why it underexposes⧉
Start with the phone's predicament, because every design choice falls out of it. The sensor is tiny. Small photosites hold few electrons — a low full-well capacity — so a single shot captures little dynamic range and carries high relative noise, exactly the noise-floor squeeze of big lesson L15. There is no tripod: the camera rides an unsteady hand. And the scene is alive — people move, leaves stir, a car passes. Classic exposure bracketing, the method of HDR merging, assumes almost the opposite of all this: a static scene, a steady mount, and a handful of different exposures, one of which must be long enough to dig the shadows out. On a phone every one of those assumptions breaks. The long frame of the bracket blurs from hand shake; subjects move between the differently-exposed frames and merge into ghosts; and the small sensor's shadows are noisy no matter how carefully you expose.
The HDR+ answer is to throw out the bracket and shoot a burst of many identical frames, each at the same short, deliberately under-exposed raw exposure (Figure 10.4.1). Short exposures buy two things at once. First, they freeze motion — no per-frame blur, and because every frame is exposed identically there is no exposure change to confuse the aligner, which makes registration easy and robust. Second, many short frames, averaged, beat the noise down by the $1/\sqrt N$ law of Denoising by averaging — recovering the same shadow detail a single long exposure would have given, but without any of its blur. The burst therefore does double duty: it is a denoiser (the $1/\sqrt N$ of averaging) and an HDR merge (range recovered from combining frames that each hold a usable slice of the scene) in one operation. This is big lesson L14 — capture the full set, decide later — instantiated on a phone: capture the whole burst now, and defer the noise-versus-range tradeoff to a downstream merge.
But why under-expose? This is the crux, and it turns on an asymmetry that is easy to state and easy to forget: the two failure walls of a single exposure are not symmetric. A clipped highlight is gone forever. Once a photosite saturates — once it fills its well and reads the maximum value — the information above that level is destroyed, and no amount of merging, averaging, or cleverness recovers it; every saturated frame agrees on the same flat white. A dark-but-unclipped shadow, by contrast, is merely noisy — and noise, as we have spent two chapters establishing, is recoverable: average $N$ frames and it falls as $1/\sqrt N$. So you deliberately bias the exposure down. Pick the per-frame exposure factor $k$ small enough that even the brightest radiance in the scene stays below the clipping ceiling,
guaranteeing highlight headroom in every frame, and pay for that guarantee with extra shadow noise — which the burst merge then removes. You are trading a recoverable problem (noise) to avoid an unrecoverable one (clipping). It is the same logic, made vivid, as exposing-to-protect-the-highlights, but pushed all the way: never clip, and let computation buy back the shadows (Figure 10.4.2).
The burst is L14 on a phone: capture many identical frames now and defer the noise-versus-range tradeoff to a downstream merge. (L14 first appears in this part's introduction; the light-field/plenoptic camera (Advanced computational photography) is the same idea for the focus/viewpoint axis — the elegant special case.)
This is precisely the SNR-optimal capture logic of Hasinoff, Durand and Freeman (2010), instantiated on hardware. That earlier work asked, given a full noise model and a fixed time budget, which set of exposures collects the most signal-per-noise — and its answer was, in general, to favour shorter exposures at higher effective gain and recover by computation. HDR+ is the special case of that principle for a phone: the optimal set is a uniform, under-exposed burst, because choosing identical short frames does not only optimize SNR — it simultaneously freezes motion and makes alignment tractable. The 2010 paper said how to choose the optimal set in general; HDR+ says for a phone, that set is one short exposure repeated $N$ times.
10.4.2 The HDR+ pipeline: align and robust-merge in raw⧉
The pipeline, end to end, is short to state. Capture a burst of $N$ under-exposed raw frames — raw meaning linear and pre-demosaic, the cleanest form of the data, by the encoding discipline of big lesson L2. Select a reference frame: a sharp, sufficiently-exposed one, often an early frame in the burst, against which everything else will be registered. Align every other frame to that reference, with a tile-based, sub-pixel search. Robustly merge the aligned frames in the raw domain. Emit one denoised, high-dynamic-range raw. Only then demosaic it, and finally tone-map and finish into the displayable JPEG (Figure 10.4.3).
The single most important architectural choice in that list is where the merge happens: in raw, before demosaicking. Doing the combine on linear, still-mosaiced data means denoise and HDR are one operation on the cleanest possible signal — no non-linear tone curve has yet been baked in to corrupt the additive average (L2 again — average in linear light), the photon-plus-read noise model stays simple and correct, and the merge can serve as the demosaic's denoiser rather than fighting it. Tone mapping is a strictly later stage.
The merge itself is the reliability-weighted average of HDR merging with one addition. Writing $W_i$ for the estimated warp that aligns frame $i$ to the reference, the merged estimate at a pixel is
where each weight $w_i$ combines two things: the inverse-variance reliability from the noise model — the same SNR-optimal weighting derived in HDR merging, which trusts brighter, less-noisy samples more — and an outlier-rejection term. The outlier term is what makes a burst safe. A frame, or a tile within a frame, that disagrees with the reference — because a subject moved, or because alignment simply failed there — is down-weighted or dropped entirely, so it cannot contribute a ghost. Conceptually this is a per-frequency Wiener-style merge carried out in a tile transform: where the frames agree, average hard and reap the full $1/\sqrt N$ denoise; where they disagree, fall back to the reference and accept that pixel's noise rather than smear two scenes together. For the static, well-aligned majority of the image, the merge collapses to plain averaging and the noise after the merge is
the $1/\sqrt N$ law of Denoising by averaging delivered exactly.
This makes plain that alignment is the whole game, and that ghosts are alignment's failures. A burst merge is only ever as good as its registration: mis-register the frames and the helpful average turns into blur and double images — precisely the failure Image alignment warned of, now the dominant risk. HDR+ aligns in tiles, running a coarse-to-fine pyramid search per tile so it can track locally-varying motion (a hand drift here, a moving arm there) rather than assume one global shift. The robust weight is then the safety net for whatever residual motion the tile alignment cannot undo — occlusions, fast subjects, the edges of moving objects. The trade is explicit and tunable per pixel: more aggressive averaging means more denoising but more ghost risk, and the robustness weight is the dial between them. This is the same median-threshold-bitmap and pyramid-search lineage as the registration of Image alignment and Ward (1997), made local and sub-pixel for the burst case.
10.4.3 From burst HDR to burst super-resolution⧉
Everything so far aligned and merged a burst to win SNR and range. The closing observation of this chapter is that the same substrate — align a burst, then robustly merge — wins resolution the moment you turn one knob. That is burst super-resolution (SR), the method Wronski and colleagues (2019) shipped as the Pixel's "Super Res Zoom," and the only difference that matters is that the alignment is now sub-pixel in a way we exploit rather than merely tolerate.
Here is the mechanism, and it is a small miracle of turning a bug into a feature. The unavoidable hand tremor between burst frames shifts the scene by fractions of a pixel. In Denoising by averaging and Image alignment that tremor was a nuisance to be aligned away; in burst SR it is the signal. Because each frame samples the continuous scene on a slightly offset grid, the frames are not redundant copies — they are complementary samples of the underlying image. Fuse them onto a finer grid and you recover detail beyond what any single frame's sampling could hold. The hand-tremor "dithering" that Denoising by averaging noted in passing — the nudge that makes fixed-pattern noise averageable — is here doing structural work: it is what spreads the samples out.
The deepest way to see it is that aliasing becomes the signal. A single frame's sampling folds high frequencies down into low ones — aliasing, ordinarily a defect to be prefiltered away. But the sub-pixel-shifted frames carry that folded information at different phases, and combining them lets you un-fold it — separate the aliased frequencies and place them back where they belong. This is genuine, measured detail recovered from the burst: reconstruction-based super-resolution, not the hallucinated detail a single-image prior would invent. That distinction — reconstruction versus hallucination — is the organizing axis of Super-resolution and image priors, where the full treatment of the prior, the ill-posedness, and the fusion math lives; we only need the merge-substrate link here. Burst SR even replaces demosaicking: the color filter mosaic gives each channel its own sampling offsets, and combining those across the burst reconstructs full-resolution color directly, with no separate demosaic step.
The unifying statement, and the close of the multiple-exposure arc: HDR merge, burst denoising, and burst super-resolution are one operation — align a burst, then robustly merge — with a single dial set to recover range (HDR+), SNR (the $1/\sqrt N$ denoise), or resolution (sub-pixel super-res), and on a modern phone usually all three at once. It is big lesson L14 taken to its conclusion — capture the set, align, merge, decide later — beating the single-shot limits of L15 not with a bigger sensor but with computation.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
The burst is L14 on a phone: capture many identical frames now and defer the noise-versus-range tradeoff to a downstream merge. (L14 first appears in this part's introduction; the light-field/plenoptic camera (Advanced computational photography) is the same idea for the focus/viewpoint axis — the elegant special case.)