💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

10.2 HDR merging

📎 Problem set

PS4 merges an exposure bracket into a radiance map and tone-maps it. → Problem sets (appendix).

A scene with a window in it is two scenes. There is the bright world outside — sunlit clouds, a lawn — running thousands of times brighter than the dim interior, the bookshelf and the face of the person sitting in the shade. Your eye sweeps the room and sees both at once, because it adapts locally as it goes. Your camera cannot. Expose for the face and the window blows out to a featureless white rectangle; expose for the window and the room collapses into black. There is no single shutter speed that holds both, and the reason is not a defect to be engineered away so much as a hard limit on what one exposure can record.

That limit has two walls. Turn the exposure up far enough and the brightest parts of the scene clip: every photosite has a finite full-well capacity, and any radiance past it saturates to the maximum value, all detail above collapsing to a flat white. Turn the exposure down and the dimmest parts drown: the signal sinks beneath the sensor's read-noise floor, and trying to lift those shadows in post just amplifies noise. Between the two walls is the slice of the world a single shot can faithfully hold — and for a high-contrast scene, that slice is far too narrow.

The fix is the subject of this chapter, and it is the same move that gives us focus stacks and light-field cameras: don't commit to one exposure — capture the whole bracket and decide afterward. Shoot several frames, each measuring a different slice of the radiance axis: a short exposure that keeps the highlights, a long one that digs out the shadows, and a few in between. Then merge the reliable parts of each into one high-dynamic-range (HDR) radiance map — a single floating-point estimate of the scene radiance $L(x,y)$ at every pixel, recovered from $N$ clipped, noisy, differently-exposed observations.

Two questions organize the whole chapter. First, calibration: how does a pixel value relate to scene radiance — what is the exposure factor (and, if the camera is non-linear, the response curve) that maps radiance to the number stored? Second, combination: given the usable observations, how do you average them — and what makes one observation more trustworthy than another? Everything below is those two questions. A third question — how to squeeze the finished HDR map back onto a low-contrast display — is tone reproduction, a separate problem we recap in Recap: tone mapping and whose operators live in Bilateral filtering. Merge here, tone-reproduce there.

💡 Big lesson (L15)

Dynamic range = full-well capacity ÷ noise floor. A single exposure records from a top — the photosite's full-well capacity, where values saturate and clip — down to a bottom — the read-noise floor in the shadows. The ratio of those two is the dynamic range, measured in stops: how much scene contrast fits in one shot. It is typically far less than the ten-to-twelve orders of magnitude the world spans. You can widen the window — a bigger well (larger photosites; a full-frame sensor beats a phone) or a lower floor (cooling, lower read noise) — but the photographic workhorse is to beat it by merging multiple exposures: bracket the scene, let each shot own a different slice of the radiance axis, and stitch the slices into one radiance map. That merge is this chapter. (Registered as L15; first appears in the FUNDAMENTALS noise chapter; this is its capture-side HDR recurrence. Its companion is L6 — with a sane encoding it is range, not bit depth, that bounds a shot.)

This is big lesson L14 — capture the full set, decide later — on the exposure axis: record the whole bracket and choose afterward, deferring the irreversible which exposure? out of the fragile moment of capture. (Registered as L14; first appears in this part's introduction; the light-field/plenoptic camera (Advanced computational photography) is the same idea for the focus/viewpoint axis — the special case.)

10.2.1 The HDR challenge

It helps to put numbers on the gap. Real scenes span ten to twelve orders of magnitude of illumination — the human eye, adapting as it goes, works from roughly $10^{-6}$ to $10^{6}\,\text{cd/m}^2$, and a single ordinary room with a window is routinely $1{:}100{,}000$ from the brightest to the dimmest point. A negative or a sensor records only two to three orders in one shot. A print or a screen shows even less — typically $1{:}20$ to $1{:}50$, at best around $1{:}500$. Two genuinely distinct problems fall out of those numbers: (1) record the information, which a single capture cannot, and (2) display it, which the medium cannot do even once the information is in hand. This chapter is problem (1), recording, via merge. Problem (2) is tone reproduction, deferred (Figure 10.2.1).

To make the recording problem precise, carry one formation model through the entire chapter. Scene radiance $L(x,y)$ reaches a photosite, is scaled by an exposure factor $k_i$ (set by shutter, aperture, and ISO), has noise $n$ added, and clips above the well capacity:

$$ I_i(x,y) = \mathrm{clip}\big(k_i\,L(x,y) + n\big). $$

Read it back: the $i$-th frame is the one scene radiance, brightened by that frame's exposure factor, corrupted by noise, and flattened wherever it tries to exceed the ceiling. Everything below is the inverse of this equation — estimate the single $L$ from the several clipped, noisy $I_i$. The two walls of L15 are right there in the model: the $\mathrm{clip}$ is the upper wall, and the $n$ is the lower one. Push $k_i$ up and more of the scene hits the clip; push it down and more of the scene sinks into $n$. No single $k_i$ avoids both for a high-range scene.

The resolution is L14 made concrete. Shoot $N$ exposures that sequentially measure different segments of the range. Each frame $I_i$ is reliable only over a band of radiance — above its noise floor, below its clip ceiling — and changing the exposure factor slides that band along the radiance axis. Choose the factors so the bands tile the whole scene range, then merge the reliable parts into one floating-point radiance map (Figure 10.2.2). That map is the chapter's output: one float per $(x,y,c)$, with values that may run well past $1$, stored in a format built for it (OpenEXR; the red-green-blue-exponent format (RGBE); or Adobe's Digital Negative (DNG)). Compressing it to something a screen can show is the separate tone-mapping stage.

Two honest caveats frame the practice. The first is modern single-shot range: today's sensors already deliver roughly thirteen to fourteen stops in one capture, so explicit HDR bracketing is less universally necessary than it once was — a single raw, tone-mapped, often suffices, with shadow noise the limiting factor. HDR merge still wins whenever the scene genuinely exceeds the sensor's single-shot range, and — crucially — the cell-phone burst reuses the very same merge for denoising, not only for range. The second is lineage: photographers fought dynamic range long before digital. Gustave Le Gray (1857) sandwiched two negatives, one exposed for the sky and one for the sea; graduated neutral-density filters and fill-flash compress scene contrast at capture; the darkroom's dodging, burning, and Ansel Adams's Zone System Adams, The Negative are tone reproduction done by hand. The digital merge automates the negative sandwich.

fig-dr-clip-vs-noise
Figure 10.2.1. One scene, one exposure, and the two walls of the single-shot dynamic range. A high-contrast scene (interior plus bright window) is shown with its radiance histogram laid against the photosite's response: highlights past the full-well capacity clip to a flat ceiling (the window blows to white), and shadows beneath the read-noise floor drown in noise (the interior sinks to black). The usable window between the two walls — the single-shot dynamic range — is shaded; the scene's true range overflows it on both ends. This is L15 drawn for one capture: dynamic range is the ratio of that ceiling to that floor.
fig-exposure-stack-slices
Figure 10.2.2. A bracket tiles the radiance axis. The horizontal axis is scene radiance (log scale, in stops); each exposure $I_i$ in the bracket is drawn as a usable band — above its noise floor, below its clip ceiling — and increasing the shutter time $t_i$ slides the band leftward (toward dimmer radiance). A short exposure owns the highlights, a long one owns the shadows, and the bands overlap so that together they cover the full scene range. The merge keeps the reliable part of each band and stitches them into one radiance map spanning the whole axis.
fig-exposure-bracket-real
Figure 10.2.3. The same idea on a real bracket — a dim interior with a blown-out window onto Boston harbour, a textbook high-dynamic-range scene. Four frames of a longer sequence, from the shortest exposure (the window and skyline are correctly exposed, but the room is lost in noise-black) to the longest (the room opens up, but the window clips to featureless white). No single frame holds both ends; each captures a different band of scene radiance, exactly the bands the previous figure drew — which is why the scene must be bracketed and merged.

10.2.2 Data capture: how to vary the exposure

To tile the radiance axis you must change the exposure factor $k_i$ between shots. Four knobs do it — shutter speed, aperture, ISO, and a neutral-density (ND) filter — and they are emphatically not interchangeable. Each has a side effect on the image beyond brightness, and those side effects decide which knob a careful bracket should use (Figure 10.2.4).

Shutter speed is the clean knob, and the default. Its range runs from roughly 30 seconds to 1/4000 of a second — about six orders of magnitude, more than enough to bracket almost any scene. Its great virtue is that it is reliable and linear: doubling the exposure time doubles the collected photons, so $k_i \propto t_i$ exactly, which is precisely what the merge math below assumes. Its only cost is that very long exposures accumulate their own dark-current noise and risk subject motion. Because it changes only the exposure and nothing else about the image, shutter speed is the right tool for bracketing.

The other three knobs each break the "only the exposure changes" assumption in some way. Aperture spans roughly $f/1.4$ to $f/22$, about two and a half orders ($k \propto 1/N_f^2$ in the $f$-number), but opening or closing it changes the depth of field — the bracket frames would carry different focus blur, so the merged result would be inconsistent. Use it only when desperate. ISO spans roughly 100 to 1600, about one and a half orders, and it is gain applied after capture, not more light — it scales the signal and the noise, so on its own it generally raises noise and is a poor way to extend range. (Its rehabilitation, once you model noise properly, is Hasinoff's insight at the end of the chapter.) A neutral-density filter darkens the incoming world by up to about four densities, and filters stack, but ND is not perfectly neutral (it introduces a color cast), it is imprecise, and adding one means you must touch the camera — risking shake and forcing a re-registration. Its one real advantage is that it works with strobe and flash, where shutter speed cannot reach.

The takeaway ties to the stops idea: exposure is counted in stops, factors of two that are additive in log space, and the clean way to walk the radiance axis is shutter speed. The other knobs trade a side effect — depth of field, noise, color — for range, and are fallbacks. There is a deeper question lurking here: not just how to vary the exposure but which exposures to pick — the optimal set — and that is the subject of the chapter's last section.

One assumption underlies the basic merge and deserves stating loudly: it presumes only the exposure changes between frames — no motion, perfect registration. A tripod gives you that. Hand-held bracketing does not: you must align the frames first (Ward's median-threshold bitmap (MTB) is a contrast-robust alignment that compares frames thresholded at their median and runs coarse-to-fine on a pyramid) and reject ghosts where objects in the scene moved between frames. Alignment is the sibling chapter's business; the burst-imaging chapter makes robust alignment central.

fig-vary-exposure-knobs
Figure 10.2.4. The four exposure knobs and their side effects, in one comparison. Each row varies one knob across a small bracket: shutter (clean and linear — only brightness changes, $k \propto t$); aperture (brightness changes but so does depth of field — the out-of-focus background sharpens or blurs across the bracket); ISO (brightness changes by post-capture gain, and noise visibly grows with it); ND filter (brightness drops but a color cast creeps in, and a hand icon flags that you must touch the camera). The shutter row is marked as the one knob that changes nothing but the exposure.

Everything so far has assumed that software does the merging: you shoot a bracket of separate frames and fuse them in post. But the same "capture several slices and combine them" move can be folded into the readout itself, and increasingly is. On-chip — or "direct" — HDR is now an industry-wide sensor capability: many modern image sensors produce a high-dynamic-range signal inside the silicon, handing the photographer one wide-range raw instead of a stack to merge. Sony is a prominent example, with its digital-overlap "double-read" scheme, but it is not one vendor's trick — OmniVision, Samsung's ISOCELL, and onsemi and the other automotive sensor houses all deliver on-chip HDR by one strategy or another. The strategies are best read by where they multiplex the dynamic range — in time, in space, or in gain — because that axis is exactly what each one pays for it (Figure 10.2.5).

Time-multiplexed: the double read. DOL-HDR (digital-overlap) reads the sensor twice — the same exposure, or interleaved long and short exposures line by line — and combines them: the long read recovers the shadows, the short read holds the highlights. This is Sony's "double read." Because the long and short captures are not simultaneous, anything moving between them ghosts — the very de-ghosting problem of a hand-held bracket, now reappearing inside a single frame.

Gain-multiplexed: dual conversion gain. DCG reads each pixel at two readout (conversion) gains — a high gain for a low noise floor in the shadows, a low gain for more full-well headroom in the highlights — then merges the two. Since both reads sample the same instant, it is the cleanest of the lot: no motion artifacts, no resolution loss. Its catch is that the extra range it buys is bounded by the spread between the two gains.

Space-multiplexed: split pixels and assorted pixels. Two schemes trade pixels for range. A split-pixel (dual-photodiode) design puts a large and a small sub-photodiode in each photosite: the large one saturates early and owns the shadows and midtones, the small one keeps reading into bright highlights — common in automotive sensors, which must survive a tunnel exit or direct sun-glare. The older spatially-varying exposure (SVE), or "assorted pixels," of Nayar and Mitsunaga (2000) bakes different sensitivities into neighboring pixels — an exposure mosaic, a Bayer pattern but for neutral density — then demosaics the result into an HDR image. Both are single-shot, so neither ghosts; both pay in resolution and SNR (the small diode, or the darkened pixels, are noisier). A related lateral-overflow (LOFIC) / voltage-domain trick adds extra charge capacity per pixel — a capacitor that catches the charge spilling past the photodiode's well — extending full-well in the charge domain without a second exposure at all. At the far end sit logarithmic / companding sensors, whose per-pixel response is log (or piecewise-companded) in radiance, capturing many decades in one read at the cost of lower contrast resolution and fixed-pattern non-uniformity within any band.

The whole family reduces to one trade-off, and it is worth stating in a line. Time-multiplexed schemes (DOL double-read) buy range with motion artifacts; space-multiplexed schemes (split-pixel, SVE) buy it with resolution and SNR; gain-domain DCG is the cleanest — same instant, full resolution — but its extra range is bounded. The software HDR+ burst of the next chapter sits at the opposite pole: maximal flexibility and denoising for free, paid for in compute and alignment. This is the computational-sensor trade space in miniature — push the work into silicon or push it into software — and it is taken up in earnest in Advanced computational photography.

fig-hdr-sensor-strategies
Figure 10.2.5. On-chip ("direct") HDR sensors, organized by where they multiplex the dynamic range. Time-multiplexed (DOL "double read"): long and short reads of the same frame are combined — schematic shows the two interleaved reads and a ghosted moving object flagging the motion artifact. Space-multiplexed (split-pixel large+small sub-photodiode, and SVE / assorted-pixels exposure mosaic): neighboring sites carry different sensitivities, single-shot but at a resolution/SNR cost. Gain-multiplexed (dual-conversion-gain): the same instant read at two readout gains — cleanest, but its added range is bounded. Each cell labels its cost (motion / resolution+SNR / bounded range), and a fourth panel contrasts the software HDR+ burst — maximal flexibility, paid in compute and alignment.

10.2.3 Curve calibration

Before you can merge, you must know how a pixel value $z_i$ relates to scene radiance — that is, recover the exposure factors $k_i$ (or at least their ratios) and, if the capture is non-linear, the camera response curve $f$ where $z = f(kL)$. There are two regimes, easy and less easy: a linear capture, where calibration is a single ratio, and a non-linear capture, where it is a regression.

Linear case — just a ratio. If the images are linearly encoded — raw, or un-gamma'd — then at any pixel that is neither clipped nor noise-dominated, $I_i = k_i L$. (This is also why radiometry must be done in linear light in the first place: only there is the pixel value proportional to radiance.) Take a pair of adjacent exposures and the unknown radiance cancels:

$$ \frac{I_i}{I_j} = \frac{k_i\,L}{k_j\,L} = \frac{k_i}{k_j}, $$

which is the same number at every good pixel, if linearity holds. Estimate it robustly with the median over the pixels where both frames are reliable,

$$ \frac{k_i}{k_j} = \operatorname*{median}_{x:\,w_i,w_j>0}\!\left(\frac{I_i(x,y)}{I_j(x,y)}\right), $$

and chain the pairwise ratios to recover every $k_i/k_0$. Note you only ever pin exposure down up to one global scale factor — which is fine, since radiance maps are inherently relative unless you do an absolute photometric calibration. In practice you can often just read $k_i$ straight from the camera's Exchangeable Image File Format (EXIF) metadata ($t_i$, $f$-number, ISO); the ratio-from-pixels method is the fallback for when the metadata is missing or the response is suspect, and it has the virtue of being measured from the actual data.

Non-linear case — recover the response curve. Real cameras (and film, and JPEG) apply a non-linear response $f$ — a gamma, a tone curve, a film characteristic. Then $z_i = f(k_i L)$, and you must recover $f$ (equivalently its log-inverse $g = \ln f^{-1}$) before you can merge. The trick of Debevec and Malik (1997) is that a pixel imaged at several exposures gives several $(z_i, t_i)$ pairs that all constrain the same radiance — and that over-determination pins down the curve. Working in the log domain, each observation says

$$ g(z_i) = \ln L + \ln t_i. $$

The unknowns are the curve $g(\cdot)$ — one value per possible pixel level, say 256 of them — and one $\ln L$ per pixel. Stack one such equation per pixel per exposure and you have a linear least-squares system in $g$ and the $\ln L$'s, solved jointly, exactly the kind of inverse problem of Linear Inverse Problems and Regression. Two riders make it well-behaved: a smoothness penalty on $g''$ (a real response curve is smooth, not jagged), and the same reliability weight $w(z)$ we will use in the merge, down-weighting samples near the clip and near the noise floor. The recovered curve is fixed only up to one additive constant — again, the global scale (Figure 10.2.6).

Mitsunaga and Nayar (1999) assume even less. They model the response as a low-order polynomial and recover it — together with the unknown exposure ratios — by radiometric self-calibration, so you need not even trust the EXIF exposure ratios. The spirit is identical: pixels seen at multiple exposures over-determine the curve, so fit a smooth response to them. (Mann & Picard 1995's "comparametric," or Wyckoff, work is the earlier multiple-exposure precursor to both.)

Why prefer a regression over the brittle per-pair ratios? Because a robust regression for $g$ and $L$ directly models the uncertainty due to noise — heavier weight where the SNR is high — and generalizes the linear and non-linear cases in one framework. That is exactly the seam into the next two sections: once the problem is written as a weighted least squares, the right weights are not a hand-drawn shape but the inverse of each observation's noise variance.

One encoding note holds the whole chapter together. Merging is an additive operation in radiance, so it must be performed in linear light — or, equivalently, in the log-radiance domain that Debevec's $g$ lives in. Gamma- or tone-curved values are not proportional to radiance and would merge wrong; un-apply the response first. The display side, by contrast, lives in a perceptual or log space tuned to the eye — but that is tone mapping, and it happens elsewhere.

fig-response-curve-debevec
Figure 10.2.6. The recovered camera response (Debevec–Malik). Left: the camera's non-linear response curve $z = f(kL)$ — pixel value against log exposure — an S-shaped film/JPEG curve. Right: its log-inverse $g(z) = \ln(\text{exposure})$, recovered by least squares from pixels seen at many exposures and straightened into log-radiance. Faint clusters of sample points (one cluster per pixel, spread across exposures and shifted by $\ln t_i$) are shown pinning the curve down, up to one global vertical offset; a smoothness penalty keeps $g$ from wiggling between samples.

10.2.4 Combining exposures

With calibration in hand, the merge is a three-step recipe: (1) recover the scale factor $k_i$ for each image — from EXIF or from pixel ratios, the previous section; (2) compute a weight map $w_i(x,y)$ marking which pixels of each image are usable; (3) reconstruct the radiance per pixel as a weighted combination of the rescaled observations.

The weight map answers a simple question at every pixel: is this observation trustworthy? It is trustworthy only if it is neither clipped nor noise-dominated. So we reject clipped highlights — say $z > 0.99$ — because a clipped pixel tells you only "at least this bright," not how bright. And we reject too-dark, too-noisy shadows — say $z < 0.002$ — because there the read-noise floor swamps the signal. What survives is the usable band of that exposure. The simplest weight is binary, $w \in \{0, 1\}$, but it extends naturally to a smooth scalar (a soft "hat" that fades toward both ends), and the weights can differ per color channel.

The merge itself is a weighted average in radiance. Rescale each usable observation back to radiance by dividing out its exposure factor — $z_i/t_i$, or more generally $z_i/k_i$ — then average, weighted by reliability:

$$ \hat L(x,y) = \frac{\sum_i w(z_i)\,z_i/t_i}{\sum_i w(z_i)}. $$

Read it back: the estimated radiance is the reliability-weighted mean of the rescaled observations. In code this is one pass per pixel over the bracket — divide out each exposure factor, accumulate by reliability, normalize:

for each pixel (x, y):
    num = 0
    den = 0
    for each exposure i:
        z = I_i[x, y]
        w = weight(z)              # 0 at clip and at the noise floor
        num += w · z / t_i         # rescale to radiance, weight by reliability
        den += w
    if den > 0:
        L[x, y] = num / den
    else:                          # bad in every frame
        L[x, y] = fallback(x, y)   # clipped → brightest exp.; dark → shortest

In the non-linear (Debevec) case the identical average runs in the log domain on the recovered curve,

$$ \ln \hat L(x,y) = \frac{\sum_i w(z_i)\,\big(g(z_i) - \ln t_i\big)}{\sum_i w(z_i)}, $$

but it is the same reliability-weighted mean either way. And where several exposures overlap and all see a pixel validly, the averaging does double duty: it also denoises, by exactly the $1/\sqrt N$ effect of the denoising sibling, now reliability-weighted. This is the unifying observation of the part — HDR merge is averaging-to-denoise, generalized with per-observation reliability weights; set every weight equal and a single radiance band and it collapses to plain averaging (Figure 10.2.7).

One edge case needs care: a pixel can be bad in every exposure — clipped in all of them (a bare light source) or noise-dominated in all (the deepest shadow). Do not discard it. Keep the clipped pixel from the brightest-exposed image and the dark pixel from the shortest-exposed image — the frames that at least recorded some value there. The implementation trap is in the thresholds: the literal values $0.0$ and $1.0$ must actually be kept for these fallback frames, not swept away by the same reject rule. And mind the direction — it is the dark pixels of the brightest image and the bright pixels of the darkest, not the reverse.

A different philosophy deserves a name here, because it short-circuits the whole pipeline. Exposure fusion (Mertens, Kautz and Van Reeth 2007) skips the radiance map entirely. Instead of recovering a response curve and building a floating-point HDR image, it scores each pixel of each bracket frame by a per-pixel quality — well-exposedness, local contrast, saturation — and then blends the bracket directly in a Laplacian pyramid, weighted by those scores. There is no response-curve recovery, no HDR float map, and no separate tone-mapping step: the multi-scale pyramid blend hides the exposure seams and emits a displayable image in one shot. It reuses exactly the pyramid-blending machinery of Linear pyramids and wavelets. Fusion is the pragmatic "merge straight to a viewable image" alternative to "build an HDR map, then tone-map" — a different answer to the same starting problem, and an excellent one when a display-ready result is all you need.

fig-weight-triangle
Figure 10.2.7. The reliability weight $w(z)$ as a function of pixel value. The weight is zero at the clipped end (e.g. $z \ge 0.99$) and zero at the dark, noisy end (e.g. $z \le 0.002$), rising to a plateau in the well-exposed middle — a hat shape. A dashed overlay shows the SNR-optimal version of the weight, skewed toward the brighter, less-noisy samples rather than symmetric. Below, a sketch of three bracket frames shows each contributing only over the band where its weight is non-zero, with overlap regions (where averaging also denoises) shaded.
fig-hdr-merge-antelope
Figure 10.2.8. The merge on a real bracket (Antelope Canyon). A shorter exposure keeps the bright sunlit slot but buries the rock in noise-dark shadow; a longer exposure lifts the shadows but clips the sun. Each frame's weight map (rainbow — bright means high weight) marks where it is well-exposed, neither clipped nor noisy. The reliability-weighted average in linear radiance keeps the good band from each, and the float radiance map is then tone-mapped for display, so both the sun and the deep shadows read at once.

The weighted average produces a floating-point radiance map, not a displayable image — its range still exceeds any screen. Compressing it for display is tone mapping, and even the simplest local operator splits the image into a low-frequency base and a high-frequency detail layer and squeezes only the base. The one choice that matters is the base filter: a Gaussian base bleeds contrast across high-contrast boundaries into a grey halo, whereas an edge-aware (bilateral) base respects the boundary and keeps the detail crisp — the motivation for the bilateral filter in Edges matter, with tone mapping developed in Basic image processing and ISP.

fig-tonemap-gauss-vs-bilateral
Figure 10.2.9. Tone-mapping the Antelope Canyon HDR two ways. Gaussian base: the smoothing crosses the bright-slot edge and lifts a washed-out grey halo along the boundary. Bilateral base: the edge-aware smoothing stops at the boundary, so the rock stays dark and saturated and the edge stays crisp — same range compression, no halo. The zoomed crops isolate the boundary where the difference shows.

10.2.5 Optimizing the capture and merge

The binary mask is leaving information on the table. When two or more exposures both see a pixel validly, they carry different noise — and a binary "use only the cleanest one" throws away the rest, forfeiting the $1/\sqrt N$ benefit of combining them. There must be a way to use the noisier observation a little. The principled answer is to weight each observation by its reliability, defined as inverse noise variance.

The kernel of it is a two-observation derivation. Given two unbiased estimates $x$ and $y$ of the same quantity, with variances $\sigma_x^2$ and $\sigma_y^2$, combine them as $\hat L = a\,x + (1-a)\,y$. The variance of that combination is $a^2 \sigma_x^2 + (1-a)^2 \sigma_y^2$; differentiate with respect to $a$, set to zero, and solve:

$$ a^{*} = \frac{\sigma_y^2}{\sigma_x^2 + \sigma_y^2}. $$

The optimum weights each estimate by the inverse of its variance. The sanity checks land: equal variances give $a = 1/2$, the plain average; one estimate far noisier than the other gets little weight, but never zero weight. Generalizing to many observations, $w_i = 1/\sigma_i^2$, so

$$ \hat L = \frac{\sum_i (z_i/k_i)\,/\,\sigma_i^2}{\sum_i 1/\sigma_i^2}, $$

which is inverse-variance (minimum-variance) weighting — the best linear unbiased estimator.

To make those weights concrete, plug in the camera's noise model. The per-pixel variance is dominated by two terms: photon (shot) noise, whose variance is proportional to the signal and so dominates in the highlights, and read noise, whose variance is constant and so dominates in the shadows:

$$ \sigma^2(x) = a\,x + \sigma_\text{read}^2, $$

where $a$ and $\sigma_\text{read}$ depend on the camera and the ISO. You can calibrate this by shooting repeated frames and measuring the variance across the stack, or derive it on paper. Either way, the merge weights are no longer a hand-drawn hat — they come from the physics of the sensor.

Something clean and slightly surprising falls out. Rescaling an observation to radiance by $1/k_i$ amplifies both signal and noise, so $\operatorname{Var}(z_i/k_i) \approx \sigma_i^2/k_i^2$. Push that through the inverse-variance formula in the photon-noise-only case: the per-pixel radiance $L$ and the photon slope $a$ are common to every term and cancel between numerator and normalizer, and the two $k_i$ factors cancel as well — so every pixel of a given exposure receives the same weight, $w'_i \propto k_i$. Longer, brighter exposures (with relatively less photon noise per unit signal) are simply weighted up. Add the read-noise term back, though, and the cancellation is incomplete: the optimal weights then do vary per pixel. Read noise is precisely what makes the optimum genuinely spatially varying. In all cases you still reject the clipped and the dead-dark pixels with the binary mask first, then weight the survivors by inverse variance.

Hasinoff, Durand and Freeman (2010) push the idea one level up: don't stop at optimal weights for a given bracket — choose the optimal set of exposures to shoot in the first place. With the full noise model (photon plus read) and ISO treated as a real degree of freedom, they pick the shutter-and-ISO sequence that maximizes the worst-case SNR across the scene's brightness range under a fixed time budget. The result is counter-intuitive: the SNR-optimal set often uses higher ISO and shorter exposures than naive bracketing. For one scene, standard ISO-100 bracketing ($1/100$, $1/25$, $1/6$ s) is beaten by an SNR-optimal ISO-3200 set ($1/3200$, $1/125$, $1/5$ s), gaining around 12 dB in the dark regions and approaching the curve of an ideal HDR sensor (Figure 10.2.10). (Granados et al. 2010 develop the optimal per-pixel weights from the full noise model in the same vein.)

That logic — fewer, shorter, higher-ISO frames, then merge optimally; underexpose to avoid clipping and freeze motion, and recover the rest by computation — is exactly the philosophy that the cell-phone HDR pipeline runs with (Hasinoff et al. 2016), where the bracket becomes a burst of identical underexposed raws merged for denoising and range at once. The burst chapter takes it from here.

fig-hasinoff-snr-optimal-set
Figure 10.2.10. The SNR-optimal exposure set (Hasinoff–Durand–Freeman). SNR (in dB) is plotted against log scene brightness for three strategies: standard bracketing at ISO 100, the SNR-optimal high-ISO short-exposure set, and the ideal HDR sensor as an upper bound. The optimal set lifts the worst-case (darkest-region) SNR by roughly 12 dB over naive bracketing and tracks close to the ideal curve across the range — the counter-intuitive payoff of using higher ISO and shorter exposures once the noise model is taken seriously.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L15)

Dynamic range = full-well capacity ÷ noise floor. A single exposure records from a top — the photosite's full-well capacity, where values saturate and clip — down to a bottom — the read-noise floor in the shadows. The ratio of those two is the dynamic range, measured in stops: how much scene contrast fits in one shot. It is typically far less than the ten-to-twelve orders of magnitude the world spans. You can widen the window — a bigger well (larger photosites; a full-frame sensor beats a phone) or a lower floor (cooling, lower read noise) — but the photographic workhorse is to beat it by merging multiple exposures: bracket the scene, let each shot own a different slice of the radiance axis, and stitch the slices into one radiance map. That merge is this chapter. (Registered as L15; first appears in the FUNDAMENTALS noise chapter; this is its capture-side HDR recurrence. Its companion is L6 — with a sane encoding it is range, not bit depth, that bounds a shot.)