💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

10.9 Focal stacks and depth of field extension

A single photograph can only be sharp over a limited slab of depth. The optics chapter made this precise: only scene points whose blur disc stays inside an acceptable circle of confusion count as in focus, and that acceptable band — the depth of field — is set by the aperture, the focal length, and the focusing distance. Stop the lens down and the band widens, but you pay twice for it: a smaller aperture lets in less light (more noise) and eventually runs into the diffraction limit, at which point sharpness falls again. So there is a floor. In macro photography, microscopy, product and jewelry shots, or a landscape that wants both a near rock and a far peak crisp, even the smallest usable aperture cannot hold the whole scene in focus at once. These are exactly the cases the optics chapter flagged as unwinnable in a single exposure.

The escape is not to push the optics harder but to change axis. Instead of committing to one focus distance at the instant of exposure, capture a stack of frames focused at different distances, and merge them, keeping each output pixel from the frame in which that scene point is sharpest. As the focus sweeps front to back, the thin in-focus slab marches through depth, so every surface in the scene is sharp in some frame. Stitch the sharp pieces together and you get an all-in-focus (extended depth-of-field) image that no single exposure could have taken.

That is the whole chapter, and it is one idea you have already met three times. The merge is a per-pixel selection — measure local sharpness, take the argmax across the stack — and the good merge is the same graph-cut label choice plus gradient-domain blend that hid the seams in panorama stitching and object compositing. We will build the simple selection algorithm first, upgrade it to the principled Photomontage version, and close on the capture hardware and the one trap that quietly sabotages a naive stack: refocusing the lens changes the magnification. One flag to plant up front: every merge below assumes the frames are already registered (pixel-aligned); the hardware section at the end explains why that assumption is non-trivial — refocusing subtly rescales the frame — and how to restore it.

💡 Big lesson (L14 · capture the full set, decide later)

Instead of fixing the focus distance at the instant of exposure, record a whole stack of focal planes and choose per pixel afterward. The slogan: focus stacking is to the focus distance what HDR bracketing is to exposure and what the light-field camera is to the aperture — defer an irreversible capture decision out of the moment of exposure, paying with more data and a harder reconstruction (a per-pixel merge) for an all-in-focus image no single shot could make. This is the same move as HDR (the exposure axis) and panorama (the viewpoint axis); here the axis is focus. (Registered as L14; it first appears in this part's introduction, and the light-field / plenoptic camera in Advanced computational photography is the same idea for the focus axis — the elegant special case. Here it recurs on the focus axis.)

10.9.1 Why limited depth of field is the problem

It is worth being precise about what we are fighting, because the precise statement is also what tells us why a naive merge struggles. From Optics, lenses, and aberration correction: only objects within the depth of field image inside an acceptable circle of confusion; everything nearer or farther blurs. The band grows as you stop down — a bigger $f$-number $N$ means more depth in focus — but a small aperture means less light and, past a point, diffraction that softens the whole image again. There is no aperture at which a deep macro scene comes out fully sharp. Where this bites hardest is macro (a bug or a flower at close range), microscopy, product and jewelry work, and landscape that wants a near foreground and a far horizon both crisp.

One caveat from optics matters for everything that follows: defocus is not a single convolution of the image. The blur radius depends on scene depth, not on output-pixel location, and at depth edges occlusion intervenes — a sharp foreground edge sits in front of a blurred background, and which one a pixel belongs to flips abruptly. This is why a naive blend struggles precisely at depth boundaries, and why the good merge needs coherent seams rather than a smooth cross-fade. Keep the depth edge in mind; it is where every method earns or loses its keep.

10.9.2 A simple algorithm: sharpness = local high-frequency energy, then argmax

The core idea is a one-line answer to "which frame is sharpest here?" A region is in focus exactly when it carries local high-frequency detail — crisp edges and fine texture. Defocus is a low-pass blur; it destroys high frequencies. So measure, per pixel and per frame, how much high-frequency content sits in a small window, and at each output pixel pick the frame that has the most.

Concretely, take the band-pass / high-frequency component of frame $I_k$ — a Laplacian, a high-pass filter, or one band of a pyramid — and accumulate its energy over a small window $w(p)$ around the pixel:

$$ s_k(p) = \sum_{q \in w(p)} \lVert \nabla I_k(q) \rVert^2 . $$

Read this back: the sharpness of frame $k$ at pixel $p$ is the local high-frequency power — squared gradient (or squared Laplacian) summed over a neighborhood. The window matters: a single pixel's gradient is too noisy a vote, so we pool a small patch's worth of high-frequency energy into one sharpness reading (Figure 10.9.1).

There is one subtle step here that trips people up. You must square (or take the absolute value) before you average. The raw high-frequency signal swings positive and negative: crossing an edge it runs $+ \to 0 \to -$. Its local average is therefore near zero by construction — the low frequencies, which carry any net offset, were exactly what the high-pass removed. Smooth the raw band-pass and you get nothing. Square it first and every value is positive, so the local (Gaussian) average now correctly reads "there is detail here." This is the same reason variance, not the mean of a zero-mean signal, measures activity (Figure 10.9.2).

With a per-frame sharpness map in hand, the all-in-focus image is the per-pixel argmax — at each pixel, copy from whichever frame was sharpest:

$$ k^*(p) = \arg\max_k s_k(p), \qquad I(p) = I_{k^*(p)}(p) . $$

A nice aside: the selection map $k^*(p)$ — "which focal plane was sharpest here" — is itself a coarse depth map, since each frame corresponds to a known focus distance. Recovering depth this way is shape-from-focus Nayar & Nakagawa 1994, and it ties the focal stack to depth estimation: the same sharpness-argmax that builds the all-in-focus image also sketches the scene's geometry.

A hard argmax, though, is noisy and seam-ridden — at a pixel where two frames are nearly tied, the winner flips on a coin, and neighboring pixels can end up sourced from wildly different frames (salt-and-pepper switching). Two softening fixes from the slides, both about weighting the best frame more heavily while staying spatially coherent:

Beyond reweighting, it helps to spatially regularize the choice itself: smooth the sharpness maps $s_k$, or smooth the label field, so neighboring pixels tend to draw from the same frame. This "spatial coherence in the choice of pixels" is the right instinct, and it is exactly what the advanced method makes principled instead of ad hoc (Figure 10.9.3).

fig-focalstack-capture
Figure 10.9.1. Why one exposure is not enough, and how the stack fixes it. At a single aperture only a thin slab of depth falls inside the acceptable circle of confusion, so the subject is sharp in one band while foreground and background mush into blur — stopping down to f/16 widens the slab but cannot cover the whole scene, and diffraction softens everything past the smallest usable aperture. The cure: the same scene shot as $N$ frames focused front to back, with the thin in-focus slab sweeping through depth across the sequence, so every surface is sharp in some frame. An inset contrasts the two ways to sweep focus: refocus the lens (camera fixed, focus distance changed) versus translate the camera on a rail along the optical axis. The merge keeps each pixel from the frame where it is sharpest.
fig-focalstack-sharpness
Figure 10.9.2. Building the sharpness map. Left to right for one input frame: (1) the frame; (2) its band-pass (signed, shown red for positive / green for negative), which swings through zero; (3) the band-pass squared, now all-positive; (4) the squared band-pass Gaussian-smoothed = the per-pixel sharpness map $s_k$, bright where the frame carries local detail. This is the pipeline behind $s_k(p)=\sum_{w}\lVert\nabla I_k\rVert^2$.
fig-focalstack-argmax-composite
Figure 10.9.3. Selection and softening. Side by side: the per-pixel argmax selection map (a random color per source frame, so it reads as a coarse depth map / which-frame-won image) and the resulting all-in-focus composite. A second pair contrasts exponent-1 weighting (smoother but softer, blurry frames leak in) against exponent-4 weighting $w_k=s_k^4$ (near-argmax, the sharp winner dominates), showing the latter is subtly but visibly crisper.
fig-focalstack-why-square
Figure 10.9.4. Why square before averaging. A 1-D edge: the band-pass response swings $+\to 0\to -$ across the edge, so its raw local average is about zero — smoothing it directly reports "no detail." Squaring makes the response all-positive, so the local average now tracks "there is detail here." The same argument justifies using squared gradient / Laplacian power as the sharpness measure.

10.9.3 The more advanced method: Interactive Digital Photomontage (graph-cut + Poisson)

The weighted blend has a built-in flaw, and it lives exactly where the optics warned us it would — at depth edges. Because the blend mixes frames wherever sharpness is close, near a depth boundary it averages a sharp foreground against a blurred background (and vice versa), leaving residual blur, ghosting, and visible transitions. Occlusion means the "right" answer flips abruptly across that boundary, but a smooth weighted sum cannot flip abruptly. What we actually want is a clean per-pixel choice that is spatially coherent, joined by an invisible blend only at the boundaries where the choice changes.

This is precisely the cut-then-reconstruct problem this part has already solved twice, and the focal-stack merge is just one application of the general tool: Interactive Digital Photomontage Agarwala et al. 2004. It is a two-stage pipeline, and both stages are chapters you have already read.

Box — Interactive Digital Photomontage (the cut-then-reconstruct compose)

Agarwala, Dontcheva, Agrawala, Drucker, Colburn, Curless, Salesin and Cohen (2004) frame compositing a stack of registered images as: (1) choose a source label per pixel by minimizing a discrete labeling energy with a graph cut (L12, Seam optimization), and (2) hide the residual mismatch across the chosen seams with a gradient-domain (Poisson) blend (L9, Poisson image editing). The data term encodes what you want — here, "the sharpest frame" — and the smoothness term encodes where a switch is cheap. All-in-focus extended depth of field is one of the paper's showcase applications; the same machinery does best-exposure HDR-like selection, group-photo "best face" merging, and clean-plate background extraction. This is the L14 decide-later merge given its principled form: the energy is the design; the optimizer is off the shelf.

Stage one — graph-cut label selection (L12, Seam optimization). Assign each pixel $p$ a source-frame label $\ell_p$ — which frame it copies from — by minimizing a discrete energy over the whole label field,

$$ E(\ell) = \sum_p D_p(\ell_p) \;+\; \sum_{p \sim q} V_{pq}(\ell_p, \ell_q) . $$

The data term $D_p$ prefers the sharpest frame at $p$ — for instance $D_p(\ell) = -\,s_\ell(p)$, so high sharpness is low cost. The smoothness term $V_{pq}$ is a seam cost: it penalizes changing the label between neighbors $p$ and $q$ in proportion to how much the two frames disagree there, so the cut prefers to switch along matching, low-contrast boundaries where the transition will be invisible. Minimizing $E$ globally is the min-cut / max-flow problem from the seam chapter — the energy is the design, the optimizer is standard.

Stage two — gradient-domain (Poisson) blend (L9, Poisson image editing). Even the best label map leaves small exposure, color, and level mismatches across each seam. So instead of copying each region's pixels, copy its gradients, and reconstruct the image whose gradients best match by solving the Poisson equation $\nabla^2 f = \operatorname{div}\mathbf{v}$. The absolute-level jumps at the seams vanish; only the (correct) gradients survive — a seamless all-in-focus composite (Figure 10.9.5).

The point worth carrying away is that this is the same two lessons, twice over: cut where it is cheap (L12 graph cut), then reconstruct in the gradient domain (L9 Poisson). It is the identical compose used for panorama blending and object compositing earlier in this part; focal-stack all-in-focus is just that pipeline with "sharpest frame" plugged in as the data term. In practice real macro montages stack tens of frames — fifty-five is not unusual — and the graph cut handles many labels without complaint. Commercial tools (Helicon Focus, Zerene Stacker, and Photoshop's Auto-Blend Layers) implement essentially this recipe.

fig-focalstack-photomontage
Figure 10.9.5. Simple blend versus Photomontage. The same focal stack merged two ways. Weighted-sum blend: at depth boundaries the mix of a sharp and a blurred frame leaves ghosting and residual blur — the non-convolution / occlusion problem made visible. Graph-cut + Poisson (Agarwala et al.): a coherent label map cuts along low-contrast boundaries and a gradient-domain blend hides the residual mismatch, giving a clean, seamless all-in-focus result. The contrast is sharpest exactly at depth edges.

10.9.4 Capturing the stack — hardware, and the magnification trap

There are two ways to sweep focus across the stack, and they differ in their geometry, which matters for the merge.

Now the trap, the thing the slides note gets "swept under the rug." Refocusing changes magnification. Focusing closer extends the lens away from the sensor, and as the lens extends the image scale grows — the same scene fills a slightly larger fraction of the frame, so the field of view narrows. The frame "breathes" in or out as the focus runs through the stack, which means the frames in a focus sweep are not pixel-aligned: a frame focused at infinity and a frame focused up close show the same scene at slightly different scales.

Refocusing rescales the frame — register before you merge

Because refocusing changes magnification, the frames of a focus sweep are at slightly different scales. If you compute sharpness and select across mis-registered frames, the all-in-focus merge stitches detail that does not line up and you get double edges — a ghosted, fattened look along every contour. You must estimate a per-frame scale (often a near-pure zoom about the optical axis) and register the frames before measuring sharpness and selecting. Translating on a rail has a different but related geometry — a perspective / parallax shift, and the slides note it can even change the viewpoint slightly — so it may need a fuller alignment, not just a scale. This is the same align-then-merge discipline that HDR and panorama insisted on earlier in this part: register first, combine second (Figure 10.9.6).

Two pointers close the loop. Single-exposure alternatives exist: some macro rigs scan focus during one exposure, integrating the stack optically in a single shot. And there is a dual escape that this chapter keeps gesturing at — the light-field (plenoptic) camera Ng 2005 records all the aperture rays in one shot and lets you refocus after capture computationally. A captured light field can be refocused into a whole focal stack and then merged with exactly the Photomontage pipeline above, which is why the two ideas are duals: the focal stack defers focus by capturing frames, the light field by capturing rays. Both are L14 on the focus axis, and both are taken up in Advanced computational photography. (Time- and light-efficient capture of focal stacks specifically is analyzed by Hasinoff & Kutulakos 2009.)

fig-focalstack-magnification
Figure 10.9.6. The magnification trap. The same scene focused at infinity versus focused close: the field of view — and hence the magnification — differs, so the two frames are at slightly different scales and do not pixel-align. The caption shows that merging without correcting this produces double edges, and that estimating a per-frame scale (a near-pure zoom about the optical axis) and registering the frames first fixes it.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L14 · capture the full set, decide later)

Instead of fixing the focus distance at the instant of exposure, record a whole stack of focal planes and choose per pixel afterward. The slogan: focus stacking is to the focus distance what HDR bracketing is to exposure and what the light-field camera is to the aperture — defer an irreversible capture decision out of the moment of exposure, paying with more data and a harder reconstruction (a per-pixel merge) for an all-in-focus image no single shot could make. This is the same move as HDR (the exposure axis) and panorama (the viewpoint axis); here the axis is focus. (Registered as L14; it first appears in this part's introduction, and the light-field / plenoptic camera in Advanced computational photography is the same idea for the focus axis — the elegant special case. Here it recurs on the focus axis.)