10.10 Focal stacks and depth of field extension⧉
A single photograph can only be sharp over a limited slab of depth. The optics chapter made this precise: only scene points whose blur disc stays inside an acceptable circle of confusion count as in focus, and that acceptable band — the depth of field — is set by the aperture, the focal length, and the focusing distance. Stop the lens down and the band widens, but you pay twice for it: a smaller aperture lets in less light (more noise) and eventually runs into the diffraction limit, at which point sharpness falls again. So there is a floor. In macro photography, microscopy, product and jewelry shots, or a landscape that wants both a near rock and a far peak crisp, even the smallest usable aperture cannot hold the whole scene in focus at once. These are exactly the cases the optics chapter flagged as unwinnable in a single exposure.
The escape is not to push the optics harder but to change axis. Instead of committing to one focus distance at the instant of exposure, capture a stack of frames focused at different distances, and merge them, keeping each output pixel from the frame in which that scene point is sharpest. As the focus sweeps front to back, the thin in-focus slab marches through depth, so every surface in the scene is sharp in some frame. Stitch the sharp pieces together and you get an all-in-focus (extended depth-of-field) image that no single exposure could have taken.
That is the whole chapter, and it is one idea you have already met three times. The merge is a per-pixel selection — measure local sharpness, take the argmax across the stack — and the good merge is the same graph-cut label choice plus gradient-domain blend that hid the seams in panorama stitching and object compositing. We will build the simple selection algorithm first, upgrade it to the principled Photomontage version, and close on the capture hardware and the one trap that quietly sabotages a naive stack: refocusing the lens changes the magnification. One flag to plant up front: every merge below assumes the frames are already registered (pixel-aligned); the hardware section at the end explains why that assumption is non-trivial — refocusing subtly rescales the frame — and how to restore it.
Instead of fixing the focus distance at the instant of exposure, record a whole stack of focal planes and choose per pixel afterward. The slogan: focus stacking is to the focus distance what HDR bracketing is to exposure and what the light-field camera is to the aperture — defer an irreversible capture decision out of the moment of exposure, paying with more data and a harder reconstruction (a per-pixel merge) for an all-in-focus image no single shot could make. This is the same move as HDR (the exposure axis) and panorama (the viewpoint axis); here the axis is focus. (Registered as L14; it first appears in this part's introduction, and the light-field / plenoptic camera in Advanced computational photography is the same idea for the focus axis — the elegant special case. Here it recurs on the focus axis.)
10.10.1 Why limited depth of field is the problem⧉
It is worth being precise about what we are fighting, because the precise statement is also what tells us why a naive merge struggles. From Optics, lenses, and aberration correction: only objects within the depth of field image inside an acceptable circle of confusion; everything nearer or farther blurs. The band grows as you stop down — a bigger $f$-number $N$ means more depth in focus — but a small aperture means less light and, past a point, diffraction that softens the whole image again. There is no aperture at which a deep macro scene comes out fully sharp. Where this bites hardest is macro (a bug or a flower at close range), microscopy, product and jewelry work, and landscape that wants a near foreground and a far horizon both crisp.
One caveat from optics matters for everything that follows: defocus is not a single convolution of the image. The blur radius depends on scene depth, not on output-pixel location, and at depth edges occlusion intervenes — a sharp foreground edge sits in front of a blurred background, and which one a pixel belongs to flips abruptly. This is why a naive blend struggles precisely at depth boundaries, and why the good merge needs coherent seams rather than a smooth cross-fade. Keep the depth edge in mind; it is where every method earns or loses its keep.
10.10.2 A simple algorithm: sharpness = local high-frequency energy, then argmax⧉
The core idea is a one-line answer to "which frame is sharpest here?" A region is in focus exactly when it carries local high-frequency detail — crisp edges and fine texture. Defocus is a low-pass blur; it destroys high frequencies. So measure, per pixel and per frame, how much high-frequency content sits in a small window, and at each output pixel pick the frame that has the most.
Concretely, take the band-pass / high-frequency component of frame $I_k$ — a Laplacian, a high-pass filter, or one band of a pyramid — and accumulate its energy over a small window $w(p)$ around the pixel:
Read this back: the sharpness of frame $k$ at pixel $p$ is the local high-frequency power — squared gradient (or squared Laplacian) summed over a neighborhood. The window matters: a single pixel's gradient is too noisy a vote, so we pool a small patch's worth of high-frequency energy into one sharpness reading (Figure 10.10.1).
There is one subtle step here that trips people up. You must square (or take the absolute value) before you average. The raw high-frequency signal swings positive and negative: crossing an edge it runs $+ \to 0 \to -$. Its local average is therefore near zero by construction — the low frequencies, which carry any net offset, were exactly what the high-pass removed. Smooth the raw band-pass and you get nothing. Square it first and every value is positive, so the local (Gaussian) average now correctly reads "there is detail here." This is the same reason variance, not the mean of a zero-mean signal, measures activity (Figure 10.10.2).
With a per-frame sharpness map in hand, the all-in-focus image is the per-pixel argmax — at each pixel, copy from whichever frame was sharpest:
A nice aside: the selection map $k^*(p)$ — "which focal plane was sharpest here" — is itself a coarse depth map, since each frame corresponds to a known focus distance. Recovering depth this way is shape-from-focus Nayar & Nakagawa 1994, and it ties the focal stack to depth estimation: the same sharpness-argmax that builds the all-in-focus image also sketches the scene's geometry.
A hard argmax, though, is noisy and seam-ridden — at a pixel where two frames are nearly tied, the winner flips on a coin, and neighboring pixels can end up sourced from wildly different frames (salt-and-pepper switching). Two softening fixes from the slides, both about weighting the best frame more heavily while staying spatially coherent:
- Weighted sum, then normalize. Blend all frames with weights proportional to sharpness, $I(p) = \dfrac{\sum_k w_k(p)\, I_k(p)}{\sum_k w_k(p)}$ with $w_k(p) = s_k(p)$. This is smooth, but a blurry frame still leaks in and softens the result.
- Sharpen the weights with an exponent. Use $w_k(p) = s_k(p)^{\gamma}$ with $\gamma \approx 4$. The exponent pushes the blend toward the argmax — the sharpest frame dominates — while keeping a smooth transition rather than a hard switch. Subtle, but visibly sharper than $\gamma = 1$.
Beyond reweighting, it helps to spatially regularize the choice itself: smooth the sharpness maps $s_k$, or smooth the label field, so neighboring pixels tend to draw from the same frame. This "spatial coherence in the choice of pixels" is the right instinct, and it is exactly what the advanced method makes principled instead of ad hoc (Figure 10.10.3).
10.10.3 The more advanced method: Interactive Digital Photomontage (graph-cut + Poisson)⧉
The weighted blend has a built-in flaw, and it lives exactly where the optics warned us it would — at depth edges. Because the blend mixes frames wherever sharpness is close, near a depth boundary it averages a sharp foreground against a blurred background (and vice versa), leaving residual blur, ghosting, and visible transitions. Occlusion means the "right" answer flips abruptly across that boundary, but a smooth weighted sum cannot flip abruptly. What we actually want is a clean per-pixel choice that is spatially coherent, joined by an invisible blend only at the boundaries where the choice changes.
This is precisely the cut-then-reconstruct problem this part has already solved twice, and the focal-stack merge is just one application of the general tool: Interactive Digital Photomontage Agarwala et al. 2004. It is a two-stage pipeline, and both stages are chapters you have already read.
Agarwala, Dontcheva, Agrawala, Drucker, Colburn, Curless, Salesin and Cohen (2004) frame compositing a stack of registered images as: (1) choose a source label per pixel by minimizing a discrete labeling energy with a graph cut (L12, Seam optimization), and (2) hide the residual mismatch across the chosen seams with a gradient-domain (Poisson) blend (L9, Poisson image editing). The data term encodes what you want — here, "the sharpest frame" — and the smoothness term encodes where a switch is cheap. All-in-focus extended depth of field is one of the paper's showcase applications; the same machinery does best-exposure HDR-like selection, group-photo "best face" merging, and clean-plate background extraction. This is the L14 decide-later merge given its principled form: the energy is the design; the optimizer is off the shelf.
Stage one — graph-cut label selection (L12, Seam optimization). Assign each pixel $p$ a source-frame label $\ell_p$ — which frame it copies from — by minimizing a discrete energy over the whole label field,
The data term $D_p$ prefers the sharpest frame at $p$ — for instance $D_p(\ell) = -\,s_\ell(p)$, so high sharpness is low cost. The smoothness term $V_{pq}$ is a seam cost: it penalizes changing the label between neighbors $p$ and $q$ in proportion to how much the two frames disagree there, so the cut prefers to switch along matching, low-contrast boundaries where the transition will be invisible. Minimizing $E$ globally is the min-cut / max-flow problem from the seam chapter — the energy is the design, the optimizer is standard.
Stage two — gradient-domain (Poisson) blend (L9, Poisson image editing). Even the best label map leaves small exposure, color, and level mismatches across each seam. So instead of copying each region's pixels, copy its gradients, and reconstruct the image whose gradients best match by solving the Poisson equation $\nabla^2 f = \operatorname{div}\mathbf{v}$. The absolute-level jumps at the seams vanish; only the (correct) gradients survive — a seamless all-in-focus composite (Figure 10.10.5).
The point worth carrying away is that this is the same two lessons, twice over: cut where it is cheap (L12 graph cut), then reconstruct in the gradient domain (L9 Poisson). It is the identical compose used for panorama blending and object compositing earlier in this part; focal-stack all-in-focus is just that pipeline with "sharpest frame" plugged in as the data term. In practice real macro montages stack tens of frames — fifty-five is not unusual — and the graph cut handles many labels without complaint. Commercial tools (Helicon Focus, Zerene Stacker, and Photoshop's Auto-Blend Layers) implement essentially this recipe.
10.10.4 Capturing the stack — hardware, and the magnification trap⧉
There are two ways to sweep focus across the stack, and they differ in their geometry, which matters for the merge.
- Refocus the lens (camera fixed, focus distance changed). This is the natural way — what an autofocus motor or a tunable / liquid lens does. You hold the camera still and step the focus from near to far.
- Translate the camera along the optical axis on a focus rail (for example the CogniSys StackShot). This is common in macro, where nudging the whole camera a few millimeters between frames is easier and far more repeatable than refocusing a macro lens.
Now the trap, the thing the slides note gets "swept under the rug." Refocusing changes magnification. Focusing closer extends the lens away from the sensor, and as the lens extends the image scale grows — the same scene fills a slightly larger fraction of the frame, so the field of view narrows. The frame "breathes" in or out as the focus runs through the stack, which means the frames in a focus sweep are not pixel-aligned: a frame focused at infinity and a frame focused up close show the same scene at slightly different scales.
Because refocusing changes magnification, the frames of a focus sweep are at slightly different scales. If you compute sharpness and select across mis-registered frames, the all-in-focus merge stitches detail that does not line up and you get double edges — a ghosted, fattened look along every contour. You must estimate a per-frame scale (often a near-pure zoom about the optical axis) and register the frames before measuring sharpness and selecting. Translating on a rail has a different but related geometry — a perspective / parallax shift, and the slides note it can even change the viewpoint slightly — so it may need a fuller alignment, not just a scale. This is the same align-then-merge discipline that HDR and panorama insisted on earlier in this part: register first, combine second (Figure 10.10.6).
Two pointers close the loop. Single-exposure alternatives exist: some macro rigs scan focus during one exposure, integrating the stack optically in a single shot. And there is a dual escape that this chapter keeps gesturing at — the light-field (plenoptic) camera Ng 2005 records all the aperture rays in one shot and lets you refocus after capture computationally. A captured light field can be refocused into a whole focal stack and then merged with exactly the Photomontage pipeline above, which is why the two ideas are duals: the focal stack defers focus by capturing frames, the light field by capturing rays. Both are L14 on the focus axis, and both are taken up in Advanced computational photography. (Time- and light-efficient capture of focal stacks specifically is analyzed by Hasinoff & Kutulakos 2009.)
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
Instead of fixing the focus distance at the instant of exposure, record a whole stack of focal planes and choose per pixel afterward. The slogan: focus stacking is to the focus distance what HDR bracketing is to exposure and what the light-field camera is to the aperture — defer an irreversible capture decision out of the moment of exposure, paying with more data and a harder reconstruction (a per-pixel merge) for an all-in-focus image no single shot could make. This is the same move as HDR (the exposure axis) and panorama (the viewpoint axis); here the axis is focus. (Registered as L14; it first appears in this part's introduction, and the light-field / plenoptic camera in Advanced computational photography is the same idea for the focus axis — the elegant special case. Here it recurs on the focus axis.)