💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

10.2 Image alignment

Every method in this part — averaging just now, HDR and panorama and focal stacks to come — assumes that corresponding scene points already sit on the same pixel across frames. They don't: hands tremble, the Earth rotates, even a "locked" tripod drifts. Alignment is the unglamorous precondition of all of it, and it is unforgiving — averaging two frames that disagree by a pixel doesn't clean the image, it blurs it into a double exposure. This chapter is the shared registration toolbox, and because the same machinery — find the motion between frames — is exactly what video stabilization and optical flow need, it hands off naturally to the video part (Video). Later chapters specialize this toolbox to their own needs, so do not be surprised when they introduce their own aligners: HDR brackets register with a brightness-robust median-threshold bitmap (MTB) (HDR merging) — the brightness-changing case that is exactly why NCC, not SSD, is the right score here; bursts use a tiled, sub-pixel search (Application to cell phones: HDR+ and burst imaging); and focal stacks add scale and registration handling for the slight magnification change as focus shifts (Focal stacks and depth of field extension).

10.2.1 Why you must align first

Take a hand-held burst and average it without aligning, and you do not get the clean image the previous chapter promised — you get a blurry double. The $1/\sqrt N$ noise benefit only materializes once the frames agree pixel-for-pixel; alignment is what earns it. The averaging law is not wrong, it simply has a precondition, and registration is that precondition (Figure 10.2.1).

Misalignment fails in two distinct regimes, and it is worth keeping them apart because they cost you different things. A sub-pixel misalignment loses fine detail: the average smears across the small offset, acting as a soft low-pass filter — you keep one image, just blurrier. A whole-pixel-or-more misalignment ghosts: edges appear twice, the telltale signature of an unregistered stack. In deep-sky astrophotography the driver is unavoidable and continuous — the sky rotates through the exposure — so without per-sub alignment the stars trail and the stack turns to mush rather than sharpening.

The clean way to think about all of this is to recognize alignment as itself an inverse problem / optimization: find the transform $T$ that makes $I_2(T\cdot)$ match $I_1$,

$$ \hat T = \arg\min_T \sum_x \big( I_1(x) - I_2(Tx) \big)^2 . $$

Read back in words: find the transform that, applied to the second frame, makes it agree with the first as closely as possible, summed over the image. Everything in the rest of the chapter is two choices inside this one statement: which family $T$ lives in (a translation, a homography, a per-pixel flow field) and how you search that family. We start with the smallest family — pure translation — because it is honest, it is a problem set, and it already exposes every idea that the richer families reuse.

10.2.2 Brute-force translational alignment (SSD / NCC)

For a tight hand-held burst, frame-to-frame motion is well approximated by a single 2-D shift $(dx, dy)$: the whole image slides as one. The simplest possible aligner takes this literally — try every shift within a window $\pm$ maxOffset and keep the one that scores best. This is genuinely a problem-set algorithm: a double loop over candidate shifts, no cleverness required.

What makes it work is the score. The natural one is the sum of squared differences (SSD) between the reference frame and the shifted candidate,

$$ \mathrm{SSD}(dx, dy) = \sum_{x,y} \big( I_1(x,y) - I_2(x+dx,\, y+dy) \big)^2 , $$

which is small exactly when the two frames agree. The aligner is then just a double loop over the shift window, scoring each candidate and keeping the best:

best = ∞
for dx in -maxOffset .. maxOffset:
    for dy in -maxOffset .. maxOffset:
        s = ∑_{x,y} (I1[x, y] - I2[x+dx, y+dy])²
        if s < best:
            best = s
            (dx*, dy*) = (dx, dy)
return (dx*, dy*)

The key property — the one the whole method stands on — is that if you plot $\mathrm{SSD}$ over the shift window you get a basin with a single clear minimum sitting at the true shift (Figure 10.2.2). The aligner is just an argmin over that surface. The minimum is at the true shift because that is where every edge in $I_2$ lands back on its partner in $I_1$ and the squared differences collapse toward (noise-floor) zero; any other shift drags strong edges across mismatched content and the score climbs.

SSD comes with an assumption baked in, and it is important to name it: brightness constancy — the same scene point has the same pixel value in both frames, i.e. the exposure and illumination did not change. For a same-exposure burst that holds. But across an HDR bracket, or under a brightness drift, it is plainly false: a pixel that is twice as bright in the longer exposure inflates the squared difference everywhere, and SSD reports a large score even at the correct alignment. The fix is normalized cross-correlation (NCC), which is invariant to a per-frame gain and offset. Subtract each patch's mean, divide by its standard deviation, and correlate:

$$ \mathrm{NCC} = \frac{\sum (I_1 - \bar I_1)(I_2 - \bar I_2)}{\sqrt{\sum (I_1 - \bar I_1)^2}\,\sqrt{\sum (I_2 - \bar I_2)^2}} , $$

which is maximized (toward $+1$) at the true alignment regardless of an overall brightness scaling or shift. The rule of thumb: use SSD for same-exposure bursts; reach for NCC (or align in a normalized or gradient domain) the moment brightness changes between frames — which is precisely the HDR case the next chapter depends on.

The cleanest historical illustration of exactly this aligner is Prokudin-Gorsky's color photography (circa 1910): the same scene shot through blue, green, and red filters onto three separate plates, recombined a century later as one image's B, G, R channels. The camera shifted slightly between the three exposures, so stacking the plates as-is leaves colored fringes at every edge; registering each channel to the others by a small shift search — NCC, since the filtered plates differ in brightness — collapses the fringes into a clean color image (Figure 10.2.5). It is the translational aligner of this section applied to three channels instead of two frames.

There is one debugging idea here worth more than any amount of staring at real data, and it is the cheapest, most reliable unit test an aligner has. Align a synthetic image to a copy of itself shifted by a known amount. Take a known image, shift it by, say, $(+2, 0)$, run your search, and confirm the score map's minimum sits exactly at $(+2, 0)$. It is hand-checkable: you know the answer in advance, by construction. If the minimum lands anywhere else, your indexing, your sign convention, or your interpolation is wrong — and you will never catch that bug on noisy real frames, where you cannot tell a one-pixel error from the truth. Build this test before you trust the aligner on anything.

Two extensions point forward. First, the integer-shift search gets you close; for true sub-pixel accuracy you fit a parabola to the SSD basin around its minimum (or upsample the surface) and read off the analytic vertex — and shifting by a fractional pixel means interpolation (bilinear or bicubic, per Resampling). That sub-pixel precision is the whole game in burst super-resolution (Super-resolution and image priors), where the tiny inter-frame offsets are not noise to be removed but signal to be exploited. Second, brute force over a window is fine for small motions and one transform; it becomes the wrong tool the moment the search window grows, which is what the pyramid below fixes.

10.2.3 Phase correlation: alignment in Fourier (L5)

Brute force searches the shift one candidate at a time. There is a way to recover the entire shift in a single shot, and it comes straight from the Fourier transform. A pure translation does something remarkably clean in the frequency domain: it leaves every magnitude untouched and tilts only the phases. This is the shift theorem,

$$ I_2(x) = I_1(x - d) \quad\Longleftrightarrow\quad \hat I_2(\omega) = \hat I_1(\omega)\, e^{-i\,\omega\cdot d} , $$

read as: shifting an image by $d$ multiplies its spectrum by the phase ramp $e^{-i\omega\cdot d}$ and changes nothing else. So the whole translation lives in the phase, encoded as a linear ramp whose slope is the shift.

💡 Big lesson (L5)

Diagonalize when you can. Convolution and correlation are diagonal in the Fourier basis — the complex exponentials are their eigenvectors — so an operation that costs an $O(\text{window}^2)$ search in the spatial domain can collapse to a per-frequency multiply and a single transform pair. Here that is the difference between scanning every candidate shift and reading the answer off one inverse fast Fourier transform (FFT). Whenever the operator you face is a shift, a convolution, or a correlation, ask first whether moving to the frequency domain turns the work into a diagonal (per-frequency) operation. (Registered as L5, first placed in Linearity, Fourier, aliasing and deblurring; here it turns a brute-force search into a single transform.)

The algorithm that exploits this is phase correlation (Kuglin & Hines 1975). Form the normalized cross-power spectrum of the two frames,

$$ R(\omega) = \frac{\hat I_1(\omega)\, \overline{\hat I_2(\omega)}}{\big| \hat I_1(\omega)\, \overline{\hat I_2(\omega)} \big|} = e^{i\,\omega\cdot d} , $$

where the overline is the complex conjugate. The cross-power spectrum $\hat I_1 \overline{\hat I_2}$ already carries the phase difference between the frames; normalizing to unit magnitude throws away the magnitudes entirely and keeps only that phase, which is exactly the ramp $e^{i\omega\cdot d}$. Two things follow. First, the inverse transform of a pure phase ramp is a single sharp delta located at the shift $d$ — so $\mathcal F^{-1}\{R\}$ is a spike whose peak position you simply read off (Figure 10.2.3). Second, because we discarded magnitudes, the result is robust to brightness changes and to whatever dominant low-frequency content the scene happens to have. It is global, it costs $O(n \log n)$ for an FFT pair rather than a quadratic search, and the sharpness and height of the delta double as a confidence measure — a crisp tall spike means a confident, unambiguous shift.

Phase correlation also extends gracefully. Mapping the spectra into log-polar coordinates turns a rotation and a scaling into shifts along the new axes, so the same phase-correlation machinery recovers rotation and scale as well as translation (Reddy & Chatterji 1996). But be honest about the one assumption it cannot escape: it presumes a single global shift over the entire frame. It cannot, by construction, handle parallax, independently moving objects, or a per-region warp. The moment the motion is richer than one rigid slide, you need a richer transform — a homography for camera rotation (Matching pixels across space and time and the panorama chapters) — or genuinely per-pixel motion, which is optical flow (Matching pixels across space and time).

10.2.4 Coarse-to-fine alignment on a pyramid

Brute-force search has two failure modes, and a single idea cures both. The first is cost: a large motion forces a large search window, and the work grows with its area. The second is more insidious — a raw SSD surface is not always a clean basin. Repetitive texture and aliasing carve local minima into it, false wells that trap a naive search far from the true shift.

The cure is multiscale — build an image pyramid (Linear pyramids and wavelets) and align coarse-to-fine. At the coarsest level the image is tiny and heavily blurred, and two things become true at once. A small shift at that level corresponds to a large real motion, so a cheap search over a few coarse pixels covers a big displacement; and because the coarse level retains only low frequencies, its SSD surface is wide, smooth, and unimodal — the local minima are simply not there yet. So you get a cheap, robust, rough estimate at the top. Then you propagate it down: carry the estimate to the next finer level, refine it with only a tiny local search, and repeat. Each finer level adds the high-frequency detail that sharpens the estimate — the very detail that, searched cold, would have manufactured false minima (Figure 10.2.4).

The frequency view makes the logic exact, and it ties back to two lessons you already have. The coarse level keeps only low frequencies, whose SSD basin is wide and single-welled; finer levels reintroduce the high frequencies that pin down sub-pixel position but would otherwise riddle the surface with traps — so you commit to the low frequencies first and let the high frequencies only refine, never decide. And the prefiltering you do on the way down the pyramid — blur, then decimate — is exactly the L16 discipline from sampling: you must low-pass before you subsample or aliasing corrupts the coarse levels you are trusting. The pyramid is not an optimization trick bolted on; it is the same multiresolution structure of Linear pyramids and wavelets put to work as a search strategy.

This is not a toy. Google's HDR+ burst pipeline (Hasinoff et al. 2016) aligns its frames exactly this way — a tiled, coarse-to-fine search per image tile, followed by a robust merge — so that one pass handles both global hand motion and local scene motion across the burst, in raw, on a phone. The same coarse-to-fine alignment is what burst super-resolution builds on (Wronski et al. 2019). The canonical survey of this whole landscape — translational and parametric models, hierarchical search, Fourier methods — is Szeliski's Image Alignment and Stitching tutorial (Szeliski 2006); the textbook treatment is Szeliski's Computer Vision.

And this is precisely the hand-off to video. Keep the coarse-to-fine matching but replace "one transform per frame" with "one motion vector per pixel" and the machinery becomes optical flow; constrain $T$ to a homography and it becomes stabilization or panorama registration. There is also a differential alternative to all this searching — instead of trying candidate shifts, linearize the image and solve for the motion that cancels the brightness difference, the gradient-based method of Lucas & Kanade 1981 — which is the doorway to the flow formulation. Either way, the alignment toolbox in this chapter is the entry point to the dense, per-pixel correspondence machinery of Matching pixels across space and timethis chapter links forward there.

fig-align-before-average
Figure 10.2.1. Alignment is a precondition of the averaging benefit. The same hand-held burst is combined two ways. Without alignment: averaging the raw frames produces a blurry double image — the frames disagree by a pixel or more, so the mean smears every edge; no noise benefit is visible, only ghosting. With alignment: registering the frames first, then averaging, yields a sharp, clean result in which the $1/\sqrt N$ noise reduction finally shows. Alignment does not improve the average — it is what makes the average worth taking.
fig-ssd-search-surface
Figure 10.2.2. The SSD score as a search surface. For a translational aligner, the sum of squared differences $\mathrm{SSD}(dx,dy)$ is plotted as a 2-D surface (or heat map) over the window of candidate shifts $(dx,dy)$. The surface is a clear basin with a single minimum, and that minimum sits at the true shift — the point where the two frames' edges land back on top of each other. The aligner is simply an argmin over this surface.
fig-phase-correlation-spike
Figure 10.2.3. Phase correlation, big lesson L5 in one picture. Two shifted frames are transformed to Fourier; their normalized cross-power spectrum $R(\omega)=\hat I_1\overline{\hat I_2}/|\hat I_1\overline{\hat I_2}| = e^{i\omega\cdot d}$ keeps only the phase, which is a pure ramp. The inverse FFT of that ramp is a single sharp delta, and the location of the spike is the shift $d$ — read directly, with no search. The height and sharpness of the spike serve as a confidence measure.
fig-coarse-to-fine-pyramid
Figure 10.2.4. Coarse-to-fine alignment on a pyramid. Both frames are built into image pyramids. At the coarsest (tiny, blurred) level a small, cheap search recovers a rough shift over a wide, single-welled SSD basin — a small shift there spans a large real motion, and the low-frequency surface has no local minima to trap the search. The estimate is then propagated down the pyramid, refined level by level with only a tiny local search, the high-frequency detail at each step sharpening the shift without manufacturing false minima.
fig-channel-align-sergey
Figure 10.2.5. Translational alignment in the wild — Prokudin-Gorsky's color-separation plates. Left: the three glass-plate negatives, the same scene shot through blue, green, and red filters. Top right: stacking them as the B, G, R channels with no registration — the plates shifted between exposures, so every edge carries a colored fringe. Bottom right: shifting each channel to its best match (an SSD/NCC argmin over a small window) collapses the fringes into a clean color image. The same brute-force shift search as the burst aligner, applied across color channels. (Original photographs public domain, U.S. Library of Congress.)

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L5)

Diagonalize when you can. Convolution and correlation are diagonal in the Fourier basis — the complex exponentials are their eigenvectors — so an operation that costs an $O(\text{window}^2)$ search in the spatial domain can collapse to a per-frequency multiply and a single transform pair. Here that is the difference between scanning every candidate shift and reading the answer off one inverse fast Fourier transform (FFT). Whenever the operator you face is a shift, a convolution, or a correlation, ask first whether moving to the frequency domain turns the work into a diagonal (per-frequency) operation. (Registered as L5, first placed in Linearity, Fourier, aliasing and deblurring; here it turns a brute-force search into a single transform.)