7.1 Brute force⧉
Every multi-frame method you will meet in this book — averaging a burst to denoise, merging an exposure bracket into a high-dynamic-range image, stitching a panorama, building a focal stack, super-resolving from a sequence — silently assumes that corresponding scene points already sit on the same pixel in every frame. They do not. Hand tremor jitters a hand-held burst; the sky rotates through a long-exposure astrophotograph; a tripod drifts; the subject breathes. Combining frames that disagree by even a pixel does not clean the image — it blurs it. So before any of those payoffs is available, the frames must be registered, and registration is the subject of this chapter.
We will pose it as a single optimization and then spend the rest of the chapter on how to solve it. The cleanest version takes the smallest possible transform — a pure 2-D translation — because it is honest, it is genuinely a problem-set algorithm, and it exposes every idea the richer families (a homography for a panorama, a dense field for flow) will reuse.
A pure translation is diagonal in the Fourier basis: it leaves every frequency magnitude untouched and tilts only the phases, by a ramp whose slope is the shift. So a brute-force $O(\text{window}^2)$ shift search collapses to a single FFT pair — form the normalized cross-power spectrum, inverse-transform, and read the shift off the location of the resulting spike. Whenever the operator you face is a shift, a convolution, or a correlation, ask first whether the frequency domain turns the work into a per-frequency (diagonal) operation. (L5 is first placed in Linearity, Fourier, aliasing and deblurring; here it turns a search into a transform.)
7.1.1 Why you must align first⧉
Recall the law that makes burst photography worth doing: average $N$ independent frames of the same scene and the noise falls by $1/\sqrt N$ while the signal stays put. That law has a precondition that is easy to forget — it only holds if the frames are registered, so that the same scene point contributes to the same output pixel in every frame. Average a hand-held burst without aligning it and you do not get the clean $1/\sqrt N$ image; you get a blurry double (Figure 7.1.1). Alignment is not a refinement you bolt on for extra polish — it is the thing that earns the noise benefit in the first place.
It helps to name the two failure regimes, because they look different and cost differently. A sub-pixel misalignment blurs: the mean low-passes across the small offset, and you get one image, softer. A whole-pixel-or-more misalignment ghosts: every edge appears twice, the unmistakable signature of an unregistered stack. Deep-sky astrophotography is the extreme case — over a long session the sky rotates through the field, so without per-sub-exposure alignment the stars trail and the whole stack mushes into smear rather than resolving faint detail.
Formally, alignment is an inverse problem: find the transform $T$ that makes the second frame, re-sampled through $T$, agree with the first,
Everything in this chapter is two choices made inside this objective: which family $T$ is drawn from (translation here; a homography for camera rotation and panoramas; a per-pixel field for flow, next), and how you search that family. This is exactly the correspondence-first half of the part's spine (L17): recover the map $T$, and the downstream transport — the averaging, the merging, the stitching — follows for free.
7.1.2 Brute-force translational alignment (SSD / NCC)⧉
For a tight burst the whole image slides as one rigid 2-D shift $(dx,dy)$, and the simplest aligner imaginable just tries every shift in a search window and keeps the best-scoring one. A double loop over $(dx,dy)$, no cleverness at all — and that bluntness is a virtue when you are first learning the problem, because there is nowhere for a subtle bug to hide. Everything then rides on the score: the function that says how well two frames agree at a given candidate shift (Figure 7.1.2).
The default score is the sum of squared differences (SSD),
which is small exactly when the frames line up — every edge has landed back on its partner. Plotted over the search window, SSD is a single-welled basin whose minimum sits at the true shift $d$, and the aligner is nothing but an argmin over that surface. But notice what SSD assumes: brightness constancy — the same scene point keeps the same pixel value across frames. That is true for a same-exposure burst and plainly false across an HDR bracket, where a pixel that is twice as bright in one frame inflates the squared difference everywhere, swamping the alignment signal. The fix is the normalized cross-correlation (NCC),
which subtracts each patch's mean and divides by its standard deviation before correlating, making it invariant to per-frame gain and offset and maximized toward $+1$ at the true alignment. Rule of thumb: SSD for same-exposure bursts; reach for NCC the moment the brightness changes between frames — the HDR case.
Two practical notes. First, the cheapest unit test you will ever write: align a synthetic image to a known shift of itself and confirm the minimum lands exactly there, by construction. Catch the indexing, sign, and interpolation bugs on that toy, not on noisy real frames where a one-pixel error hides in the grain. Second, sub-pixel accuracy: the integer-grid minimum is rarely the true minimum, so fit a parabola to the SSD basin (or upsample the surface) and read off the vertex — a fractional shift, realized by bilinear or bicubic interpolation when you resample. Those tiny fractional offsets are noise to be removed when you are denoising, but they are signal to be exploited when you are super-resolving: a burst whose frames are sub-pixel-shifted samples the scene more densely than any one frame can, which is precisely what burst super-resolution mines (Wronski et al. 2019; cross-ref Super-resolution and image priors).
The most charming example of this exact aligner in the wild predates digital photography by a century. Around 1910, Sergei Prokudin-Gorsky photographed the Russian Empire in color by exposing three glass plates in quick succession through blue, green, and red filters. The camera shifted slightly between the three exposures, so naively stacking the digitized plates produces lurid colored fringes along every edge. The cure is exactly a translational shift-search per channel — and because the three filtered plates differ in brightness (a red flower is bright on the red plate, dark on the blue), it is an NCC search, not SSD. Recover the small inter-channel shifts, re-stack, and the fringes collapse into a clean color photograph (Figure 7.1.3). It is the same aligner you just wrote, applied across channels instead of across frames.
7.1.3 Phase correlation — the whole shift from one FFT⧉
Brute force is honest but wasteful: an $O(\text{window}^2)$ search to find a single shift. There is a way to read the shift off in one shot, and it comes straight from the shift theorem (L5, Linearity, Fourier, aliasing and deblurring). A pure translation multiplies the spectrum by a pure phase ramp and changes nothing else:
The entire translation lives in the phase, encoded as a ramp whose slope is the shift. The magnitudes are untouched — which is the diagonalization the big lesson promised.
Phase correlation (Kuglin & Hines 1975) turns this observation into an algorithm. Form the normalized cross-power spectrum,
where the overbar is complex conjugation. Normalizing each frequency to unit magnitude discards the magnitudes entirely and keeps only the phase ramp $e^{i\omega\cdot d}$. And the inverse Fourier transform of a pure phase ramp is a single sharp delta at $d$: $\mathcal F^{-1}\{R\}$ is a spike whose location is the shift you wanted. No search, no basin — just read off the peak position (Figure 7.1.4). The cost is one FFT pair, $O(n\log n)$ instead of $O(\text{window}^2)$; the method is robust to brightness changes for free, because the magnitudes (where gain lives) were thrown away; and the spike's sharpness and height double as a confidence measure — a crisp tall peak means a clean single shift, a smeared low one means trouble.
The same machinery stretches a little further. By first re-expressing the images in log-polar coordinates, a rotation becomes a shift in the angular axis and a scale change becomes a shift in the radial axis, so phase correlation recovers rotation and scale too (Reddy & Chatterji 1996). But there is one hard limit you must respect: phase correlation presumes one global shift for the whole image. It has no notion of parallax, of independently moving objects, or of a warp that varies from region to region. The moment the motion is richer than a single rigid translation, you need a parametric model with more degrees of freedom — a homography for a panorama under camera rotation — or a genuinely per-pixel field, which is optical flow (Optical flow, next).
7.1.4 Coarse-to-fine alignment on a pyramid⧉
Brute force has two failure modes that bite together. Large motion forces a large — and therefore expensive — search window; and a raw SSD surface is not always the clean single-welled basin of the idealized figure. Repetitive texture (a brick wall, a picket fence) and aliasing carve local minima into it, and a cold search started far from the answer can fall into the wrong well and lock there. Both problems have one cure: go multiscale (Figure 7.1.5).
Build a Gaussian pyramid of both frames and align coarse-to-fine (the pyramid of Linear pyramids and wavelets, here repurposed as a search strategy). At the tiny, heavily-blurred top of the pyramid, a small search in pixels covers a large real motion, because each coarse pixel stands for many fine ones — so large displacements come cheap. Better still, the low-frequency surface at the top is wide and single-welled: blurring has washed out the repetitive high-frequency detail that produced the local minima, so the coarse search has nothing to fall into and returns a robust rough estimate. Propagate that estimate down to the next level, scale it up, and refine it with a tiny local search; the high frequencies you add as you descend only sharpen the estimate, they never get to decide it.
Two lessons recur here, and they will recur again in flow and tracking. First: commit to low frequencies first — they own the clean basin — and let the high frequencies refine, never overturn. Second: building the pyramid the right way is blur-then-decimate, which is exactly the prefilter-before-subsampling discipline of L16 (Linearity, Fourier, aliasing and deblurring); skip the blur and the coarse levels alias, manufacturing the very spurious minima you went multiscale to avoid. The pyramid is not a bolt-on trick — it is the multiresolution structure of the image put directly to work as search. This is exactly how Google's HDR+ registers its bursts in production (Hasinoff et al. 2016): tiled, coarse-to-fine alignment followed by a robust merge.
Finally, the hand-off. Keep the coarse-to-fine matching machinery but replace "one transform per frame" with "one vector per pixel," and you have arrived at optical flow (next section). The differential alternative points the same way: rather than search over shifts, linearize the brightness difference and solve directly for the small motion that cancels it (Lucas & Kanade 1981) — the doorway from this direct-alignment toolbox into the dense per-pixel correspondence machinery of the rest of the part. (For the canonical survey of this entire global-alignment toolbox, see Szeliski 2006, Image Alignment and Stitching.) This chapter is the global root; Sparse matching takes the sparse branch and Optical flow the dense one.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
A pure translation is diagonal in the Fourier basis: it leaves every frequency magnitude untouched and tilts only the phases, by a ramp whose slope is the shift. So a brute-force $O(\text{window}^2)$ shift search collapses to a single FFT pair — form the normalized cross-power spectrum, inverse-transform, and read the shift off the location of the resulting spike. Whenever the operator you face is a shift, a convolution, or a correlation, ask first whether the frequency domain turns the work into a per-frequency (diagonal) operation. (L5 is first placed in Linearity, Fourier, aliasing and deblurring; here it turns a search into a transform.)