10.6 Automatic panorama stitching from multiple views and feature matching⧉
PS7 builds the automatic panorama: Harris, descriptors, the 2NN test, RANSAC, and blending. → Problem sets (appendix).
The manual pipeline is complete except for one human step: someone clicked the corresponding points. That click is the bottleneck — slow, and impossible at scale, where "scale" means gigapixel mosaics, internet photo collections, and video. This chapter replaces the clicks with automatic feature matching, which is fairly described as perhaps the most important innovation in computer vision and computational photography of the last twenty years: the same machinery powers 3D reconstruction, tracking, object recognition, retrieval, and robot navigation. We keep the homography solve from before untouched; we only manufacture its inputs.
This chapter sits on the part spine L14 — capture the full set, decide later (introduced in this part's intro): a panorama defers your field of view, shooting several overlapping frames now and deciding the final wide view later, and everything here is the alignment step that lets that "decide later" merge happen. The light-field capture of Advanced computational photography is the special case where the deferred decision is focus and viewpoint at once.
10.6.1 Why not brute force, and the two sub-problems⧉
The honest place to start is with the method this one replaces, because seeing why it fails fixes the design. The old way matched images densely: try all candidate alignments, warp one image onto the other, compute the sum-of-squared-differences (SSD) over all overlapping pixels, and keep the alignment with the smallest error. For a translation that is two parameters to search — already slow but tractable. For a homography it is eight coefficients, and discretising an eight-dimensional search is combinatorially hopeless, with every trial demanding a full warp plus a full-image SSD. Even cut down to a couple of rotational degrees of freedom it is restrictive, slow, and — the deciding flaw — not robust: a moving object or a little lens distortion corrupts the global SSD and drags the answer off (Figure 10.6.1).
The modern idea is to go sparse. Ignore the uniform regions, which carry no alignment information anyway, and concentrate on a few very distinctive points — interest points or keypoints. Match those points locally (which point in image one corresponds to which point in image two), and add machinery to stay robust to the inevitable local mistakes. Sparse, fast, and robust, all at once.
That plan decomposes into two sub-problems, and they are worth stating sharply because they are the chapter's skeleton:
- repeatable detection — detect the same scene point independently in each image. If a point is found in one image but missed in the other, it can never be matched, no matter how good the rest of the pipeline is. Detection must therefore be repeatable across the changes of viewpoint, scale, and lighting that separate two frames.
- distinctive description — for a detected point, summarise its neighbourhood with a descriptor that is similar for the same scene point seen in two views and different for unrelated points, so that the correct correspondent can be recognised among many candidates.
One discipline governs both and is easy to forget: detection and description run independently in each image. Only at the very end, given a bag of (point, descriptor) features per image, do we compute correspondences. We will analyse the detector and descriptor on corresponding scene points — but only to check that they would behave consistently; the algorithm itself never gets to peek across images until the matching step.
10.6.2 Basic feature matching: Harris corners and what makes a good keypoint⧉
What makes a point worth detecting? Two things: you must be able to localise it precisely — pin down where it is to the pixel — and re-find it reliably in another view. There is a clean test for both, due originally to Moravec: take a small window around the candidate, slide it a little in every direction, and measure how much the windowed image changes. Three cases sort themselves out (Figure 10.6.2). In a flat region the window looks the same wherever you slide it — no change in any direction, so the point cannot be localised at all. On an edge the window changes when you slide across the edge but not when you slide along it — it slides freely in one direction, so its position along the edge is ambiguous (this is the aperture problem, the same one that haunts optical flow). On a corner the window changes substantially under a shift in every direction — and that is exactly what pins the point down uniquely. Corners make the best keypoints.
Now quantify the test. The windowed change produced by a shift $(u,v)$ is a weighted SSD,
with $w(x,y)$ a box or Gaussian window centred on the point. Evaluating this for every shift is expensive, so we Taylor-expand the shifted image, $I(x+u,y+v)\approx I + I_x u + I_y v$, where $I_x,I_y$ are the image gradients. Substituting and collecting terms turns $E$ into a quadratic form in the shift,
The $2\times2$ symmetric matrix $M$ is the structure tensor (also called the second-moment matrix): it packs the local gradient statistics — averaged over the window — into a single object. Its eigenvalues $\lambda_1,\lambda_2$ are the rates at which $E$ grows along the two principal directions, and the iso-$E$ contour is an ellipse whose axes are proportional to $\lambda^{-1/2}$ (a long axis means slow change, a short axis means fast change). Reading off the cases is now a matter of eigenvalues: both small means flat; one large, one small means an edge, the small eigenvalue being the freely-sliding direction; both large means a corner (Figure 10.6.3).
We would rather not compute eigenvalues explicitly at every pixel. Harris and Stephens (1988) gave a response that needs only the determinant and trace, both cheap:
The determinant $\det M=\lambda_1\lambda_2$ is large only when both eigenvalues are large, which is the corner condition; the trace penalty subtracts off the edge case (one large eigenvalue). So $R$ comes out large-positive at corners, large-negative on edges, and near zero in flat regions — a single scalar that classifies the point. (A common variant, Shi–Tomasi, drops the trick and simply uses $R=\min(\lambda_1,\lambda_2)$, the smallest growth rate, which is large only at corners.) The full Harris detector is then short: compute the gradients $I_x,I_y$; form $M$ at every pixel as windowed sums of gradient products; evaluate $R$; threshold it; and keep the local maxima of $R$ by non-maximum suppression over a small neighbourhood. What survives are the detected corners.
What survives a change of view is the real question, so it is worth being precise about Harris's invariances. Under rotation, the iso-$E$ ellipse rotates with the image but its eigenvalues do not change — they are intrinsic to the matrix — so $R$ is rotation-invariant. (This is a clean example of eigen-analysis handing you a rotation-invariant quantity for free.) Under an intensity change, $R$ depends only on derivatives, so it is exactly invariant to a brightness offset $I\to I+b$ and partially invariant to a contrast scaling $I\to aI$ — partial only because the fixed threshold on $R$ does not rescale with the image. The fatal gap is scale: Harris is not scale-invariant. Zoom in on a corner and its neighbourhood, seen through the same small window, now looks like a smooth edge, so the detector classifies it differently at different scales and fails to re-find it (Figure 10.6.4). That single failure is what motivates SIFT.
Detection gives us where; we still need what, a descriptor to recognise the point again. The basic, pre-SIFT descriptor — enough for a first working system — is almost embarrassingly simple: take the small patch of luminance around the corner, blur it slightly to suppress sampling and aliasing, then subtract its mean and divide by its standard deviation. The blur tames noise; the mean-and-standard-deviation normalisation kills the brightness and contrast differences that exposure, vignetting, or a passing cloud impose between frames. The descriptor is just those normalised patch values, read out as a vector $\mathbf{f}$. To match, scan: for each feature in image one, compute the descriptor SSD $d=\lVert\mathbf{f}_i-\mathbf{f}_j\rVert^2$ against every feature in image two and keep the nearest. This dutifully hands every corner a match — including corners that have no true correspondent at all, because they sit in a non-overlapping region or in repeated texture. Brute-force nearest-neighbour matching is therefore a first draft riddled with wrong answers; two later cleanups, the ratio test and RANSAC, exist precisely to repair it.
10.6.3 SIFT and advanced feature matching⧉
The goal now is sharper: a detector that finds the same point across changes of scale and rotation, paired with a descriptor that is correspondingly invariant, so that wide-baseline, zoomed, and rotated views still match. The Scale-Invariant Feature Transform, or SIFT, of Lowe (2004) is the canonical answer, and it has two halves: a scale-and-rotation-covariant detector, and a scale-and-rotation-invariant descriptor. It is long, so it helps to hold the four moving parts in view: (1) scale-space and the Difference-of-Gaussians, which supplies a scale axis; (2) keypoint detection, the space-and-scale extrema of DoG; (3) a canonical orientation per keypoint, which buys rotation invariance; and (4) the 128-D descriptor built in that canonical frame. We take them in turn.
1. Scale-space and the Difference-of-Gaussians. Scale selection comes from scale-space. Blur the image with Gaussians of increasing $\sigma$ to build a stack $L(\cdot,\sigma)=G_\sigma * I$, the image viewed at a continuum of scales. The right window size for a feature is then found as an extremum over scale of some scale-invariant function evaluated at the point. A plain Gaussian average is too featureless to peak, so SIFT uses a centre-surround (contrast) operator: the Difference-of-Gaussians $D=L(\cdot,k\sigma)-L(\cdot,\sigma)$, which closely approximates the scale-normalised Laplacian $\sigma^2\nabla^2 G$ and is cheap because it is just a subtraction of two pyramid levels. Tracked across scale at a fixed point, the DoG response peaks at a scale proportional to the feature's size — blur up a small blob and the response keeps climbing until the operator's scale matches the blob, then falls off. The peak location is the feature's characteristic scale (Figure 10.6.6). Because it is found independently in each image, a zoomed copy of the feature is detected at the correspondingly larger scale, which is exactly the repeatability we need.
2. Keypoint detection. SIFT detection then looks for local extrema of DoG in space and scale: a point that is larger or smaller than all 26 of its neighbours — the 8 around it in its own scale plus the 9 in each adjacent scale — within a DoG pyramid processed one octave (a factor of two in $\sigma$) at a time. Two refinements clean this up: a quadratic fit localises each extremum to sub-pixel and sub-scale precision, and two filters reject unreliable keypoints — low-contrast points (unstable under noise) and edge-like points (poorly localised along the edge, caught by a principal-curvature ratio that is essentially the Harris edge test again). The detected keypoints carry a location and a scale (Figure 10.6.6).
3. Canonical orientation. For rotation invariance the detector assigns each keypoint a canonical orientation. Build a histogram of local gradient orientations over the keypoint's neighbourhood at its scale, each gradient voting with weight equal to its magnitude; the peak of that histogram is the keypoint's canonical angle. Describing the patch relative to this angle makes everything downstream rotation-invariant: rotate the image and the gradients rotate with it, the histogram peak rotates with them, and the description, measured against the peak, is unchanged.
4. The 128-D descriptor. The SIFT descriptor itself is a 128-dimensional gradient-orientation histogram, built so as to bake in the invariances the detector earned. Resample a $16\times16$ window around the keypoint in its canonical scale and orientation — so the same scene patch is sampled identically regardless of how it was zoomed or rotated. Compute the gradients in that window and weight them by a Gaussian that falls off toward the edge (so a feature at the rim contributes gently and small misregistrations matter less). Divide the window into a $4\times4$ grid of cells, and in each cell accumulate an 8-bin orientation histogram, each gradient voting by its magnitude, with trilinear interpolation spreading every vote smoothly across neighbouring bins — two spatial and one orientation — so that a small shift of the patch does not snap a gradient abruptly from one bin to the next. The result is $4\times4\times8=\mathbf{128}$ numbers. A last pass buys illumination invariance: normalise the 128-vector to unit length (which removes any overall contrast scaling), then clamp every component above $0.2$ and renormalise (so that a few very large gradients — from a specular edge, say — cannot dominate). The net invariances are scale, rotation, and affine illumination, with graceful degradation under small affine geometry changes.
It is worth pausing on why this design is so durable. It is local, so it tolerates occlusion and clutter and needs no segmentation; distinctive, so a single feature can be matched against a database of thousands; plentiful, with many features per image; and efficient to compute. At scale, matching no longer scans linearly but uses approximate-nearest-neighbour structures — k-d trees with best-bin-first search — to find each descriptor's nearest neighbour fast. Downstream, SIFT features feed structure-from-motion at internet scale, the lineage that runs from Photo Tourism and Photosynth through today's COLMAP (the widely used open-source structure-from-motion package) and into Gaussian-splatting reconstruction. The modern learned successors — SuperPoint, LightGlue, neural matchers trained end-to-end — are cross-referenced in Deep learning; they keep this exact detect/describe/match skeleton and only replace the hand-designed pieces with learned ones.
10.6.4 Second-nearest-neighbour test⧉
Brute-force matching has a structural defect: it returns a nearest neighbour for every feature, including the many features that have no true correspondent — those in the background, in the non-overlapping margins, or in repeated texture where one brick looks like the next. We need a way to tell a trustworthy match from a spurious one.
The obvious move — threshold the raw descriptor distance, rejecting any match whose $d$ is too large — does not work cleanly, because the distributions of $d$ for correct and incorrect matches overlap. A correct match can be fairly far (the patch was lit differently, slightly warped) while an incorrect match can be deceptively near (two genuinely similar bits of texture), so no single absolute cutoff separates them (Figure 10.6.8).
Lowe's fix is to make the test relative. For each feature, find not just its nearest neighbour at distance $d_1$ but also its second-nearest at distance $d_2$, and keep the match only when
with $d_1\le d_2$ the two smallest descriptor distances. The intuition is exactly right. A distinctive, correct match is much closer to its true correspondent than to anything else, so $d_1 \ll d_2$ and the ratio is small. An ambiguous match — a repeated pattern, or a feature with no real correspondent — has a near-tie between its best and second-best candidate, so $d_1 \approx d_2$ and the ratio sits near one; we reject it. Crucially, the ratio distributions for correct and incorrect matches are well separated even though the raw-distance distributions overlap, so a single threshold $\tau$ does cleanly what an absolute threshold could not. The relative test asks not "is this match close?" but "is this match uniquely close?" — and uniqueness is what correspondence actually requires.
After the ratio test we hold a set of matches that is mostly correct — but not entirely. Some outliers survive, and near-duplicate structure can still slip a wrong match through. That residual contamination is precisely the job of RANSAC.
10.6.5 RANSAC⧉
We now have point correspondences, most of them good (inliers) but some still wrong (outliers), and we want the homography $H$. A plain least-squares fit over all matches is hopeless: least squares minimises squared error, which rewards the worst offenders, so even a handful of gross outliers drags the fit arbitrarily far from the truth. We need a robust estimator — one that can find the right model while ignoring the bad data.
RANSAC — RANdom SAmple Consensus, due to Fischler and Bolles (1981) — does this with a strikingly simple loop:
- Randomly pick a minimal sample. A homography is determined by four correspondences, so draw $s=4$ matches at random.
- Fit the model exactly from that sample — solve for $H$.
- Count inliers — every match whose reprojection error is small, $\lVert \mathbf{x}_i' - H\mathbf{x}_i \rVert < \varepsilon$ for a tolerance $\varepsilon$.
- Keep the sample with the largest consensus set — the $H$ that the most matches agree with.
- After the loop, re-fit $H$ by least squares on all its inliers (optionally reweighted to pull in near-inliers), now that the outliers are gone.
As a loop:
best = ∅ for N iterations: sample = random 4 matches H = fit_homography(sample) inliers = {i : ‖x'_i - H·x_i‖ < ε} if |inliers| > |best|: best = inliers return least_squares_fit(best)
The reason it works is the consensus idea. A model fit to an all-inlier sample is approximately correct, so it will be endorsed by many other inliers — they all describe the same underlying geometry. A model contaminated by even one outlier is wrong and earns few votes, because the outlier pulled it somewhere no real correspondence lives. Agreement, not squared error, picks the good model out of the noise. And using the minimal sample size is deliberate: the fewer points a draw needs, the higher the chance that a given draw is clean (Figure 10.6.9).
How many draws do we need? Let $w$ be the fraction of inliers among the matches and $s$ the sample size. A single random sample is all-inlier with probability $w^s$ — independent draws, so the probabilities multiply (an AND). The probability that $N$ independent samples all fail to be clean is then $(1-w^s)^N$. To succeed with probability $p$, we demand $(1-w^s)^N \le 1-p$, and solving for $N$ gives the standard count
Three knobs live in that formula, and reading them off is the whole engineering intuition. The sample size $s$ sits in the bad exponential: $w^s$ collapses fast as $s$ grows, so a more complex model — more parameters, larger $s$ — costs dramatically more iterations (a homography's $s=4$ is already nontrivial). The inlier fraction $w$ is the base of that exponential: cleaner matches mean far fewer iterations, which is exactly why the ratio test runs first — it throws out ambiguous matches and so raises $w$. And the iteration count $N$ is the good exponential: the failure probability $(1-w^s)^N$ falls geometrically with $N$, so a few extra iterations buy enormous reliability.
A worked number makes the exponentials concrete. Take a homography, $s=4$, and aim for $p=0.99$, so $\log(1-p)=\log(0.01)\approx-4.6$. With a clean half-inlier set, $w=0.5$, we have $w^s=0.5^4=0.0625$, and
iterations suffice — fewer than a hundred random draws to be 99% sure of a clean sample. Push the same formula and the knobs come alive: running $N=1000$ at $w=0.5$ drives the failure probability down to roughly $(0.9375)^{1000}\approx 10^{-28}$. But let the inlier fraction sag to $w=0.3$ and $w^s=0.0081$, so $N=100$ draws still fail about $44\%$ of the time and you need on the order of a thousand to be safe; at $w=0.1$, $w^s=10^{-4}$, and you need on the order of $10^4$–$10^5$ iterations (tens of thousands). The lesson is blunt: the inlier fraction dominates, which is why every cheap thing you can do to raise $w$ before RANSAC — the ratio test, above all — pays off enormously here.
RANSAC's reach extends far past homographies. It is a general robust-fitting tool — lines, planes, fundamental matrices, anything you can fit from a small sample and score by agreement — and the recipe is always the same: fit a low-parameter model from a random minimal sample, score it by consensus, keep the best, then re-fit on the inliers. It is cheap exactly when the model needs few points and the inliers are not too rare.
With RANSAC the automatic pipeline closes. Extract Harris or SIFT interest points; describe them (normalised patch, or the SIFT vector); match by descriptor distance; let the ratio test discard ambiguous matches; let RANSAC fit the homography robustly from what remains; warp every image into the reference frame; and blend. That is automatic panorama stitching in the sense of Brown and Lowe (2007) — and, with invariant features the way Brown, Szeliski and Winder (2005) characterised them, the same feature pipeline scales straight up to bundle adjustment and full 3D reconstruction.
The same pipeline run over many overlapping frames — match each new view to the growing mosaic, reject outliers, warp, blend — assembles a wide panorama from a handful of hand-held shots. (Distributing the small per-pair alignment errors evenly around a long sweep, so the ends meet cleanly, is the job of bundle adjustment in Bells and whistles.)