symbol

meaning (this chapter)

note

$I_x,\,I_y$

image gradients in $x$ and $y$

from Notations; reused from convolution / refreshers

$w(x,y)$

the detection window / weight (box or Gaussian) summed over in $M$

local; cf. $w$ = RANSAC inlier fraction below — disambiguate by context

$E(u,v)$

windowed SSD change under a shift $(u,v)$; the quadratic form $(u\,v)\,M\,(u\,v)^\top$

new (this chapter)

$M$

the structure tensor (second-moment matrix) $\sum w\,(I_x^2,\,I_xI_y;\,I_xI_y,\,I_y^2)$

new (this chapter)

$\lambda_1,\lambda_2$

the eigenvalues of $M$ — the two principal change rates

new; classify flat / edge / corner

$R$

the Harris corner response $R=\det M - k\,(\operatorname{tr}M)^2$

new; large-positive at corners

$k$

the Harris constant, $\approx 0.04$–$0.06$

new (this chapter); cf. $k\sigma$ scale ratio below

$\sigma$, $G_\sigma$

scale and the Gaussian of that scale; $L(\cdot,\sigma)=G_\sigma*I$ is scale-space

from Notations / Linear pyramids and wavelets

DoG, $D$

Difference-of-Gaussians $D=L(\cdot,k\sigma)-L(\cdot,\sigma)\approx\sigma^2\nabla^2 G * I$

new (scale-normalised Laplacian)

$\mathbf{f}$

a descriptor vector (normalised patch, or the 128-D SIFT vector)

new (this chapter)

$d$, $d_1$, $d_2$

descriptor distance $d=\lVert\mathbf{f}_i-\mathbf{f}_j\rVert^2$; nearest and second-nearest

new (this chapter)

$\tau$

the ratio-test threshold, keep iff $d_1/d_2<\tau$ ($\tau\approx0.8$)

new (this chapter)

$\varepsilon$

the RANSAC inlier reprojection tolerance $\lVert\mathbf{x}_i'-H\mathbf{x}_i\rVert<\varepsilon$

new (this chapter)

$w,\,s,\,p,\,N$

RANSAC: inlier fraction, minimal sample size ($s=4$ for $H$), target success probability, iteration count $N=\log(1-p)/\log(1-w^s)$

new (this chapter); $w$ here ≠ window weight above

$H$

the homography being fit (derived as $H=KRK^{-1}$ in Manual panorama stitching from multiple views)

reused; the model RANSAC robustly estimates

No big lessons in this chapter — skip ahead to Blending →

10.5 Automatic panorama stitching from multiple views and feature matching⧉

📎 Problem set

PS7 builds the automatic panorama: Harris, descriptors, the 2NN test, RANSAC, and blending. → Problem sets (appendix).

The manual pipeline is complete except for one human step: someone clicked the corresponding points. That click is the bottleneck — slow, and impossible at scale, where "scale" means gigapixel mosaics, internet photo collections, and video. This chapter replaces the clicks with automatic feature matching, which is fairly described as perhaps the most important innovation in computer vision and computational photography of the last twenty years: the same machinery powers 3D reconstruction, tracking, object recognition, retrieval, and robot navigation. We keep the homography solve from before untouched; we only manufacture its inputs.

This chapter sits on the part spine L14 — capture the full set, decide later (introduced in this part's intro): a panorama defers your field of view, shooting several overlapping frames now and deciding the final wide view later, and everything here is the alignment step that lets that "decide later" merge happen. The light-field capture of Advanced computational photography is the special case where the deferred decision is focus and viewpoint at once.

10.5.1 Why not brute force, and the two sub-problems⧉

The honest place to start is with the method this one replaces, because seeing why it fails fixes the design. The old way matched images densely: try all candidate alignments, warp one image onto the other, compute the sum-of-squared-differences (SSD) over all overlapping pixels, and keep the alignment with the smallest error. For a translation that is two parameters to search — already slow but tractable. For a homography it is eight coefficients, and discretising an eight-dimensional search is combinatorially hopeless, with every trial demanding a full warp plus a full-image SSD. Even cut down to a couple of rotational degrees of freedom it is restrictive, slow, and — the deciding flaw — not robust: a moving object or a little lens distortion corrupts the global SSD and drags the answer off (Figure 10.5.1).

The modern idea is to go sparse. Ignore the uniform regions, which carry no alignment information anyway, and concentrate on a few very distinctive points — interest points or keypoints. Match those points locally (which point in image one corresponds to which point in image two), and add machinery to stay robust to the inevitable local mistakes. Sparse, fast, and robust, all at once.

That plan decomposes into two sub-problems, and they are worth stating sharply because they are the chapter's skeleton:

repeatable detection — detect the same scene point independently in each image. If a point is found in one image but missed in the other, it can never be matched, no matter how good the rest of the pipeline is. Detection must therefore be repeatable across the changes of viewpoint, scale, and lighting that separate two frames.
distinctive description — for a detected point, summarise its neighbourhood with a descriptor that is similar for the same scene point seen in two views and different for unrelated points, so that the correct correspondent can be recognised among many candidates.

One discipline governs both and is easy to forget: detection and description run independently in each image. Only at the very end, given a bag of (point, descriptor) features per image, do we compute correspondences. We will analyse the detector and descriptor on corresponding scene points — but only to check that they would behave consistently; the algorithm itself never gets to peek across images until the matching step.

fig-feature-pipeline — **Figure 10.5.1.** The whole automatic system on one strip. Five boxes, left to right: **detect** distinctive points independently in each image (Harris corners, or scale-invariant feature transform (SIFT) keypoints); **describe** each point's neighbourhood with an invariant descriptor (a normalised patch, or the 128-D SIFT vector); **match** descriptors across images by nearest neighbour and prune ambiguous matches with the **ratio test** $d_1/d_2<\tau$; **robustly fit** the homography $H$ from the surviving, still-outlier-ridden matches with **RANSAC**; then **warp and blend** into the panorama. The dense brute-force alternative — discretise the homography's eight coefficients and search by full-image SSD — is shown crossed out beside it: combinatorially hopeless and not robust to motion.

10.5.2 Basic feature matching: Harris corners and what makes a good keypoint⧉

What makes a point worth detecting? Two things: you must be able to localise it precisely — pin down where it is to the pixel — and re-find it reliably in another view. There is a clean test for both, due originally to Moravec: take a small window around the candidate, slide it a little in every direction, and measure how much the windowed image changes. Three cases sort themselves out (Figure 10.5.2). In a flat region the window looks the same wherever you slide it — no change in any direction, so the point cannot be localised at all. On an edge the window changes when you slide across the edge but not when you slide along it — it slides freely in one direction, so its position along the edge is ambiguous (this is the aperture problem, the same one that haunts optical flow). On a corner the window changes substantially under a shift in every direction — and that is exactly what pins the point down uniquely. Corners make the best keypoints.

Now quantify the test. The windowed change produced by a shift $(u,v)$ is a weighted SSD,

$$ E(u,v) = \sum_{x,y} w(x,y)\,\big[\,I(x+u,y+v) - I(x,y)\,\big]^2, $$

with $w(x,y)$ a box or Gaussian window centred on the point. Evaluating this for every shift is expensive, so we Taylor-expand the shifted image, $I(x+u,y+v)\approx I + I_x u + I_y v$, where $I_x,I_y$ are the image gradients. Substituting and collecting terms turns $E$ into a quadratic form in the shift,

$$ E(u,v) \approx \begin{pmatrix} u & v \end{pmatrix} M \begin{pmatrix} u \\ v \end{pmatrix}, \qquad M = \sum_{x,y} w(x,y) \begin{pmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{pmatrix}. $$

The $2\times2$ symmetric matrix $M$ is the structure tensor (also called the second-moment matrix): it packs the local gradient statistics — averaged over the window — into a single object. Its eigenvalues $\lambda_1,\lambda_2$ are the rates at which $E$ grows along the two principal directions, and the iso-$E$ contour is an ellipse whose axes are proportional to $\lambda^{-1/2}$ (a long axis means slow change, a short axis means fast change). Reading off the cases is now a matter of eigenvalues: both small means flat; one large, one small means an edge, the small eigenvalue being the freely-sliding direction; both large means a corner (Figure 10.5.3).

We would rather not compute eigenvalues explicitly at every pixel. Harris and Stephens (1988) gave a response that needs only the determinant and trace, both cheap:

$$ R = \det M - k\,(\operatorname{tr} M)^2 = \lambda_1 \lambda_2 - k\,(\lambda_1 + \lambda_2)^2, \qquad k \approx 0.04\text{–}0.06. $$

The determinant $\det M=\lambda_1\lambda_2$ is large only when both eigenvalues are large, which is the corner condition; the trace penalty subtracts off the edge case (one large eigenvalue). So $R$ comes out large-positive at corners, large-negative on edges, and near zero in flat regions — a single scalar that classifies the point. (A common variant, Shi–Tomasi, drops the trick and simply uses $R=\min(\lambda_1,\lambda_2)$, the smallest growth rate, which is large only at corners.) The full Harris detector is then short: compute the gradients $I_x,I_y$; form $M$ at every pixel as windowed sums of gradient products; evaluate $R$; threshold it; and keep the local maxima of $R$ by non-maximum suppression over a small neighbourhood. What survives are the detected corners.

What survives a change of view is the real question, so it is worth being precise about Harris's invariances. Under rotation, the iso-$E$ ellipse rotates with the image but its eigenvalues do not change — they are intrinsic to the matrix — so $R$ is rotation-invariant. (This is a clean example of eigen-analysis handing you a rotation-invariant quantity for free.) Under an intensity change, $R$ depends only on derivatives, so it is exactly invariant to a brightness offset $I\to I+b$ and partially invariant to a contrast scaling $I\to aI$ — partial only because the fixed threshold on $R$ does not rescale with the image. The fatal gap is scale: Harris is not scale-invariant. Zoom in on a corner and its neighbourhood, seen through the same small window, now looks like a smooth edge, so the detector classifies it differently at different scales and fails to re-find it (Figure 10.5.4). That single failure is what motivates SIFT.

Detection gives us where; we still need what, a descriptor to recognise the point again. The basic, pre-SIFT descriptor — enough for a first working system — is almost embarrassingly simple: take the small patch of luminance around the corner, blur it slightly to suppress sampling and aliasing, then subtract its mean and divide by its standard deviation. The blur tames noise; the mean-and-standard-deviation normalisation kills the brightness and contrast differences that exposure, vignetting, or a passing cloud impose between frames. The descriptor is just those normalised patch values, read out as a vector $\mathbf{f}$. To match, scan: for each feature in image one, compute the descriptor SSD $d=\lVert\mathbf{f}_i-\mathbf{f}_j\rVert^2$ against every feature in image two and keep the nearest. This dutifully hands every corner a match — including corners that have no true correspondent at all, because they sit in a non-overlapping region or in repeated texture. Brute-force nearest-neighbour matching is therefore a first draft riddled with wrong answers; two later cleanups, the ratio test and RANSAC, exist precisely to repair it.

fig-corner-flat-edge — **Figure 10.5.2.** What makes a good keypoint: slide a small window and watch the patch. Three columns. **Flat** — slide in any direction and nothing changes, so the point cannot be localised. **Edge** — slide across the edge and the patch changes, but slide along it and it does not (the aperture problem): the position along the edge is ambiguous. **Corner** — slide in any direction at all and the patch changes a lot, so the point is uniquely pinned down. Each column shows the window at the centre and a few shifted copies, with the change $E(u,v)$ small (flat), one-directional (edge), or large in every direction (corner).

fig-harris-eigen-ellipse — **Figure 10.5.3.** The structure tensor read through its eigenvalues. Left: the iso-error ellipse of $E(u,v)=(u\,v)\,M\,(u\,v)^\top$, axes proportional to $\lambda^{-1/2}$, drawn for a flat patch (large round ellipse — slow change everywhere), an edge (a long thin ellipse — fast across, slow along), and a corner (a small tight ellipse — fast in every direction). Right: the $(\lambda_1,\lambda_2)$ plane partitioned into **flat** (both small), **edge** (one large, one small), and **corner** (both large) regions, with contours of the Harris response $R=\det M-k(\operatorname{tr}M)^2$ overlaid — positive in the corner wedge, negative along the edge axes.

fig-harris-not-scale-invariant — **Figure 10.5.4.** Harris is not scale-invariant. The same scene corner is shown at two zoom levels with the **same** fixed detection window. Zoomed out, the window spans the corner and Harris fires "corner." Zoomed in, the same window now sees only a locally straight edge and Harris fires "edge" — so the point is detected at one scale and missed at the other, and can never match across the two views. The cure is to select a scale per feature, which is what SIFT adds.

fig-harris-real — **Figure 10.5.5.** Harris on a real image (Stata Center view 1), the conceptual figures cashed out. From the input, gradients are accumulated over a window into the **structure tensor** $M=\sum w\,\nabla I\,\nabla I^{\top}$; the scalar **Harris response** $R=\det M - k(\operatorname{tr}M)^2$ is large only where both eigenvalues are large — a true two-dimensional corner — and the heatmap lights up on building corners, window-grid junctions, and foliage tips while flat walls and sky stay dark; a **threshold plus non-maximum suppression** then keeps the local peaks as the detected corners. The same operator runs on every view before matching.

10.5.3 SIFT and advanced feature matching⧉

The goal now is sharper: a detector that finds the same point across changes of scale and rotation, paired with a descriptor that is correspondingly invariant, so that wide-baseline, zoomed, and rotated views still match. The Scale-Invariant Feature Transform, or SIFT, of Lowe (2004) is the canonical answer, and it has two halves: a scale-and-rotation-covariant detector, and a scale-and-rotation-invariant descriptor. It is long, so it helps to hold the four moving parts in view: (1) scale-space and the Difference-of-Gaussians, which supplies a scale axis; (2) keypoint detection, the space-and-scale extrema of DoG; (3) a canonical orientation per keypoint, which buys rotation invariance; and (4) the 128-D descriptor built in that canonical frame. We take them in turn.

1. Scale-space and the Difference-of-Gaussians. Scale selection comes from scale-space. Blur the image with Gaussians of increasing $\sigma$ to build a stack $L(\cdot,\sigma)=G_\sigma * I$, the image viewed at a continuum of scales. The right window size for a feature is then found as an extremum over scale of some scale-invariant function evaluated at the point. A plain Gaussian average is too featureless to peak, so SIFT uses a centre-surround (contrast) operator: the Difference-of-Gaussians $D=L(\cdot,k\sigma)-L(\cdot,\sigma)$, which closely approximates the scale-normalised Laplacian $\sigma^2\nabla^2 G$ and is cheap because it is just a subtraction of two pyramid levels. Tracked across scale at a fixed point, the DoG response peaks at a scale proportional to the feature's size — blur up a small blob and the response keeps climbing until the operator's scale matches the blob, then falls off. The peak location is the feature's characteristic scale (Figure 10.5.6). Because it is found independently in each image, a zoomed copy of the feature is detected at the correspondingly larger scale, which is exactly the repeatability we need.

2. Keypoint detection. SIFT detection then looks for local extrema of DoG in space and scale: a point that is larger or smaller than all 26 of its neighbours — the 8 around it in its own scale plus the 9 in each adjacent scale — within a DoG pyramid processed one octave (a factor of two in $\sigma$) at a time. Two refinements clean this up: a quadratic fit localises each extremum to sub-pixel and sub-scale precision, and two filters reject unreliable keypoints — low-contrast points (unstable under noise) and edge-like points (poorly localised along the edge, caught by a principal-curvature ratio that is essentially the Harris edge test again). The detected keypoints carry a location and a scale (Figure 10.5.6).

3. Canonical orientation. For rotation invariance the detector assigns each keypoint a canonical orientation. Build a histogram of local gradient orientations over the keypoint's neighbourhood at its scale, each gradient voting with weight equal to its magnitude; the peak of that histogram is the keypoint's canonical angle. Describing the patch relative to this angle makes everything downstream rotation-invariant: rotate the image and the gradients rotate with it, the histogram peak rotates with them, and the description, measured against the peak, is unchanged.

4. The 128-D descriptor. The SIFT descriptor itself is a 128-dimensional gradient-orientation histogram, built so as to bake in the invariances the detector earned. Resample a $16\times16$ window around the keypoint in its canonical scale and orientation — so the same scene patch is sampled identically regardless of how it was zoomed or rotated. Compute the gradients in that window and weight them by a Gaussian that falls off toward the edge (so a feature at the rim contributes gently and small misregistrations matter less). Divide the window into a $4\times4$ grid of cells, and in each cell accumulate an 8-bin orientation histogram, each gradient voting by its magnitude, with trilinear interpolation spreading every vote smoothly across neighbouring bins — two spatial and one orientation — so that a small shift of the patch does not snap a gradient abruptly from one bin to the next. The result is $4\times4\times8=\mathbf{128}$ numbers. A last pass buys illumination invariance: normalise the 128-vector to unit length (which removes any overall contrast scaling), then clamp every component above $0.2$ and renormalise (so that a few very large gradients — from a specular edge, say — cannot dominate). The net invariances are scale, rotation, and affine illumination, with graceful degradation under small affine geometry changes.

It is worth pausing on why this design is so durable. It is local, so it tolerates occlusion and clutter and needs no segmentation; distinctive, so a single feature can be matched against a database of thousands; plentiful, with many features per image; and efficient to compute. At scale, matching no longer scans linearly but uses approximate-nearest-neighbour structures — k-d trees with best-bin-first search — to find each descriptor's nearest neighbour fast. Downstream, SIFT features feed structure-from-motion at internet scale, the lineage that runs from Photo Tourism and Photosynth through today's COLMAP (the widely used open-source structure-from-motion package) and into Gaussian-splatting reconstruction. The modern learned successors — SuperPoint, LightGlue, neural matchers trained end-to-end — are cross-referenced in Deep learning; they keep this exact detect/describe/match skeleton and only replace the hand-designed pieces with learned ones.

fig-dog-pyramid — **Figure 10.5.6.** Difference-of-Gaussians scale-space and keypoint detection. One octave of a Gaussian pyramid ($L(\cdot,\sigma)$ at increasing blur) and the adjacent-level differences $D=L(\cdot,k\sigma)-L(\cdot,\sigma)$ stacked into a DoG volume. Tracked across scale at a fixed point, the DoG response **peaks at a scale proportional to the feature's size** — the feature's characteristic scale, found independently per image, so a zoomed copy is detected at the correspondingly larger scale (a plain Gaussian average is too flat to peak; the centre-surround DoG is what creates the maximum). A SIFT keypoint is a point that is a local maximum or minimum over its 26 space-and-scale neighbours — the 8 in its own DoG level and the 9 in each neighbouring level. The volume is processed one octave at a time; detected extrema are then refined to sub-pixel/sub-scale and culled of low-contrast and edge-like responses.

fig-sift-descriptor — **Figure 10.5.7.** Building the 128-D SIFT descriptor. The keypoint's **canonical orientation** is first fixed as the dominant peak of a magnitude-weighted gradient-orientation histogram over its neighbourhood (so the descriptor, measured relative to this angle, is rotation-invariant). A $16\times16$ window is then resampled in that canonical scale and orientation; its gradients (arrows) are weighted by a Gaussian falloff (the overlaid circle). The window is split into a $4\times4$ grid of cells, and within each cell the gradients accumulate into an **8-bin orientation histogram** (the small radial plots), with trilinear interpolation spreading each vote across neighbouring spatial and orientation bins. Concatenating $4\times4\times8$ gives a 128-vector, then normalised to unit length, clamped at $0.2$, and renormalised for illumination invariance.

10.5.4 Second-nearest-neighbour test⧉

Brute-force matching has a structural defect: it returns a nearest neighbour for every feature, including the many features that have no true correspondent — those in the background, in the non-overlapping margins, or in repeated texture where one brick looks like the next. We need a way to tell a trustworthy match from a spurious one.

The obvious move — threshold the raw descriptor distance, rejecting any match whose $d$ is too large — does not work cleanly, because the distributions of $d$ for correct and incorrect matches overlap. A correct match can be fairly far (the patch was lit differently, slightly warped) while an incorrect match can be deceptively near (two genuinely similar bits of texture), so no single absolute cutoff separates them (Figure 10.5.8).

Lowe's fix is to make the test relative. For each feature, find not just its nearest neighbour at distance $d_1$ but also its second-nearest at distance $d_2$, and keep the match only when

$$ \frac{d_1}{d_2} < \tau, \qquad \tau \approx 0.8, $$

with $d_1\le d_2$ the two smallest descriptor distances. The intuition is exactly right. A distinctive, correct match is much closer to its true correspondent than to anything else, so $d_1 \ll d_2$ and the ratio is small. An ambiguous match — a repeated pattern, or a feature with no real correspondent — has a near-tie between its best and second-best candidate, so $d_1 \approx d_2$ and the ratio sits near one; we reject it. Crucially, the ratio distributions for correct and incorrect matches are well separated even though the raw-distance distributions overlap, so a single threshold $\tau$ does cleanly what an absolute threshold could not. The relative test asks not "is this match close?" but "is this match uniquely close?" — and uniqueness is what correspondence actually requires.

After the ratio test we hold a set of matches that is mostly correct — but not entirely. Some outliers survive, and near-duplicate structure can still slip a wrong match through. That residual contamination is precisely the job of RANSAC.

fig-ratio-test-histograms — **Figure 10.5.8.** Why the ratio test beats an absolute threshold. Top: histograms of the **raw** descriptor distance $d_1$ for correct (green) and incorrect (red) matches — heavily **overlapping**, so no vertical cutoff cleanly separates them. Bottom: histograms of the **ratio** $d_1/d_2$ for the same two populations — now **well separated**, correct matches piling up at small ratios (the true match is much nearer than any alternative) and incorrect matches near one (a near-tie between best and second-best). A single threshold $\tau\approx0.8$ keeps almost all correct matches and rejects most wrong ones.

10.5.5 RANSAC⧉

We now have point correspondences, most of them good (inliers) but some still wrong (outliers), and we want the homography $H$. A plain least-squares fit over all matches is hopeless: least squares minimises squared error, which rewards the worst offenders, so even a handful of gross outliers drags the fit arbitrarily far from the truth. We need a robust estimator — one that can find the right model while ignoring the bad data.

RANSAC — RANdom SAmple Consensus, due to Fischler and Bolles (1981) — does this with a strikingly simple loop:

Randomly pick a minimal sample. A homography is determined by four correspondences, so draw $s=4$ matches at random.
Fit the model exactly from that sample — solve for $H$.
Count inliers — every match whose reprojection error is small, $\lVert \mathbf{x}_i' - H\mathbf{x}_i \rVert < \varepsilon$ for a tolerance $\varepsilon$.
Keep the sample with the largest consensus set — the $H$ that the most matches agree with.
After the loop, re-fit $H$ by least squares on all its inliers (optionally reweighted to pull in near-inliers), now that the outliers are gone.

As a loop:

best = ∅
for N iterations:
    sample = random 4 matches
    H = fit_homography(sample)
    inliers = {i : ‖x'_i - H·x_i‖ < ε}
    if |inliers| > |best|:
        best = inliers
return least_squares_fit(best)

The reason it works is the consensus idea. A model fit to an all-inlier sample is approximately correct, so it will be endorsed by many other inliers — they all describe the same underlying geometry. A model contaminated by even one outlier is wrong and earns few votes, because the outlier pulled it somewhere no real correspondence lives. Agreement, not squared error, picks the good model out of the noise. And using the minimal sample size is deliberate: the fewer points a draw needs, the higher the chance that a given draw is clean (Figure 10.5.9).

How many draws do we need? Let $w$ be the fraction of inliers among the matches and $s$ the sample size. A single random sample is all-inlier with probability $w^s$ — independent draws, so the probabilities multiply (an AND). The probability that $N$ independent samples all fail to be clean is then $(1-w^s)^N$. To succeed with probability $p$, we demand $(1-w^s)^N \le 1-p$, and solving for $N$ gives the standard count

$$ N = \frac{\log(1-p)}{\log(1-w^s)}. $$

Three knobs live in that formula, and reading them off is the whole engineering intuition. The sample size $s$ sits in the bad exponential: $w^s$ collapses fast as $s$ grows, so a more complex model — more parameters, larger $s$ — costs dramatically more iterations (a homography's $s=4$ is already nontrivial). The inlier fraction $w$ is the base of that exponential: cleaner matches mean far fewer iterations, which is exactly why the ratio test runs first — it throws out ambiguous matches and so raises $w$. And the iteration count $N$ is the good exponential: the failure probability $(1-w^s)^N$ falls geometrically with $N$, so a few extra iterations buy enormous reliability.

A worked number makes the exponentials concrete. Take a homography, $s=4$, and aim for $p=0.99$, so $\log(1-p)=\log(0.01)\approx-4.6$. With a clean half-inlier set, $w=0.5$, we have $w^s=0.5^4=0.0625$, and

$$ N = \frac{\log(0.01)}{\log(1-0.0625)} = \frac{-4.6}{-0.0645} \approx 71 $$

iterations suffice — fewer than a hundred random draws to be 99% sure of a clean sample. Push the same formula and the knobs come alive: running $N=1000$ at $w=0.5$ drives the failure probability down to roughly $(0.9375)^{1000}\approx 10^{-28}$. But let the inlier fraction sag to $w=0.3$ and $w^s=0.0081$, so $N=100$ draws still fail about $44\%$ of the time and you need on the order of a thousand to be safe; at $w=0.1$, $w^s=10^{-4}$, and you need on the order of $10^4$–$10^5$ iterations (tens of thousands). The lesson is blunt: the inlier fraction dominates, which is why every cheap thing you can do to raise $w$ before RANSAC — the ratio test, above all — pays off enormously here.

RANSAC's reach extends far past homographies. It is a general robust-fitting tool — lines, planes, fundamental matrices, anything you can fit from a small sample and score by agreement — and the recipe is always the same: fit a low-parameter model from a random minimal sample, score it by consensus, keep the best, then re-fit on the inliers. It is cheap exactly when the model needs few points and the inliers are not too rare.

With RANSAC the automatic pipeline closes. Extract Harris or SIFT interest points; describe them (normalised patch, or the SIFT vector); match by descriptor distance; let the ratio test discard ambiguous matches; let RANSAC fit the homography robustly from what remains; warp every image into the reference frame; and blend. That is automatic panorama stitching in the sense of Brown and Lowe (2007) — and, with invariant features the way Brown, Szeliski and Winder (2005) characterised them, the same feature pipeline scales straight up to bundle adjustment and full 3D reconstruction.

fig-ransac-line — **Figure 10.5.9.** RANSAC on a toy line fit, four panels. (1) A scatter of points: a clear linear trend plus scattered outliers. (2) A random **minimal sample** of two points and the line through them; count how many points fall within a band $\pm\varepsilon$ of it (the **consensus set**). (3) A bad draw — one sample point is an outlier — gives a tilted line with few inliers, and is discarded. (4) The best draw's line, with its large consensus set highlighted, followed by a least-squares **re-fit on the inliers** alone. The minimal-sample-then-consensus loop, in miniature.

fig-stata-stitch-pipeline — **Figure 10.5.10.** The whole pipeline on a real pair (the MIT Stata Center, two overlapping views). **Detect & describe:** Harris corners fire at trackable two-dimensional structure, and each gets an orientation-normalised descriptor patch. **Match & verify:** nearest-descriptor matching gives many **putative** correspondences, a good fraction of them wrong; RANSAC keeps only the **inliers** consistent with a single homography. **Warp & blend:** the recovered homography warps one view onto the other and the overlap is blended, yielding a seamless panorama. Every stage of this chapter, end to end, on one scene.

The same pipeline run over many overlapping frames — match each new view to the growing mosaic, reject outliers, warp, blend — assembles a wide panorama from a handful of hand-held shots. (Distributing the small per-pair alignment errors evenly around a long sweep, so the ends meet cleanly, is the job of bundle adjustment in Bells and whistles.)

fig-autostitch-gallery — **Figure 10.5.11.** Automatic stitching at scale, on real scenes: a Boston-skyline sunset and the Guédelon castle site, each assembled from several overlapping hand-held frames by exactly this corners → match → RANSAC → warp → blend pipeline. The bowed black border is the signature of projecting many frames onto one planar canvas (the projection choice is revisited in Bells and whistles).

10.5 Automatic panorama stitching from multiple views and feature matching🔗⧉

10.5.1 Why not brute force, and the two sub-problems🔗⧉

10.5.2 Basic feature matching: Harris corners and what makes a good keypoint🔗⧉

10.5.3 SIFT and advanced feature matching🔗⧉

10.5.4 Second-nearest-neighbour test🔗⧉

10.5.5 RANSAC🔗⧉

10.5 Automatic panorama stitching from multiple views and feature matching⧉

10.5.1 Why not brute force, and the two sub-problems⧉

10.5.2 Basic feature matching: Harris corners and what makes a good keypoint⧉

10.5.3 SIFT and advanced feature matching⧉

10.5.4 Second-nearest-neighbour test⧉

10.5.5 RANSAC⧉