💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

7.4 Robustness: the ratio test and RANSAC

A keypoint matcher hands you a pile of correspondences that is mostly noise. Repeated texture (one window of brick looks like the next), occlusion (the point a descriptor matched to is simply not in the other image), and ordinary descriptor collisions all produce confident-but-wrong matches. And the cost of even one of them is steep: least squares minimizes a sum of squared residuals, so a single far-off outlier — a match that says "this corner of the roof went to that corner of the sky" — dominates the sum and bends the entire fitted transform off the truth. You cannot average your way out of an outlier; you have to reject it. This is the same lesson the brittle least-squares fit taught back in Linear Inverse Problems and Regression, now with real, adversarial outliers attached.

Two classic filters make matching usable, and they attack the problem from opposite ends. One prunes matches before any geometry is fit, by spotting which matches are ambiguous. The other fits a transform despite the outliers that survive pruning, by replacing the fragile least-squares average with a robust consensus vote. Neither alone is enough; together they turn a 50%-garbage match set into a clean geometric model.

7.4.1 The ratio test — reject ambiguous matches before fitting

The instinct, when you have a keypoint in image A and a set of candidates in image B, is to keep the match if the nearest descriptor is close. Lowe's ratio test (Lowe 2004) says that is the wrong question. Closeness in absolute terms is unreliable — a descriptor on a high-contrast corner is close to its true partner, a descriptor on flat texture is far from everything, and a single threshold on distance cannot serve both. The right question is about ambiguity: is the nearest match decisively better than the second-nearest one?

So for each keypoint, find its two nearest neighbours in descriptor space, at distances $d_1\le d_2$, and accept the match only if

$$ \frac{d_1}{d_2} < \rho, \qquad \rho \approx 0.7\text{–}0.8. $$

The logic is clean. A distinctive point — a unique corner of a window frame — has one obviously-best partner and everything else far away, so $d_1\ll d_2$ and the ratio is small: keep it. A point on repeated texture — one brick among a hundred — has a best match that is barely better than the next brick over, so $d_1\approx d_2$, the ratio is near $1$, and the match is thrown out (Figure 7.4.1). The test measures exactly the failure mode that breaks matching: it does not ask "is this match good?" but "is there a competitor that makes this match untrustworthy?" That is why it works where an absolute-distance threshold does not — repeated structure produces near-ties, and near-ties are precisely what the ratio catches.

It is one line of code, it runs before any geometry, and it removes the majority of false matches at a stroke. The slides call its survivors the "after 2NN test" set — feature points with mostly-good matches, but not all good. That residual is the honest caveat: the ratio test does cost some true matches in genuinely repetitive scenes (a real corner that happens to have a near-twin elsewhere gets rejected as ambiguous), and a lower $\rho$ trades recall for precision. Whatever it leaves behind still contains outliers — which is exactly why the next stage exists.

fig-ratio-test
Figure 7.4.1. Ambiguity, not distance, is the test. Left: one query keypoint and its two nearest descriptors in the other image, with distances $d_1$ (nearest) and $d_2$ (second-nearest). A distinctive match has $d_1\ll d_2$ — one clear winner — and passes ($d_1/d_2<\rho$, kept). An ambiguous match (a point on repeated texture) has $d_1\approx d_2$ — two near-tied candidates — and fails (rejected). Right: histograms of the ratio $d_1/d_2$ for true versus false matches, which separate cleanly around $\rho\approx0.8$: true matches pile up at small ratios, false matches spread toward $1$. The threshold slices between the two populations.

7.4.2 RANSAC — fit from a minimal random sample, score by consensus

After the ratio test, outliers remain — so we must fit the transform robustly, in a way that no number of bad matches can sabotage. RANSAC — RANdom SAmple Consensus, from Fischler & Bolles (1981) — does this with an idea that is almost embarrassingly simple, and exactly right. Rather than fit one model to all the data and hope, fit many models to tiny random subsets and let the data vote.

The loop, for fitting a transform $T$ (Figure 7.4.3 shows it on a homography):

  1. Draw a minimal random sample — the fewest matches that determine the model. Two points for a line, four point-pairs for a homography $H$, and so on. Minimal is deliberate: the smaller the sample, the higher the chance it lands entirely on inliers.
  2. Fit $T$ from just that sample (an exact solve — four pairs give an exact homography, no least squares yet).
  3. Score by consensus. Count how many of all the other matches agree with this $T$ — the inliers — using the test $\lVert \mathbf{x}'_i - T(\mathbf{x}_i)\rVert < \epsilon$: a match is an inlier if mapping $\mathbf{x}_i$ through $T$ lands within $\epsilon$ pixels of its claimed partner $\mathbf{x}'_i$.
  4. Repeat, keeping the $T$ with the largest inlier set.
  5. Refit that winning model by least squares on all its inliers — now that the outliers are excluded, a least-squares average is finally safe and gives the best estimate.

Why this beats least squares is the whole point. A least-squares fit averages in every match, so a handful of gross outliers drag the answer off (Figure 7.4.2, the canonical picture: the least-squares line tilts toward the outliers, while RANSAC's best minimal-sample line threads the inlier majority). RANSAC instead ignores outliers rather than averaging them: a minimal sample that happens to be all-inlier defines a model the inlier majority votes for, and the outliers — being mutually inconsistent — never form a large consensus of their own. The method is robust even when most of the matches are wrong, which is exactly the regime keypoint matching lives in.

How many iterations do we need? This is the one piece of analysis worth doing, because it tells you the algorithm is cheap. Suppose the inlier fraction is $w$ and the model needs a minimal sample of size $s$. The probability that a single random draw is all inliers is $w^s$ (each of the $s$ picks independently inlier). So the probability a single draw is contaminated is $1-w^s$, and the probability that all $N$ draws fail is $(1-w^s)^N$. Demand that this failure probability be at most $1-p$ for a target success probability $p$, and solve:

$$ N = \frac{\log(1-p)}{\log\!\big(1-w^{s}\big)}. $$

Read the three dependencies off this formula, because they are the design knobs (the slides call them the good, the bad, and the base of the exponential). The sample size $s$ sits in an exponent — it is the bad exponential, so you want the minimal solver and not one match more (this is why "minimal sample" is load-bearing, not a stylistic choice). The inlier fraction $w$ sets the base of the exponential — a richer match set converges far faster. And $N$ itself is the good exponential: more iterations drive failure down geometrically, and iterations are cheap (a four-point solve plus a residual count). A concrete case from the slides makes it vivid: with a homography ($s=4$) and a 50% inlier fraction ($w=0.5$), a single draw succeeds only 6% of the time — but $N=100$ iterations cuts the failure probability to about $1$ in $600$, and $N=1000$ to astronomically small. Even at a punishing $w=0.1$, raising $N$ buys success; you simply pay more iterations. The takeaway is that RANSAC trades a guaranteed-correct but outlier-fragile solve for a probabilistically-correct but outlier-immune one, and the probability is yours to set as high as you like for the price of a few more cheap trials.

fig-ransac-line-fit
Figure 7.4.2. Why RANSAC beats least squares — the canonical picture. A set of 2-D points: an inlier majority lying along a line, plus a few gross outliers off to the side. Least squares (fit to all points) is dragged off toward the outliers — the squared residuals of the far points dominate the sum and tilt the line. RANSAC draws minimal samples of two points; a sample landing on two inliers defines a line that the inlier majority votes for (the largest consensus set), and the outliers are simply ignored. The RANSAC line passes cleanly through the inlier majority while the least-squares line bends toward the outliers.
fig-ransac-homography
Figure 7.4.3. RANSAC on a real homography. Left: two views of a scene with raw nearest-neighbour matches drawn as lines — many of them cross wildly, the signature of outliers (a corner in one image "matched" to an unrelated point in the other). Right: the RANSAC inlier set — the subset of matches consistent with a single homography $H$ (the four-point minimal solver, scored by reprojection within $\epsilon$). These lines are clean and roughly parallel, all obeying one consistent warp; the recovered $H$ is shown beside them. The same match set, before and after consensus filtering.
💡 Big lesson

Consensus is more robust than averaging. A least-squares fit is an average: it minimizes a sum of squared residuals, so it is perfectly behaved when every data point is a slightly-noisy inlier — and catastrophically fragile the instant one point is a gross outlier, because that one term dominates the sum. RANSAC's move is to stop averaging the raw data and instead let the data vote: fit from a minimal sample (small enough to occasionally be outlier-free), score by how many others agree, and keep the model with the most votes — averaging only at the very end, after the outliers have been excluded. The pattern recurs far beyond matching: whenever your data is contaminated by points that don't merely add noise but actively lie, replace the global average with a consensus-of-a-clean-subset. Robustness is not a better fit to all the data; it is knowing which data to fit.

7.4.3 Variants and degeneracy

The sample-and-score skeleton is so useful that a family of variants has grown around it, each improving one piece while keeping the loop intact:

All four keep the draw-a-sample, score-by-agreement, keep-the-best loop; they change how you sample, how you score, or how you refine, never the core idea.

There is one honest caveat that no amount of iteration fixes, and it is worth stating plainly: RANSAC finds consensus, not truth. The algorithm trusts whatever model the most points agree on — but a degenerate configuration can make a wrong model win with overwhelming consensus. The textbook case lives in two-view geometry (Multiple view geometry): if you are estimating the fundamental matrix $F$ to recover 3-D structure, but all the matched points happen to lie on a single plane (a building façade, a tabletop), then the data is consistent with infinitely many $F$'s, and RANSAC will confidently return one of them — geometrically valid by its inlier count, but useless for reconstruction. Consensus and correctness diverge exactly when the data is degenerate. The standard guards are a planar-versus-3-D check (test whether the inliers are explained by a homography alone, which flags the planar degeneracy) and choosing minimal solvers matched to the geometry you actually want. The lesson sits one level above the algorithm: a method that rewards agreement will be fooled by data that agrees for the wrong reason.

These three pieces — ratio test, RANSAC, degeneracy guard — are what make sparse matching deliver. The very next chapter, Automatic panorama stitching and feature matching, runs the full pipeline in anger: Harris corners → descriptors → ratio test → RANSAC homography → blend; and the same robust-fitting machinery underwrites every two-view estimator in Multiple view geometry. Prune the ambiguous, fit by consensus, and distrust the suspiciously perfect.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson

Consensus is more robust than averaging. A least-squares fit is an average: it minimizes a sum of squared residuals, so it is perfectly behaved when every data point is a slightly-noisy inlier — and catastrophically fragile the instant one point is a gross outlier, because that one term dominates the sum. RANSAC's move is to stop averaging the raw data and instead let the data vote: fit from a minimal sample (small enough to occasionally be outlier-free), score by how many others agree, and keep the model with the most votes — averaging only at the very end, after the outliers have been excluded. The pattern recurs far beyond matching: whenever your data is contaminated by points that don't merely add noise but actively lie, replace the global average with a consensus-of-a-clean-subset. Robustness is not a better fit to all the data; it is knowing which data to fit.