💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

7.3 Sparse matching

The previous part transported pixels along maps that were given or hand-drawn. This part estimates the maps, and the cheapest, most robust way to start is not to match every pixel but to match a sparse handful of points you can find again. The plan is three stages, in strict order. Detect distinctive points — the where. Describe each one's neighbourhood with a vector — the what. Match those vectors across images — the which-goes-with-which. Two design goals govern the whole pipeline and pull in opposite directions. Detection must be repeatable: the same physical point should fire in both images despite a change of viewpoint, zoom, and lighting — if a corner is found in one image but not the other, there is no chance to match it. Description must be distinctive yet invariant: unique enough that the right match stands out unambiguously, but stable enough to survive the change between the two images. Repeatability and distinctiveness are the two halves of the contract, and almost every design decision below is a negotiation between them.

💡 Big lesson — the structure tensor, three jobs

The same $2\times2$ matrix $M=\sum_W \nabla I\,\nabla I^{\top}$ answers three questions across this part, and the unity is the point. Is this a good point to detect? — yes where $M$ has two large eigenvalues, a corner: that is Harris / Shi–Tomasi, here. Is this patch's motion solve well-conditioned? — the very same matrix, now required invertible: that is KLT (Feature tracking). What is the local flow constraint? — $M$ is the normal-equations matrix $A^\top A$: that is Lucas–Kanade (Optical flow). Detect with it, track with it, solve flow with it — one matrix, read three ways. Corners are special for a deep reason: the same both-directions-vary condition that localises a point is also what makes its motion recoverable. So "good to detect" and "good to track" are not two coincidentally-aligned criteria; they are one statement about $M$.

7.3.1 Where to look — corners, and the structure tensor

Why corners? Because a point can only be localised where the intensity changes in both directions. Walk through the three cases (Figure 7.3.1). A flat patch can be slid in any direction and looks identical — there is nothing to lock onto, so its position is hopeless. An edge is worse than it looks: you can slide it along itself and see no change, so its position along the edge is unrecoverable — this is the aperture problem of Optical flow wearing a detection costume. Only a corner — where two differently-oriented edges meet — pins down uniquely, because every small shift changes the appearance.

The quantitative test for "varies in both directions" is the structure tensor $M$, also called the second-moment matrix or, in this context, the Harris matrix. Imagine shifting a window $W$ by a small $(\Delta x,\Delta y)$ and measuring the resulting sum-of-squared-differences; a first-order Taylor expansion turns that error surface into a quadratic form whose matrix is exactly

$$ M=\sum_{\mathbf x\in W} w(\mathbf x)\begin{pmatrix}I_x^2 & I_xI_y\\ I_xI_y & I_y^2\end{pmatrix}, $$

the gradient outer products $\nabla I\,\nabla I^{\top}$ accumulated over the window with weights $w$ (a Gaussian, in practice, for a smooth falloff). The two eigenvalues $\lambda_1\ge\lambda_2$ of $M$ measure how strongly the intensity varies along its two principal directions, and they read off the three cases cleanly: flat means both eigenvalues tiny ($M$ near zero, an error surface that is a flat dish — slide anywhere for free); edge means one large eigenvalue and one near zero ($M$ rank-deficient, a long trough — free to slide along it); corner means both eigenvalues large ($M$ well-conditioned, a steep bowl — every shift costs). Drawn as an error ellipse with axes $\lambda_1^{-1/2},\lambda_2^{-1/2}$, the partition is one picture: a slit for an edge, a long trough for nothing, a round-ish blob for a corner.

fig-structure-tensor-ellipse
Figure 7.3.1. One matrix, the flat/edge/corner partition. The second-moment matrix $M=\sum_W \left(\begin{smallmatrix}I_x^2 & I_xI_y\\ I_xI_y & I_y^2\end{smallmatrix}\right)$ drawn as an error ellipse (axes scaled by $\lambda_1^{-1/2},\lambda_2^{-1/2}$) over three patches. Flat (both eigenvalues tiny): a large, near-circular ellipse — the SSD error stays low for shifts in every direction, so the point is unlocalisable. Edge (one large eigenvalue, one near-zero): a long thin slit — error rises across the edge but stays flat along it (the aperture problem). Corner (both eigenvalues large): a small round-ish ellipse — every shift incurs cost, so the point is pinned. The eigenvalues of one $2\times2$ matrix sort every patch into these three.

Two responses turn the eigenvalues into a detector. Harris (Harris & Stephens 1988) avoids computing eigenvalues — expensive, in 1988 — and uses the cheap determinant-and-trace combination

$$ R=\det M-k\,(\operatorname{tr}M)^2=\lambda_1\lambda_2-k(\lambda_1+\lambda_2)^2, $$

with $k\approx 0.04$–$0.06$. $R$ is large and positive only when both eigenvalues are large (a corner), strongly negative on edges (one eigenvalue dominates), and near zero on flat regions. The pipeline is then mechanical: compute gradients, form the three products $I_x^2, I_y^2, I_xI_y$, Gaussian-smooth each, combine into $R$, threshold, and non-maximum-suppress so each corner fires exactly once (Figure 7.3.2). A pleasant property falls out for free: because $R$ is built from gradients it is invariant to an intensity offset ($I\to I+b$), and because the structure-tensor ellipse merely rotates with the image, $R$ is invariant to image rotation — only the absolute scaling of the threshold and, as we will see, the image scale give it trouble.

Shi–Tomasi (Shi & Tomasi 1994, "good features to track") takes the more direct route and scores a corner by the smaller eigenvalue itself,

$$ \min(\lambda_1,\lambda_2) > \tau : $$

a point is a good corner exactly when its weaker direction is still strong, i.e. both directions are well-constrained. This is the same criterion that makes a feature trackable — it is precisely the condition that $M$ be well-conditioned for the KLT motion solve in Feature tracking — which is why Shi & Tomasi derived "what is a good feature to detect" from "what is a good feature to track." Harris versus Shi–Tomasi is two readings of the same matrix: Shi–Tomasi is slightly more reliable, Harris slightly cheaper, and in practice they pick nearly the same points.

fig-harris-corners
Figure 7.3.2. Harris corners, from response to points. Left: an input photograph. Middle: the Harris response $R=\det M-k(\operatorname{tr}M)^2$ as a heatmap — hot at corners and texture junctions, strongly negative (cool) along edges, near zero on flat regions like open sky. Right: the corners surviving thresholding and non-maximum suppression, overlaid as points — they land on actual corners (window frames, tile junctions, rooflines meeting) and never on a single straight edge or a flat patch. The heatmap makes the eigenvalue story visible: corners are the isolated hot peaks the response picks out.

7.3.2 Invariance — surviving scale and rotation

Harris solves where to look within a single image, but a corner detected at one zoom or one orientation must be found, and described the same way, in an image where the object is bigger or rotated. Across far views this is the whole game: the two photos differ in scale (you stood closer, or zoomed) and in rotation (you tilted the camera, or the object turned). Plain Harris fails the scale test outright — blur an image down and a corner becomes a smooth bump that the detector reads as flat, or as an edge. Repeatability across far views therefore demands that detector and descriptor be covariant with scale and rotation: find the same physical point, and measure its neighbourhood in a frame that rotates and scales with the point itself.

Scale is SIFT's signature contribution (Lowe 2004). The idea is to give each keypoint its own characteristic scale, found automatically and consistently across images. Build a scale-space — the image repeatedly blurred by a Gaussian of growing $\sigma$, so that fine structure dissolves as you climb — and look for a function of $\sigma$ that peaks at the size of the structure under the point. A center-surround / band-pass response does exactly this: it is near zero when the window is much smaller or much larger than the feature and peaks when they match. SIFT computes it cheaply as the Difference-of-Gaussians,

$$ D(x,y,\sigma)=\big(G_{k\sigma}-G_{\sigma}\big)*I, $$

a close and inexpensive approximation to the scale-normalised Laplacian. A keypoint is then a point that is a local extremum of $D$ in all three coordinates jointly — a maximum or minimum over $(x,y)$ and over neighbouring scales $\sigma$ (Figure 7.3.3). The scale at which the extremum occurs is the feature's characteristic scale: describe the neighbourhood at that scale, and the descriptor no longer cares how zoomed-in the photo was. (In practice the scale-space is organised into octaves — halving the resolution each octave — and the raw extrema are refined to sub-pixel, sub-scale precision by fitting a quadratic, with weak or edge-like extrema discarded by a DoG-value threshold and a principal-curvature ratio test, the Harris idea reused to reject points strung out along an edge.)

fig-scale-space-dog
Figure 7.3.3. A keypoint is an extremum in space and scale. A Difference-of-Gaussian stack: the image is blurred by Gaussians of increasing $\sigma$ (climbing the page) and adjacent blur levels are subtracted to give a band-pass response $D(x,y,\sigma)=(G_{k\sigma}-G_\sigma)*I$ at each scale. A keypoint (circled) is a point that is a local maximum or minimum among its neighbours in $(x,y)$ and across the two adjacent scales — a peak in the full $(x,y,\sigma)$ volume. The scale at which the peak occurs sets the circle's size: that characteristic scale is what makes the later descriptor zoom-invariant.

Rotation is handled by an analogous canonical-frame trick. Around the keypoint, at its scale, build a histogram of local gradient orientations (weighted by gradient magnitude) and read off its peak — the dominant orientation. Then measure the descriptor relative to that orientation, as if the patch had been rotated to a canonical upright pose. Rotate the object in the world and its dominant orientation rotates with it, so the descriptor comes out unchanged. Illumination invariance is bought the same cheap way it was for Harris: use gradients, which are blind to an additive brightness offset, and normalise the final vector to remove a multiplicative gain. With characteristic scale and dominant orientation in hand, every keypoint comes with a stable local coordinate frame $(x,y,\sigma,\theta)$ — and the descriptor's only remaining job is to summarise what the neighbourhood looks like in that frame.

7.3.3 Describing a neighbourhood — SIFT and its zoo

The simplest possible descriptor is the raw patch — a small window of pixel values around the keypoint, matched by SSD or NCC. It is invariant to nothing on its own; bolted onto the canonical frame above (resample the window at the keypoint's scale, rotate it to the dominant orientation, normalise its intensities) it becomes serviceable, and this is essentially the Multiscale Oriented Patches descriptor. But raw patches are brittle: a one-pixel misregistration or a small non-rigid deformation changes every value, so the SSD jumps. The landmark improvement is to describe the patch by its gradients, histogrammed over cells — distinctive, yet tolerant of exactly those small shifts.

That is SIFT (Lowe 2004), the Scale-Invariant Feature Transform, and it is worth stating concretely (Figure 7.3.4). Take a $16\times16$ window around the keypoint, sampled at the feature's scale and rotated to its dominant orientation. Divide it into a $4\times4$ grid of cells. In each cell, accumulate the gradient orientations into an 8-bin histogram, each gradient weighted by its magnitude and by a Gaussian centred on the keypoint (so the window edges count less), and spread across bins by trilinear interpolation (so a gradient near a cell or bin boundary does not flip discontinuously). Concatenate the $4\times4$ cells $\times\,8$ orientations into a single 128-dimensional vector. Finally, $L_2$-normalise it — then clamp every component to a ceiling (0.2) and renormalise, which blunts the influence of a few very large gradients from, say, a strong specular edge. The genius of the design is the histogram: by binning gradients within a cell rather than recording their exact positions, SIFT tolerates small spatial shifts and deformations while remaining highly distinctive — the resolved tension between invariance and discriminability. Empirically it survives viewpoint changes up to roughly $60^\circ$ of out-of-plane rotation and dramatic illumination shifts, which is why it dominated matching, stitching, and recognition for a decade.

fig-sift-descriptor
Figure 7.3.4. Building the 128-D SIFT vector. Left: the gradient field in a window around the keypoint, rotated to the keypoint's dominant orientation so the descriptor is rotation-invariant, with a Gaussian weighting (faded toward the edges). Middle: the window split into a $4\times4$ grid of cells; within each cell the gradients are accumulated into an 8-bin orientation histogram (drawn as a small star of bars, weighted by gradient magnitude). Right: the $4\times4\times8=128$ histogram values concatenated into one vector, $L_2$-normalised with a clamp for contrast invariance. Histogramming gradients per cell is what buys tolerance to small shifts while staying distinctive.

SIFT is accurate but not cheap, and a zoo of successors trades accuracy for speed along a single axis — speed ↔ robustness ↔ invariance (Figure 7.3.5). SURF (Bay et al. 2006) approximates SIFT's Gaussian derivatives with box filters evaluated in $O(1)$ via the integral image, buying a large speed-up for a small robustness cost. FAST (Rosten & Drummond 2006) is a corner detector only — it tests pixels on a circle for a contiguous arc brighter or darker than the centre — extremely fast, with no descriptor of its own. The binary descriptors are the real-time workhorses: BRIEF (Calonder et al. 2010) encodes a patch as a string of bits, each one the answer to "is pixel $p$ brighter than pixel $q$?" for a fixed set of test pairs, and ORB (Rublee et al. 2011) wraps FAST detection plus a rotation-aware BRIEF into a complete, patent-free, SIFT-free pipeline; BRISK adds a scale-space and a structured sampling pattern. Binary descriptors are tiny and are matched not by Euclidean distance but by Hamming distance — a XOR and a popcount, almost free on modern hardware. At the other end, KAZE / AKAZE build their scale-space with nonlinear (edge-preserving) diffusion instead of Gaussian blur, respecting object boundaries for better localisation at higher cost. The through-line: SIFT remains the accuracy benchmark; binary descriptors are what you reach for when you need thousands of matches per frame on a phone.

fig-descriptor-zoo
Figure 7.3.5. The descriptor zoo, on the axes that matter. SIFT, SURF, ORB, and BRIEF placed on a speed ↔ robustness ↔ invariance layout. SIFT: a 128-D float vector — most robust and invariant, slowest. SURF: box-filter / integral-image approximation of SIFT — faster, slightly less robust. ORB / BRIEF: binary strings of pixel-pair intensity comparisons, matched by Hamming distance — tiny and blazing fast, ORB adding rotation and FAST detection for a full SIFT-free pipeline. Float-versus-binary is the headline split; each point on the diagram is a different bet on how much robustness to spend for speed.

7.3.4 Matching — nearest neighbour in descriptor space

With every keypoint reduced to a vector, matching is conceptually trivial: a keypoint in image A matches the keypoint in image B whose descriptor is nearest (Figure 7.3.6). The distance is Euclidean for float descriptors (SIFT, SURF) and Hamming for binary ones (ORB, BRIEF). Done naively this is exhaustive — compare every descriptor in A against every descriptor in B — which is fine for a single pair of images but becomes the bottleneck once you match against a large database (Photo Tourism reconstructs cities from tens of thousands of photos, each with hundreds of features). The standard remedy is approximate nearest-neighbour search — k-d trees with best-bin-first, or modern hashing — which trades a small chance of missing the true nearest neighbour for an enormous speed-up; the details live in Misc: fast matching.

The crucial caveat is that the raw nearest-neighbour matches are not ready to use. Repeated texture (a row of identical windows, a brick wall), occlusion, and ordinary descriptor collisions all manufacture matches that are confident but wrong — and a single wrong correspondence can wreck a least-squares fit of the transform you ultimately want. The fix is two-stage and is the subject of the next section: Lowe's ratio test prunes ambiguous matches before fitting (accept a match only if the nearest descriptor is decisively closer than the second nearest), and then RANSAC fits a geometric transform robustly despite whatever outliers survive. That is Robustness: the ratio test and RANSAC, which this section feeds — and from there the clean correspondences flow into panorama stitching (Automatic panorama stitching and feature matching) and 3-D reconstruction (3D and depth). A closing note on where this is heading: the entire hand-designed pipeline — detector, descriptor, matcher — is exactly the kind of three-stage system deep learning later swallows whole, replacing each stage with a learned operator (SuperPoint, SuperGlue, and the pointmap regressors of Deep learning approaches to sparse matching); but the problem it solves is the one stated here, and SIFT remains the baseline they are measured against.

fig-descriptor-matching
Figure 7.3.6. Nearest-neighbour matching, before and after the ratio test. Two views of the same scene with their keypoints, joined by lines to their nearest neighbour in descriptor space. Left (raw): many lines, a tangle of crossings — repeated texture and occlusion have produced confident-but-wrong matches alongside the good ones. Right (after the ratio test): the survivors, where the nearest descriptor was decisively closer than the second-nearest ($d_1/d_2$ small); the ambiguous matches on repeated structure are gone, leaving a clean, consistent set. The rejected matches are exactly the handoff to the robustness section that fits a transform to what remains.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson — the structure tensor, three jobs

The same $2\times2$ matrix $M=\sum_W \nabla I\,\nabla I^{\top}$ answers three questions across this part, and the unity is the point. Is this a good point to detect? — yes where $M$ has two large eigenvalues, a corner: that is Harris / Shi–Tomasi, here. Is this patch's motion solve well-conditioned? — the very same matrix, now required invertible: that is KLT (Feature tracking). What is the local flow constraint? — $M$ is the normal-equations matrix $A^\top A$: that is Lucas–Kanade (Optical flow). Detect with it, track with it, solve flow with it — one matrix, read three ways. Corners are special for a deep reason: the same both-directions-vary condition that localises a point is also what makes its motion recoverable. So "good to detect" and "good to track" are not two coincidentally-aligned criteria; they are one statement about $M$.