💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

7.5 Deep learning approaches to sparse matching

The classical pipeline — a hand-designed detector, a hand-designed descriptor, a nearest-neighbour matcher — is exactly the kind of three-stage hand-built system that deep learning likes to swallow whole. Each stage was an act of human intuition: Harris decided a corner was where the structure tensor had two large eigenvalues; SIFT decided a neighbourhood was best summarised by gradient-orientation histograms; the matcher decided two points correspond when their descriptors are nearest in Euclidean distance. Every one of those decisions is a candidate for the L8 move — replace the hand-designed operator with one trained from data — and the history of learned matching is the story of that move marching down the pipeline.

It arrives in three waves, and the right way to read them is as a steadily larger bite. First, learned local features that play SIFT's role: a single network detects and describes, trained for repeatability and matchability rather than designed by intuition (LIFT, SuperPoint, D2-Net, R2D2). Second, learned matching that reasons about the whole set of features at once instead of deciding each point independently: SuperGlue and the faster LightGlue cast matching as an attention-based assignment, while detector-free matchers like LoFTR skip keypoints entirely. Third, the 3-D-aware turn, where DUSt3R and MASt3R regress dense pointmaps straight from an image pair, so correspondence, camera pose, and depth all fall out together and matching stops being a separate stage at all. Front end, then middle, then the whole thing folded into geometry. The task is fixed; the pipeline is progressively learned.

💡 Big lesson (L8, recurrence)

A learned operator swaps a hand-designed prior for one learned from data — and here it walks the whole pipeline. Classical sparse matching is three hand-built operators in series: a detector, a descriptor, a matcher (plus a robust back end). Wave one neuralizes the front end — keypoints and descriptors are now network outputs, trained for the very property (repeatability, matchability) the hand designs only approximated. Wave two neuralizes the middle — the matcher becomes a network that sees all the features at once and respects their mutual geometric consistency, instead of a per-point nearest-neighbour lookup. Wave three dissolves the pipeline into geometry — predict the scene's 3-D points per pixel and the correspondence is whatever pixels land on the same point. The deeper you go, the more of the classical chain is absorbed; what survives untouched is the problem Lowe and Harris posed. (L8 first appears in Deep learning; this is a recurrence, the same move that powers RAFT in Optical flow and the learned point trackers in Feature tracking.)

7.5.1 Learned local features — SIFT's job, done by a network

The first wave replaces SIFT's two separate stages — the Difference-of-Gaussian detector and the gradient-histogram descriptor — with a single convolutional network that does both. The architecture is now standard: a shared encoder processes the image once, then splits into two heads. A detector head emits a keypoint score map (where the keypoints are), and a descriptor head emits a dense descriptor field (one descriptor vector per location). The two heads are trained together, so the points the detector likes are exactly the points the descriptor can match well — a cooperation the hand-built pipeline never had, since SIFT's detector and descriptor were designed independently. LIFT (Yi et al., 2016) first showed the whole detect-orient-describe pipeline could be learned end-to-end; SuperPoint (DeTone et al., 2018) made it fast and self-supervised; D2-Net (Dusmanu et al., 2019) and R2D2 (Revaud et al., 2019) refined what "a good keypoint" should mean (Figure 7.5.1).

The obvious difficulty is supervision: there is no ground-truth "corner" to train against. Nobody can label, pixel by pixel, which points are truly repeatable across viewpoint and lighting, so a network has nothing to imitate. SuperPoint's answer — homographic adaptation — is clever enough to have become the template. Take an image, warp it by many known random homographies, run the current detector on each warped copy, then un-warp the detections back to the original frame and demand that they agree. A point that fires consistently across many synthetic warps is, by construction, repeatable under those transforms; a point that fires in only one warp is noise. Repeatability-under-known-transforms becomes the training signal, manufactured from the warps themselves, with no human annotation. R2D2 pushes the idea further by training the descriptor for reliability alongside the detector's repeatability — explicitly learning that a point can be easy to find yet hard to match (a repeated texture), and that those points should be down-weighted.

What is striking is how little the interface changes. The task is identical to SIFT's — emit repeatable, matchable points — and only the operators move from hand-design to data: what counts as a keypoint, and what the descriptor encodes. Because the output is still a set of (location, descriptor) pairs, these networks are drop-in front ends: their points feed the very same ratio-test-then-RANSAC back end from Robustness: the ratio test and RANSAC, with no change downstream. That drop-in property is why they spread so fast — you could swap SuperPoint for SIFT in an existing structure-from-motion pipeline and keep everything else.

fig-learned-feature-pipeline
Figure 7.5.1. One network does SIFT's two jobs. A shared CNN encoder ingests the image once and splits into two heads: a detector head producing a keypoint score map (bright where a repeatable point should fire) and a descriptor head producing a dense descriptor field (a vector per location), the two trained to cooperate. The inset shows the self-supervision trick — homographic adaptation: warp the image by a known homography, detect on the warped copy, un-warp the detections back, and demand they agree with the detections on the original; repeatability under known warps is the manufactured training label.

7.5.2 Learned matching — reason about the whole set at once

Even with learned features, the matching step in the classical pipeline is embarrassingly local: each keypoint in image A is matched to the keypoint in image B whose descriptor is nearest, $\arg\min_j \lVert d_i - d_j \rVert$, decided independently for every point. This is blind to a fact the geometry knows perfectly well — that the matches must be mutually consistent. The true correspondences obey a single coherent transformation; their connecting lines barely cross; one point can match at most one other. Nearest-neighbour matching throws all of that away and decides each point in isolation, then leans entirely on RANSAC downstream to clean up the mess. The second wave asks the matcher itself to reason about the whole set at once.

SuperGlue (Sarlin et al., 2020) does this with attention. Treat the two keypoint sets as the nodes of a graph and alternate two kinds of attention: self-attention within an image (so each point is informed by the other points in its own image — the local geometric context) and cross-attention between images (so each point is informed by the candidate matches in the other image). Stacked several layers deep, this lets every match be decided in light of all the others — a point near an unambiguous corner inherits confidence from its neighbour, an isolated point on repeated texture is rightly distrusted. The network outputs a learned score matrix $S$, where $S_{ij}$ scores keypoint $i$ in A against keypoint $j$ in B, and the final assignment is found by solving a differentiable optimal-transport problem: find the soft assignment $P$ that

$$ \max_{P} \;\sum_{i,j} P_{ij}\,S_{ij}, $$

subject to $P$ being (doubly stochastic) — each row and column summing toward one — so that the solution respects the one-to-one structure of correspondence. The constraint is enforced by the Sinkhorn algorithm, which alternately normalises the rows and columns of $\exp(S)$ until both marginals are satisfied; because each step is differentiable, gradients flow back through the whole matcher and it can be trained end-to-end. Crucially, the score matrix is augmented with a dustbin — an extra row and column that absorb keypoints with no valid match (occluded points, points outside the overlap), so the assignment can confidently decline to match rather than being forced to pair every point. Contrast this with the classical rule: where nearest-neighbour-plus-ratio-test decides $\arg\min_j \lVert d_i - d_j \rVert$ one point at a time, SuperGlue solves a single global assignment over the whole set (Figure 7.5.2).

The cost of that global reasoning is speed — the attention layers are quadratic in the number of keypoints, which is painful on a phone or a robot. LightGlue (Lindenberger et al., 2023) keeps SuperGlue's quality but makes it adaptive: it predicts after each layer whether the match is already confident enough to stop, and it prunes keypoints that are clearly unmatchable early, so easy image pairs finish in a few layers and only the hard ones pay full price. The result is a matcher that runs at interactive rates on-device while inheriting SuperGlue's global-consistency benefit.

A more radical move drops the detector entirely. LoFTR (Sun et al., 2021) is detector-free: instead of finding keypoints and then matching them, it matches densely and coarse-to-fine, using transformers to establish correspondences on a coarse grid and then refining each to sub-pixel accuracy. The payoff is dramatic on the surfaces where detectors fail worst — low-texture walls, repetitive patterns, weakly-textured indoor scenes — where Harris or SuperPoint fire almost nowhere, leaving classical matching with too few points to fit a geometry. By never committing to a sparse keypoint set, LoFTR can produce correspondences where there are no distinctive corners at all. The detector, assumed indispensable since Harris, turns out to be optional.

fig-superglue-attention
Figure 7.5.2. Global matching versus independent nearest-neighbours. The two keypoint sets are drawn as graph nodes; self-attention edges connect points within each image and cross-attention edges connect points across the pair, so every match is decided in the context of all the others. The output is a soft assignment matrix $P$ (one keypoint set along each axis), with an extra dustbin row and column that absorb unmatched points; the bright cells are the recovered one-to-one correspondences. Beside it, the classical baseline: each query keypoint matched to its single nearest descriptor, decided alone, with no cross-talk and no way to abstain.

7.5.3 The 3-D-aware turn — pointmaps subsume matching

The third wave stops treating matching as a problem in its own right. The observation is that in nearly every application, sparse matching is only a means to an end — you match features in order to estimate the camera pose, or to triangulate 3-D structure, or to build a panorama. The classical chain is long and brittle: detect, describe, match, ratio-test, RANSAC the fundamental or essential matrix, recover relative pose, triangulate. Each stage can fail, and errors compound. What if you simply asked a network for the answer — the 3-D geometry — and let matching be whatever falls out of it?

That is what DUSt3R (Wang et al., 2024) does. From an image pair, it regresses two dense pointmaps: for every pixel in each image, a 3-D point, with both pointmaps expressed in one common coordinate frame (the frame of the first camera). That single output is astonishingly rich. Correspondence becomes trivial — two pixels correspond when their predicted 3-D points coincide, so matching is read off the aligned pointmaps with no keypoint stage. Relative pose falls out, because the two pointmaps already live in a shared frame, so the transformation between cameras is implicit in the prediction. Depth falls out, since a pointmap is a per-pixel 3-D position. The long classical chain — detector → descriptor → matcher → robust pose → triangulation — collapses into a single feed-forward pass that predicts scene geometry directly (Figure 7.5.3). MASt3R (Leroy et al., 2024) adds a matching head on top of the same pointmap backbone, producing sharper, more accurate correspondences for tasks (like large-scale localization) that still want explicit, reliable matches — recovering the precision of dedicated matching while keeping the geometric grounding.

The conceptual shift is worth dwelling on. Matching is no longer a stage that produces an input to geometry; it is a by-product of predicting geometry. The network is not asked "which point goes with which?" but "where is everything in 3-D?", and correspondence is the answer's shadow. This is the bridge to 3D and depth — pointmap regression is precisely the feed-forward reconstruction primitive that chapter builds on — and it is, as of this writing, the current frontier of the field. It also closes the throughline cleanly: the front end was neuralized first (learned features), then the middle (learned matching), and now the whole pipeline is folded into the geometry it was always serving. The general "neuralize a hand-built operator" pattern that drives all three waves lives in Deep learning; here we have watched it consume an entire classical pipeline, stage by stage, until the stages themselves disappeared.

fig-dust3r-pointmap
Figure 7.5.3. Matching as a by-product of geometry. An image pair (two views of one scene) feeds a network that regresses two dense pointmaps — a 3-D point per pixel for each image — both expressed in one common coordinate frame. From the aligned pointmaps, three things are simply read off: correspondence (pixels whose 3-D points coincide), relative camera pose (implicit in the shared frame), and per-pixel depth. No detector, no descriptor, no explicit matching or triangulation stage — the classical chain has collapsed into one feed-forward prediction of scene geometry.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L8, recurrence)

A learned operator swaps a hand-designed prior for one learned from data — and here it walks the whole pipeline. Classical sparse matching is three hand-built operators in series: a detector, a descriptor, a matcher (plus a robust back end). Wave one neuralizes the front end — keypoints and descriptors are now network outputs, trained for the very property (repeatability, matchability) the hand designs only approximated. Wave two neuralizes the middle — the matcher becomes a network that sees all the features at once and respects their mutual geometric consistency, instead of a per-point nearest-neighbour lookup. Wave three dissolves the pipeline into geometry — predict the scene's 3-D points per pixel and the correspondence is whatever pixels land on the same point. The deeper you go, the more of the classical chain is absorbed; what survives untouched is the problem Lowe and Harris posed. (L8 first appears in Deep learning; this is a recurrence, the same move that powers RAFT in Optical flow and the learned point trackers in Feature tracking.)