💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

7.7 Misc: fast matching

Once you have descriptors, the cost is dominated not by computing them but by the nearest-neighbour search in a high-dimensional space — and exact nearest neighbour is hopeless at scale. A 128-dimensional SIFT vector lives exactly where the curse of dimensionality bites hardest: the kd-tree that makes nearest-neighbour search logarithmic in two or three dimensions degenerates past a few dozen, until it is no faster than checking every point. So the toolkit is approximate nearest neighbour and clever propagation. For sparse descriptors there are three workhorses — randomized kd-trees and FLANN, locality-sensitive hashing, and product quantization. For dense patch correspondence, PatchMatch abandons trees entirely and rides spatial coherence. And underneath both sits Andrew Adams's random-projection lineage, the engine that — not coincidentally — also accelerates the smoothing filters of Denoising. Three sections, one repeated lesson: stop insisting on the exact answer, and stop treating every query as if it were the first.

💡 Big lesson (fast matching)

At scale, good-enough-and-fast beats optimal-and-slow; and propagation beats independent search. Two morals braided together. The first is a concession: in high dimensions the exact nearest neighbour costs as much as brute force, so every practical matcher trades a little recall for orders of magnitude of speed — and the operating point on that trade-off is a tuning knob, not a fixed property of the algorithm. The second is an opportunity: when the thing you are matching is spatially coherent — neighbouring pixels move together, neighbouring patches match to neighbouring patches — you should not search for each match from scratch. Let a good answer flood outward to its neighbours, who almost certainly share it. The deepest version of the lesson is the unification at the end: the high-dimensional nearest-neighbour engine that finds correspondences is the engine that runs the bilateral filter and NL-means — matching and filtering are the same search, and an accelerator for one is an accelerator for the other.

7.7.1 Approximate nearest neighbours for sparse descriptors

Start with the sparse case: you have two sets of descriptors — say a few thousand SIFT vectors per image, or a database of a billion — and for each query vector you want the closest one in Euclidean (or Hamming) distance. The textbook data structure is the kd-tree, which recursively splits the point set on alternating coordinate axes and, at query time, descends to the leaf containing the query and then backtracks into neighbouring cells that might hold something closer. In low dimensions this backtracking is bounded and the search is genuinely logarithmic. In high dimensions it is a catastrophe: the query's true nearest neighbour can sit just across almost any of the splitting planes, so the tree must backtrack through exponentially many cells. By the time you reach the 128 dimensions of SIFT, the kd-tree has quietly become a slow brute-force scan (Figure 7.7.1). This is the curse of dimensionality, and it is the wall every method here is built to get around — not by searching faster, but by searching less, and accepting an approximate nearest neighbour in exchange.

Three families do the work. The first is randomized kd-trees, productized as FLANN ([@muja-lowe-2009] — "Fast Library for Approximate Nearest Neighbors", and note whose name is on it: Lowe, the author of SIFT, building the search his own descriptors needed). The trick is to build several kd-trees, each split on randomly rotated axes, and at query time search all of them but only to a bounded number of leaves — you stop early, before backtracking explodes, and take the best candidate found across the trees. You give up the guarantee of finding the true nearest neighbour; you gain a search that stays fast in high dimensions, with recall you dial in by how many leaves you allow. The second is locality-sensitive hashing (LSH): design a hash function with the property that nearby points are likely to collide into the same bucket, then, to answer a query, hash it and scan only the handful of points in its bucket. Where a hash table normally scatters similar keys, LSH deliberately clusters them, turning "find the nearest point" into "look in the right drawer." The third, and the one that scales to billions, is product quantization (PQ) ([@jegou-douze-schmid-2011]): split each vector into chunks, quantize each chunk against its own small codebook, and store every vector as a short tuple of codebook indices. A billion 128-dimensional floats will not fit in memory; a billion 16-byte PQ codes will — and because distances decompose chunk-by-chunk, comparing a query against a code becomes a few table look-ups instead of a 128-term dot product. This is the lineage that became the modern vector database — FAISS, ScaNN — the productized form of this very chapter.

The point of all three is the trade-off itself, not any single setting. Each knob trades recall (did you find the true nearest neighbour?) against speed and memory, and the right operating point is an application decision, not an algorithm one. A one-off structure-from-motion reconstruction run overnight can afford near-exact search; a phone retrieving an image against a million-item index, in the time it takes to lift it, cannot, and will happily accept ninety-percent recall for a hundred-fold speed-up. When a system today says it does "vector search," this is what it is doing.

fig-ann-tradeoff
Figure 7.7.1. Why exact nearest neighbour fails in high dimensions, and what approximate search buys. Left: a 2-D sketch — a query point and its true nearest neighbour, with the kd-tree cells drawn in. An exact search must backtrack across many cells because the true neighbour can lie just across any splitting plane, and the number of cells to check explodes with dimension; an approximate method checks only a few buckets or leaves and returns a point that is near-est, not provably nearest. Right: the accuracy-versus-speed curve this implies — exact search pinned at one extreme (perfect recall, ruinous cost), with LSH / PQ / randomized kd-trees strung along the trade-off, each dialling recall down for speed and memory. The figure's whole message is that the curve, not any single point on it, is the object of study.

7.7.2 PatchMatch — randomized dense correspondence by propagation

The approximate-nearest-neighbour toolbox above treats every query independently — each descriptor goes off and searches the tree alone. For dense correspondence that independence is wasteful, and PatchMatch ([@barnes-shechtman-finkelstein-goldman-2009]) is the algorithm that throws it away. The setting is a nearest-neighbour field (NNF): for every pixel in an image, an offset pointing to the most similar patch elsewhere (in the same image, for inpainting and retargeting, or in a second image, for correspondence). Computed naively, that is one tree search per pixel — millions of them. PatchMatch computes it with no tree at all, and converges astonishingly fast, by exploiting a single observation: the answer is spatially coherent. Neighbouring pixels almost always have similar offsets, because real images are made of coherent regions that move and repeat as wholes. So a neighbour's match is not just a hint — it is, with high probability, your match shifted by one pixel.

The algorithm is three steps, and the middle one is the whole idea. Random initialization: give every pixel a random offset. This sounds hopeless, but in an image of a million pixels, by sheer luck some of those random offsets are accidentally excellent — a scattering of correct matches hidden in the confetti. Propagation: sweep across the image, and at each pixel test the offsets its already-visited neighbours are using; if a neighbour's offset (applied here) beats the current one, adopt it. A single accidentally-correct offset thus floods outward, snapping whole coherent regions into place in one pass — and alternating the sweep direction each iteration lets good matches propagate up-and-left as well as down-and-right. Random search: to escape local optima, test a few random offsets in a window around the current best, with the window radius shrinking geometrically — large jumps first, to find a better basin, then ever-finer refinement within it. Two passes of propagate-and-search are typically enough to turn a field of random confetti into a clean, coherent correspondence — fast enough that it powered interactive image inpainting, retargeting, and morphing, the applications taken up in depth in Patch match.

The lesson generalizes well past patches. Propagation beats independent search whenever the answer is spatially coherent — which is precisely why coarse-to-fine pyramids work (a coarse estimate propagates down in scale) and why optical-flow solvers spread motion across smooth regions. PatchMatch is the purest statement of the principle: do not search a million times: search a few times, get lucky somewhere, and let coherence carry the luck everywhere else (Figure 7.7.2).

fig-patchmatch-propagation
Figure 7.7.2. PatchMatch in three steps, from confetti to coherent. The nearest-neighbour field shown after each stage. (1) Random init: every pixel gets a random offset — the field is noise, but in a million pixels a few offsets are accidentally correct (the seeds). (2) Propagation: a sweep passes each good offset to its neighbours; the accidentally-correct seeds flood outward and big coherent regions snap into place at once. (3) Random search: shrinking-radius random jumps around each current best escape local optima and refine the remaining errors. The arc across the three panels — confetti → coherent regions → clean field, in a couple of iterations — is the visual argument that exploiting coherence makes a per-pixel tree search unnecessary.

7.7.3 Fast high-dimensional matching by random projection

The last thread is the most quietly influential, and it belongs to Andrew Adams. The problem is the same curse of dimensionality, attacked from a different angle: rather than partition the space with axis-aligned splits (the kd-tree's downfall), project the high-dimensional points onto a few random directions and place them on a sparse lattice, so that nearby points land in the same or adjacent lattice cells and can be matched by splatting values onto cells and gathering them back. The cost grows only near-linearly with dimension instead of exponentially — exactly the property the kd-tree lacks. Two papers define the lineage: Gaussian kd-trees ([@adams-gelfand-dolson-levoy-2009]), which randomize the tree's splits and sample so that high-dimensional queries stay cheap, and the permutohedral lattice ([@adams-baek-davis-2010]), which replaces the rectangular grid with the lattice that tiles high-dimensional space with the fewest vertices per cell — the natural grid for splat-blur-slice in many dimensions. Both trade exactness for a cost that barely grows with dimension, which is the only thing that matters once $d$ is large.

Here is the unification worth flagging, and it is the deepest of the chapter's lessons: the very same high-dimensional nearest-neighbour engine accelerates the bilateral filter and NL-means denoising. This is not an analogy. The bilateral filter averages each pixel with others that are nearby both in space and in colour — that is, nearby in a five-dimensional space (two spatial, three colour). NL-means averages each patch with others whose appearance is similar — nearness in a high-dimensional patch space. Both are, structurally, "for each point, find and average its high-dimensional neighbours" — which is exactly the matching problem. So Adams's lattice, built to find correspondences fast, runs an edge-aware smoothing filter fast by the same code path: project, splat onto the lattice, blur on the lattice, slice back. Matching and edge-aware filtering are revealed to be one high-dimensional search wearing two hats, and an accelerator for either is an accelerator for both (Denoising).

That coincidence is the reason this miscellaneous chapter earns its place beside the rest of the part. The earlier sections asked what makes a good match — a corner, a descriptor, a brightness-constant patch. This one asks how to find it fast enough to matter, and the answer — approximate it, propagate it, project it — turns out to be shared machinery that reaches well beyond matching, into the filtering and denoising chapters that follow.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (fast matching)

At scale, good-enough-and-fast beats optimal-and-slow; and propagation beats independent search. Two morals braided together. The first is a concession: in high dimensions the exact nearest neighbour costs as much as brute force, so every practical matcher trades a little recall for orders of magnitude of speed — and the operating point on that trade-off is a tuning knob, not a fixed property of the algorithm. The second is an opportunity: when the thing you are matching is spatially coherent — neighbouring pixels move together, neighbouring patches match to neighbouring patches — you should not search for each match from scratch. Let a good answer flood outward to its neighbours, who almost certainly share it. The deepest version of the lesson is the unification at the end: the high-dimensional nearest-neighbour engine that finds correspondences is the engine that runs the bilateral filter and NL-means — matching and filtering are the same search, and an accelerator for one is an accelerator for the other.