💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to

7.0 MATCHING PIXELS ACROSS SPACE AND TIME

Pixels move. Photograph a façade from two angles and every brick lands somewhere new; shoot thirty frames a second of a running child and every patch of her drifts a little from one frame to the next; point two cameras at the same scene and a point splits into two image locations. Different situations — two views, two times — and the same underlying fact: the image content has been displaced, and we want to recover that displacement. This is the exact inverse of warping. The previous part, Warping and morphing, was given a coordinate map and moved pixels along it; this part is given the moved pixels and must find the map. Transport is the plumbing; matching is the science — and the hard, ill-posed half of the whole correspondence-then-transport story.

The payoff for recovering correspondence is enormous, and worth seeing before the machinery. Once two images are in correspondence, the simplest possible operation — averaging them — becomes powerful: aligned frames averaged together cancel noise (the root of burst photography and multi-frame super-resolution); aligned views merged together stitch a panorama; aligned exposures combine into HDR. Misalign them and the same average ghosts into a double exposure. So almost everything downstream is align, then combine, and this part is the "align" — with the recurring twist that, having found a set of matches, you usually fit a spatial transformation to them (a translation, an affine, a homography, or a full dense field), choosing the model the scene allows and rejecting the matches that disagree.

Correspondence comes in two flavours, and the split organises the part. Sparse matching finds a few reliable, distinctive points — corners and keypoints — and matches those: cheap, robust, and enough for stitching, pose, and tracking. Dense matching estimates a displacement at every pixel — optical flow between frames, disparity between stereo views — richer but far more ill-posed. And the matching happens across space (different viewpoints: panoramas, 3-D reconstruction) or across time (consecutive video frames). The same vocabulary covers all four quadrants.

Whichever flavour, one difficulty is inescapable: matching is ill-posed. A smooth, textureless wall offers nothing to lock onto; a straight edge suffers the aperture problem (peer at a moving edge through a straw and you cannot tell sideways drift from along-the-edge drift); occlusion means some pixels have no honest match at all; and large, fast motion breaks the gentle assumptions estimators lean on. Every method here is, at bottom, a way of adding an assumption — distinctive corners, local windows, global smoothness, a learned prior — to make an under-determined problem answerable, and a way of staying robust when most of the answers are wrong.

The roadmap runs sparse-to-dense, classic-to-learned. Sparse matching detects repeatable keypoints (Harris and Shi–Tomasi corners — the structure tensor again — through SIFT and its zoo of faster cousins) and describes them so they match across scale, rotation, and light. Feature tracking is the temporal sparse cousin: the KLT tracker follows good corners through a clip, trackable for the very structure-tensor reason they make good keypoints. Robustness: the ratio test and RANSAC turns noisy matches into a usable transform — the second-nearest-neighbour ratio test prunes ambiguous matches, RANSAC fits a model from random minimal samples and keeps the one with the most inliers. Deep learning approaches to sparse matching replaces the hand-designed detector, descriptor, and matcher with learned ones (SuperPoint/SuperGlue, LoFTR) and the 3-D-aware pointmap regressors (DUSt3R, MASt3R). Misc: fast matching is the speed story — approximate nearest neighbours, PatchMatch, and Andrew Adams's projection-based high-dimensional matching.

Then the dense half. Optical flow estimates per-pixel motion from brightness constancy ($I_x u + I_y v + I_t = 0$), meets the aperture problem, and answers it two classic ways — Lucas–Kanade pools a local window (inverting the structure tensor $A^\top A$), Horn–Schunck imposes global smoothness — then scales up with pyramids and modern occlusion- and edge-aware tricks. Deep learning approaches to optical flow shows the dominant modern recipe: take a classical iterative method and unroll it into a trained network (FlowNet, PWC-Net, RAFT).

Hold the one sentence and the chapters stop looking like a grab-bag of corners and flows: recover where each pixel went — sparsely or densely, across space or time — then fit a transform you trust to the matches. The maps you recover here are applied by Warping and morphing; their temporal applications follow in Video, and matching across many views is what builds structure in 3D and depth.


Contents of this part

▸ full collapsible outline of this part