💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 2 big lessons ↓

7.2 Feature tracking

The previous section estimated a dense flow field — a motion vector at every pixel — between one pair of frames. This section does the complementary thing: follow a sparse handful of distinctive points across many frames. It is the same physics — brightness constancy, the aperture problem — but restricted to the places where the math is well-behaved, and run forward in time. The payoff of that restriction is the spine of the whole section: the matrix that says a patch is a good corner to detect is the same matrix that must be invertible to track it. Detect with it (Harris), track with it (KLT), refuse to track where it is singular (Shi–Tomasi). One $2\times2$ matrix, three jobs.

💡 Big lesson (L17, recurrence)

Tracking is a sparse correspondence. Where dense flow asks "where did every pixel go?", tracking asks "where did these few physical points go?" — and threads the answer across a long sequence. It is the Lagrangian instance of correspondence (follow the particle, Motion blur and temporal sampling) to dense flow's Eulerian leaning (watch the fixed pixel). The sparse trajectories it produces are the raw input to everything downstream that needs the same physical points over time: video stabilization, structure-from-motion, simultaneous localization and mapping (SLAM), and bundle adjustment. Hold this placement in view — the rest of the section is the mechanics of doing the sparse thing well, and the rest of the part is what those sparse tracks feed.

💡 Big lesson (L8, recurrence)

A learned operator swaps a hand-designed prior for one learned from data. Classical KLT tracks a hand-built brightness patch by a hand-derived least-squares update. The modern point trackers at the end of this section keep the exact same task — "where did this point go?" — but replace the patch with learned features and the greedy per-point update with a learned, occlusion-aware, track-together module. The skeleton is unchanged: initialise a trajectory, iteratively refine it against appearance, predict visibility; only the operator is now fit to data. The same neuralise-the-classical-iteration move powers RAFT in Optical flow. (L8 first appears in Deep learning; this is a recurrence.)

7.2.1 Tracking vs dense flow: sparse-but-long vs dense-but-short

Correspondence comes in two regimes, and the distinction organises this whole part. Dense optical flow gives a two-dimensional motion vector at every pixel — a full field — but typically only between two successive frames. Feature tracking gives motion at only a sparse set of points, but threads them across a long sequence. Dense-but-short versus sparse-but-long (Figure 7.2.1). Each is the right tool in a different regime: you reach for dense flow when you need motion everywhere — to warp a whole frame, interpolate slow motion, or segment by motion — and for tracking when you need the same physical points identified over a long clip.

Why go sparse at all? Two reasons, and they are the reasons tracking has never been retired. The first is scalability: a few hundred points carry far fewer unknowns than a million pixels. That mattered enormously when the field was young and it still matters today, for real-time SLAM and on-device tracking where you cannot afford a dense solve per frame. The second is deeper and is really about conditioning: some pixels are genuinely easier to track than others. Spend effort only where the answer is reliable — at a corner — and refuse to guess in flat regions or along edges, where the motion is fundamentally ambiguous (the aperture problem you met in Optical flow). In this light, tracking is dense flow's quality-controlled sibling: it reports motion only where it is confident, and stays silent where dense flow would be forced to hallucinate.

There is a useful label for what tracking does, made precise later in Motion blur and temporal sampling: tracking is the Lagrangian view of motion — "this physical point, where is it now?" — following a particle through the scene. Dense per-pixel flow leans Eulerian — "at this fixed pixel, what moved through?" That Lagrangian/Eulerian dichotomy organises the whole part, and tracking is its cleanest Lagrangian instance: it literally follows the patch rather than differentiating at a fixed location.

fig-track-vs-flow
Figure 7.2.1. Two regimes of correspondence, in one strip. Dense flow (left): a single pair of frames, with a colored two-dimensional motion vector at every pixel — dense, but spanning only two frames. Tracking (right): a handful of marked points threaded as trajectories across many frames — sparse, but long. The contrast is exactly dense-but-short versus sparse-but-long: where dense flow fills one gap completely, tracking carries a few reliable points across the whole clip.

7.2.2 KLT = Lucas–Kanade, per feature, iterated over time

The workhorse of sparse tracking is the Kanade–Lucas–Tomasi (KLT) tracker, and its definition is almost anticlimactically simple: take the Lucas–Kanade patch solve from Optical flow, apply it to one small window around a feature, iterate it to convergence, then repeat for the next frame. That is the entire algorithm. The lineage is worth naming because it factors the credit cleanly: Lucas & Kanade (1981) gave the iterative registration update; Tomasi & Kanade 1991 packaged that update as a feature tracker; and Shi & Tomasi (1994) settled which features to track. KLT is the assembly of those three.

The per-patch system is pure reuse — it is Lucas–Kanade restricted to a window $W$, so we will set it up but not re-derive it. Assume every pixel in the window shares one displacement $\mathbf{u}=(u,v)$, and stack the brightness-constancy constraint $I_x u + I_y v + I_t = 0$ over all the pixels in $W$. That is an over-determined linear system in two unknowns, and its least-squares normal equations are

$$ M\,\Delta\mathbf{u} = \mathbf{b}, \qquad M=\sum_{\mathbf{x}\in W}\begin{pmatrix}I_x^2 & I_xI_y\\ I_xI_y & I_y^2\end{pmatrix}, \qquad \mathbf{b}=-\sum_{\mathbf{x}\in W} I_t\begin{pmatrix}I_x\\ I_y\end{pmatrix}, $$

so the displacement update is simply

$$ \Delta\mathbf{u} = M^{-1}\mathbf{b}. $$

Now look hard at that $2\times2$ matrix $M$. It is exactly the structure tensor of Harris — the second-moment matrix of gradient products, Gaussian-weighted over the window — the very object that classifies a patch as flat, edge, or corner in Automatic panorama stitching from multiple views and feature matching. The same matrix, no new math. The whole detection apparatus from corner finding turns out to be precisely the apparatus the motion solve needs (Figure 7.2.2). (Seen another way, the per-patch solve is a one-step Horn–Schunck with a strong constant-smoothness assumption over the window.)

One iteration is not enough, and the reason is built into the derivation: brightness constancy was Taylor-linearised, so a single solve is only valid for sub-pixel motion. The fix is Gauss–Newton on the patch sum-of-squared-differences: warp the second-frame patch by the current estimate $\mathbf{u}$, recompute the temporal difference $I_t$ on the warped patch, re-solve for a residual $\Delta\mathbf{u}$, and add it — repeat until the residual is negligible (the inverse-compositional account of Baker & Matthews 2004 unifies the bookkeeping).

That handles small motion; real videos move farther than a pixel between frames. The standard remedy is the same coarse-to-fine logic as dense flow, now localised to each feature: run the iteration down a pyramid (Bouguet, 2000). Solve on a heavily blurred and downsampled level, where the displacement is small enough for the linearisation to hold; propagate that estimate as the initial warp for the next-finer level; refine; repeat to full resolution.

Finally, thread it through time. Having localised the feature in frame $t{+}1$, use that location as the starting guess for frame $t{+}2$, and so on down the clip. KLT is this tiny $2\times2$ solve threaded greedily through the whole video — one small matrix inverse per feature per frame, almost free. Assembled, the three layers — iterate to converge, thread to the next frame — read as one short loop:

for each feature p with window W:
    u = 0
    for each frame:
        repeat until Δu small:
            warp W by current u and recompute Iₜ
            build M, b over W from Iₓ, I_y, Iₜ
            Δu = M⁻¹ b
            u ← u + Δu
        report p + u; carry u as the guess for the next frame

But "greedy and memoryless" is exactly the property that makes drift and occlusion bite, as we will see two sections on.

fig-klt-is-harris
Figure 7.2.2. One matrix, two readings — the spine of the section. The same $2\times2$ structure-tensor ellipse (axes $\lambda_1,\lambda_2$) is drawn twice. Left, labelled "is this a good corner to detect?" — the Harris reading: a round ellipse (two large eigenvalues) marks a corner. Right, labelled "is this patch's motion solve well-conditioned?" — the KLT reading: the same round ellipse means $M^{-1}$ is stable and the displacement is determined in both directions. A degenerate ellipse (one tiny eigenvalue) is simultaneously "not a corner" and "not trackable." Detection and trackability are the same statement about $M$.

7.2.3 Good features to track = where the structure tensor is well-conditioned

Here is the hinge the whole section turns on. The update $\Delta\mathbf{u}=M^{-1}\mathbf{b}$ is only meaningful when $M$ is invertible and well-conditioned — when you can actually trust the inverse. And the conditioning of $M$ is read straight off its eigenvalues $\lambda_1\ge\lambda_2$ — the same eigenvalues that, in the Harris derivation, sort a patch into flat, edge, or corner. So the question "is this a good feature to track?" is literally "is the tracking matrix well-conditioned?", which is in turn literally "is this a corner?" One question wearing three hats.

Walk the three cases — they are exactly the Harris partition, now read as trackability:

The selection rule falls right out of this. Shi & Tomasi's (1994) "good features to track" criterion is the simple, principled test

$$ \min(\lambda_1,\lambda_2) > \lambda_{\min}, $$

read as: require the smaller eigenvalue to be large enough — that is, demand that both directions be well-constrained, not just one. This is nothing but the conditioning of $M$ stated directly, and it is the same $\min(\lambda_1,\lambda_2)$ corner response cross-referenced in Automatic panorama stitching from multiple views and feature matching. The Harris response $R=\det M - k\,(\operatorname{tr} M)^2$ is a determinant-and-trace proxy for the same quantity. The striking part is the direction of the logic: the detector and the tracker agree by construction, because Shi & Tomasi derived "what makes a good feature to detect" from "what makes a good feature to track" — not the other way round (Figure 7.2.3).

In practice the recipe is exactly the Harris pipeline: compute $M$ over a window at every pixel (image gradients → pointwise products → Gaussian-weighted sum), score each pixel by $\min(\lambda_1,\lambda_2)$, non-maximum-suppress, and keep the top-$N$ well-spread points. Those are the features you then KLT-track.

One real tension governs the window size, and it is the recurring small-versus-large-aperture trade of every patch method. A small window localises precisely but sees little texture — $M$ is poorly conditioned and noise fools it. A large window is better conditioned but spans multiple motions: across a depth discontinuity or an object boundary, a single displacement $\mathbf{u}$ no longer fits the whole window. The sweet spot trades localisation against conditioning.

fig-good-features-to-track
Figure 7.2.3. Shi–Tomasi points land only on corners. A photograph with selected features overlaid: markers cluster on corners and texture junctions — never on the flat sky or along a single straight edge (the railing, the roofline). A few windows are annotated with their local eigenvalue pair $(\lambda_1,\lambda_2)$: large-and-large at the kept corners, large-and-near-zero on the rejected edges, near-zero-and-near-zero on the rejected flat regions. The overlay makes "good feature to track" visibly identical to "well-conditioned $M$" identical to "corner."

7.2.4 What breaks long-term tracking: drift, appearance change, occlusion — and re-detection

A single small patch tracked greedily over a long video is not enough, and the failures are predictable. There are three of them, and one standard response.

Drift — the cost of greediness. Each frame's tiny Lucas–Kanade (LK) residual is imperfect, and chained over hundreds of frames the sub-pixel errors accumulate until the window has slowly slid off the true feature — sub-pixel bias becomes whole pixels (Figure 7.2.4a). The classic remedy is to anchor to the first frame: match against the original template from frame $t_0$ rather than the previous frame, so the error does not compound. But anchoring trades one problem for another, because over a long clip the template goes stale.

Appearance change — the template goes stale. Over many frames the patch genuinely changes: lighting shifts, the object undergoes out-of-plane rotation, its scale changes as it nears or recedes, perspective skews it. A pure-translation match against a fixed template degrades. Shi & Tomasi's answer is a two-speed scheme: track plain translation frame-to-frame (small motion, cheap) but verify against the anchor template with an affine warp, allowing the patch to rotate, scale, and shear when you check whether it is still itself. The affine dissimilarity

$$ \varepsilon=\sum_{\mathbf{x}\in W}\big[\,I_2(A\mathbf{x}+\mathbf{d})-I_1(\mathbf{x})\,\big]^2 $$

measures how well the current patch still matches its origin under the best affine warp $A$ (with translation $\mathbf{d}$). A large $\varepsilon$ means the feature is no longer the thing you started tracking.

Occlusion — the point disappears. When a feature passes behind another object — or off the frame edge, or through a shadow or specular flip — there is simply no correct match. The danger is that a greedy LK solve does not know this: it will happily lock onto whatever nearby texture minimises the sum of squared differences (SSD), silently corrupting the track. The detection is the same dissimilarity: $\varepsilon$ spikes, at which point you declare the feature lost and drop it rather than trust a garbage match (Figure 7.2.4b). Motion boundaries are the canonical place this happens — no good match exists where two motions meet.

Re-detection — keep the set populated. As features bleed away to drift, occlusion, and leaving the frame, periodically re-run Shi–Tomasi and spawn fresh features in the gaps, so the tracked set stays dense enough and well-distributed (Figure 7.2.4c). Detection and tracking become an interleaved loop rather than a one-shot detect-then-track pipeline.

Step back and the honest limitation is clear, and it is what motivates the next section. Classical KLT is per-point, greedy, and forgetful: each feature is tracked independently, with no memory of where it went and no way to recover a point once it reappears from behind an occluder. Points also drift relative to one another, with no enforced global consistency. These are exactly the failures the learned trackers were built to fix.

fig-track-drift-occlusion
Figure 7.2.4. Three ways a long track fails, one point over a long clip. (a) Drift: the tracking window slowly slides off the true feature as per-frame sub-pixel errors accumulate. (b) Occlusion: the point passes behind an object; the dissimilarity $\varepsilon$ spikes, signalling "no valid match" → drop the feature rather than lock onto background texture. (c) Re-detection: a fresh Shi–Tomasi feature is spawned to replace a lost one, keeping the tracked set populated and well-spread. Together they show why a single greedy patch is not enough over a long sequence.

7.2.5 Modern point trackers (briefly)

The failures of greedy KLT have a modern answer, and it is a clean instance of L8: the bitter-lesson successor keeps the task — "track this physical point through a video, even through occlusion" — but replaces the hand-built brightness patch with learned features and the greedy per-frame solve with a learned iterative update, trained on synthetic video with ground-truth tracks. Strikingly, the classical scaffolding survives as architecture: a cost volume of appearance similarity, iterative refinement of the trajectory, and now an explicit visibility / occlusion prediction. The skeleton Lucas, Kanade, and Tomasi laid down is intact; only the operator is fit to data. We sketch the family here and leave the full models to Deep learning.

Why this closes the section: every failure of greedy KLT — drift, single-point myopia, silent occlusion corruption — is answered by learn the features, refine iteratively, track together, predict visibility. The structure-tensor question "is this point trackable?" is now answered implicitly, inside the learned features and the joint optimisation, rather than by an explicit eigenvalue test. But the problem is identical to the one Lucas, Kanade, and Tomasi posed. The classical idea did not get replaced so much as absorbed — and the same neuralise-the-classical-iteration pattern is exactly RAFT for dense flow in Optical flow.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L17, recurrence)

Tracking is a sparse correspondence. Where dense flow asks "where did every pixel go?", tracking asks "where did these few physical points go?" — and threads the answer across a long sequence. It is the Lagrangian instance of correspondence (follow the particle, Motion blur and temporal sampling) to dense flow's Eulerian leaning (watch the fixed pixel). The sparse trajectories it produces are the raw input to everything downstream that needs the same physical points over time: video stabilization, structure-from-motion, simultaneous localization and mapping (SLAM), and bundle adjustment. Hold this placement in view — the rest of the section is the mechanics of doing the sparse thing well, and the rest of the part is what those sparse tracks feed.

💡 Big lesson (L8, recurrence)

A learned operator swaps a hand-designed prior for one learned from data. Classical KLT tracks a hand-built brightness patch by a hand-derived least-squares update. The modern point trackers at the end of this section keep the exact same task — "where did this point go?" — but replace the patch with learned features and the greedy per-point update with a learned, occlusion-aware, track-together module. The skeleton is unchanged: initialise a trajectory, iteratively refine it against appearance, predict visibility; only the operator is now fit to data. The same neuralise-the-classical-iteration move powers RAFT in Optical flow. (L8 first appears in Deep learning; this is a recurrence.)