Computational Photography, an AI-powered Slopendium — 07 Matching pixels across space and time

▸Part 7 MATCHING PIXELS ACROSS SPACE AND TIME⧉

💡 **Big lesson (L17 · estimation half):** correspondence-then-transport — this part owns the **hard, ill-posed estimation** half (where did each pixel go?), ill-posed by the **aperture problem** and broken by occlusion and large motion; [[Warping and morphing]] owns the transport. Matching is literally the inverse of warping: warping consumes a coordinate map, matching produces one.

7.1 Sparse matching⧉

▸7.2 Feature tracking⧉

`fig-track-vs-flow` · two regimes of correspondence — dense flow (one pair, a vector at every pixel; dense but short) vs tracking (a few points threaded as trajectories across many frames; sparse but long) 🟨

`fig-klt-is-harris` · one matrix, two readings — the same structure-tensor ellipse read as Harris ("good corner to detect?") and as KLT ("motion well-conditioned?"); detection and trackability are the same statement about $M$ 🟨

`fig-good-features-to-track` · Shi–Tomasi points land only on corners — markers clustering on corners/junctions, never flat sky or single edges, with windows annotated by their eigenvalue pair; "good feature" = "well-conditioned $M$" = "corner" 🟨

⬜ figure not yet created

`fig-klt-pyramid-iteration` (per-feature coarse-to-fine: predict on a blurred level, warp, re-solve the $2\times2$ system, refine down the pyramid — the LK iteration localised to one patch) fig-klt-pyramid-iteration

`fig-track-drift-occlusion` · three ways a long track fails — drift (window slides off as sub-pixel errors accumulate), occlusion (dissimilarity spikes → drop the feature), re-detection (a fresh Shi–Tomasi point spawned to replace a lost one) 🟨

⬜ figure not yet created

`fig-modern-point-trackers` (KLT's single greedy patch vs a learned tracker — CoTracker — tracking *many* points *jointly* through an occlusion, recovering the point when it reappears) fig-modern-point-trackers

The previous section estimated a **dense** flow field — a motion vector at *every* pixel — between *one* pair of frames. This section does the complementary thing: follow a **sparse** handful of *distinctive* points across *many* frames. It is the same physics (brightness constancy, the aperture problem) restricted to the places where the math is well-behaved, and run forward in time. The payoff of the restriction is the chapter's spine: **the matrix that says a patch is a good corner to *detect* is the same matrix that must be invertible to *track* it.** Detect with it (Harris), track with it (KLT), refuse to track where it is singular (Shi–Tomasi). One $2\times2$ matrix, three jobs.

💡 **Big lesson (recurrence of L8 — a learned operator swaps a hand-designed prior for one learned from data):** classical KLT tracks a hand-built brightness patch by a hand-derived least-squares update; modern point trackers (PIPs, CoTracker) keep the *exact same task* — "where did this point go?" — but replace the patch with **learned features** and the greedy per-point update with a **learned, occlusion-aware, track-together** module. The skeleton (initialise a trajectory, iteratively refine it against appearance, predict visibility) is unchanged; only the operator is now fit to data. (→ see Big lesson **L8**; first appears in [[Deep learning]]. The same "neuralise the classical iteration" move powers RAFT in [[Optical flow]].)

equations

per-feature LK normal equations $M\,\Delta\mathbf{u}=\mathbf{b}$, with the **structure tensor** $M=\sum_{\mathbf{x}\in W} \begin{psmallmatrix}I_x^2 & I_xI_y\\ I_xI_y & I_y^2\end{psmallmatrix}$ and $\mathbf{b}=-\sum_{\mathbf{x}\in W} I_t\begin{psmallmatrix}I_x\\ I_y\end{psmallmatrix}$ → update $\Delta\mathbf{u}=M^{-1}\mathbf{b}$

**Shi–Tomasi trackability** $\min(\lambda_1,\lambda_2)>\lambda_{\min}$ (a feature is trackable iff $M$ is well-conditioned)

affine-warp **dissimilarity** $\varepsilon=\sum_W [\,I_2(A\mathbf{x}+\mathbf{d})-I_1(\mathbf{x})\,]^2$ (occlusion / lost-track test)

▸7.2.1 Tracking vs dense flow: sparse-but-long vs dense-but-short⧉

• **the two regimes** of correspondence (the taxonomy that organises this whole part): **dense optical flow** gives a 2-D motion vector at *every* pixel, but typically only between *two successive frames*; **feature tracking** gives motion at only a *sparse* set of points, but *threads them across a long sequence* of frames. Dense-but-short vs sparse-but-long. [`fig-track-vs-flow`]

• **why go sparse** (the case for tracking, from the slides' "motivations for sparse"): **scalability** — far fewer unknowns (a few hundred points, not a million pixels), which mattered enormously historically and still matters for real-time SLAM and on-device tracking; **robustness / conditioning** — *some pixels are genuinely easier to track than others*, so spend effort only where the answer is reliable (a corner), and refuse to guess in flat regions or along edges where it is hopeless (the aperture problem). Tracking is dense flow's *quality-controlled* sibling: it only reports where it is confident.

• **the Lagrangian framing** (forward-ref, made precise in [[Motion blur and temporal sampling]]): tracking **follows a particle** — "this *physical* point, where is it now?" — the **Lagrangian** view. Dense per-pixel flow leans **Eulerian** — "at this *fixed* pixel, what moved through?" This is the organising dichotomy for the part; tracking is its cleanest Lagrangian instance. (The slides note matching-based flow "becomes more Lagrangian" precisely by *following the patch* rather than differentiating at a fixed pixel.)

• **where tracked points go next**: the sparse trajectories are the *input* to structure-from-motion, SLAM, video stabilization, and bundle adjustment — wherever you need *the same physical points* identified over time, not a per-pixel flow. (Cross-ref [[Automatic panorama stitching from multiple views and feature matching]] — features → 3D.)

▸7.2.2 KLT = Lucas–Kanade, per feature, iterated over time⧉

• **the core move**: take the **Lucas–Kanade** patch solve from [[Optical flow]] and apply it *to one small window around a feature*, then *iterate*, then *repeat for the next frame*. That is the whole Kanade–Lucas–Tomasi (KLT) tracker. (Lucas & Kanade 1981 = the registration update; Tomasi & Kanade 1991 = packaging it as a feature *tracker*; Shi & Tomasi 1994 = *which* features.)

• **the per-patch system** (reuse, don't re-derive — it is LK restricted to a window $W$): assume every pixel in $W$ shares one displacement $\mathbf{u}=(u,v)$ and stack the brightness-constancy constraint $I_x u + I_y v + I_t = 0$ over the window. The least-squares normal equations are

$$M\,\Delta\mathbf{u}=\mathbf{b},\qquad M=\sum_{\mathbf{x}\in W}\begin{pmatrix}I_x^2 & I_xI_y\\ I_xI_y & I_y^2\end{pmatrix},\qquad \mathbf{b}=-\sum_{\mathbf{x}\in W} I_t\begin{pmatrix}I_x\\ I_y\end{pmatrix},$$

so the update is $\Delta\mathbf{u}=M^{-1}\mathbf{b}$. **That $M$ is exactly the structure tensor** of Harris (the second-moment matrix) — *the same matrix, no new math*. (Slides: "per-patch linear system… can be seen as per-patch 1-step Horn–Schunck with strong constant smoothness.") [`fig-klt-is-harris`]

• **iterate, because LK is only a linearisation**: a single solve assumes sub-pixel motion (brightness constancy was Taylor-linearised). So **warp** the second-frame patch by the current estimate, recompute $I_t$ on the warped patch, re-solve for a residual $\Delta\mathbf{u}$, and **add** it — repeat until convergence. This is Gauss–Newton on the patch SSD (Baker & Matthews 2004's account).

• **coarse-to-fine for large motion**: LK assumes motion of roughly $\lesssim 1$ pixel. For real videos, run the iteration **down a pyramid** (Bouguet 2000): solve on a heavily-blurred/downsampled level where the displacement *is* small, **propagate** the estimate as the initial warp for the next-finer level, refine. Same coarse-to-fine logic as dense flow, localised to each feature. [`fig-klt-pyramid-iteration`]

• **across time**: having localised the feature in frame $t{+}1$, use that as the start for frame $t{+}2$, and so on. KLT is this $2\times2$ solve threaded greedily down the whole clip — cheap (one tiny matrix inverse per feature per frame) but **greedy and memoryless**, which is exactly what makes drift and occlusion bite (below).

• **pseudocode**: a short KLT loop (per feature: iterate warp-recompute-$I_t$ → build $M,\mathbf b$ → $\Delta\mathbf u=M^{-1}\mathbf b$ → add; then carry $\mathbf u$ to the next frame) shows the three nested layers — converge, then thread over time.

▸7.2.3 Good features to track = where the structure tensor is well-conditioned⧉

• 💡 **the hinge of the chapter**: the update $\Delta\mathbf{u}=M^{-1}\mathbf{b}$ is only meaningful when **$M$ is invertible and well-conditioned**. And $M$'s conditioning is read off its **eigenvalues** $\lambda_1\ge\lambda_2$ — the *same* eigenvalues that classify a patch as flat / edge / corner in Harris. So "*is this a good feature to track?*" is literally "*is the tracking matrix well-conditioned?*" which is *literally* "*is this a corner?*" — one question wearing three hats. [`fig-klt-is-harris`, `fig-good-features-to-track`]

• **the three cases** (reused from the Harris derivation, now read as *trackability*):

• **flat region** ($\lambda_1\approx\lambda_2\approx 0$): $M$ is near-zero, $M^{-1}$ blows up — no texture, motion totally ambiguous. **Untrackable.**

• **edge** ($\lambda_1\gg\lambda_2\approx 0$): $M$ is rank-deficient — motion *along* the edge is unconstrained (the **aperture problem**). You can recover only the across-edge component. **Half-trackable** — and the unstable half drifts.

• **corner** ($\lambda_1,\lambda_2$ both large): $M$ is well-conditioned, motion fully determined in both directions. **Trackable.**

• **Shi–Tomasi "good features to track"** (1994): pick features by the simple, principled test

$$\min(\lambda_1,\lambda_2) > \lambda_{\min},$$

i.e. require the *smaller* eigenvalue to be large enough — *both* directions well-constrained. This is the conditioning of $M$ stated directly; it is the same $\min(\lambda_1,\lambda_2)$ Shi–Tomasi corner response cross-referenced in [[Automatic panorama stitching from multiple views and feature matching]]. The Harris response $R=\det M - k(\operatorname{tr}M)^2$ is a (determinant-trace) proxy for the same thing. **The detector and the tracker agree by construction** — Shi & Tomasi derived "what is a good feature to *detect*" *from* "what is a good feature to *track*", not the other way round. [`fig-good-features-to-track`]

• **practical recipe**: compute $M$ over a window at every pixel (image gradients → products → Gaussian-weighted sum, exactly the Harris pipeline), score by $\min(\lambda_1,\lambda_2)$, **non-max suppress**, keep the top-$N$ well-spread points — these are the features you then KLT-track.

• **the window-size tension** (Shi & Tomasi, slide "Rest of the story"): a *small* window localises precisely but sees little texture (poorly conditioned $M$) and is fooled by noise; a *large* window is better conditioned but blurs across **multiple motions** (a single $\mathbf{u}$ no longer fits — depth discontinuities, object boundaries). The sweet spot trades localisation against conditioning — the recurring small-aperture-vs-large-aperture tension of all patch methods.

▸7.2.4 What breaks long-term tracking: drift, appearance change, occlusion — and re-detection⧉

• **drift** — the cost of greediness: each frame's tiny LK residual is imperfect; chained over hundreds of frames the errors **accumulate** and the window slowly slides off the true feature (sub-pixel bias becomes pixels). The classic fix is to **anchor to the first frame**: track against the *original* template (frame $t_0$), not the previous frame, so error doesn't compound — but then **appearance change** fights you. [`fig-track-drift-occlusion`(a)]

• **appearance change** — the template goes stale: over a long clip the patch changes due to **lighting**, **out-of-plane rotation**, **scale** (the object nears/recedes), and perspective. A pure-translation template match degrades. Shi & Tomasi's fix: track translation frame-to-frame (small motion, cheap) but **verify against the anchor template with an *affine* warp** — allow the patch to rotate/scale/shear when checking quality. The affine **dissimilarity**

$$\varepsilon=\sum_{\mathbf{x}\in W}\big[\,I_2(A\mathbf{x}+\mathbf{d})-I_1(\mathbf{x})\,\big]^2$$

measures how well the current patch still matches its origin under the best affine warp $A$; a large $\varepsilon$ means the feature is no longer itself.

• **occlusion** — the point disappears: when a feature passes **behind** another object (or off-frame, or into a shadow/specular flip), there is *no* correct match — but a greedy LK solve will happily lock onto *whatever* nearby texture minimises SSD, silently corrupting the track. **Detection: the dissimilarity $\varepsilon$ spikes** → declare the feature **lost** and drop it rather than trust a garbage match. (Slides "Issue: Motion Boundaries — no matches can be found at motion boundaries… also at frame boundary.") [`fig-track-drift-occlusion`(b)]

• **re-detection** — keep the set populated: as features are lost to drift/occlusion/leaving-frame, periodically **re-run Shi–Tomasi** and spawn fresh features in the gaps, so the tracked set stays dense enough and well-distributed. Detection and tracking become an interleaved loop (slides: "Combine tracking and detection; keep refining detection model over time"). [`fig-track-drift-occlusion`(c)]

• **the honest limitation** that motivates the next section: classical KLT is **per-point, greedy, and forgetful** — each feature is tracked independently with no memory of where it went and no way to *recover* a point after it reappears from behind an occluder. Points also **drift relative to each other** (no global consistency). These are exactly the failures learned trackers were built to fix.

▸7.2.5 Modern point trackers (briefly)⧉

• **the shift** (the *bitter-lesson* successor to KLT; **L8**): keep the task — "track this physical point through a video, even through occlusion" — but replace the hand-built brightness patch with **learned features** and the greedy per-frame solve with a **learned iterative update**, trained on synthetic video with ground-truth tracks (FlyingThings3D, Kubric, etc.). The classical scaffolding survives *as architecture*: a 4-D **cost volume** (appearance similarity), **iterative refinement** of the trajectory, and now an explicit **visibility / occlusion** prediction. [`fig-modern-point-trackers`]

• **PIPs — Particle Video Revisited** (Harley et al. 2022): tracks **one point at a time** over a temporal window; solves jointly for its $(x,y)$ location *and* an appearance vector across all frames, **predicts visibility per frame**, and **initialises with a constant** location/appearance then iteratively refines (RAFT-style). The reintroduction of *long-range, occlusion-aware* particle tracking that KLT lacked.

• **TAP-Vid** (Doersch et al. 2022): frames the modern task — **Track Any Point** — and supplies the **benchmark** that made progress measurable (the long-standing bottleneck, as in optical flow, is **ground-truth data**).

• **CoTracker — track points *together*** (Karaev et al. 2023): the key insight, stated as a slogan, is that **tracking points jointly beats tracking them independently** — points on the same object constrain each other, giving **global consistency** and far better **occlusion handling** (a point hidden behind an object is inferred from its visible neighbours). Mechanism: a **transformer** over a sliding window with **factorised attention across space, time, and the group of tracks**, iteratively updating each track's position/appearance/visibility — plus extra **register tokens** so the network has scratch space for global info (otherwise it hijacks real pixels, causing spiky artifacts).

• **why this closes the section**: every failure of greedy KLT — drift, single-point myopia, silent occlusion corruption — is answered by *learn the features, refine iteratively, track together, predict visibility*. The structure-tensor "is this point trackable?" question is now answered implicitly by the learned features and the joint optimisation, but the **problem** is identical to the one Lucas, Kanade, and Tomasi posed. (Full model details: [[Deep learning]]; the same *neuralise-the-classical-iteration* pattern is RAFT in [[Optical flow]].)

7.3 Robustness: the ratio test and RANSAC⧉

7.4 Deep learning approaches to sparse matching⧉

7.5 Misc: fast matching⧉

▸7.6 Optical flow⧉

`fig-flow-field-colorkey` · from two frames to a dense field — a per-pixel flow shown with the standard colour key (hue = direction, saturation = speed, after Baker et al. 2011); coherent regions read as one colour, static background near-white ✅

⬜ figure not yet created

`fig-brightness-constancy` (a patch at $(x,y,t)$ reappears at $(x+u,y+v,t+1)$ with the *same* intensity — the assumption, and where it breaks: lighting change, specularity, occlusion) fig-brightness-constancy

`fig-flow-constraint-line` · one equation, a line of answers — $I_x u + I_y v + I_t = 0$ as a line in the $(u,v)$ plane; the gradient fixes only the normal component (normal flow), the along-edge tangent undetermined 🟨

`fig-aperture-problem` · why one pixel is not enough — a straight edge through an aperture (three true motions, identical appearance, only normal recoverable) and the barber-pole illusion (diagonal stripes appearing to move straight up) 🟨

`fig-zoom-mechanism` · a zoom at two focal lengths (short-f/wide vs long-f/tele): moving variator + compensator groups slide along the axis to change f while the focal plane (sensor) stays fixed; motion arrows between states ✅

`fig-flow-corner-edge-flat` · the structure tensor $A^\top A$ deciding where flow is solvable — corner (two large eigenvalues, full 2-D flow), edge (rank-deficient, only normal flow), flat (indeterminate); the same picture that selected Harris corners 🟨

⬜ figure not yet created

`fig-LK-vs-HS` (Lucas–Kanade: solve a tiny system per window, local fig-klt-pyramid-iteration

`fig-mc-as-flow` · same correspondence, two budgets — a dense smooth optical-flow field vs the codec's one-constant-vector-per-block field; the same "where did this come from?" coarsened to what is cheap to estimate and transmit 🟨

`fig-coarse-to-fine-flow` · making large motion sub-pixel — a Gaussian pyramid where coarse displacement is fractional; at each level upsample-and-scale, warp, estimate the small residual and add; a Laplacian view of the flow 🟨

`fig-raft-skeleton` · RAFT, the classical pipeline neuralized — learned feature + context encoders, an all-pairs 4-D correlation volume, and a recurrent GRU update operator iterating; the boxes line up with data term, cost volume, and warp-then-refine 🟨

💡 **Big lesson (the chapter's core — *brightness constancy + the aperture problem*):** track a point by the one thing that's (assumed) conserved — its **brightness**. Linearizing that single scalar equation gives **one constraint per pixel**, $I_x u + I_y v + I_t = 0$, for **two** unknowns $(u,v)$ — so locally motion is **fundamentally underdetermined**: you can only see the component **along the image gradient** (normal flow), never the component **along an edge** (the *aperture problem*). Every flow method is a strategy for supplying the missing second equation — from a **neighbor** (Lucas–Kanade: pixels in a patch share a motion → the well-/ill-posedness is the **same corner/edge/flat structure tensor as Harris**), from a **smoothness prior** (Horn–Schunck), or from **learned context** (RAFT). *Only the gradient carries motion information, and one gradient is never enough.*

▸7.6.1 What optical flow is — and whether it is even well-defined⧉

• **definition**: the **2-D motion vector of every pixel** between two (usually consecutive) frames — a dense field $(u(x,y),\,v(x,y))$. Visualized with a **color key**: hue = motion direction, saturation = speed (Baker et al. 2011). [`fig-flow-field-colorkey`]

• **why it matters (downstream)**: frame interpolation / slow-motion, video super-resolution, in-painting, stabilization, special effects; action recognition, tracking, surveillance; egomotion / SLAM; **video compression** (motion-compensated prediction); medical registration; even optical mice and fluid-flow physics. A workhorse primitive.

• **relation to stereo**: disparity is optical flow **specialized to a static scene with a known camera baseline** — the search collapses to the **1-D epipolar line**, so there's a **single unknown (depth/disparity) per pixel** instead of two. Flow is the unconstrained 2-D cousin (cross-ref [[Image alignment]] / multi-view).

• **honesty up front**: optical flow is **ill-defined** in places — flat untextured regions (no information), occlusion / motion boundaries (a pixel *has* no correspondence), lighting changes and specularities (brightness *isn't* constant). Worth keeping in mind that "the flow" is an idealization that the math itself will warn us is sometimes unrecoverable.

• **the progression of "what is a match"** (foreshadow the chapter and its sequels): per-pixel **RGB + brightness constancy + gradient** (this chapter) → **window/patch** matching (SSD, cost volume) → hand-designed **features** (SIFT/HOG) → **learned** features (RAFT and successors). Each step buys robustness at the cost of locality.

▸7.6.2 Brightness constancy and the optical-flow constraint⧉

• **the assumption (brightness / color constancy)**: a scene point keeps the same intensity as it moves over one timestep — it just *relocates*:

$$I(x,\,y,\,t) = I(x+u,\;y+v,\;t+1).$$

($u,v$ = the per-pixel displacement we want.) This is the "color constancy assumption" of Horn & Schunck 1981. [`fig-brightness-constancy`]

• **linearize (first-order Taylor)**: expand the right side about $(x,y,t)$, assuming small motion:

$$I(x+u,y+v,t+1) \approx I(x,y,t) + I_x\,u + I_y\,v + I_t.$$

Subtract $I(x,y,t)$ from both sides of brightness constancy → the **optical-flow constraint equation (OFCE)**:

$$\boxed{\,I_x\,u + I_y\,v + I_t = 0\,}\qquad\text{equivalently}\qquad \nabla I\cdot(u,v) + I_t = 0,$$

with $I_x,I_y$ the **spatial** gradients and $I_t$ the **temporal** gradient (frame difference).

• 💡 **"only the gradient matters"**: motion is invisible in **flat** regions — where $\nabla I = 0$ the constraint says $I_t=0$ and tells you *nothing* about $(u,v)$. Information about motion lives **entirely in the image gradient**. (This is exactly the lever Eulerian video magnification pulls: amplify the *temporal* change $I_t$ in proportion to the spatial gradient.)

• **the constraint is a line, not a point**: one scalar equation in two unknowns. In the $(u,v)$ plane it's a **line**; the true flow lies *somewhere on it* but is otherwise undetermined. We can pin down only the component **along $\nabla I$**: the **normal flow** $\;u_\perp = -I_t / \lVert\nabla I\rVert$ (magnitude), directed along the gradient. The component **tangent to the edge is free**. [`fig-flow-constraint-line`]

▸7.6.3 The aperture problem⧉

• **the statement**: from a *single* pixel (one OFCE) you can recover **only the normal flow** — the motion **perpendicular to the local edge**. The motion **along** the edge is unobservable. *One equation, two unknowns.* [`fig-aperture-problem`]

• **the intuition (the "aperture")**: watch a moving straight edge through a small hole. A long diagonal bar sliding behind the aperture looks the same whether it moves horizontally, vertically, or perpendicular to itself — you cannot tell. Only motion that **changes what's inside the aperture** (the ⟂ component) is visible. The **barber-pole** illusion (stripes appear to move *up* though the pole spins sideways) is the same effect.

• **why it's not a bug**: the math is *honestly reporting* a genuine ambiguity — the data truly don't determine the tangential motion locally. So "solving" flow = **adding a second assumption** to supply the missing equation. The two classic choices:

• **a local one** — neighbors share the motion → **Lucas–Kanade**;

• **a global one** — the field is smooth → **Horn–Schunck**.

• **the perceptual flip side**: human vision suffers the aperture problem too (we mis-perceive these stimuli), and resolves it by **integrating across the object** (corners, line-endings) — exactly what the algorithms do by aggregating a window or imposing smoothness.

▸7.6.4 Lucas–Kanade — local constant-flow least squares (and the structure tensor)⧉

• **the assumption**: pixels in a **small window** $W$ all share the **same** motion $(u,v)$. Now we have **many** OFCEs (one per pixel in $W$) for the **two** unknowns — over-determined → **least squares** resolves the aperture problem *and* regularizes against noise. (Lucas & Kanade 1981.)

• **stack the equations**: with $n$ pixels in $W$, form

$$A = \begin{bmatrix} I_x^{(1)} & I_y^{(1)} \\ \vdots & \vdots \\ I_x^{(n)} & I_y^{(n)} \end{bmatrix},\quad b = \begin{bmatrix} I_t^{(1)} \\ \vdots \\ I_t^{(n)} \end{bmatrix},\qquad A\begin{bmatrix}u\\v\end{bmatrix} = -b.$$

The least-squares solution solves the **normal equations**:

$$\boxed{\,A^\top A \begin{bmatrix}u\\v\end{bmatrix} = -A^\top b\,},\qquad A^\top A = \sum_W \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix},\quad A^\top b = \sum_W \begin{bmatrix} I_x I_t \\ I_y I_t\end{bmatrix}.$$

• 💡 **the matrix $A^\top A$ is the Harris structure tensor** (the second-moment matrix). The very same object decides corner detection ([[Automatic panorama stitching from multiple views and feature matching]]) and flow conditioning — **flow is solvable exactly where Harris finds a corner**:

• **two large eigenvalues** (a **corner** / well-textured patch): $A^\top A$ invertible, full 2-D flow recoverable, well-conditioned. *These are the "good features to track."*

• **one large eigenvalue** (an **edge**): rank-deficient → only **normal flow** (the aperture problem, *re-derived* — along-edge motion is the null direction).

• **both small** (a **flat** region): no information; flow indeterminate.

This is the bridge from dense flow to **sparse feature tracking** (KLT): just track the pixels where $A^\top A$ is well-conditioned. [`fig-flow-corner-edge-flat`]

• **scope & relation to HS**: LK assumes **small motion** ($\sim$1 px, from the linearization) and a window large enough to be informative but small enough to be locally constant. It can be read as a **per-patch, one-step Horn–Schunck with a hard "constant flow" smoothness** — local and direct, vs HS's global and iterative. Extended to large motion via **coarse-to-fine pyramids** and iterative warping. LK is also the engine of **parametric image alignment** (register two images under one affine/homography — same gradient least-squares; cross-ref [[Image alignment]]). [`fig-LK-vs-HS`]

▸7.6.5 Horn–Schunck — global smoothness regularization⧉

• **the assumption**: instead of constancy in a window, impose that the flow field is **globally smooth** — *neighboring pixels have similar motion*. This fills in the unconstrained (aperture / flat) directions from the data-rich ones, propagating motion into textureless regions. (Horn & Schunck 1981.)

• **the energy** (data-fit + prior — the classic inverse-problem shape):

$$E(u,v) = \iint \underbrace{\big(I_x\,u + I_y\,v + I_t\big)^2}_{\text{data: brightness constancy}} \;+\; \lambda\,\underbrace{\big(\lVert\nabla u\rVert^2 + \lVert\nabla v\rVert^2\big)}_{\text{prior: smoothness}}\;\,dx\,dy.$$

The **data term** is the OFCE residual (penalizes violating brightness constancy); the **smoothness term** penalizes spatial *variation* of the flow; $\lambda$ trades them off (large $\lambda$ → smoother, more diffused; small $\lambda$ → trusts the data, noisier).

• **it's convex / linear to solve**: quadratic in $(u,v)$ → the Euler–Lagrange equations are **linear** (a large sparse system coupling all pixels — a screened-Poisson-like diffusion). Solve by **iterative** updates (Gauss–Seidel / Jacobi): each pixel's flow relaxes toward its **neighborhood average** corrected by the local brightness constraint. Globally smooth, fills the apertures.

• **LK vs HS — the contrast to land**: **local window + least squares** (LK) vs **global field + smoothness prior** (HS); **one-step** vs **iterative**; **constant-in-a-patch** vs **smoothly-varying-everywhere**. Two ways to supply the aperture problem's missing equation — *aggregate locally* or *regularize globally*. [`fig-LK-vs-HS`]

• **modern classical flow** — the *Secrets* paper (Sun, Roth & Black 2010): the bones are still HS, but what actually wins is **robust** (non-quadratic, $L_1$/Charbonnier) penalties on both terms (so edges/occlusions don't over-smooth), **median filtering** of the flow between iterations, **coarse-to-fine pyramids**, and **edge-aware / non-local** smoothness. The lesson: the *energy is the design* — same skeleton, better data/prior terms.

▸7.6.6 Large motion: coarse-to-fine warping⧉

• **the problem**: the Taylor linearization (hence both LK and HS) assumes **sub-pixel** motion. Real motion is many pixels — the linearization is invalid and the solver diverges or finds the wrong (aliased) match.

• **the fix — pyramids** (cross-ref [[Linear pyramids and wavelets]]): build a **Gaussian/Laplacian pyramid** of both frames. On the **coarsest** (heavily blurred & downsampled) level, even large motion is only a fraction of a pixel → the linear method works. Then, going finer:

1. **upsample** the coarse flow to the next level (and scale its magnitude);

2. **warp** frame 2 toward frame 1 by the current flow estimate ([[Warping and resampling]]);

3. estimate only the **residual** (now small) flow on the warped pair, and **add** it to the running estimate.

Repeat to the finest level. [`fig-coarse-to-fine-flow`]

• **why it's the right picture**: this is a **Laplacian-pyramid** view of flow — get the *low-frequency / large-scale* motion on blurry images, then add the *high-frequency* motion as a residual at each finer scale. The warp-then-refine loop is exactly the iterative scheme that **RAFT** and PWC-Net later learn.

• **pseudocode**: a coarse-to-fine block (loop levels: upsample-and-scale `flow` → `warp(frame2, flow)` → estimate residual `ΔLK` → add) makes the warp-then-refine loop explicit — the scheme RAFT later neuralizes.

• **a different attack — the cost volume**: rather than linearize, **match** patches directly. For each pixel in frame 1, score similarity (SSD / correlation) against candidate offsets in frame 2 → a **cost volume** (4-D: $(x_1,y_1)\times(x_2,y_2)$, or per-displacement). This keeps brightness constancy but **drops the linearization** (more "Lagrangian"), handling large motion at the cost of a big search — pruned by **coarse-to-fine** and (later) learned features. The cost volume is the centerpiece of learned flow.

▸7.6.7 Learned flow — RAFT (neuralize the classical pipeline)⧉

• **the framing (L8)**: keep the classical *skeleton* — match (data term) + iterative update (with smoothness/context) — but **learn** the pieces instead of hand-designing them. (Forward-ref [[Machine learning]]: a learned operator swaps a hand-built prior/feature for one fit to data.)

• **RAFT** (Teed & Deng 2020, ECCV best paper) — the three learned ingredients on the classical scaffold: [`fig-raft-skeleton`]

1. **learned per-pixel features** replace raw RGB in the matching (robust to lighting, texture) — plus a separate **context** encoder of frame 1 for edge-aware regularization;

2. an **all-pairs 4-D correlation (cost) volume** — precompute similarity between every pair of pixels, pooled into a **pyramid** for multi-scale lookup (the cost-volume idea, made efficient);

3. a **recurrent update operator** (GRU, shared weights) that **iteratively refines** the flow field — at each step it **looks up** correlations at the current flow estimate and predicts an increment, exactly **mimicking an iterative optimizer** (the unrolled HS/coarse-to-fine loop), but with the update *learned*. All differentiable, trained end-to-end.

• **why flow got the CNN treatment late**: flow is **non-local** (a pixel can match anywhere) — hard for ordinary convolutions; the explicit **cost volume** is the inductive bias that makes non-local matching tractable. Training data is scarce (real flow ground truth is hard), so synthetic sets (FlyingChairs/Things, Sintel, KITTI) and **self-supervised warping losses** (SMURF) do heavy lifting.

• **the through-line to land**: RAFT is the classical pipeline **with its hand-crafted parts (features, the iterative minimizer) replaced by learned ones** — *scaffolding from traditional methods, weights from data*. It is **not** "bitter-lessoned" into a generic transformer: the cost-volume + recurrent-update **inductive bias still wins**, which is itself the lesson — non-local matching is hard enough that problem-specific structure still pays. (Lineage to note: FlowNet 2015, the first CNN flow; PWC-Net 2018, the *pyramid+warping+cost-volume* "Lucas–Kanade-flavored" net; transformer variants for the cost volume.)

• **closing hand-off**: with a dense, automatic correspondence field in hand, the **transport** tasks open up — warp/interpolate between frames for **slow-motion** ([[Frame interpolation and slow-motion synthesis]]), stabilize, compress, magnify motion ([[Morphing]] — motion magnification), or lift to **scene flow / SfM** (the 3-D cousins). *Correspondence first, then transport* — the part's spine.

7.7 Deep learning approaches to optical flow⧉