💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

7.8 Deep learning approaches to optical flow

The learned-flow story is best told as a single pattern, not a parade of network names. The pattern is this: take the classical iteration of Optical flow and unroll it into a differentiable network trained end to end, so the hand-tuned data term, smoothness regularizer, and coarse-to-fine schedule become learned. Read against the previous section, almost nothing about the structure changes — there are still features to match, still a cost surface that scores candidate displacements, still an iterative update that refines the field, still a coarse-to-fine reach for large motion. What changes is who designs the operators: a human, by derivation and tuning, or gradient descent, from data. Hold that contrast steady; it is the whole section.

The lineage makes the point concrete. FlowNet ([@dosovitskiy-etal-2015]) was the proof of concept — the first convolutional network to regress a dense flow field directly, showing the problem was learnable at all (its FlowNet 2.0 successor, [@ilg-etal-2017], stacked and refined it into accuracy). PWC-Net ([@sun-etal-2018], at NVIDIA) made the lineage explicit by folding the classical coarse-to-fine pipeline — Pyramid, Warping, Cost-volume — into one compact, fast network: the same warp-then-refine loop from the last section, now with learned features and a learned per-level update. RAFT (Teed and Deng 2020) is the design that came to dominate: build a four-dimensional all-pairs correlation volume — every pixel's similarity to every other — and apply a recurrent update operator that behaves like an unrolled iterative optimizer refining the field. Finally, transformer flows — FlowFormer ([@huang-etal-2022]) and GMFlow ([@xu-etal-2022]) — reframe the whole thing as global matching, replacing the local recurrent update with attention. Throughout, the skeleton is the classical one; learning supplies the operators.

💡 Big lesson (L8, recurrence)

A learned operator swaps a hand-designed piece for one fit to data — the neuralize-the-classical-iteration move. Modern flow networks are not free-form regressors that dissolved the problem into a generic stack of layers. They keep the classical algorithm's inductive bias — features, a cost volume, an iterative update, coarse-to-fine — as architecture, and learn only the operators inside that scaffold. That is exactly why they beat both the hand-built original (data tunes every term) and a structure-free CNN (the bias makes flow well-posed). The same move powered the learned point trackers of Feature tracking and gets its general framing in Deep learning. The slogan to carry: scaffolding from the traditional method, weights from data.

7.8.1 The unrolling principle — neuralize the classical solver

Start with the move itself, because once you see it the rest of the section is variations on one theme. A classical flow method iterates an optimizer: compute a data or gradient term from brightness constancy, take a step, repeat until the field settles. Write that step in its plainest form,

$$ \mathbf f^{(k+1)} \;=\; \mathbf f^{(k)} \;-\; \eta\,\nabla E\big(\mathbf f^{(k)}\big), $$

where $\mathbf f=(u,v)$ is the flow field, $E$ is the data-plus-smoothness energy of Horn–Schunck, and $\eta$ is a step size. Unrolling takes this loop and straightens it: instead of running an opaque optimizer to convergence, lay out a fixed number of iterations as a stack of differentiable blocks, replace the hand-tuned pieces inside each block — the data term, the smoothness prior, the step size, the coarse-to-fine schedule — with learned operators, and train the entire stack end to end against ground-truth flow. The gradient step becomes a learned update,

$$ \mathbf f^{(k+1)} \;=\; \mathbf f^{(k)} \;+\; \mathrm{GRU}_\theta\big(\mathbf f^{(k)},\ \text{corr lookup}\big), $$

in which a small recurrent network $\mathrm{GRU}_\theta$, reading the cost volume at the current estimate, stands in for the gradient step. The classical solver computed a descent direction by differentiating a fixed energy; the unrolled solver predicts an increment from learned weights. Same loop shape, learned interior (Figure 1).

Why does this win over both alternatives? Because it keeps the inductive bias of the classical algorithm — the very structure that makes flow well-posed in the first place — while letting data tune every operator. A free-form CNN that maps two frames to a flow field directly has to rediscover matching, regularization, and multi-scale reasoning from scratch, with no built-in notion that flow is a correspondence problem; it generalizes poorly. The hand-built original has the right structure but the wrong constants — its penalties and schedules were tuned by intuition. Unrolling takes the structure from one and the operators from the other. This is the recurrence of L8, and it is not special to flow: it is the same pattern behind the learned point trackers of Feature tracking (initialize a trajectory, refine it iteratively, predict visibility — only the operator is learned) and the general thesis of Deep learning.

fig-flow-unrolling
Figure 7.8.1. The pattern in one diagram. Left: a classical iterative solver drawn as a loop — compute the data/gradient term $\nabla E(\mathbf f^{(k)})$, take a step $\mathbf f^{(k+1)}=\mathbf f^{(k)}-\eta\,\nabla E$, repeat — with the data term, smoothness regularizer, and step size annotated as hand-tuned. Right: its unrolled twin — the same loop straightened into a fixed stack of differentiable blocks, each block now a learned module $\mathbf f^{(k+1)}=\mathbf f^{(k)}+\mathrm{GRU}_\theta(\mathbf f^{(k)},\text{corr})$, with the same three pieces re-annotated as now learned and the whole stack trained end to end. The boxes line up one-to-one: the gradient step on the left is the GRU update on the right.

7.8.2 Cost volumes and warping inside the net

Two of the classical parts deserve their own look, because they are where matching actually happens inside the network: the cost volume and the warp.

The cost (correlation) volume is the learned analogue of the brute-force SSD search surface from image alignment and the classical patch-matching of Optical flow. The classical method, for a pixel in frame 1, scored its sum-of-squared-differences against every candidate offset in frame 2 and took the argmin. The learned version does the same thing on learned features rather than raw RGB: it computes, for each pixel in frame 1, the correlation (a dot product of feature vectors) against candidate pixels in frame 2 — over a local neighbourhood (PWC-Net) or over all pixels (RAFT's four-dimensional all-pairs volume). The result is a similarity map per pixel whose peak is the match (Figure 2). The decisive difference is what happens next: the classical method arg-mined the surface, committing hard to its peak; the network instead reads the volume — it hands the whole similarity map to the update operator, which can weigh a sharp peak against an ambiguous ridge, fuse it with context, and refine softly. Using learned features here is exactly what lets a match survive the lighting changes and weak texture that break brightness constancy on raw pixels. It is also worth naming why flow got the deep-learning treatment relatively late: matching is intrinsically non-local — a pixel can correspond to anywhere in the other frame — which ordinary convolutions, with their small receptive fields, handle badly. The explicit cost volume is the inductive bias that makes non-local matching tractable, and every method here keeps it in some form.

Warping closes the loop between the volume and the iteration, and it is the move that makes PWC-Net so transparently the classical pipeline. Before correlating at a given pyramid level, PWC-Net warps the frame-2 features toward frame 1 by the current flow estimate, so that only a small residual motion remains to be measured — which means a local cost volume, computed over a tiny neighbourhood, suffices. This is precisely the classical coarse-to-fine warp-then-refine loop, now differentiable and living inside the network: align by what you know, measure the leftover, add it back. And the pyramid survives as the network's own multi-scale structure — large motion captured cheaply at the coarse top, refined as you descend — except that the prefilter and the per-level update are learned rather than a fixed Gaussian blur and a fixed least-squares step. PWC-Net is, almost literally, Lucas–Kanade's coarse-to-fine schedule with every hand-built part swapped for a trained one.

fig-cost-volume
Figure 7.8.2. The match surface, learned. Two frames are each passed through a shared encoder to feature maps. For one pixel in frame 1 (marked), its correlation is computed against candidates in frame 2 — over a local neighbourhood (PWC-Net, after warping by the current flow) or against all pixels (RAFT's 4-D all-pairs volume) — producing a per-pixel similarity map. The map's peak (bright) is the most likely displacement; its shape encodes confidence — a sharp peak at a corner, a ridge along an edge (the aperture problem, visible in the volume). It is the learned analogue of the classical SSD search surface, except the network reads the whole map rather than taking its argmax.

7.8.3 RAFT and the recurrent update

RAFT is the design that came to dominate, and it is best read as the unrolling principle taken to its clean limit (Figure 3). Three classical parts are neuralized and arranged just so. First, learned features: a shared encoder turns both frames into per-pixel feature maps for matching, and a separate context encoder of frame 1 supplies the edge-aware regularization that Horn–Schunck got from a smoothness prior — the data term and the prior, each now a learned network. Second, an all-pairs correlation volume: RAFT computes the four-dimensional volume of similarities between every pair of pixels once, up front, and pools it into a small pyramid for cheap multi-scale lookup. Third, a recurrent update operator: a GRU with tied weights that, at each step, looks up correlations around the current flow estimate, combines them with the context features, and emits a refinement $\Delta\mathbf f$ — the learned update standing in for the gradient step, applied over and over to the same shared weights. This is the unrolled iterative optimizer made literal.

Two design choices explain why RAFT spread far beyond flow. The first is that it has no coarse-to-fine pyramid of the field at all — the recurrence does the refining, not a cascade of resolutions. Because the update weights are tied across iterations, you can run any number of refinement steps at test time: more steps buy more accuracy, and the same module trained at one budget generalizes to another. That shared-weight recurrence is exactly an unrolled optimizer that never fixed its iteration count. The second is modularity: "learned features → all-pairs correlation volume → recurrent update" is a self-contained recipe that drops, almost unchanged, into neighbouring correspondence problems. The RAFT-everything spinoffs — RAFT-Stereo for disparity, scene-flow and tracking variants — are the same skeleton pointed at a different match. The classical idea was not replaced so much as absorbed into a reusable module.

The transformer flows close the section by changing only the update. FlowFormer ([@huang-etal-2022]) and GMFlow ([@xu-etal-2022]) recast flow as global matching: instead of a local recurrent lookup that nudges the field a little at a time, they apply attention over the cost volume, letting every pixel directly attend to every candidate match in one global step. It is the same reframing that detector-free matchers brought to sparse correspondence — trust a global, attention-computed match surface rather than a local iterative refinement. Strikingly, the cost volume itself stays; what attention replaces is the update rule, not the matching primitive. Even at the transformer end of the lineage, the classical scaffolding — features and a similarity volume — is still doing the load-bearing work.

And with a dense, automatic, learned correspondence field finally in hand, the transport half of the part opens up exactly as it did for classical flow: warp and interpolate between frames for slow-motion synthesis (Frame interpolation and slow-motion synthesis), stabilize, compress, or magnify motion. Correspondence first, then transport — the learned flow is the correspondence, produced now by a net that kept the classical algorithm's bones and grew its operators from data.

fig-raft-skeleton
Figure 7.8.3. RAFT: the classical pipeline, neuralized. (1) Learned features — a shared encoder replaces raw RGB for matching, with a separate context encoder of frame 1 supplying edge-aware regularization. (2) An all-pairs 4-D correlation volume precomputes the similarity between every pair of pixels once, pooled into a pyramid for multi-scale lookup. (3) A recurrent (GRU) update operator with tied weights iteratively refines the flow: each step looks up correlations at the current estimate and emits an increment $\mathbf f^{(k+1)}=\mathbf f^{(k)}+\mathrm{GRU}_\theta(\mathbf f^{(k)},\text{corr})$ — a learned iterative optimizer with no coarse-to-fine pyramid. The three blocks map one-to-one onto the classical data term, cost volume, and warp-then-refine loop; transformer flows (FlowFormer/GMFlow) swap block (3)'s local recurrence for global attention.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L8, recurrence)

A learned operator swaps a hand-designed piece for one fit to data — the neuralize-the-classical-iteration move. Modern flow networks are not free-form regressors that dissolved the problem into a generic stack of layers. They keep the classical algorithm's inductive bias — features, a cost volume, an iterative update, coarse-to-fine — as architecture, and learn only the operators inside that scaffold. That is exactly why they beat both the hand-built original (data tunes every term) and a structure-free CNN (the bias makes flow well-posed). The same move powered the learned point trackers of Feature tracking and gets its general framing in Deep learning. The slogan to carry: scaffolding from the traditional method, weights from data.