12.2 Video compression and motion compensation⧉
The still-image part of this book taught one compression idea over and over: don't store what the viewer won't miss, and don't store what you can predict. JPEG (File formats and compression) throws away high-frequency chroma and finely quantizes the DCT because perception won't notice. Video adds a second, far larger redundancy that stills cannot touch: the next frame looks almost exactly like this one. At 30–60 frames per second the camera and the world barely move between frames; most of the picture is already on screen. Coding each frame independently — so-called Motion-JPEG, literally JPEG per frame — ignores this entirely and is wildly wasteful. Real codecs predict each frame from its neighbours and code only the prediction error. That is the whole subject.
Correspondence on a budget: motion compensation is optical flow you can afford. A codec needs, for every block, the same thing Optical flow asks — where did this come from in a frame we already have? — but it does not need a physically correct, dense, sub-pixel flow field. It needs a coarse, block-constant correspondence field that is cheap to estimate and, above all, cheap to transmit, and it chooses that field to minimise total bits, not endpoint error. So decades before learned optical flow, video codecs were already estimating dense-ish correspondence at massive scale — just optimised for rate, not accuracy. The recurring move: reuse what the decoder already has, and code only the difference. It returns inverted in Video magnification, which keeps the temporal difference instead of discarding it, and it is the rate-aware cousin of the correspondence story in Optical flow.
12.2.1 Why video compresses far better than still × N — temporal redundancy⧉
Here is the single observation the chapter stands on. Take two consecutive frames and subtract one from the other. The difference image $I_t - I_{t-1}$ is near-zero almost everywhere — black except at moving edges, occlusions, and lighting changes. That near-black difference image is the only new information the second frame carries; everything else was already on screen and need not be sent again (Figure 12.2.1).
The naive baseline ignores this. Motion-JPEG codes every frame as an independent JPEG: simple, edit-friendly, randomly accessible at any frame — but it pays the full still-image cost $N$ times and exploits none of the temporal redundancy. It is the "still $\times\,N$" strawman, and real codecs beat it by 5–50$\times$.
The lever is to stop coding the frame and start coding the surprise. Instead of coding $I_t$, code the residual between $I_t$ and a prediction $\hat I_t$ assembled from already-decoded frames. If the prediction is good the residual is mostly zero, so after the DCT and quantization almost all coefficients vanish, and entropy coding then crushes a block that is nearly all zeros into almost nothing. The slogan is good prediction $=$ small residual $=$ few bits.
Two redundancies are now stacked. Video keeps JPEG's spatial and perceptual redundancy removal — the DCT, chroma subsampling, perceptual quantization, all carried over unchanged from File formats and compression — and adds temporal redundancy removal on top. The temporal one is the large win.
The only question left is how to build $\hat I_t$. The crude answer, predict $I_t$ by the previous frame $I_{t-1}$ and code $I_t - I_{t-1}$, works when nothing moves but fails the moment anything does: a moving sharp edge sits in a different place in the two frames, so the difference lights up as a bright residual band exactly along every edge — precisely where coding is expensive. The fix is to predict with motion: line each block up with where it actually went, so the residual stays small even as the scene moves. That is motion compensation.
12.2.2 Motion-compensated prediction — the core trick⧉
The idea is geometric and simple. A moving object's block in frame $t$ is, to first order, the same pixels as some block in a reference frame, merely displaced. So for each block we find that displacement, copy the matched patch from the reference as our prediction, and code only what is left over.
Macroblocks. Partition the frame into blocks — classically $16\times 16$ luma "macroblocks," though modern codecs use variable sizes down to $4\times 4$. Each block is predicted independently, with its own motion.
Block matching is the correspondence step, and it is exactly the cost-volume search of Optical flow run per block. For a block $B$ in frame $t$, search a window in the reference frame for the best-matching patch and record the offset $(u,v)$ — the motion vector:
This cost is the sum of absolute differences (SAD); the sum of squared differences (SSD) is its smooth cousin — the same matching primitive as Optical flow, just summed over a whole block instead of a single pixel's neighbourhood. Per block, it is a search over the displacement window for the cheapest match:
for each block B in frame I_t: best = ∞ for (u, v) in search window: sad = ∑_{(x,y) in B} |I_t[x, y] - I_ref[x-u, y-v]| if sad < best: best = sad mv[B] = (u, v) residual[B] = B - I_ref shifted by mv[B]
The search is sped up with hierarchical, coarse-to-fine search and with predicted starting points from neighbouring blocks, and sub-pixel motion vectors — half- or quarter-pel, found by interpolating the reference — sharpen the match further.
The prediction is then just the reference frame shifted by that vector,
and the residual — the thing we actually code — is what the prediction missed,
small wherever the match was good.
Reuse the JPEG pipeline on the residual. Now comes the payoff of having built the still chapter first. Take $r$ block by block and run the same lossy chain: DCT, then quantize against a quantization table with the usual quality / quantization-parameter (QP) knob exactly as in File formats and compression, then entropy-code the quantized coefficients. The codec transmits, per block, only the motion vector $(u,v)$ plus the coded residual. (The motion vectors are themselves predictively coded from their neighbours, since they too are spatially correlated — adjacent blocks tend to move together.)
The decoder mirrors the encoder. It decodes the residual, fetches the same reference patch via the transmitted motion vector, and adds them: $I_t = \hat I_t + r$. One subtlety keeps the two sides in lock-step: the encoder predicts from the reconstructed (lossy) reference the decoder will actually have, not from the pristine original. If it predicted from the original, the small quantization errors would compound frame after frame and the two sides would slowly drift apart; predicting from the reconstruction makes encoder and decoder see the same reference and eliminates drift.
Rate–distortion is the real objective. The encoder does not pick the most accurate motion vector. It picks the one that minimises
the reconstruction distortion $D$ plus $\lambda$ times the bits $R$ needed to send the vector and its residual. A slightly worse match that codes far cheaper wins outright. This $\min D + \lambda R$ trade-off — the same Lagrangian balance that sets a quantization table's aggressiveness in File formats and compression — is why the motion field is on a budget: it is optimised for bits, not for physical correctness (Figure 12.2.2).
12.2.3 I, P, and B frames; GOP structure⧉
Not every frame can be predicted from another — you need an entry point with no dependency, and you would like to predict from both directions when you can. Codecs therefore define three frame types, and a frame's type is exactly its place in the prediction graph (Figure 12.2.3).
- I-frame (Intra). Coded alone, with no temporal prediction — essentially a JPEG, spatial DCT only. It is the largest of the three (it spends the most bits) but it is self-contained: an I-frame is a random-access point you can seek to, and a resynchronisation point after a transmission error.
- P-frame (Predicted). Predicted from a previous decoded frame, forward only. It sends motion vectors plus a residual and is far smaller than an I-frame.
- B-frame (Bidirectional). Predicted from frames on both sides — a past and a future reference — often by averaging the two predictions. These are the smallest of all, excellent for smooth motion and for regions revealed or occluded between references. The price is out-of-order decoding: the future reference must already be decoded, so a codec's coding order differs from its display order.
GOP (Group of Pictures). The repeating arrangement of these types — for example I B B P B B P …, restarting at the next I-frame — is the GOP, and its length is a direct trade-off:
- short GOP (frequent I-frames) → easy random access and seeking, robust to errors, but bigger files;
- long GOP (rare I-frames) → best compression, but slow seeking and error propagation that lasts until the next I-frame arrives to flush it.
In practice streaming and editing favour shorter GOPs (you seek and splice constantly); archival and broadcast push longer GOPs for size. All-intra (I-only) modes also exist, for editing-friendly, frame-accurate video at the cost of size — which is essentially Motion-JPEG with a better intra coder.
12.2.4 Why this is "optical flow on a budget"⧉
It is worth stating the equivalence plainly, because it is the whole reason this chapter lives next to Optical flow. The field of motion vectors over all the blocks of a frame is a correspondence field — a piecewise-constant, one-vector-per-block approximation to the optical flow between the frame and its reference. Motion compensation is dense correspondence, computed at scale, running silently inside every video you have ever watched (Figure 12.2.4).
But it is the budget version, and the differences are the point.
- Block-constant, not per-pixel. One vector covers a whole block, even though the true flow varies within it. The residual mops up the within-block error — the codec does not need the flow to be right, only cheap, because whatever it gets wrong is paid for once, in residual bits.
- Rate-optimal, not accuracy-optimal. The vector is chosen to minimise bits ($D + \lambda R$), so the encoder will happily pick a physically wrong vector when it codes cheaper. A codec's motion field is not a measurement of scene motion; do not read it as ground-truth flow.
- Block-matching, not variational. It is pure SAD/SSD search — the cost-volume primitive of Optical flow — with no global smoothness solve. What smoothness there is comes implicitly, from predictively coding the vectors and from the rate cost of disagreeing with neighbours.
- Graceful fallback. Where prediction is hopeless — a scene cut, fast new content, a large occlusion — the encoder simply codes the block, or the whole frame, intra, degrading gracefully back to the still-image path.
The conceptual bridge runs both ways. The same aperture problem, search window, and sub-pixel interpolation appear in the codec and in flow estimation. Learned optical flow (RAFT and its kin, Teed & Deng 2020; see Optical flow) and learned video codecs both build on this cost-volume correspondence skeleton — the codec just got there first, optimised for a different loss.
12.2.5 Modern codecs in one breath⧉
The lineage is long and the skeleton never changes: Moving Picture Experts Group (MPEG)-1/2 → H.264 / Advanced Video Coding (AVC) (Wiegand, Sullivan, Bjøntegaard and Luthra 2003) → H.265 / High Efficiency Video Coding (HEVC) (Sullivan, Ohm, Han and Wiegand 2012) → AV1 (AOMedia Video 1; AOMedia, 2018) → H.266 / Versatile Video Coding (VVC). Every one of them keeps the same three pieces: motion-compensated block prediction $+$ transform-coded residual $+$ entropy coding. What advances generation to generation is how cleverly each piece is done, not the architecture.
What they add, on the same bones:
- Variable / recursive block sizes — big blocks for flat, uniform motion, tiny blocks for fine detail, chosen per region.
- Multiple reference frames and better sub-pixel interpolation, so a block can be predicted from whichever past (or future) frame matches best.
- Intra prediction within a frame — predict a block from its already-decoded spatial neighbours using directional modes — so that even I-frames now beat a plain JPEG.
- Richer entropy coders such as CABAC (context-adaptive binary arithmetic coding).
- In-loop deblocking and sample-adaptive filters to hide the block edges that motion compensation and block-transform coding would otherwise leave.
- AV1 additionally brings warped / affine motion (a deliberate step toward true flow), overlapped-block motion compensation, and the practical advantage of being royalty-free.
The takeaway is that twenty-five years of codec progress is better prediction plus smarter residual coding inside an unchanged motion-compensation framework — and that framework is, at heart, correspondence on a budget applied to the temporal axis. Even today's learned neural codecs, which replace individual pieces (the motion model, the residual transform) with networks, keep the same predict-then-code-the-residual spine. Build the still pipeline once and add temporal prediction on top, and you have, in outline, every video codec there is.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
Correspondence on a budget: motion compensation is optical flow you can afford. A codec needs, for every block, the same thing Optical flow asks — where did this come from in a frame we already have? — but it does not need a physically correct, dense, sub-pixel flow field. It needs a coarse, block-constant correspondence field that is cheap to estimate and, above all, cheap to transmit, and it chooses that field to minimise total bits, not endpoint error. So decades before learned optical flow, video codecs were already estimating dense-ish correspondence at massive scale — just optimised for rate, not accuracy. The recurring move: reuse what the decoder already has, and code only the difference. It returns inverted in Video magnification, which keeps the temporal difference instead of discarding it, and it is the rate-aware cousin of the correspondence story in Optical flow.