💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

12.2 Video compression and motion compensation

The still-image part of this book taught one compression idea over and over: don't store what the viewer won't miss, and don't store what you can predict. JPEG (File formats and compression) throws away high-frequency chroma and finely quantizes the DCT because perception won't notice. Video adds a second, far larger redundancy that stills cannot touch: the next frame looks almost exactly like this one. At 30–60 frames per second the camera and the world barely move between frames; most of the picture is already on screen. Coding each frame independently — so-called Motion-JPEG, literally JPEG per frame — ignores this entirely and is wildly wasteful. Real codecs predict each frame from its neighbours and code only the prediction error. That is the whole subject.

💡 Big lesson (L17, recurrence)

Correspondence on a budget: motion compensation is optical flow you can afford. A codec needs, for every block, the same thing Optical flow asks — where did this come from in a frame we already have? — but it does not need a physically correct, dense, sub-pixel flow field. It needs a coarse, block-constant correspondence field that is cheap to estimate and, above all, cheap to transmit, and it chooses that field to minimise total bits, not endpoint error. So decades before learned optical flow, video codecs were already estimating dense-ish correspondence at massive scale — just optimised for rate, not accuracy. The recurring move: reuse what the decoder already has, and code only the difference. It returns inverted in Video magnification, which keeps the temporal difference instead of discarding it, and it is the rate-aware cousin of the correspondence story in Optical flow.

12.2.1 Why video compresses far better than still × N — temporal redundancy

Here is the single observation the chapter stands on. Take two consecutive frames and subtract one from the other. The difference image $I_t - I_{t-1}$ is near-zero almost everywhere — black except at moving edges, occlusions, and lighting changes. That near-black difference image is the only new information the second frame carries; everything else was already on screen and need not be sent again (Figure 12.2.1).

The naive baseline ignores this. Motion-JPEG codes every frame as an independent JPEG: simple, edit-friendly, randomly accessible at any frame — but it pays the full still-image cost $N$ times and exploits none of the temporal redundancy. It is the "still $\times\,N$" strawman, and real codecs beat it by 5–50$\times$.

The lever is to stop coding the frame and start coding the surprise. Instead of coding $I_t$, code the residual between $I_t$ and a prediction $\hat I_t$ assembled from already-decoded frames. If the prediction is good the residual is mostly zero, so after the DCT and quantization almost all coefficients vanish, and entropy coding then crushes a block that is nearly all zeros into almost nothing. The slogan is good prediction $=$ small residual $=$ few bits.

Two redundancies are now stacked. Video keeps JPEG's spatial and perceptual redundancy removal — the DCT, chroma subsampling, perceptual quantization, all carried over unchanged from File formats and compressionand adds temporal redundancy removal on top. The temporal one is the large win.

The only question left is how to build $\hat I_t$. The crude answer, predict $I_t$ by the previous frame $I_{t-1}$ and code $I_t - I_{t-1}$, works when nothing moves but fails the moment anything does: a moving sharp edge sits in a different place in the two frames, so the difference lights up as a bright residual band exactly along every edge — precisely where coding is expensive. The fix is to predict with motion: line each block up with where it actually went, so the residual stays small even as the scene moves. That is motion compensation.

fig-video-temporal-redundancy
Figure 12.2.1. Temporal redundancy, made visible. Three consecutive frames of a moving scene (top row) look nearly identical. The per-pixel frame difference $I_t - I_{t-1}$ (bottom) is near-black everywhere except a thin bright fringe at moving edges and at a newly revealed (disoccluded) patch. That sparse difference image is the only genuinely new information — "this is all that is left to code."

12.2.2 Motion-compensated prediction — the core trick

The idea is geometric and simple. A moving object's block in frame $t$ is, to first order, the same pixels as some block in a reference frame, merely displaced. So for each block we find that displacement, copy the matched patch from the reference as our prediction, and code only what is left over.

Macroblocks. Partition the frame into blocks — classically $16\times 16$ luma "macroblocks," though modern codecs use variable sizes down to $4\times 4$. Each block is predicted independently, with its own motion.

Block matching is the correspondence step, and it is exactly the cost-volume search of Optical flow run per block. For a block $B$ in frame $t$, search a window in the reference frame for the best-matching patch and record the offset $(u,v)$ — the motion vector:

$$ (u,v) = \arg\min_{(u,v)} \sum_{(x,y)\in B} \big|\, I_t(x,y) - I_{\text{ref}}(x-u,\,y-v) \,\big| . $$

This cost is the sum of absolute differences (SAD); the sum of squared differences (SSD) is its smooth cousin — the same matching primitive as Optical flow, just summed over a whole block instead of a single pixel's neighbourhood. Per block, it is a search over the displacement window for the cheapest match:

for each block B in frame I_t:
    best = ∞
    for (u, v) in search window:
        sad = ∑_{(x,y) in B} |I_t[x, y] - I_ref[x-u, y-v]|
        if sad < best:
            best = sad
            mv[B] = (u, v)
    residual[B] = B - I_ref shifted by mv[B]

The search is sped up with hierarchical, coarse-to-fine search and with predicted starting points from neighbouring blocks, and sub-pixel motion vectors — half- or quarter-pel, found by interpolating the reference — sharpen the match further.

The prediction is then just the reference frame shifted by that vector,

$$ \hat I_t(x,y) = I_{\text{ref}}(x-u,\,y-v), $$

and the residual — the thing we actually code — is what the prediction missed,

$$ r(x,y) = I_t(x,y) - \hat I_t(x,y), $$

small wherever the match was good.

Reuse the JPEG pipeline on the residual. Now comes the payoff of having built the still chapter first. Take $r$ block by block and run the same lossy chain: DCT, then quantize against a quantization table with the usual quality / quantization-parameter (QP) knob exactly as in File formats and compression, then entropy-code the quantized coefficients. The codec transmits, per block, only the motion vector $(u,v)$ plus the coded residual. (The motion vectors are themselves predictively coded from their neighbours, since they too are spatially correlated — adjacent blocks tend to move together.)

The decoder mirrors the encoder. It decodes the residual, fetches the same reference patch via the transmitted motion vector, and adds them: $I_t = \hat I_t + r$. One subtlety keeps the two sides in lock-step: the encoder predicts from the reconstructed (lossy) reference the decoder will actually have, not from the pristine original. If it predicted from the original, the small quantization errors would compound frame after frame and the two sides would slowly drift apart; predicting from the reconstruction makes encoder and decoder see the same reference and eliminates drift.

Rate–distortion is the real objective. The encoder does not pick the most accurate motion vector. It picks the one that minimises

$$ D + \lambda R, $$

the reconstruction distortion $D$ plus $\lambda$ times the bits $R$ needed to send the vector and its residual. A slightly worse match that codes far cheaper wins outright. This $\min D + \lambda R$ trade-off — the same Lagrangian balance that sets a quantization table's aggressiveness in File formats and compression — is why the motion field is on a budget: it is optimised for bits, not for physical correctness (Figure 12.2.2).

fig-mc-prediction-loop
Figure 12.2.2. The motion-compensated prediction loop. Encoder (top): the current frame $I_t$ and a stored reference frame enter block matching, which emits motion vectors; the predictor $\hat I_t$ is subtracted from $I_t$ to form the residual $r$, which goes through DCT $\to$ quantize $\to$ entropy-code into the bitstream alongside the (predictively coded) motion vectors. A reconstruction branch inverse-quantizes and adds $\hat I_t$ back to store the lossy reference. Decoder (bottom): a mirror image — entropy-decode, inverse DCT, fetch the reference patch via the motion vector, add to recover $I_t$. The whole rate–distortion choice $\min D + \lambda R$ lives in the block-matching / mode-decision box.

12.2.3 I, P, and B frames; GOP structure

Not every frame can be predicted from another — you need an entry point with no dependency, and you would like to predict from both directions when you can. Codecs therefore define three frame types, and a frame's type is exactly its place in the prediction graph (Figure 12.2.3).

GOP (Group of Pictures). The repeating arrangement of these types — for example I B B P B B P …, restarting at the next I-frame — is the GOP, and its length is a direct trade-off:

In practice streaming and editing favour shorter GOPs (you seek and splice constantly); archival and broadcast push longer GOPs for size. All-intra (I-only) modes also exist, for editing-friendly, frame-accurate video at the cost of size — which is essentially Motion-JPEG with a better intra coder.

fig-ipb-gop
Figure 12.2.3. A GOP timeline. Frames laid out left to right in display order: an I-frame opens the group, P-frames predict forward from the previous reference (arrows pointing right), and B-frames predict from both a past and a future reference (arrows from both sides). The arrows are prediction dependencies; note that the future reference of a B-frame must be decoded before it, so coding order $\neq$ display order. The group repeats at the next I-frame.

12.2.4 Why this is "optical flow on a budget"

It is worth stating the equivalence plainly, because it is the whole reason this chapter lives next to Optical flow. The field of motion vectors over all the blocks of a frame is a correspondence field — a piecewise-constant, one-vector-per-block approximation to the optical flow between the frame and its reference. Motion compensation is dense correspondence, computed at scale, running silently inside every video you have ever watched (Figure 12.2.4).

But it is the budget version, and the differences are the point.

The conceptual bridge runs both ways. The same aperture problem, search window, and sub-pixel interpolation appear in the codec and in flow estimation. Learned optical flow (RAFT and its kin, Teed & Deng 2020; see Optical flow) and learned video codecs both build on this cost-volume correspondence skeleton — the codec just got there first, optimised for a different loss.

fig-mc-as-flow
Figure 12.2.4. Same correspondence, two budgets. Left: a dense, per-pixel optical-flow field over a frame (smooth color-coded vectors, varying continuously across the moving object). Right: the codec's motion-vector field over the same frame — one constant vector per block, a coarse, blocky quantization of the same underlying correspondence. Same question ("where did this come from?"), coarsened to whatever is cheap to estimate and cheap to transmit.

12.2.5 Modern codecs in one breath

The lineage is long and the skeleton never changes: Moving Picture Experts Group (MPEG)-1/2 → H.264 / Advanced Video Coding (AVC) (Wiegand, Sullivan, Bjøntegaard and Luthra 2003) → H.265 / High Efficiency Video Coding (HEVC) (Sullivan, Ohm, Han and Wiegand 2012) → AV1 (AOMedia Video 1; AOMedia, 2018) → H.266 / Versatile Video Coding (VVC). Every one of them keeps the same three pieces: motion-compensated block prediction $+$ transform-coded residual $+$ entropy coding. What advances generation to generation is how cleverly each piece is done, not the architecture.

What they add, on the same bones:

The takeaway is that twenty-five years of codec progress is better prediction plus smarter residual coding inside an unchanged motion-compensation framework — and that framework is, at heart, correspondence on a budget applied to the temporal axis. Even today's learned neural codecs, which replace individual pieces (the motion model, the residual transform) with networks, keep the same predict-then-code-the-residual spine. Build the still pipeline once and add temporal prediction on top, and you have, in outline, every video codec there is.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L17, recurrence)

Correspondence on a budget: motion compensation is optical flow you can afford. A codec needs, for every block, the same thing Optical flow asks — where did this come from in a frame we already have? — but it does not need a physically correct, dense, sub-pixel flow field. It needs a coarse, block-constant correspondence field that is cheap to estimate and, above all, cheap to transmit, and it chooses that field to minimise total bits, not endpoint error. So decades before learned optical flow, video codecs were already estimating dense-ish correspondence at massive scale — just optimised for rate, not accuracy. The recurring move: reuse what the decoder already has, and code only the difference. It returns inverted in Video magnification, which keeps the temporal difference instead of discarding it, and it is the rate-aware cousin of the correspondence story in Optical flow.