Computational Photography, an AI-powered Slopendium

▸Part 12 VIDEO⧉

**Roadmap.** **[[Motion blur and temporal sampling]]** sets up the time axis (a frame is an integral; time is sampled; the Lagrangian-vs-Eulerian framing). The applications follow: **[[Video compression and motion compensation]]** (correspondence on a budget), **[[Video magnification]]** (the Eulerian reveal), **[[Video stabilization and rolling-shutter correction]]** (estimate → smooth → re-render), **[[Frame interpolation and slow-motion synthesis]]** (transport to the in-between), and a closing **[[Video editing]]** coda (timelines, summarization, reduce-over-time filters, transcript-based editing).

equations

motion blur $B(\mathbf x)=\tfrac1\tau\int_0^\tau I(\mathbf x-\mathbf v t)\,dt$

Eulerian magnification $I'(\mathbf x,t)=I(\mathbf x,t)+\alpha\,\mathcal B\{I(\mathbf x,t)\}$

▸12.1 Motion blur and temporal sampling⧉

`fig-motion-blur-integral` · a frame is an integral over the exposure — a bright point traversing the frame while the shutter is open records a streak of length $\lVert\mathbf v\rVert\tau$; motion blur is a 1-D box convolution along $\mathbf v$ 🟨

⬜ figure not yet created

`fig-blur-spatial-vs-temporal` (side-by-side: a spatial Gaussian PSF blurring a static edge vs a temporal box PSF blurring a *moving* edge — same convolution, different axis) fig-blur-spatial-vs-temporal

`fig-shutter-angle` · shutter angle sets the blur — a rotating-disc shutter at $0°/180°/360°$ admitting a smaller/larger fraction of the frame interval $T$; $180°$ ($\tau=T/2$) is the cinematic film-look, small angles strobe 🟨

`fig-temporal-aliasing-wagonwheel` · the wagon-wheel effect — a spoked wheel sampled below temporal Nyquist appears to stall near it and spin backward past it; the temporal twin of moiré 🟨

⬜ figure not yet created

`fig-temporal-nyquist` (a sinusoidal motion signal sampled at frame rate $f_s$: $f_s>2f_{\text{motion}}$ reconstructs the true motion fig-temporal-nyquist

`fig-blur-as-temporal-prefilter` · one knob, two outcomes — the same fast motion captured long (blurred but not aliased; the window low-passed before sampling) vs short/high-angle (sharp but strobing); the exposure window is the temporal anti-alias filter, L16 in time 🟨

`fig-lagrangian-vs-eulerian` · the organizing diagram — Lagrangian (arrows following particles over time → flow and tracking, output trajectories) vs Eulerian (a fixed grid, each pixel plotting intensity-over-time → magnification, never asking where anything went) 🟨

The two great themes of spatial imaging are **convolution** (a measurement integrates over a region — the PSF) and **sampling** (a discrete grid can only represent frequencies below Nyquist, else they alias). This section's single claim is that **both recur, identically, on the *time* axis** — and together they organise the entire part. A frame is an *integral over the exposure* → **motion blur** is convolution *in time*. A video is a *sequence of temporal samples* → motion faster than the frame rate **aliases** (the wagon wheel) *in time*. And the question "do I think of motion as *particles I follow* or *a time-series at each fixed pixel*?" is the **Lagrangian vs Eulerian** distinction that separates flow/tracking from video magnification.

💡 **Big lesson (recurrence of L5 / L16 — Nyquist, and *prefilter before you downsample*, on the time axis):** the spatial sampling laws apply unchanged in **time**. **L5:** a frame rate $f_s$ can only faithfully capture temporal frequencies below $f_s/2$; faster motion **folds down** to a false low frequency — the **wagon-wheel effect** is temporal moiré. **L16:** the cure for aliasing is to **prefilter before sampling** — and in time the prefilter is *built into the camera*: integrating over the exposure window (which produces motion blur) is exactly the temporal **anti-alias filter**. So motion blur and temporal aliasing are not two problems but **the same tradeoff** — a longer exposure removes strobing by *blurring*, a shorter one gives sharp frames that *strobe*. (→ see Big lessons **L5** & **L16**, first placed in [[Linearity, aliasing and deblurring]] / [[Basic image processing and ISP]]; the **wagon-wheel** is L16's named temporal instance.)

equations

**motion blur** $B(\mathbf{x})=\dfrac{1}{\tau}\displaystyle\int_0^\tau I(\mathbf{x}-\mathbf{v}\,t)\,dt$ (a directional integration along the motion vector $\mathbf v$ over exposure $\tau$) $= I * k_{\mathbf v}$ with **path kernel** $k_{\mathbf v}$ a 1-D box of length $\|\mathbf v\|\tau$ oriented along $\mathbf v$ (constant-velocity case)

**shutter angle** $\theta=360^\circ\cdot\tau/T$ (exposure $\tau$ as a fraction of frame interval $T=1/f_s$)

**temporal Nyquist** $f_s > 2\,f_{\text{motion}}$ (sample faster than twice the motion's highest temporal frequency, else alias)

**aliased apparent frequency** $f_{\text{app}}=|f_{\text{motion}} - n f_s|$ (the folded-down false frequency — wagon wheel)

▸12.1.1 A frame is an integral over time → motion blur⧉

• **the cause** (intuition first): a photo is **not** an instant — the shutter is open for an exposure time $\tau$, and the sensor **accumulates** all the light that lands during that window. If a bright feature *moves* across the frame while the shutter is open, its light is **smeared** along the path it traveled — a **streak**, not a point. Motion blur is the *temporal* version of "a pixel is a weighted average of what it saw," with the average taken over *time* instead of over a spatial neighbourhood. [`fig-motion-blur-integral`]

• **the equation** — blur is a time integral: for image-plane velocity $\mathbf v$ (constant over the exposure), the blurred frame is

$$B(\mathbf{x})=\frac{1}{\tau}\int_0^\tau I(\mathbf{x}-\mathbf{v}\,t)\,dt.$$

Read it literally: the value at pixel $\mathbf x$ is the **average** of the *sharp* scene $I$ over all the positions the scene occupied (offset $-\mathbf v t$) during the exposure.

• **it is a directional convolution** (the punchline — the temporal analog of the spatial PSF): that integral is exactly a **convolution** $B = I * k_{\mathbf v}$ where the **path kernel** $k_{\mathbf v}$ is a **1-D box** of length $\|\mathbf v\|\tau$, oriented along the motion direction $\mathbf v$, normalised to unit area. Constant velocity → a straight box; curved motion → a curved path kernel. **Spatial blur smears over a spatial neighbourhood; motion blur smears over the motion path** — same operator, the axis is just time-projected-onto-space. (Recall from BASIC that convolution *was* the sum-of-random-variables / averaging operation — here the "random variable" is *position-during-exposure*.) [`fig-blur-spatial-vs-temporal`]

• **shutter speed / shutter angle** — the photographer's control of $\tau$: motion-blur length is $\|\mathbf v\|\tau$, so it scales with **exposure time**. In cine, exposure is expressed as a **shutter angle** $\theta=360^\circ\cdot\tau/T$ — the fraction of the frame interval $T=1/f_s$ the (historically rotating-disc) shutter is open. The **$180^\circ$** convention ($\tau=T/2$) gives the natural "film look"; $360^\circ$ blurs maximally (the whole interval), small angles give crisp, **strobed** frames (the *Saving Private Ryan* look). [`fig-shutter-angle`]

• **the tradeoff to state plainly**: longer $\tau$ → more light (better SNR, **L15**) but **more blur**; shorter $\tau$ → sharper frames but less light *and* — the next subsection — **temporal aliasing**. You cannot have sharp, well-exposed, *and* alias-free for fast motion at a fixed frame rate; you pick two.

• **removing it** (forward-ref): if the path kernel $k_{\mathbf v}$ is **known**, deblurring is **deconvolution** by it — but a straight-line box kernel has *zeros* in its frequency response, so naive inversion is ill-posed (it is exactly the Wiener / blind-deblurring story of [[Blind deblurring]], now with a *motion* PSF). When motion **differs across the frame** (different objects, different speeds) the blur is **spatially varying** — no single kernel — connecting to non-uniform deblurring. And you can **engineer** the kernel to invert cleanly: **coded / flutter shutter** (Raskar 2006) flips the shutter open/closed during $\tau$ to give $k_{\mathbf v}$ a broadband, zero-free spectrum; **motion-invariant capture** (Levin 2008) sweeps the camera so *every* speed gets the *same* invertible PSF. (Same "make the forward operator nice" philosophy as coded aperture.)

▸12.1.2 Time is sampled → temporal aliasing, the wagon-wheel effect⧉

• **the cause**: a video is a **discrete sequence** of frames at rate $f_s$ — it **samples** the continuous motion. As with any sampling, motion components **faster than half the frame rate cannot be represented** and instead **alias** — fold down to a false, slower (even reversed) apparent motion. This is **L5/L16 on the time axis**, full stop. [`fig-temporal-nyquist`]

• **the temporal Nyquist condition**: to capture motion of temporal frequency $f_{\text{motion}}$ (e.g. a wheel's spokes passing a point), you need

$$f_s > 2\,f_{\text{motion}}.$$

Violate it and the apparent frequency folds to $f_{\text{app}}=|f_{\text{motion}}-n f_s|$ for the integer $n$ that brings it into $[0,f_s/2]$.

• **the wagon-wheel effect** — the canonical demo: a spoked wheel filmed at frame rate $f_s$. When spoke-passing frequency is **below** Nyquist the wheel spins **forward** correctly; as it approaches $f_s$ the wheel appears to **slow, stop, then spin backward** — because each frame catches the next spoke *just short of* where the previous one was, so the sampled snapshots imply a small *backward* step. It is the **temporal twin of spatial moiré**: undersampling maps a high frequency to a wrong low one. Helicopter rotors and car rims under video/streetlights show the same. [`fig-temporal-aliasing-wagonwheel`]

• **strobing / judder** — the everyday face of it: fast pans or fast subjects at low frame rate or short shutter look **stuttery / juddery** rather than smoothly moving — the motion is *temporally undersampled* and the eye sees discrete jumps instead of continuous flow. (Stroboscopic lighting freezing or reversing fans is the same effect with the *light* doing the sampling.)

• **the fixes** (the sampling toolkit, in time): **raise $f_s$** (higher frame rate → higher temporal Nyquist — the brute-force cure, what high-speed cameras buy); or **prefilter before sampling** — which on the time axis means **integrate longer** (a longer exposure = a wider temporal box = a low-pass that attenuates the offending high temporal frequencies *before* the frame samples them). That prefilter **is motion blur** — which is why the two subsections are one story:

▸12.1.3 Motion blur *is* the temporal prefilter (the two are one tradeoff)⧉

• 💡 **the unification** (the L16 payoff): the temporal anti-alias filter you would *want* before sampling a video is a **low-pass over time** — and the camera already has one, *for free*: **integration over the exposure window**. A long exposure **box-filters the motion in time** before the frame rate samples it, suppressing the high temporal frequencies that would otherwise alias. **Motion blur and temporal aliasing are the two outcomes of one knob** ($\tau$): integrate more → **blur** (aliasing prefiltered away); integrate less → **sharp frames that strobe** (aliasing let through). [`fig-blur-as-temporal-prefilter`]

• **why you can't win for free**: this is the exact **L16** spatial dilemma — prefiltering trades aliasing for blur — transplanted to time. The "right" exposure is a *frequency-domain* compromise: long enough to kill aliasing, short enough to keep wanted motion sharp. CGI renderers face the dual problem and *add synthetic motion blur* precisely as a temporal anti-alias filter (Potmesil & Chakravarty 1983) — they integrate over sub-frame time samples so fast motion doesn't strobe.

• **the escape hatch — capture the full set, decide later** (cross-ref **L14**): shoot at a **high frame rate with a short shutter** (sharp, alias-free *if* $f_s$ is high enough) and **synthesize** any desired motion blur or slow-motion afterward by integrating/interpolating frames — defer the exposure-vs-blur decision out of the moment of capture. ("High-speed photography is to shutter speed what light fields are to aperture", **L14**; forward-ref [[Frame interpolation and slow-motion synthesis]].)

▸12.1.4 Lagrangian vs Eulerian: the organizing distinction for the part⧉

• 💡 **the dichotomy** (the frame that ties the whole part together): there are **two ways to describe a moving scene**, borrowed from fluid dynamics —

• **Lagrangian** — *follow the particles*: attach yourself to a *physical point* and ask where **it** goes over time. This is **tracking** and **optical flow** — a *displacement field* of *which point moved where* ($\frac{D}{Dt}$, the material derivative). The output is **trajectories / correspondences**.

• **Eulerian** — *watch fixed locations*: stay at a **fixed pixel** and record the **time series of intensity** passing through it. No correspondence, no "which point" — just *what happened at this grid cell over time*. The output is a **per-pixel signal in time**. [`fig-lagrangian-vs-eulerian`]

• **why it is the right organising axis** (and *where* each method sits):

• **Lagrangian methods** — [[Optical flow]] (dense, *follow every pixel's* motion), [[Feature tracking]] (sparse, *follow distinctive points*). Hard because correspondence is **non-local** and ill-posed (aperture problem, occlusion); the slides repeatedly note flow's *"Lagrangian nature"* as its core difficulty. Matching-based flow is "more Lagrangian" than differential flow precisely because it *follows the patch* rather than differentiating at a fixed pixel.

• **Eulerian methods** — **video magnification** ([[#Video magnification]]): *do not* solve correspondence at all; instead take the **time series at each fixed pixel**, **band-pass filter it in time**, and **amplify** the tiny temporal variations (a pulse, a sub-pixel sway, a sound-induced vibration) — then add them back. By *never* following particles, Eulerian processing sidesteps the hard correspondence problem entirely — that is *why* it works on motions too small to track. **This section exists to set that up.** (Recall the flow-slides' aside "Recall Eulerian Video Magnification" appearing right beside the differential brightness-constancy derivation — the contrast is deliberate.)

• **the tradeoff between them**: Lagrangian gives you **explicit, large-range correspondence** (you know *where the ball went*) but must *solve* the hard ill-posed matching; Eulerian is **trivial to compute** (just per-pixel temporal filtering) and brilliant for **tiny, sub-pixel** changes, but it has **no notion of where anything went** and breaks down for **large** motion (a big displacement is not a small per-pixel perturbation). Small motion → Eulerian; large motion → Lagrangian. Phase-based magnification is a clever middle ground (Eulerian *in a steerable-pyramid phase representation*, which encodes local motion).

• **optional sidebar — the fluid-dynamics origin**: the terms come from describing fluid flow. The **Lagrangian** specification tracks individual fluid *parcels* along their trajectories (a dye droplet); the **Eulerian** specification fixes a measurement grid and records velocity/density *at each fixed point* as fluid passes (a weather station). Computer-graphics fluid solvers make the same choice — **particle / SPH** methods are Lagrangian, **grid-based** solvers are Eulerian, and many real simulators are *hybrid* (FLIP, particle-in-cell) — exactly mirroring why vision uses **tracked particles** for large motion and **per-pixel grids** for tiny motion. The same dichotomy, the same reasons. [optional sidebar box]

12.2 Time-lapse photography⧉

▸12.3 Video compression and motion compensation⧉

`fig-video-temporal-redundancy` · temporal redundancy made visible — three near-identical consecutive frames whose per-pixel difference $I_t-I_{t-1}$ is near-black except a thin fringe at moving edges and a disoccluded patch; the only new information to code 🟨

`fig-mc-prediction-loop` · the motion-compensated prediction loop — encoder (block-match → motion vectors, residual $r=I_t-\hat I_t$ → DCT → quantize → entropy-code, plus a reconstruction branch storing the lossy reference) and the mirror-image decoder; $\min D+\lambda R$ in the mode-decision box 🟨

⬜ figure not yet created

decoder mirrors it)

⬜ figure not yet created

`fig-block-matching-search` (one macroblock in frame $t$, its search window in the reference frame, the best-match offset = the motion vector) fig-block-matching-search

⬜ figure not yet created

`fig-residual-vs-naive` (left: code the whole block fig-residual-vs-naive

`fig-ipb-gop` · a GOP timeline — an I-frame opening the group, P-frames predicting forward, B-frames predicting from past and future references; arrows are prediction dependencies, so coding order $\neq$ display order 🟨

`fig-mc-as-flow` · same correspondence, two budgets — a dense smooth optical-flow field vs the codec's one-constant-vector-per-block field; the same "where did this come from?" coarsened to what is cheap to estimate and transmit 🟨

The still-image part of this book taught one compression idea over and over: **don't store what the viewer won't miss, and don't store what you can predict.** JPEG ([[File formats and compression]]) throws away high-frequency chroma and finely-quantizes the DCT because perception won't notice. Video adds a second, far larger redundancy that stills can't touch: **the next frame looks almost exactly like this one.** At 30–60 fps the camera and the world barely move between frames; most of the picture is *already on screen*. Coding each frame independently (so-called **Motion-JPEG** — literally JPEG-per-frame) ignores this entirely and is wildly wasteful. Real codecs **predict each frame from its neighbours and code only the prediction error.** That is the whole subject.

💡 **Big lesson (correspondence on a budget):** **motion compensation is optical flow you can afford.** A codec needs, for every block, *where did this come from in a frame we already have?* — exactly the correspondence question of [[Optical flow]]. But it does not need a physically correct, dense, sub-pixel flow field; it needs a **coarse, block-constant** field that is **cheap to estimate** and, crucially, **cheap to transmit** — and it chooses that field to **minimise total bits**, not endpoint error. So decades before learned optical flow, video codecs were already estimating dense-ish correspondence at massive scale — just optimised for rate, not accuracy. The recurring move: *reuse what the decoder already has, and code only the difference.* (Recurs as the framing for [[Video magnification]], which keeps the difference instead of discarding it; and is the rate-aware cousin of the correspondence story in [[Optical flow]].)

equations

motion-compensated **residual** $r(x,y)=I_t(x,y)-\hat I_t(x,y)$ where $\hat I_t(x,y)=I_{\text{ref}}\big(x-u,\,y-v\big)$ is the reference frame shifted by the block's motion vector $(u,v)$

**block-matching cost** $(u,v)=\arg\min_{(u,v)}\sum_{(x,y)\in B}\big|I_t(x,y)-I_{\text{ref}}(x-u,y-v)\big|$ (SAD over a block $B$, sometimes SSD)

**rate–distortion** objective $\min\ D+\lambda R$ — choose motion vectors/modes to minimise distortion $D$ *plus* $\lambda$ times the bits $R$ (the codec optimises bits, not physical accuracy)

coded frame $\approx$ entropy-code$\big(\text{quantize}(\text{DCT}(r))\big)$ + motion vectors

▸12.3.1 Why video compresses far better than still × N — temporal redundancy⧉

• **the observation**: consecutive frames are nearly identical. Subtract one from the next and the difference image $I_t-I_{t-1}$ is **near-zero almost everywhere** — non-zero only at moving edges, occlusions, and lighting changes. That near-black difference image *is the only new information*. [`fig-video-temporal-redundancy`]

• **the naive baseline (Motion-JPEG)**: code every frame as an independent JPEG. Simple, edit-friendly, random-access — but it pays the *full* still-image cost $N$ times and exploits **none** of the temporal redundancy. It's the "still × N" strawman; real codecs beat it by 5–50×.

• **the lever**: instead of coding $I_t$, code the **residual** between $I_t$ and a **prediction** $\hat I_t$ built from already-decoded frames. If the prediction is good, the residual is **mostly zero** → after DCT + quantization almost all coefficients vanish → entropy coding crushes it. *Good prediction = small residual = few bits.* [`fig-residual-vs-naive`]

• **two redundancies, stacked**: video keeps JPEG's **spatial/perceptual** redundancy removal (DCT, chroma subsampling, perceptual quantization — all carry over unchanged) **and adds temporal** redundancy removal on top. The temporal one is the big win.

• **why simple frame-differencing isn't enough**: just coding $I_t-I_{t-1}$ works when nothing moves, but **any motion** turns a sharp edge into a bright residual band (the edge is in a different place). The fix is to *predict with motion* — line the block up with where it actually went — so the residual stays small even as things move. That is **motion compensation** (next).

▸12.3.2 Motion-compensated prediction — the core trick⧉

• **the idea**: a moving object's block in frame $t$ is, to first order, **the same pixels** as some block in a **reference frame**, just **displaced**. So for each block, find that displacement, copy the matched patch as the prediction, and code only the leftover. [`fig-mc-prediction-loop`]

• **macroblocks**: partition the frame into blocks (classically 16×16 luma "macroblocks"; modern codecs use variable sizes down to 4×4). Each block is predicted **independently** with its own motion.

• **block matching** (the correspondence step): for a block $B$ in frame $t$, search a window in the reference frame for the **best-matching** patch and record the offset $(u,v)$ — the **motion vector**:

$$(u,v)=\arg\min_{(u,v)}\sum_{(x,y)\in B}\big|\,I_t(x,y)-I_{\text{ref}}(x-u,\,y-v)\,\big|$$

(sum of absolute differences, SAD; SSD is the smooth cousin — same cost-volume idea as [[Optical flow]], just per-block instead of per-pixel). Search is sped up with hierarchical/coarse-to-fine search and predicted starting points; **sub-pixel** motion vectors (half/quarter-pel, with interpolation) sharpen the match. [`fig-block-matching-search`]

• *pseudocode*: block matching as a per-block search over the displacement window — minimize SAD, store the motion vector, then form the residual.

• **the prediction**: $\hat I_t(x,y)=I_{\text{ref}}(x-u,\,y-v)$ — the reference frame shifted by the block's motion vector.

• **the residual** — the thing we actually code:

$$r(x,y)=I_t(x,y)-\hat I_t(x,y)$$

small wherever the match was good. [`fig-residual-vs-naive`]

• **reuse the JPEG pipeline on the residual**: take $r$ block by block → **DCT** → **quantize** (a quantization table, with a quality/QP knob, exactly as in [[File formats and compression]]) → **entropy-code** the quantized coefficients. The codec transmits, per block: the **motion vector** $(u,v)$ + the **coded residual**. (Motion vectors themselves are predictively coded from neighbours — they're spatially correlated too.)

• **the decoder mirrors the encoder**: decode the residual, fetch the same reference patch via the transmitted motion vector, add them: $I_t=\hat I_t+r$. Critically the encoder predicts from the **reconstructed (lossy) reference**, not the original, so encoder and decoder stay in lock-step (no drift).

• **rate–distortion is the real objective**: the encoder doesn't pick the *most accurate* motion vector — it picks the one minimising $D+\lambda R$: reconstruction error **plus** the bits to send the vector and residual. A slightly-worse match that costs far fewer bits wins. This is *why* the field is "on a budget."

▸12.3.3 I, P, and B frames; GOP structure⧉

• **the prediction graph**: not every frame can be predicted from another — you need an entry point, and you'd like both backward and forward prediction. Three frame types: [`fig-ipb-gop`]

• **I-frame (Intra)**: coded **alone**, no temporal prediction — essentially a **JPEG** (spatial/DCT only). Big (most bits) but **self-contained**: a random-access point and a resync after errors.

• **P-frame (Predicted)**: predicted from a **previous** decoded frame (forward only). Sends motion vectors + residual; much smaller than an I-frame.

• **B-frame (Bidirectional)**: predicted from frames on **both** sides (a past **and** a future reference), often averaging the two predictions. Smallest of all — great for smooth motion and revealed/occluded regions — but requires **out-of-order** decoding (the future reference must be decoded first), which is why **coding order ≠ display order**.

• **GOP (Group of Pictures)**: the repeating pattern, e.g. `I B B P B B P …`, restarting at the next I-frame. The GOP length trades off:

• **short GOP** (frequent I-frames) → easy random access / seeking, robust to errors, **bigger** files;

• **long GOP** (rare I-frames) → best compression, but slow seeking and error propagation until the next I-frame.

• **practical reads**: streaming/editing want shorter GOPs (seek, splice); archival/broadcast push longer GOPs for size. **All-intra** (I-only) modes exist for editing-friendly, frame-accurate video at the cost of size — that's essentially Motion-JPEG with a better intra coder.

▸12.3.4 Why this is "optical flow on a budget"⧉

• **the equivalence to state plainly**: the field of motion vectors over all blocks **is a correspondence field** — a piecewise-constant (one vector per block) approximation to the **optical flow** between frame and reference ([[Optical flow]]). Motion compensation *is* dense correspondence, computed at scale, in every video you've ever watched. [`fig-mc-as-flow`]

• **but it's the budget version**, and the differences are the point:

• **block-constant, not per-pixel**: one vector covers a whole block (real flow varies within it) — the residual mops up the within-block error.

• **rate-optimal, not accuracy-optimal**: chosen to minimise **bits** ($D+\lambda R$), so it will happily pick a *physically wrong* vector if it codes cheaper. A codec's motion field is **not** a measurement of scene motion — don't read it as ground-truth flow.

• **block-matching, not variational**: pure SAD/SSD search (the cost-volume primitive of [[Optical flow]]) with no global smoothness solve — smoothness comes implicitly from predictively coding the vectors and from rate cost.

• **handles failure by falling back**: where prediction is hopeless (scene cut, fast new content, occlusion), the encoder just codes the block (or whole frame) **intra** — a graceful degradation to the still-image path.

• **the conceptual bridge**: the *same* aperture problem, search window, and sub-pixel interpolation appear in both. Learned optical flow (RAFT etc., [[Optical flow]]) and learned video codecs both build on this cost-volume correspondence skeleton; the codec just got there first, optimised for a different loss.

▸12.3.5 Modern codecs in one breath⧉

• **the lineage, same skeleton**: MPEG-1/2 → **H.264/AVC** (Wiegand 2003) → **H.265/HEVC** (Sullivan 2012) → **AV1** (AOMedia 2018) → H.266/VVC. Every one keeps **motion-compensated block prediction + transform-coded residual + entropy coding**. What advances is *how cleverly each piece is done*, not the architecture.

• **what they add on the same bones**: **variable / recursive block sizes** (big blocks for flat motion, tiny for detail); **multiple reference frames** and **better sub-pixel** interpolation; **intra prediction** *within* a frame (predict a block from already-decoded neighbours — directional modes — so even I-frames beat JPEG); richer **entropy coders** (CABAC); **in-loop deblocking/SAO** filters to hide block edges; **AV1** adds warped/affine motion (a step *toward* true flow), overlapped block MC, and is **royalty-free**.

• **the takeaway**: 25 years of codec progress is **better prediction + smarter residual coding within the unchanged motion-compensation framework** — and the framework is, at heart, *correspondence-on-a-budget* applied to the temporal axis. Learned neural codecs now replace pieces (the motion model, the residual transform) with networks, but the **predict-then-code-the-residual** spine is the same.

▸12.4 Video magnification⧉

⬜ figure not yet created

`fig-eulerian-vs-lagrangian-mag` (two routes to amplify small motion: **Lagrangian** — estimate flow $\delta(t)$, scale it, re-render fig-eulerian-vs-lagrangian-mag

fig-pixel-timeseries-bandpass · one fixed pixel, signal to output — its value over time, its temporal spectrum (a small in-band peak among DC and noise), a band-pass keeping that band, and the band scaled by $\alpha$ and added back; identical at every pixel 🟨

⬜ figure not yet created

`fig-pulse-color-mag` (a face video fig-pulse-color-mag

`fig-firstorder-motion-mag` · intensity amplification is motion amplification to first order — a 1-D edge displaced by $\delta(t)$ giving temporal change $\delta(t)I_x$, amplified to a $(1+\alpha)\delta(t)$ shift; right panel where a large $\delta$ or sharp edge breaks it (haloing, clipping) 🟨

⬜ figure not yet created

`fig-phase-vs-linear-mag` (same clip: linear intensity magnification (haloing, noise blow-up) vs phase-based magnification in a steerable pyramid (clean, larger $\alpha$)) fig-phase-vs-linear-mag

⬜ figure not yet created

`fig-visual-microphone` (sound vibrates a plant/chip-bag

You cannot see a heartbeat in a face, or a skyscraper sway, or a wall flex under a passing truck — these changes are real but **below the threshold of perception**, buried in pixel values that look constant. Yet the information is *there* in an ordinary video: the green channel of a cheek really does brighten and dim with each pulse; an edge really does move a fraction of a pixel as a structure vibrates. **Video magnification** is the family of methods that **pulls out these tiny temporal variations and scales them up** until they're plainly visible — turning a normal camera into an instrument for the invisible.

💡 **Big lesson (Eulerian vs Lagrangian — the cheaper view often wins):** there are two ways to amplify small motion, the same dichotomy from [[Motion blur and temporal sampling]]. The **Lagrangian** way *follows points*: estimate the optical-flow displacement $\delta(t)$ of each feature, multiply it, and re-render the frame with the exaggerated motion. It's intuitive but **fragile** — it inherits every failure of [[Optical flow]] (occlusion, aperture problem, sub-pixel error), and tracking *tiny* motions accurately is exactly where flow is weakest. The **Eulerian** way *stays put*: at each **fixed pixel** treat the value over time as a signal, **band-pass** it, amplify, add back — **no motion estimation at all**. For small variations this is dramatically simpler and more robust, and it amplifies **color** changes (a pulse) just as naturally as motion. The recurring moral: *picking the right frame of reference (fixed grid vs moving point) can turn a hard estimation problem into a trivial filtering one.*

equations

**Eulerian linear amplification** $I'(x,t)=I(x,t)+\alpha\,B\{I(x,t)\}$, where $B\{\cdot\}$ is a temporal **band-pass** of the per-pixel signal and $\alpha$ the magnification factor

**first-order motion link** — a feature translating as $I(x,t)=f(x+\delta(t))$ band-passes to $B\approx\delta(t)\,I_x$, and adding $\alpha B$ gives $I'(x,t)\approx f\big(x+(1+\alpha)\delta(t)\big)$ (amplifying the band-passed intensity ≈ amplifying the displacement by $(1+\alpha)$)

**phase-based** — in a complex steerable sub-band $S(x,t)=A(x,t)\,e^{i\phi(x,t)}$, temporally band-pass the **local phase** $\phi$ and amplify it ($\phi\to\phi+\alpha\,B\{\phi\}$), which shifts the sub-band signal without amplifying amplitude noise

▸12.4.1 The two views: Lagrangian (track-then-amplify) vs Eulerian (per-pixel signal)⧉

• **the goal**: take subtle temporal change and make it visible — without (ideally) needing to solve correspondence. [`fig-eulerian-vs-lagrangian-mag`]

• **Lagrangian magnification** (Liu et al. 2005, the precursor): **estimate the motion** (optical flow) of features, **scale the displacement** $\delta(t)\to(1+\alpha)\delta(t)$, and **warp** the frame to render the exaggerated motion. Physically explicit, but: needs **accurate flow** for *sub-pixel* motions (the hardest regime), needs **layer/occlusion** handling, and tends to **break and tear** where tracking fails. Powerful in principle, brittle in practice.

• **Eulerian magnification** (Wu et al. 2012, the breakthrough): **fix the spatial grid** and look at the **time series** $t\mapsto I(x,y,t)$ at each pixel. **Band-pass** it temporally to the band you care about (e.g. 0.8–4 Hz for heart rate), **amplify** that band, **add it back**. **No flow, no warping, no tracking.** It's a pure **temporal signal-processing** operation done independently per pixel (per spatial sub-band). [`fig-pixel-timeseries-bandpass`]

• **the connection to the rest of the book**: this is the **Eulerian frame of reference** introduced in [[Motion blur and temporal sampling]] (watch a fixed location and ask *how does it change here?*) versus the **Lagrangian** (follow a particle), exactly the fluid-simulation analogy. Video magnification is the clearest payoff of that distinction.

▸12.4.2 Eulerian video magnification — band-pass and amplify⧉

• **the pipeline**, per pixel / per spatial band:

1. **decompose spatially** into a pyramid ([[Linear pyramids and wavelets]]) — process each band/scale separately (this also lets you spatially smooth, which controls noise).

2. **temporal band-pass** each pixel's time series with $B\{\cdot\}$ — keep only the frequencies of the phenomenon (pulse band, vibration band); discard the DC "static" part and out-of-band content.

3. **amplify and recombine**:

$$I'(x,t)=I(x,t)+\alpha\,B\{I(x,t)\}$$

add an $\alpha$-scaled copy of the band-passed signal back onto the original. Collapse the pyramid → the magnified video. [`fig-pixel-timeseries-bandpass`]

#### Why amplifying a color signal reveals a pulse

• **the cleanest case — no motion at all**: blood flow modulates skin reflectance slightly; the pixel's green-channel value oscillates at the heart rate. Band-pass to 0.8–4 Hz, amplify → the face visibly **blushes** in time with the (recoverable) pulse waveform. **No model of motion needed** — it's literally a color change over time. [`fig-pulse-color-mag`] (This is the basis of contactless **photoplethysmography** / remote vital signs.)

#### Why amplifying intensity also reveals motion — the first-order argument

• **the argument** (one Taylor expansion): take a 1-D feature translating by a tiny displacement, $I(x,t)=f\big(x+\delta(t)\big)$. First-order Taylor: $I(x,t)\approx f(x)+\delta(t)\,f_x(x)$, so the **temporal variation at a fixed pixel is $\delta(t)\,I_x$** (displacement × spatial gradient — the same brightness-constancy term as [[Optical flow]], here *exploited* rather than solved). Band-passing isolates $B\approx\delta(t)\,I_x$; adding $\alpha B$ gives

$$I'(x,t)\approx f(x)+(1+\alpha)\,\delta(t)\,I_x\approx f\big(x+(1+\alpha)\delta(t)\big),$$

i.e. the result **looks like the feature moved $(1+\alpha)$ times as far.** Amplifying the per-pixel intensity signal *is* amplifying the motion — **without ever estimating it** (the formal content of the **L17 exception**). [`fig-firstorder-motion-mag`]

#### The catches — noise and large motion

• **first-order, so small/smooth only**: the approximation is **first-order** → only valid for **small** $\delta$ and **smooth** edges; large motions or sharp edges produce **artifacts/haloing** (the linearization breaks). [`fig-firstorder-motion-mag`, right]

• **it amplifies noise too**: the band-pass passes sensor noise in-band; multiplying by $\alpha$ blows it up. Remedy: **spatially blur / use coarser pyramid levels** for large $\alpha$ (trade spatial detail for SNR), and bound $\alpha$ by the spatial wavelength (amplify low spatial frequencies more).

• **single global band**: fine for one dominant frequency (a pulse), clumsy for broadband motion. The limit the next method addresses.

▸12.4.3 The phase-based successor — amplify phase, not intensity⧉

• **the diagnosis**: linear *intensity* magnification conflates "the edge moved" with "the brightness changed," so it amplifies amplitude **noise** and **clips/halos** at edges. Motion is better expressed not as intensity change but as a **shift in local phase**.

• **the fix — work in a complex steerable pyramid** ([[Linear pyramids and wavelets]]): decompose each frame into oriented, band-pass sub-bands whose coefficients are **complex**, $S(x,t)=A(x,t)\,e^{i\phi(x,t)}$, with **amplitude** $A$ and **local phase** $\phi$. By the Fourier shift theorem, **local translation of a sub-band shows up as a change in its phase $\phi$**, while the amplitude $A$ (the "content") stays put.

• **the operation**: **temporally band-pass the phase** $\phi(x,t)$ and amplify *it*, $\phi\to\phi+\alpha\,B\{\phi\}$, leaving $A$ alone. This **shifts** each sub-band — i.e. *moves* the content — by the amplified amount, **without scaling amplitude noise**. [`fig-phase-vs-linear-mag`]

• **why it's better**: (1) **far fewer artifacts** — phase manipulation is a clean sub-pixel shift, not an intensity add, so much less haloing; (2) **supports larger $\alpha$** (bigger, cleaner magnification); (3) **noise is translated, not amplified** (it moves with the band instead of blowing up). The cost is the steerable-pyramid machinery and more compute.

• **the unifying view**: linear Eulerian ≈ amplify the **intensity** band-pass; phase-based ≈ amplify the **phase** band-pass in an oriented complex pyramid. Both are **Eulerian** (per-location, no flow) — phase-based just uses a representation in which "small motion" lives on a cleaner axis.

▸12.4.4 Uses, and the bookend to compression⧉

• **vital signs (contactless)**: recover **heart rate / pulse waveform** (and even breathing) from a webcam video of a face or hand — remote photoplethysmography, telehealth, driver monitoring, neonatal care. [`fig-pulse-color-mag`]

• **structural & material vibration**: visualize and **measure** how bridges, buildings, blades, and machines vibrate; reveal **resonant modes** and material properties from ordinary video — a cheap, full-field, non-contact alternative to accelerometers.

• **the visual microphone — sound from silent video** (Davis et al. 2014): sound waves vibrate everyday objects (a chip bag, a plant leaf, a glass of water) by **micrometres**; magnifying and reading those per-sub-band micro-motions **recovers the audio** that caused them — *speech and music from a silent, high-speed video*. The most striking demonstration that the signal was there all along. [`fig-visual-microphone`]

• **other cues**: subtle **facial / lip** motion for speech and affect; revealing **breathing**, fluid flow, and slow drifts; an analysis tool for any "is something moving that I can't see?" question.

• **the bookend to compression**: [[Video compression and motion compensation]] spends all its effort **predicting away and discarding** the inter-frame difference (it's redundant, throw it out). Video magnification does the **exact opposite** — it treats that same tiny temporal residual as **the signal**, and **amplifies** it. Same temporal-difference quantity; opposite intent. A clean way to remember both: *compression hides the small temporal change; magnification reveals it.*

▸12.5 Video stabilization and rolling-shutter correction⧉

`fig-stab-pipeline` · the stabilization pipeline — estimate (fit inter-frame $F_t$ from KLT/RANSAC, chain into the real path $C_t$) → smooth ($C_t\to P_t$) → re-render (warp by $W_t=P_t C_t^{-1}$, crop to a valid window); the path is the correspondence, the warp the transport 🟨

`fig-stab-trajectory-smoothing` · smoothing the camera-path signal — a jittery raw trajectory $C_t$, a Gaussian low-pass that removes tremor but lags/overshoots at pans, and an $L_1$-optimal path snapping to static/constant-velocity/eased segments with sharp transitions 🟨

`fig-stab-crop-window` · the stabilization↔crop tradeoff on one frame — the captured rectangle warped by $W_t$ to a tilted quad with empty margins, the crop window the largest rectangle valid for every frame (upscaled back); more shake removed forces a smaller crop 🟨

⬜ figure not yet created

`fig-l1-cinematic-paths` (the three allowed motion primitives — static hold, constant pan, smooth ease-in/out — that L1 stitches together) fig-l1-cinematic-paths

`fig-rolling-shutter-skew` · rolling-shutter distortion — a global shutter keeping a vertical pole upright under a fast pan vs a rolling shutter where per-row readout $t(r)=t_\text{frame}+r\,t_\text{row}$ shears the pole and smears fan blades (the jello effect) 🟨

⬜ figure not yet created

`fig-rolling-shutter-perrow` (the row-time diagram: each scanline exposed at $t_0 + r\,\Delta t$, each reading a different camera pose, then rectified to one virtual instant) fig-rolling-shutter-perrow

The previous chapters made motion an *enemy to estimate* (optical flow) or a *cue to track* (KLT). Here motion is the thing we want to **edit**: handheld footage carries an involuntary high-frequency camera path on top of the intended one. Every method in this chapter is one pipeline — **measure the real path, design a better path, warp the frames onto it** — and the only thing you give up is the **border pixels** that the better path no longer sees.

💡 **Big lesson (recurrence of L16 · prefilter before you downsample / temporal aliasing):** stabilization is *temporal* signal processing on the camera-path signal — and the same Nyquist intuition applies. We **low-pass the trajectory** to remove jitter, and rolling-shutter "jello" is precisely a **temporal aliasing** artifact (rows sampled at different times under fast motion, the video cousin of the wagon-wheel effect). (→ see Big lesson **L16**, first placed in BASIC → Resampling; here it recurs as *smooth the path in time, and beware time-varying sampling within a frame*.)

equations

cumulative path $C_t = F_t F_{t-1}\cdots F_1$ from inter-frame transforms $F_t$ (each a homography or affine fit to feature matches)

update/correction warp $W_t = P_t\, C_t^{-1}$ sending real pose $C_t$ to smoothed pose $P_t$

low-pass smoothing $P_t = \sum_k g_k\, C_{t-k}$ (Gaussian weights $g_k$)

**L1 objective** $\min_{P}\ \lVert D^1 P\rVert_1 + \lVert D^2 P\rVert_1 + \lVert D^3 P\rVert_1$ (penalize 1st/2nd/3rd path derivatives in $L_1$ → mostly-zero derivatives = static / constant-velocity / constant-acceleration segments) subject to the crop-window inclusion constraints

rolling-shutter row time $t(r) = t_{\text{frame}} + r\cdot t_{\text{row}}$ and per-row pose $R(t(r))$, rectify pixel $(r,c)$ by reprojecting through $R(t_{\text{ref}})\,R(t(r))^{-1}$

▸12.5.1 What stabilization is: a camera-path signal to be smoothed⧉

• **the framing**: the recorded video = an intended camera motion (a slow pan, a hold) **plus** an unwanted high-frequency tremor. The tremor is a *signal on the camera path*. Stabilization removes it without removing the intended motion — exactly a **low-pass filter on the trajectory**, with the subtleties that (a) the "samples" are 2-D image transforms, not scalars, and (b) we must not crop away the whole frame. [`fig-stab-pipeline`]

• **the three stages** (the spine of the chapter): **(1) estimate** the per-frame camera motion and accumulate it into a trajectory; **(2) smooth** that trajectory into the steady path a tripod/dolly/Steadicam would have followed; **(3) re-render** each frame by warping it from its real pose to the smoothed pose, then **crop/inpaint** the exposed borders.

• **why it works without 3-D**: for a *purely rotating* camera (or a distant/planar scene) the inter-frame motion is a **homography** — the same "depth doesn't matter for a rotation or a plane" fact behind panorama reprojection (cross-ref panorama *Warping and assembling*). Most handheld shake is dominated by rotation, so a 2-D motion model already does the heavy lifting; true translation through a 3-D scene (parallax) is the hard case that needs content-preserving warps (Liu 2009) or 3-D reconstruction.

▸12.5.2 Stage 1 — estimating the camera trajectory⧉

• **inter-frame motion**: between consecutive frames, fit a global 2-D transform $F_t$ (translation → affine → **homography**, in increasing fidelity) to a set of **feature correspondences**. Get the correspondences from **KLT tracks** or matched corners (cross-ref [[Feature tracking]]), and fit **robustly with RANSAC** so that moving foreground objects don't hijack the camera-motion estimate (cross-ref panorama *Automatic stitching* — same RANSAC homography fit). [`fig-stab-pipeline`, left]

• **cumulative trajectory**: chain the inter-frame transforms into an absolute path, $C_t = F_t F_{t-1}\cdots F_1$. The camera path is now a *sequence of transforms* (or, parameterized, a multi-dimensional time signal: translation, rotation, scale, …). [`fig-stab-trajectory-smoothing`]

• **drift — the catch to name**: each $F_t$ has small estimation error, and chaining **accumulates** it, so $C_t$ slowly drifts (the same error-accumulation that **bundle adjustment** fixes in panoramas/SfM — cross-ref). For finite clips this is usually tolerable; long sequences want global optimization or loop closure.

• **sensor alternative**: instead of estimating rotation from pixels, **read it from the phone's gyroscope** — Karpenko et al. 2011 integrate the gyro to get the camera's rotation directly, sidestepping feature tracking (cheap, robust to scene content, and exactly the per-instant rotation rolling-shutter correction also needs). Forward-pointer to Stage-4 rolling shutter, where the gyro really pays off.

▸12.5.3 Stage 2 — smoothing the path (low-pass vs L1-optimal cinematic paths)⧉

• **naive low-pass**: convolve the trajectory with a temporal Gaussian, $P_t = \sum_k g_k\,C_{t-k}$ — simple, removes jitter. **Problems**: it **lags** and **overshoots** at genuine motion onsets (start/stop of a real pan), it **blurs intended motion**, and a fixed kernel can't tell "deliberate whip-pan" from "tremor." Also needs future frames (acausal) or it lags. [`fig-stab-trajectory-smoothing`]

• **L1-optimal cinematic paths (Grundmann et al. 2011)** — the key idea: don't just *smooth*; **synthesize the path a professional camera operator would shoot**, which is made of three primitives — **static holds**, **constant-velocity pans/dollies**, and **smooth ease-in/ease-out** transitions. These correspond to the path's 1st/2nd/3rd derivatives being **zero** for long stretches. So minimize the $L_1$ norm of those derivatives:

$$\min_{P}\ \ w_1\lVert D^1 P\rVert_1 + w_2\lVert D^2 P\rVert_1 + w_3\lVert D^3 P\rVert_1$$

$L_1$ (not $L_2$) is the whole trick: it is **sparsity-inducing**, so the optimal path has derivatives that are **exactly zero almost everywhere**, broken by a few **deliberate transitions** — i.e. crisp static-then-constant-then-eased segments rather than a perpetually wobbling smooth curve. (Same sparsity intuition as the sparse-gradient image priors in [[Blind deblurring]], applied to the *time* axis.) [`fig-l1-cinematic-paths`]

• **the crop constraint, made explicit**: the optimization is **constrained** so that the smoothed frame always stays **inside** the real frame — i.e. the crop window (a rectangle inset by a chosen margin) must, after warping by $P_t C_t^{-1}$, contain only real pixels. Grundmann solve it as a **linear program** (L1 objective + linear inclusion constraints). This bakes the stabilization↔crop tradeoff directly into the path design.

• **content-preserving warps (Liu et al. 2009)** — for true 3-D parallax: a single homography can't re-render a translating camera correctly (near and far move differently). Liu's method allows a **spatially-varying warp** guided by a sparse 3-D reconstruction, with a regularizer that keeps the warp **as-rigid-as-possible** so straight lines and shapes are preserved even though it isn't a global homography (ties to the as-rigid-as-possible warping note in [[Warping and resampling]]). This is the bridge from 2-D-path stabilization to genuinely 3-D-aware stabilization.

▸12.5.4 Stage 3 — re-rendering and the stabilization↔crop tradeoff⧉

• **the warp**: for each frame, apply the **correction transform** $W_t = P_t\,C_t^{-1}$ — "undo where the camera actually was, redo where the smooth path wants it," then **resample** (cross-ref [[Warping and resampling]] for the resampling/interpolation). [`fig-stab-pipeline`, right]

• **why borders go missing**: warping to the smoothed pose swings the frame quad off the original image grid; parts of the output rectangle fall **outside** the captured pixels → empty margins. [`fig-stab-crop-window`]

• **two fixes, a real tradeoff**:

• **crop & zoom** (standard): keep only a smaller central window that stays valid for all frames, then upscale back to full size. **More stabilization ⇒ smaller crop window ⇒ less field of view + softer (upscaled) image.** This is *the* tradeoff — typical consumer modes crop ~10–20%. [`fig-stab-crop-window`]

• **inpaint / fill from neighbors**: paint the missing borders using content from **other frames** (temporally nearby frames saw those pixels) or a generative **inpainter** (forward-ref EDGES MATTER inpainting / [[Generative AI and diffusion]]). Keeps full FOV at the cost of synthesized (possibly wrong) borders and flicker risk.

• **adaptive crop**: choose the crop/margin **per clip** or per segment so smooth shots crop little and violent shake crops more — turning the fixed tradeoff into a content-aware one.

▸12.5.5 Rolling-shutter correction: per-row pose and rectification⧉

• **the cause**: most CMOS sensors are **read out row by row**, not all at once (global shutter). Row $r$ is exposed at time $t(r) = t_{\text{frame}} + r\cdot t_{\text{row}}$. If the camera (or subject) moves appreciably during that **readout sweep** (a few to ~30 ms), each row records a **different camera pose**. [`fig-rolling-shutter-perrow`]

• **the artifacts** (name them, they're diagnostic): **skew/shear** of vertical edges under horizontal pan; **wobble/"jello"** under tremor; **partial exposure** banding and the spinning-propeller "smear" — all because the top and bottom of the frame are effectively *different time samples*. This is **temporal aliasing within a single frame** (callback to L16). [`fig-rolling-shutter-skew`]

• **the model & fix**: treat the per-row pose as a continuous function of time $R(t(r))$. **Rectify** by reprojecting every pixel from the pose at its own row-time onto a single **reference pose** $R(t_{\text{ref}})$ (e.g. the first row's, or the frame center's):

$$(r,c)\ \longmapsto\ \text{reproject through } R(t_{\text{ref}})\,R\!\big(t(r)\big)^{-1}.$$

Each row is warped by a slightly different transform → the per-row warps "un-shear" the frame back to one virtual instant.

• **where the pose comes from**: **Karpenko et al. 2011** read $R(t)$ from the phone's **gyroscope** (after calibrating the gyro-to-frame timing and the row readout time $t_{\text{row}}$) — robust and cheap, and it doubles as the stabilization rotation estimate. **Forssén & Ringaby 2010** instead estimate the per-row rotation **from the image** (feature tracks + a continuous-time motion model), then resample. Either way the output is a rolling-shutter-free frame, after which ordinary stabilization runs.

• **order & coupling**: rolling-shutter rectification should precede (or be jointly solved with) trajectory smoothing — otherwise the homography estimates in Stage 1 are corrupted by the very shear we're trying to remove. This is exactly the rolling-shutter caveat the panorama chapter forward-refs for **continuous phone sweeps** (cross-ref [[Continuous panoramas (e.g. on cell phones)]]), where readout-time motion is most severe.

• **the capture-side escape**: a **global-shutter** sensor reads all rows at once and has no rolling shutter at all — the optical fix, paid for in sensor cost/complexity. (Same spirit as "engineer the system so you don't have to invert a nasty artifact.")

▸12.6 Frame interpolation and slow-motion synthesis⧉

`fig-interp-as-morph` · interpolation as a morph — a plain cross-dissolve of two frames yielding two ghosts vs a flow-warp where corresponded points slide halfway before blending so the subject moves rather than fades; the correspondence here is automatic optical flow 🟨

`fig-flow-warp-blend` · the flow-based interpolation pipeline — estimate bidirectional flow, backward-warp each frame to time $t$ ($-t\mathbf F_{0\to1}$ and $(1-t)\mathbf F_{1\to0}$), convex-blend by $(1-t),t$; a highlighted disocclusion hole present in only one warped frame 🟨

⬜ figure not yet created

`fig-forward-vs-backward-warp` (splatting a pixel forward leaves holes/collisions fig-forward-vs-backward-warp

⬜ figure not yet created

`fig-occlusion-disocclusion` (a moving foreground: the background it *uncovers* exists in only one of the two frames → which frame to copy from) fig-occlusion-disocclusion

`fig-slowmo-axis` · four ways to treat the time axis on a shared timeline — normal capture (sparse), high-speed (dense true samples), interpolation (sparse reals with synthesized in-betweens), and long-exposure blur (the integral); only interpolation adds resolution after capture and can be wrong 🟨

⬜ figure not yet created

`fig-film-largemotion` (a fast-moving subject where small-motion flow fails and FILM's scale-agnostic pyramid still tracks it) fig-film-largemotion

A normal camera shoots, say, 30 fps; to play a moment 8× slower and smooth you'd need ~240 distinct instants per second that were never recorded. You can either **capture** them (a high-speed camera — true, but expensive and light-hungry) or **synthesize** them. Frame interpolation synthesizes: given two real frames, invent the ones that belong **between** them. The whole problem reduces to a question you already met in [[Morphing]] — *what moved where?* — answered here by **optical flow** or by a **learned** model.

💡 **Big lesson (recurrence of L14 · capture the full set, decide later — and its limits):** high-speed photography is the honest way to own every instant (capture the full temporal set, pick the moment/shutter later). Interpolation is the **budget substitute**: when you *didn't* capture the full set, a **prior** invents the missing instants. So this chapter sits at the seam between **L14** (capture everything) and **L10** (the prior is not optional) — the in-between frames are partly **reconstructed** (where flow is reliable) and partly **hallucinated** (across occlusions). (→ see Big lessons **L14**, first in MULTIPLE EXPOSURE, and **L10**, first in Super-resolution.)

equations

linear motion assumption $x_t = (1-t)\,x_0 + t\,x_1$ along a flow vector

backward-warp synthesis $I_t(\mathbf{p}) = (1-t)\,I_0(\mathbf{p} - t\,\mathbf{F}_{0\to 1}(\mathbf{p})) + t\,I_1(\mathbf{p} + (1-t)\,\mathbf{F}_{1\to 0}(\mathbf{p}))$ (warp each source to $t$, then convex-blend by temporal distance)

occlusion-aware blend $I_t = \dfrac{(1-t)\,V_0\,W_0 + t\,V_1\,W_1}{(1-t)\,V_0 + t\,V_1}$ with visibility masks $V_0,V_1$ (down-weight the frame where a pixel is occluded)

flow consistency check $\lVert \mathbf{F}_{0\to1}(\mathbf p) + \mathbf{F}_{1\to0}(\mathbf p + \mathbf F_{0\to1}(\mathbf p))\rVert \le \tau$ (forward-backward agreement → trust map)

▸12.6.1 Why interpolate: faking slow-motion and up-converting frame rate⧉

• **slow-motion**: to slow footage $N\times$ smoothly you need $N{-}1$ synthetic frames between every real pair. Interpolation manufactures them from ordinary 30/60 fps footage — no high-speed camera, no extra light. [`fig-slowmo-axis`]

• **frame-rate up-conversion**: 24→48, 30→60→120 fps for smoother playback, VR/high-refresh displays, and TV "motion smoothing" (the much-debated "soap-opera effect").

• **the four ways to treat time** (place interpolation among rivals): **normal capture** = sparse samples with gaps; **high-speed capture** = dense true samples (expensive, dim); **interpolation** = synthesized in-betweens (cheap, can be wrong); **long-exposure / motion blur** = the *integral* over time, not the instants (cross-ref temporal averaging / "max-over-time" in [[Video editing]]). Interpolation is the only one that adds temporal resolution *after* capture. [`fig-slowmo-axis`]

• **the honest limit**: synthesized frames are guesses. Where motion is large, fast, or self-occluding, they can warp, ghost, or tear — which is exactly what the learned methods target.

▸12.6.2 Interpolation = morphing between adjacent frames⧉

• **the equivalence**: a [[Morphing]] takes two images + a correspondence and produces in-betweens by **warping both toward an intermediate shape and cross-dissolving**. Frame interpolation is that, with **two consecutive video frames** as the endpoints. [`fig-interp-as-morph`]

• **what's added over a static morph**: in classic morphing a human supplies the correspondence (Beier–Neely feature lines); here the correspondence is the **optical flow**, estimated automatically. So interpolation = **automatic** morphing, where motion estimation replaces hand-drawn correspondence.

• **why not just cross-dissolve**: a plain temporal cross-dissolve (no warp) **double-exposes** the moving object — you see two ghosts fading between positions, not one object *moving*. The warp is what turns a dissolve into motion. [`fig-interp-as-morph`, ghost vs slide]

▸12.6.3 Flow-based interpolation: warp both frames to the midpoint and blend⧉

• **the recipe** (intuition first): assume a pixel moves along a roughly **straight, constant-velocity** path between $A$ and $B$. To make the frame at fraction $t\in(0,1)$: estimate flow both ways, **pull** each source frame to time $t$ along (a $t$-scaled slice of) its flow, then **blend** the two warped images weighted by temporal closeness. [`fig-flow-warp-blend`]

$$I_t(\mathbf{p}) = (1-t)\,I_0\!\big(\mathbf{p} - t\,\mathbf{F}_{0\to 1}\big) + t\,I_1\!\big(\mathbf{p} + (1-t)\,\mathbf{F}_{1\to 0}\big).$$

#### Forward vs backward warping

• **the resampling detail that matters**: **forward warping (splatting)** pushes each source pixel to its destination — but destinations collide and leave **holes** (no source pixel lands there). **Backward warping** asks, for each *output* pixel, where it came from, and samples there — giving a **full, hole-free grid** (cross-ref [[Warping and resampling]] on backward resampling). Hence the recipe formula above is written as a **backward** sample. The snag: backward warping needs the flow *at time $t$*, which must itself be approximated (e.g. by splatting the flow, or a learned flow-at-$t$). [`fig-forward-vs-backward-warp`]

#### Occlusions and holes: the real difficulty

• **the hard case**: pixels that are **disoccluded** (background uncovered by a moving foreground) exist in **only one** of the two frames; a pixel **occluded** in $B$ should be taken from $A$, and vice versa. Naive blending of both ghosts the boundary. The fix is an **occlusion/visibility mask** $V_0,V_1$ that down-weights the frame where a pixel isn't visible:

$$I_t = \frac{(1-t)\,V_0\,(\text{warp of }I_0) + t\,V_1\,(\text{warp of }I_1)}{(1-t)\,V_0 + t\,V_1}.$$

Get the masks from a **forward–backward flow consistency check** — where $\mathbf{F}_{0\to1}$ and $\mathbf{F}_{1\to0}$ disagree, a pixel is likely occluded; trust the other frame there. Genuine holes (visible in neither, or torn) need **inpainting** (forward-ref). [`fig-occlusion-disocclusion`]

• **failure modes to name**: large displacements break the small-motion flow assumption (objects "teleport," flow finds the wrong match); thin/fast structures (a swung bat, spokes) tear; non-linear motion violates the straight-line assumption. These are exactly what learned synthesis attacks.

▸12.6.4 Learned synthesis: Super SloMo and FILM⧉

• **the learned-operator move** (💡 L8): keep the same skeleton — estimate motion, warp, blend — but **train it end-to-end** on real video so the network learns better flow, better occlusion handling, and a learned blend/refinement, rather than hand-tuning each. (→ see Big lesson **L8**: a learned operator swaps a hand-designed prior for one learned from data.)

• **Super SloMo (Jiang et al. 2018)**: a two-network design — first estimate **bidirectional flow** between $A$ and $B$, then a second network **refines the intermediate flow at each time $t$ and predicts visibility maps** for occlusion-aware warping+blend. Crucially it can synthesize **any number** of intermediate frames (variable-length slow-mo) from one flow estimate, and it's **trained end-to-end** with the warped-and-blended output supervised against held-out real frames.

• **FILM (Reda et al. 2022)** — the **large-motion** specialist: ordinary interpolators fail when things move *far* between frames (near-duplicate photos, action shots). FILM uses a **scale-agnostic feature pyramid** with **shared weights across scales**, so the *same* matcher operates from coarse (big motion) to fine (detail) — large motion at a coarse scale looks like small motion. It's a **single unified network** (no separate flow net), trained with a **perceptual/Gram-matrix (style) loss** to keep synthesized detail sharp rather than blurry, and is notably good at interpolating between **far-apart still photos** ("near-duplicate" photo animation). [`fig-film-largemotion`]

• **why learned wins where flow loses**: the network learns a **prior** over how scenes move and how disoccluded regions should look, so it can **hallucinate plausible content** in holes and across large gaps where explicit flow has no correspondence to copy — the L8/L10 pairing again (learned prior fills what measurement didn't supply).

• **the models live elsewhere**: architectures, training, and losses are taught as *models* in [[Machine learning]] / [[Generative AI and diffusion]]; here we use them as the **interpolation operator** and stress *what they add over hand-built flow* — robustness to large motion and clean occlusion handling. (Same "priors here, models there" split as Super-resolution.) Forward-pointer: diffusion-based video models push interpolation/in-betweening further still.

▸12.7 Video editing⧉

`fig-nle-timeline` · the non-linear timeline — parallel video/audio tracks, clips with trim handles, a playhead, a labelled hard cut and a cross-dissolve transition; the random-access reference-list model that replaced sequential tape splicing 🟨

⬜ figure not yet created

`fig-reduce-over-time` (one stacked video → four outputs side by side: **mean** (motion-blur/long-exposure), **median** (people vanish), **max** (bright streaks / star trails), **min** (darkest-pixel) — the per-pixel reduce family) fig-reduce-over-time

⬜ figure not yet created

`fig-median-deghost` (a busy plaza, N frames → median → empty plaza: transient people removed) fig-median-deghost

⬜ figure not yet created

`fig-everyone-smiling` (a burst of a group portrait → per-region best-frame selection → one frame where everyone's eyes open / smiling, after photomontage) fig-everyone-smiling

`fig-hyperlapse` · hyperlapse vs naive fast-forward — a shaky first-person walk, a jagged frame-dropping $10\times$ speed-up (nauseating), and a hyperlapse that reconstructs the 3-D trajectory, fits a smooth virtual path, and renders along it (stable); speed-up and stabilisation together 🟨

`fig-transcript-edit` · transcript-based editing — a word-timestamped transcript with a filler "um" struck through, and the timeline below with the aligned micro-segment removed and clips closed up; speech-to-text as a 1-D handle on video 🟨

This closing chapter spends the part's machinery. Editing is where alignment, flow, warping, and robust per-pixel statistics stop being algorithms and become **tools an editor reaches for**. It's deliberately lighter and demo-driven — and honest where a specific product or paper is uncertain (marked *queue*). The through-line: almost every "video effect" is **align the frames, then reduce or select across time**, or **re-time the timeline**.

equations

per-pixel temporal reduce $O(\mathbf p) = \operatorname*{reduce}_{t} I_t(\mathbf p)$ for $\operatorname{reduce}\in\{\text{mean},\ \text{median},\ \max,\ \min\}$ (after alignment)

mean = long-exposure $O = \tfrac1N\sum_t I_t$

median = robust background $O(\mathbf p)=\operatorname{median}_t I_t(\mathbf p)$ (transients are outliers → rejected)

selection composite $O(\mathbf p) = I_{t^\star(\mathbf p)}(\mathbf p)$ with $t^\star(\mathbf p)=\arg\max_t s(I_t,\mathbf p)$ (pick the frame maximizing a per-region score $s$ — e.g. "eyes open / smiling")

▸12.7.1 Non-linear editing: the timeline metaphor⧉

• **what "non-linear" means**: on tape, you edited **in order** by physically splicing — to change shot 3 you re-recorded everything after it. Digital **non-linear editing (NLE)** stores clips as **random-access** files, so you assemble them on a **timeline** in any order, re-arrange freely, and the edit is just a **list of references + in/out points** (non-destructive — the source media is never altered). [`fig-nle-timeline`]

• **the timeline**: parallel **video and audio tracks**, **clips** with trim handles, a **playhead**; edits are **cuts** (hard joins) and **transitions** (cross-dissolve, wipe, fade) — a transition is literally a temporal **cross-dissolve / morph** between two clips (cross-ref [[Morphing]] and the interpolation cross-dissolve).

• **why it matters here**: the timeline is the substrate the rest of the chapter manipulates — summarization *shortens* it, fun filters *collapse* a stretch of it to one frame, transcript editing *re-orders* it from text, and slow-mo/speed-ramps (cross-ref [[Frame interpolation and slow-motion synthesis]]) *re-time* it.

• **honest scope note**: the craft of *editing* (pacing, continuity, the cut on action — Murch's *In the Blink of an Eye*) is a whole art; we touch only the computational hooks. *Queue* a sidebar on editing grammar.

▸12.7.2 Summarization: keyframes, fast-forward, and highlights⧉

• **the goal**: compress *watching time*, not file size — turn an hour of footage into something **glanceable** or **skimmable**.

• **representative / keyframes**: pick a small set of frames that **cover** the content — e.g. one per **shot** (detect shot boundaries by large inter-frame change), or cluster frames and take a medoid per cluster. Output: a contact-sheet/storyboard of the video. (Ties to scene-change detection; *queue* survey ref Truong & Venkatesh 2007.)

• **adaptive fast-forward / hyperlapse**: speed up long first-person footage. Naive $10\times$ frame-dropping is **unwatchably shaky** (every step jolts). **Hyperlapse** (Kopf et al. 2014) reconstructs the camera path in 3-D, **chooses a smooth virtual path** through it, and renders along that path — *stabilization fused with speed-up* (cross-ref [[Video stabilization and rolling-shutter correction]] for the path-smoothing; cross-ref [[Optical flow]] for the motion). Speed can be **content-adaptive** — slow on interesting/eventful stretches, fast on dull ones. [`fig-hyperlapse`]

• **highlight / event detection**: surface the *interesting* parts — score frames by **saliency / motion / faces / audio peaks**, or a **learned interestingness** model (forward-ref [[Machine learning]]), then keep the top segments (sports highlights, GoPro "best moments"). *Queue* specific systems.

▸12.7.3 Fun temporal filters: reduce-over-time⧉

• **the unifying primitive**: after the frames are **aligned** (stabilized — cross-ref, this is the prerequisite, or the result ghosts), apply a **per-pixel reduction across time**: $O(\mathbf p) = \operatorname*{reduce}_t I_t(\mathbf p)$. The choice of reducer is the whole effect. [`fig-reduce-over-time`]

• **mean over time** = a **long exposure from video**: $O = \tfrac1N\sum_t I_t$. Moving water silks out, traffic becomes light ribbons, people blur away — and it's the **denoise-by-averaging** filter ($1/\sqrt N$) from MULTIPLE EXPOSURE seen as an *effect* instead of a fix. (Same align-then-average pattern.)

• **median over time** = **robust background / people-removal**: $O(\mathbf p)=\operatorname{median}_t I_t(\mathbf p)$. Anything **transient** (a person walking through, a passing car) appears at any given pixel only briefly → it's an **outlier** the median **rejects**, leaving the static scene. The classic "**empty the busy plaza**" / tourist-removal trick — and exactly why median is the **robust** statistic (cross-ref BASIC denoising). [`fig-median-deghost`]

• **max over time** = **bright-streak accumulation**: $O(\mathbf p)=\max_t I_t(\mathbf p)$. Keep the brightest value each pixel ever saw → **star trails**, light-painting, lightning, firework trails, a runner's headlamp streak. (The part's "**max over time**" fun filter.) **Min over time** is the dual (darkest-pixel) — useful to drop specular flashes / find the persistent dark background.

• **selection composite** (not a reduce but a **pick**): choose, per region, the **best** frame from a burst — $O(\mathbf p)=I_{t^\star(\mathbf p)}(\mathbf p)$ with $t^\star(\mathbf p)=\arg\max_t s(I_t,\mathbf p)$ for a score $s$. This is the **"everyone smiling / no blinks"** group-photo trick: for each face, splice in the frame where that person's **eyes are open and they're smiling**, blended seamlessly. It's **interactive digital photomontage** (Agarwala et al. 2004) — select-best-pixel over a stack — applied to a short burst; the seamless merge uses graph-cut + Poisson blending (cross-ref EDGES MATTER). [`fig-everyone-smiling`]

• **the prerequisite, restated**: all of these assume the frames are **registered**; on handheld video, **stabilize/align first** (cross-ref [[Video stabilization and rolling-shutter correction]]) or the static scene smears too. Same align-then-combine discipline as HDR/panorama.

▸12.7.4 Transcript-based editing⧉

• **the idea**: edit a **talking-head** video by editing its **transcript**. Run **speech-to-text (ASR)** with **word-level timestamps** (forward-ref [[Machine learning]]); the text aligns to the audio, the audio to the video. **Delete a sentence in the text → the matching clip is removed from the timeline**; reorder paragraphs → reorder shots. [`fig-transcript-edit`]

• **the killer feature**: **filler-word removal** — find and strike every "um," "uh," "you know," and the corresponding micro-segments vanish, tightening the cut automatically. Also: search-and-replace navigation ("jump to where they said X"), and **overdub**.

• **the research root**: Berthouzoz, Li & Agrawala 2012, *Tools for Placing Cuts and Transitions in Interview Video*, automated finding clean cut points (at sentence/pause boundaries, hidden cuts) for interview footage — the academic ancestor of **Descript**-style transcript editing. (*Queue* exact citation/products.)

• **why it works**: speech is a **clean alignment signal** for dialogue-driven video; the transcript is a *one-dimensional handle* on a two-dimensional medium. Limits: it's for **spoken-word** content (useless for action/music), and **jump cuts** between removed segments may need a **transition** or a flow-based **seam-hiding** frame interpolation (cross-ref [[Frame interpolation and slow-motion synthesis]]) to look smooth.

▸12.7.5 Coda: storyboards, interviews, and where this part lands⧉

• **storyboards** — the **timeline as a sequence of stills**, looking two directions:

• *forward (pre-visualization)*: a **storyboard** plans a shoot before any footage exists — a panel per intended shot (framing, camera move, action), the director's equivalent of an outline. **Animatics** add rough timing/motion; modern **pre-viz** does it in 3-D. Increasingly **text-to-image / text-to-video** models draft storyboard panels from a script (forward-ref [[Generative AI and diffusion]]) — generation standing in for the not-yet-captured frames.

• *backward (auto-storyboard / recap)*: turn an *existing* video into a storyboard automatically — the **keyframe-summarization** output above (one representative frame per shot, laid out as a contact sheet) used to **skim, recap, or index** footage; comic-style layouts (Schödl/Agarwala-line "video summarization as a comic"; *queue ref*) size and arrange the keyframes by importance.

• the link: a storyboard is the **summarization primitive in both tenses** — keyframes chosen to *cover* the story, whether the story is planned (forward) or already shot (backward). [`fig-storyboard`]

• **interviews / multi-cam** (brief): the transcript-edit + multi-track timeline shine for **interview** assembly (sync cameras by audio, cut between angles on the transcript). *Queue* specific tooling.

• **the part's arc, closed**: we began with motion as *physics to model* (flow, warping, magnification, compression), turned it into *correction* (stabilization, rolling shutter) and *synthesis* (interpolation, slow-mo), and end here with *editing* — motion as raw material for storytelling. Every effect was one of two moves: **register-then-reduce/select across time**, or **re-time the timeline** — the same "align, then combine, then decide" spine that runs through the whole book.