💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
Computational Photography, an AI-powered Slopendium — 12 Video
expand to📖 Full book outlinejump to1 parts · 7 chapters · 27 sections · 21 figures embedded · 17 placeholders · double-click a figure to enlarge
Part 12 VIDEO
**Roadmap.** **[[Motion blur and temporal sampling]]** sets up the time axis (a frame is an integral; time is sampled; the Lagrangian-vs-Eulerian framing). The applications follow: **[[Video compression and motion compensation]]** (correspondence on a budget), **[[Video magnification]]** (the Eulerian reveal), **[[Video stabilization and rolling-shutter correction]]** (estimate → smooth → re-render), **[[Frame interpolation and slow-motion synthesis]]** (transport to the in-between), and a closing **[[Video editing]]** coda (timelines, summarization, reduce-over-time filters, transcript-based editing).
equations
motion blur $B(\mathbf x)=\tfrac1\tau\int_0^\tau I(\mathbf x-\mathbf v t)\,dt$
Eulerian magnification $I'(\mathbf x,t)=I(\mathbf x,t)+\alpha\,\mathcal B\{I(\mathbf x,t)\}$
12.1 Motion blur and temporal sampling
fig-motion-blur-integral
fig-motion-blur-integral · a frame is an integral over the exposure — a bright point traversing the frame while the shutter is open records a streak of length $\lVert\mathbf v\rVert\tau$; motion blur is a 1-D box convolution along $\mathbf v$ 🟨
⬜ figure not yet created
`fig-blur-spatial-vs-temporal` (side-by-side: a spatial Gaussian PSF blurring a static edge vs a temporal box PSF blurring a *moving* edge — same convolution, different axis) fig-blur-spatial-vs-temporal
fig-shutter-angle
fig-shutter-angle · shutter angle sets the blur — a rotating-disc shutter at $0°/180°/360°$ admitting a smaller/larger fraction of the frame interval $T$; $180°$ ($\tau=T/2$) is the cinematic film-look, small angles strobe 🟨
fig-temporal-aliasing-wagonwheel
fig-temporal-aliasing-wagonwheel · the wagon-wheel effect — a spoked wheel sampled below temporal Nyquist appears to stall near it and spin backward past it; the temporal twin of moiré 🟨
⬜ figure not yet created
`fig-temporal-nyquist` (a sinusoidal motion signal sampled at frame rate $f_s$: $f_s>2f_{\text{motion}}$ reconstructs the true motion fig-temporal-nyquist
fig-blur-as-temporal-prefilter
fig-blur-as-temporal-prefilter · one knob, two outcomes — the same fast motion captured long (blurred but not aliased; the window low-passed before sampling) vs short/high-angle (sharp but strobing); the exposure window is the temporal anti-alias filter, L16 in time 🟨
fig-lagrangian-vs-eulerian
fig-lagrangian-vs-eulerian · the organizing diagram — Lagrangian (arrows following particles over time → flow and tracking, output trajectories) vs Eulerian (a fixed grid, each pixel plotting intensity-over-time → magnification, never asking where anything went) 🟨
The two great themes of spatial imaging are **convolution** (a measurement integrates over a region — the PSF) and **sampling** (a discrete grid can only represent frequencies below Nyquist, else they alias). This section's single claim is that **both recur, identically, on the *time* axis** — and together they organise the entire part. A frame is an *integral over the exposure* → **motion blur** is convolution *in time*. A video is a *sequence of temporal samples* → motion faster than the frame rate **aliases** (the wagon wheel) *in time*. And the question "do I think of motion as *particles I follow* or *a time-series at each fixed pixel*?" is the **Lagrangian vs Eulerian** distinction that separates flow/tracking from video magnification.
💡 **Big lesson (recurrence of L5 / L16 — Nyquist, and *prefilter before you downsample*, on the time axis):** the spatial sampling laws apply unchanged in **time**. **L5:** a frame rate $f_s$ can only faithfully capture temporal frequencies below $f_s/2$; faster motion **folds down** to a false low frequency — the **wagon-wheel effect** is temporal moiré. **L16:** the cure for aliasing is to **prefilter before sampling** — and in time the prefilter is *built into the camera*: integrating over the exposure window (which produces motion blur) is exactly the temporal **anti-alias filter**. So motion blur and temporal aliasing are not two problems but **the same tradeoff** — a longer exposure removes strobing by *blurring*, a shorter one gives sharp frames that *strobe*. (→ see Big lessons **L5** & **L16**, first placed in [[Linearity, aliasing and deblurring]] / [[Basic image processing and ISP]]; the **wagon-wheel** is L16's named temporal instance.)
equations
**motion blur** $B(\mathbf{x})=\dfrac{1}{\tau}\displaystyle\int_0^\tau I(\mathbf{x}-\mathbf{v}\,t)\,dt$ (a directional integration along the motion vector $\mathbf v$ over exposure $\tau$) $= I * k_{\mathbf v}$ with **path kernel** $k_{\mathbf v}$ a 1-D box of length $\|\mathbf v\|\tau$ oriented along $\mathbf v$ (constant-velocity case)
**shutter angle** $\theta=360^\circ\cdot\tau/T$ (exposure $\tau$ as a fraction of frame interval $T=1/f_s$)
**temporal Nyquist** $f_s > 2\,f_{\text{motion}}$ (sample faster than twice the motion's highest temporal frequency, else alias)
**aliased apparent frequency** $f_{\text{app}}=|f_{\text{motion}} - n f_s|$ (the folded-down false frequency — wagon wheel)
12.2 Time-lapse photography
12.3 Video compression and motion compensation
fig-video-temporal-redundancy
fig-video-temporal-redundancy · temporal redundancy made visible — three near-identical consecutive frames whose per-pixel difference $I_t-I_{t-1}$ is near-black except a thin fringe at moving edges and a disoccluded patch; the only new information to code 🟨
fig-mc-prediction-loop
fig-mc-prediction-loop · the motion-compensated prediction loop — encoder (block-match → motion vectors, residual $r=I_t-\hat I_t$ → DCT → quantize → entropy-code, plus a reconstruction branch storing the lossy reference) and the mirror-image decoder; $\min D+\lambda R$ in the mode-decision box 🟨
⬜ figure not yet created
decoder mirrors it)
⬜ figure not yet created
`fig-block-matching-search` (one macroblock in frame $t$, its search window in the reference frame, the best-match offset = the motion vector) fig-block-matching-search
⬜ figure not yet created
`fig-residual-vs-naive` (left: code the whole block fig-residual-vs-naive
fig-ipb-gop
fig-ipb-gop · a GOP timeline — an I-frame opening the group, P-frames predicting forward, B-frames predicting from past and future references; arrows are prediction dependencies, so coding order $\neq$ display order 🟨
fig-mc-as-flow
fig-mc-as-flow · same correspondence, two budgets — a dense smooth optical-flow field vs the codec's one-constant-vector-per-block field; the same "where did this come from?" coarsened to what is cheap to estimate and transmit 🟨
The still-image part of this book taught one compression idea over and over: **don't store what the viewer won't miss, and don't store what you can predict.** JPEG ([[File formats and compression]]) throws away high-frequency chroma and finely-quantizes the DCT because perception won't notice. Video adds a second, far larger redundancy that stills can't touch: **the next frame looks almost exactly like this one.** At 30–60 fps the camera and the world barely move between frames; most of the picture is *already on screen*. Coding each frame independently (so-called **Motion-JPEG** — literally JPEG-per-frame) ignores this entirely and is wildly wasteful. Real codecs **predict each frame from its neighbours and code only the prediction error.** That is the whole subject.
💡 **Big lesson (correspondence on a budget):** **motion compensation is optical flow you can afford.** A codec needs, for every block, *where did this come from in a frame we already have?* — exactly the correspondence question of [[Optical flow]]. But it does not need a physically correct, dense, sub-pixel flow field; it needs a **coarse, block-constant** field that is **cheap to estimate** and, crucially, **cheap to transmit** — and it chooses that field to **minimise total bits**, not endpoint error. So decades before learned optical flow, video codecs were already estimating dense-ish correspondence at massive scale — just optimised for rate, not accuracy. The recurring move: *reuse what the decoder already has, and code only the difference.* (Recurs as the framing for [[Video magnification]], which keeps the difference instead of discarding it; and is the rate-aware cousin of the correspondence story in [[Optical flow]].)
equations
motion-compensated **residual** $r(x,y)=I_t(x,y)-\hat I_t(x,y)$ where $\hat I_t(x,y)=I_{\text{ref}}\big(x-u,\,y-v\big)$ is the reference frame shifted by the block's motion vector $(u,v)$
**block-matching cost** $(u,v)=\arg\min_{(u,v)}\sum_{(x,y)\in B}\big|I_t(x,y)-I_{\text{ref}}(x-u,y-v)\big|$ (SAD over a block $B$, sometimes SSD)
**rate–distortion** objective $\min\ D+\lambda R$ — choose motion vectors/modes to minimise distortion $D$ *plus* $\lambda$ times the bits $R$ (the codec optimises bits, not physical accuracy)
coded frame $\approx$ entropy-code$\big(\text{quantize}(\text{DCT}(r))\big)$ + motion vectors
12.4 Video magnification
⬜ figure not yet created
`fig-eulerian-vs-lagrangian-mag` (two routes to amplify small motion: **Lagrangian** — estimate flow $\delta(t)$, scale it, re-render fig-eulerian-vs-lagrangian-mag
fig-pixel-timeseries-bandpass
fig-pixel-timeseries-bandpass · one fixed pixel, signal to output — its value over time, its temporal spectrum (a small in-band peak among DC and noise), a band-pass keeping that band, and the band scaled by $\alpha$ and added back; identical at every pixel 🟨
⬜ figure not yet created
`fig-pulse-color-mag` (a face video fig-pulse-color-mag
fig-firstorder-motion-mag
fig-firstorder-motion-mag · intensity amplification is motion amplification to first order — a 1-D edge displaced by $\delta(t)$ giving temporal change $\delta(t)I_x$, amplified to a $(1+\alpha)\delta(t)$ shift; right panel where a large $\delta$ or sharp edge breaks it (haloing, clipping) 🟨
⬜ figure not yet created
`fig-phase-vs-linear-mag` (same clip: linear intensity magnification (haloing, noise blow-up) vs phase-based magnification in a steerable pyramid (clean, larger $\alpha$)) fig-phase-vs-linear-mag
⬜ figure not yet created
`fig-visual-microphone` (sound vibrates a plant/chip-bag
You cannot see a heartbeat in a face, or a skyscraper sway, or a wall flex under a passing truck — these changes are real but **below the threshold of perception**, buried in pixel values that look constant. Yet the information is *there* in an ordinary video: the green channel of a cheek really does brighten and dim with each pulse; an edge really does move a fraction of a pixel as a structure vibrates. **Video magnification** is the family of methods that **pulls out these tiny temporal variations and scales them up** until they're plainly visible — turning a normal camera into an instrument for the invisible.
💡 **Big lesson (Eulerian vs Lagrangian — the cheaper view often wins):** there are two ways to amplify small motion, the same dichotomy from [[Motion blur and temporal sampling]]. The **Lagrangian** way *follows points*: estimate the optical-flow displacement $\delta(t)$ of each feature, multiply it, and re-render the frame with the exaggerated motion. It's intuitive but **fragile** — it inherits every failure of [[Optical flow]] (occlusion, aperture problem, sub-pixel error), and tracking *tiny* motions accurately is exactly where flow is weakest. The **Eulerian** way *stays put*: at each **fixed pixel** treat the value over time as a signal, **band-pass** it, amplify, add back — **no motion estimation at all**. For small variations this is dramatically simpler and more robust, and it amplifies **color** changes (a pulse) just as naturally as motion. The recurring moral: *picking the right frame of reference (fixed grid vs moving point) can turn a hard estimation problem into a trivial filtering one.*
equations
**Eulerian linear amplification** $I'(x,t)=I(x,t)+\alpha\,B\{I(x,t)\}$, where $B\{\cdot\}$ is a temporal **band-pass** of the per-pixel signal and $\alpha$ the magnification factor
**first-order motion link** — a feature translating as $I(x,t)=f(x+\delta(t))$ band-passes to $B\approx\delta(t)\,I_x$, and adding $\alpha B$ gives $I'(x,t)\approx f\big(x+(1+\alpha)\delta(t)\big)$ (amplifying the band-passed intensity ≈ amplifying the displacement by $(1+\alpha)$)
**phase-based** — in a complex steerable sub-band $S(x,t)=A(x,t)\,e^{i\phi(x,t)}$, temporally band-pass the **local phase** $\phi$ and amplify it ($\phi\to\phi+\alpha\,B\{\phi\}$), which shifts the sub-band signal without amplifying amplitude noise
12.5 Video stabilization and rolling-shutter correction
fig-stab-pipeline
fig-stab-pipeline · the stabilization pipeline — estimate (fit inter-frame $F_t$ from KLT/RANSAC, chain into the real path $C_t$) → smooth ($C_t\to P_t$) → re-render (warp by $W_t=P_t C_t^{-1}$, crop to a valid window); the path is the correspondence, the warp the transport 🟨
fig-stab-trajectory-smoothing
fig-stab-trajectory-smoothing · smoothing the camera-path signal — a jittery raw trajectory $C_t$, a Gaussian low-pass that removes tremor but lags/overshoots at pans, and an $L_1$-optimal path snapping to static/constant-velocity/eased segments with sharp transitions 🟨
fig-stab-crop-window
fig-stab-crop-window · the stabilization↔crop tradeoff on one frame — the captured rectangle warped by $W_t$ to a tilted quad with empty margins, the crop window the largest rectangle valid for every frame (upscaled back); more shake removed forces a smaller crop 🟨
⬜ figure not yet created
`fig-l1-cinematic-paths` (the three allowed motion primitives — static hold, constant pan, smooth ease-in/out — that L1 stitches together) fig-l1-cinematic-paths
fig-rolling-shutter-skew
fig-rolling-shutter-skew · rolling-shutter distortion — a global shutter keeping a vertical pole upright under a fast pan vs a rolling shutter where per-row readout $t(r)=t_\text{frame}+r\,t_\text{row}$ shears the pole and smears fan blades (the jello effect) 🟨
⬜ figure not yet created
`fig-rolling-shutter-perrow` (the row-time diagram: each scanline exposed at $t_0 + r\,\Delta t$, each reading a different camera pose, then rectified to one virtual instant) fig-rolling-shutter-perrow
The previous chapters made motion an *enemy to estimate* (optical flow) or a *cue to track* (KLT). Here motion is the thing we want to **edit**: handheld footage carries an involuntary high-frequency camera path on top of the intended one. Every method in this chapter is one pipeline — **measure the real path, design a better path, warp the frames onto it** — and the only thing you give up is the **border pixels** that the better path no longer sees.
💡 **Big lesson (recurrence of L16 · prefilter before you downsample / temporal aliasing):** stabilization is *temporal* signal processing on the camera-path signal — and the same Nyquist intuition applies. We **low-pass the trajectory** to remove jitter, and rolling-shutter "jello" is precisely a **temporal aliasing** artifact (rows sampled at different times under fast motion, the video cousin of the wagon-wheel effect). (→ see Big lesson **L16**, first placed in BASIC → Resampling; here it recurs as *smooth the path in time, and beware time-varying sampling within a frame*.)
equations
cumulative path $C_t = F_t F_{t-1}\cdots F_1$ from inter-frame transforms $F_t$ (each a homography or affine fit to feature matches)
update/correction warp $W_t = P_t\, C_t^{-1}$ sending real pose $C_t$ to smoothed pose $P_t$
low-pass smoothing $P_t = \sum_k g_k\, C_{t-k}$ (Gaussian weights $g_k$)
**L1 objective** $\min_{P}\ \lVert D^1 P\rVert_1 + \lVert D^2 P\rVert_1 + \lVert D^3 P\rVert_1$ (penalize 1st/2nd/3rd path derivatives in $L_1$ → mostly-zero derivatives = static / constant-velocity / constant-acceleration segments) subject to the crop-window inclusion constraints
rolling-shutter row time $t(r) = t_{\text{frame}} + r\cdot t_{\text{row}}$ and per-row pose $R(t(r))$, rectify pixel $(r,c)$ by reprojecting through $R(t_{\text{ref}})\,R(t(r))^{-1}$
12.6 Frame interpolation and slow-motion synthesis
fig-interp-as-morph
fig-interp-as-morph · interpolation as a morph — a plain cross-dissolve of two frames yielding two ghosts vs a flow-warp where corresponded points slide halfway before blending so the subject moves rather than fades; the correspondence here is automatic optical flow 🟨
fig-flow-warp-blend
fig-flow-warp-blend · the flow-based interpolation pipeline — estimate bidirectional flow, backward-warp each frame to time $t$ ($-t\mathbf F_{0\to1}$ and $(1-t)\mathbf F_{1\to0}$), convex-blend by $(1-t),t$; a highlighted disocclusion hole present in only one warped frame 🟨
⬜ figure not yet created
`fig-forward-vs-backward-warp` (splatting a pixel forward leaves holes/collisions fig-forward-vs-backward-warp
⬜ figure not yet created
`fig-occlusion-disocclusion` (a moving foreground: the background it *uncovers* exists in only one of the two frames → which frame to copy from) fig-occlusion-disocclusion
fig-slowmo-axis
fig-slowmo-axis · four ways to treat the time axis on a shared timeline — normal capture (sparse), high-speed (dense true samples), interpolation (sparse reals with synthesized in-betweens), and long-exposure blur (the integral); only interpolation adds resolution after capture and can be wrong 🟨
⬜ figure not yet created
`fig-film-largemotion` (a fast-moving subject where small-motion flow fails and FILM's scale-agnostic pyramid still tracks it) fig-film-largemotion
A normal camera shoots, say, 30 fps; to play a moment 8× slower and smooth you'd need ~240 distinct instants per second that were never recorded. You can either **capture** them (a high-speed camera — true, but expensive and light-hungry) or **synthesize** them. Frame interpolation synthesizes: given two real frames, invent the ones that belong **between** them. The whole problem reduces to a question you already met in [[Morphing]] — *what moved where?* — answered here by **optical flow** or by a **learned** model.
💡 **Big lesson (recurrence of L14 · capture the full set, decide later — and its limits):** high-speed photography is the honest way to own every instant (capture the full temporal set, pick the moment/shutter later). Interpolation is the **budget substitute**: when you *didn't* capture the full set, a **prior** invents the missing instants. So this chapter sits at the seam between **L14** (capture everything) and **L10** (the prior is not optional) — the in-between frames are partly **reconstructed** (where flow is reliable) and partly **hallucinated** (across occlusions). (→ see Big lessons **L14**, first in MULTIPLE EXPOSURE, and **L10**, first in Super-resolution.)
equations
linear motion assumption $x_t = (1-t)\,x_0 + t\,x_1$ along a flow vector
backward-warp synthesis $I_t(\mathbf{p}) = (1-t)\,I_0(\mathbf{p} - t\,\mathbf{F}_{0\to 1}(\mathbf{p})) + t\,I_1(\mathbf{p} + (1-t)\,\mathbf{F}_{1\to 0}(\mathbf{p}))$ (warp each source to $t$, then convex-blend by temporal distance)
occlusion-aware blend $I_t = \dfrac{(1-t)\,V_0\,W_0 + t\,V_1\,W_1}{(1-t)\,V_0 + t\,V_1}$ with visibility masks $V_0,V_1$ (down-weight the frame where a pixel is occluded)
flow consistency check $\lVert \mathbf{F}_{0\to1}(\mathbf p) + \mathbf{F}_{1\to0}(\mathbf p + \mathbf F_{0\to1}(\mathbf p))\rVert \le \tau$ (forward-backward agreement → trust map)
12.7 Video editing
fig-nle-timeline
fig-nle-timeline · the non-linear timeline — parallel video/audio tracks, clips with trim handles, a playhead, a labelled hard cut and a cross-dissolve transition; the random-access reference-list model that replaced sequential tape splicing 🟨
⬜ figure not yet created
`fig-reduce-over-time` (one stacked video → four outputs side by side: **mean** (motion-blur/long-exposure), **median** (people vanish), **max** (bright streaks / star trails), **min** (darkest-pixel) — the per-pixel reduce family) fig-reduce-over-time
⬜ figure not yet created
`fig-median-deghost` (a busy plaza, N frames → median → empty plaza: transient people removed) fig-median-deghost
⬜ figure not yet created
`fig-everyone-smiling` (a burst of a group portrait → per-region best-frame selection → one frame where everyone's eyes open / smiling, after photomontage) fig-everyone-smiling
fig-hyperlapse
fig-hyperlapse · hyperlapse vs naive fast-forward — a shaky first-person walk, a jagged frame-dropping $10\times$ speed-up (nauseating), and a hyperlapse that reconstructs the 3-D trajectory, fits a smooth virtual path, and renders along it (stable); speed-up and stabilisation together 🟨
fig-transcript-edit
fig-transcript-edit · transcript-based editing — a word-timestamped transcript with a filler "um" struck through, and the timeline below with the aligned micro-segment removed and clips closed up; speech-to-text as a 1-D handle on video 🟨
This closing chapter spends the part's machinery. Editing is where alignment, flow, warping, and robust per-pixel statistics stop being algorithms and become **tools an editor reaches for**. It's deliberately lighter and demo-driven — and honest where a specific product or paper is uncertain (marked *queue*). The through-line: almost every "video effect" is **align the frames, then reduce or select across time**, or **re-time the timeline**.
equations
per-pixel temporal reduce $O(\mathbf p) = \operatorname*{reduce}_{t} I_t(\mathbf p)$ for $\operatorname{reduce}\in\{\text{mean},\ \text{median},\ \max,\ \min\}$ (after alignment)
mean = long-exposure $O = \tfrac1N\sum_t I_t$
median = robust background $O(\mathbf p)=\operatorname{median}_t I_t(\mathbf p)$ (transients are outliers → rejected)
selection composite $O(\mathbf p) = I_{t^\star(\mathbf p)}(\mathbf p)$ with $t^\star(\mathbf p)=\arg\max_t s(I_t,\mathbf p)$ (pick the frame maximizing a per-region score $s$ — e.g. "eyes open / smiling")