💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

12.4 Video stabilization and rolling-shutter correction

The previous chapters made motion an enemy to estimate (optical flow) or a cue to track (KLT). Here motion is the thing we want to edit. Hold a phone and walk, and the footage carries an involuntary, high-frequency tremor riding on top of whatever motion you intended — the slow pan, the deliberate hold. A shaky video, in other words, is a good video plus an unwanted signal, and that signal lives on the camera path: the trajectory the camera swung through over the course of the clip. Stabilization is the act of subtracting it.

Every method in this chapter is one pipeline, and it is worth stating before any of the details: measure the real path, design a better path, warp the frames onto it. Measure the jittery trajectory the camera actually followed; design the smooth trajectory a tripod, dolly, or Steadicam would have produced instead; then re-render each frame by warping it from where the camera was to where the smooth path says it should be. The only thing you give up — and it is unavoidable — is the border pixels that the better path no longer sees, which must be cropped away or inpainted back. That tension, between how much shake you remove and how much frame you keep, is the one genuine tradeoff in the chapter, and we will name it precisely.

💡 Big lesson (L17, recurrence)

This chapter is the part spine in its most literal form. Correspondence first: estimate where each pixel came from — here, as a camera path, the smoothed trajectory $P_t$ that maps each output frame back to the captured one. Then transport: warp and resample the pixels along that map with the one shared engine of Warping and resampling. The estimation is the hard, ill-posed half (feature tracks corrupted by moving objects, error that drifts as you chain frames); the transport is plumbing you already own. Separating the two is exactly what lets the same warp engine serve rectification, panoramas, morphing, and now stabilization alike — what changes is only how you got the map. The rolling-shutter section at the end is the same spine applied within a frame: a per-row correspondence, then a per-row transport. (Part spine L17; registered in the part introduction; recurs in every chapter of this part.)

There is a second lesson underneath, recurring rather than new. Stabilization is temporal signal processing on the camera-path signal, and the same Nyquist intuition from resampling applies: we low-pass the trajectory to strip the jitter while keeping the intended motion, exactly as we low-pass an image before downsampling it. Rolling-shutter "jello," when we reach it, is the time-axis sampling problem within a single frame — rows captured at different instants under fast motion — a loose echo of the L16 sampling lesson rather than a strict instance of it.

12.4.1 What stabilization is: a camera-path signal to be smoothed

The framing is the whole chapter in miniature. The recorded video equals an intended camera motion — a slow pan, a hold — plus an unwanted high-frequency tremor, and that tremor is a signal living on the camera path. Stabilization removes the tremor without removing the intended motion, which is to say it is a low-pass filter on the trajectory. Two subtleties keep it from being a one-line exercise: the "samples" we are filtering are not scalars but 2-D image transforms (a homography per frame, or its parameters), and we cannot let the filtered frame swing so far that it crops away the whole picture.

That gives the three stages that form the spine of the chapter:

  1. Estimate the per-frame camera motion and accumulate it into a trajectory.
  2. Smooth that trajectory into the steady path a tripod or dolly would have followed.
  3. Re-render each frame by warping it from its real pose to the smoothed pose — then crop or inpaint the exposed borders.

One fact makes the whole enterprise tractable without any 3-D reconstruction. For a purely rotating camera — or for a scene that is distant or planar — the motion between two frames is exactly a homography, the same "depth doesn't matter for a rotation or a plane" fact that lets a panorama reproject onto a common surface (cross-ref Automatic panorama stitching from multiple views and feature matching). And most handheld shake is dominated by rotation: a trembling wrist rocks the camera far more than it translates it. So a 2-D motion model already does the heavy lifting. The hard case — genuine translation through a 3-D scene, where near and far objects parallax past each other at different rates — is the one that breaks the single-homography assumption and needs the content-preserving warps we reach in Stage 2 (Figure 12.4.1).

fig-stab-pipeline
Figure 12.4.1. The stabilization pipeline as one block diagram. Left to right: a sequence of input frames; (1) estimate — fit an inter-frame transform $F_t$ to feature correspondences (KLT tracks, robustly via RANSAC) and chain them into the cumulative real path $C_t = F_t F_{t-1}\cdots F_1$; (2) smooth — low-pass or $L_1$-optimize $C_t$ into the steady path $P_t$; (3) re-render — warp each frame by the correction transform $W_t = P_t\,C_t^{-1}$ and crop the exposed borders to a valid window. The recovered camera path is the correspondence; the warp is the transport.

12.4.2 Stage 1 — estimating the camera trajectory

To recover the path, we work locally and then accumulate. Between two consecutive frames, fit a global 2-D transform $F_t$ to a set of point correspondences. The transform comes in increasing fidelity — a pure translation is the crudest model, an affine map adds rotation/scale/shear, and a full homography captures the perspective of a rotating camera — and you pick the simplest one that fits your footage. The correspondences themselves come from KLT feature tracks or matched corners (cross-ref Feature tracking): a good corner is trackable for the same reason it is a good keypoint, which is why the tracking chapter pays off directly here.

The one defensive measure that matters is robust fitting. A moving foreground object — a person walking through the shot — drags its own feature tracks along with it, and if those tracks enter the fit they will pull the estimated camera motion toward the object's motion. So we fit $F_t$ with RANSAC, exactly the robust-homography machinery from automatic panorama stitching: sample a minimal set, count inliers, keep the consensus, and let the moving object's tracks fall out as outliers (cross-ref Automatic panorama stitching from multiple views and feature matching).

With the inter-frame transforms in hand, chain them into an absolute trajectory:

$$ C_t = F_t\, F_{t-1} \cdots F_1 . $$

The camera path is now a sequence of transforms — or, if you parameterize it, a multi-dimensional time signal: translation in $x$ and $y$, rotation, scale, and so on, each a 1-D curve over the clip. This is the object we will smooth.

The catch to name is drift. Every $F_t$ carries a small estimation error, and chaining accumulates those errors, so $C_t$ slowly wanders away from the truth — the same error-accumulation that bundle adjustment corrects in panoramas and structure-from-motion. For a finite clip the drift is usually tolerable; long sequences want global optimization or loop closure to pin it down.

There is a sensor-side alternative that sidesteps feature tracking entirely. Rather than estimate the camera's rotation from pixels, read it from the phone's gyroscope: Karpenko, Jacobs, Baek and Levoy (2011) integrate the gyro signal to obtain the camera's rotation directly. It is cheap, robust to scene content (no features to lose on a blank wall), and — crucially — it is exactly the per-instant rotation that rolling-shutter correction will also need at the end of the chapter. Hold that thread; it is where the gyro really pays off.

12.4.3 Stage 2 — smoothing the path (low-pass vs. L1-optimal cinematic paths)

Now we design the better path $P_t$. The obvious move is naive low-pass smoothing: convolve the trajectory with a temporal Gaussian,

$$ P_t = \sum_k g_k\, C_{t-k}, $$

with Gaussian weights $g_k$. It is simple and it does remove jitter, but it has three honest failings. It lags and overshoots at genuine motion onsets — the start and stop of a real pan get smeared past their true timing. It blurs intended motion, because a fixed kernel cannot tell a deliberate whip-pan from a tremor; both are "fast," so both get attenuated. And to avoid lagging it needs future frames, making it acausal — fine for offline processing, awkward for live preview (Figure 12.4.2).

The deeper idea, due to Grundmann, Kwatra and Essa (2011), is to stop smoothing and start synthesizing: produce the path a professional camera operator would have shot. A trained operator's footage is built from three primitives — static holds (the camera locked off), constant-velocity pans or dollies (a steady sweep), and smooth ease-in/ease-out transitions between them. In terms of the path's derivatives, these are stretches where the first, second, or third derivative is exactly zero: a hold has zero velocity, a constant pan has zero acceleration, an eased move has zero jerk. So Grundmann minimizes the $L_1$ norm of those three derivatives,

$$ \min_{P}\ \ w_1\lVert D^1 P\rVert_1 + w_2\lVert D^2 P\rVert_1 + w_3\lVert D^3 P\rVert_1 , $$

and the choice of $L_1$, not $L_2$, is the whole trick. The $L_1$ norm is sparsity-inducing: it drives the penalized quantities to be exactly zero almost everywhere, broken by a few deliberate jumps. So the optimal path comes out as crisp static-then-constant-then-eased segments rather than a perpetually wobbling smooth curve — the optimizer literally invents the cuts a director would make. (It is the same sparsity intuition as the sparse-gradient image priors in blind deblurring, transplanted from the spatial axis to the time axis.)

This is also where the crop tradeoff gets baked directly into the path design rather than patched on afterward. The optimization is constrained so that the smoothed frame always stays inside the real frame: the crop window — a rectangle inset by a chosen margin — must, after warping by $P_t\,C_t^{-1}$, contain only real pixels. Grundmann solves the whole thing as a linear program — an $L_1$ objective with linear inclusion constraints — so the path it returns is the steadiest one that never asks for pixels the camera never captured.

fig-stab-trajectory-smoothing
Figure 12.4.2. Smoothing the camera-path signal, one parameter shown (say horizontal position over time). The jittery raw trajectory $C_t$ wobbles frame to frame. A Gaussian low-pass (dashed) removes the tremor but rounds the corners — it lags and overshoots where a real pan starts and stops. The $L_1$-optimal path (solid) instead snaps to piecewise static / constant-velocity / eased segments with sharp, deliberate transitions, matching what a camera operator would actually shoot.

There is one regime the single-homography model simply cannot handle: genuine 3-D parallax. When the camera truly translates through a scene, near and far objects move by different amounts, and no single global warp can re-render the frame correctly — re-render the foreground right and the background tears, or vice versa. Liu, Gleicher, Jin and Agarwala (2009) address this with content-preserving warps: a spatially-varying warp, guided by a sparse 3-D reconstruction of the scene, that moves different image regions by different amounts. To keep the spatially-varying warp from shredding the picture, it carries an as-rigid-as-possible regularizer — a penalty that keeps local shapes and straight lines intact even though the global map is no longer a homography (the same as-rigid-as-possible idea behind the free-form warps in Warping and resampling). Content-preserving warps are the bridge from 2-D-path stabilization to genuinely 3-D-aware stabilization, accepting a small amount of geometric "wrongness" in exchange for a perceptually clean result.

12.4.4 Stage 3 — re-rendering and the stabilization↔crop tradeoff

With the smoothed path $P_t$ designed, the transport is mechanical. For each frame, apply the correction transform

$$ W_t = P_t\, C_t^{-1}, $$

which reads exactly as it should: undo where the camera actually was ($C_t^{-1}$), then redo where the smooth path wants it ($P_t$). Apply $W_t$ to the frame and resample, using the inverse-warp-and-reconstruct engine of Warping and resampling — loop over output pixels, push each back through $W_t$, read the color there. This is the "transport" half of the part spine, identical in kind to the transport in rectification, panoramas, and morphing.

But the warp swings the frame quad off the original image grid, and that is where the borders go missing. Stabilizing means tilting and shifting each frame to sit on the smooth path; parts of the output rectangle then fall outside the pixels the camera actually captured, leaving empty margins along the edges that swung inward. There are two fixes, and the choice between them is the genuine tradeoff of the chapter:

A practical refinement softens the tradeoff: choose the crop margin adaptively, per clip or per segment, so that smooth shots crop almost nothing and only the violently shaky stretches pay the full field-of-view (FOV) cost. That turns a fixed global penalty into a content-aware one — the same spirit as adaptive everything else in this book.

fig-stab-crop-window
Figure 12.4.3. The stabilization↔crop tradeoff on a single frame. The original captured frame (outer rectangle) is warped by $W_t = P_t\,C_t^{-1}$ to the smoothed pose, so the warped quad (tilted, shifted) no longer fills the rectangle — empty margins appear where it swung inward. The crop window (inner rectangle) is the largest rectangle that stays inside the valid pixels for every frame; it is upscaled back to full size. Removing more shake forces a smaller crop window, trading field of view and sharpness for steadiness.

12.4.5 Rolling-shutter correction: per-row pose and rectification

So far we treated each frame as a single snapshot taken at one instant. For most CMOS sensors that is a lie, and the lie has visible consequences. Unlike a global-shutter sensor, which reads every pixel at once, a CMOS sensor is read out row by row: row $r$ is exposed at time

$$ t(r) = t_{\text{frame}} + r \cdot t_{\text{row}}, $$

so the top of the frame and the bottom of the frame are captured a few — up to roughly thirty — milliseconds apart. If the camera or the subject moves appreciably during that readout sweep, each row records a slightly different camera pose. The frame is no longer one time sample; it is a stack of them.

The artifacts are diagnostic, and worth naming because each one tells you what happened (Figure 12.4.4). A horizontal pan skews vertical edges into slants, because each row was shot from a slightly panned-over viewpoint. A handheld tremor produces a wobble the community calls "jello," the image rippling as if seen through water. A spinning fan or propeller smears into impossible curved blades. All of it traces to the same cause — top and bottom of the frame are different time samples — which makes rolling-shutter distortion a time-axis sampling problem within a single frame (rows captured at different instants), a loose within-frame echo of the L16 sampling lesson rather than frequency-folding aliasing proper.

The model and the fix are the part spine again, now applied per row. Treat the camera pose as a continuous function of time, $R\big(t(r)\big)$, so that each scanline has its own pose. To rectify, pick a single reference pose — the first row's, or the frame center's — and reproject every pixel from the pose at its row-time onto that common instant:

$$ (r,c)\ \longmapsto\ \text{reproject through } R(t_{\text{ref}})\, R\big(t(r)\big)^{-1}. $$

Each row gets warped by a slightly different transform; together the per-row warps "un-shear" the frame back to a single virtual instant, undoing the jello. It is correspondence-then-transport at the granularity of a scanline.

fig-rolling-shutter-skew
Figure 12.4.4. Rolling-shutter distortion. Left, global shutter: a vertical pole (or a spinning fan) under a fast horizontal pan stays upright — every row shares one exposure instant. Right, rolling shutter: because each row is read at $t(r) = t_{\text{frame}} + r\,t_{\text{row}}$, successive rows see a slightly panned pose, so the pole shears into a slant and the fan blades smear into curves — the "jello" effect. Top and bottom of the frame are different time samples.

Everything now hinges on where the per-row pose comes from, and there are two answers. Karpenko et al. (2011) read $R(t)$ from the phone's gyroscope — after a one-time calibration of the gyro-to-frame timing and the row readout time $t_{\text{row}}$ — which is robust, cheap, and (the thread from Stage 1) doubles as the rotation estimate for ordinary stabilization. Forssén and Ringaby (2010) instead estimate the per-row rotation from the image itself, fitting feature tracks to a continuous-time motion model and resampling against it. Either way the output is a rolling-shutter-free frame, after which conventional stabilization runs on clean input.

That ordering is not optional. Rolling-shutter rectification should precede trajectory smoothing — or be solved jointly with it — because otherwise the Stage-1 homography fits are corrupted by the very shear we are trying to remove: you would be estimating a camera path from frames that each lie about their own geometry. This is precisely the rolling-shutter caveat that the panorama chapter forward-references for continuous phone sweeps, where readout-time motion bites hardest because the phone is deliberately swinging the whole time (cross-ref Continuous panoramas (e.g. on cell phones)).

Finally, the capture-side escape worth stating: a global-shutter sensor reads all rows at once and has no rolling shutter to correct — the optical fix, paid for in sensor cost and complexity. It is the same engineering instinct that recurs throughout the book: when you can, build the system so you never have to invert the nasty artifact in the first place.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L17, recurrence)

This chapter is the part spine in its most literal form. Correspondence first: estimate where each pixel came from — here, as a camera path, the smoothed trajectory $P_t$ that maps each output frame back to the captured one. Then transport: warp and resample the pixels along that map with the one shared engine of Warping and resampling. The estimation is the hard, ill-posed half (feature tracks corrupted by moving objects, error that drifts as you chain frames); the transport is plumbing you already own. Separating the two is exactly what lets the same warp engine serve rectification, panoramas, morphing, and now stabilization alike — what changes is only how you got the map. The rolling-shutter section at the end is the same spine applied within a frame: a per-row correspondence, then a per-row transport. (Part spine L17; registered in the part introduction; recurs in every chapter of this part.)