12.3 Video magnification⧉
You cannot see a heartbeat in a face, or a skyscraper sway, or a wall flex under a passing truck — these changes are real but below the threshold of perception, buried in pixel values that look constant. Yet the information is there in an ordinary video: the green channel of a cheek really does brighten and dim with each pulse; an edge really does move a fraction of a pixel as a structure vibrates. Video magnification is the family of methods that pulls out these tiny temporal variations and scales them up until they are plainly visible — turning a normal camera into an instrument for the invisible.
The surprise is how little it takes. The instinct, schooled by the rest of this part, is to track: find each feature, measure its tiny displacement, exaggerate it, re-render. That route exists, and we will meet it. But the breakthrough was to discover that for small changes you can skip tracking entirely and get the same result — and often a better one — by pure temporal filtering at each fixed pixel. That refusal to estimate correspondence is what makes this chapter the exception that proves the part's rule.
12.3.1 The two views: Lagrangian (track-then-amplify) vs. Eulerian (per-pixel signal)⧉
There are two ways to make a small motion big, and they are the same dichotomy you met in Motion blur and temporal sampling: do you follow the point (Lagrangian) or stay at the location (Eulerian)?
The Lagrangian route — the precursor, Liu, Torralba, Freeman, Durand and Adelson 2005 (Motion Magnification) — is the one this part has trained you to expect. Estimate the motion: run Optical flow to recover each feature's displacement $\delta(t)$, scale the displacement $\delta(t)\to(1+\alpha)\,\delta(t)$, and warp the frame to render the exaggerated movement. It is physically explicit and intuitive. It is also fragile: it inherits every failure mode of optical flow — occlusion, the aperture problem, sub-pixel error — and estimating tiny motions accurately is precisely the regime where flow is weakest. Powerful in principle, brittle in practice, and it cannot touch a pure color change (a pulse) because there is no point to follow.
The Eulerian route — the breakthrough, Wu, Rubinstein, Shih, Guttag, Durand and Freeman (2012) — fixes the spatial grid and never moves off it. At each pixel $\mathbf x$, look at the time series $t\mapsto I(\mathbf x,t)$. Band-pass it temporally to the band you care about — say $0.8$–$4$ Hz for a human pulse — amplify that band, and add it back. No flow, no warping, no tracking. It is a pure temporal signal-processing operation done independently at every pixel (within every spatial sub-band). And because it operates on the raw value over time, it amplifies a color change (blood flushing a cheek) exactly as naturally as it amplifies a sub-pixel motion.
The motion part runs on one premise: to manipulate motion you must first establish correspondence — estimate where each thing went (optical flow, Kanade–Lucas–Tomasi (KLT) tracks, homography paths) and then act on that displacement field. Video magnification is the deliberate exception. It refuses to estimate correspondence and filters per pixel instead: it never asks where did this go?, only how does this location change in time? For small variations this turns a hard, brittle estimation problem (sub-pixel flow) into a trivial, robust filtering one (a 1-D band-pass), and as a bonus it reveals color change — which has no correspondence to track at all. The recurring moral: picking the right frame of reference — fixed grid versus moving point — can dissolve a hard estimation problem into an easy filtering one. (Registered as the L17 exception; the rule itself governs Optical flow, Feature tracking, and Video stabilization and rolling-shutter correction.)
12.3.2 Eulerian video magnification — band-pass and amplify⧉
The full pipeline, run per spatial band and independently at each pixel, is three steps.
First, decompose spatially into a pyramid (Linear pyramids and wavelets) and process each scale separately. This costs almost nothing and buys two things: it lets us amplify different spatial frequencies by different amounts, and it lets us spatially smooth a band — the lever that controls noise, as we will see.
Second, temporally band-pass each pixel's time series with an operator we write $\mathcal B\{\cdot\}$. Keep only the temporal frequencies of the phenomenon — the pulse band, the vibration band — and discard the rest: the DC (zero-frequency) "static" part that carries the unchanging scene, and the out-of-band content that is noise or irrelevant motion.
Third, amplify and recombine. Add an $\alpha$-scaled copy of the band-passed signal back onto the original,
then collapse the pyramid to get the magnified video. Read back in words: leave the scene as it is, but add $\alpha$ times the part of each pixel's time series that lives in the band of interest. The factor $\alpha$ is the magnification — $\alpha=10$ makes the in-band variation ten times larger; $\alpha=0$ leaves the video untouched (Figure 12.3.1).
Why amplifying a color signal reveals a pulse⧉
The cleanest case has no motion in it at all. Blood flow modulates skin reflectance very slightly, so a face pixel's green-channel value oscillates faintly at the heart rate. Band-pass that time series to $0.8$–$4$ Hz, amplify, add back — and the face visibly blushes in time with the pulse, whose waveform you can now read straight off the recovered signal. There is no model of motion anywhere in this: it is literally a color changing over time, isolated and scaled. This is the basis of contactless remote photoplethysmography — recovering vital signs from an ordinary webcam.
Why amplifying intensity also reveals motion — the first-order argument⧉
The remarkable claim is that the same operation, applied to a moving edge, amplifies the motion — even though we never estimated it. The argument is a single Taylor expansion, and it is worth doing because it shows exactly when the trick works and when it breaks.
Take a 1-D feature translating by a tiny displacement, so the intensity profile $f$ simply slides: $I(x,t)=f\big(x+\delta(t)\big)$. Expand to first order in the small displacement,
so the temporal variation at a fixed pixel is $\delta(t)\,I_x$ — the displacement times the spatial gradient. (This is the same product, displacement $\times$ image gradient, that the brightness-constancy equation of Optical flow is built on; here we exploit it rather than solve it.) The band-pass isolates exactly this term, $\mathcal B\{I\}\approx\delta(t)\,I_x$, and adding $\alpha$ times it gives
where the last step runs the Taylor expansion backward: a signal equal to $f(x)$ plus $(1+\alpha)\delta(t)$ times its own gradient is, to first order, just $f$ shifted by $(1+\alpha)\delta(t)$. So the result looks exactly as if the feature had moved $(1+\alpha)$ times as far. Amplifying the per-pixel intensity signal is amplifying the motion — without ever estimating it (Figure 12.3.2). That is the whole magic trick, and it is the formal content of the L17 exception.
The catches — noise and large motion⧉
None of this is free, and the limits follow directly from the derivation.
The approximation is first-order, so it holds only for small $\delta$ and smooth edges. Large motions or sharp edges violate the linearization, and the reconstruction over- or under-shoots — the visible symptom is haloing around amplified edges (Figure 12.3.2, right).
It also amplifies noise. The temporal band-pass passes in-band sensor noise along with the signal, and multiplying by $\alpha$ blows it up just as much. The remedy exploits the spatial pyramid from step one: spatially blur, or use coarser pyramid levels, for large $\alpha$ — trading spatial detail for signal-to-noise — and bound $\alpha$ as a function of spatial wavelength, amplifying low spatial frequencies (which are robust) more than high ones (which are noisy). Finally, the method commits to a single global band, which is ideal for one dominant frequency (a pulse) but clumsy for broadband motion. These limits are exactly what the next method addresses.
12.3.3 The phase-based successor — amplify phase, not intensity⧉
The diagnosis of what went wrong with intensity magnification is sharp: it conflates "the edge moved" with "the brightness changed." Both show up as a change in $I(\mathbf x,t)$, so amplifying that change amplifies amplitude noise and clips at edges. The fix, due to Wadhwa, Rubinstein, Durand and Freeman (2013), is to express motion on a cleaner axis: not as an intensity change, but as a shift in local phase.
The vehicle is a complex steerable pyramid (Linear pyramids and wavelets): decompose each frame into oriented, band-pass sub-bands whose coefficients are complex,
with an amplitude $A$ (the "content" — how much edge energy is here) and a local phase $\phi$ (where, within the wavelength, the edge sits). By the Fourier shift theorem, a local translation of a sub-band appears purely as a change in its phase $\phi$, while the amplitude $A$ stays put. Phase is the natural coordinate for small motion.
So the operation is: temporally band-pass the phase and amplify it, leaving the amplitude alone,
which shifts each sub-band — physically moves the content — by the amplified amount, without scaling amplitude noise. The wins are three: far fewer artifacts, because a phase change is a clean sub-pixel shift rather than an intensity addition, so much less haloing; support for larger $\alpha$, hence bigger and cleaner magnification; and noise that is translated, not amplified — it rides along with the band instead of blowing up. The cost is the steerable-pyramid machinery and more computation (Figure 12.3.3).
The unifying view: linear Eulerian magnification amplifies the intensity band-pass; phase-based magnification amplifies the phase band-pass in an oriented complex pyramid. Both are squarely Eulerian — per-location, no flow, the L17 exception intact. Phase-based magnification simply chooses a representation in which "small motion" lives on a cleaner axis.
12.3.4 Uses, and the bookend to compression⧉
Once tiny temporal change is a thing you can pull out and scale, a row of applications opens up.
Contactless vital signs. Recover a heart rate and pulse waveform — and even breathing — from a webcam video of a face or a hand. Remote photoplethysmography of this kind feeds telehealth, driver monitoring, and neonatal care, where attaching a sensor is awkward or impossible.
Structural and material vibration. Visualize and measure how bridges, buildings, turbine blades, and machines vibrate; reveal resonant modes and infer material properties from ordinary video. It is a cheap, full-field, non-contact alternative to bolting on accelerometers — every pixel is a virtual sensor.
The visual microphone — sound from a silent video. The most striking demonstration that the signal was there all along, due to Davis, Rubinstein, Wadhwa, Mysore, Durand and Freeman (2014). Sound waves vibrate everyday objects — a chip bag, a plant leaf, a glass of water — by micrometres. Reading those per-sub-band micro-motions out of a high-speed video recovers the audio that caused them: intelligible speech and music from a silent recording of a vibrating object. The same Eulerian per-location reading that magnifies a pulse can, run in reverse, demodulate a soundtrack from a bag of chips.
Other cues. Subtle facial and lip motion for speech and affect; revealing breathing, fluid flow, and slow drifts; an analysis tool for any is something moving that I can't see? question.
The bookend to compression. It is worth ending where the part's other temporal extreme sits. Video compression and motion compensation spends all its effort predicting away and discarding the inter-frame difference — that residual is redundant, so throw it out. Video magnification does the exact opposite: it treats that same tiny temporal residual as the signal, and amplifies it. Same quantity, opposite intent. The clean way to remember both: compression hides the small temporal change; magnification reveals it. And both differ from the part's mainstream — flow, tracking, stabilization, interpolation — in the way this chapter has stressed throughout: those follow points and need correspondence, while magnification, the L17 exception, sits at a fixed pixel and filters.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
The motion part runs on one premise: to manipulate motion you must first establish correspondence — estimate where each thing went (optical flow, Kanade–Lucas–Tomasi (KLT) tracks, homography paths) and then act on that displacement field. Video magnification is the deliberate exception. It refuses to estimate correspondence and filters per pixel instead: it never asks where did this go?, only how does this location change in time? For small variations this turns a hard, brittle estimation problem (sub-pixel flow) into a trivial, robust filtering one (a 1-D band-pass), and as a bonus it reveals color change — which has no correspondence to track at all. The recurring moral: picking the right frame of reference — fixed grid versus moving point — can dissolve a hard estimation problem into an easy filtering one. (Registered as the L17 exception; the rule itself governs Optical flow, Feature tracking, and Video stabilization and rolling-shutter correction.)