symbol

meaning (this chapter)

note

$I(\mathbf x,t)$

the video as intensity at fixed position $\mathbf x$ and time $t$; the Eulerian object — a per-pixel time series

from Notations; the $(x,y,t)$ video of Motion blur and temporal sampling

$I'(\mathbf x,t)$

the magnified video, $I + \alpha\,\mathcal B\{I\}$

new (this chapter)

$\mathcal B\{\cdot\}$

a temporal band-pass of a per-pixel time series — keeps the frequencies of the phenomenon, drops DC and out-of-band

new (this chapter)

$\alpha$

the magnification factor; $\alpha\,\mathcal B$ is added back; $\alpha=0$ leaves the video unchanged

new (this chapter)

$\delta(t)$

the small displacement of a translating feature; the temporal change at a pixel is $\delta(t)\,I_x$

shared with Optical flow

$I_x$, $f_x$

the spatial gradient of the intensity profile (the brightness-constancy term)

from Notations; shared with Optical flow

$S(\mathbf x,t)$

a complex steerable sub-band, $S=A\,e^{i\phi}$

new (phase-based, Linear pyramids and wavelets)

$A(\mathbf x,t)$, $\phi(\mathbf x,t)$

the sub-band amplitude (content) and local phase (position within the wavelength); motion = a shift in $\phi$

new (this chapter)

💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

12.3 Video magnification⧉

You cannot see a heartbeat in a face, or a skyscraper sway, or a wall flex under a passing truck — these changes are real but below the threshold of perception, buried in pixel values that look constant. Yet the information is there in an ordinary video: the green channel of a cheek really does brighten and dim with each pulse; an edge really does move a fraction of a pixel as a structure vibrates. Video magnification is the family of methods that pulls out these tiny temporal variations and scales them up until they are plainly visible — turning a normal camera into an instrument for the invisible.

The surprise is how little it takes. The instinct, schooled by the rest of this part, is to track: find each feature, measure its tiny displacement, exaggerate it, re-render. That route exists, and we will meet it. But the breakthrough was to discover that for small changes you can skip tracking entirely and get the same result — and often a better one — by pure temporal filtering at each fixed pixel. That refusal to estimate correspondence is what makes this chapter the exception that proves the part's rule.

12.3.1 The two views: Lagrangian (track-then-amplify) vs. Eulerian (per-pixel signal)⧉

There are two ways to make a small motion big, and they are the same dichotomy you met in Motion blur and temporal sampling: do you follow the point (Lagrangian) or stay at the location (Eulerian)?

The Lagrangian route — the precursor, Liu, Torralba, Freeman, Durand and Adelson 2005 (Motion Magnification) — is the one this part has trained you to expect. Estimate the motion: run Optical flow to recover each feature's displacement $\delta(t)$, scale the displacement $\delta(t)\to(1+\alpha)\,\delta(t)$, and warp the frame to render the exaggerated movement. It is physically explicit and intuitive. It is also fragile: it inherits every failure mode of optical flow — occlusion, the aperture problem, sub-pixel error — and estimating tiny motions accurately is precisely the regime where flow is weakest. Powerful in principle, brittle in practice, and it cannot touch a pure color change (a pulse) because there is no point to follow.

The Eulerian route — the breakthrough, Wu, Rubinstein, Shih, Guttag, Durand and Freeman (2012) — fixes the spatial grid and never moves off it. At each pixel $\mathbf x$, look at the time series $t\mapsto I(\mathbf x,t)$. Band-pass it temporally to the band you care about — say $0.8$–$4$ Hz for a human pulse — amplify that band, and add it back. No flow, no warping, no tracking. It is a pure temporal signal-processing operation done independently at every pixel (within every spatial sub-band). And because it operates on the raw value over time, it amplifies a color change (blood flushing a cheek) exactly as naturally as it amplifies a sub-pixel motion.

💡 Big lesson (L17, recurrence)

The motion part runs on one premise: to manipulate motion you must first establish correspondence — estimate where each thing went (optical flow, Kanade–Lucas–Tomasi (KLT) tracks, homography paths) and then act on that displacement field. Video magnification is the deliberate exception. It refuses to estimate correspondence and filters per pixel instead: it never asks where did this go?, only how does this location change in time? For small variations this turns a hard, brittle estimation problem (sub-pixel flow) into a trivial, robust filtering one (a 1-D band-pass), and as a bonus it reveals color change — which has no correspondence to track at all. The recurring moral: picking the right frame of reference — fixed grid versus moving point — can dissolve a hard estimation problem into an easy filtering one. (Registered as the L17 exception; the rule itself governs Optical flow, Feature tracking, and Video stabilization and rolling-shutter correction.)

12.3.2 Eulerian video magnification — band-pass and amplify⧉

The full pipeline, run per spatial band and independently at each pixel, is three steps.

First, decompose spatially into a pyramid (Linear pyramids and wavelets) and process each scale separately. This costs almost nothing and buys two things: it lets us amplify different spatial frequencies by different amounts, and it lets us spatially smooth a band — the lever that controls noise, as we will see.

Second, temporally band-pass each pixel's time series with an operator we write $\mathcal B\{\cdot\}$. Keep only the temporal frequencies of the phenomenon — the pulse band, the vibration band — and discard the rest: the DC (zero-frequency) "static" part that carries the unchanging scene, and the out-of-band content that is noise or irrelevant motion.

Third, amplify and recombine. Add an $\alpha$-scaled copy of the band-passed signal back onto the original,

$$ I'(\mathbf x,t) = I(\mathbf x,t) + \alpha\,\mathcal B\{I(\mathbf x,t)\}, $$

then collapse the pyramid to get the magnified video. Read back in words: leave the scene as it is, but add $\alpha$ times the part of each pixel's time series that lives in the band of interest. The factor $\alpha$ is the magnification — $\alpha=10$ makes the in-band variation ten times larger; $\alpha=0$ leaves the video untouched (Figure 12.3.1).

Figure 12.3.1. One fixed pixel, from signal to amplified output, left to right: (1) the pixel's value over time $t\mapsto I(\mathbf x,t)$ — a noisy near-flat trace; (2) its temporal spectrum, with a small peak at the phenomenon's frequency buried among DC and noise; (3) a temporal band-pass $\mathcal B$ keeping only that band; (4) the band scaled by $\alpha$ and added back, $I'=I+\alpha\,\mathcal B\{I\}$ — the same trace with its in-band wiggle now plainly visible. The operation is identical at every pixel and needs no knowledge of any other pixel.

Why amplifying a color signal reveals a pulse⧉

The cleanest case has no motion in it at all. Blood flow modulates skin reflectance very slightly, so a face pixel's green-channel value oscillates faintly at the heart rate. Band-pass that time series to $0.8$–$4$ Hz, amplify, add back — and the face visibly blushes in time with the pulse, whose waveform you can now read straight off the recovered signal. There is no model of motion anywhere in this: it is literally a color changing over time, isolated and scaled. This is the basis of contactless remote photoplethysmography — recovering vital signs from an ordinary webcam.

Why amplifying intensity also reveals motion — the first-order argument⧉

The remarkable claim is that the same operation, applied to a moving edge, amplifies the motion — even though we never estimated it. The argument is a single Taylor expansion, and it is worth doing because it shows exactly when the trick works and when it breaks.

Take a 1-D feature translating by a tiny displacement, so the intensity profile $f$ simply slides: $I(x,t)=f\big(x+\delta(t)\big)$. Expand to first order in the small displacement,

$$ I(x,t) \approx f(x) + \delta(t)\,f_x(x), $$

so the temporal variation at a fixed pixel is $\delta(t)\,I_x$ — the displacement times the spatial gradient. (This is the same product, displacement $\times$ image gradient, that the brightness-constancy equation of Optical flow is built on; here we exploit it rather than solve it.) The band-pass isolates exactly this term, $\mathcal B\{I\}\approx\delta(t)\,I_x$, and adding $\alpha$ times it gives

$$ I'(x,t) \approx f(x) + (1+\alpha)\,\delta(t)\,I_x \approx f\big(x+(1+\alpha)\,\delta(t)\big), $$

where the last step runs the Taylor expansion backward: a signal equal to $f(x)$ plus $(1+\alpha)\delta(t)$ times its own gradient is, to first order, just $f$ shifted by $(1+\alpha)\delta(t)$. So the result looks exactly as if the feature had moved $(1+\alpha)$ times as far. Amplifying the per-pixel intensity signal is amplifying the motion — without ever estimating it (Figure 12.3.2). That is the whole magic trick, and it is the formal content of the L17 exception.

fig-firstorder-motion-mag — **Figure 12.3.2.** Intensity amplification is motion amplification, to first order. **Left:** a 1-D edge $f$ at two instants, displaced by a tiny $\delta(t)$; the temporal change at a fixed pixel is $\delta(t)\,I_x$ (the bump rides the slope of the edge). Adding $\alpha$ times it reconstructs $f$ shifted by $(1+\alpha)\delta(t)$ — the edge appears to have moved farther. **Right:** where it breaks — for a **large** $\delta$ or a **sharp** edge the first-order approximation fails, and the reconstruction over- or under-shoots, producing **haloing** and clipping rather than a clean shift.

The catches — noise and large motion⧉

None of this is free, and the limits follow directly from the derivation.

The approximation is first-order, so it holds only for small $\delta$ and smooth edges. Large motions or sharp edges violate the linearization, and the reconstruction over- or under-shoots — the visible symptom is haloing around amplified edges (Figure 12.3.2, right).

It also amplifies noise. The temporal band-pass passes in-band sensor noise along with the signal, and multiplying by $\alpha$ blows it up just as much. The remedy exploits the spatial pyramid from step one: spatially blur, or use coarser pyramid levels, for large $\alpha$ — trading spatial detail for signal-to-noise — and bound $\alpha$ as a function of spatial wavelength, amplifying low spatial frequencies (which are robust) more than high ones (which are noisy). Finally, the method commits to a single global band, which is ideal for one dominant frequency (a pulse) but clumsy for broadband motion. These limits are exactly what the next method addresses.

12.3.3 The phase-based successor — amplify phase, not intensity⧉

The diagnosis of what went wrong with intensity magnification is sharp: it conflates "the edge moved" with "the brightness changed." Both show up as a change in $I(\mathbf x,t)$, so amplifying that change amplifies amplitude noise and clips at edges. The fix, due to Wadhwa, Rubinstein, Durand and Freeman (2013), is to express motion on a cleaner axis: not as an intensity change, but as a shift in local phase.

The vehicle is a complex steerable pyramid (Linear pyramids and wavelets): decompose each frame into oriented, band-pass sub-bands whose coefficients are complex,

$$ S(\mathbf x,t) = A(\mathbf x,t)\,e^{\,i\,\phi(\mathbf x,t)}, $$

with an amplitude $A$ (the "content" — how much edge energy is here) and a local phase $\phi$ (where, within the wavelength, the edge sits). By the Fourier shift theorem, a local translation of a sub-band appears purely as a change in its phase $\phi$, while the amplitude $A$ stays put. Phase is the natural coordinate for small motion.

So the operation is: temporally band-pass the phase and amplify it, leaving the amplitude alone,

$$ \phi \;\to\; \phi + \alpha\,\mathcal B\{\phi\}, $$

which shifts each sub-band — physically moves the content — by the amplified amount, without scaling amplitude noise. The wins are three: far fewer artifacts, because a phase change is a clean sub-pixel shift rather than an intensity addition, so much less haloing; support for larger $\alpha$, hence bigger and cleaner magnification; and noise that is translated, not amplified — it rides along with the band instead of blowing up. The cost is the steerable-pyramid machinery and more computation (Figure 12.3.3).

The unifying view: linear Eulerian magnification amplifies the intensity band-pass; phase-based magnification amplifies the phase band-pass in an oriented complex pyramid. Both are squarely Eulerian — per-location, no flow, the L17 exception intact. Phase-based magnification simply chooses a representation in which "small motion" lives on a cleaner axis.

[figure fig-phase-vs-linear-mag not built]

Figure 12.3.3. Same clip, two magnifications. Linear intensity magnification (left): amplifying $\mathcal B\{I\}$ directly produces visible haloing at edges and blown-up noise, and caps out at a modest $\alpha$. Phase-based magnification (right): in a complex steerable pyramid, amplifying the band-passed local phase $\phi\to\phi+\alpha\,\mathcal B\{\phi\}$ shifts each sub-band cleanly, giving far fewer artifacts, much larger usable $\alpha$, and noise that moves with the band rather than amplifying.

12.3.4 Uses, and the bookend to compression⧉

Once tiny temporal change is a thing you can pull out and scale, a row of applications opens up.

Contactless vital signs. Recover a heart rate and pulse waveform — and even breathing — from a webcam video of a face or a hand. Remote photoplethysmography of this kind feeds telehealth, driver monitoring, and neonatal care, where attaching a sensor is awkward or impossible.

Structural and material vibration. Visualize and measure how bridges, buildings, turbine blades, and machines vibrate; reveal resonant modes and infer material properties from ordinary video. It is a cheap, full-field, non-contact alternative to bolting on accelerometers — every pixel is a virtual sensor.

The visual microphone — sound from a silent video. The most striking demonstration that the signal was there all along, due to Davis, Rubinstein, Wadhwa, Mysore, Durand and Freeman (2014). Sound waves vibrate everyday objects — a chip bag, a plant leaf, a glass of water — by micrometres. Reading those per-sub-band micro-motions out of a high-speed video recovers the audio that caused them: intelligible speech and music from a silent recording of a vibrating object. The same Eulerian per-location reading that magnifies a pulse can, run in reverse, demodulate a soundtrack from a bag of chips.

Other cues. Subtle facial and lip motion for speech and affect; revealing breathing, fluid flow, and slow drifts; an analysis tool for any is something moving that I can't see? question.

The bookend to compression. It is worth ending where the part's other temporal extreme sits. Video compression and motion compensation spends all its effort predicting away and discarding the inter-frame difference — that residual is redundant, so throw it out. Video magnification does the exact opposite: it treats that same tiny temporal residual as the signal, and amplifies it. Same quantity, opposite intent. The clean way to remember both: compression hides the small temporal change; magnification reveals it. And both differ from the part's mainstream — flow, tracking, stabilization, interpolation — in the way this chapter has stressed throughout: those follow points and need correspondence, while magnification, the L17 exception, sits at a fixed pixel and filters.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L17, recurrence)

12.3 Video magnification🔗⧉

12.3.1 The two views: Lagrangian (track-then-amplify) vs. Eulerian (per-pixel signal)🔗⧉

12.3.2 Eulerian video magnification — band-pass and amplify🔗⧉

Why amplifying a color signal reveals a pulse🔗⧉

Why amplifying intensity also reveals motion — the first-order argument🔗⧉

The catches — noise and large motion🔗⧉

12.3.3 The phase-based successor — amplify phase, not intensity🔗⧉

12.3.4 Uses, and the bookend to compression🔗⧉

Big lessons of this chapter

12.3 Video magnification⧉

12.3.1 The two views: Lagrangian (track-then-amplify) vs. Eulerian (per-pixel signal)⧉

12.3.2 Eulerian video magnification — band-pass and amplify⧉

Why amplifying a color signal reveals a pulse⧉

Why amplifying intensity also reveals motion — the first-order argument⧉

The catches — noise and large motion⧉

12.3.3 The phase-based successor — amplify phase, not intensity⧉

12.3.4 Uses, and the bookend to compression⧉