12.5 Frame interpolation and slow-motion synthesis⧉
A normal camera shoots, say, 30 fps; to play a moment 8× slower and still look smooth you'd need around 240 distinct instants per second that were never recorded. You have two ways to get them. You can capture them — a high-speed camera that genuinely records 240 or 1000 frames a second — which is the honest answer but an expensive and light-hungry one: shorter exposures mean less light per frame, so high-speed footage wants bright scenes or fast lenses. Or you can synthesize them: given two real frames, invent the ones that belong between them. That is frame interpolation, and the whole problem reduces to a question you already met in Morphing — what moved where? — answered here either by optical flow or by a learned model that has seen enough video to guess.
This is the part's recurring move in a new key. To make a frame at time $t$ between two real ones, you first estimate a correspondence (where did each pixel go between frame $A$ and frame $B$?) and then transport the pixels partway along it (warp each frame toward time $t$ and blend). Correspondence first, then transport — the same skeleton as rectification, panoramas, morphing, and stabilization, only now the target of the transport is a moment rather than a viewpoint.
Frame interpolation is the part spine run on the time axis. (1) Estimate a correspondence — the optical flow between two consecutive frames, the dense map of where each pixel went. (2) Transport the pixels along it — warp each frame to the in-between time $t$ and blend. The correspondence is the hard, ill-posed half (broken by occlusion and large motion); the warp-and-resample transport is the same shared engine as everywhere else in the part. The one genuinely new wrinkle is that we transport not to a new viewpoint but to a new instant, and the in-between time can be chosen freely after the fact. (Registered in Big Lessons as L17, first appearing in the part introduction; this is a recurrence.)
High-speed photography is the honest way to own every instant: capture the full temporal set, then pick the moment and the playback speed later. Interpolation is the budget substitute for when you didn't capture the full set — a prior invents the missing instants instead. So this chapter sits exactly at the seam between L14 (capture everything) and L10 (the prior is not optional): an in-between frame is partly reconstructed, where the flow is reliable and there is real evidence to transport, and partly hallucinated, across the occluded regions no frame ever saw. (L14 first appears in the multiple-exposure material; L10 in super-resolution; both recur here.)
12.5.1 Why interpolate: faking slow-motion and up-converting frame rate⧉
Two jobs share one primitive. The first is slow-motion: to slow footage $N\times$ smoothly you need $N-1$ synthetic frames between every real pair, and interpolation manufactures them from ordinary 30- or 60-fps footage — no high-speed camera, no extra light, the slowdown decided in post. The second is frame-rate up-conversion: 24→48, or 30→60→120 fps, for smoother playback, high-refresh and virtual reality (VR) displays, and the television "motion smoothing" that produces the much-debated soap-opera effect (so much added smoothness that filmed drama starts to look like cheap live video). Both are the same operation — synthesize a frame at a chosen fractional time — applied either at many $t$'s for slow-motion or at a few for a higher frame rate (Figure 12.5.1).
It helps to place interpolation among its rivals by asking what each does to the time axis (Figure 12.5.1). Normal capture takes sparse samples with gaps between them. High-speed capture takes dense, true samples — the ground truth interpolation is trying to imitate — at the cost of light and storage. Interpolation synthesizes the in-betweens: cheap, capturable on any camera, but it can be wrong. And a long exposure does something different in kind — it records the integral of the scene over the shutter interval, the motion smeared into one blurred frame rather than resolved into instants (this is the temporal-averaging view that returns in Video editing). Of the four, interpolation is the only one that adds temporal resolution after the shutter has closed — which is also why it is the only one that can be flatly wrong about what happened.
That honesty is the standing caveat for the whole chapter: synthesized frames are guesses. Where motion is large, fast, or self-occluding, they can warp, ghost, or tear. Naming those failure modes precisely is most of the work, because they are exactly what the learned methods at the end of the chapter are built to attack.
12.5.2 Interpolation = morphing between adjacent frames⧉
The cleanest way to understand frame interpolation is to recognize it as something you have already built. A Morphing takes two images plus a correspondence between them and produces in-betweens by warping both toward an intermediate shape and cross-dissolving. Frame interpolation is precisely that, with two consecutive video frames standing in for the two faces (Figure 12.5.2). The two endpoints, the warp-to-the-middle, the blend — all identical.
What changes is where the correspondence comes from. In classic morphing a human supplies it, drawing Beier–Neely feature lines from one image to the other by hand. In frame interpolation the correspondence is the optical flow, estimated automatically between the two frames. So the slogan for the whole chapter is:
Frame interpolation is automatic morphing between adjacent frames — motion estimation replaces the hand-drawn correspondence.
Everything that makes interpolation hard, and everything the learned methods buy, lives in that one substitution: the correspondence is no longer given, it must be recovered, and it is recovered imperfectly.
This also explains, in one stroke, why you cannot skip the warp. A plain temporal cross-dissolve — fade frame $A$ out while frame $B$ fades in, with no warp — does not show an object moving; it shows two ghosts of the object, one at each position, the first dimming as the second brightens (Figure 12.5.2). The eye reads this as a double exposure, not as motion. The warp is exactly what collapses those two ghosts into one object that slides from $A$'s position to $B$'s. Morphing's lesson, restated for video: the cross-dissolve hides the seam in appearance, but only the warp moves the geometry, and motion is geometry.
12.5.3 Flow-based interpolation: warp both frames to the midpoint and blend⧉
Take the morphing recipe literally and you get the classical, flow-based interpolator. Start from an assumption — that between two close frames a pixel moves along a roughly straight, constant-velocity path. Then to synthesize the frame at fraction $t\in(0,1)$:
- Estimate flow both ways between the frames: $\mathbf{F}_{0\to1}$ (where each pixel of $I_0$ goes in $I_1$) and $\mathbf{F}_{1\to0}$ (the reverse).
- Pull each source frame to time $t$ along a $t$-scaled slice of its flow — frame $0$ is dragged a fraction $t$ of the way toward frame $1$, and frame $1$ a fraction $1-t$ of the way back toward frame $0$.
- Blend the two warped images, weighted by temporal closeness: a frame near $t=0$ should look mostly like $I_0$, one near $t=1$ mostly like $I_1$.
Written out, the synthesized frame is a convex blend of two backward-warped sources (Figure 12.5.3):
Read it back: for each output pixel $\mathbf{p}$, look back along the flow to where it sat in each real frame, sample the color there, and mix the two by how close $t$ is to each frame. The straight-line assumption is what lets us scale the flow vectors by $t$ — we are sliding each pixel a fraction of the way along the one displacement we estimated.
Forward vs backward warping⧉
The detail hiding in that formula is which way you warp, and it is the same resampling choice from Warping and resampling, now decisive. Forward warping (splatting) takes each source pixel and pushes it to its destination at time $t$. It is the natural reading of "this pixel moves there," but it leaves a mess: several source pixels can land on the same output location (collisions), while other output locations receive no source pixel at all (holes). A splatted image is a sparse spray of dots, not a clean grid. Backward warping inverts the question: for each output pixel, it asks where that pixel came from and samples the source there. Every output pixel is filled exactly once — a full, hole-free grid — which is why the formula above is written as a backward sample (you evaluate $I_0$ and $I_1$ at shifted locations of the output grid). Backward warping is preferred for the same reason it is preferred everywhere in the part: it covers the output exactly once and reduces the whole thing to a clean resampling at fractional coordinates.
There is one catch, and it is honest to state it. A backward warp needs the flow defined at time $t$ — the displacement of the output pixel, which lives on the synthetic intermediate frame we do not yet have. We only estimated flow at the real frames. So the intermediate flow must itself be approximated — by splatting the estimated flow forward to time $t$, or, in the learned methods below, by a network that predicts the flow-at-$t$ directly. The gap between "the flow I have" and "the flow the backward warp wants" is one of the seams where interpolation gets hard, and one of the things learning helps close.
Occlusions and holes: the real difficulty⧉
If the warp-and-blend were the whole story, interpolation would be easy. It is not, because of occlusion. When a foreground object moves, it uncovers background behind it — a strip of wall that was hidden in frame $A$ and visible in frame $B$, or the reverse. Such disoccluded pixels exist in only one of the two frames. Blending both frames there is wrong: the frame that doesn't contain the background contributes garbage (the foreground that was covering it), and you get a ghost smeared along the moving boundary. The correct rule is simple to state — a pixel occluded in $B$ should be taken entirely from $A$, and vice versa — and the trick is knowing which is which.
Encode that knowledge as a visibility mask $V_0,V_1$, one per source frame, that down-weights the frame in which a pixel is not honestly visible. The occlusion-aware blend then becomes a weighted average rather than a fixed convex one:
Where a pixel is visible in both frames, $V_0=V_1=1$ and this collapses back to the ordinary temporal blend. Where it is occluded in frame $1$, $V_1\to0$ and the result is taken entirely from $I_0$ — the visible witness wins, and the denominator renormalizes so the brightness stays right.
Where do the masks come from? The workhorse is a forward–backward flow consistency check. Follow the flow from a pixel forward into the other frame and then back again; if you return to where you started, the two flows agree and the pixel is probably visible in both frames. If you don't return — if
for some tolerance $\tau$ — the forward and backward flows disagree, which is the signature of an occlusion, and you should trust the other frame there. This consistency check is the same robustness tool that flags bad matches throughout Optical flow; here it doubles as an occlusion detector. The residual hard case is a pixel visible in neither frame, or a region torn by fast thin motion — a genuine hole with no correspondence to copy from. There, transport has nothing to transport, and you must inpaint: fill from a learned prior of how scenes look. That is the cliff the flow-based method walks up to and the learned methods step over.
It is worth naming the failure modes plainly, because they motivate everything that follows. Large displacements break the small-motion assumption optical flow leans on: the object effectively teleports between frames and the flow estimator locks onto the wrong match. Thin, fast structures — a swung bat, bicycle spokes, a thrown ball — move farther than their own width between frames and tear. And non-linear motion — anything accelerating or curving — violates the straight-line, constant-velocity premise that let us scale the flow by $t$ in the first place. Each is a place where the hand-built recipe quietly produces a wrong guess, and each is exactly what learned synthesis is trained to survive.
12.5.4 Learned synthesis: Super SloMo and FILM⧉
The classical pipeline has a fixed skeleton — estimate motion, warp, blend, handle occlusion — and every joint of it was hand-tuned: the flow estimator, the way intermediate flow is approximated, the occlusion heuristic, the blend. The learned move (the recurring L8) keeps the skeleton but trains it end-to-end on real video, so that each joint is learned from data rather than designed by hand. Crucially, you can supervise the whole thing for free: take a real high-frame-rate clip, hide the middle frames, ask the network to synthesize them, and compare against the frames you hid. The ground truth is the footage itself.
Super SloMo (Jiang et al. 2018) follows the classical skeleton almost beat for beat, but learns each step. A first network estimates bidirectional flow between $I_0$ and $I_1$. A second network then refines the intermediate flow at each time $t$ — closing exactly the "flow-at-$t$" gap we flagged for backward warping — and predicts the visibility maps $V_0,V_1$ for an occlusion-aware warp-and-blend. The payoff over the hand-built version is twofold: the occlusion handling and intermediate-flow refinement are learned rather than heuristic, and because the time $t$ enters as an input, a single flow estimate can be reused to synthesize any number of intermediate frames — variable-length slow-motion, $2\times$ or $8\times$ from the same forward pass. The whole network is trained end-to-end with its warped-and-blended output supervised against held-out real frames.
FILM (Frame Interpolation for Large Motion; Reda et al. 2022) attacks the failure mode the classical method handles worst: large motion. When things move far between frames — action shots, or a pair of near-duplicate photos taken seconds apart — small-motion flow simply has no honest correspondence to follow, and the classical pipeline tears or ghosts. FILM's central idea is a scale-agnostic feature pyramid with weights shared across scales, so the same matcher runs from coarse to fine. The trick is that large motion at a coarse scale looks like small motion: downsample far enough and a leap across the frame becomes a one-pixel step the matcher can lock onto, after which finer scales recover detail. Two further choices matter. FILM is a single unified network — no separate flow estimator bolted on — and it is trained with a perceptual (Gram-matrix / style) loss that penalizes the texture statistics of the output, which keeps synthesized regions sharp rather than dissolving into the blur a plain pixel-error loss tends to produce in uncertain areas. The result is notably good at the extreme case of interpolating between far-apart still photos, animating a "near-duplicate" pair into smooth motion (Figure 12.5.4).
Why does learning win where explicit flow loses? Because the network learns a prior over how scenes move and how disoccluded regions should look. Explicit flow can only ever copy a color it can find a correspondence for; across a large gap or a true hole there is no such color, and the classical method has nothing to transport. A trained network can hallucinate plausible content there — the L8/L10 pairing once more: the learned prior supplies what the measurement did not. The honest flip side is that "plausible" is not "true," which is why these methods are evaluated against held-out real frames and why their confident guesses can still be confidently wrong.
A note on scope, matching the standing split in this book. The models themselves — the network architectures, the training procedures, the perceptual losses — are taught as models in Machine learning and Generative AI and diffusion. Here we treat them as the interpolation operator and ask only what they add over hand-built flow: robustness to large motion, clean learned occlusion handling, and a prior that fills what correspondence cannot. The forward pointer is that diffusion-based video models push in-betweening further still, generating frames with even less reliance on an explicit warp.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
Frame interpolation is the part spine run on the time axis. (1) Estimate a correspondence — the optical flow between two consecutive frames, the dense map of where each pixel went. (2) Transport the pixels along it — warp each frame to the in-between time $t$ and blend. The correspondence is the hard, ill-posed half (broken by occlusion and large motion); the warp-and-resample transport is the same shared engine as everywhere else in the part. The one genuinely new wrinkle is that we transport not to a new viewpoint but to a new instant, and the in-between time can be chosen freely after the fact. (Registered in Big Lessons as L17, first appearing in the part introduction; this is a recurrence.)
High-speed photography is the honest way to own every instant: capture the full temporal set, then pick the moment and the playback speed later. Interpolation is the budget substitute for when you didn't capture the full set — a prior invents the missing instants instead. So this chapter sits exactly at the seam between L14 (capture everything) and L10 (the prior is not optional): an in-between frame is partly reconstructed, where the flow is reliable and there is real evidence to transport, and partly hallucinated, across the occluded regions no frame ever saw. (L14 first appears in the multiple-exposure material; L10 in super-resolution; both recur here.)