6.6 Video in-betweening⧉
In hand-drawn animation, a senior artist draws the key frames — the extreme poses — and an assistant fills in the in-between frames that carry the motion from one key to the next. That filling-in is called tweening, and "in-betweening" is the same job done photographically: given two frames $I_0$ and $I_1$ — two drawn keys, or two adjacent frames of a captured video — synthesize the frame $I_t$ that belongs between them at time $t\in(0,1)$. Set $t=0.5$ and you have doubled the frame rate; sample $t$ on a fine grid and you have slow motion from ordinary footage.
The temptation, exactly as in Morphing, is to reach for the cross-dissolve: set every output pixel to the linear blend of the two frames,
And exactly as in morphing, it fails in one specific way. If anything in the scene moved between the frames, that object sits at different pixels in $I_0$ and $I_1$, so averaging their colors at fixed locations leaves a ghost — two faint, translucent copies of the moving thing, sliding through each other, instead of one sharp object caught mid-motion (Figure 6.6.1, top). A cross-dissolve fades; it does not move. The fix is the morph's fix, carried to video: estimate the correspondence between the frames, warp both toward the intermediate time, and then blend — and now the moving object travels to a single mid-position and stays sharp.
So in-betweening is morphing across $t$ with an estimated correspondence. Everything from Morphing transfers: the separation of shape (where things are) and color (what color they are), the meet-in-the-middle symmetric warp, the align-before-you-blend discipline. What is genuinely new — and what the rest of this short chapter is about — is twofold: the correspondence is now dense and automatic rather than hand-drawn, and real moving scenes force two issues the still-image morph could ignore, occlusion and large motion.
To synthesize a frame between two key frames you cannot average their colors where they sit — that ghosts any motion. You must estimate a correspondence between the frames (optical flow, or a learned matcher), warp both toward the in-between time, and only then blend, just as a face morph warps both faces to a common shape before cross-dissolving. The one new obligation over the still-image morph is visibility: where a moving object has covered or uncovered background, a pixel exists in only one of the two frames, and the blend must defer to the frame that can see it (or in-paint) rather than average. Correspond, transport, blend — with an occlusion-aware blend — is the whole of frame interpolation; only how the correspondence is found changes from classical flow to learned synthesis.
6.6.1 Correspondence, now estimated: flow and learned matching⧉
The morph of Morphing got its correspondence from a person — a few line pairs, a handful of mesh vertices. Between two video frames that is hopeless: there are too many moving pixels and too many frames, and nobody is going to draw lines on a 60-frame slow-motion clip. The correspondence must be estimated, dense, and per-pixel.
The classical source is optical flow (Optical flow): a field $\mathbf{u}(x,y)$ assigning to each pixel of $I_0$ the displacement that carries it to its match in $I_1$. Given that flow, in-betweening is mechanical. To build the frame at time $t$, scale the flow and warp: a pixel that moves by $\mathbf{u}$ over the full interval has moved by $t\,\mathbf{u}$ at time $t$. Warp $I_0$ forward by $t\,\mathbf{u}$ and $I_1$ backward by $-(1-t)\,\mathbf{u}$ so both land at the intermediate position, then blend. This is precisely the morph's "interpolate the geometry to an intermediate shape, warp both to it, cross-dissolve," with the flow field playing the role the line pairs played before. As always the warp runs backward and resamples (reconstruct + prefilter, the engine of Warping and resampling) so every output pixel is filled.
The modern source is a learned matcher: a network trained on real video to predict, directly from $I_0$ and $I_1$, the flow and the per-pixel visibility needed to combine them — or, more aggressively, to synthesize the in-between pixels outright without ever exposing an explicit flow. The shift from flow-based to learned in-betweening is less a change of recipe than a change of who estimates the correspondence and how the blend is decided; we return to it at the end of the chapter.
6.6.2 Occlusion and disocclusion — why a tween is more than a warped cross-dissolve⧉
Here is the issue a still-image face morph never had to face. In a real moving scene, an object in front covers background as it advances and uncovers it as it retreats. Consider a foreground bar sliding to the right between the two key frames (Figure 6.6.2). Background just to the left of the bar was hidden in $I_0$ and is revealed in $I_1$ — it is visible in $I_1$ only (a disocclusion). Background just to the right of the bar is still visible in $I_0$ but will be covered by the bar in $I_1$ — it is visible in $I_0$ only (an occlusion).
At the in-between time, those one-sided regions are real background that must appear, but a symmetric warp-and-average has no valid pixel for them in one of the two frames — averaging a real pixel with a foreground-occluded one smears the foreground's color across the background. The cure is occlusion reasoning: build a per-pixel visibility map and, in each one-sided region, defer to the frame that can see the pixel — take the disoccluded strip from $I_1$, the about-to-be-occluded strip from $I_0$ — and only blend where the pixel is visible in both. Where a region is visible in neither (it can happen — a thin sliver swept by fast motion), there is nothing to copy and the missing content must be in-painted (hallucinated from the surrounding texture). This visibility logic — copy where one-sided, blend where two-sided, in-paint where neither — is exactly what separates a true tween from a ghosting cross-dissolve, and it is the single hardest part of frame interpolation.
6.6.3 Large motion, and the climb from flow to learning⧉
The second new difficulty is large motion. Optical flow is reliable for small, smooth displacements; when something moves far between frames — a fast object, a low frame rate, a wide camera pan — flow estimation breaks down, the warp lands pixels in the wrong place, and the in-between tears or doubles. Classical flow-based interpolation is therefore strongest on gentle motion and weakest on exactly the cases (sports, action, big disocclusions) where slow-motion is most wanted.
This is what motivated the move to learned in-betweening (Figure 6.6.3). The progression runs in three stages. First, classical flow-based: estimate flow, warp both frames to $t$, blend with explicit occlusion masks — transparent, but fragile under large motion, thin structures, and disocclusion. Second, learned synthesis — FILM ([@reda-2022], Frame Interpolation for Large Motion) and RIFE ([@huang-2022], a real-time intermediate-flow network) — where a network trained end-to-end on real video predicts the flow and the visibility, or synthesizes the in-between pixels directly; this is far more robust to large motion, with RIFE fast enough for real time and FILM tuned for the very large displacements that defeat classical flow. The slow-motion line begins here too: Super SloMo ([@jiang-2018]) learns to predict intermediate optical flow and per-pixel visibility maps, then warps and blends to produce arbitrary-time in-betweens — many frames between two captured ones — which is in-betweening turned into a slow-motion engine. Third, video-diffusion in-betweening: a generative model conditioned on both key frames denoises the missing frames, hallucinating plausible content in disoccluded regions rather than copying it — powerful where there is genuinely no pixel to warp, but heavy and able to drift from what truly happened.
The throughline across all three is unchanged. Every method still corresponds, transports, and blends; what climbs is how robustly the correspondence is found and how the occlusion-aware blend is decided — from a hand-tuned flow-plus-mask, to a network that predicts both, to a generative model that invents what neither frame can see.
6.6.4 Recap and significance⧉
In-betweening closes the morphing arc by carrying it into time. The recipe is the morph's — correspond, transport, blend — but with the correspondence estimated (dense flow, or a learned matcher) rather than drawn, and with two obligations the still-image morph never had: occlusion reasoning (copy from the frame that can see each pixel, blend only where both can, in-paint where neither) and robustness to large motion (which pushed the field from classical flow to learned synthesis). The reusable idea is that a doubled frame rate, slow motion, and a face morph are the same operation at different time scales — a fact worth carrying because the deeper, video-centric development of it, including multi-frame slow-motion and the FILM/RIFE/Super SloMo systems in full, is the subject of Frame interpolation and slow-motion synthesis. Morphing was the still-image showpiece; in-betweening is what happens when the two images are simply consecutive frames, and it is where the warp-and-blend machinery of this whole part meets the synthesis-of-motion problems of the VIDEO part.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
To synthesize a frame between two key frames you cannot average their colors where they sit — that ghosts any motion. You must estimate a correspondence between the frames (optical flow, or a learned matcher), warp both toward the in-between time, and only then blend, just as a face morph warps both faces to a common shape before cross-dissolving. The one new obligation over the still-image morph is visibility: where a moving object has covered or uncovered background, a pixel exists in only one of the two frames, and the blend must defer to the frame that can see it (or in-paint) rather than average. Correspond, transport, blend — with an occlusion-aware blend — is the whole of frame interpolation; only how the correspondence is found changes from classical flow to learned synthesis.