💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

6.6 Video in-betweening⧉

In hand-drawn animation, a senior artist draws the key frames — the extreme poses — and an assistant fills in the in-between frames that carry the motion from one key to the next. That filling-in is called tweening, and "in-betweening" is the same job done photographically: given two frames $I_0$ and $I_1$ — two drawn keys, or two adjacent frames of a captured video — synthesize the frame $I_t$ that belongs between them at time $t\in(0,1)$. Set $t=0.5$ and you have doubled the frame rate; sample $t$ on a fine grid and you have slow motion from ordinary footage.

The temptation, exactly as in Morphing, is to reach for the cross-dissolve: set every output pixel to the linear blend of the two frames,

$$ I_t[y,x] = (1-t)\,I_0[y,x] + t\,I_1[y,x]. $$

And exactly as in morphing, it fails in one specific way. If anything in the scene moved between the frames, that object sits at different pixels in $I_0$ and $I_1$, so averaging their colors at fixed locations leaves a ghost — two faint, translucent copies of the moving thing, sliding through each other, instead of one sharp object caught mid-motion (Figure 6.6.1, top). A cross-dissolve fades; it does not move. The fix is the morph's fix, carried to video: estimate the correspondence between the frames, warp both toward the intermediate time, and then blend — and now the moving object travels to a single mid-position and stays sharp.

So in-betweening is morphing across $t$ with an estimated correspondence. Everything from Morphing transfers: the separation of shape (where things are) and color (what color they are), the meet-in-the-middle symmetric warp, the align-before-you-blend discipline. What is genuinely new — and what the rest of this short chapter is about — is twofold: the correspondence is now dense and automatic rather than hand-drawn, and real moving scenes force two issues the still-image morph could ignore, occlusion and large motion.

💡 Big lesson (L17, recurrence) — in-betweening is morphing across time

To synthesize a frame between two key frames you cannot average their colors where they sit — that ghosts any motion. You must estimate a correspondence between the frames (optical flow, or a learned matcher), warp both toward the in-between time, and only then blend, just as a face morph warps both faces to a common shape before cross-dissolving. The one new obligation over the still-image morph is visibility: where a moving object has covered or uncovered background, a pixel exists in only one of the two frames, and the blend must defer to the frame that can see it (or in-paint) rather than average. Correspond, transport, blend — with an occlusion-aware blend — is the whole of frame interpolation; only how the correspondence is found changes from classical flow to learned synthesis.

fig-inbetween-warp-midpoint — **Figure 6.6.1.** In-betweening is the morph's warp-then-blend, taken to video. A textured disk translates across a textured background between key frames $I_0$ (disk left) and $I_1$ (disk right). **Top — naive cross-dissolve** at $t=0.5$, $\tfrac12 I_0+\tfrac12 I_1$: because the disk moved, averaging colors at fixed pixels leaves **two faint copies — a ghost**, not a moving object. **Bottom — warp to the midpoint:** estimate the correspondence (here the disk's translation), warp $I_0$ **forward** and $I_1$ **backward** so the disk in both lands at its $t=0.5$ position, then blend the now-aligned frames → **one sharp in-between disk**. Both frames warp only partway and meet in the middle, exactly the symmetric morph of the previous chapter.

6.6.1 Correspondence, now estimated: flow and learned matching⧉

The morph of Morphing got its correspondence from a person — a few line pairs, a handful of mesh vertices. Between two video frames that is hopeless: there are too many moving pixels and too many frames, and nobody is going to draw lines on a 60-frame slow-motion clip. The correspondence must be estimated, dense, and per-pixel.

The classical source is optical flow (Optical flow): a field $\mathbf{u}(x,y)$ assigning to each pixel of $I_0$ the displacement that carries it to its match in $I_1$. Given that flow, in-betweening is mechanical. To build the frame at time $t$, scale the flow and warp: a pixel that moves by $\mathbf{u}$ over the full interval has moved by $t\,\mathbf{u}$ at time $t$. Warp $I_0$ forward by $t\,\mathbf{u}$ and $I_1$ backward by $-(1-t)\,\mathbf{u}$ so both land at the intermediate position, then blend. This is precisely the morph's "interpolate the geometry to an intermediate shape, warp both to it, cross-dissolve," with the flow field playing the role the line pairs played before. As always the warp runs backward and resamples (reconstruct + prefilter, the engine of Warping and resampling) so every output pixel is filled.

The modern source is a learned matcher: a network trained on real video to predict, directly from $I_0$ and $I_1$, the flow and the per-pixel visibility needed to combine them — or, more aggressively, to synthesize the in-between pixels outright without ever exposing an explicit flow. The shift from flow-based to learned in-betweening is less a change of recipe than a change of who estimates the correspondence and how the blend is decided; we return to it at the end of the chapter.

6.6.2 Occlusion and disocclusion — why a tween is more than a warped cross-dissolve⧉

Here is the issue a still-image face morph never had to face. In a real moving scene, an object in front covers background as it advances and uncovers it as it retreats. Consider a foreground bar sliding to the right between the two key frames (Figure 6.6.2). Background just to the left of the bar was hidden in $I_0$ and is revealed in $I_1$ — it is visible in $I_1$ only (a disocclusion). Background just to the right of the bar is still visible in $I_0$ but will be covered by the bar in $I_1$ — it is visible in $I_0$ only (an occlusion).

At the in-between time, those one-sided regions are real background that must appear, but a symmetric warp-and-average has no valid pixel for them in one of the two frames — averaging a real pixel with a foreground-occluded one smears the foreground's color across the background. The cure is occlusion reasoning: build a per-pixel visibility map and, in each one-sided region, defer to the frame that can see the pixel — take the disoccluded strip from $I_1$, the about-to-be-occluded strip from $I_0$ — and only blend where the pixel is visible in both. Where a region is visible in neither (it can happen — a thin sliver swept by fast motion), there is nothing to copy and the missing content must be in-painted (hallucinated from the surrounding texture). This visibility logic — copy where one-sided, blend where two-sided, in-paint where neither — is exactly what separates a true tween from a ghosting cross-dissolve, and it is the single hardest part of frame interpolation.

fig-inbetween-occlusion — **Figure 6.6.2.** Occlusion is what a tween must handle that a still-image morph need not. A foreground bar moves right between key frames $I_0$ and $I_1$ over a textured background. At the in-between $t=0.5$ the bar sits in the middle, and the swept band splits into one-sided regions: the strip the bar **uncovered** is **visible in $I_1$ only** (disoccluded — in-paint it from $I_1$), the strip the bar is **about to cover** is **visible in $I_0$ only** (take it from $I_0$); everywhere else the background is visible in **both** frames and is **safe to blend**. A symmetric warp cannot simply average in the one-sided regions — one frame has no valid pixel there — so it must reason about **visibility** and defer to the frame that can see each pixel.

6.6.3 Large motion, and the climb from flow to learning⧉

The second new difficulty is large motion. Optical flow is reliable for small, smooth displacements; when something moves far between frames — a fast object, a low frame rate, a wide camera pan — flow estimation breaks down, the warp lands pixels in the wrong place, and the in-between tears or doubles. Classical flow-based interpolation is therefore strongest on gentle motion and weakest on exactly the cases (sports, action, big disocclusions) where slow-motion is most wanted.

This is what motivated the move to learned in-betweening (Figure 6.6.3). The progression runs in three stages. First, classical flow-based: estimate flow, warp both frames to $t$, blend with explicit occlusion masks — transparent, but fragile under large motion, thin structures, and disocclusion. Second, learned synthesis — FILM ([@reda-2022], Frame Interpolation for Large Motion) and RIFE ([@huang-2022], a real-time intermediate-flow network) — where a network trained end-to-end on real video predicts the flow and the visibility, or synthesizes the in-between pixels directly; this is far more robust to large motion, with RIFE fast enough for real time and FILM tuned for the very large displacements that defeat classical flow. The slow-motion line begins here too: Super SloMo ([@jiang-2018]) learns to predict intermediate optical flow and per-pixel visibility maps, then warps and blends to produce arbitrary-time in-betweens — many frames between two captured ones — which is in-betweening turned into a slow-motion engine. Third, video-diffusion in-betweening: a generative model conditioned on both key frames denoises the missing frames, hallucinating plausible content in disoccluded regions rather than copying it — powerful where there is genuinely no pixel to warp, but heavy and able to drift from what truly happened.

The throughline across all three is unchanged. Every method still corresponds, transports, and blends; what climbs is how robustly the correspondence is found and how the occlusion-aware blend is decided — from a hand-tuned flow-plus-mask, to a network that predicts both, to a generative model that invents what neither frame can see.

fig-inbetween-flow-vs-learned — **Figure 6.6.3.** The progression of in-betweening methods, all solving the same task — synthesize the frame between two key frames. **Classic flow-based:** estimate optical flow, warp both to $t$ and blend with explicit occlusion masks; transparent but fragile under large motion, thin structure, and disocclusion. **Learned synthesis (FILM, RIFE):** a network trained end-to-end predicts flow plus visibility, or synthesizes pixels directly — robust to large motion, real-time (RIFE) or large-displacement-sharp (FILM). **Video diffusion:** a generative model conditioned on both key frames denoises the missing frames, hallucinating disoccluded detail — powerful but heavy and able to drift from the true motion. Left to right: more learned, larger motion and occlusion handled, less tied to a literal warp.

6.6.4 Recap and significance⧉

In-betweening closes the morphing arc by carrying it into time. The recipe is the morph's — correspond, transport, blend — but with the correspondence estimated (dense flow, or a learned matcher) rather than drawn, and with two obligations the still-image morph never had: occlusion reasoning (copy from the frame that can see each pixel, blend only where both can, in-paint where neither) and robustness to large motion (which pushed the field from classical flow to learned synthesis). The reusable idea is that a doubled frame rate, slow motion, and a face morph are the same operation at different time scales — a fact worth carrying because the deeper, video-centric development of it, including multi-frame slow-motion and the FILM/RIFE/Super SloMo systems in full, is the subject of Frame interpolation and slow-motion synthesis. Morphing was the still-image showpiece; in-betweening is what happens when the two images are simply consecutive frames, and it is where the warp-and-blend machinery of this whole part meets the synthesis-of-motion problems of the VIDEO part.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L17, recurrence) — in-betweening is morphing across time

symbol	meaning (this chapter)	note
$I_0,\ I_1$	the two key frames being interpolated (the endpoints, at $t=0$ and $t=1$)	from Morphing; here adjacent video frames
$t$	the in-between time $t\in[0,1]$; $t=0$ gives $I_0$, $t=1$ gives $I_1$	from Morphing; here literal time
$I_t$	the synthesized in-between frame at time $t$ (warp both toward $t$, then occlusion-aware blend)	new framing (this chapter)
$\mathbf{u}(x,y)$	the optical-flow field carrying a pixel of $I_0$ to its match in $I_1$; the frame at $t$ warps by $t\,\mathbf{u}$ and $-(1-t)\,\mathbf{u}$	from Optical flow; the estimated correspondence
$V(x,y)$	a per-pixel visibility map: is this pixel seen in $I_0$, in $I_1$, in both, or neither — used to copy / blend / in-paint	new (this chapter)

6.6 Video in-betweening🔗⧉

6.6.1 Correspondence, now estimated: flow and learned matching🔗⧉

6.6.2 Occlusion and disocclusion — why a tween is more than a warped cross-dissolve🔗⧉

6.6.3 Large motion, and the climb from flow to learning🔗⧉

6.6.4 Recap and significance🔗⧉

Big lessons of this chapter

6.6 Video in-betweening⧉

6.6.1 Correspondence, now estimated: flow and learned matching⧉

6.6.2 Occlusion and disocclusion — why a tween is more than a warped cross-dissolve⧉

6.6.3 Large motion, and the climb from flow to learning⧉

6.6.4 Recap and significance⧉