💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

10.8 Continuous panoramas (e.g. on cell phones)

A cell-phone "sweep" panorama has the same goal as everything else in this part — one wide image assembled from many views — but the capture model is inverted. Instead of shooting $N$ discrete frames and stitching them offline, the phone captures a continuous video while you pan and builds the mosaic as the frames arrive. You wave the camera in one smooth motion; an on-screen ribbon nudges you to keep level; and a clean wide photograph appears the instant you stop. That streaming constraint, together with the fact that you are hand-holding and therefore continuously translating, reshapes every design choice the discrete pipeline made on a tripod.

The throughline is a single trade. A heavy offline optimization — register all pairs, run bundle adjustment, choose seams, blend — becomes affordable in real time only because the phone never stitches more than the newest sliver of the newest frame. Each arriving frame is registered against the growing mosaic, a thin vertical strip is taken from its centre, and that strip is feathered onto the canvas. Everything that makes a hand-waved video look like one photograph follows from that discipline.

10.8.1 Incremental registration of a video stream

Frames arrive at video rate, and the phone registers each new frame against the growing mosaicframe-to-mosaic, not all-pairs. The registration is a small incremental homography: if $H_{t-1}$ maps the previous frame into the mosaic, then

$$ H_t = H_{t-1}\,\Delta H_t, $$

where $\Delta H_t$ is the frame-to-frame motion. Because the camera moved only a little since the last frame, $\Delta H_t$ is small and mostly rotation, and — crucially — it can be predicted cheaply from the phone's gyroscope before any pixels are touched. The inertial sensor reports the inter-frame rotation directly; the visual tracker then only has to refine a prediction that is already close, rather than search from scratch. This is what buys real time: never an all-pairs solve, just one fast, well-initialized update per frame.

The on-screen level/arrow ribbon is part of the algorithm, not just the user interface. The incremental registration assumes a smooth, roughly constant sweep; keeping the motion regular keeps each $\Delta H_t$ small and well-conditioned, and keeps consecutive strips overlapping by just the right amount to register and feather. A jerky pan breaks the small-motion assumption and the mosaic tears.

Errors still accumulate along a long sweep — the same drift we met for global panoramas in Bells and whistles, where small per-pair rotation errors compound into a visible mismatch when the loop closes. A full bundle adjustment over the whole sweep is too heavy for real time, so phones use the lightweight substitutes: the gyroscope prior to anchor absolute orientation, occasional re-registration against earlier mosaic content, and an incremental / windowed optimization that refines only the recent past. The drift is managed, not eliminated, and the central-strip trick below keeps whatever remains from ever becoming a hard seam.

10.8.2 Mosaicking a moving strip (and why a central strip)

From each registered frame the phone keeps only a thin vertical strip taken from the centre of the frame and pastes it onto the moving mosaic canvas. Consecutive strips overlap slightly and are feathered together across that narrow overlap,

$$ I_{\text{out}} = \alpha\, I_{\text{new strip}} + (1-\alpha)\, I_{\text{mosaic}}, $$

with $\alpha$ ramping across the seam exactly as in Blending. The mosaic grows sideways as you pan, one strip at a time (Figure 10.8.1).

Why the centre of the frame? Because a central strip minimizes three different problems at once, each of which is smallest right there on the optical axis:

Seen this way, the central strip is the continuous-pano analog of seam routing from Blending: rather than choosing a seam through a wide overlap after the fact, you only ever keep the slice where the registration and photometric assumptions already hold best. The seam is chosen by construction, at the moment of capture.

fig-sweep-central-strip
Figure 10.8.1. Incremental registration and why a central strip. A phone pans across a scene; each new video frame is registered against the growing mosaic (not against all previous frames) via a small incremental homography $H_t=H_{t-1}\,\Delta H_t$, where the frame-to-frame increment $\Delta H_t$ is predicted from the gyroscope and only refined by the visual tracker — one fast update per frame, never an all-pairs solve, with an on-screen guide ribbon keeping the user panning smoothly and level. From each frame only a thin central vertical strip is kept. Annotations mark why the centre is the sweet spot: vignetting is lowest there (corners darkest), radial distortion is lowest near the optical axis (lines stay straight), and parallax is lowest along the rotation axis — and because the strip is narrow, depth-dependent displacement across it is negligible, so a hand-held translation does not double-image. Consecutive strips overlap just enough to feather ($I_{\text{out}}=\alpha I_{\text{new}}+(1-\alpha)I_{\text{mosaic}}$). The strip is "locally a homography" even when the global motion is not.

10.8.3 Rolling shutter and exposure drift

Two caveats that a tripod hides come to dominate the sweep, one geometric and one photometric.

Rolling shutter is the geometric one, and it bites hardest on a fast pan. A complementary metal-oxide-semiconductor (CMOS) sensor reads out row by row, so the bottom of a frame is captured a few milliseconds after the top. While you pan, the camera moves during that readout, so different rows of one frame see different camera poses. Model the pose as varying linearly with row index $y$,

$$ \mathbf p(y) = \mathbf p_0 + y\,\dot{\mathbf p}, \qquad \text{row } y \text{ exposed at } t_0 + y\,t_{\text{row}}, $$

where $t_{\text{row}}$ is the per-row readout time. Because the camera has rotated slightly between the first and last row, the frame is sheared: vertical poles lean, and a fast horizontal pan skews or wobbles the whole strip (Figure 10.8.2). There are two corrections, and the phone uses both. The user-facing one is pan slowly — exactly what the guide ribbon enforces, since a slow sweep makes the intra-frame motion negligible. The algorithmic one is to model the per-row motion (again from the gyroscope, which timestamps the camera's rotation finely enough to know $\dot{\mathbf p}$) and rectify each row to a common pose before pasting — the same rolling-shutter rectification used for video stabilization in Video. And here the central-strip trick pays off a second time: because each pasted strip is narrow, it spans only a few rows' worth of readout time, so its rolling-shutter warp is small and easy to undo.

Exposure and white-balance drift is the photometric caveat. Over a long sweep the auto-exposure (AE) and auto-white-balance (AWB) re-meter continuously — you pan from shadow into sun, from indoor tungsten toward a window — so brightness and color ramp along the mosaic. The fix is the streaming form of the gain compensation from Bells and whistles: continuous gain compensation, estimating a per-frame gain (and color scaling) so each new strip matches the mosaic where they overlap, optionally locking AE/AWB at the start of the sweep to prevent the ramp in the first place. Because each strip is feathered over only a narrow overlap, any residual drift appears as a gentle, low-frequency ramp across the whole panorama rather than a step at any one strip boundary — and a slow ramp is exactly what the eye forgives, while a hard edge is what it catches. This is big lesson L9 at work in the photometric domain: there is no offending gradient at any seam, so the slowly changing absolute level goes unnoticed.

The takeaway is that a sweep panorama is the discrete pipeline's assumptions — static scene, global shutter, fixed exposure, pure rotation — all relaxed at once and absorbed incrementally. Narrow central strips, gyroscope-predicted registration, continuous gain compensation, and per-row rolling-shutter rectification are, together, how a phone turns a hand-waved video into one clean wide photograph.

fig-rolling-shutter-pan
Figure 10.8.2. Rolling shutter in a fast pan. A CMOS sensor reads out row by row, so during a quick horizontal pan the bottom rows are captured later than the top rows and see a slightly rotated camera: $\mathbf p(y)=\mathbf p_0+y\,\dot{\mathbf p}$, row $y$ exposed at $t_0+y\,t_{\text{row}}$. Left (uncorrected): a vertical pole is rendered slanted and wobbly because the frame is sheared by the intra-frame motion. Right (corrected): modelling the per-row pose from the gyroscope and rectifying each row to a common pose straightens the pole. Panning slowly (the guide ribbon) and using a narrow strip both shrink the effect.
💡 Big lesson (L14 · recurrence)

The sweep panorama is another instance of capture the full set, decide later — here the set is a whole video stream of overlapping views, and the panorama is reconstructed from it. What is distinctive is how the set is captured and consumed: not as a deliberate, separable bracket (as in high-dynamic-range (HDR) exposure or a focal stack) but as a continuous stream processed incrementally in real time, where the reconstruction runs as the data arrives rather than offline afterward. The cost L14 always charges — more data, a harder reconstruction — is paid here in streaming registration, rolling-shutter rectification, and continuous gain compensation, in exchange for a wide image no single phone frame could hold. (Registered in Big Lessons as L14; first introduced in this part's intro, with the light-field / plenoptic case in Advanced computational photography as its special form. Recurs here on the viewpoint axis, in its streaming, real-time form.)


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L14 · recurrence)

The sweep panorama is another instance of capture the full set, decide later — here the set is a whole video stream of overlapping views, and the panorama is reconstructed from it. What is distinctive is how the set is captured and consumed: not as a deliberate, separable bracket (as in high-dynamic-range (HDR) exposure or a focal stack) but as a continuous stream processed incrementally in real time, where the reconstruction runs as the data arrives rather than offline afterward. The cost L14 always charges — more data, a harder reconstruction — is paid here in streaming registration, rolling-shutter rectification, and continuous gain compensation, in exchange for a wide image no single phone frame could hold. (Registered in Big Lessons as L14; first introduced in this part's intro, with the light-field / plenoptic case in Advanced computational photography as its special form. Recurs here on the viewpoint axis, in its streaming, real-time form.)