10.8 Continuous panoramas (e.g. on cell phones)⧉
A cell-phone "sweep" panorama has the same goal as everything else in this part — one wide image assembled from many views — but the capture model is inverted. Instead of shooting $N$ discrete frames and stitching them offline, the phone captures a continuous video while you pan and builds the mosaic as the frames arrive. You wave the camera in one smooth motion; an on-screen ribbon nudges you to keep level; and a clean wide photograph appears the instant you stop. That streaming constraint, together with the fact that you are hand-holding and therefore continuously translating, reshapes every design choice the discrete pipeline made on a tripod.
The throughline is a single trade. A heavy offline optimization — register all pairs, run bundle adjustment, choose seams, blend — becomes affordable in real time only because the phone never stitches more than the newest sliver of the newest frame. Each arriving frame is registered against the growing mosaic, a thin vertical strip is taken from its centre, and that strip is feathered onto the canvas. Everything that makes a hand-waved video look like one photograph follows from that discipline.
10.8.1 Incremental registration of a video stream⧉
Frames arrive at video rate, and the phone registers each new frame against the growing mosaic — frame-to-mosaic, not all-pairs. The registration is a small incremental homography: if $H_{t-1}$ maps the previous frame into the mosaic, then
where $\Delta H_t$ is the frame-to-frame motion. Because the camera moved only a little since the last frame, $\Delta H_t$ is small and mostly rotation, and — crucially — it can be predicted cheaply from the phone's gyroscope before any pixels are touched. The inertial sensor reports the inter-frame rotation directly; the visual tracker then only has to refine a prediction that is already close, rather than search from scratch. This is what buys real time: never an all-pairs solve, just one fast, well-initialized update per frame.
The on-screen level/arrow ribbon is part of the algorithm, not just the user interface. The incremental registration assumes a smooth, roughly constant sweep; keeping the motion regular keeps each $\Delta H_t$ small and well-conditioned, and keeps consecutive strips overlapping by just the right amount to register and feather. A jerky pan breaks the small-motion assumption and the mosaic tears.
Errors still accumulate along a long sweep — the same drift we met for global panoramas in Bells and whistles, where small per-pair rotation errors compound into a visible mismatch when the loop closes. A full bundle adjustment over the whole sweep is too heavy for real time, so phones use the lightweight substitutes: the gyroscope prior to anchor absolute orientation, occasional re-registration against earlier mosaic content, and an incremental / windowed optimization that refines only the recent past. The drift is managed, not eliminated, and the central-strip trick below keeps whatever remains from ever becoming a hard seam.
10.8.2 Mosaicking a moving strip (and why a central strip)⧉
From each registered frame the phone keeps only a thin vertical strip taken from the centre of the frame and pastes it onto the moving mosaic canvas. Consecutive strips overlap slightly and are feathered together across that narrow overlap,
with $\alpha$ ramping across the seam exactly as in Blending. The mosaic grows sideways as you pan, one strip at a time (Figure 10.8.1).
Why the centre of the frame? Because a central strip minimizes three different problems at once, each of which is smallest right there on the optical axis:
- Vignetting is smallest at the frame centre — the corners are the darkest part of any lens's falloff — so centre strips are photometrically uniform and butt together without a brightness step.
- Radial distortion is smallest near the optical axis, so centre strips warp the least and straight lines stay straight. The frame edges, where distortion is worst, are simply never used.
- Parallax is smallest along the rotation axis, and because each strip is narrow, the depth-dependent displacement across the strip is tiny. The hand-held translation that would double-image a wide overlap produces almost no disparity inside a thin central slice. A narrow central strip is "locally a homography" even when the global hand motion is not — so the phone suppresses parallax doubling without ever recovering 3-D.
Seen this way, the central strip is the continuous-pano analog of seam routing from Blending: rather than choosing a seam through a wide overlap after the fact, you only ever keep the slice where the registration and photometric assumptions already hold best. The seam is chosen by construction, at the moment of capture.
10.8.3 Rolling shutter and exposure drift⧉
Two caveats that a tripod hides come to dominate the sweep, one geometric and one photometric.
Rolling shutter is the geometric one, and it bites hardest on a fast pan. A complementary metal-oxide-semiconductor (CMOS) sensor reads out row by row, so the bottom of a frame is captured a few milliseconds after the top. While you pan, the camera moves during that readout, so different rows of one frame see different camera poses. Model the pose as varying linearly with row index $y$,
where $t_{\text{row}}$ is the per-row readout time. Because the camera has rotated slightly between the first and last row, the frame is sheared: vertical poles lean, and a fast horizontal pan skews or wobbles the whole strip (Figure 10.8.2). There are two corrections, and the phone uses both. The user-facing one is pan slowly — exactly what the guide ribbon enforces, since a slow sweep makes the intra-frame motion negligible. The algorithmic one is to model the per-row motion (again from the gyroscope, which timestamps the camera's rotation finely enough to know $\dot{\mathbf p}$) and rectify each row to a common pose before pasting — the same rolling-shutter rectification used for video stabilization in Video. And here the central-strip trick pays off a second time: because each pasted strip is narrow, it spans only a few rows' worth of readout time, so its rolling-shutter warp is small and easy to undo.
Exposure and white-balance drift is the photometric caveat. Over a long sweep the auto-exposure (AE) and auto-white-balance (AWB) re-meter continuously — you pan from shadow into sun, from indoor tungsten toward a window — so brightness and color ramp along the mosaic. The fix is the streaming form of the gain compensation from Bells and whistles: continuous gain compensation, estimating a per-frame gain (and color scaling) so each new strip matches the mosaic where they overlap, optionally locking AE/AWB at the start of the sweep to prevent the ramp in the first place. Because each strip is feathered over only a narrow overlap, any residual drift appears as a gentle, low-frequency ramp across the whole panorama rather than a step at any one strip boundary — and a slow ramp is exactly what the eye forgives, while a hard edge is what it catches. This is big lesson L9 at work in the photometric domain: there is no offending gradient at any seam, so the slowly changing absolute level goes unnoticed.
The takeaway is that a sweep panorama is the discrete pipeline's assumptions — static scene, global shutter, fixed exposure, pure rotation — all relaxed at once and absorbed incrementally. Narrow central strips, gyroscope-predicted registration, continuous gain compensation, and per-row rolling-shutter rectification are, together, how a phone turns a hand-waved video into one clean wide photograph.
The sweep panorama is another instance of capture the full set, decide later — here the set is a whole video stream of overlapping views, and the panorama is reconstructed from it. What is distinctive is how the set is captured and consumed: not as a deliberate, separable bracket (as in high-dynamic-range (HDR) exposure or a focal stack) but as a continuous stream processed incrementally in real time, where the reconstruction runs as the data arrives rather than offline afterward. The cost L14 always charges — more data, a harder reconstruction — is paid here in streaming registration, rolling-shutter rectification, and continuous gain compensation, in exchange for a wide image no single phone frame could hold. (Registered in Big Lessons as L14; first introduced in this part's intro, with the light-field / plenoptic case in Advanced computational photography as its special form. Recurs here on the viewpoint axis, in its streaming, real-time form.)
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
The sweep panorama is another instance of capture the full set, decide later — here the set is a whole video stream of overlapping views, and the panorama is reconstructed from it. What is distinctive is how the set is captured and consumed: not as a deliberate, separable bracket (as in high-dynamic-range (HDR) exposure or a focal stack) but as a continuous stream processed incrementally in real time, where the reconstruction runs as the data arrives rather than offline afterward. The cost L14 always charges — more data, a harder reconstruction — is paid here in streaming registration, rolling-shutter rectification, and continuous gain compensation, in exchange for a wide image no single phone frame could hold. (Registered in Big Lessons as L14; first introduced in this part's intro, with the light-field / plenoptic case in Advanced computational photography as its special form. Recurs here on the viewpoint axis, in its streaming, real-time form.)