10.8 Bells and whistles⧉
The pairwise pipeline — match, RANSAC, homography, blend — makes a demo. A tool has to survive 360° loops, lens imperfections, people who wander through the shot, and a photographer who took a step sideways. Each section below is one such assumption breaking, and the repair. The recurring fix is the same throughout: treat the whole panorama as one global estimation problem instead of a chain of independent local ones. That single reframing — local chain to global solve — is the spine of the chapter, and it is worth keeping in view because it recurs far beyond panoramas: it is the same idea that underlies structure-from-motion and simultaneous localization and mapping (SLAM), where a moving camera must stay globally consistent over thousands of frames.
Following the part spine L14 — capture the full set, decide later, a panorama defers the choice of field of view at capture and decides the framing, projection surface, global geometry, and which frame owns each moving object afterward in software; this chapter is where that deferred decision gets cashed out at production quality.
10.8.1 Other projections⧉
A panorama can span far more of the world than a single flat image plane can hold, which forces a question the two-image demo never had to ask: onto what surface do you unroll the mosaic? The frames have already been registered to one another by the homographies; choosing a projection is a separate, later step — a re-parameterization of the output canvas, independent of the alignment. But it is not a cosmetic choice. The surface you pick decides what looks natural and what distorts, and there is a hard geometric tradeoff at its heart that no cleverness escapes.
Planar projection unrolls the mosaic onto a single flat plane, as if it were one enormous photograph. Its virtue is the one architects and engineers care about most: it keeps straight lines straight. A planar panorama of a building has straight edges where the building has straight edges. The price is field of view. As the panorama widens, the corners stretch grotesquely and the edges race toward infinity; a planar projection cannot exceed roughly 120°, and a 180° planar panorama is mathematically impossible — the rays at the edge would need to land infinitely far out on the plane. Planar is therefore the right choice for narrow-to-moderate sweeps where keeping lines straight matters.
Cylindrical projection wraps the mosaic onto a cylinder and then unrolls it. This is the natural surface for the classic "turn in place" panorama — a wide horizontal sweep, even a full 360° around the horizontal. Here vertical lines stay straight and the horizon becomes a straight horizontal band, but horizontal straight lines that are not at eye level bend into curves. The construction needs the focal length, which sets the cylinder's radius.
Spherical (or equirectangular) projection wraps the mosaic onto a sphere, and it is the choice for a full $360° \times 180°$ panorama — the immersive surround used in virtual reality and street-view imagery. Everything in the scene maps onto the output, but almost every straight line curves, and the poles (straight up and straight down) stretch severely.
The pattern across the three is one tradeoff stated plainly, and it is worth naming because it is unavoidable: you cannot simultaneously cover a very wide field of view and keep all straight lines straight on a flat output. This is the cartographer's dilemma — you cannot flatten a sphere onto a plane without distortion, which is why every world map lies about something (areas, angles, or distances). Planar buys straight lines at the cost of field of view; cylindrical and spherical buy field of view at the cost of bent lines. So pick by content: lots of straight architecture and a modest sweep favor planar; a wide landscape turn favors cylindrical; an immersive 360° surround demands spherical (Figure 10.8.1).
Because the projection is just a re-parameterization of the output canvas, nothing stops you from choosing a deliberately creative one. The stereographic "little planet" projects a full $360°$ panorama from the zenith down onto a plane, wrapping the horizon into a circle and curling the ground into a tiny world seen from above — the same registered mosaic, unrolled for effect rather than fidelity. It is the playful end of the same freedom bundle adjustment and blending exist to protect.
10.8.2 Bundle adjustment⧉
The naive pipeline builds a multi-image panorama the obvious way: it estimates each homography between adjacent frames and composes them to relate distant frames — $H_{13} = H_{23} H_{12}$, and so on around the sequence. Each pairwise homography carries a small estimation error, and composing a chain of them accumulates those errors. The symptom is unmistakable on a closed loop: shoot a full 360° panorama and, by the time the chain comes back around to where it started, the two ends do not meet — there is a visible gap or vertical step at the closure (Figure 10.8.3). Sequential estimation lets error pile up and dumps the whole accumulated discrepancy at the seam.
The fix is the chapter's recurring move made concrete. Instead of fixing the cameras one pair at a time, jointly optimize the parameters of all cameras at once so that they best explain all the feature correspondences simultaneously. This is bundle adjustment, the global-consistency engine borrowed from photogrammetry and structure-from-motion (Brown and Lowe, 2007; the canonical synthesis is Triggs et al. 2000). When the solve is global, the residual error is distributed evenly around the loop instead of being dumped at one seam — and so the loop closes, with the leftover error spread thin enough to be invisible.
Concretely, bundle adjustment minimizes the total reprojection error of every matched feature across every overlapping pair of images. Writing $\theta_j$ for the parameter bundle of camera $j$ (its rotation $R_j$, focal length $f_j$, and — below — its distortion and gain), and $\hat{\mathbf{x}}^j_i$ for where feature $i$ should land in image $j$ given those parameters,
Read it back: choose all the cameras' parameters together so that, summed over every feature $i$ seen in every overlapping pair $(j,k)$, each feature's predicted position agrees as closely as possible with where it was actually observed. This is a large nonlinear least-squares problem — the same kind of inverse problem as Linear Inverse Problems and Regression, but nonlinear in the rotations, so it is solved by Levenberg–Marquardt, initialized from the RANSAC pairwise homographies and grown one image at a time. The function $\rho$ is a robust loss (rather than a plain squared norm) so that any mismatches that survived RANSAC cannot dominate the fit.
The real payoff of phrasing alignment as one global solve is that the same optimization can carry extra unknowns that have nothing to do with geometry — and folding them in is what turns a geometrically-aligned mosaic into a seamless one. Two lens-and-sensor imperfections in particular belong inside the solve rather than being patched afterward.
The first is gain (exposure) compensation. Frames shot around a panorama rarely share an exposure exactly — auto-exposure drifts, the sun moves behind a cloud — so the overlaps disagree in brightness. Assign each image a per-image gain $g_j$ (and optionally a per-channel white balance) and choose the gains so that overlapping pixels agree photometrically:
The first term pulls the overlapping brightnesses into agreement; the regularizer $\lambda \sum_j (g_j-1)^2$ keeps every gain near 1 so the whole panorama does not drift uniformly dark or bright (the trivial all-zero solution that the first term alone would happily accept). Solving this before the seam blend removes most of the exposure mismatch up front — which is exactly what makes the blending of the previous chapter easy, because a blender hides a small residual mismatch far more gracefully than a large one.
The second is vignetting — the radial brightness falloff that darkens a lens's corners relative to its center. A per-lens falloff model is estimated jointly so that the same scene point matches in brightness whether it happens to land at the center of one frame or the corner of the next. And one can fold in radial distortion as well: the barrel or pincushion warp that bends straight lines near the frame edge, modeled as
with the distortion coefficients $\kappa$ added to the per-camera parameters $\theta_j$ so that straight scene lines align across frames rather than fighting the homography at the seams. (The distortion model itself is developed in Optics, lenses, and aberration correction; here it is simply another set of unknowns in the global fit.)
The lesson to land is the one the whole chapter keeps circling back to: geometry, photometry, and lens imperfections are not three separate clean-up passes — they are one joint estimation. Solving them together, rather than as a sequence of independent patches each fighting the others' residuals, is what yields a globally consistent, seam-friendly mosaic.
10.8.3 Movement and parallax handling⧉
The previous two sections repaired the geometry and photometry of the panorama while still assuming two things the world routinely violates: that the scene is static, and that the camera only rotated. This section is what happens when each of those assumptions breaks. The first failure — something moved between frames — is fixed by a tool you already have. The second — the camera translated — is a genuine limit of the homography model, and worth understanding precisely so you know when to stop fighting it.
Deghosting: one source per moving region⧉
Bundle adjustment and blending both assume a static scene, but in practice people walk, leaves rustle, and cars drive through the overlap between two frames. A moving object sits in different places in the two frames that see it, and when the blender — which was designed to cross-fade correctly aligned pixels — averages those two positions, the object appears twice as a ghost, or smears into a translucent double. It is important to distinguish this from the misregistration ghosting of the alignment chapter: there the cure was better alignment, because the pixels themselves were mis-positioned. Here the background pixels are aligned correctly; it is the object that genuinely changed place between exposures, and no amount of better registration can reconcile two truthful but incompatible observations.
The fix is to stop averaging across the moving object and instead take it entirely from a single frame. Detect the disagreement regions first — after geometric and photometric alignment, a large per-pixel difference between overlapping frames is the signature that "something moved here." Then route the composite so that each such region is drawn from exactly one frame, keeping the moving object whole.
Crucially, this is not a new algorithm — it is precisely the seam-optimization problem of Seam optimization with one extra term added to the cost. Recall the seam energy from Blending and Seam optimization, $E = \sum_p D_p + \sum_{pq} V_{pq}$ — a per-pixel data cost $D_p$ for which frame each output pixel is drawn from, plus a smoothness cost $V_{pq}$ for cutting between neighbors. "One source per moving region" is exactly a per-pixel labeling decision: which source frame does each output pixel come from? Make the graph-cut smoothness energy cheap to cut around a moving blob and expensive to cut through it, and the optimizer (Boykov, Veksler and Zabih, 2001) automatically routes the seam around the object, so it is taken whole from one frame. Then blend along the chosen seam with the pyramid or Poisson machinery of Linear pyramids and wavelets and Poisson image editing. Deghosting is therefore the seam energy of the blending chapter, with motion-disagreement folded into the cut cost — the same "the energy is the design" principle that governs the whole EDGES-MATTER part. And because the choice is just a label per region, the same machinery gives you the playful variants for free: keep all copies of a person who walked through and you get a "stroboscopic" multiplicity panorama; keep none and you have cleanly removed every transient passer-by (Figure 10.8.4).
Parallax: when translation breaks the homography⧉
The deeper limit is geometric, and it is not a bug to be tuned away. Recall from Manual panorama stitching from multiple views why a homography can align two frames in the first place: it is exact only if the camera underwent pure rotation about its optical center, or the scene is planar or very distant. Under pure rotation every scene point, near or far, moves between the two views by the same projective map — which is what let a single $3 \times 3$ homography $H \simeq K R K^{-1}$ stand in for the whole scene, with no dependence on depth.
The moment the camera translates — the photographer takes a step sideways — that guarantee collapses. Now near and far objects shift by different amounts: this is motion parallax, the everyday effect by which nearby fence posts streak past a train window while distant mountains barely move. And no single homography can align both: line up the background and the foreground doubles; line up the foreground and the background tears (Figure 10.8.5).
It helps to see exactly why this is fundamental rather than a matter of a better fit. A homography is a 2-D warp with 8 degrees of freedom — one global mapping from image plane to image plane. But translation through a 3-D scene induces a displacement that depends on each point's depth, and a single 2-D map simply has no way to encode "shift the near things more than the far things." The discrepancy grows with the amount of translation and with the range of depths in the scene, so it is worst with close foreground in front of a distant background — exactly the situation where it is most visible.
Because the limit is structural, the practical escapes either avoid the translation or stop using a homography:
- Rotate about the no-parallax point. If the camera pivots about its entrance pupil there is no translation of the optical center, so the pure-rotation assumption holds exactly and the homography is once again valid. This is precisely why tripod panoramas use a nodal slide to position the lens's pupil over the axis of rotation.
- Accept it locally. Route the seam through low-parallax regions — distant sky, uniform wall — using the graph cut once more, and blend, so that the unavoidable mismatch is hidden where the scene is far away or featureless enough that no one notices.
- Go 3-D. When translation is essential to the shot, abandon the single-homography model and estimate scene depth with genuine multi-view geometry (Hartley and Zisserman). This is the hand-off to structure-from-motion and to Matching pixels across space and time, where a per-pixel, depth-aware warp replaces the one global map.
This is also why the continuous "sweep" panorama on a phone — covered next — is the hardest case of all: a handheld sweep is continuous translation plus rotation, so parallax (and rolling-shutter wobble) are front and center rather than incidental, and the static-scene, pure-rotation idealization that made the demo work is never quite true.