💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to

2.7 Multiple view geometry

A single photograph is missing one number per pixel: the depth $Z$. The perspective projection $x = fX/Z$, $y = fY/Z$ of the previous chapters maps a whole ray of 3D points — every point at the same direction but different distance — onto one image point, and the division by $Z$ that shrinks distant objects also erases how distant they were. You can recover $Z$ with extra hardware (the structured-light and time-of-flight depth sensors mentioned earlier), but the oldest and most general way to get it back is simply to look from a second place. That is what your two eyes do, what a stereo rig does, and what a moving camera does over time; the geometry is the same in every case, and this chapter is about that geometry.

The story comes in two parts. First the intuition, which is genuinely simple and which evolution discovered long before we did: a nearby object shifts more than a far one when you change viewpoint, and that shift — the disparity — inverts cleanly to distance. Then a teaser of the algebra that makes two-view geometry into a precise tool: the fact that a point in one image constrains its match in the other to a line, not a point, and the single small matrix that encodes that constraint. The intuition we develop in full; the algebra we state and forward to the Advanced part, where the multi-view machinery — many cameras, structure from motion, neural scene representations — properly belongs.

2.7.1 Two views: stereo and disparity

One camera throws away depth. Two cameras, looking at the same scene from slightly different positions, can get it back — which is why you have two eyes (Figure 2.7.1). The cue is disparity: a nearby object shifts a lot between the two viewpoints, a faraway object barely shifts at all. Hold a finger close and blink each eye in turn and it jumps; do the same with a distant building and it stays put. That difference in shift is depth information, and your visual system fuses the two retinal images into a single vivid sense of three dimensions — stereopsis.

[figure fig-stereo-pair not built]
Figure 2.7.1. A stereo pair. The same scene photographed from two horizontally offset viewpoints (shown here as a red–cyan anaglyph, or as a side-by-side pair for a stereoscope). Near objects sit at noticeably different image positions in the two views; far objects barely move. That position difference — disparity — encodes depth, and fusing the two views yields a vivid 3D impression.

That disparity alone carries depth — with no help from recognizing objects and no monocular cues like shading, occlusion, or familiar size — is not obvious, and Bela Julesz proved it with a beautiful experiment: the random-dot stereogram (Figure 2.7.2). Take a field of random black-and-white dots, copy it for the second eye, and shift a central square patch sideways. Each image on its own is pure noise — no edges, no shapes, nothing to recognize. But fuse the two and a square floats crisply in front of the background. "Depth without objects": the visual system computes depth purely from the per-pixel shift, before it has any idea what it is looking at. This is the strongest possible evidence that stereo depth is a low-level geometric computation, and it is the conceptual ancestor of every stereo-matching algorithm.

fig-random-dot-stereogram
Figure 2.7.2. A random-dot stereogram (Julesz). Two images of random dots are identical except that a central patch is shifted horizontally in one. Viewed monocularly, each is featureless noise; fused binocularly, the shifted patch springs forward as a floating surface. Depth emerges from disparity alone, with no recognizable objects or monocular cues — "depth without objects."

The geometry, for parallel cameras. Put two cameras side by side, optical axes parallel, separated by a baseline $B$, both with focal length $f$. A scene point at depth $Z$ projects into each camera; because the cameras are displaced by $B$, its two image positions differ by the disparity $d$. Drawing the two projection triangles and using similar triangles — the same move as the perspective equation of the Pinhole chapter — gives a clean relation:

$$ Z = \frac{f\,B}{d}. $$

Read it back: depth is focal length times baseline, divided by disparity. A larger shift $d$ means a closer point (small $Z$); a vanishing shift means a far point ($Z \to \infty$) — exactly the finger-versus-building intuition, now quantitative. A wider baseline $B$ or a longer focal length $f$ spreads the disparities out and measures depth more precisely, which is why stereo rigs — and our two eyes — have a baseline at all (Figure 2.7.3). The same relation also explains stereo's Achilles' heel: because $d \propto 1/Z$, faraway points crowd into tiny, nearly equal disparities, so depth precision degrades with distance — a one-pixel error in $d$ is harmless up close and catastrophic far away.

fig-disparity-geometry
Figure 2.7.3. Parallel-camera stereo geometry. Two cameras with focal length f are separated by a baseline B, axes parallel. A scene point at depth Z projects to two image positions whose difference is the disparity d. Similar triangles give Z = f·B / d: depth is inversely proportional to disparity, and proportional to baseline times focal length. Near points (large d) are easy to localize; far points (small d) are hard.
Sidebar — widening your baseline: telestereoscopes and rangefinders

Because disparity grows with the baseline ($d = fB/Z$), you can amplify your own depth perception by optically pushing your two eyes farther apart than the $\sim$6.5 cm your skull came with. Hermann von Helmholtz built exactly this in 1857 — the telestereoscope, two pairs of angled mirrors that feed each eye a view gathered a foot or more to the side, so distant mountains and buildings, normally flat to the naked eye, leap into exaggerated relief (hyperstereo). The military turned the same trick into instruments: the optical rangefinders on warships and artillery were long horizontal tubes with mirrors or prisms at each end and a baseline of one to ten meters, letting a gunner triangulate the range to a target kilometres away by fusing the two views; the scissors periscope (Scherenfernrohr) gave a trench observer both a raised viewpoint and a wide-baseline 3D view over the parapet. The flip side, hypostereo, shrinks the baseline — closely spaced lenses — for natural-looking depth on tiny or very near subjects in macro 3D. It is all the same lever from $Z = fB/d$: more baseline, more disparity, more depth. You are not stuck with the interocular distance you were born with.

Stereo matching — finding each point's match in the other view and reading off its disparity — produces a disparity map, and through $Z = fB/d$, a depth map (Figure 2.7.4). Real rigs are rarely perfectly parallel, so a preliminary step rectifies the pair: a 2D projective transform (a homography) reprojects both images as if the cameras were parallel, after which matching need only search along horizontal lines. That same homography machinery stitches panoramas — images taken from one viewpoint in different directions are also related by a homography — and the dual-pixel autofocus of a modern camera (covered in Photography and camera 101) is a stereo pair the width of a single lens. The hard part of matching is always the same: textureless regions and occlusions, where a point has no distinctive match or no match at all. The full multi-view story — many cameras, structure from motion, neural radiance fields and Gaussian splatting — is the subject of the Advanced part. Here the takeaway is just the lever: two views plus disparity equals depth.

[figure fig-disparity-map not built]
Figure 2.7.4. A disparity map. For a stereo pair (here a standard benchmark scene), each pixel's match in the other view gives its disparity, rendered as brightness — near surfaces bright (large disparity), far surfaces dark. Through Z = f·B/d this disparity map is a depth map. The hard parts are textureless regions and occlusions, where a match is ambiguous or absent.

2.7.2 Two-view geometry: epipolar lines, and the essential and fundamental matrices

The parallel-camera picture is the easy case, where matches lie on the same horizontal row. The general two-view setup — two cameras in arbitrary positions and orientations — has a deeper structure, and it rests on a single fact that is worth seeing once even though we defer its algebra. The fact is this: a point in one view does not pin its match to a point in the other — it constrains it to a line.

Here is why, in one picture (Figure 2.7.5). Take a pixel in the left image. By the divide-by-$Z$ logic of projection, you do not know its depth, so its 3D point could be anywhere along a single ray shooting out from the left camera's center through that pixel. Now look at that whole ray from the right camera: the ray is a line in space, and a line in space projects to a line in the right image. So the match for your left pixel, wherever it turns out to be, must lie somewhere on that projected line — the epipolar line. Matching is therefore a 1-D search along a line, not a 2-D search over the whole image — a huge simplification, and exactly the reason we rectify stereo pairs: rectification rotates both images so that every epipolar line becomes a horizontal row, recovering the easy parallel-camera case of the previous section.

fig-epipolar-line
Figure 2.7.5. The epipolar constraint. A pixel in the left image back-projects to a ray (its 3D point lies somewhere along it). Seen from the right camera, that ray projects to a single line — the epipolar line — so the matching point must lie on it: matching is a 1-D search along a line, not a 2-D search. The two camera centers and the 3D point define the epipolar plane, whose intersection with each image plane is the epipolar line.

That point-to-line constraint is encoded by a single $3 \times 3$ matrix, and we can derive it directly from the picture just drawn — the ray, the plane, and the line, in that order.

Step 1 — the ray (depth ambiguity). Work in the first camera's frame, with its center at the origin. A pixel back-projects to a ray: in calibrated coordinates — the pixel $x$ premultiplied by the inverse intrinsics, $\hat{x} = K^{-1}x$, which strips out focal length and principal point and leaves a pure direction — the unknown 3-D point is

$$ X = \lambda\,\hat{x}, \qquad \lambda > 0, $$

where the depth $\lambda$ is exactly the quantity the divide-by-$Z$ projection threw away. The true point could sit at any $\lambda$ along this ray.

Step 2 — the epipolar plane. That ray, together with the second camera's center, spans a plane: the epipolar plane. Equivalently, it is the plane through the two camera centers and the unknown point $X$ — and whatever $\lambda$ turns out to be, $X$ lies in this one plane (Figure 2.7.5).

Step 3 — the epipolar line. The plane cuts the second image in a line — the epipolar line — so the match $\hat{x}'$ is trapped on it. That is steps 1–3 of the picture; now we just write the plane down.

Step 4 — the plane equation, in the second camera's frame. Let the second camera be rotated and translated from the first by $(R, t)$, so a point with first-frame coordinates $X$ has second-frame coordinates $R X + t$. Two facts follow:

Now collect the three vectors that emanate from the second center into the epipolar plane: the match direction $\hat{x}'$ (toward $X$), the baseline $t$ (toward the first center), and the rotated ray $R\hat{x}$ (along the ray itself). All three lie in that one plane, so they are coplanar (Figure 2.7.6) — and coplanarity of three vectors is exactly a vanishing scalar triple product:

$$ \hat{x}' \cdot \big(t \times R\,\hat{x}\big) = 0. $$

Writing the cross product $t\times(\cdot)$ as the skew-symmetric matrix $[t]_\times$ turns the triple product into a bilinear form, and that is the matrix:

$$ \hat{x}'^{\mathsf{T}}\,[t]_\times R\,\hat{x} = 0 \;\equiv\; \hat{x}'^{\mathsf{T}} E\,\hat{x} = 0, \qquad E = [t]_\times R. $$

$E = [t]_\times R$ is the essential matrix — the entire epipolar constraint, in calibrated coordinates, assembled from nothing but the relative pose $(R, t)$.

fig-epipolar-coplanarity
Figure 2.7.6. Why the epipolar constraint is a single 3×3 matrix. Seen from the second camera's center $O'$, three vectors all lie in the one epipolar plane: the baseline $t$ (toward the first center $O$), the rotated back-projected ray $R\hat{x}$ (along the first camera's ray), and the match direction $\hat{x}'$ (toward the 3-D point $X$). Three coplanar vectors span a zero-volume parallelepiped, so their scalar triple product vanishes — $\hat{x}'\cdot(t\times R\hat{x})=0$ — and rewriting the cross product as the skew matrix $[t]_\times$ turns that into $\hat{x}'^{\mathsf T}[t]_\times R\,\hat{x}=0 \equiv \hat{x}'^{\mathsf T}E\,\hat{x}=0$, with $E=[t]_\times R$.

From $E$ to $F$. In raw pixel coordinates the intrinsics are unknown or unremoved; with $x = K\hat{x}$ and $x' = K'\hat{x}'$ for the two cameras' intrinsics $K, K'$, substitute $\hat{x} = K^{-1}x$ and $\hat{x}' = K'^{-1}x'$:

$$ x'^{\mathsf{T}}\,\underbrace{K'^{-\mathsf{T}}\,[t]_\times R\,K^{-1}}_{\displaystyle F}\,x = 0, \qquad F = K'^{-\mathsf{T}} E\,K^{-1}. $$

This is the fundamental matrix $F$: the same constraint when the cameras are uncalibrated. It also closes the loop with the picture — $F x$ is a 3-vector of line coefficients, the epipolar line in the second image, so $x'^{\mathsf{T}}(F x) = 0$ says precisely "the match $x'$ lies on that line."

What this buys, and what we still defer. Everything in two-view geometry flows from this one matrix. Because $E = [t]_\times R$, the relative pose $(R, t)$ of the two cameras can be recovered from $E$ (the translation only up to scale — a single pair of images cannot know the absolute size of the world), and relative poses chained across many views reconstruct both the cameras and the scene. What we still leave to the Advanced part is the estimation and the scaling-up: recovering $F$ or $E$ from noisy point matches (the eight-point algorithm and its robust, RANSAC-wrapped descendants), the rectification math, triangulation (intersecting two verified rays to recover the 3-D point), and the leap to many images — structure from motion and multi-view stereo, and the neural scene representations built on top of them. Carry away the small, durable core: a second viewpoint gives depth; disparity inverts to distance through $Z = fB/d$; and in general a point constrains its match to an epipolar line — $x'^{\mathsf{T}} F x = 0$, with $F = K'^{-\mathsf{T}}[t]_\times R\,K^{-1}$ and its calibrated heart the essential matrix $E = [t]_\times R$.