2.7 Multiple view geometry⧉
A single photograph is missing one number per pixel: the depth $Z$. The perspective projection $x = fX/Z$, $y = fY/Z$ of the previous chapters maps a whole ray of 3D points — every point at the same direction but different distance — onto one image point, and the division by $Z$ that shrinks distant objects also erases how distant they were. You can recover $Z$ with extra hardware (the structured-light and time-of-flight depth sensors mentioned earlier), but the oldest and most general way to get it back is simply to look from a second place. That is what your two eyes do, what a stereo rig does, and what a moving camera does over time; the geometry is the same in every case, and this chapter is about that geometry.
The story comes in two parts. First the intuition, which is genuinely simple and which evolution discovered long before we did: a nearby object shifts more than a far one when you change viewpoint, and that shift — the disparity — inverts cleanly to distance. Then a teaser of the algebra that makes two-view geometry into a precise tool: the fact that a point in one image constrains its match in the other to a line, not a point, and the single small matrix that encodes that constraint. The intuition we develop in full; the algebra we state and forward to the Advanced part, where the multi-view machinery — many cameras, structure from motion, neural scene representations — properly belongs.
2.7.1 Two views: stereo and disparity⧉
One camera throws away depth. Two cameras, looking at the same scene from slightly different positions, can get it back — which is why you have two eyes (Figure 2.7.1). The cue is disparity: a nearby object shifts a lot between the two viewpoints, a faraway object barely shifts at all. Hold a finger close and blink each eye in turn and it jumps; do the same with a distant building and it stays put. That difference in shift is depth information, and your visual system fuses the two retinal images into a single vivid sense of three dimensions — stereopsis.
That disparity alone carries depth — with no help from recognizing objects and no monocular cues like shading, occlusion, or familiar size — is not obvious, and Bela Julesz proved it with a beautiful experiment: the random-dot stereogram (Figure 2.7.2). Take a field of random black-and-white dots, copy it for the second eye, and shift a central square patch sideways. Each image on its own is pure noise — no edges, no shapes, nothing to recognize. But fuse the two and a square floats crisply in front of the background. "Depth without objects": the visual system computes depth purely from the per-pixel shift, before it has any idea what it is looking at. This is the strongest possible evidence that stereo depth is a low-level geometric computation, and it is the conceptual ancestor of every stereo-matching algorithm.
The geometry, for parallel cameras. Put two cameras side by side, optical axes parallel, separated by a baseline $B$, both with focal length $f$. A scene point at depth $Z$ projects into each camera; because the cameras are displaced by $B$, its two image positions differ by the disparity $d$. Drawing the two projection triangles and using similar triangles — the same move as the perspective equation of the Pinhole chapter — gives a clean relation:
Read it back: depth is focal length times baseline, divided by disparity. A larger shift $d$ means a closer point (small $Z$); a vanishing shift means a far point ($Z \to \infty$) — exactly the finger-versus-building intuition, now quantitative. A wider baseline $B$ or a longer focal length $f$ spreads the disparities out and measures depth more precisely, which is why stereo rigs — and our two eyes — have a baseline at all (Figure 2.7.3). The same relation also explains stereo's Achilles' heel: because $d \propto 1/Z$, faraway points crowd into tiny, nearly equal disparities, so depth precision degrades with distance — a one-pixel error in $d$ is harmless up close and catastrophic far away.
Because disparity grows with the baseline ($d = fB/Z$), you can amplify your own depth perception by optically pushing your two eyes farther apart than the $\sim$6.5 cm your skull came with. Hermann von Helmholtz built exactly this in 1857 — the telestereoscope, two pairs of angled mirrors that feed each eye a view gathered a foot or more to the side, so distant mountains and buildings, normally flat to the naked eye, leap into exaggerated relief (hyperstereo). The military turned the same trick into instruments: the optical rangefinders on warships and artillery were long horizontal tubes with mirrors or prisms at each end and a baseline of one to ten meters, letting a gunner triangulate the range to a target kilometres away by fusing the two views; the scissors periscope (Scherenfernrohr) gave a trench observer both a raised viewpoint and a wide-baseline 3D view over the parapet. The flip side, hypostereo, shrinks the baseline — closely spaced lenses — for natural-looking depth on tiny or very near subjects in macro 3D. It is all the same lever from $Z = fB/d$: more baseline, more disparity, more depth. You are not stuck with the interocular distance you were born with.
Stereo matching — finding each point's match in the other view and reading off its disparity — produces a disparity map, and through $Z = fB/d$, a depth map (Figure 2.7.4). Real rigs are rarely perfectly parallel, so a preliminary step rectifies the pair: a 2D projective transform (a homography) reprojects both images as if the cameras were parallel, after which matching need only search along horizontal lines. That same homography machinery stitches panoramas — images taken from one viewpoint in different directions are also related by a homography — and the dual-pixel autofocus of a modern camera (covered in Photography and camera 101) is a stereo pair the width of a single lens. The hard part of matching is always the same: textureless regions and occlusions, where a point has no distinctive match or no match at all. The full multi-view story — many cameras, structure from motion, neural radiance fields and Gaussian splatting — is the subject of the Advanced part. Here the takeaway is just the lever: two views plus disparity equals depth.
2.7.2 Two-view geometry: epipolar lines, and the essential and fundamental matrices⧉
The parallel-camera picture is the easy case, where matches lie on the same horizontal row. The general two-view setup — two cameras in arbitrary positions and orientations — has a deeper structure, and it rests on a single fact that is worth seeing once even though we defer its algebra. The fact is this: a point in one view does not pin its match to a point in the other — it constrains it to a line.
Here is why, in one picture (Figure 2.7.5). Take a pixel in the left image. By the divide-by-$Z$ logic of projection, you do not know its depth, so its 3D point could be anywhere along a single ray shooting out from the left camera's center through that pixel. Now look at that whole ray from the right camera: the ray is a line in space, and a line in space projects to a line in the right image. So the match for your left pixel, wherever it turns out to be, must lie somewhere on that projected line — the epipolar line. Matching is therefore a 1-D search along a line, not a 2-D search over the whole image — a huge simplification, and exactly the reason we rectify stereo pairs: rectification rotates both images so that every epipolar line becomes a horizontal row, recovering the easy parallel-camera case of the previous section.
That point-to-line constraint is encoded by a single $3 \times 3$ matrix, and we can derive it directly from the picture just drawn — the ray, the plane, and the line, in that order.
Step 1 — the ray (depth ambiguity). Work in the first camera's frame, with its center at the origin. A pixel back-projects to a ray: in calibrated coordinates — the pixel $x$ premultiplied by the inverse intrinsics, $\hat{x} = K^{-1}x$, which strips out focal length and principal point and leaves a pure direction — the unknown 3-D point is
where the depth $\lambda$ is exactly the quantity the divide-by-$Z$ projection threw away. The true point could sit at any $\lambda$ along this ray.
Step 2 — the epipolar plane. That ray, together with the second camera's center, spans a plane: the epipolar plane. Equivalently, it is the plane through the two camera centers and the unknown point $X$ — and whatever $\lambda$ turns out to be, $X$ lies in this one plane (Figure 2.7.5).
Step 3 — the epipolar line. The plane cuts the second image in a line — the epipolar line — so the match $\hat{x}'$ is trapped on it. That is steps 1–3 of the picture; now we just write the plane down.
Step 4 — the plane equation, in the second camera's frame. Let the second camera be rotated and translated from the first by $(R, t)$, so a point with first-frame coordinates $X$ has second-frame coordinates $R X + t$. Two facts follow:
- the first camera's center — the origin of frame 1 — maps to $t$, so $t$ is the baseline vector from the second center to the first;
- the ray direction $\hat{x}$ becomes $R\hat{x}$ in the second frame.
Now collect the three vectors that emanate from the second center into the epipolar plane: the match direction $\hat{x}'$ (toward $X$), the baseline $t$ (toward the first center), and the rotated ray $R\hat{x}$ (along the ray itself). All three lie in that one plane, so they are coplanar (Figure 2.7.6) — and coplanarity of three vectors is exactly a vanishing scalar triple product:
Writing the cross product $t\times(\cdot)$ as the skew-symmetric matrix $[t]_\times$ turns the triple product into a bilinear form, and that is the matrix:
$E = [t]_\times R$ is the essential matrix — the entire epipolar constraint, in calibrated coordinates, assembled from nothing but the relative pose $(R, t)$.
From $E$ to $F$. In raw pixel coordinates the intrinsics are unknown or unremoved; with $x = K\hat{x}$ and $x' = K'\hat{x}'$ for the two cameras' intrinsics $K, K'$, substitute $\hat{x} = K^{-1}x$ and $\hat{x}' = K'^{-1}x'$:
This is the fundamental matrix $F$: the same constraint when the cameras are uncalibrated. It also closes the loop with the picture — $F x$ is a 3-vector of line coefficients, the epipolar line in the second image, so $x'^{\mathsf{T}}(F x) = 0$ says precisely "the match $x'$ lies on that line."
What this buys, and what we still defer. Everything in two-view geometry flows from this one matrix. Because $E = [t]_\times R$, the relative pose $(R, t)$ of the two cameras can be recovered from $E$ (the translation only up to scale — a single pair of images cannot know the absolute size of the world), and relative poses chained across many views reconstruct both the cameras and the scene. What we still leave to the Advanced part is the estimation and the scaling-up: recovering $F$ or $E$ from noisy point matches (the eight-point algorithm and its robust, RANSAC-wrapped descendants), the rectification math, triangulation (intersecting two verified rays to recover the 3-D point), and the leap to many images — structure from motion and multi-view stereo, and the neural scene representations built on top of them. Carry away the small, durable core: a second viewpoint gives depth; disparity inverts to distance through $Z = fB/d$; and in general a point constrains its match to an epipolar line — $x'^{\mathsf{T}} F x = 0$, with $F = K'^{-\mathsf{T}}[t]_\times R\,K^{-1}$ and its calibrated heart the essential matrix $E = [t]_\times R$.