9.6 Focus, autofocus⧉
A lens has exactly one plane of the scene that it images sharply onto the sensor. Everything closer or farther blurs, gently at first and then badly, into the circles of confusion that FUNDAMENTALS made precise. Focusing is the act of choosing which scene plane that is — sliding the in-focus plane forward or back through the scene until it lands on whatever you care about. Mechanically this is almost trivial: move a piece of glass a fraction of a millimeter. The hard part is the decision. Given the blurry image the sensor sees right now, is the subject in front of the focal plane or behind it, and by how much? A camera that cannot answer the second question can still focus — it just has to guess and check, nudging the lens and watching whether the picture got sharper or worse. A camera that can answer it focuses in a single confident move. The whole chapter is about climbing from the first kind to the second, and then about reusing the machinery — defocus carries depth — to measure the scene's 3D shape.
9.6.1 Focus mechanics⧉
Start with what "focus" physically is. The thin-lens conjugate relation from FUNDAMENTALS, $1/f = 1/u + 1/v$, ties the object distance $u$ to the image distance $v$ at which that object comes to a sharp point. An object at infinity ($u \to \infty$) focuses at $v = f$, right at the focal length; pull the object nearer and $v$ grows — the sharp image moves back, away from the lens. To keep that sharp image pinned on a fixed sensor, then, the camera must move the lens (or a focusing group inside it, or the sensor) forward as the subject comes closer. That motion is focusing, and Figure 1 shows the geometry: focusing slides the system so that the chosen conjugate lands exactly on the sensor.
How far must the lens move? Differentiate the conjugate relation. Near the infinity setting the required image-distance travel per unit of subject closeness is small for a distant subject and grows steeply as the subject approaches — which is why a focus ring's last few degrees cover the macro range and most of its travel is spent on the near field. In numbers it is easiest to think in diopters (reciprocal metres of object distance): a fixed change in subject diopters demands a roughly fixed sensor shift of order $f^2$ times that change, so long lenses need far more focus travel than short ones for the same change in subject distance. This is the unglamorous mechanical fact behind everything that follows: the focus actuator has a finite, calibrated range of motion, and the AF brain must tell it how far and which way to go.
Before autofocus existed, the photographer was that brain, aided by manual-focus aids that make defocus visible to the eye. A ground-glass screen shows the projected image directly — sharp when focused, smeared when not. A split-image rangefinder (the split prism in an old single-lens reflex (SLR) focusing screen) is cleverer: a pair of opposed prisms splits the central spot into two halves that are aligned only at exact focus and offset when defocused — a built-in disparity device, and a foreshadowing of phase detection below. A true rangefinder camera triangulates the subject with a separate optical baseline, a coincidence-image patch that the photographer brings into register by turning the focus ring — literally stereo (cross-ref the FUNDAMENTALS rangefinder sidebar, $Z = fB/d$). Modern electronic finders add focus peaking (high-contrast edges outlined in color, since sharp edges carry the most high-frequency energy — exactly the cue contrast AF will automate) and magnified live view.
The last piece of mechanics is the actuator and its decoupling from the ring. In most modern autofocus lenses the focus ring is not mechanically connected to the glass at all. Turning it sends an electronic signal to a motor — a stepper, an ultrasonic motor (USM), or a linear voice-coil — that drives the focusing group. This is focus-by-wire, and it is what makes fast, silent, programmable focus possible: the motor can slew the whole range in a fraction of a second, hold position without power, respond to a touchscreen tap, and use the lens's own distance encoders and aberration data to correct for focus shift. The decoupling matters for the rest of the chapter because it means the AF system is free to command any lens position instantly — the question is purely what position to command, and that is an algorithm.
9.6.2 Contrast-detection AF⧉
The first algorithm is the most direct, and it needs no special hardware at all — just the imaging sensor the camera already has. The premise is a single observation about defocus: a sharp image contains more high-frequency energy than a blurred one. Blur is a low-pass operation; it smears edges, attenuates fine detail, and flattens the local contrast. So if you measure how much high-frequency content the image has and maximize it over lens position, the maximum occurs exactly at focus. This is contrast-detection autofocus (CDAF).
Concretely, the camera computes a contrast metric over the focus region — a sum of squared image gradients, or the energy passing a high-pass filter, or the variance of the local Laplacian. Any of these rises as the image sharpens and falls as it blurs, tracing out a single-peaked curve as the lens sweeps through focus (Figure 2). Finding focus is then gradient ascent on the sharpness hill: climb until the metric stops increasing, and you are at the top.
Contrast detection's strength is that it is exact and cheap: it reads the very sensor that will take the picture, so there is no separate optical path to calibrate and no front/back-focus error — the metric peaks precisely where the recorded image is sharpest. Its fatal weakness is hidden in the phrase gradient ascent: a single measurement of the metric tells you the image's sharpness but not the slope of the hill. From one blurry frame the camera cannot tell whether focus lies nearer or farther — both directions look equally blurry. So it must hunt: nudge the lens one way, measure again, and infer the direction from whether the metric rose or fell, possibly reversing if it guessed wrong. This trial-and-error is visible as the lens racking back and forth before settling, it is slow, and it fails outright in low contrast or low light, where the hill is so flat that sensor noise swamps the metric and the peak is undetectable. Contrast AF is accurate but blind to direction — which is exactly the gap the next method fills.
9.6.3 Phase-detection AF (the split-pupil / stereo trick)⧉
Suppose that instead of one view of the scene you had two — taken from two different parts of the lens's aperture. Then you would have a stereo pair, and stereo gives you not just that the subject is out of focus but which way and how far, in a single shot. This is the idea of phase-detection autofocus (PDAF), and it is the most important idea in the chapter.
Here is the geometry. Light from a single subject point fills the whole exit pupil of the lens and converges toward the conjugate image point. If the sensor sits exactly at that convergence point, all the rays — from the left of the pupil, the right of the pupil, everywhere — pile onto one spot: in focus. But if the sensor is in front of the convergence point, the cone has not finished closing, and rays from the left half of the pupil land to one side while rays from the right half land to the other; the two half-pupil images are separated. If the sensor is behind the convergence point, the cone has crossed over, and the two half-pupil images separate the other way. The separation between the two half-pupil views is therefore a signed quantity: its sign tells you the direction of defocus (front or back), and its magnitude tells you how far out of focus you are.
A dedicated PDAF module makes those two views explicit. After the main lens, a small separator (secondary) lens — really a pair of tiny lenslets behind a mask — re-images the left and right halves of the exit pupil onto two separate line sensors. When the main lens is focused, the two line patterns line up. When it is defocused, the two patterns slide apart by an offset proportional to the defocus, with a sign set by the defocus direction (Figure 3). The camera correlates the two line signals to read off that lateral offset in one measurement — and from the offset it computes exactly how far and which way to drive the focus motor. No hunting: one read, one move, done. This is why phase detection is fast where contrast detection dawdles.
It is worth stating plainly what this is: PDAF is stereo, performed inside one lens. The two halves of the exit pupil are two viewpoints; the line separating them is a baseline; the lateral offset between the two half-pupil images is a disparity. The same relation that governed two-camera stereo in FUNDAMENTALS, $Z = fB/d$, governs the inside of the lens — here the baseline $B$ is the distance between the centroids of the two pupil halves, and the disparity $d$ is the measured offset, which is directly proportional to the defocus. A camera does not need a second body to do stereo; the aperture is already wide enough to hold two viewpoints, and reading the disparity between them is reading the focus error. Every PDAF variant below is just a different way of capturing those two pupil halves.
The classic implementation lives in the SLR. The main reflex mirror is partly transparent; behind it a small sub-mirror angles a fraction of the light downward to a dedicated PDAF module in the floor of the mirror box, with its separator lenses and line-sensor arrays (often arranged in a cross or a grid of points across the frame). This gives the fast, single-shot focus that made action and sports photography practical with autofocus. Its drawback is structural: the AF sensor is on a separate optical path from the imaging sensor, so any mechanical discrepancy between the two shows up as front- or back-focus error that must be calibrated out per body and per lens. Move the focusing function onto the imaging sensor itself and that calibration problem disappears — which is the next step.
The sub-mirror PDAF module just described is the single-lens reflex's signature trick, and also the reason the reflex design is being retired. The mirror earned its place with two historical advantages: it let the photographer view through the lens optically, and it fed a dedicated, fast phase-detection module. Both advantages are now shrinking. Electronic viewfinders have become good enough — and are needed for video regardless — and on-sensor phase detection (next section) has nearly closed the autofocus gap. The mirror's costs, meanwhile, have not shrunk. It adds moving parts and the per-body, per-lens front/back-focus calibration noted above; it enlarges the body; it imposes the retrofocus tax on wide-angle and large-aperture lenses by forbidding glass near the sensor (cross-ref A short bestiary of classic designs — the mirror box is exactly the standoff a retrofocus design must buy); and, most telling for this book, it occludes the imaging sensor while you focus, so the camera cannot run scene analysis — face and eye detection, subject recognition — on the actual image before the shutter fires. For video the mirror must stay up the whole time, forfeiting every one of its advantages at once. A design whose benefits are met elsewhere and whose costs are paid in size, complexity, lens freedom, and the very sensor access that computational photography depends on is a design on the way out — the reflex camera is yielding to mirrorless, and the autofocus story is migrating wholesale onto the main sensor. (This synthesis follows the author's 2018 essay "The DSLR will probably die" — Sources/Blog/the-dslr-will-probably-die-are-mirrorless-the-future-of-large-standalone-cameras.md, Blog Index.)
9.6.4 On-sensor PDAF and dual-pixel AF⧉
Phase detection's payoff — signed, single-shot focus — is too good to leave stranded on a separate module, especially once cameras went mirrorless and lost the sub-mirror entirely. The fix is to capture the two pupil halves on the main imaging sensor, so the same pixels that take the picture also report phase. There are two ways to do it.
The first is masked (phase-detect) pixels. Scatter a sparse set of special pixels across the imaging array, each with half of its light-collecting area masked off — some masked on the left, some on the right. A left-masked pixel sees only light coming from the right half of the pupil, and vice versa; pairing nearby left- and right-masked pixels recovers the two half-pupil views directly on the recording sensor. This brings PDAF to mirrorless and live-view shooting with no separate module and no separate calibration. The cost is that masked pixels are partially blind — they collect less light and must be interpolated over in the final image, like defective pixels — so they are kept sparse, which limits how dense and low-light-capable this flavor of on-sensor PDAF can be.
The second way wastes no light at all: dual-pixel autofocus. Instead of masking, split every photosite into two side-by-side half-photodiodes sharing one microlens (Figure 4). The microlens — the same per-pixel microlens stack introduced in FUNDAMENTALS — focuses the incoming cone so that the left half-diode sees the left half of the pupil and the right half-diode sees the right half. Each pixel is therefore its own miniature phase-detect pair, and every pixel participates. Compare the left-diode image to the right-diode image across the frame and you get a dense disparity signal — phase detection at every point — which is why dual-pixel AF is fast and tracks well across the whole frame. For the final photograph the two halves are simply summed into a normal pixel value, so no light and no resolution is thrown away.
The disparity these dense methods produce is not only a focus signal but a low-resolution depth signal. Because the baseline is just the small split across a single lens's pupil, the disparities are tiny and the depth map is coarse and noisy — far weaker than a wide-baseline two-camera rig. But it is real depth, captured by the main sensor in one exposure, and it is exactly what feeds portrait-mode background blur and small post-capture refocus on phones (forward-ref to the computational-DoF treatment in Depth of field and the light-field part). The same disparity that aims the lens also begins to measure the scene — which brings us to the explicitly computational siblings of autofocus.
9.6.5 Depth from focus / defocus, and learned subject AF⧉
Autofocus reads defocus to remove it. Turn the question around — read defocus to measure the scene — and the same physics becomes a depth sensor. Two classic recipes do this.
Depth from focus (DfF) sweeps the lens through its whole focus range while capturing a stack of images, then asks, for each pixel independently, at which focus setting was it sharpest — using the very contrast metric from contrast AF, evaluated per pixel. The focus position of each pixel's sharpness peak maps directly to that pixel's depth, since each focus setting corresponds to one in-focus distance. It needs many exposures (one per focus step) but no model of the blur. This is precisely the machinery behind focus stacking, which keeps the sharpest pixel from each slice to build an all-in-focus image rather than a depth map — the full treatment lives in Depth of field.
Depth from defocus (DfD) is the frugal cousin: rather than sweeping the entire range, take just two or a few images at different focus or aperture settings and infer, per pixel, the blur radius of the defocus disk. The geometry from FUNDAMENTALS' depth-of-field section ties that blur radius to distance-from-focus and aperture, so estimating the blur recovers the depth — from a handful of shots instead of a full stack (Figure 5). The foundational work is Pentland's 1987 method and Subbarao's contemporaneous analysis; both estimate the local point-spread function (PSF) and solve for the depth that explains it. That defocus PSF is also the bridge to deconvolution — if you know the blur, you can both read depth from it and undo it — tying DfD to the deblurring and aberration-correction machinery in Aberrations correction.
There is a third use of the same physics, halfway between measuring depth and aiming the lens: depth from defocus as autofocus, the trick behind Panasonic's DFD (Depth From Defocus) system. Rather than build a depth map at all, it uses defocus to drive focus. The camera carries a stored model of how a given lens's blur PSF changes with focus, captures two frames at slightly different focus settings, and compares them against that model to infer not just that the image is defocused but the direction and amount of the defocus — a PDAF-like, near-single-step cue extracted from an ordinary contrast-detect sensor with no phase pixels on the array. In effect it recovers phase detection's signed error from defocus comparison alone. It rides on top of CDAF — confirming focus at the sharpness peak — and is especially strong for video, where the smooth, directed pull beats contrast AF's visible hunting. Its weakness is the dependence on a per-lens blur profile (so it works only with characterized lenses) and on the subject holding still between the two frames.
The last rung of the ladder moves the decision of what to focus on from optics into recognition. Everything so far answers "how do I make this point sharp?" — but choosing which point is the photographer's real intent, and modern cameras increasingly learn it. Subject and eye autofocus runs a detector — for human faces and eyes, for animals (their eyes specifically), for birds, vehicles, and trains — over the live feed, picks the subject (a person's near eye, say), tracks it as it moves, and then drives the underlying PDAF to focus on exactly that point. The focusing physics is unchanged; what is new is that AF's decision layer is now a recognition problem, solved by the learned-detection methods developed in the computational-tools part and the big-data part. Alongside this learned, passive path sit the active rangefinders cameras have used over the years to assist focus rather than choose it: early infrared and ultrasonic systems (the Polaroid sound-ranging (SONAR) camera famously pinged the subject with ultrasound), and today's LiDAR / time-of-flight modules on phones that supply a direct depth measurement to seed and confirm focus in the dark, where contrast and phase signals both fade. The ladder thus ends where the book's big themes converge: optics provides the signal, computation reads it, and a learned model decides what the picture is of.
9.6.6 Focusing in astrophotography⧉
Every method in this chapter assumes a subject with contrast and light to spare. The night sky has neither, which makes astrophotography the one regime where the whole AF ladder simply collapses — and where, paradoxically, focus must be set more precisely than ever. Stars are dim point sources at infinity: there is almost no local contrast for contrast detection to climb, and far too few photons for phase detection to correlate. So astro focus is essentially manual, and unforgiving — a star is the ultimate focus target, an unresolved point that betrays the smallest error, so the circle of confusion you can tolerate is tiny (cross-ref Depth of field).
The obvious shortcut — just rack the lens to its infinity stop — does not work. Most lenses are built to focus past infinity (to leave room for thermal expansion and autofocus overshoot), so the "∞" hard stop is not the in-focus position. Worse, the true infinity setting drifts during a session: as the night cools, the barrel contracts and the focus shifts, so a rig that was sharp at dusk needs refocusing as the temperature drops. Focus has to be found empirically, and re-found.
The practical methods all turn a star into a visible, nullable focus cue. The simplest is to put a bright star in magnified live view and adjust until it shrinks to the tightest point. The classic aid is the Bahtinov mask — a grid placed over the aperture whose slats throw diffraction spikes that cross as an X with a central spike running through the middle; the central spike sits exactly centred between the two arms of the X only at perfect focus, giving a precise, eyeballable null that is far more sensitive than judging a blob's size. Software automates the same idea by measuring a star's size directly — its FWHM (full width at half maximum) or HFD (half-flux diameter) — and on deep-sky rigs a motorized focuser runs an autofocus routine that steps the focuser and minimizes star size versus position. That last technique is, in essence, contrast-detection AF — climbing the same sharpness hill — but done deliberately and slowly, with a metric chosen for point sources, rather than the instant grab a daytime camera attempts. The daytime AF families above (Section Contrast-detection AF onward) simply do not transfer: there is no contrast hill to climb in real time and no light to split into pupil halves (cross-ref FUNDAMENTALS Focus for the practical operator's view).