💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

9.11 Fake (synthetic) depth of field

A large-sensor camera with a fast lens isolates its subject by physics: the background falls outside a shallow depth of field and dissolves into the soft, structured blur the last chapter dissected. A phone cannot do this. Its sensor is fingernail-sized and its lens is correspondingly short, and — by the sensor-size scaling FUNDAMENTALS derived and Depth of field restated — that combination gives it an enormous depth of field in which almost everything is sharp at once. For snapshots this is a gift; for portraits it reads as flat, busy, un-cinematic. So the phone does in software what the glass cannot do in hardware: it estimates the scene's depth (or at least which pixels are the subject), then synthesizes the background blur, growing the blur with defocus and shaping it like a real aperture. This is fake, or synthetic, depth of field — the ubiquitous "portrait mode." It is the computational counterpart to the optical depth of field of the previous chapter, and the whole of it is a pipeline whose stages we take in turn: why phones must fake it, where the depth comes from, how depth becomes a blur radius, why a Gaussian betrays the fake, how occlusion is handled, and where it all breaks.

9.11.1 Why phones must fake it

The root cause is geometric, and it was settled two chapters ago: depth of field scales with sensor size. Shrink the sensor while keeping the field of view fixed and you must shorten the focal length to match, and both moves deepen the depth of field — so a phone's tiny sensor and short lens conspire to keep the entire scene, from a near hand to a far wall, comfortably inside the band of acceptable sharpness. A full-frame camera at f/1.8 throws a portrait's background into a soft wash; a phone framing the identical portrait holds that background nearly as crisp as the face. There is no aperture setting on the phone that fixes this, because the limitation is not the aperture — it is the sensor. The phone is, optically, a deep-depth-of-field instrument, and no amount of opening up will make it a shallow one.

What photographers want from a portrait, though, is exactly the opposite: the subject isolated, lifted off a soft background that does not compete for attention. Since the optics refuse to deliver it, the phone computes it. "Portrait mode" estimates where each pixel sits in depth — or, more cheaply, simply which pixels belong to the subject and which to the background — and then blurs everything that is not the subject by an amount that grows with its distance from a chosen virtual focal plane, painting in a believable aperture-shaped blur as it goes (Figure 9.11.1). The output is a picture that looks as if it came off a large sensor and a fast lens, assembled from one that did neither. This is the computational sibling of the optical Depth of field chapter: same target look, opposite means.

fig-portrait-mode-pipeline
Figure 9.11.1. The portrait-mode pipeline on one photograph. Four panels left to right: (1) the input, a deep-depth-of-field phone capture in which the subject and the cluttered background are both sharp; (2) the estimated per-pixel depth (or subject/background) map, shown as a grayscale layer, near = bright, far = dark; (3) the depth-aware blur, each background pixel splatting an aperture-shaped disk whose radius grows with its distance from the virtual focal plane while the subject stays untouched; (4) the composited result, the crisp subject sitting on a smoothly blurred background with bloomed highlight disks. A caption note marks the subject silhouette as the seam where matting must be exact.

9.11.2 Where the depth (or subject) comes from

A synthetic blur is only as honest as the depth that drives it, and there are four places a phone can get that depth, in rough order of hardware cost. The first is dual-pixel disparity. As Focus, autofocus explained, a dual-pixel sensor splits every photosite into two half-photodiodes under one microlens, so the left and right halves see the left and right halves of the lens's pupil — a stereo pair with a baseline of a few millimetres inside a single lens. Correlating the left-half image against the right-half image gives a dense, if shallow, per-pixel disparity, and disparity is defocus is depth. This is precisely the signal Wadhwa et al. exploited in their 2018 Synthetic Depth-of-Field with a Single-Camera Mobile Phone — the basis of Google's single-camera Portrait mode, which extracts a usable depth map from one lens by reading the dual-pixel split.

The second source is multi-camera stereo: a phone with two rear lenses has a real (if small) baseline between them, and the classic two-view matching of Sparse matching yields a depth map — subject to the usual headaches of textureless regions and half-occluded pixels that any stereo system fights. Because the baseline is short and the match must run on a power budget, fast approximate solvers matter; Barron et al.'s 2015 Fast Bilateral-Space Stereo is the canonical move here, solving the stereo (and the subsequent edge-aware depth smoothing) in a bilateral space so that the depth snaps to image edges cheaply enough to run on a phone. The third source needs no special hardware at all: monocular depth networks that predict depth from a single image, the subject of Monocular depth estimation (one image → depth). They are metrically loose — they recover relative depth far better than absolute distance — but for a blur whose only job is to look plausible, relative depth ordering is most of what matters.

The fourth source is the pragmatist's shortcut, and the one most phones actually ship: segmentation and matting. Often a full depth map is overkill, because the photographer only wants "the person sharp, everything else soft." A clean subject mask — a person-segmentation network, refined into a soft alpha matte as in Compositing, segmentation and matting — suffices to blur everything that is not the subject, with no per-pixel depth at all. The cost is that the background is then blurred uniformly rather than progressively by distance, so a far wall and a near shoulder behind the subject get the same blur; the gain is robustness and speed. In practice a production portrait mode blends these sources — dual-pixel or stereo depth where it is reliable, a learned matte to nail the subject's silhouette, a monocular net to fill the gaps — because each is strong exactly where the others are weak.

9.11.3 From depth to blur — the thin-lens circle

Given a depth (or subject) map, the blur itself is the honest part of the pipeline, because the optics it imitates are simple and exactly known. Choose a virtual focal plane at depth $z_f$ — usually the subject's distance — and give each pixel a blur kernel whose radius grows with how far that pixel's depth $z$ departs from the focal plane. This is the same circle-of-confusion law that governs a real lens, imported wholesale from Depth of field and FUNDAMENTALS:

$$ c(z) \;\propto\; \left| \frac{1}{z_f} - \frac{1}{z} \right|. $$

The proportionality constant rolls up the chosen virtual aperture and focal length — it is the knob the "blur strength" slider turns. The structure of the formula is what matters: a pixel at the focal plane ($z = z_f$) gets zero blur and stays razor sharp; a pixel far behind the subject gets a large kernel; and because of the reciprocals, the blur grows asymmetrically — distant background blurs more aggressively than equally-displaced foreground, exactly as a real lens renders it. A phone can evaluate this per pixel trivially fast.

That is the whole of the easy part, and it is worth being clear that everything hard about portrait mode is around this equation, not in it. The circle-of-confusion law tells you the radius of each pixel's blur with optical exactness; it tells you nothing about the shape of the kernel (the next section), nothing about what happens at the silhouette where a sharp subject meets a blurred background (the section after), and it presupposes a depth map whose accuracy at edges is precisely where the depth sources above are least trustworthy. The thin-lens circle is the one stage of the pipeline that cannot really go wrong — which is exactly why the tells of a fake never come from it.

9.11.4 Realistic bokeh — why a Gaussian looks fake

The single most common giveaway of a cheap synthetic blur is that it is a Gaussian. A Gaussian blur is a convolution with a soft, bell-shaped kernel: it is cheap, separable, and utterly wrong-looking, for two reasons. First, it has no shape — a real out-of-focus highlight, as Depth of field established, is the image of the aperture, a hard-edged round (or polygonal) disk, not a fuzzy smudge that fades to nothing at its edge. Second, and more damning, a Gaussian dims bright points: it spreads a highlight's energy over a wide soft footprint so the peak drops, whereas a real lens turns a clipped highlight into a bloomed bright disk that holds its brightness right to its hard rim. A background string of fairy lights, through real glass, becomes a constellation of glowing coins; through a Gaussian, it becomes a row of dim grey puddles. The eye knows the difference instantly even when it cannot name it (Figure 9.11.2).

fig-gaussian-vs-bokeh
Figure 9.11.2. Why a Gaussian looks fake. The same defocused background — a scatter of bright point highlights and some texture — rendered two ways. Left: a flat Gaussian blur — uniform mush, no disk shape, highlights smeared into dim soft blobs that have lost their brightness. Right: a true aperture-disk bokeh — each highlight a hard-edged round disk, the clipped bright points bloomed into glowing coins that keep their intensity to the rim, polygonal disks where a bladed aperture is simulated. An inset shows the two kernels side by side: a soft bell vs. a flat-topped disk.

Getting the disks right turns on how the blur is computed, and the choice is between scatter and gather. A scatter (or splatting) renderer walks the source pixels and, for each one, splats an aperture-shaped disk of the right radius onto the output, accumulating overlapping disks. Crucially the splat must be energy-correct: each source pixel deposits a disk of total weight one, so its energy is spread but conserved, and a saturated highlight — which carried clipped, abnormally concentrated energy — splats into a disk that stays bright across its whole face. This is what produces blooming for free and renders the hard disk edge faithfully. The price is cost: scatter is a write-heavy, unpredictable-memory operation that maps awkwardly onto a GPU. A gather renderer instead walks the output pixels and, for each, averages a depth-dependent neighbourhood of the input — cheap, regular, GPU-friendly, but structurally bad at exactly the things scatter does well, because averaging in dims highlights and a gather kernel cannot easily respect depth discontinuities. Production portrait modes pick a point on this trade-off — often a gather pass for speed with special handling to re-brighten highlights and patch edges — and the AIM "Bokeh Effect Rendering" challenges (Ignatov et al.) pushed the frontier further still, training networks to learn the whole rendering end-to-end and reproduce a real wide-aperture lens's bokeh from a phone's deep-depth-of-field input.

9.11.5 Occlusion-aware compositing and matting

Here the load-bearing caveat from Depth of field comes due: defocus is not a 2-D convolution. A real lens does not blur the flat image; it blurs the scene, in depth, and at every depth discontinuity the geometry is subtle. A sharp foreground subject must occlude the background behind it — the background's blur grows right up to the subject's silhouette and stops, while the crisp subject sits cleanly on top. Run a naive whole-image blur instead, and the subject's own sharp edge gets dragged outward into the blur, painting a tell-tale bright or dark halo around the person — the single most recognizable artifact of a bad portrait fake (Figure 9.11.3).

fig-occlusion-aware-blur
Figure 9.11.3. Occlusion-aware blur vs. naive convolution. Left: a sharp foreground subject over a background, blurred by a naive 2-D convolution of the whole image — the subject's hard edge bleeds into the blur, leaving a bright/dark halo ringing the silhouette. Right: the same scene rendered depth- and occlusion-aware — the background blurs only up to the subject's silhouette, the crisp subject occludes it, and the edge is clean. A zoom inset on the shoulder shows the halo in one and its absence in the other.

The fix is to render in depth order, as layers, rather than convolving the image as a single sheet. Separate the scene into a sharp subject and a background (the matte does this), blur only the background up to the subject's silhouette, then composite the untouched sharp subject on top — so the subject's edge is never smeared into the blur. This exposes the one place where synthetic depth of field can never equal the real thing: disocclusion. When the foreground edge softens, a real lens reveals a sliver of background that was hidden directly behind the sharp subject — but the phone never captured that sliver, because it has no light field, only a single all-in-focus image. That background has no real data; it must be inpainted or guessed. This is the irreducible gap between fake and optical depth of field: the phone is filling in geometry it never saw.

And the matte itself is the battleground, because the silhouette is rarely a clean line. Hair, fur, foliage, and glass defeat a hard binary mask — a strand of wind-blown hair is half subject, half background within a single pixel — so portrait modes lean on alpha matting: a soft, fractional foreground coverage per pixel that lets the blur blend partially through wisps of hair. Abuolaim & Brown's 2020 work on defocus from dual-pixel data is instructive here, recovering a defocus map directly from the dual-pixel views to drive exactly this kind of depth-aware deblurring and rendering. Even so, alpha matting is hard, and it leaves residue — which is the cue for the failure gallery.

9.11.6 Failure modes

It is worth naming the classic tells, because once you can see them you cannot unsee them in a friend's phone portrait (Figure 9.11.4). The first is the edge halo: a matting error around fine, high-frequency boundaries — wind-blown hair, a frizzy sweater, backlit foliage — where the matte cannot decide subject from background and leaves either a soft halo of background bleeding into the subject or a fringe of sharp subject stranded in the blur. The second is the flat, shapeless blur: a Gaussian instead of an aperture disk, dimming the highlights and erasing disk structure, so the background looks smeared rather than defocused. The third is the wrong subject — the segmentation grabs the railing instead of the person, or sharpens the coffee cup and blurs the face — a depth/segmentation failure that no amount of pretty bokeh can rescue. The fourth is foreground bleed: a sharp hand or shoulder closer than the focal plane gets softened into the background because the depth map put it on the wrong side of $z_f$.

fig-portrait-failure-modes
Figure 9.11.4. A gallery of portrait-mode tells. Four crops, each a classic failure: (1) a matting halo around wind-blown hair, where the soft matte leaks background into the strands; (2) the wrong subject blurred — a foreground railing left sharp while the person behind it is softened, the segmentation having grabbed the rail; (3) a flat, shapeless Gaussian blur with dim, structureless highlights instead of bokeh disks; (4) foreground bleed, a sharp hand reaching toward the camera mistakenly blurred into the background by a depth error. Each crop is annotated with the failing pipeline stage.

The honest summary is the chapter's whole thesis in one line: portrait mode is a depth-estimation + matting + plausible-blur stack, and a composited result is only ever as good as its weakest stage — which, in practice, is almost always the matte at fine edges. The depth-to-blur circle is exact; the bokeh can be made gorgeous; but a single misjudged strand of hair or a railing mistaken for a subject undoes all of it, because the eye polices silhouettes ruthlessly. This is the deepest lesson of the chapter, and it generalizes far beyond portrait mode to every composited, multi-stage computational image: the picture inherits the failures of its frailest component.

💡 The big lesson

Defocus is not a 2-D convolution — and "portrait mode" is not a blur filter. A real lens blurs the scene in depth, with the blur radius set by each point's distance from the focal plane and with occlusion bookkeeping at every silhouette. Faking that look is therefore a depth + matting + plausible-blur pipeline, not a single Gaussian: estimate where each pixel is, matte the subject cleanly, grow an aperture-shaped, energy-preserving blur with the circle-of-confusion law $c(z)\propto|1/z_f - 1/z|$, and composite in depth order so the sharp subject occludes the blurred background. The result is only as convincing as its weakest stage — usually the matte at hair and fine edges — and it can never fully equal real optics, because the background hidden behind the subject's softened silhouette was never captured and must be guessed.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 The big lesson

Defocus is not a 2-D convolution — and "portrait mode" is not a blur filter. A real lens blurs the scene in depth, with the blur radius set by each point's distance from the focal plane and with occlusion bookkeeping at every silhouette. Faking that look is therefore a depth + matting + plausible-blur pipeline, not a single Gaussian: estimate where each pixel is, matte the subject cleanly, grow an aperture-shaped, energy-preserving blur with the circle-of-confusion law $c(z)\propto|1/z_f - 1/z|$, and composite in depth order so the sharp subject occludes the blurred background. The result is only as convincing as its weakest stage — usually the matte at hair and fine edges — and it can never fully equal real optics, because the background hidden behind the subject's softened silhouette was never captured and must be guessed.