💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.

3.13 Auto-exposure and auto white balance

A camera makes two guesses about light on every shot, and both are under-determined — they must recover intent from pixels that confound the answer. Auto-exposure (AE) decides how much light to collect; auto white balance (AWB) decides what color the light was so it can be divided out. Neither has a labelled ground truth in the frame, so each leans on assumptions, and the history of both is the story of those assumptions getting smarter: a flat average, then a zone/pattern model, then a learned one.

3.13.1 Auto-exposure: metering

Before the shutter even opens, the camera must choose aperture, shutter, and ISO so the scene lands at a sensible brightness — the metering problem. The reference is the middle-grey target (≈18% reflectance) from the exposure discussion in Photography and camera 101: the meter tries to place the scene's representative tone there. The catch is exactly the one white balance hits below — which pixels are representative? — and metering modes are a ladder of ever-smarter answers.

Average and center-weighted metering. The simplest meter averages the whole frame to one number and exposes so that average is middle grey. It is fooled by anything that is not average: a bright sky pulls the average up, so the meter under-exposes and the foreground falls into shadow (the classic backlit-subject silhouette); a snow field reads too bright and is rendered a dingy grey. Center-weighted metering improves the guess by weighting the middle of the frame most, on the assumption that the subject is usually near the center.

Matrix / evaluative / pattern metering. Modern cameras divide the frame into a grid of zones and combine them intelligently rather than flatly. They measure each zone's brightness, the contrast between zones, which zone holds the focus point, and sometimes color, and feed those into a model — historically a lookup against a database of tens of thousands of reference scenes. Nikon's 3D Color Matrix metering folds in color and the distance the lens reports (hence "3D"); Canon's evaluative and others are cousins. The model recognizes patterns like "bright top zone, darker bottom, subject in focus near the center → backlit landscape" and biases the exposure to protect the subject. This pattern/evaluative metering is the default on every modern camera.

Spot metering is the opposite extreme: meter one small zone and place that tone at middle grey. It hands the photographer exact control over which tone is anchored — the basis of Ansel Adams's Zone System — at the cost of having to choose wisely.

fig-ae-metering
Figure 3.13.1. Auto-exposure metering modes (interactive). Pick a scene and a metering mode and the demo scales the exposure (a multiply in linear light) so the metered region averages to ≈18% grey, reporting the metered tone, the gain, and the fraction of highlights it clips. Average meters the whole frame — a bright sky or water drags the average up, the gain comes out small, and the subject falls dark. Centre-weighted biases toward the middle but is still pulled by a big bright surround. Spot meters one small zone you can move and resize — it places that zone correctly at the cost of clipping the bright background. Same scene, same target — only where you meter changes, and the exposure with it. Try the backlit sunset scene, then move the spot from the dark foreground to the bright sky.

Learned metering. The newest cameras and phones replace the hand-built zone model with a machine-learned one: a small network looks at a downsampled frame (often plus a semantic segmentation — sky, face, backlight) and predicts the exposure a good photographer would pick, trained on large sets of human-rated images. Phones go further and choose exposure jointly with their burst / HDR pipeline, deliberately under-exposing each frame to protect highlights and then recovering the shadows by stacking (→ multiple-exposure imaging) — leaning on the affine-and-truncated noise model from the noise chapter to know how far the shadows can be pushed. Auto-exposure has thus traveled the same arc we are about to see for auto white balance: one hand-coded statistic → a multi-zone pattern model → a learned estimator, all guessing intent from pixels.

Auto white balance is the color twin of this exposure guess, and it climbs the very same ladder.

3.13.2 white balance and color constancy

We arrive at the camera's hardest color guess. The light that reaches the sensor is the product of the illuminant spectrum and the surface reflectance — so a white shirt under tungsten light sends warm, orange-ish light to the sensor, and the very same shirt under open shade sends cool, bluish light. Yet to a human the shirt looks white under both, because the visual system discounts the illuminant — the color constancy of the perception chapter. A camera has no such automatic constancy; it must estimate the illuminant and divide it out, and that operation is white balance (Figure 3.13.2).

[figure fig-wb-before-after not built]
Figure 3.13.2. White balance, before and after. The same neutral object photographed under a warm (tungsten) and a cool (shade) illuminant looks orange and blue before correction; after white balancing, the neutral reads neutral under both and the surface colors agree. White balance is the camera's stand-in for the eye's color constancy.

The standard model is beautifully simple. Von Kries adaptation says: correct each color channel by an independent gain,

$$ R' = k_R\,R, \qquad G' = k_G\,G, \qquad B' = k_B\,B, $$

a diagonal $3 \times 3$ transform (Figure 3.13.3). The gains are chosen so that a known neutral in the scene comes out neutral ($R' = G' = B'$). This is the same diagonal-adaptation idea as the Bradford transform in color management, and it is the spectral-domain stand-in for what the cones do biologically (Land and McCann's Retinex, 1971, is the influential model of how the visual system might compute it). A general $3 \times 3$ matrix can do a little better than the pure diagonal when channels interact, but the diagonal von Kries gain is the workhorse.

fig-von-kries
Figure 3.13.3. Von Kries white balance: three independent channel gains. The correction is a diagonal map — scale R, G, and B by separate constants $k_R, k_G, k_B$ — chosen so a scene neutral becomes equal in all three channels. Geometrically it stretches the color cube independently along each axis. Simple, cheap, and a surprisingly good model of the eye's own adaptation.

One subtlety dictates where in the pipeline this correction belongs. White balance is a per-channel multiply, and a scaling commutes with a linear map but not with a non-linear one. Apply the same gains $k_R, k_G, k_B$ to linear-light values and you simply rescale the channels, as intended; apply them instead to gamma- or tone-curve-encoded (non-linear) values and the result is wrong — the gains no longer merely neutralize the cast but also shift hue and change apparent saturation (Figure 3.13.4). The reason is exactly the additive-vs-multiplicative lesson from the encoding section: a multiply is only faithful in the space where the arithmetic is linear. So white balance, like exposure (another multiply-in-linear-light), belongs early, in the linear part of the pipeline, before the gamma curve is applied — get the order wrong and you tint the very colors you were trying to correct.

fig-wb-linear-vs-nonlinear
Figure 3.13.4. White balance must be done in linear light. The same von Kries gains $k_R, k_G, k_B$ are applied two ways: to linear-light RGB (left), where they cleanly neutralize the cast, and to gamma-encoded RGB (right), where the identical gains additionally shift hue and change apparent saturation because a scaling does not commute with the non-linear encoding. The error is a visible color twist, not just a brightness change — which is why white balance lives early, in the linear pipeline.

3.13.3 Automatic white balance

In practice the camera has no labelled neutral patch, so it must guess the illuminant from the image itself — automatic white balance (AWB). The classic assumptions are statistical. Grey-world assumes the scene averages to grey, so the channel gains are

$$ k_c = \frac{\text{mean grey}}{\text{mean}_c}, $$

scaling each channel so the means agree (Figure 3.13.5). The bright-pixel (white-patch) assumption instead takes the brightest pixels to be a white highlight reflecting the illuminant directly, and balances on those. Both fail in predictable ways — grey-world tips on a scene dominated by one color (a forest, a red wall), bright-pixel tips on a colored specular highlight.

fig-awb-greyworld
Figure 3.13.5. Automatic white balance, as results. Top — recovering a cast: a neutral scene given a warm tungsten cast (a per-channel multiply in linear light) is corrected by both grey-world ($k_c=\bar g/\bar c$, assume the scene averages to grey) and white-patch ($k_c\propto 1/\,\text{p99}(c)$, assume the brightest pixels are white) — both undo the cast cleanly when the scene really is, on average, neutral. Bottom — grey-world's failure: on a scene dominated by one true colour (a red flower bed), grey-world misreads the red as a cast and drains it, while white-patch — anchored on the bright neutral highlights — keeps the colour. The assumption is the algorithm; when the scene breaks it, the estimate breaks.

Modern cameras go well beyond these. A common middle ground is to regress the illuminant from image statistics with a learned $3 \times 3$ correction, and the current best methods are machine-learned color constancy — a small network or a learned histogram model (Barron's Fast Fourier Color Constancy, 2017, recasts the estimate as a convolution over a log-chroma histogram). These learn the priors that grey-world and bright-pixel only crudely assume, but they are still solving the same under-determined problem: recover two numbers (the illuminant's chromaticity) from an image that confounds light and surface.

The key reframing is to stop thinking of the image as a picture and start thinking of it as a histogram of log-chroma: bin every pixel by the ratios of its channels (how red-to-green, how blue-to-green it is), and a global change of illuminant simply translates that whole histogram rigidly across the log-chroma plane. Estimating the illuminant is then nothing but finding where the histogram has been shifted to — a localization problem — which a learned filter can solve as a single convolution, made fast by doing it in the Fourier domain. The figures below walk through the method.

fig-ffcc-pipeline
Figure 3.13.6. The Fast Fourier Color Constancy pipeline, from Barron. The input image is binned into a two-dimensional log-chroma histogram; a learned filter is convolved over that histogram — efficiently, in the Fourier domain — to produce a heat map whose peak localizes the illuminant; and that estimate is read back out as the per-channel gains that white-balance the image. The reframing to keep is that color constancy has become localization in a histogram: because changing the light merely translates the log-chroma histogram, finding the light is finding the shift. This is the method's overview (Barron 2017).
fig-ffcc-model
Figure 3.13.7. The FFCC model, from Barron. The estimator is a single learned filter applied to the log-chroma histogram, trained discriminatively so that convolving it with any scene's histogram puts its response peak at the true illuminant. Because illuminant change is a translation of the histogram, one shift-invariant filter — a convolution — suffices, and the whole estimate reduces to a few small Fourier-domain operations, which is what makes the method both accurate and exceptionally fast. This is the learned model at the heart of the method (Barron 2017).
fig-ffcc-localization
Figure 3.13.8. Illuminant estimation as localization, from Barron. Convolving the learned filter with a scene's log-chroma histogram yields a response surface over the chroma plane; the white balance is read off as the location of its peak. Seeing the estimate as the argmax of a heat map is what makes the rest possible — it turns "what color was the light?" into "where is the bump?", a question a convolution answers directly, and one whose confidence we can read from how sharp the bump is. From the FFCC method (Barron 2017).
fig-ffcc-von-mises
Figure 3.13.9. Reading a confidence from the FFCC response, from Barron. Rather than just take the peak, the method fits a von Mises distribution — the natural bell curve for a quantity that wraps around, as hue does — to the response surface. The fitted mean gives the illuminant estimate and its spread gives a calibrated uncertainty, so the camera knows not only its best guess but how much to trust it, which matters when fusing the estimate over a video or with other cues. The lesson is that localizing in a histogram hands you a full distribution, not just a point. From the FFCC method (Barron 2017).
fig-ffcc-result-1
Figure 3.13.10. FFCC white-balance results on the Gehler–Shi benchmark, from Barron. Scenes from the standard color-constancy dataset, each shown with its raw cast and after correction by the learned illuminant estimate, with the recovered light compared to the ColorChecker ground truth. Notice how a method whose entire inference is a small Fourier-domain convolution over a histogram nevertheless tracks the true illuminant across very different scenes — the priors grey-world only gestures at have been learned. The first set of published results (Barron 2017).
fig-ffcc-result-2
Figure 3.13.11. Further FFCC results on Gehler–Shi, from Barron. A second set of benchmark scenes corrected by the method, including cases that defeat the classic statistical assumptions — frames dominated by a single strong color, where grey-world would drain the very hue that belongs in the scene. The takeaway is that the learned histogram filter has absorbed enough of natural-scene statistics to keep the color when grey-world would lose it, which is exactly the gap between a hand-coded statistic and a learned estimator that this chapter has been tracing. More published results (Barron 2017).
Sidebar — Dataset: Gehler-Shi + NUS (Cheng)

Color-constancy sets with ColorChecker ground-truth illuminant; lead with NUS (8 cameras), cite Gehler-Shi as the classic baseline. <https://www2.cs.sfu.ca/~color/data/shi_gehler/> · <https://cvil.eecs.yorku.ca/projects/public_html/illuminant/illuminant.html>. See the Datasets appendix.

Skin tones — a calibration target, a memory colour, and a fairness caveat

Skin is the colour cameras are quietly tuned around, for two reasons. First, faces are the most common and most scrutinised subject, and skin is a memory colour — we have a strong prior for what it should look like — so modern metering and white balance are biased to render it well: face detection feeds both the meter and the AWB so that the person, not the average scene, lands at a sensible exposure and a believable warmth. Second, and less happily, that tuning carries a history of bias. Colour film, and later automatic exposure and white balance, were calibrated against the Shirley card — a reference photo of a light-skinned woman — so the defaults were fit to light skin and rendered darker skin poorly (under-exposed, with the wrong cast), a bias that persists wherever metering, white balance, and face detection inherit those defaults. The obligation is concrete: design and test exposure and white balance across the full range of skin tones, not just the historical reference. We take this up as an ethics question — the Shirley card, fairness in capture — in Ethics of computational photography.

3.13.4 The limits of white balance, and CRI

White balance is one global guess, and two situations defeat it no matter how clever the estimator — the second of which forces us to measure the quality of the light itself, its color rendering index (CRI). The first is mixed illumination: a single per-channel gain cannot simultaneously fix a face lit warm by a lamp and cool by a window, because the right correction differs across the frame (Figure 3.13.12). Doing it properly needs per-region or multi-light white balance, which we take up in the advanced material (Hsu et al.). The second is metameric failure of the illuminant. White balance can only rescale channels; it cannot resurrect color information the light never delivered. A light with a spiky, gap-ridden spectrum — a cheap fluorescent tube or a low-quality LED — simply fails to illuminate certain wavelengths, so reflectances that differ only in those bands collapse to the same camera response and no gain can pull them apart.

[figure fig-mixed-illuminant-scene not built]
Figure 3.13.12. Why one white balance is not enough. A scene lit by two differently-colored sources — warm interior light and cool daylight through a window — cannot be neutralized by a single global gain: balancing for one light leaves the other cast. A correct fix needs spatially-varying, multi-illuminant white balance.

This is exactly the regime Hsu and colleagues set out to crack: not a cleverer global gain, but a per-pixel estimate of which light is doing the lighting, and then a different correction at every point. Their three figures below carry the argument from the failure to the fix.

Video: Light Mixture Estimation for Spatially Varying White Balance — estimating, at every pixel, the mixture of two illuminants and white-balancing each region for the light that actually falls on it — Hsu, Mertens, Paris, Avidan & Durand 2008 (Hsu et al. 2008).
fig-light-mixture-wb-1
Figure 3.13.13. Mixed lighting, where a single white balance must lose. The scene is lit by two illuminants of different colour at once — a warm indoor source and cool daylight — so the cast a camera should remove changes from one part of the frame to the next. Balance for the warm light and the daylit regions go blue; balance for the daylight and the warm regions go orange; there is no one gain that neutralizes both, which is the whole problem a global white balance cannot solve Hsu et al. 2008.
fig-light-mixture-wb-2
Figure 3.13.14. The spatially-varying fix, in overview. Rather than guess one illuminant for the frame, the method estimates at every pixel how much of each of the two lights is contributing — a per-pixel mixture of the two illuminants — and then white-balances each region for the light that actually falls on it. The reframing to keep is that a mixed-light scene has no single answer, so the unknown becomes a smoothly-varying mixing map rather than a single gain Hsu et al. 2008.
fig-light-mixture-wb-8
Figure 3.13.15. Global versus spatially-varying white balance, compared. The same mixed-light scene is corrected two ways: a single best global gain, which can only ever neutralize one of the two lights and leaves the other tinted, and the multi-illuminant result, which carries the right correction into each region so warm-lit and daylit areas both read neutral at once. This is the payoff of treating the illuminant as a per-pixel mixture rather than one number for the frame Hsu et al. 2008.

The quality of a light source is quantified by its CRI (color rendering index): how faithfully a source renders a set of test colors compared with a reference illuminant of the same correlated color temperature. A perfect incandescent or daylight source scores $100$; a spiky fluorescent or cheap LED may score in the $50$s, mis-rendering whole families of color even after white balance (Figure 3.13.16). High CRI is why good studios and museums pay for their lights, and why two prints can match under daylight yet diverge under store fluorescents — illuminant metamerism made expensive.

fig-cri-comparison
Figure 3.13.16. High versus low color rendering index. A ColorChecker chart rendered under a high-CRI reference (D65, CRI 100) and under a low-CRI spiky fluorescent (the CIE standard "F-series" fluorescent illuminant number four, CRI $\approx 52$), each white-adapted so the neutrals stay neutral — so the chromatic distortion is what shows. Both spectral power distributions and CRI scores are printed; the patches the spiky spectrum mis-renders are outlined. White balance keeps the greys honest but cannot recover the colors the gappy spectrum never lit.

It is worth playing with this limit, because the failure is sharper than a static figure can show. The interactive demo below renders a ColorChecker from spectra: pick an illuminant — smooth daylight, a warm tungsten, or a spiky fluorescent or LED — and watch the captured chart, each patch's resulting spectrum (reflectance × illuminant), and the best white balance attainable two ways: a diagonal (von Kries) correction and a full $3\times3$ matrix, each fit by least squares against the D65 ground truth, with its residual error printed (Figure 3.13.17). Under a smooth light the diagonal nearly recovers the truth and the $3\times3$ closes the rest. Under a spiky light the residual will not go to zero — not for the diagonal, and not even for the full $3\times3$ — because once the illuminant fails to deliver whole bands of the spectrum, different reflectances collapse to the same three camera numbers and no linear map can pull them back apart. The temperature/tint sliders let you try to white-balance by hand, and you will find you can never match the ground truth on a bad light: a manual correction is only a diagonal, and a diagonal is the weakest of the corrections that already cannot win.

fig-white-balance
Figure 3.13.17. The limits of white balance, interactively. A Macbeth ColorChecker is rendered from spectral reflectances under a chosen illuminant (daylight D65, tungsten A, a spiky fluorescent, or a warm-white LED). The demo shows the captured chart beside the D65 ground truth, the selected patch's resulting spectrum, and the best correction by a diagonal (von Kries) gain versus a full $3\times3$ matrix — each least-squares-fit to the truth with its mean $\Delta E$ printed — plus manual temperature/tint sliders. Smooth illuminants correct nearly perfectly; spiky ones leave an irreducible residual no $3\times3$ can remove, the metameric failure of the light made tangible.