Illumination related effects in a single image

symbol

meaning (this chapter)

note

$I$

the observed image (per channel) — the product or sum the camera records

from Notations

$R$

reflectance / albedo — the surface's own color, independent of lighting

new (this chapter)

$S$

shading / illumination — the light across the geometry, independent of surface

new (this chapter)

$I=R\cdot S$

the intrinsic-image model; additive in log, $\log I=\log R+\log S$

the chapter's headline equation (L1/L2)

$T$, $R_{\text{refl}}$

the transmitted and reflected layers of a glass shot; $I=T+R_{\text{refl}}$ (additive)

new (this chapter)

$\alpha(x)$, $L_1,L_2$

the per-pixel illuminant mixing map and the two illuminant colors

new (multi-illuminant white balance)

$m_d, m_s$

per-pixel diffuse and specular weights (dichromatic model)

new (this chapter)

$\Lambda_{\text{body}}, \Lambda_{\text{illum}}$

the surface and illuminant color directions in the dichromatic model

new (this chapter)

$\tau$

the Retinex gradient threshold separating reflectance edges from shading ramps

new (this chapter)

💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

A single photograph mixes things the eye effortlessly keeps apart. Look at a white shirt half in shadow and you read it instantly as white-and-shadowed — one surface, unevenly lit — not as a grey shirt that happens to be pale at the top. The camera cannot do this. It records only the product: at every pixel, the light that fell on the surface times how much of that light the surface sent back. Write it as $I = R\cdot S$ — illumination (shading) $S$ times reflectance $R$ (L1). The shirt's whiteness lives in $R$; the shadow lives in $S$; the camera hands you only their product and leaves you to guess which is which.

This chapter is the project of un-mixing that product from one image — pulling the light apart from the surface, or one layer of light apart from another. And it is the same story as super-resolution and deblurring told in a new costume: the un-mixing is under-determined, so a prior is what makes recovery possible (L10). The unifying frame is the intrinsic-image decomposition, and everything else in the chapter — white balancing a scene lit by two different lamps, deleting a reflection off a shop window, lifting a shadow off a face, killing the glossy highlight on a leaf — is a special case of it: I care about one layer, and I have a cue that pins it down.

💡 Big lesson (recurrence of L1 + L10)

Because the world is multiplicative — what you measure is light × surface — recovering either factor from one image is ill-posed. A dark patch can be a dim surface in bright light or a bright surface in dim light, and the single pixel cannot tell you which. So separating illumination from reflectance — or a reflection from a transmission, a shadow from a stain, a highlight from a color — always leans on a prior about how surfaces, light, and natural images behave. The prior is not a tuning knob you could omit; it is the load-bearing part of the method, exactly as it was for the inverse-problem chapters. (→ see Big lesson L1 · the multiplicative world, FUNDAMENTALS; → see Big lesson L10 · the prior is not optional, Super-resolution and image priors.)

A standing note on encoding, since every method here depends on it (L2). The relation $I = R\cdot S$ is multiplicative, so the natural place to work is log, where it becomes additive, $\log I = \log R + \log S$ — a multiplicative haze, shadow, or illuminant becomes a clean offset you can manipulate with gradient and affinity tools. The layer-superposition effects (a glass reflection summed onto a transmission, a specular term added onto the diffuse) are instead genuinely additive in radiance, so those must be separated in linear light, before gamma. Declaring the space up front is not pedantry: do reflection removal in gamma and the "sum" you are trying to un-sum is not actually a sum.

8.9.1 Intrinsic images: the unifying frame ($I = R\cdot S$)⧉

Start with the decomposition that names everything else. Split one photograph into two intrinsic layers: a reflectance image $R$ — the surface's own albedo, the color and lightness of the paint, independent of how it happens to be lit — and a shading image $S$ — the illumination, the play of light and shadow across the geometry, independent of what the surface is made of. Per color channel, $I = R\cdot S$; in log it is additive, $\log I = \log R + \log S$. The reflectance layer looks like a flat, evenly-lit version of the scene with all the modeling light removed; the shading layer looks like a smooth grey rendering of the geometry with all the texture and markings gone (Figure 8.9.1).

Why is this the umbrella? Because once you can say "this variation is light and that variation is surface," every other effect in the chapter becomes an edit on one layer and not the other. White balance corrects the color of $S$. Shadow removal flattens a dark region of $S$ while leaving $R$ alone. Highlight removal strips a specular term that has contaminated $S$ in the wrong color. Reflection removal is the one outlier — it un-sums two whole images rather than splitting one into factors — but it is the same un-mixing instinct. Name the layers and the rest is bookkeeping.

And why is it hard? For the reason in the big-lesson box: infinitely many pairs $(R,S)$ multiply to the very same $I$. Halve the reflectance everywhere and double the shading and the product is unchanged; the data alone cannot prefer one factorization over another. You need a prior that says which kinds of variation are reflectance and which are shading.

[figure fig-intrinsic-decomp not built]

Figure 8.9.1. Intrinsic-image decomposition. One input photograph (a textured, unevenly-lit object) is split into two layers whose product reconstructs it. Reflectance $R$: the surface albedo — the paint and markings, rendered as if under flat, even light, with all the modeling shadow removed. Shading $S$: the illumination — a smooth, near-greyscale rendering of the geometry and shadows, with all the surface texture removed. Per channel $I = R\cdot S$ (additive in log, $\log I=\log R+\log S$). The split is non-unique: many $(R,S)$ give the same $I$, which is why a prior is required.

The classic prior is Retinex (Land & McCann 1971 Land & McCann 1971), and its intuition is one sentence: reflectance changes are sharp, shading changes are smooth. A painted edge — where one surface color abuts another — produces a sudden jump in the image. A shading change — light falling off across a curved surface, or the soft edge of a shadow — produces a gentle ramp. So walk along the image and look at the gradient of the log image: where it jumps hard, that variation is surface, so assign it to $\nabla R$; where it ramps gently, that variation is light, so assign it to $\nabla S$. The decision rule is a simple threshold on the gradient magnitude, $|\nabla \log I| > \tau$ for reflectance (Figure 8.9.2). Having sorted every gradient into one of two piles, you integrate each pile back into an image by exactly the gradient-domain reconstruction of Poisson image editing — solve the Poisson equation whose guidance field is the reflectance gradients, and again for the shading gradients (L9). Retinex is crude — a hard threshold mislabels soft material edges and sharp shadow edges, the two failure modes the rest of the field spends its energy on — but it captures the whole instinct in one move: edges are surface, ramps are light.

fig-retinex-thresholding — **Figure 8.9.2.** Retinex gradient classification. A 1-D scan of $\log I$ crosses two kinds of edge. **A paint (reflectance) edge:** a sharp step — the surface color changes abruptly. **A shadow (shading) edge:** a smooth ramp — the light falls off gradually across the penumbra. Retinex thresholds the log-gradient: magnitudes above $\tau$ are declared **reflectance** ($\nabla R$), the gentle remainder is declared **shading** ($\nabla S$). Each gradient field is then **Poisson-integrated** back into a layer (L9). The cartoon failure mode is also visible: a very soft material edge (below $\tau$) leaks into shading, and a hard-edged shadow (above $\tau$) leaks into reflectance.

The modern version keeps the same recipe — data-fit plus a hand-built prior, recovered by maximum a posteriori (MAP) estimation — but enriches the priors and adds a third unknown. Barron and Malik's shape, illumination, and reflectance from shading (SIRFS, 2015) Barron & Malik 2015 recovers all three jointly from a single image: a reflectance prior that albedo is piecewise-constant and draws from a small palette; a shape prior that surfaces are smooth and bend gently; and an illumination prior that lighting is low-frequency (a few soft sources, not a thicket of hard ones). Maximizing the posterior under these priors is the same data-fit-plus-prior optimization that runs through the whole part. SIRFS is the high-water mark of hand-built priors for this problem; the learned successors — convolutional neural networks (CNNs) trained to predict $R$ and $S$ directly from a photo — live in Deep learning, and they are the canonical instance of L8 (a learned operator swapping a hand-designed prior for one fit to data).

The takeaway to carry forward: intrinsic decomposition is the generalization. The remaining sections each pick one layer to recover and bring a sharper cue than Retinex's blunt threshold to pin it down.

8.9.2 Multiple-light / spatially-varying white balance⧉

Ordinary white balance, as set up in Color technology and the ISP chapter, estimates one illuminant color for the whole frame and divides it out (L1 — white balance is a division). That is exactly right when the scene is lit by a single color of light, and exactly wrong the moment it is not.

And real scenes are routinely not. Step into a room at dusk: warm tungsten lamps on one side, cool blue daylight through the window on the other. Or fire a flash indoors and let the warm ambient fill the rest. Or shoot a face half in direct sun (warm) and half in open shade (lit by the blue sky). A single global correction can neutralize one of these and only one — fix the tungsten side and the window goes glacial blue; neutralize the window and the lamp-lit side turns orange (Figure 8.9.3). There is no one number that is right everywhere, because there is no one illuminant.

[figure fig-multi-illuminant-wb not built]

Figure 8.9.3. Multi-illuminant white balance. A scene lit by warm tungsten on the left and cool window daylight on the right. One global white balance: correcting for the tungsten neutralizes the left but pushes the right cold blue (and vice versa) — a single illuminant estimate cannot be right everywhere. Per-pixel mixture (Hsu 2008): estimate a smooth mixing map $\alpha(x)$ blending the two illuminant colors $L_1,L_2$, then divide out the local light — both halves come out neutral. Done in linear light, where mixing and dividing illuminants is the physically additive/multiplicative operation it should be (L2).

The fix is to let the white balance vary across the frame. Hsu and colleagues (2008) Hsu et al. 2008 model each pixel as lit by a blend of two known illuminant colors, $L_1$ and $L_2$, in proportions that vary spatially:

$$ I(x) = \big(\alpha(x)\,L_1 + (1-\alpha(x))\,L_2\big)\cdot R(x). $$

The unknown is the per-pixel mixing map $\alpha(x)$ — how much of each light reaches that pixel. Solve for $\alpha(x)$ and you have a spatially-varying white balance: at each pixel you know the local illuminant color, so you divide that out and recover a neutral $R(x)$. The two endpoint colors $L_1,L_2$ come from the user clicking a warm and a cool region, from clipped highlights (which take the illuminant's color), or from clustering the image's colors into two illuminant directions.

What makes this an intrinsic-image method, and not a new trick, is that it is recovering the color of the shading layer as it varies across the scene — a chrominance-only intrinsic split. And the prior is once again Retinex's: the mixing map $\alpha(x)$ is assumed smooth, because light varies gently across a room. That smoothness is what regularizes the otherwise-underdetermined per-pixel estimate — the same "ramps are light" assumption, now applied to the color of the illumination rather than its brightness. As always, do the dividing in linear light, where mixing and removing illuminants is the operation the radiometry says it is (L2).

The easier multi-image version disambiguates the lights for free: a flash/no-flash pair gives you one frame under a light of known color (the flash) and one under the ambient, so the mixture is observed rather than inferred — that method lives in Advanced computational photography. The learned single-image version — a network that predicts a per-pixel illuminant map — is in Deep learning.

8.9.3 Reflection removal — pulling apart a transmission and a glass reflection⧉

Photograph a painting behind glass, or a street through a shop window, and you capture two scenes summed on top of each other: the transmitted layer $T$ you meant to shoot, plus a reflection $R_{\text{refl}}$ of whatever was behind the camera. This effect is additive, unlike the multiplicative $R\cdot S$ — the glass passes one image and mirrors another, and the sensor adds them:

$$ I = T + R_{\text{refl}}. $$

Recovering both $T$ and $R_{\text{refl}}$ from a single $I$ is hopelessly under-determined — two unknown images, one observed sum, with no constraint tying them down. This is the purest prior territory in the chapter (L10): the data fixes nothing, so the entire answer comes from what you assume about the two layers. What makes reflection removal a good teaching case is that there is not one prior but a small zoo of them, each exploiting a different physical cue that the reflection differs from the transmission in some measurable way (Figure 8.9.4).

fig-reflection-removal-cues — **Figure 8.9.4.** The reflection-removal cue zoo. Shooting through a glass pane records $I = T + R_{\text{refl}}$ — a sharp **transmitted** layer plus an overlaid **reflection**. Each panel shows a cue that breaks the tie. **Defocus:** focusing on $T$ leaves $R_{\text{refl}}$ blurred (broad, weak gradients) — assign sharp gradients to $T$, soft to $R_{\text{refl}}$. **Ghosting:** a thick pane reflects off both surfaces, so $R_{\text{refl}}$ appears as a shifted double image — a known double-impulse to identify and subtract. **Edge sparsity:** both layers have sparse, heavy-tailed gradients, but their sum has too many edges — find the split that minimizes total edge count. **Polarization:** glancing reflection is partially polarized (Brewster), so a polarizer attenuates $R$ differently from $T$.

Defocus / blur. You focus the camera on the transmitted scene, so the reflection — typically at a different distance — falls out of focus: it contributes weak, broad gradients, while the transmission contributes sharp ones. Fattal's single-image method (2008) Fattal 2008 exploits exactly this: assign the sharp gradients to $T$ and the soft ones to $R_{\text{refl}}$. Note that this is Retinex again, run on focus instead of on light-versus-surface — the same "sort the gradients, then integrate" move from the intrinsic-image section.

Ghosting. A thick glass pane reflects light off both its front and back surfaces, so the reflection arrives twice, slightly shifted — a doubled ghost. That shift is a known double-impulse kernel; spotting the characteristic doubling lets you identify the reflected layer and subtract it.

Sparse-gradient prior. Levin and Weiss (2007) Levin & Weiss 2007 lean on natural-image statistics: each layer on its own has sparse, heavy-tailed gradients (mostly flat, a few strong edges), so their sum has too many edges — its gradient histogram is too dense. Among all the ways to split $I$ into $T+R_{\text{refl}}$, prefer the one that minimizes the total number of edges across both layers, optionally guided by a few user scribbles marking which layer an edge belongs to. This is the very same sparse-gradient prior that breaks the chicken-and-egg of blind deblurring — the natural-image prior earning its keep again.

Polarization. Light reflected off glass at a glancing angle is partially polarized (the Brewster effect), while the transmitted light is much less so. A polarizing filter therefore attenuates $R_{\text{refl}}$ differently from $T$. The genuine method rotates a real polarizer and takes two frames — that lives in Advanced computational photography — but single-image methods try to exploit the polarization signature without the extra shot.

Learned. Zhang and colleagues (2018) Zhang et al. 2018 train a CNN on synthesized $T+R$ composites to predict the transmitted layer directly, using a perceptual loss so that the recovered $T$ looks like a real photograph rather than merely fitting pixels. This is L8 in its native habitat: the hand-built priors above (defocus, sparsity, ghosting) replaced by a prior fit to data.

The takeaway is the honest one: "$I = T + R_{\text{refl}}$ has no answer." Every method here is just a different way of inserting an assumption that the reflection is somehow distinguishable — blurrier, doubled, edge-sparser, polarized, or simply unlike a natural transmitted scene. It is a cousin of intrinsic images: instead of factoring light from surface, we are un-summing one illumination from another.

8.9.4 Shadow detection and shadow removal⧉

What is a shadow, intrinsically? It is a region where the shading $S$ drops — less light reaches the surface — while the reflectance $R$ is unchanged. Same paint, just darker. That single observation places shadow removal squarely under intrinsic images: it is an edit on the shading layer only — flatten the darkening, keep the texture and markings untouched. If you could perfectly factor $I$ into $R\cdot S$, shadow removal would be trivial; the difficulty is the usual one, telling the two layers apart at the shadow's edge.

So the crux is detection: distinguish a shadow edge (a shading discontinuity — same surface, less light) from a material edge (a reflectance discontinuity — the surface itself changes). Several cues separate them:

A shadow edge changes brightness but not hue much — it is the same surface on both sides, so its color stays put while its lightness drops. A material edge changes color outright.
A shadow edge is often soft — a penumbra, where the light source is gradually occluded — and geometrically smooth, tracking the shape of the occluder rather than the texture of the surface.
Shadowed regions are lit by skylight (bluer) rather than by direct sun (warmer), so a shadow often carries a faint color tell: the shaded side is not just darker but cooler.

Once the shadow boundary is found, removal is a gradient-domain edit of exactly the kind in Poisson image editing (L9). Label the gradients that lie on the shadow boundary — the big step from lit to shaded — and zero them out, leaving every other gradient (the surface texture, which is small and lives inside both regions) intact. Then Poisson-reconstruct. The texture survives because its gradients were never touched; the shadow vanishes because its defining step was deleted. This is the same machinery as gradient-domain compositing — and indeed the outline of Poisson image editing flagged shadow removal as one of the "change only the guidance field" edits.

Two subtleties keep this honest. First, the penumbra. A hard binary shadow mask, zeroed at a crisp boundary, leaves a dark halo where the real shadow softly faded — so you need a soft shadow matte, and the correction must follow the real penumbra. The fast bilateral solver (Barron 2017 Barron 2017) is the right tool to refine the matte: it snaps the soft correction to the image's own edges, which is precisely the edge-aware affinity idea (L4) — pixels that look alike should get a similar correction. Second, the correction is multiplicative, not additive. The shadowed region received less light by some factor, so you recover it by scaling the shadowed radiance back up by the recovered shading ratio — a multiply in linear light (L1), not a brightness offset. Add a constant instead of multiplying and the lifted shadow will not match the surrounding tone.

The easy multi-image version is, once more, flash/no-flash (the flash floods the scene and erases the ambient shadow, so differencing the pair localizes it) — in Advanced computational photography — and the learned single-image shadow removers are in Deep learning.

8.9.5 Specular-highlight removal / "fake polarization" (the dichromatic model)⧉

Glossy surfaces — skin, leaves, plastic, wet asphalt — show specular highlights: bright spots that are the light source reflected off the surface, not the surface itself. A highlight carries the illuminant's color, not the surface's, and it blows out texture, hides detail, and wrecks any downstream estimate of color or material. We would like the matte image — the diffuse-only version, as if the gloss had been wiped off.

The key is the dichromatic reflection model (Shafer 1985) Shafer 1985, which says that the light leaving a glossy surface is the sum of two terms with two different colors: a diffuse (body) term, light that entered the material and scattered back out carrying the surface's color; and a specular (interface) term, light that bounced straight off the surface without entering, carrying the illuminant's color. Per pixel,

$$ I(x) = m_d(x)\,\Lambda_{\text{body}} + m_s(x)\,\Lambda_{\text{illum}}, $$

where $\Lambda_{\text{body}}$ is the (fixed) surface color direction, $\Lambda_{\text{illum}}$ is the (fixed) illuminant color direction, and the scalar weights $m_d(x), m_s(x)$ — how much diffuse and how much specular — vary from pixel to pixel with the geometry. The crucial point: across a single-colored glossy patch the weights change everywhere, but there are only two color directions in play.

That is what makes the separation possible, and the geometric picture is the clearest way to see it (Figure 8.9.5). Plot the pixels of a glossy, uniformly-colored patch as points in RGB space. They do not scatter randomly; they form an "L" or dog-leg: a line running along the body color $\Lambda_{\text{body}}$ (the matte shading, where $m_s = 0$ and only $m_d$ varies) with a spur branching off toward the illuminant white $\Lambda_{\text{illum}}$ (the highlight, where $m_s$ climbs). Lee (1986) Lee 1986 and, with the full cluster analysis, Klinker, Shafer and Kanade (1988) Klinker et al. 1988 showed how to identify the two arms of the dog-leg and then project each pixel back onto the diffuse line — subtract off the specular component along $\Lambda_{\text{illum}}$ — to recover the matte image.

fig-dichromatic-model — **Figure 8.9.5.** The dichromatic model in RGB. The pixels of a single glossy, uniformly-colored patch, plotted as points in RGB color space, form an **"L" / dog-leg**: a **diffuse line** along the surface color $\Lambda_{\text{body}}$ (matte shading, specular weight $m_s=0$) and a **specular spur** branching toward the illuminant color $\Lambda_{\text{illum}}$ (the highlight). Because $I = m_d\Lambda_{\text{body}} + m_s\Lambda_{\text{illum}}$ has only two color directions, the highlight is removed by **projecting each pixel back onto the diffuse line** — subtracting its component along the illuminant direction. The separation is done in linear light, where the dichromatic sum is genuinely additive (L2).

This is why the technique is called "fake polarization." The real way to kill a specular highlight is a polarizing filter on the lens: specular reflection is polarized, so a rotated polarizer cuts it. Doing the same job purely computationally, from one image via the dichromatic model gives you the polarizer's matte look without the filter and without the extra exposure — hence "fake." (There is a bonus: because the highlight carries the illuminant's color, it doubles as a free white reference. Lee 1986 used exactly this for color constancy — a highlight tells you the color of the light for free, which is the same idea the clipped-highlight cue fed into multi-illuminant white balance above.)

And this is its place in the chapter's frame. The specular term is an additive contamination of the shading layer, in the wrong color — the illuminant's rather than the surface's. Strip it and what remains is the clean diffuse $R\cdot S$. The cue is different from everything before — a color-direction prior rather than a gradient or smoothness prior — but the goal is identical: un-mix a layer that the camera summed in. As with reflection removal, the separation must happen in linear light, where the dichromatic sum is physically additive (L2); the genuine optical version (a rotating polarizer) and multi-light specular separation are in Advanced computational photography, and the learned separators are in Deep learning.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (recurrence of L1 + L10)

8.9 Illumination related effects in a single image🔗⧉

8.9.1 Intrinsic images: the unifying frame ($I = R\cdot S$)🔗⧉

8.9.2 Multiple-light / spatially-varying white balance🔗⧉

8.9.3 Reflection removal — pulling apart a transmission and a glass reflection🔗⧉

8.9.4 Shadow detection and shadow removal🔗⧉

8.9.5 Specular-highlight removal / "fake polarization" (the dichromatic model)🔗⧉

Big lessons of this chapter

8.9 Illumination related effects in a single image⧉

8.9.1 Intrinsic images: the unifying frame ($I = R\cdot S$)⧉

8.9.2 Multiple-light / spatially-varying white balance⧉

8.9.3 Reflection removal — pulling apart a transmission and a glass reflection⧉

8.9.4 Shadow detection and shadow removal⧉

8.9.5 Specular-highlight removal / "fake polarization" (the dichromatic model)⧉