10.11 Intrinsic images with time lapse⧉
Bolt a camera to a windowsill and let it shoot one frame every few minutes from dawn to dusk. Over the day you collect hundreds of pictures of the same scene — the same bricks, the same lawn, the same parked car — yet no two frames look alike. What changes is not the scene but the light: the sun arcs overhead, shadows of the eaves and the railing sweep slowly across the wall, a bright patch crawls along the ground and is gone an hour later. The bricks were always that shade of red; the lawn was always that green. The paint never moved. Only the lighting did.
That single fact is the lever. An image is, to a first approximation, the product of two things — what a surface is made of and how it is lit. The first is its reflectance (or albedo): the fraction of light each point sends back, a property of the surface itself, fixed for the day. The second is its shading: how much light actually reaches that point, which depends on the sun's position, on shadows, on the whole geometry of illumination. Pulling these two apart — recovering the paint from the light — is the classic intrinsic-images problem, and from a single photograph it is hopeless: a dark patch could equally be dark paint or a cast shadow, and infinitely many reflectance/shading pairs reproduce the very same image (this is the prior-not-optional ambiguity, L10, met head-on in Illumination related effects in a single image). But across the time-lapse stack the ambiguity simply dissolves. Whatever stays constant frame to frame is reflectance; whatever changes is shading. No clever prior is needed — the temporal axis does the disambiguating, exactly as extra exposures did it for HDR and extra focal planes did it for the all-in-focus merge. This is the L14 idea once more, on a new axis.
The chapter is short because the payoff is a single, almost startlingly clean estimator, due to Weiss 2001: work in the log domain, take the median over time of the log-gradients, and you have the reflectance gradient field; re-integrate it with a Poisson solve and you have the flat-lit reflectance image, with the per-frame shading falling out for free.
10.11.1 The split, and why one image can't do it⧉
Write the model down. At pixel $(x,y)$ in the frame captured at time $t$, the observed image is the product of a time-invariant reflectance and a time-varying shading,
where $R$ is the surface albedo — the what-it's-made-of, fixed across the whole sequence — and $S(x,y,t)$ is the how-it's-lit, sliding as the sun moves and shadows sweep through. (This is the same reflectance×shading factorization behind lightness constancy, developed in Human (and animal) vision and color; here we exploit it across frames rather than within one.) Intrinsic-image decomposition is the task of recovering the two factors $R$ and $S$ from the observed $I$.
From one frame this is fundamentally under-determined. There is one equation and two unknowns at every pixel — $I = R\cdot S$ — so for any observed $I$ there are infinitely many factorizations: scale $R$ up by any amount and $S$ down by the same amount, and the product is unchanged. Is that dark region on the wall dark brick or the shadow of the railing? The pixel alone cannot say. Single-image methods break the tie only by importing an assumption about how reflectance and shading differ — the Retinex prior, for instance, declares that large gradients are reflectance edges (a sharp boundary between two paints) while gentle, smooth gradients are shading (light falling off softly), then thresholds the gradient field on that basis and integrates. That works, but it is a hand-built prior, and it is not optional: remove it and the problem has no answer. This is the single-image, ill-posed version of the story, and it is the subject of Illumination related effects in a single image (L10 — the prior is not optional).
The time-lapse stack supplies, for free, what the single image lacked. Fix the camera so the scene registers pixel-for-pixel across the day, and shoot a long sequence. Now $R(x,y)$ is genuinely the same in every frame, and only $S(x,y,t)$ moves. The two unknowns are no longer symmetric: one is constant in time, the other is not. So the rule for reading the stack writes itself — constant-in-time means reflectance; varying-in-time means shading. The extra frames have replaced the prior. Where the single-image method had to guess which gradients were paint, the stack lets us measure it (Figure 10.11.1).
10.11.2 Weiss 2001 — the median of log-gradients⧉
Yair Weiss (2001), in Deriving Intrinsic Images from Image Sequences, turns the reading rule into a one-line estimator. Two moves do all the work: go to the log domain, and take the median over time.
Go to log. The model is multiplicative, and multiplicative quantities are awkward — but a logarithm turns a product into a sum. Taking logs,
so the observed log-image is the sum of a constant log-reflectance and a time-varying log-shading. This is the same encoding lesson that recurs throughout the book — radiometry is done in linear light, and ratios become differences in log (L1/L2). Because gradients are linear, the spatial gradient of the log-image is likewise a sum,
where $\nabla$ is the spatial gradient (the same per-pixel $x$- and $y$-difference field from Poisson image editing). The first term, the reflectance gradient, is the same in every frame. The second, the shading gradient, moves: a shadow edge produces a large gradient where it currently sits, and nothing where it is not.
Take the median over time. Now look at one pixel across the whole stack and ask what its log-gradient does over time. The reflectance contribution $\nabla\log R$ is a fixed vector present in every frame — at a true brick-to-mortar edge, every frame sees that edge. The shading contribution $\nabla\log S(t)$ is transient: a moving shadow edge passes over this pixel in only a minority of frames, lighting up the gradient there for a little while and then leaving. In the language of the averaging chapter, the persistent reflectance gradient is the signal and the passing shadow edges are outliers — and the right tool against outliers is not the mean (which a single hard shadow edge would drag) but the median, which a minority of extreme values cannot move. So the estimate of the reflectance gradient at each pixel is simply the per-pixel median, over time, of the observed log-gradient:
Read it back: at each pixel, the reflectance gradient is whatever the log-gradient is for most of the day. The median keeps the gradient that persists and rejects the gradients that merely passed through — a robust estimator in one line, with no learned prior and no threshold to tune (Figure 10.11.2). (Like any robust combine across a stack, this is the same reject-or-average instinct as the median in the denoising chapter, now applied across frames to gradients rather than to pixel values.)
Reconstruct (L9). The median gives us a gradient field for log-reflectance, not the log-reflectance image itself, so the final step is to integrate that field back into an image. This is exactly the gradient-domain reconstruction of Poisson image editing: find the image whose gradients best match the prescribed field, which is the Poisson equation
solved by the same sparse least-squares machinery — conjugate gradient, multigrid, or a fast Fourier transform (FFT) solve. (The median field is generally not the gradient of any single image — the moving shadows leave it non-conservative — so this is genuinely the least-squares projection, not an exact integration; that is precisely why a Poisson solve is the right and necessary tool, per the conservative-field discussion in Poisson image editing.) Exponentiating gives the flat-lit reflectance image $R$: the scene as if painted under perfectly even light, every cast shadow erased. And once $R$ is known, the per-frame shading falls out by subtraction, one image per frame,
leaving a shading sequence that contains only the moving light — the sweep of shadows with the texture removed.
What makes this elegant is how much it gets from how little. Four lessons meet in four lines: L1/L2 (log turns × into +), robust statistics (the median kills the transient shadow gradients), L9 (integrate the gradients to reconstruct), and L14 (the temporal stack is what disambiguates in the first place) — and the whole thing needs no learned prior, in pointed contrast to the single-image decomposition it sidesteps.
10.11.3 Where this chapter belongs — passive vs. active illumination⧉
A word on placement, because this chapter sits at a natural fork. It lives here, among the multiple-exposure stacks, and only points forward to computational illumination — and the reason is the part's spine. Every chapter of this part captures a stack along one axis and decides later (L14): exposure for HDR, viewpoint for panoramas, focus for focal stacks, wavelength for hyperspectral. Time-lapse intrinsic images is simply the time / illumination axis of that same family, and it reuses the very tools this part already teaches — log encoding, gradients, the robust median, and a Poisson reconstruction. Mechanically and in spirit it is multi-exposure imaging: a fixed camera passively merging a set of frames, exploiting illumination that happens to vary.
The kinship to flag, and forward-reference, is with computational illumination — flash/no-flash, multi-flash, photometric stereo, structured light — which chases the same $I = R\cdot S$ goal but by actively controlling the light. Weiss is the passive counterpart of all of those: you do not set the lighting, you observe it change over the day and let time do the work that an active rig would otherwise have to engineer. So the active decompositions live there; this passive, time-lapse one belongs here, brief, with the kinship noted in passing.
Capture the full set, decide later — here the temporal stack defers which part of each pixel is paint and which is light: whatever is constant across the stack is reflectance, whatever varies is shading. This is the time/illumination recurrence of the lesson; it is introduced in this part's intro and registered in Big Lessons as L14.