💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

8.3 Blind deblurring

A photograph taken at dusk, handheld, at a thirtieth of a second: the streetlamp smears into a comma, the lettering on a shop sign doubles, every point of light draws the same little squiggle. That squiggle is the answer to a question — how did the camera move while the shutter was open? — and it is the same squiggle everywhere in the frame, because every point of the scene was painted across the sensor by the same hand tremor. In principle, if you knew that squiggle exactly, you could undo it: the blurry image is the sharp scene convolved with the squiggle, and convolution is linear, so deblurring is inversion. That is the textbook story of the previous chapter, and it is a lie in two places. First, the sensor added noise, and we are about to see that inverting a blur amplifies noise so violently that the naive answer is pure speckle. Second — the harder lie — you do not actually know the squiggle. You have to recover the camera's motion and the sharp scene from the single blurry photograph, with more unknowns than measurements.

This chapter is the hard sequel to linear inversion. We will see why exact deconvolution detonates the moment any noise is present, and meet the Wiener filter, the regularized inverse that defuses it frequency by frequency. We will then confront the blind problem — kernel unknown — and watch a natural-image prior break a chicken-and-egg that data alone cannot. We will upgrade the blur model to the spatially-varying reality of camera shake, which is a rotation, not a single 2-D kernel. And we will close by showing that the very same data-fit + prior recipe solves two problems that are not deblurring at all — turning a color image to gray while keeping its contrast (Color2Gray), and propagating a few color scribbles across a gray photo (colorization) — because the recipe is the real lesson, not the blur.

8.3.1 Deblurring in the presence of noise: why naive inversion fails

Start with the easy version to see exactly where it breaks. The blur is a known kernel $K$, there is no noise, and the observed image is $Y = K * X$. Convolution is a per-frequency multiply (this is L5 — convolution diagonalizes in Fourier, from Linearity, Fourier, Aliasing and deblurring), so in the Fourier domain $\hat Y = \hat K \,\hat X$, and recovering the sharp image is one division: $\hat X = \hat Y / \hat K$, then transform back. Clean, exact, done.

Now add the faintest whisper of noise — $Y = K * X + n$ — and that exact inverse becomes useless, not slightly worse but catastrophically, hilariously useless (Figure 8.3.1). To see why, remember what blur is: a low-pass filter. It smooths, so it attenuates high frequencies, and a real blur kernel's frequency response $\hat K$ goes essentially to zero at the high frequencies — those are exactly the fine details the blur erased. The inverse filter is $1/\hat K$, so at those frequencies it divides by almost zero. The trouble is that the signal there is gone, but the noise is not: noise is broadband, it has energy at every frequency including the ones where $\hat K \approx 0$. Dividing that noise by a near-zero number multiplies it by a gigantic factor. The reconstruction is dominated, totally, by amplified high-frequency noise — an image of speckle that contains no photograph at all.

The linear-algebra twin of this statement is worth keeping, because it generalizes beyond convolutions. Writing the blur as a matrix $A$ (so $Y = AX + n$), inversion divides each component of the data by a singular value of $A$. The low frequencies are the large singular values; the high frequencies, where the blur has killed the signal, are the tiny singular values. Dividing by a tiny singular value is the matrix version of dividing by $\hat K \approx 0$ — it is precisely there that noise explodes. Naive inversion has no defense, because nothing in $\hat Y / \hat K$ tells it "this frequency is noise, not signal." It trusts the data equally everywhere, and the data is worthless where the blur was strong.

This is another recurrence of L10 — the prior is not optional; see Super-resolution and image priors. Deblurring is the case where the forward operator is a convolution that kills high frequencies, so running it backward is ill-posed and noise-amplifying: you can only fill in what the blur destroyed with prior knowledge of what images look like.

So the fix is not to invert blindly but to trade off two things: fidelity to the measurement, and plausibility of the answer. Where the kernel is strong — low frequencies, high signal-to-noise — trust the data and invert. Where the kernel is weak — high frequencies drowning in noise — back off and lean on what we expect images to look like. Under a Gaussian-noise assumption with a known signal spectrum, the optimal version of that trade-off has a name and a one-line formula: the Wiener filter, next.

One encoding caution before we go, following our standing rule of declaring the space every operation lives in. Deconvolution assumes the blur is linear in scene radiance — the convolution model $Y = K*X$ is additive and linear — so it must be done in linear light, with gamma undone first (this is L2, from Linear vs Gamma vs log encoding). A gamma or log curve distorts the additive convolution, so deblurring a gamma-encoded image is solving the wrong equation. State this up front and un-gamma before you invert.

fig-naive-deconv-noise
Figure 8.3.1. The catastrophe of naive inversion. Left: a sharp image $X$. Middle: the same image blurred and corrupted by a tiny amount of noise, $Y = K*X + n$ — visibly just a soft, slightly grainy version of the original. Right: the exact inverse filter $\hat X = \hat Y / \hat K$ applied to the middle image — not a sharp photograph but a field of high-frequency speckle, the original scene completely buried. The inverse filter divides by $\hat K \approx 0$ at high frequencies, where the signal is gone but the noise is not, amplifying that noise by enormous factors.

8.3.2 The Wiener filter — the regularized, noise-aware inverse

The idea in one line: take the naive inverse filter and attenuate it wherever noise would dominate, frequency by frequency, with the amount of attenuation set by the signal-to-noise ratio at each frequency. Where the data is trustworthy, invert fully; where it is noise, hold back.

Written per frequency in the Fourier domain, the Wiener filter is

$$ \hat X \;=\; \frac{\overline{K}\,\hat Y}{|K|^2 + \mathrm{SNR}^{-1}}, $$

where $\overline K$ is the complex conjugate of $\hat K$ and $\mathrm{SNR}$ is the signal-to-noise ratio at that frequency. Read the formula as the inverse filter plus a brake. The part $\overline K / |K|^2$ is the inverse filter (it equals $1/\hat K$, just written so the denominator is real and positive). The $+\,\mathrm{SNR}^{-1}$ added to the denominator is the regularizer — the term that stops the blow-up. Equivalently you can read it as the inverse filter scaled by a per-frequency gain $|K|^2 / (|K|^2 + S_n/S_x)$, where $S_n$ and $S_x$ are the noise and signal power spectra and $\mathrm{SNR} = S_x/S_n$.

The behavior at the two extremes is hand-verifiable and is exactly what we asked for (Figure 8.3.2). High SNR at a frequency means $\mathrm{SNR}^{-1} \to 0$, so the denominator becomes $|K|^2$ and the whole thing collapses to the exact inverse filter $1/\hat K$: no noise, invert fully, recover the signal. Where the kernel vanishes — $|K| \to 0$, the killed high frequencies — the $|K|^2$ term goes to zero but the $+\,\mathrm{SNR}^{-1}$ keeps the denominator finite, so the gain goes smoothly to zero instead of blowing up to infinity. Those lost frequencies are simply left alone — set to zero — rather than amplified into speckle. The filter automatically passes the low frequencies (where the kernel is strong), rolls off through the mid-band, and stops at the noise floor.

Where does it come from? Just the intuition, no derivation: the Wiener filter is the minimum-mean-squared-error linear estimator under Gaussian noise — among all linear filters, it minimizes the expected squared error between the estimate and the truth, and the algebra of that minimization yields the formula above. The pedagogically important point is what the $\mathrm{SNR}^{-1}$ term is: it encodes a prior on the image's power spectrum. Natural images have a roughly $1/f$ spectrum — most of their energy is at low frequencies — and that assumption is precisely what tells the filter "expect little signal up here, so do not trust the data up here." So even the classic Wiener filter, the oldest trick in the book, is already a prior-driven method. That is the bridge to everything that follows: the only question from here on is which prior and how.

Its limitation points straight at the sequel. Wiener assumes a known kernel and a stationary Gaussian prior — a power-spectrum prior, equivalent to assuming the image is a blurry, textureless cloud. But real images are not Gaussian: they have sharp edges, and the right prior for edges is sparse gradients, not a power spectrum. Swapping the Gaussian power-spectrum prior for a sparse-gradient one gives modern non-blind deconvolution (Levin et al. 2007 Levin et al. 2007; Krishnan & Fergus 2009 Krishnan & Fergus 2009, whose hyper-Laplacian prior makes that solve fast). And when even the kernel is unknown, we get blind deblurring — the next section. The fully general version, "use any denoiser as the prior," is Plug-and-Play / RED in Super-resolution and image priors; the learned prior is Deep learning.

fig-wiener-tradeoff
Figure 8.3.2. The Wiener filter's frequency response, plotted against spatial frequency at three SNR levels. Each curve is the per-frequency gain $|K|^2/(|K|^2 + \mathrm{SNR}^{-1})$ overlaid on the naive inverse filter $1/|K|$ (dashed, shooting to infinity where $|K|\to 0$). High SNR (low noise): the Wiener curve tracks the inverse filter far up into the high frequencies before rolling off — invert aggressively, the data is trustworthy. Low SNR (high noise): the curve rolls off early and gently, refusing to amplify the noisy high band. At every SNR the gain passes the low frequencies, rolls off through the mid-band, and goes to zero exactly where the kernel vanishes — never to infinity.

8.3.3 Blind deblurring — estimating the kernel and the image

Now drop the last assumption. We no longer know the blur kernel $K$ — camera shake produces an unknown, photo-specific squiggle of a PSF — so we must recover both $X$ and $K$ from a single blurry $Y$. That is twice as many unknowns as before — more unknowns than measurements. It sounds hopeless, and naively it is.

The structure of the difficulty is a chicken-and-egg (Figure 8.3.3). If you knew the sharp image $X$, finding the kernel would be easy — $K$ is just $Y$ deconvolved by $X$, a division in Fourier. If you knew the kernel $K$, finding the image would be the non-blind deconvolution we just did — a Wiener or sparse-prior solve. Each half is easy given the other; the trouble is we have neither.

Worse, the problem is genuinely ill-posed in a way that traps the obvious optimizer. The blurry image explains itself. The true factorization $Y = (\text{sharp } X) * (\text{real blur } K)$ fits the data perfectly — but so does the trivial factorization $Y = (\text{the blurry image itself}) * (\delta)$, where $\delta$ is the identity (delta) kernel meaning "no blur at all." Both reproduce $Y$ exactly. So a naive joint optimizer that only fits the data will happily return the no-op answer: declare the blurry image already sharp and the kernel a single point. Data-fit alone is hopeless — it cannot tell deblurring from doing nothing, because doing nothing fits perfectly.

What breaks the tie is a prior on natural images, and the specific statistic is the one we met in gradient-domain editing: real sharp photos have sparse, heavy-tailed gradients (Figure 8.3.4). A natural scene is mostly flat regions — sky, walls, skin — punctuated by a few strong edges. Plot the histogram of gradient magnitudes on a log scale and it is sharply peaked at zero with long, heavy tails: lots of near-zero gradients, a small number of large ones. Blurring destroys this signature. Convolving with a spread-out kernel takes each sharp edge and smears it into many small gradients, filling in the mid-magnitudes and pulling the histogram toward the bell-shaped Gaussian of a smooth image. So the blurry, self-explaining "no-blur" answer has the wrong gradient statistics — too dense, too Gaussian — and a sparse-gradient prior penalizes it. The prior prefers the sharp explanation, because only the genuinely sharp image has the heavy-tailed gradients of a real photo. This is L9 turned into a discriminator: the eye cares about gradients, and so, it turns out, does the statistics that tell blurry from sharp.

The objective puts data-fit and two priors together:

$$ (\hat X, \hat K) \;=\; \arg\min_{X,K}\; \underbrace{\lVert K * X - Y\rVert^2}_{\text{data fit}} \;+\; \underbrace{\lambda\,\rho(\nabla X)}_{\text{image prior}} \;+\; \underbrace{\mu\,\psi(K)}_{\text{kernel prior}}, $$

where $\rho$ is a sparse, heavy-tailed gradient penalty — $\lVert \nabla X\rVert_\alpha$ with $\alpha < 1$, a so-called hyper-Laplacian, which is what makes the heavy-tailed statistics cheap and dense-gradient blur expensive — and $\psi$ is a constraint on the kernel (non-negativity, since a PSF cannot be negative; sparsity; and unit energy). The kernel prior matters: the squiggle of a hand tremor is a thin, connected, sparse curve, not a fuzzy blob. This objective is non-convex in $(X, K)$ jointly, so how you optimize it is itself a design decision — and a subtle one.

Two approaches, with an important caveat between them:

K ← δ                                  # start from "no blur"
for each pyramid level, coarse to fine:
    upsample X and K to this level
    repeat until converged:
        X ← argmin_X ‖K * X − Y‖² + λ·ρ(∇X)     # non-blind deconv, K fixed
        K ← argmin_K ‖K * X − Y‖² + μ·ψ(K)      # fit kernel, X fixed
return non-blind deconvolution of Y with the final K

It is simple and often works, but it can collapse to the trivial delta-kernel solution if the prior is not handled carefully — the optimizer slides toward "no blur" exactly as warned above.

The attribution is worth stating plainly. Fergus, Singh, Hertzmann, Roweis & Freeman (2006) Fergus et al. 2006, "Removing Camera Shake from a Single Photograph," put blind deblurring on the map for real photographs, using a natural-image gradient prior together with variational inference to estimate the kernel. Levin, Fergus, Durand & Freeman (2007) Levin et al. 2007 clarified the role of the sparse prior in non-blind deconvolution, and Levin, Weiss, Durand & Freeman (2009) Levin et al. 2009 gave the canonical evaluation — the careful analysis of why naive joint MAP fails and why marginalizing over the image is what actually works. If you remember one thing from this section, make it that: in a blind problem, the estimator design, not just the prior, is load-bearing. The learned successors — convolutional neural network (CNN) and diffusion deblurring — live in Deep learning; they keep this exact structure but replace the hand-modeled sparse prior with one learned from data (L8).

fig-blind-chicken-egg
Figure 8.3.3. The blind chicken-and-egg, and why data-fit alone fails. The same blurry photograph $Y$ is "explained" two ways that both fit the data exactly. Left: $Y = (\text{sharp } X) * (\text{real squiggle } K)$ — the true factorization, a crisp scene blurred by the camera's motion. Right: $Y = (\text{the blurry image}) * (\delta)$ — declare the blurry image already sharp and the kernel a single point: "no blur at all." A naive optimizer fitting only $\lVert K*X - Y\rVert^2$ cannot choose between them and slides to the trivial delta-kernel answer. Only a prior that knows real images have sparse gradients breaks the tie (Figure 8.3.4).
fig-natural-image-gradient-prior
Figure 8.3.4. Why the prior prefers the sharp explanation. Log-scale histograms of gradient magnitude for (1) a real sharp photograph — sharply peaked at zero with long, heavy tails: mostly flat regions, a few strong edges; and (2) the same photograph after blurring — the peak lowered and the mid-magnitudes filled in, the distribution pulled toward a smooth Gaussian as each sharp edge is smeared into many weak gradients. A sparse, heavy-tailed prior $\rho(\nabla X) = \lVert\nabla X\rVert_\alpha$, $\alpha<1$, assigns low cost to the sharp histogram and high cost to the blurred one — so among explanations that fit the data, it selects the sharp $X$ and rejects the dense-gradient no-blur answer.

8.3.4 A more realistic blur model: spatially-varying (camera-shake) blur

Everything above assumed a single kernel convolved over the whole image — shift-invariant blur, where the squiggle is the same in every corner. That assumption is the foundation of the FFT trick and of "the kernel" as one small object. And real camera shake violates it.

The reason is geometric: hand shake is mostly rotation of the camera, not translation (Figure 8.3.5). When you cannot hold a camera still, your wrist and arm rotate it — pitch and yaw — far more than they slide it sideways. And a rotation moves different parts of the image by different amounts and in different directions. A point near the optical axis barely moves; a point out at the corner of the frame sweeps through a much larger arc. So the corners smear more than the center, and because the motion is a rotation the smear curves differently across the frame. There is simply no single PSF that, convolved everywhere, reproduces this — the blur is genuinely spatially varying.

Camera shake is not the only blur that breaks shift-invariance, and it is worth naming its kin so the reader sees the pattern. Defocus blurs each pixel by an amount that depends on the depth of what it images — the blur disk grows with distance from the focus plane, and it changes abruptly across occluding edges — so defocus is not a convolution either (it is depth-dependent, and undoing it is entangled with estimating depth). The lens aberrations of the Optics part — spherical aberration, coma, astigmatism, field curvature — are mild near the optical axis and grow toward the corners, so their blur kernel varies across the field; and chromatic aberration makes it differ per colour channel. All of these share camera shake's structure: a blur that changes with position (or depth) rather than one fixed kernel. In each case the shift-invariant convolution is only a local approximation, valid within a small patch — exactly the "blur isn't always a convolution" caveat from Neighborhood operations and convolution — and the recipe that follows for camera shake (model the true physical cause, then invert a non-stationary linear operator) is the template for handling them all; the deblurring intuitions about which information the blur destroyed carry over essentially intact.

The better model (Whyte, Sivic, Zisserman & Ponce, 2011/2014 Whyte et al. 2011/2014) does not try to write down a different kernel at every pixel independently — that would be hopelessly many unknowns. Instead it goes back to the physics: the blurry image is a weighted sum of the sharp image seen under many camera poses, integrated over the camera's rotational trajectory during the exposure. The unknown is that trajectory — a path through the low-dimensional space of camera rotations — and it is one underlying 3-D motion. From it, the spatially-varying PSF at every pixel is derived, automatically consistent across the frame because it comes from a single physical cause. Far fewer unknowns than an independent kernel per pixel, and physically meaningful ones.

The consequence for the math is that deblurring becomes inversion of a global, non-stationary linear operator — the pose-integral — rather than a convolution. It is still linear in $X$, and it is still the same data-fit + sparse-prior recipe; what changes is that the forward operator no longer diagonalizes in Fourier, so there is no single-FFT trick (L5 does not apply when the operator is not a convolution). We are back to the matrix-free iterative solvers of the previous chapter — conjugate gradient and friends — applying the pose-integral operator and its transpose at each step. A further wrinkle that Whyte 2014 handles and that any "realistic" deblurring must face: real shots have saturated highlights (a streetlamp clips to pure white), and clipping is a non-linearity that breaks the linear convolution model outright — handling those saturated pixels specially is part of making the method work on actual photographs.

There is an optimistic flip side worth flagging as a forward-reference. Instead of fighting an accidental, hard-to-invert blur, you can engineer the PSF so that it is easy to invert — and even encodes useful information like depth. Coded aperture, motion-invariant photography, and wavefront coding (Levin et al. 2007; Dowski & Cathey 1995 Dowski & Cathey 1995) all design the blur deliberately, so the forward operator is nice by construction rather than nasty by accident. "Make the forward operator invertible" beats "invert a nasty one" — that story lives in the coded-imaging / PSF-engineering part.

fig-camera-shake-nonuniform
Figure 8.3.5. Why camera shake is not a convolution. Left: the camera rotates (pitch/yaw) during the exposure — the dominant component of hand tremor. Right: the resulting blur across the frame, drawn as the little streak each point of light leaves. Near the optical axis the streaks are tiny; out toward the corners they grow long and curve, because a rotation moves peripheral rays through a larger arc than central ones. No single kernel convolved everywhere can produce this spatially-varying, curving smear — the blur must be modeled as the sharp scene integrated over the camera's rotational trajectory.

8.3.5 Decolorization (Color2Gray) as gradient-matching optimization

A placement note first, because the next two sections are not deblurring. They are, however, the same data-fit + prior inverse-problem recipe this chapter teaches, applied to problems that are not restoration at all — which is the whole point of including them. The recipe generalizes; the blur was just the first instance. (Their efficient solvers live in Poisson image editing and Edge-preserving optimization — colorization; here we only set up the framing.)

The problem: convert a color image to grayscale while preserving contrast that luminance alone destroys. Two colors can be isoluminant — different hues with the same luminance, like a particular red and a particular green tuned to match in brightness. The naive conversion — "take the luminance" or "desaturate" — maps both to the identical gray, erasing a contrast the viewer plainly saw (Figure 8.3.6). A red figure on a green background of equal luminance simply vanishes into a flat gray field. This is a real failure: think of a pie chart, a subway map, or a photograph of red berries in green leaves, all of which can dissolve into mush when desaturated.

The framing of Gooch, Olsen, Tumblin and Gooch 2005 is to pose decolorization as an optimization in the gradient domain — exactly the move of Poisson image editing. Find the grayscale image $g$ whose gradients best match the perceived color differences between neighboring pixels. Where two neighbors differ in color — even isoluminantly — force a difference in gray level between them; and carry the sign of that difference so the ordering stays consistent (brighter-looking colors map consistently lighter, not arbitrarily). The objective is the familiar gradient-matching least-squares,

$$ \min_g\; \sum \lVert \nabla g - \theta \rVert^2, $$

where $\theta$ is a target gradient field derived from differences in CIELAB (the perceptually uniform $L^*a^*b^*$ color space of the Commission Internationale de l'Éclairage (CIE)) — combining the luminance difference where it exists with a signed chrominance difference where it does not, so that an isoluminant color step still produces a non-zero target gradient. Minimizing this is a Poisson-type linear system — the very same machinery as gradient-domain editing — solved by the very same CG / multigrid / FFT solvers. The "prior" here is unusual but real: it is match perceived contrast, the instruction that the output's local differences should track what a viewer perceives rather than what the luminance channel alone records. It is a least-squares reconstruction from a target gradient field, the gradient-domain pattern yet again, which is why it belongs in this chapter even though no blur is involved.

One encoding note, our standing rule: the color contrasts must be computed in a perceptually uniform space — CIELAB — not in linear or gamma RGB, so that "equal color difference" means equal perceived difference (this is the L1/L2 discipline of working where differences mean what you want). Stating the space is part of the method, not a footnote.

[figure fig-color2gray not built]
Figure 8.3.6. Decolorization that keeps isoluminant contrast. Left: a color image containing an isoluminant pair — a red shape on a green field of equal luminance. Middle: naive grayscale (take the luminance / desaturate) — the red and green map to the same gray, and the shape disappears into a flat field; a contrast the viewer clearly saw is erased. Right: Color2Gray (Gooch et al. 2005) — solving $\min_g \sum\lVert\nabla g - \theta\rVert^2$ with a target gradient $\theta$ built from signed CIELAB color differences forces a gray-level step wherever colors differ, so the shape reappears with its contrast intact.

8.3.6 Colorization as an optimization (the inverse-problem framing)

The companion problem runs the other way: a grayscale photo plus a few color scribbles — a stroke of blue on the sky, red on a coat, green on the grass — and the goal is to propagate that color across the whole image, filling every pixel, while stopping at edges so color does not bleed across object boundaries (the red of the coat must not leak onto the gray wall behind it). A decision on placement: we introduce the framing here, as another instance of the chapter's recipe; the worked, edge-aware affinity solver is developed in Edge-preserving optimization — colorization (cross-referenced both ways).

The inverse-problem framing is Levin, Lischinski & Weiss (2004) Levin et al. 2004. The unknown is the per-pixel chrominance $U$ (the color to fill in; the luminance is already given by the gray image). The constraints are the scribbles — at scribbled pixels, $U$ is pinned to the chosen color. And the prior is a single sentence: neighboring pixels with similar intensity should get similar color. Encode that as an affinity $w_{xx'}$ between neighboring pixels $x$ and $x'$, large when their gray intensities match and small when they differ — i.e. small across an edge. Then minimize the color variation between each pixel and the affinity-weighted average of its neighbors, subject to the scribbles:

$$ \min_U\; \sum_x \Big( U(x) - \sum_{x' \in N(x)} w_{xx'}\, U(x') \Big)^2. $$

Read it: each pixel's color should equal the intensity-weighted average of its neighbors' colors — so color flows freely through regions of constant intensity and is blocked at intensity edges, where the affinity drops to near zero. With the scribbles as boundary conditions, this is a sparse linear least-squares problem, solved by the same CG / multigrid machinery as Efficient solvers and Poisson editing. Color spreads from each scribble outward, pooling within objects and halting at their boundaries, with no scribble ever needed in the interior of a uniform region.

💡 Big lesson (callback to L4)

The weight $w_{xx'}$ — "how much these two pixels belong together, judged by their intensity difference" — is an affinity. This is the same idea as edge-preserving = affinity (L4, first placed in Bilateral filtering): an edge is exactly where the affinity between neighbors collapses, and an edge-aware operation is one that averages within high-affinity regions and refuses to average across low-affinity ones. The colorization objective is the optimization form of that idea — the bilateral filter's local average turned into a global linear solve. This is the hinge to Edge-preserving optimization — colorization, where the affinity / edge-aware machinery is developed in full (the colorization "matting Laplacian" is precisely an affinity matrix). (Callback to L4; not a first appearance.)

The forward-reference closes the loop: the learned alternative dispenses with scribbles entirely — train a CNN to predict plausible colors directly from a gray image (Zhang, Isola & Efros, 2016 Zhang et al. 2016), replacing the hand-built affinity prior with one learned from millions of color photographs (L8 again). That, and deep deblurring, dehazing, and the generative priors that supersede all the hand-designed ones in this chapter, are Deep learning and Generative AI and diffusion — but the skeleton never changes. It is always data-fit + prior; this chapter has just been a tour of which prior, and how.

[figure fig-colorization-scribbles not built]
Figure 8.3.7. Colorization by optimization (Levin, Lischinski & Weiss 2004). Left: a grayscale photo with a few color scribbles — strokes of the intended hue on the sky, a coat, the grass. Right: the propagated result — color spread across every pixel by solving $\min_U \sum_x \big(U(x) - \sum_{x'} w_{xx'} U(x')\big)^2$ with the scribbles pinned. The affinity weights $w_{xx'}$ are large between similar-intensity neighbors and near zero across intensity edges, so color flows freely within objects and stops at their boundaries — the coat's red does not bleed onto the wall behind it.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (callback to L4)

The weight $w_{xx'}$ — "how much these two pixels belong together, judged by their intensity difference" — is an affinity. This is the same idea as edge-preserving = affinity (L4, first placed in Bilateral filtering): an edge is exactly where the affinity between neighbors collapses, and an edge-aware operation is one that averages within high-affinity regions and refuses to average across low-affinity ones. The colorization objective is the optimization form of that idea — the bilateral filter's local average turned into a global linear solve. This is the hinge to Edge-preserving optimization — colorization, where the affinity / edge-aware machinery is developed in full (the colorization "matting Laplacian" is precisely an affinity matrix). (Callback to L4; not a first appearance.)