💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.

8.8 Colorization

Hand a computer a black-and-white photograph — a 1930s street scene, a frame of silent film, an ancestor's portrait — and ask it to colour it. The result, when it works, feels like a small resurrection: the past in colour, as if the camera had been better all along. It is also a textbook ill-posed inverse problem, and a uniquely instructive one, because it strips the part's recurring theme down to its bones. In super-resolution or deblurring the measurement still carries most of the answer and the prior only refines it. In colorization the measurement carries the structure — every edge, every texture, every shadow — but none of the colour. The luminance tells you where things are and how light they are; it says nothing about whether the dress is red or blue. So the colour you get out is, almost entirely, the colour the prior put in. Colorization is the part's clearest proof that the prior is not a tie-breaker but the source of the signal.

8.8.1 One channel in, three channels out

Work in a luminance/chrominance space — $La^*b^*$ or $YUV$ — where one axis is lightness and the other two are colour (Color technology). A greyscale image is the luminance channel $L$; colorization must produce the two chrominance channels $(a,b)$ at every pixel. That is two unknowns per pixel recovered from one measurement — under-determined by construction, exactly the shape of every recovery in this part, and the reason the answer needs a prior at all (L10, Super-resolution and image priors).

fig-colorization-luminance-chrominance
Figure 8.8.1. The problem made concrete. A colour photograph splits into a luminance channel — the greyscale image a colorizer is given, carrying every edge and texture — and two chrominance channels — the colour it must invent (shown here as the pure colour field at constant lightness). The third panel recombines them into the colour image. The measurement supplies all the structure and none of the colour, so the colour you see is the colour the prior put in.

What makes it especially under-determined is that the missing quantity is multimodal. A grey car could plausibly be red, blue, silver, or black; a grey wall could be any pastel. The luminance does not merely under-specify the colour — it leaves several genuinely different colours equally consistent. Hold on to that fact; it is the hinge the whole chapter turns on, because the methods differ mostly in how they choose among the modes, and the automatic methods can fail precisely by refusing to choose.

What rescues the problem is that colour is not free to vary pixel by pixel. Within a single object it is nearly constant and changes only where the luminance changes — the shirt is one colour up to its outline, the sky one colour up to the horizon. And the world supplies strong statistical priors: skies and water trend blue, foliage green, skin to a narrow band of warm tones. Every method below is a different way of injecting one or both of these — local smoothness keyed to luminance edges, and global knowledge of what the world's colours usually are.

8.8.2 A spectrum of priors: scribbles, references, and learned models

The methods sort cleanly by who supplies the colour prior, from a human in the loop to a model that has seen everything (Figure 8.8.2).

fig-colorization-spectrum
Figure 8.8.2. The chapter on one axis: colorization methods arranged by who supplies the colour prior, from a human's hand (left) to a model trained on data (right). Across it, human control falls and learned world-knowledge rises — scribble propagation (the human paints colours), example/reference (copy from a chosen photo), user-guided learned (a learned prior plus a few hints), fully automatic (a network invents colour from millions of images), and generative (sample diverse colourings). The two ends differ in what they know: the human knows what colour the scene was; the model knows what colour such a scene usually is.

Interactive — the human picks the colours, the algorithm spreads them. The user crayons a few coarse colour strokes and an optimization propagates them along luminance affinity — colour flooding regions of smooth lightness and halting at luminance edges — so the human supplies which colours go roughly where and the algorithm supplies the spread. This is Levin, Lischinski & Weiss's (2004) scribble colorization, developed in full as the canonical filtering-becomes-optimization example in Edge-preserving optimization — colorization (whose affinity solve is, not coincidentally, the matting Laplacian of Compositing, segmentation and matting); here we only need that it sits at the human-supplies-the-prior end of the spectrum — accurate and controllable, but work, since every photograph needs someone who knows what colour things were.

Example-based — a reference photo supplies the colours. Drop the scribbles and instead hand the system a colour image of a similar scene: another beach, another face, another bowl of fruit. For each greyscale pixel, find the location in the reference whose neighbourhood looks most alike — matched on luminance and a little local texture — and copy that location's chrominance across (Welsh, Ashikhmin & Mueller, 2002; refined to match within segmented regions by Irony et al., 2005). The prior is now "this other photograph," and the method is a colour cousin of the histogram and colour transfer of Histograms and of the patch-copying in Inpainting, texture synthesis — transport colour from where it exists to where it is missing. It removes the per-stroke labour but trades it for the labour of finding a good reference, and it fails when no region of the reference matches a region of the target.

Fully automatic — a network supplies the colours, learned from everything. Train a convolutional network on a vast corpus of colour photographs by the cheapest supervision imaginable: strip each one to greyscale, ask the network to predict its colour, and compare against the original you hid (Zhang, Isola & Efros, 2016; Iizuka et al., 2016; Larsson et al., 2016). No human, no reference; the prior is the statistics of millions of images, learned (L8, Deep learning). The network discovers the world's regularities — grass is green, lips are red, skies grade blue to white — and applies them automatically. This is the version that colours a museum's archive overnight, and it is where the multimodality of the problem turns from a footnote into the central design decision.

fig-colorful-colorization-1
Figure 8.8.3. Fully automatic colorization, the paper's gallery. A spread of greyscale photographs — portraits, animals, landscapes, a ladybird — each colourised with no user input at all, the plausible colours learned from a large corpus of colour images stripped to grey and back. Posed as classification over quantized colours (with rare colours up-weighted) so the output stays saturated rather than sliding toward the desaturated browns a regression loss favours (Zhang et al. 2016).

8.8.3 The multimodality trap: why naïve colorization goes muddy

Here is the chapter's one genuinely new idea, and it is a lesson that reaches far past colorization. Suppose you train the automatic network the obvious way: predict the chrominance $(a,b)$ at each pixel and minimise the squared error against the true colour — a regression loss. It will produce desaturated, muddy images, every confident colour washed toward a sad greyish brown. The cause is not a weak network or too little data; it is the loss, and the multimodality we flagged at the start.

When a grey pixel could plausibly be red or blue, the squared-error loss is minimised not by guessing one but by predicting the average of the two — which is a dull grey in the middle. A network trained to minimise mean-squared error learns to output the mean of the conditional colour distribution, and the mean of several vivid, incompatible options is a colour no real object has. Averaging a multimodal distribution lands you in the empty valley between its peaks (Figure 8.8.4). The very ambiguity that makes colorization interesting is what a regression loss handles worst.

fig-colorization-multimodal
Figure 8.8.4. Why mean-squared regression desaturates. A single greyscale object — say a balloon — is consistent with several saturated colours (red, blue, green): its conditional colour distribution over the $(a,b)$ plane is multimodal, a few separated peaks. A regression loss is minimised by predicting the mean of that distribution, which falls in the empty centre — a desaturated grey that matches no mode. Treating colorization as classification over quantized $(a,b)$ bins (with rare, saturated bins up-weighted) lets the network commit to one peak and return a vivid colour. The same "average the modes and get mush versus pick one and stay sharp" choice reappears as the reason diffusion samples rather than averages (L11).

The fix is to stop averaging. Zhang and colleagues recast colorization as classification: quantize the $(a,b)$ plane into a few hundred colour bins and have the network predict a probability over bins for each pixel, then pick (or take the expectation with a low "temperature" so it commits to a peak). Because real photographs are dominated by dull backgrounds, they also re-weight the rare, saturated colours up during training, so the network is not lulled into always betting on grey. The result is the saturated, decisive colouring of the gallery above. The deeper point generalises: when the thing you are predicting is multimodal, do not regress to its mean — model the distribution and commit to a mode. That is exactly why the generative chapter's diffusion models sample instead of average (L11, Generative AI and diffusion): the muddy-brown colorization is the same failure as the blurry mean image, and "predict a distribution, then draw from it" is the same cure.

8.8.4 Closing the loop: learned priors with a human's hints

The spectrum is not really a line with the human at one end and the machine at the other — its two halves combine, and the combination is what actually shipped. Real-time user-guided colorization (Zhang, Zhu, Isola et al., 2017) feeds the network the greyscale image and a handful of user colour points, and is trained to honour the hints while filling everything else from its learned prior. It is the scribble idea of Levin with the hand-built affinity propagation swapped out for a trained model that already knows how the world's colours usually spread: drop one red dot on a balloon and the whole balloon turns a plausible red at once, because the network supplies the propagation and the prior the old optimization had to be told. This resolves the chapter's central tension in one stroke — the human chooses which colours where ambiguity matters, the learned model handles how they spread and everything left unspecified — and it runs at interactive speed. It is also the most-used member of this whole chapter: the same research lineage is the engine behind Photoshop's Neural Filters Colorize tool, the academic colorization prior shipping in a product millions reach for to colour their grandparents' photographs.

Generative colorization takes the idea one step further. The learned colorizers so far all return a single colouring — including the conditional-GAN, image-to-image kind (the pix2pix recipe that also does labels→scene and edges→photo; Isola et al., 2017, Deep learning), which despite its adversarial training still maps a greyscale image to one fixed output, so it sits squarely in the learned-automatic camp. A truly generative prior instead lets you draw several different plausible colourings of the same greyscale photo, or steer them with a text prompt ("a red car"): diffusion and autoregressive models make the prior not just learned but sampleable (L11). Colorization stops pretending there is one right answer and offers a distribution of them.

8.8.5 Plausible is not correct

Which forces the chapter's honest caveat, and it is a sharp one. A colourised historical photograph is a confident guess, not a recovery. The original colours were never measured; they are gone. The network supplies the statistically most likely colours for a scene of that shape, which is often convincing and occasionally completely wrong — a dress that was green rendered a plausible blue, a uniform's true regimental colour invented. Colorization fabricates information that looks like measurement, which is why it sits uneasily next to provenance and authenticity (Image Forensics and Authentication): the output is indistinguishable, to the eye, from a real colour photograph, yet none of its colour is evidence of anything. It is worth keeping distinct from false colour — the deliberate mapping of non-visible data (infrared, depth, an electron micrograph's intensities) to hues, which makes no claim to be the scene's true colour and is honest about it.

The same fact reshapes how you evaluate colorization. Measuring colour error against the original photograph — a PSNR to ground truth — is misleading, because the original is just one of the plausible colourings, and a different-but-equally-plausible result is scored as a failure. The honest metric is a colorization Turing test: show people the result and ask whether they believe it is a real colour photograph. Zhang's automatic colourings fooled human judges roughly a third of the time — a far more meaningful number than any pixel-wise error, and a reminder that for every generative recovery in this part, "looks real" and "matches the held-out truth" are different questions, and only the first is the one the prior was ever trying to answer.

8.8.6 Where it sits

Colorization is the limit case of the whole single-image part: the recovery in which the data supplies the skeleton and the prior supplies everything you actually came for. Read across the spectrum and you read the book's three reusable tools pointed at one task — solve an affinity-weighted optimization with a human's hints (Edge-preserving optimization — colorization), transport colour from an example (Histograms, Inpainting, texture synthesis), learn the prior from data (Deep learning), and finally sample it (Generative AI and diffusion). The same affinity matrix returns as matting (Compositing, segmentation and matting); the same multimodality lesson returns as why diffusion samples; the same caveat — plausible is not true — returns wherever a learned prior fills in what the measurement left out. Colour is the most visible thing a prior can invent, which is exactly what makes colorization the cleanest place to watch one work.