💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to

4.4 Deep learning

The previous chapter, Machine learning, made a single move and named it: take a single-image task — denoise, demosaic, upsample, colorize, retouch, estimate depth — and replace the hand-designed operator with a function $f_\theta$ fit to data (the L8 big lesson; the data-fit-plus-prior skeleton is untouched, only the prior is now learned). It also insisted that the data is as much the method as the model. What it left deliberately abstract was $f_\theta$ itself. This chapter makes it concrete, because in practice that $f_\theta$ is almost always a deep neural network, and the realizations of the framing are what we survey here.

A word on scope first, because it governs everything below. This is not a course in neural networks. How a convolutional neural network (CNN) slides filters over an image, why a U-Net's skip connections matter, what a transformer's attention computes, how the parameters get fit — all of that is the refresher's job (Refreshers#Machine learning and deep learning), and we lean on it without restating it. Our question is narrower and more useful: granted that you can fit an image-to-image map, what map should you fit, and what changes when you do. We organize the answer by level — low-level operators that map pixels to pixels, then mid- and high-level predictors that read structure out of an image, then generative models for when the answer is not unique, and finally the learned metrics that decide whether any of it worked.

4.4.1 Low-level learned operators (pixel-to-pixel)

The bread-and-butter case is an image-to-image network — typically a U-Net or encoder–decoder — mapping a degraded or partial image to a restored one. Each task below is a task an earlier chapter solved by hand, now learned from examples instead (Figure 4.4.1).

fig-learned-task-zoo
Figure 4.4.1. A small zoo of low-level learned operators, each shown as an input→output pair on one photo: noisy→clean (denoising), Bayer mosaic→full RGB (demosaicking), low-resolution→high-resolution (super-resolution), and grayscale→color (colorization). Every panel is a task an earlier chapter solved by hand; here a single image-to-image network learns the map from data.

Denoising. Train on pairs of (noisy, clean) images and you get a learned denoiser that beats a hand-tuned non-local means or bilateral filter — provided the noise it trained on matches the noise the sensor actually makes. That proviso is not a footnote; it is the entire data story of Machine learning, and it is why a denoiser tuned in the lab can fall apart on a real phone photo. This learned denoiser turns out to be far more than a denoiser, too: it is a reusable prior you can drop into any inverse-problem solver (Denoising as a universal prior — Plug-and-Play and regularization by denoising), and a denoiser run in a loop is precisely what a diffusion model is (Generative AI and diffusion). We keep meeting this object.

Demosaicking, and joint demosaick-plus-denoise. Gharbi and colleagues (2016) trained a single network to go from a Bayer mosaic straight to full RGB, doing the demosaicking and the denoising at once, on hard cases mined automatically from data. It outperformed the hand-built interpolators of Book 2 — interpolators tuned over decades. This is the headline instance of L8: learning the operator end to end beat a long-polished pipeline stage. It is also a quiet argument for jointly learning steps the classical pipeline kept apart, since the net is free to trade off demosaicking against denoising however the data rewards, rather than committing to one before the other the way a fixed sequence must.

Super-resolution. Pose low-resolution → high-resolution as a learned map — super-resolution (SR). The lineage runs from Freeman's pre-deep, example-based work (Freeman et al. 2002), which stitched in high-frequency detail copied from a dictionary of patches, through to modern networks: SwinIR (Liang et al. 2021), a transformer restoration backbone, and Real-ESRGAN (Wang et al. 2021), an enhanced super-resolution GAN (ESRGAN) trained against a deliberately realistic model of how real photos degrade so that it survives in the wild. The deep treatment, with priors, lives in Super-resolution and image priors; here it is one entry in the zoo and a preview of L10 — when the measurement throws information away, the prior is what puts it back. Note the seam this opens: a single "correct" high-resolution image rarely exists, which is exactly why the strongest super-resolvers reach for the generative machinery later in this chapter.

Colorization — classical and learned, side by side. This task makes the contrast vivid, because two paradigms sit right next to each other (Figure 4.4.2). Levin, Lischinski and Weiss (2004) colorize from a few user scribbles, propagated across the image by optimization under one simple rule: neighboring pixels with similar intensity should get similar color. That rule is an affinity prior, the same edge-aware idea as the bilateral filter (big lesson L4), and it is entirely hand-designed; it is also, tellingly, the very matrix-free least-squares solve of Linear Inverse Problems and Regression, run on color instead of sharpness. Zhang, Isola and Efros (2016) instead learn fully automatic colorization from a huge corpus of color photos stripped to gray and back, posed as classification over quantized colors to keep the output from sliding toward desaturated browns. One propagates a human's chosen colors by a human's chosen rule; the other invents plausible colors learned from millions of images. Same task, opposite philosophies.

fig-colorization-classical-vs-learned
Figure 4.4.2. Colorization, classical versus learned. Left, Levin et al. (2004): a few user color scribbles are propagated across the gray image by edge-aware optimization — a hand-designed affinity prior. Right, Zhang et al. (2016): a network colorizes the same gray image fully automatically, the colors learned from a large corpus. Hand-designed propagation versus learned-from-data plausibility, on one image.

Monocular depth. Estimate depth from a single RGB image. This is badly ill-posed — one photo is consistent with infinitely many 3-D scenes — so the network must lean hard on a learned scene prior about how the world is usually shaped. MiDaS (Ranftl et al. 2019) trained across many incompatible depth datasets at once with a scale- and shift-invariant loss, so that depth maps recorded in different units could all teach the same net; Depth Anything (Yang et al. 2024) pushed the idea to a massive pile of pseudo-labeled images and got robust, zero-shot depth on photos it had never seen (Figure 4.4.3). Two cautions: what these predict is relative depth, not metric distance, and geometric depth from stereo or dual-pixel sensors — measured rather than inferred — is a separate story for the multi-image and geometry parts.

[figure fig-depth-anything not built]
Figure 4.4.3. (Placeholder — figure not yet built.) Monocular depth from a single photo. One ordinary RGB image (left) is mapped by a learned network to a dense depth map (right), near surfaces bright, far ones dark. The problem is ill-posed — a single image fixes no true distances — so the result is driven by a strong learned prior about how scenes are usually arranged, and recovers relative, not metric, depth.

Learned exposure and retouch. Here the "degradation" is taste: learn a photographer's tone and color adjustment from before/after pairs. Bychkovsky and colleagues (2011) built the MIT-Adobe FiveK dataset — the same scenes retouched by five expert editors — which is what made the adjustment learnable at all; the data, once again, is the enabling move. HDRnet (Gharbi et al. 2017) runs fast enough for a phone by predicting not output pixels but the coefficients of a bilateral-grid affine transform, an edge-aware local adjustment (the learned bilateral grid, cross-referenced in the edge-preserving chapter). Exposure (Hu et al. 2018) takes a different and instructive line: it learns a sequence of interpretable operations — curves, white balance, contrast — rather than raw pixels, so the result is an edit a photographer can read and undo. Learn the pixels, or learn the operations: both are on the table, and the choice trades raw quality against the editability and trust of a human-readable result.

4.4.2 Mid- and high-level learned predictors

Not every learned operator outputs an image. A second family predicts semantic structure from a single photo — where things are, what they are — and that structure becomes the input to retouching, autofocus, framing, and selection. These are enabling rather than central, so we keep them brief.

Sky and face detection are the classic "where is the sky, where are the faces" predictors that drive auto-enhance and metering: brighten the faces, deepen the sky, expose for the subject. The lineage runs from pre-deep detectors of the Viola–Jones era (Viola & Jones 2004 — the cascade of simple features that first made real-time face detection practical) to modern networks; the point for us is that a retouching pipeline must know what it is looking at before it can decide how to adjust it.

Face landmarks and pose predict facial keypoints and head orientation, which feed portrait retouching, relighting, and augmented reality — and, increasingly, forensics, since a face whose landmarks are geometrically inconsistent is a tell for a manipulated or synthetic image (cross-referenced to the ethics discussion in Human factors and the art of photography).

Saliency and gaze prediction estimate where people look in an image. Pre-deep models built saliency from hand-designed contrast features — the classic Itti, Koch and Niebur (1998) model combined center-surround contrasts of color, intensity, and orientation into one saliency map — while learned models train directly on eye-tracking data. The uses are practical: auto-cropping and reframing, retargeting, aesthetics scoring, and attention-aware compression that spends bits where the eye will actually land.

4.4.3 Generative models: image-to-image translation

Sometimes there is no single correct answer. Ask a network to colorize a gray photo, super-resolve a tiny one, or fill a hole, and many different outputs are equally plausible — the task is one-to-many. Regression handles this badly: minimizing squared error pulls the network toward the average of all plausible answers, and the average of many sharp images is a blurry one. (We pin down exactly why that average looks bad yet scores well when we reach metrics, below.) The cure is to stop averaging and start sampling: model the space of plausible answers and draw one sharp, plausible member. That is the shift from regression to generative modeling.

GANs. A generative adversarial network (GAN) pits two networks against each other: a generator proposes images while a discriminator learns to tell real from generated. Train them in opposition and the generator is pushed to produce output the discriminator cannot distinguish from real — which means sharp, plausible detail rather than a blurry mean. This adversarial training (Goodfellow et al. 2014) is the engine behind Real-ESRGAN's realism. We give it one paragraph of intuition; the full generative story, and why diffusion has largely displaced GANs, is Generative AI and diffusion.

Conditional translation, paired and unpaired. Pix2Pix (Isola et al. 2017) is the generic tool: a conditional GAN that learns to map one image domain to another from paired examples — edges→photo, label-map→street scene, day→night (Figure 4.4.4). When you have no aligned pairs, CycleGAN (Zhu et al. 2017) does the unpaired version, keeping the translation honest with a cycle-consistency constraint (translate there and back and you should land where you started): horses↔zebras, summer↔winter, photo↔painting, all without a single matched pair.

fig-gan-pix2pix
Figure 4.4.4. Paired image-to-image translation with a conditional GAN. An edge map (left) is the input; the network generates a plausible photo (right) consistent with those edges, having learned the edges→photo mapping from paired examples (Pix2Pix). The same generic recipe maps label-maps to street scenes, sketches to renderings, or day to night — "learn the map between two image domains" — and is the GAN ancestor of the diffusion-based editing in the next chapter.

The pre-deep ancestor. "Translate one image into the style of another" predates deep nets. Image Analogies (Hertzmann et al. 2001) learned a filter from a single example pair — given A and its filtered version A′, apply the same transformation to a new B to get B′ — by matching patches. The same instinct as Pix2Pix ("A is to A′ as B is to B′"), but no learned model, just patch lookup (ties to the texture-synthesis and patch-matching material).

Forward reference. Diffusion models now dominate generative image-to-image work — text-to-image, instruction-based editing, the rest. We place the lineage here (GAN → diffusion) and develop it fully in Generative AI and diffusion, where it reconnects to the denoiser-as-prior throughline of this part.

4.4.4 Learned perceptual metrics and losses

There is a question we have been deferring: how do we measure whether a restored or generated image is any good? The answer matters more than it sounds, because the metric you optimize quietly defines what "good" means — and the classical metrics break exactly where learned methods are supposed to shine.

The classical metrics — per-pixel mean squared error (MSE) and its decibel form, peak signal-to-noise ratio (PSNR); the structural similarity index (SSIM); the perceptual visual-difference predictor (VDP), all introduced in Book 2 — compare low-level statistics, and they have a fatal blind spot for this setting: they are blind to texture and to high-level similarity (Figure 4.4.5). Shift a patch of grass by two pixels and it looks identical to a person but scores terribly under MSE, because almost no pixel matches its old neighbor. Worse, the blurry mean that regression produces scores well — it is, after all, close to every plausible answer on average — even though it looks obviously wrong. So these metrics reward precisely what we do not want in generative restoration: smooth, safe, blurry, pixel-aligned.

fig-perceptual-metric
Figure 4.4.5. Why per-pixel error is the wrong ruler. Two distortions of one image are constructed to have the same mean squared error from the original: one is a tiny spatial shift of a texture (perceptually nearly identical), the other is additive noise (perceptually much worse). MSE and PSNR cannot tell them apart, because they compare pixels at fixed locations. This is the motivation for a learned perceptual metric like the learned perceptual image patch similarity (LPIPS), which compares deep features instead.

LPIPS — deep feature distance. Instead of comparing pixels, compare the two images' deep network features — the learned perceptual image patch similarity, LPIPS (Zhang et al. 2018). Run both images through a pretrained network, pull the feature maps $\phi_l$ at several layers $l$, and measure the distance between them. The plain feature, or perceptual, loss is

$$\ell_{\text{feat}} = \sum_l \lVert \phi_l(\hat I) - \phi_l(I)\rVert^2,$$

and LPIPS proper goes one step further: it unit-normalizes the features per channel and applies learned per-channel weights $w_l$, fit directly to human similarity judgments,

$$\ell_{\text{LPIPS}} = \sum_l \frac{1}{H_l W_l}\sum_{h,w}\bigl\lVert w_l \odot \bigl(\hat\phi_l(h,w) - \phi_l(h,w)\bigr)\bigr\rVert^2,$$

so each feature channel contributes in proportion to how much it actually matters perceptually. In words: rather than ask whether two images agree pixel by pixel, we run each through a network and ask whether they activate the same features — edges, textures, parts — at each layer, then sum the disagreement. Because deep features respond to texture and structure rather than exact pixel position, this tracks perceived similarity far better than MSE. It is also usable as a training loss — the same feature/perceptual loss Gatys, Ecker and Bethge (2016) used for neural style transfer — so you can train a network to look right rather than to match pixels.

DreamSim — holistic similarity. A more recent metric (Fu et al. 2023) tuned to mid-level, holistic similarity: layout, pose, semantic content — the things LPIPS still misses because it stays fairly local. Keep it as "the step beyond LPIPS": the trend is toward metrics that judge images the way a person glancing at them would, comparing what the picture is rather than how its pixels line up.

From metric to loss: how perceptual losses buy sharpness. This matters in practice because the metric you train against decides what the network produces. Train a restoration or super-resolution network to minimise per-pixel MSE (or L1) and it learns to hedge: when several sharp outputs are all plausible, the single image that minimises average pixel error is their blurry mean, so MSE-trained networks systematically wash out fine texture (Figure 4.4.6). Swap in — or add — a perceptual (feature) loss: push prediction and target through a fixed pretrained network and penalise the distance between their deep features (the plain feature loss, or LPIPS), rewarding the network for reproducing texture and structure rather than exact pixels, so it stops fearing detail it cannot place pixel-perfectly. Add an adversarial (GAN) term — a discriminator trained to tell restored from real — and the output is pushed onto the manifold of real images, crisper still. The honest caveat is the perception–distortion trade-off (L10): the sharper and more "real" you push, the more the model invents detail that may not be true — sharpness bought partly with hallucination, which is why a fidelity anchor stays in the mix.

fig-perceptual-loss-sharpness
Figure 4.4.6. The loss you train on decides the sharpness. The same degraded input restored by the same network under different losses: L2/MSE regresses to the blurry mean (safe, soft, detail-free); adding a perceptual (feature/LPIPS) loss brings back texture; an adversarial term sharpens it onto the manifold of real images. Per-pixel error actually favours the blurry result — which is why fidelity-only training looks washed out, and why the extra sharpness is partly invented (the perception–distortion trade-off, L10).

A go-to loss mix. When you reach for a perceptual loss, the robust default is a weighted sum, never a single term:

$$\mathcal{L} = \lambda_1\,\lVert \hat I - I\rVert_1 \;+\; \lambda_p\,\ell_{\text{LPIPS}}(\hat I, I) \;+\; \lambda_a\,\mathcal{L}_{\text{adv}}(\hat I).$$

The roles, with a sane starting point:

Tune by eye on held-out images, track LPIPS rather than PSNR, and treat "more adversarial" as the knob that trades truth for crispness.

The one-line takeaway closes the chapter's loop and echoes its opening L8 callback: the metrics are learned now too. We began by learning the operator, and we end by learning the very ruler we measure it with. Convenient — but it deserves a wary glance, because whichever metric you optimize silently decides what counts as a good image, and therefore what the model will hallucinate toward when the data runs out.