3.10 Image metrics⧉
Almost everything in this book eventually compares two images. You denoise a photo and want to know whether the result is closer to the clean original. You align two frames of a burst and need to score how well they overlap. You compress a file and have to decide whether quality 80 is acceptable, or whether the artifacts have become visible. You train a neural network and need a loss — a single number the optimizer can push downhill. In every case the underlying question is the same: given two images, how different are they? This chapter is about turning that question into a number — and about the uncomfortable fact that the obvious number is frequently the wrong one.
We will build up in order of how much human vision each metric bakes in. The simplest, the mean squared error (MSE), knows nothing about seeing and compares pixels directly. The next, structural similarity (SSIM), knows a little about how the eye reads structure. A full visible differences predictor (VDP) carries an entire perceptual model. And the newest, learned metrics, soak up perception implicitly from data. None is universally "best" — which is the honest theme we begin with and never abandon.
3.10.1 Full-reference vs. no-reference⧉
The first fork is whether you have something to compare against.
A full-reference metric assumes you hold a known-good reference image and want to score a candidate against it. This is the natural setting for almost everything we do: denoising (the clean original is the reference), compression (the uncompressed source), resampling (the higher-resolution original). You have both pictures; you want one number for how far the candidate has drifted. Most of this chapter lives here.
A no-reference (or blind) metric is harder: you are handed a single image and asked to score its quality with nothing to compare to. There is no ground truth, so the metric must lean on a built-in model of what natural, undegraded photographs look like — and flag deviations (blockiness, blur, noise) as defects. This is what a phone uses to decide a shot is too blurry to keep, or what a streaming service uses to monitor quality at scale. It is genuinely useful, but it rests on a learned or hand-built prior rather than a direct comparison, so we mostly set it aside and concentrate on the full-reference case where the math is cleaner and the intuition is sharper.
3.10.2 why not just L2?⧉
If you have two images, the most obvious distance is to subtract them, square the differences, and average:
where $I$ and $J$ are the reference and candidate, the sum runs over all pixels (and channels), and $N$ is how many values you summed. This is the mean squared error, the squared L2 distance between the two images viewed as long vectors of numbers. It is everything an engineer wants: trivial to compute, cheap, and — crucially for machine learning — smoothly differentiable, so you can backpropagate through it.
It also routinely disagrees with your eyes. The problem is that MSE treats an image as a bag of numbers with no notion of geometry, edges, texture, or faces. Consider what leaves MSE unchanged versus what it punishes:
- A one-pixel shift. Translate the whole image by a single pixel and it looks identical to a human — but every edge now lands on different neighbors, so the squared differences are large and MSE spikes. The metric screams about a change nobody can see.
- A faint global gamma or brightness tweak. A small, smooth change in tone is barely noticeable, yet because it touches every pixel a little, it can rack up the same total squared error as a localized, glaring artifact.
- Localized structured damage. Blocky compression artifacts on a face, or a smeared edge, can be perceptually catastrophic while contributing only a modest sum of squares — the same MSE as a change you would forgive instantly.
In other words, MSE has no idea where the error is or what structure it destroys. Two edits with identical MSE can sit at opposite ends of the perceptual scale. This single fact is the engine of the whole chapter: it is why we keep adding perceptual machinery.
Why square the differences instead of taking absolute values? Squaring makes the error smooth at zero (no kink in the derivative), which optimizers like, and it punishes large outliers disproportionately — one badly-wrong pixel dominates. That can be a feature or a bug. The L1 distance (mean of absolute differences) is more robust: it shrugs off a few wild pixels and is the better choice when your errors are spiky (a handful of hot pixels, salt-and-pepper noise) rather than uniformly spread. Neither is perceptual; the choice between them is about how you want outliers to count.
3.10.3 PSNR⧉
In practice people rarely quote MSE directly; they quote the peak signal-to-noise ratio (PSNR), a logarithmic re-expression measured in decibels:
Here $\text{MAX}$ is the largest possible pixel value — $1$ for floating-point images in $[0,1]$, or $255$ for 8-bit. The logarithm compresses a huge range of error magnitudes into a friendly scale, and the sign is flipped so that bigger is better: a perfect match gives infinite PSNR, while as the error grows the PSNR falls. As a rough rule of thumb, PSNR in the 30–40 dB range is considered decent for compression and denoising work, but the exact thresholds are domain-dependent.
The essential thing to understand is that PSNR is just MSE wearing nicer units. It is a strictly decreasing function of MSE, so it ranks any set of images in exactly the same order MSE would — and it inherits every one of MSE's blind spots. A one-pixel shift still tanks the PSNR; a perceptually awful localized artifact can still score the same PSNR as a harmless global tweak. PSNR is convenient and ubiquitous in papers, but it is not a perceptual metric.
If you are going to use MSE or PSNR anyway — and for all their faults, you often will — there is one nearly free improvement: compute the error in a perceptually uniform color space instead of raw RGB. A difference of 0.05 in a deep shadow and the same 0.05 in a bright highlight are not equally visible (our response to light is roughly logarithmic), and the same is true across colors. Converting first to CIELAB (a perceptually uniform color space) — where equal distances correspond roughly to equal perceived differences — turns raw error into something closer to perceived error, which is exactly what the color-difference measure $\Delta E$ does (CIELAB is built and explained in Color technology). It fixes nothing structural, but measuring in Lab or Luv rather than in linear or gamma RGB is almost costless and almost always better. A sum of squares is only as meaningful as the space you measure it in.
3.10.4 SSIM⧉
The fix for MSE's blindness is to stop comparing isolated pixels and start comparing local neighborhoods. The structural similarity (SSIM) index does exactly this: it slides a small window over both images and, at each location, compares three things the eye actually cares about —
- luminance: do the two windows have the same local brightness (their means)?
- contrast: do they have the same local contrast (their standard deviations)?
- structure: once brightness and contrast are normalized away, do the patterns correlate (their covariance)?
SSIM multiplies these three terms together (each is a ratio designed to sit between 0 and 1, with small constants added to keep the division stable when a window is nearly flat). The product is the local similarity; $1$ means the windows are identical, lower means they have drifted apart. The genius is the structure term: by dividing out the local mean and contrast first, SSIM asks whether the pattern survived, independent of an overall brightness or contrast shift — which is much closer to how a person judges whether two images "look the same." The upshot is that SSIM correlates far better with perceived quality than PSNR, and the same PSNR can correspond to wildly different SSIM.
One detail worth making concrete: SSIM is a map before it is a number. Because the window slides across the image, what SSIM produces first is a per-pixel structural-similarity image — bright wherever the two pictures agree, dark exactly where the structure broke. The single "SSIM" value people quote is just that map's average. The map is the more informative object, because it localizes the damage in a way a scalar never can.
SSIM is not the last word. It has knobs (the window size, the small stabilizing constants); it is computed on luminance and can therefore miss color-only errors; and a multi-scale variant, MS-SSIM, which evaluates structural similarity across several resolutions, together with many later refinements, improves on it. But it captured something real and cheap, it is differentiable enough to serve as a training loss, and "report PSNR and SSIM" became standard practice precisely because the two together say more than either alone — PSNR for raw fidelity, SSIM for whether the structure survived.
3.10.5 VDP and HDR-VDP⧉
SSIM bakes in a little perception, by hand. The next step bakes in a lot. A visible differences predictor (VDP) is built around an explicit, quantitative model of the human visual system — and instead of one summary score, it produces a map: for each pixel, the probability that a human would actually notice the difference there.
The model inside a VDP is assembled from the perception machinery we have already met. It applies the eye's contrast sensitivity function (CSF) — our sensitivity as a function of spatial frequency, from the perception chapter — so that errors at frequencies the eye barely resolves are discounted. It models visual masking: a difference hidden in busy, high-contrast texture is far less visible than the same difference on a smooth patch, because surrounding detail masks it. And it accounts for the fact that visibility is threshold-based — below a certain contrast, a difference is simply invisible, full stop. The result is not "how big is the error" but "how likely is a person to see it, pixel by pixel." That is a fundamentally more honest question for tasks like deciding whether a compression artifact will be noticed.
HDR-VDP extends this model to high dynamic range (HDR) imagery, where pixel values span many orders of magnitude in luminance — from deep shadow to direct light source — and the eye's sensitivity changes dramatically with adaptation level across that range. It rebuilds the perceptual model to be calibrated in physical luminance, so it can predict visible differences in scenes far brighter and darker than a standard 8-bit display can show. VDP-style metrics are heavier to compute than SSIM and require a calibrated model of viewing conditions, but when the question is genuinely "will anyone see this?" they are the principled answer.
3.10.6 learned metrics⧉
The most recent turn is to stop hand-designing the perceptual model and instead learn it from data. The key observation is that a deep neural network trained for ordinary vision tasks (recognizing objects, say) develops internal features that capture image structure, texture, and semantics in a way that lines up remarkably well with human perception — almost as a side effect. So instead of comparing raw pixels, you push both images through such a network and compare them in feature space.
This is the idea behind learned perceptual image patch similarity (LPIPS) and successors like DreamSim: the distance between two images is the distance between their deep-feature representations, and that distance correlates with human similarity judgments far better than MSE or even SSIM — it forgives a small shift the way a person does, while flagging the structured, semantic damage a person notices. The catch is that these metrics are heavier, depend on the particular trained network behind them, and inherit whatever biases that network learned. We only flag them here as a forward reference; the machinery — what those deep features are and how the networks are trained — waits for Deep learning (COMPUTATIONAL TOOLS).
The same forward reference carries a second idea worth planting now: a differentiable metric is also a training loss. Because LPIPS (and the plainer feature/perceptual loss it builds on) is differentiable, you can optimize it — train a restoration or generation network to minimize perceptual distance rather than per-pixel MSE, which is exactly how learned methods escape the blurry-mean trap and produce sharp, textured results. We develop this — perceptual losses and a practical go-to loss mix — in Deep learning#Learned perceptual metrics and losses.
3.10.7 The right metric depends on the task⧉
There is no single best image metric, and choosing one is part of the engineering. The right answer depends entirely on what you are measuring for:
- If you need a differentiable training loss that is cheap and stable, MSE (or an L1 variant) is the workhorse, blind spots and all — and is often combined with SSIM or a learned term to claw back some perceptual sensitivity.
- If you are reporting fidelity in a paper for comparability with prior work, PSNR remains the lingua franca — but quote SSIM alongside it.
- If you care about perceived quality — will a viewer notice the artifact? — reach for SSIM, a VDP, or a learned metric, and prefer a map over a scalar when you want to know where the damage is.
- If you have no reference, you are in the harder no-reference regime and depend on a model of natural images.
The recurring lesson of this whole part — that how you encode and measure light should match the physics and the perception you care about — lands here too. A metric is a hypothesis about what matters; pick the one whose hypothesis matches your task, and never mistake a convenient number for a perceptual truth.