8.2 Super-resolution and image priors⧉
Zoom all the way into a digital photo and, sooner or later, you hit the pixels: a license plate two blocks away is a smear of eight or nine grey squares, a distant face is a beige lozenge with two dark dots. Television detectives say "enhance" and the smear resolves into crisp characters. It is the most famous impossible request in imaging — and the interesting thing is not that it's impossible, but exactly which part is impossible and which part is not.
Here is the honest version. The grey squares are all the camera measured; the fine detail that distinguishes a 3 from an 8 was averaged away when the lens blurred the scene and the sensor sampled it onto a coarse grid. Nothing in those squares uniquely tells you which character was there — infinitely many sharp scenes would have produced the very same smear. So no algorithm can recover the answer from the data alone. And yet a good super-resolution method will often print something sharp and plausible anyway. Where did that come from? Not the measurement. It came from a prior — everything the method already knows about what plates, faces, and natural images look like. That is the whole chapter in one sentence: the prior is what makes recovery possible, and the only honest question is whether the prior is reconstructing detail that was genuinely measured somewhere, or inventing detail that merely looks right.
We will pose super-resolution as the same linear inverse problem we met in regression, see precisely why it is ill-posed, and install the prior as a regularizer. Then we follow that one idea outward: across single-image and burst capture (the reconstruction-vs-hallucination axis), to the Plug-and-Play insight that any denoiser is a prior you can plug into any solver, ending at the identity that makes diffusion just iterated denoising — closing the prior throughline and handing off to the generative chapters.
8.2.1 What problem super-resolution solves (and why it's ill-posed)⧉
The goal is not "more pixels." Bicubic interpolation already gives you more pixels — it fits a smooth surface through the samples you have and reads off intermediate values — but it adds no new frequencies; the result is just a soft, enlarged version of the input. Super-resolution wants pixels that are meaningfully sharper: detail finer than the original sampling, frequencies that weren't in the input at all. That is a categorically harder request, and to see why, we need to write down where a low-resolution image comes from.
A low-resolution measurement $y$ is a high-resolution scene $x$ run through three stages: the optics and pixel aperture blur it by a point-spread function $k$; the sensor downsamples it by some factor $s$ (keeping one sample where the scene had $s$); and the electronics add noise $n$. In symbols,
This is exactly the linear forward model $y = Ax + n$ of Linear Inverse Problems and Regression, with the operator $A = (\text{downsample by } s)\cdot(\text{blur by } k)$ — a convolution followed by a decimation. Super-resolution is the inverse problem: recover $x$ from $y$ (Figure 8.2.2).
Why can't we just invert $A$? Because downsampling is many-to-one, and irreversibly so. Blurring already attenuated the high frequencies; downsampling then discards everything above the new Nyquist limit — it folds or throws away exactly the detail that distinguishes the sharp scenes from one another. Add any high-frequency pattern that the blur-and-downsample would have annihilated, and $y$ does not change at all. So infinitely many high-resolution images $x$ pass through the same low-resolution $y$ (Figure 8.2.1). The operator $A$ has no usable inverse; the data under-determines the answer. This is not a numerical inconvenience to be fixed with more bits or a cleverer solver — it is a genuine loss of information, baked into the physics of sampling.
The resolution is the one we have used for every inverse problem: recover with data-fit plus a regularizer,
The first term keeps $\hat{x}$ consistent with what was measured — push it through the forward model and you should get back (close to) $y$. The second term, the prior $\Phi(x)$ with weight $\lambda$, picks — among the infinitely many $x$ that fit the data equally well — the one that most looks like a real image. Read probabilistically (Refreshers → Bayes), this is the maximum a posteriori (MAP) estimate: the data term is the negative log-likelihood of the noise, and $\Phi(x) = -\log p(x)$ is the negative log-prior, so minimizing the sum maximizes $p(x \mid y) \propto p(y \mid x)\,p(x)$ — posterior $\propto$ likelihood $\times$ prior.
Thomas Bayes (c. 1701–1761) was an English Presbyterian minister whose single posthumous paper — read to the Royal Society in 1763 by his friend Richard Price — gave us the rule for updating a belief with evidence: posterior $\propto$ likelihood $\times$ prior. That one line is the backbone of the MAP estimate above, of the priors that regularize every ill-posed reconstruction in this part, and of the Bayesian reading of denoising, white balance, and blind deblurring elsewhere in the book. Fittingly for the patron saint of uncertainty, even this portrait is of doubtful authenticity — it may not be Bayes at all. Portrait: 1936 reproduction of disputed origin, public domain (via Wikimedia Commons).
The point to sit with is that without $\Phi$ the problem has no answer. The data term alone is satisfied by infinitely many images — including absurd ones full of high-frequency garbage that the forward model happens to annihilate. The prior is what breaks that tie, and everything in the rest of the chapter is a question of which prior and how it is applied.
When the measurement genuinely destroys information — super-resolution past the sensor's sampling, deblurring at frequencies the blur killed, inpainting a hole, dehazing — no amount of cleverness recovers it from the data alone, and inverting the forward model also amplifies noise (it divides by the small singular values / vanishing frequencies). A prior — a model of what natural images look like — is what selects an answer; it is a load-bearing part of the algorithm, not a tuning knob. The honest split: reconstruction priors fuse genuinely-measured detail (extra sub-pixel samples, frequencies merely attenuated); hallucination priors invent plausible detail that was never measured. (Registered as L10 — first appearance here. Recurs in Blind deblurring — noisy and blind deconvolution, the dark-channel prior — and forward in Generative AI and diffusion as a sampleable generative prior, L11; the learned-prior version is L8.)
So the right mental model for super-resolution is not "magnifying glass for files." It is "guess the most plausible scene that would have produced this blurry, aliased, noisy little image" — and plausible is defined entirely by the prior.
3, an 8, and a fence pattern) — are each run through blur-then-downsample, and all produce the identical low-resolution $y$ (the high frequencies that distinguish them are exactly what sampling discards). The forward arrow is well-defined and lossy; the backward arrow is one-to-many. The data cannot choose among the preimages — only a prior can.8.2.2 Scenarios: single-image, burst, and hybrid space–time⧉
The objective above is fixed; what changes from one super-resolution system to the next is how much real information the measurement actually contains — and therefore how much work the prior has to do. There is a spectrum here, and it is worth laying out the three landmark cases.
Classic single-image super-resolution (SR) is the hardest: one frame in, one bigger frame out. The data fixes nothing new above the Nyquist limit — every extra frequency in the output is supplied by the prior. Two classic priors do this. An internal prior exploits patch recurrence: small patches in a natural image tend to repeat across the image and across scales — a little corner or edge that appears large somewhere in the photo appears small somewhere else, so the image is, in effect, its own example database of how its details look when shrunk (Glasner, Bagon and Irani 2009 Glasner et al. 2009). An external / example-based prior instead learns, from a separate database of low-/high-resolution pairs, how low-resolution patches map to high-resolution ones, and pastes in the learned completion (Freeman, Jones and Pasztor 2002 Freeman et al. 2002). Either way, pure single-image SR is hallucination-leaning by necessity: the detail it prints was never in the one frame it was given.
Burst (multi-frame) SR changes the game, because it changes the measurement. This is the idea behind the Pixel phone's "Super Res Zoom" (Wronski et al. 2019 Wronski et al. 2019): instead of one frame, capture a rapid burst, and exploit the fact that you cannot hold a camera perfectly still. Hand tremor shifts the scene by sub-pixel amounts between frames — a third of a pixel here, two-thirds there. Each frame therefore samples the same continuous scene on a slightly offset grid. Align the frames to sub-pixel precision and fuse them, and those staggered samples interleave into a denser effective sampling grid than any single frame provided (Figure 8.2.4). This is genuine new information: the extra samples were really measured. The mechanism is delicious — aliasing, normally a defect (the moiré of a downsampled fence), becomes the signal, because the sub-pixel shifts between frames are precisely what let you disentangle the high frequencies that a single frame folded together. No dedicated hallucination prior is needed for the recovered detail; the frames supply it. (A noise/motion model is still needed for robust fusion — see below — but that is a different job from inventing detail.)
Two sub-points carry the burst idea:
- Sub-pixel alignment is the whole game. The benefit comes entirely from knowing each frame's offset to a fraction of a pixel. Mis-register, and instead of a denser grid you get blur and ghosting — the samples land in the wrong places and average into mush.
- The merge must be robust. A burst of a real scene contains moving objects, occlusions, and parallax. Fusing them naively smears the moving car into a ghost. A robust merge rejects samples that disagree with the rest, falling back to fewer-but-consistent frames where the scene moved. Conveniently, the same multi-frame machinery also replaces demosaicking: the color-filter mosaic puts each color on its own offset sub-grid, and the burst's sub-pixel shifts help fill those in too. (And note the kinship with multi-frame denoising: averaging $N$ aligned frames cuts noise by $1/\sqrt{N}$ — burst SR gets sharper and cleaner at once.)
Hybrid space–time SR generalizes the burst idea to the time axis: trade temporal resolution for spatial (or the reverse). Combine several captures that are low in space but high in time — or pair a fast, low-resolution camera with a slow, high-resolution one — so that the object's motion sweeps it across the sampling grid and supplies extra spatial samples (and vice versa). It is the same trick as hand tremor, with motion doing the shifting.
These cases line up on a single organizing axis — how much detail is measured versus invented. Many sub-pixel-shifted frames sit at the reconstruction end (the detail is real); a single frame sits at the hallucination end (the detail is a plausible guess); real systems live in between. That axis is important enough — and ethically loaded enough — to get its own section next.
8.2.3 Reconstruction vs hallucination — measured detail vs invented detail⧉
Both kinds of super-resolution hand you a sharper picture. Only one of them adds truth, and a reader who cannot tell them apart will trust a forgery. So state the distinction plainly:
- Reconstruction-based super-resolution recovers detail that was genuinely measured — by multiple sub-pixel-shifted frames (a burst), or by deconvolving frequencies that the optics merely attenuated rather than killed (a frequency that is weak but nonzero in $y$ can, with care, be amplified back). The extra detail is recovered: verifiable, repeatable, faithful to the scene.
- Hallucination-based super-resolution uses a learned prior to synthesize detail that is plausible for natural images but was never in the measurement: pores on a cheek, individual hairs, characters on a distant sign. It can look spectacular — and it can be wrong. The model will happily render the wrong digits on the plate or the wrong weave on a fabric, because it is sampling a likely completion, not reading a measurement (Figure 8.2.3).
The headline hallucination models — used here as priors, and treated as models in their own right in Deep learning — are Real-ESRGAN (Wang et al. 2021 Wang et al. 2021) — an enhanced super-resolution generative adversarial network (ESRGAN) trained against a realistic degradation pipeline to produce sharp, textured output — and SwinIR (Liang et al. 2021 Liang et al. 2021), a transformer-based restoration network. Both effectively learn $p(\text{high-res} \mid \text{low-res})$ from data and emit a plausible sample from it. Their architectures and training belong to the machine-learning chapters; here they occupy one end of the prior axis, and that is all we need from them.
Two caveats deserve to be surfaced for honesty's sake. First, a metric trap: a hallucinated result can score better on peak signal-to-noise ratio (PSNR) or structural similarity (SSIM) than a faithful one and still be wrong, and a faithful result can score worse — this is the perception–distortion tradeoff, the formal statement that you cannot simultaneously maximize fidelity-to-truth and perceptual realism past a certain frontier. (The learned perceptual image patch similarity (LPIPS) metric, which tries to track human judgment of similarity, is discussed in Deep learning.) Second, a forensic and ethical note that follows directly from L10: a super-resolved face or license plate is not evidence. The sharp detail was supplied by a prior over faces and plates in general, not measured from this one; presenting it as fact is presenting the model's guess as the scene.
A placement note, since the question naturally arises: why is super-resolution here, in the single-image part, and not in the deep-learning chapters where the flashy models live? Because this chapter owns the prior abstraction — the throughline from ill-posedness to regularizer to denoiser-as-prior to diffusion — and super-resolution is the cleanest place to teach it. The learned models themselves (Real-ESRGAN, SwinIR, GANs, diffusion networks) are taught as models in Deep learning and Generative AI and diffusion, and cross-referenced, not duplicated. The slogan: priors here, models there.
8.2.4 Denoising as a universal prior — Plug-and-Play and RED⧉
So far the prior $\Phi$ has been something you write down — total variation, sparse gradients, $-\log p(x)$ for some model. The modern view is more powerful and, at first, a little startling: the prior is whatever your best denoiser knows. To see why, look at how the recovery objective is actually minimized.
Return to $\hat{x} = \arg\min_x \tfrac{1}{2}\lVert Ax - y\rVert^2 + \lambda\Phi(x)$ with the two terms — data-fit and prior — pulling on the same variable $x$. Splitting algorithms (ADMM, or the simpler half-quadratic splitting (HQS)) handle this the way one always handles two coupled terms: introduce a copy of the variable so each term gets its own, then alternate, nudging the two copies toward agreement. The data-fit term gets a step that enforces the measurement; the prior term gets a step that enforces the prior. And here is the observation that opens everything up: the prior step — formally the proximal operator of $\Phi$, "find the nearby image the prior likes best" — is exactly a denoising operation. Concretely, that proximal step solves the MAP denoising sub-problem $\arg\min_x \tfrac12\lVert x-z\rVert^2 + \lambda\Phi(x)$ — "find the nearest clean-looking image to $z$" — which is precisely what a denoiser does. Denoising is "take this image and move it to the nearest clean-looking one," which is the same sentence (Figure 8.2.5).
Plug-and-Play (PnP) priors (Venkatakrishnan, Bouman and Wohlberg 2013 Venkatakrishnan et al. 2013) take the obvious next step. If the prior step is a denoiser, then drop in any denoiser $\mathcal{D}_\sigma$ — even one with no closed-form $\Phi$ behind it at all (BM3D, NL-means, a learned convolutional neural network (CNN)). The iteration alternates two steps:
- data-fit step — $z \leftarrow \arg\min_x \tfrac{1}{2}\lVert (k * x)\downarrow_s - y\rVert^2 + \tfrac{\rho}{2}\lVert x - \tilde{x}\rVert^2$: pull the estimate toward consistency with the measurement, staying near the current denoised guess $\tilde{x}$. This is a linear solve — conjugate gradient or a fast Fourier transform (FFT), exactly the matrix-free machinery of Linear Inverse Problems and Regression — and it is the only place the imaging physics $A$ enters.
- denoise step — $\tilde{x} \leftarrow \mathcal{D}_\sigma(z)$: impose the prior by running any denoiser.
The payoff is a clean decoupling. The data-fit step knows the physics — swap $A$ to change the task (super-resolution, deblurring, inpainting, demosaicking) — and knows nothing about images. The denoise step knows the image prior — and nothing about the task. So one good denoiser becomes a solver for every inverse problem: improve the denoiser and every recovery task improves for free. (FlexISP, Heide et al. 2014 Heide et al. 2014, builds an entire camera ISP this way — demosaic, denoise, deblur — around a single denoiser-as-prior.) This is what "universal" means in the section title.
RED (Regularization by Denoising) (Romano, Elad and Milanfar 2017 Romano et al. 2017) makes the prior explicit rather than implicit. Where PnP slips a denoiser into a proximal slot — leaving it slightly unclear what energy, if any, is being minimized — RED defines an honest regularizer directly from the denoiser, $\Phi(x) = \tfrac{1}{2}\,x^{\top}\big(x - \mathcal{D}(x)\big)$, and shows that under mild conditions its gradient is just the denoising residual,
That is a remarkably clean object: the prior's gradient is how much the denoiser wants to change the image. Where the image already looks clean, $\mathcal{D}(x)\approx x$ and the gradient vanishes; where it looks noisy/unnatural, the residual points toward the fix. With it you can minimize the regularized objective by plain gradient steps, the denoiser defining an actual energy rather than an opaque proximal map. (Roughly: PnP is the implicit / proximal form, RED the explicit / gradient form of the same idea.)
This reframes "image prior" entirely. It is no longer a hand-derived penalty like total variation or sparsity; it is "whatever your best denoiser knows." And it places every prior on a single spectrum (Figure 8.2.6), from hand-built penalties through classical denoisers to learned networks — all occupying the same slot in the same solver loop.
This is also exactly the mechanism by which the not-optional prior of L10 actually enters the computation: the denoise step is the prior step. It is, as well, the concrete face of the learned-operator lesson L8 — when $\mathcal{D}$ is a learned denoiser, you have swapped a hand-designed prior for one learned from data, while the data-fit + prior skeleton stays put.
8.2.5 Diffusion is iterated denoising (the continuous limit)⧉
The spectrum in Figure 8.2.6 ends at the diffusion score, and that endpoint is not a metaphor — it is an exact identity that closes the prior throughline and hands off to the generative chapters. It rests on one classical fact.
Tweedie's formula. Suppose you observe a clean image corrupted by Gaussian noise, $z = x + \sigma\epsilon$. The best possible denoiser in the mean-squared sense — the minimum mean-squared error (MMSE) estimator, which is the posterior mean $\mathbb{E}[x \mid z]$ — has a startlingly clean form:
where $p_\sigma$ is the distribution of noisy images at noise level $\sigma$. The quantity $\nabla_z \log p_\sigma(z)$ is the score — the gradient of the log-density, pointing "uphill" toward where clean images are more likely. So Tweedie says the score and a Gaussian denoiser are the same object: a denoiser is the score, rearranged (move toward higher density by exactly $\sigma^2$ times the score, and you land on the posterior mean). This is the right end of Figure 8.2.6 made precise.
Therefore diffusion is iterated denoising. A diffusion model is trained to estimate the score $\nabla_z \log p_\sigma$ at every noise level $\sigma$ (Ho, Jain and Abbeel 2020 Ho et al. 2020; Rombach et al. 2022 Rombach et al. 2022, who run it in a compressed latent space). To sample a fresh image, it starts from pure noise and repeatedly "denoise a little, add a little noise back, repeat," walking $\sigma$ down a schedule from large to zero until a clean image emerges. Read against the previous section, that outer loop is exactly PnP/RED taken to a continuous limit: the prior-step denoiser is the learned score, applied over a whole schedule of noise levels rather than one. Diffusion sampling is a denoiser-as-prior solver.
The consequence for recovery is immediate. Condition the diffusion sampler on a measurement — fold in the data-fit step for some forward model $A$ at each iteration — and you have a PnP/RED-style solver whose prior is the strongest learned denoiser we have. That is precisely how state-of-the-art super-resolution, deblurring, and inpainting are now done: posterior sampling under a generative prior. It is the bridge from this chapter's prior story to the generative one.
We establish only the identity here — denoiser $\equiv$ score $\equiv$ diffusion step — to close the throughline. The full treatment (DDPM, latent diffusion, score-matching, samplers, and the generative leap from scoring a prior to sampling one, L11) lives in Generative AI and diffusion, which cross-references back to this section.
And that sets up the hand-off. The story of this chapter — recovery = data-fit + prior, and the prior is doing the heavy lifting — now plays out where the difficulty is even sharper: where the forward operator itself is unknown (blind deblurring), where naive inversion amplifies noise catastrophically (the Wiener filter), and where a single hand-built statistical prior still earns its keep (the dark-channel prior for haze). That is the next chapter, Blind deblurring — the same lesson, harder problems.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
When the measurement genuinely destroys information — super-resolution past the sensor's sampling, deblurring at frequencies the blur killed, inpainting a hole, dehazing — no amount of cleverness recovers it from the data alone, and inverting the forward model also amplifies noise (it divides by the small singular values / vanishing frequencies). A prior — a model of what natural images look like — is what selects an answer; it is a load-bearing part of the algorithm, not a tuning knob. The honest split: reconstruction priors fuse genuinely-measured detail (extra sub-pixel samples, frequencies merely attenuated); hallucination priors invent plausible detail that was never measured. (Registered as L10 — first appearance here. Recurs in Blind deblurring — noisy and blind deconvolution, the dark-channel prior — and forward in Generative AI and diffusion as a sampleable generative prior, L11; the learned-prior version is L8.)