4.1 Linear Inverse Problems and Regression⧉
A picture comes back soft, and you know roughly why. The lens hunted and never quite locked focus; a hand drifted through a long exposure in a dim café; diffraction smeared away the last of the fine detail. Crucially, you also know how the softening happened — you measured the lens's point-spread function (PSF) on a calibration target, or you reconstructed the camera-shake trajectory from the photo itself. With the blur in hand, the natural wish is to run it backward and get the sharp scene returned. That is precisely the question the previous chapter posed and then set aside: given the blur, can it be undone? Fourier offered a tantalizing first answer — divide each frequency back out — and an immediate warning — that the division detonates wherever the blur had pressed a frequency down to nearly nothing.
This chapter builds the real machine behind that wish, and the machine turns out to dwarf deblurring itself. The same forward model and the same solvers reappear in nearly every optimization-based edit later in the book. So the honest framing is not "we are learning to deblur" but "we are learning the template that deblurring, Poisson cloning, colorization, and matting are all special cases of." The route has three legs. First: see that a known blur is a linear operator, hence a matrix, so recovery is inversion. Second: recast that inversion as regression — never demand an exact solution; instead ask which sharp image best explains the photo in a least-squares sense. Third, the engineering core: solve that regression without ever assembling the matrix, with iterative solvers that only need to apply the blur and its transpose. We finish by stepping back to see why this one chapter underwrites the entire part.
4.1.1 Blur is linear, so deblurring is inversion⧉
Pin down the cast. There is a sharp scene we are after; call it $x$. There is a known blur kernel $k$ — a lens PSF, a defocus disk, a motion streak, the looping scribble a shaky hand traces. And there is the photograph we were actually handed, $y = k * x$: the sharp scene convolved with that kernel. Throughout this chapter we sit in the non-blind case — the kernel is given, whether measured beforehand or recovered earlier. Recovering the kernel too, from the blurred image alone, is blind deblurring, a strictly harder problem we hand off to Advanced inverse problems.
Now the load-bearing idea. Blur is linear and shift-invariant. Linear means blurring a sum of two scenes equals the sum of their blurs, and scaling a scene scales its blur. Shift-invariant means sliding the scene over just slides the result. And any linear map on a vector is a matrix. So somewhere there sits a matrix $A$ — the forward operator — for which the photo is simply
Said plainly: $y$ is what comes out when the sharp image $x$ is pushed through the blur $A$. Deblurring is then nothing more than inversion — recover $x$ from $y$. The entire remainder of the chapter is an unpacking of what "invert" can possibly mean when $A$ is a matrix we must never write down.
It pays to keep three pictures of $A$ in view simultaneously, because we hop between them depending on which makes the next move cheap (Figure 1). Picture one is the convolution view, $y = k * x$: the familiar local story where each output pixel is a kernel-weighted blend of its neighbors. Picture two is the matrix view, $y = Ax$, whose payoff is purely that it lets us talk linear algebra. That matrix is gigantic — for an image of $N = W\cdot H$ pixels it is $N \times N$, so a 12-megapixel photo gives a twelve-million-by-twelve-million matrix — yet it is far from a dense slab. It is sparse, because each output pixel reaches only the few inputs under the kernel's footprint, so the overwhelming majority of entries are zero. And it is structured: every row holds the same kernel, merely shifted to a new center. (A matrix built from one pattern shifted along each row is Toeplitz, or circulant if the image is treated as wrapping around — the very wrap-around that let Fourier be exact last chapter.) Picture three is the vector-map view: $A$ as a function $x \mapsto Ax$ carrying one point of the million-dimensional image space to another. One operator, three faces.
Why not just compute $A^{-1}$ and call it done? Two reasons, each fatal on its own. The first is brute size: a twelve-million-square matrix cannot even be stored, never mind inverted — the storage is astronomical and a direct inverse costs vastly more still. The second runs deeper. Suppose we could form $A^{-1}$; it would be hopelessly ill-conditioned. Blur is a low-pass filter, crushing high frequencies toward zero, so its inverse has to amplify those same frequencies back up — and right alongside the signal, it amplifies whatever noise the sensor sprinkled on the photo. This is the deblur preview from the previous chapter, exactly (Figure 2): divide each frequency back out and the true detail does return, but wherever the blur attenuated a frequency a hundredfold, the noise added after the blur is boosted a hundredfold too, and the recovered image disappears under bright speckled garbage. We dodge both troubles. Size we beat in this chapter, by never forming $A$. Conditioning we beat in the next, by adding a prior (Advanced inverse problems).
A general blur matrix is a tangled, million-dimensional object that no one inverts by hand. But Fourier diagonalizes convolution (first met in Linearity, Fourier, Aliasing and deblurring): rewrite $A$ in the frequency basis and the tangle dissolves into a diagonal operator — it just multiplies each frequency $\omega$ by the kernel's spectrum $\hat k(\omega)$. A diagonal operator is trivial to invert: divide each coordinate by its own scalar. So in Fourier coordinates deblurring collapses to per-frequency division, $\hat x(\omega) = \hat y(\omega)/\hat k(\omega)$. That single fact is at once the cleanest deblur available (the fast Fourier transform (FFT) solver at the end of this chapter) and the cleanest diagnosis of why deblurring is hard: wherever $\hat k(\omega) \approx 0$ — a frequency the blur erased outright — dividing by it explodes. "Diagonalize when you can" (L5) is the move we return to again and again; it may be the single most useful reflex in applied linear algebra.
4.1.2 Images as vectors, and the notation overload⧉
We have been quietly treating the image as a vector $x$ all along; it is time to make that explicit, because everything ahead is spoken in fluent linear algebra and that only works once the image is a vector. Take every pixel — every channel too, for color — and stack them into one tall column vector $x \in \mathbb{R}^N$ with $N = W\cdot H\cdot C$. This is the same flattening the pixels already undergo in memory: a C++ planar buffer indexed cWH + yW + x, or a NumPy array of shape (H, W, C) read out in order. The serialization is almost entirely a conceptual device — we essentially never materialize the giant vector, and certainly never the matrix. Its whole job is to let us speak linear algebra and call* its libraries and solvers, while the real arithmetic stays down in the cheap, local idiom of convolutions (Figure 3).
What does the vector view actually hand us? The entire linear-algebra toolbox. Images now add and scale like vectors. The dot product $\langle a, b\rangle = \sum_i a_i b_i$ — one number measuring how much two images overlap — gives correlation and projection, and through it the squared norm $\|r\|^2 = \langle r, r\rangle$, the total energy in an image, which is exactly the error that regression will drive down. And every linear image operation is now a matrix: a global exposure change is a diagonal $A$, a convolution is a Toeplitz $A$, downsampling and the gradient are matrices too. One frame, every operation.
One piece of notation deserves a loud warning before it trips us, because it is both genuinely confusing and utterly standard. In linear algebra the unknown we solve for is always written $x$. In imaging, $x$ is just as reliably the horizontal pixel coordinate. We keep both — fighting either convention against an entire field would cost more than it saves — and we tell them apart by context. When $x$ is the thing we recover, it is the whole serialized image, a vector; when $x$ indexes a position inside an image, it is a coordinate. The measurement $y$ carries the same clash against the vertical coordinate. In practice the two roles never collide in a single sentence, and context settles which is meant; but it is worth knowing the overload is deliberate, not an accident.
4.1.3 Regression: deblurring as least squares⧉
We said we would not solve $Ax = y$ exactly, and now we can give the real reason. With noise in the photo there may be no image $x$ that reproduces $y$ on the nose — the equation is inconsistent — and even where a solution does exist, we just watched it turn unstable. So we change the question entirely. Rather than "find the $x$ with $Ax = y$ exactly," we ask: which sharp image $x$, once blurred, best reproduces the photo we actually got? That is a regression problem — fit $x$ to the data by minimizing the squared discrepancy between the blurred candidate and the measurement:
In words: range over all conceivable sharp images, blur each, and keep the one whose blur lands nearest the photo, "nearest" meaning smallest total squared pixel difference. The mental picture is the textbook least-squares fit (Figure 4): data, a model, and the model nudged until the squared residuals shrink as far as they can — except here the "model" is an entire sharp image and the "fit" is how faithfully its blur matches the photo. This is the same skeleton as fitting a line to a scatter of points; only $A$ changes. Deblur, denoise, demosaic, HDR-merge, white balance — all the same problem in different operators.
Why squared error in particular? Three reasons, each of which pays off later. It keeps the problem linear: the squared norm is quadratic in $x$, so its gradient is linear in $x$ — which is exactly the hook that lets the fast solvers below exist at all. It is the maximum-likelihood fit when the noise is Gaussian, the most natural noise model, so least squares is not arbitrary but the statistically principled choice (the tie to probability lives in the Refreshers). And it has a clean closed form. The flip side — that squared error shrugs off outliers and, left alone, will cheerfully over-fit noise — is real, and it is precisely what the prior in the next chapter is for.
To locate the minimum, set the gradient to zero. The gradient of $\tfrac12\|Ax-y\|^2$ is
and zeroing it gives the central equation of the chapter:
These are the normal equations — the least-squares understudy for "$Ax = y$." Read them this way: the best $x$ is the one whose residual $Ax - y$, pushed back through $A^\top$, vanishes — geometrically, the residual ends up orthogonal to everything $A$ can possibly produce, which is exactly what "closest achievable fit" means. The matrix $A^\top A$ is symmetric and positive (semi-)definite, and that is no idle remark: it is the very structure the iterative solvers below are engineered to exploit.
That leaves one mysterious character, $A^\top$. What is the transpose of a blur, in concrete terms? You might brace for something exotic; it is just another convolution. The transpose of "convolve by $k$" is "convolve by the flipped kernel" $\tilde k(\mathbf u) = k(-\mathbf u)$ — the kernel mirrored through its center, which turns convolution into correlation. So
This is the hook the whole chapter dangles from. Both $A$ and $A^\top$ are merely convolutions. Either can be applied with one call to the blur routine we already own — conv(·, k) for $A$, conv(·, flip(k)) for $A^\top$ — and no matrix is ever touched. Concretely, apply_blur(x) = conv(x, k) and apply_blur_T(y) = conv(y, flip(k)) are the only two primitives any solver below will ask for.
One scope marker before we push on. Bare least squares, with nothing else, over-fits the noise — it is the regression face of the unstable inverse from Figure 2. The remedy is a regularizer, or prior: bolt on a penalty $R(x)$ that prefers plausible images, turning the objective into $\min_x \tfrac12\|Ax - y\|^2 + \lambda R(x)$ and the normal equations into $(A^\top A + \lambda L)\,x = A^\top y$, where $L$ is the operator behind the penalty (often a Laplacian, penalizing roughness) and $\lambda$ sets prior against data-fit. We name it here and defer the why and how — Tikhonov, total-variation, Wiener — to Advanced inverse problems, and the learned prior to Machine learning. Deferring is safe for the happiest of reasons: the solver machinery below does not change one bit. The prior only appends a $+\lambda L$ to the operator; everything that follows applies a little more and is otherwise word-for-word identical.
4.1.4 Matrices without forming matrices: gradient descent and conjugate gradient⧉
Here is the central trick of the chapter — the move that makes the linear-algebra framing usable rather than merely pretty. The normal-equation matrix $A^\top A$ is $N\times N$, twelve million square, and we will never build it. Every solver we reach for needs exactly one primitive: given a vector $v$, return $A^\top A\,v$. And that we can deliver, cheaply, as two convolutions — blur $v$, then blur the result by the flipped kernel:
No matrix is formed, stored, or inverted anywhere. The solver calls apply_blur and apply_blur_T, and nothing else. This is what matrix-free means, and it is the reason a problem with twelve million unknowns fits in memory and runs on a laptop.
Begin with the simplest such solver: gradient descent (GD). We are minimizing the bowl-shaped $f(x) = \tfrac12\|Ax - y\|^2$, whose gradient we already wrote, $\nabla f(x) = A^\top(Ax - y)$. Each step is therefore: blur the current estimate, subtract the photo to get the residual, blur that by the flipped kernel, and step downhill —
one forward blur, one residual, one transposed blur per iteration, scaled by the step size $\eta$. It is correct and almost embarrassingly short to code — the whole solver is the loop below, touching $A$ only through the two convolutions:
x ← 0 while not converged: r ← A·x − y # residual: blur the guess, subtract the photo g ← Aᵀ·r # gradient: blur the residual by the flipped kernel x ← x − η·g # step downhill return x
Read it back: form the residual with one forward blur, push it through the transpose with one more, and step against the gradient by $\eta$ — no matrix, just apply_blur and apply_blur_T. Its flaw is that it is slow. The least-squares bowl is badly elongated — a long, narrow trough rather than a round basin — because the blur squashes high frequencies hard and low frequencies barely, so the surface is far steeper in some directions than others. Plain gradient descent, forever stepping straight downhill, zig-zags across such a trough, taking thousands of timid steps to inch along its length (Figure 5). It works; you simply would not want to wait for it.
The cure is conjugate gradient (CG), the workhorse of the chapter. Because $A^\top A$ is symmetric positive-definite — the structural fact flagged a moment ago — CG can be far cleverer than steepest descent. One honest caveat first. For a blur that kills frequencies outright — the very ill-conditioning that set all this in motion — $A^\top A$ is only positive semi-definite: those erased frequencies sit in its null space, contributing nothing. Strict definiteness, which CG genuinely wants, is therefore absent for pure least squares. It is restored by the regularizer: the $+\lambda L$ term lifts those null directions, making $A^\top A + \lambda L$ properly positive-definite, and that regularized operator is the one CG actually runs against. With definiteness secured, here is the idea. Instead of greedily grabbing the steepest direction at every step — which keeps undoing earlier progress, hence the zig-zag — CG picks a sequence of conjugate directions: directions that, measured through the operator, do not interfere, so progress won along one is never spoiled by the next. The result is that CG marches straight down the trough rather than rattling across it, converging in tens of iterations where gradient descent needed thousands. Best of all, CG touches $A$ only through the same apply-operator calls — one $A$ and one $A^\top$ per iteration, the identical two convolutions — so it inherits the matrix-free property at no cost. The one-line intuition to keep: CG never re-searches a direction it has already exhausted — each new search direction is chosen $A^\top A$-conjugate to all the previous ones, giving monotone, non-interfering progress, with a guarantee of reaching the exact minimum in at most $n$ steps and, in practice, far fewer. For the geometry of why conjugate directions behave this well, the classic and genuinely readable account is Shewchuk's An Introduction to the Conjugate Gradient Method Without the Agonizing Pain.
CG runs fast on a round bowl and slower on a stretched one, so one final idea earns its place: preconditioning. A preconditioner is a cheap-to-apply approximation $M \approx (A^\top A)^{-1}$ — a diagonal, say, or an FFT-based rough inverse — applied each iteration to effectively re-round the bowl before CG walks it. A good one can slash the iteration count by an order of magnitude, and preconditioned CG is the practical default solver for problems of this shape.
4.1.5 Efficient solvers⧉
Step back and survey the toolbox, because the best solver depends on the problem at hand. The premise never changes: $A^\top A$ is huge but sparse and structured, so we never go near a direct inverse and always solve iteratively or multiscale. Three tools cover the territory, ordered by when each one wins.
The first is conjugate gradient with a preconditioner, just described — the general-purpose default. Its great virtue is generality: it asks only that you can apply $A$ and $A^\top$, so it handles any operator, even when the boundaries are awkward or the blur is not perfectly shift-invariant (a spatially varying PSF, say). When in doubt, reach for this.
The second is multigrid, a coarse-to-fine method borrowed from numerical analysis. Its motivation is a specific weakness of iterative solvers: they crush high-frequency error quickly but drain low-frequency error agonizingly slowly — a smooth, broad mistake takes an age to seep out one local step at a time. Yet low-frequency error is exactly what a coarse grid represents cheaply and a few iterations dispatch. So multigrid does the clever, obvious thing: solve on a small, coarse version of the image (where today's low frequencies are this grid's high frequencies, killed fast), prolong the correction back up — upsample it — and refine on the fine image. Running down to coarse and back up to fine traces a V-cycle on an image pyramid, and it makes convergence near-linear in the pixel count (Figure 6). It reuses, almost verbatim, the reduce/expand machinery of Linear pyramids and wavelets — the same Gaussian-pyramid bargain, now pressed into service to accelerate a solver rather than to blend images.
The third is the FFT solver, fastest of all when its assumption holds. Recall the big lesson: on a rectangular domain with periodic (or Neumann) boundaries the blur is circulant, so Fourier diagonalizes it (L5) — and a Laplacian regularizer diagonalizes right alongside it. When everything is diagonal there is no system left to iterate on. We transform, divide per frequency, and transform back. Without a prior this is just the per-frequency $\hat x(\omega) = \hat y(\omega)/\hat k(\omega)$ from the big-lesson box; with a prior included, the formula picks up precisely the safety valve that bare division lacked:
which is the regularized normal equations $(A^\top A + \lambda L)x = A^\top y$ written out frequency by frequency: the numerator is $A^\top y$ in Fourier, the denominator is $A^\top A + \lambda L$, and the $+\lambda$ term in the denominator is exactly what stops the division from blowing up where $\hat k \approx 0$ — the regularized, stable cousin of the naive inverse from Figure 2. Two FFTs and a pointwise divide, no iteration whatsoever. The catch is the boundary assumption: it wants the image to wrap, and real images do not, so the edges misbehave unless handled (padding, windowing). Where the assumption is tolerable, nothing is faster; where it is not, fall back to preconditioned CG. This is the through-line from Linearity, Fourier, Aliasing and deblurring reaching its destination — the FFT "divide it back out" previewed there is exactly this solver, now properly regularized.
That is the entire toolkit, and here is why it was worth assembling so carefully: it carries over almost unchanged across the whole single-image part. Set up the forward model $y = Ax$, minimize $\|Ax - y\|^2$ (plus a prior), and solve it matrix-free with CG, multigrid, or an FFT. Gradient-domain HDR, Poisson cloning, colorization, and matting are each just this machine with a different $A$ and a different prior — Poisson editing, for one, is the same least-squares solve run on image gradients instead of pixels. Learn the spine once, here, and every later chapter in this part becomes a new operator on the same skeleton. That is what makes this the foundation chapter: not because deblurring is the most important application, but because deblurring is the cleanest place to first meet a machine we will run for the rest of the book.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
A general blur matrix is a tangled, million-dimensional object that no one inverts by hand. But Fourier diagonalizes convolution (first met in Linearity, Fourier, Aliasing and deblurring): rewrite $A$ in the frequency basis and the tangle dissolves into a diagonal operator — it just multiplies each frequency $\omega$ by the kernel's spectrum $\hat k(\omega)$. A diagonal operator is trivial to invert: divide each coordinate by its own scalar. So in Fourier coordinates deblurring collapses to per-frequency division, $\hat x(\omega) = \hat y(\omega)/\hat k(\omega)$. That single fact is at once the cleanest deblur available (the fast Fourier transform (FFT) solver at the end of this chapter) and the cleanest diagnosis of why deblurring is hard: wherever $\hat k(\omega) \approx 0$ — a frequency the blur erased outright — dividing by it explodes. "Diagonalize when you can" (L5) is the move we return to again and again; it may be the single most useful reflex in applied linear algebra.