symbol

meaning (this chapter)

note

$A$, $A^\top$

the forward operator (blur / convolution) and its transpose; applied matrix-free as $Av=\mathrm{conv}(v,k)$, $A^\top v=\mathrm{conv}(v,\mathrm{flip}(k))$

from Linear Inverse Problems and Regression

$A^\top A$

the normal-equation matrix — symmetric positive-definite, huge, sparse, structured; never formed

from Linear Inverse Problems and Regression

$x$, $y$

the image we solve for and the observed data

from Notations

$r$

the residual $r = Ax - y$

local (this chapter)

$\eta$

the gradient-descent step size

local (this chapter)

$M$

the preconditioner $M \approx (A^\top A)^{-1}$, cheap to apply; re-rounds the bowl

new (this chapter)

$k$, $\tilde k$

the blur kernel and its flip; $A^\top$ correlates with $\tilde k = \mathrm{flip}(k)$

from Linear Inverse Problems and Regression

$\hat x(\omega)$, $\hat k(\omega)$

the Fourier transforms of image and kernel; the FFT solver divides per frequency $\omega$

from Linearity, Fourier, Aliasing and deblurring

$L$, $\lambda$

the Laplacian regularizer and its weight in the FFT divide (full treatment deferred to Advanced inverse problems)

forward-ref

💡 In a hurry? Jump to this chapter’s 2 big lessons ↓

4.2 Efficient solvers⧉

The previous chapter cast recovery as the normal equations $A^\top A\,x = A^\top y$ and made the key point that we never build $A$ — every solver needs only to apply the blur and its transpose. This chapter is the toolkit that actually does the solving: a handful of matrix-free, iterative / multiscale methods, and the rule for which one to reach for. The same machinery returns in Poisson image editing, colorization, matting, and gradient-domain HDR — learn it once here.

4.2.1 Matrix-free, iterative, multiscale — the situation⧉

Recall the obstacle. For a one-megapixel image, the unknown $x$ has a million entries, so the normal-equation matrix $A^\top A$ is a million-by-million object — a trillion numbers, far too large to store, let alone invert directly. The saving grace, established last chapter, is that this matrix is never arbitrary: it is huge but sparse and structured. Sparse because a blur mixes each output pixel from only a small neighborhood, so each row of $A$ has a handful of nonzeros; structured because the same kernel acts everywhere, so $A$ is a convolution and every row is the same stencil, just shifted. A direct inverse would destroy both properties — the inverse of a sparse matrix is generally dense — so we do not compute one. We solve iteratively, improving a running guess, and sometimes multiscale, working coarse-to-fine on a pyramid.

The discipline that makes all of this possible is to touch $A$ only through the matrix-free operator pair from Linear Inverse Problems and Regression:

$$ \mathtt{apply\_blur}(v) = \mathrm{conv}(v, k), \qquad \mathtt{apply\_blur\_T}(v) = \mathrm{conv}(v, \mathrm{flip}(k)). $$

The first applies $A$ (blur by the kernel $k$); the second applies $A^\top$ (correlate with the flipped kernel). Composed, they give $A^\top A\,v$ in two convolutions, which is the single primitive every iterative solver below consumes. No solver in this chapter ever asks for an entry of $A^\top A$; each asks only "given this image $v$, hand me $A^\top A\,v$," and that is a pair of convolutions over an image — never a matrix in sight. With that primitive in hand, the rest of the chapter is three tools, plus one, ordered by when each wins.

4.2.2 Gradient descent and conjugate gradient (recap + preconditioning)⧉

We met both methods in Linear Inverse Problems and Regression; here we recap them with the geometry that explains why one is so much better than the other, and then add the preconditioner that makes the better one practical.

Gradient descent (GD) is the obvious method. We are minimizing the least-squares energy $f(x) = \tfrac12\lVert Ax - y\rVert^2$, whose gradient is

$$ \nabla f(x) = A^\top(Ax - y), $$

one forward blur to form the residual $r = Ax - y$, one transposed blur to turn it into a gradient — exactly the matrix-free pair. We then step downhill, $x_{t+1} = x_t - \eta\,\nabla f(x_t)$, for a small step size $\eta$. It is correct and a handful of lines to code. It is also slow, and the reason is geometric. The energy $f$ is a quadratic bowl, and its shape is set by the eigenvalues of $A^\top A$. A blur crushes high frequencies toward zero while leaving low frequencies near one, so those eigenvalues span an enormous range — the operator is ill-conditioned — and the bowl is not round but a long, narrow valley, hundreds or thousands of times longer than it is wide. Gradient descent always steps perpendicular to the local contour, which in a narrow valley points across the valley rather than down it. So the iterate zig-zags from one wall to the other, creeping along the floor a sliver at a time, and convergence drags out to thousands of steps (Figure 4.2.1).

Conjugate gradient (CG) is the workhorse that fixes exactly this. Because $A^\top A$ is symmetric and positive-definite — the structure least squares always produces — CG can be far smarter about its directions. Instead of greedily following the steepest descent and undoing its own progress, it chooses a sequence of conjugate search directions: each new direction is non-interfering with respect to the operator, so once CG has minimized along a direction it never has to revisit it. The effect is that CG cuts straight down the long axis of the valley instead of bouncing across it, and it converges in tens of steps, not thousands. Crucially, it pays the same per-step cost as gradient descent: one application of $A$ and one of $A^\top$ per iteration, and a few image dot products (the inner products CG needs are just sums of pixelwise products). It, too, is fully matrix-free. For the geometric intuition behind why conjugacy works — and it is genuinely worth the read — Shewchuk's tutorial An Introduction to the Conjugate Gradient Method Without the Agonizing Pain is the standard reference. The loop is short, and every appearance of the operator is one of the two convolutions:

r ← b − Aᵀ·A·x          # initial residual, b = Aᵀy; uses apply_blur then apply_blur_T
d ← r                    # first search direction
while not converged:
    q ← Aᵀ·A·d           # the only operator call per step: two convolutions
    α ← ⟨r, r⟩ / ⟨d, q⟩  # exact step along d (⟨·,·⟩ is a sum of pixel products)
    x ← x + α·d
    r_new ← r − α·q
    β ← ⟨r_new, r_new⟩ / ⟨r, r⟩
    d ← r_new + β·d      # next direction, conjugate to all previous
    r ← r_new
return x

Read it back: each step takes the exact stride $\alpha$ down the current direction $d$, then bends $d$ by $\beta$ so the new direction is $A^\top A$-conjugate to every earlier one — which is why CG never re-searches a direction and converges in tens of steps, not thousands.

CG is fast when the bowl is round and slow, like everything else, when it is elongated. So we reshape the bowl. A preconditioner is a matrix $M \approx (A^\top A)^{-1}$ that is cheap to apply, and applying it before each CG step re-rounds the valley — it compresses the long axis so that the conditioning the solver actually faces is far better than the raw operator's. A good preconditioner can slash the iteration count by an order of magnitude, and it need not be exact: even a crude approximate inverse helps, because CG only needs the bowl more round, not perfectly so. Common choices are a simple diagonal scaling (cheap, sometimes enough) or an FFT-based approximate inverse that captures the operator's dominant frequency behavior — which previews the FFT solver below, deployed here as a preconditioner rather than a full solve. Preconditioned CG is the practical default: it works for any operator you can apply and apply-transpose, even when the boundaries are awkward or the blur varies across the image, where the FFT solver below does not apply at all.

fig-cg-vs-gd — **Figure 4.2.1.** Gradient descent zig-zags; conjugate gradient cuts straight. Both solvers descend the same elongated least-squares bowl — a long, narrow valley whose shape comes from the ill-conditioned operator $A^\top A$. **Gradient descent (GD):** each step points perpendicular to the contour, across the valley rather than down it, so the path **zig-zags** wall to wall and inches along the floor — thousands of steps. **Conjugate gradient (CG):** non-interfering, conjugate directions never undo earlier progress, so the path cuts down the long axis and reaches the minimum in a handful of steps. Inset: a **preconditioner** $M\approx(A^\top A)^{-1}$ re-rounds the bowl, shrinking the long axis so even GD-like steps converge fast. *(Deferred — to be generated.)*

💡 Big lesson — matrix-free iterative solving

The inverse problems of this part produce a system $A^\top A\,x = A^\top y$ that is far too large to form or invert directly, but is sparse and structured — $A$ is a convolution. So never build the matrix: solve iteratively, touching $A$ only through apply-operator calls (a forward convolution and its transpose). Plain gradient descent zig-zags on the elongated, ill-conditioned bowl; conjugate gradient picks non-interfering directions and converges in tens of steps, not thousands; a preconditioner $M\approx(A^\top A)^{-1}$ re-rounds the bowl and slashes the count further. This same matrix-free spine — set up $y=Ax$, minimize $\lVert Ax-y\rVert^2$, solve with CG / multigrid / FFT — drives Poisson image editing, colorization, matting, and gradient-domain HDR.

4.2.3 Multigrid: coarse-to-fine on a pyramid⧉

There is one error iterative solvers are conspicuously bad at, and it is the same one for gradient descent and CG alike: low-frequency error, the broad, smooth, large-scale component of the discrepancy. A blur barely touches low frequencies — they pass through $A$ almost unchanged — so on the bowl they correspond to the long, flat axis the solver crawls along. The local convolutions a solver applies move information only one pixel-ring per step, so a smooth error spread across the whole image takes as many steps to correct as the image is wide. High-frequency error, by contrast, dies in a few iterations. So an iterative solver tends to clean up the fine detail quickly and then stall, grinding away at a slowly-fading smooth residual.

Multigrid turns this weakness into a strategy by noticing that a low frequency on a fine grid is a high frequency on a coarse one. Downsample the problem onto a small image — a Gaussian-pyramid reduce — and what was a broad, slow error becomes a sharp, fast one the solver kills in a few cheap steps. So multigrid solves a coarse version of the system first, prolongs (upsamples — a pyramid expand) the resulting correction back to the fine grid, and adds it in, then does a few fine-grid iterations to clean up the high frequencies the coarse grid could not represent. Recurse this down to a tiny image and back up — descending the pyramid resolving ever-lower frequencies, then ascending to refine — and you trace the characteristic V-cycle (Figure 4.2.2). The payoff is dramatic: each frequency band is handled on the grid where it is cheap, so convergence becomes near-linear in pixel count rather than dragging with the image's width. This is literally the Gaussian/Laplacian-pyramid machinery of Linear pyramids and wavelets — the same reduce and expand — repurposed from a representation into a solver.

fig-multigrid-vcycle — **Figure 4.2.2.** Multigrid: a V-cycle down and up an image pyramid. The error iterative solvers kill **slowest** is low-frequency (broad, smooth), and that is exactly what a **coarse grid** represents cheaply. **Down the V:** a few iterations on the fine grid clear the high-frequency error, then **restrict** (downsample, pyramid reduce) the remaining smooth residual to a coarser grid, where it is now high-frequency and quick to solve; recurse to the coarsest grid. **Up the V:** **prolong** (upsample, pyramid expand) each coarse correction back, add it to the finer guess, and do a few cleanup iterations. Each frequency band is solved on the grid where it is cheap, so total work is near-linear in pixel count. *(Deferred — to be generated.)*

4.2.4 The FFT solver: diagonalize when you can⧉

The fastest solver of all needs no iteration whatsoever — but it buys that speed with a restrictive assumption. Suppose the operator is a genuine convolution on a rectangular domain with periodic boundaries (or Neumann boundaries, which a cosine transform handles the same way). Then $A$ is circulant, and by big lesson L5 — convolution diagonalizes in Fourier, because the sine waves are its eigenvectors — applying $A$ becomes a per-frequency multiply by the kernel's frequency response $\hat k(\omega)$ (Linearity, Fourier, Aliasing and deblurring). A diagonal system is trivial to invert: divide. The entire normal-equation solve, including a Laplacian regularizer $L$ with weight $\lambda$ if we want one, collapses to a single pointwise division in the frequency domain,

$$ \hat x(\omega) = \frac{\hat y(\omega)\,\overline{\hat k(\omega)}}{\lvert \hat k(\omega)\rvert^2 + \lambda\,\lvert \hat L(\omega)\rvert^2}, $$

read back as: at each frequency, divide the data by the operator's response there, damped by the regularizer. The recipe is three steps and no loop: FFT the data, divide per frequency by the formula above, inverse-FFT the result (Figure 4.2.3). Two FFTs and a pointwise divide — for a megapixel image this is milliseconds, dramatically faster than any iterative solve. The catch is in the fine print of "circulant": the moment the boundaries are not periodic, the blur varies across the image (spatially-varying defocus, motion blur from a wandering camera path), or the operator is anything but a clean convolution, the Fourier diagonalization breaks and we fall back to preconditioned CG or multigrid. This one-shot Fourier solve is the engine behind fast non-blind deconvolution methods such as Krishnan & Fergus (2009), which accelerate the non-blind deblurring of Levin et al. (2007) precisely by pushing the heavy linear solve into the frequency domain.

fig-fft-solver — **Figure 4.2.3.** The FFT solver: deblurring as a per-frequency division. When the blur is a true convolution on a rectangular periodic (or Neumann) domain, the operator is **circulant**, hence **diagonal in Fourier** (L5) — its eigenvectors are the sine waves. **Three steps, no iteration:** (1) **FFT** the blurry data $y$; (2) at each frequency, **divide** by the kernel's response, regularized, $\hat x(\omega)=\hat y(\omega)\,\overline{\hat k(\omega)}/(\lvert\hat k(\omega)\rvert^2+\lambda\lvert\hat L(\omega)\rvert^2)$; (3) **inverse-FFT** back to the sharp image $x$. The denominator's $\lambda\lvert\hat L\rvert^2$ term keeps the divide from blowing up where $\hat k(\omega)$ is near zero (the high frequencies the blur destroyed). Fastest solver available — but only when the circulant assumption holds. *(Deferred — to be generated.)*

💡 Big lesson (L5)

Diagonalize when you can. A shift-invariant operator on a nice domain is a convolution, so it is diagonal in Fourier — its eigenvectors are the sine waves. A solve that would otherwise demand an iterative method then collapses to a per-frequency division: one FFT, a pointwise divide (regularized so near-zero frequencies do not explode), one inverse FFT. This is why both the naive inverse filter and the Wiener filter have one-line Fourier forms, and why the FFT solver is the fastest tool in this chapter whenever its circulant assumption holds. (L5 — first placed in Linearity, Fourier, Aliasing and deblurring; recurs here as the FFT solver.)

4.2.5 Which solver wins — and why this is the foundation of the part⧉

The three tools sort cleanly by how much structure the problem hands you, trading generality for speed:

Preconditioned CG — the safe default. Reach for it whenever you can apply $A$ and $A^\top$, which is essentially always. It does not care whether the operator is shift-invariant, tolerates awkward boundaries and spatially-varying blur, and a cheap preconditioner keeps the iteration count low. When in doubt, this is the answer.
Multigrid — when low-frequency error dominates and you have a pyramid. Smooth, large-scale problems (the Poisson and Laplace solves of Poisson image editing are the canonical case) are exactly where plain iteration stalls and where the V-cycle's coarse-to-fine structure shines, delivering near-linear scaling.
FFT — when the operator is shift-invariant with nice boundaries. The fastest by a wide margin, a no-iteration one-shot divide — but the most restrictive, valid only when the operator is genuinely circulant.

The deeper point is why this short chapter is the foundation of the whole part. Every optimization-based edit ahead follows the same recipe: write the data as $y = Ax$, minimize $\lVert Ax - y\rVert^2$ (plus a prior, later — see Advanced inverse problems, which only adds $+\lambda L$ to the operator and leaves the solvers untouched), and solve it matrix-free with one of these three methods. Gradient-domain HDR, Poisson cloning, colorization, and matting are not separate algorithms to learn from scratch; each is a new choice of $A$ and a new prior bolted onto this same spine. Learn the solver once, here, and you have learned the computational engine under most of the book.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson — matrix-free iterative solving

💡 Big lesson (L5)

4.2 Efficient solvers🔗⧉

4.2.1 Matrix-free, iterative, multiscale — the situation🔗⧉

4.2.2 Gradient descent and conjugate gradient (recap + preconditioning)🔗⧉

4.2.3 Multigrid: coarse-to-fine on a pyramid🔗⧉

4.2.4 The FFT solver: diagonalize when you can🔗⧉

4.2.5 Which solver wins — and why this is the foundation of the part🔗⧉

Big lessons of this chapter

4.2 Efficient solvers⧉

4.2.1 Matrix-free, iterative, multiscale — the situation⧉

4.2.2 Gradient descent and conjugate gradient (recap + preconditioning)⧉

4.2.3 Multigrid: coarse-to-fine on a pyramid⧉

4.2.4 The FFT solver: diagonalize when you can⧉

4.2.5 Which solver wins — and why this is the foundation of the part⧉