3.9 Linear pyramids and wavelets⧉
Picture this problem. You have a photograph of Earth and a photograph of Jupiter, and you want to join the left half of one to the right half of the other so smoothly that nobody can find the seam. Try the obvious thing — cut both down the middle and paste — and a hard line runs down the join: every detail, coarse and fine, changes at one pixel. Try to hide that line by blurring across the boundary, and now you have smeared Earth's coastlines into Jupiter's cloud bands; the broad colors finally cross-fade, but the fine texture has turned to mush. Neither works because each uses a single scale of blending. What the picture actually needs is to blend gently where it is smooth — the slow sweep of color across the two disks — and sharply where it has detail — the cloud bands, the coastlines — at the same time. A single blur cannot do that, because a blur has one width. The moment we stop treating the image as one thing and start treating it as a stack of things at different scales, the problem dissolves.
That stack is what this chapter is about. We will build a multiresolution representation of an image — a tower of versions running coarse to fine — and then learn to split it into independent bands of detail that we can edit one at a time. The payoff is a representation that is local in both space and frequency, the very combination the Fourier chapter told us we could not get for free. Most of the chapter is spent on the two classic constructions, the Gaussian pyramid and the Laplacian pyramid of Burt and Adelson; we close with their principled relatives, wavelets and steerable pyramids.
3.9.1 Halfway between space and frequency⧉
It is worth naming the problem up front, because it is the same one the whole back half of basic image processing keeps circling. So far we have two ways to represent an image, and each is perfect at one thing and useless at the other. In the primal domain a pixel is a value at a place: perfectly local in space, but with no notion of scale — it cannot tell you whether you are looking at a broad gradient or a fine edge. In the Fourier domain a coefficient is the amount of one sine wave: perfectly local in frequency, telling you exactly which scales are present, but each coefficient is spread over the entire image, with no notion of where.
Fourier paid off handsomely for convolution — it diagonalized it — but it came with two catches we already met. It assumes the image wraps around (it is cyclic, a torus), which is a lie about the boundary; and it is global, since a perfectly shift-invariant basis function has to have support everywhere. A great deal of practical image processing is really the search for a middle ground between Fourier's beautiful frequency structure and the wish to stay local in space. Bicubic resampling kernels were one such compromise; JPEG's 8×8 blocks were another. Pyramids and wavelets are the cleanest compromise of the lot.
The Fourier chapter's lesson was diagonalize when you can — work in the basis where your operator is simple. This chapter adds the practical sequel: the perfect basis is rarely the useful one. Fourier is perfectly localized in frequency and not at all in space; the primal grid is the reverse. The uncertainty principle (compact in space ⇔ spread in frequency, from the Fourier chapter) says you cannot have both perfectly. A pyramid surrenders a little of each — it chops frequency into a few coarse bands and keeps each band laid out spatially — and that modest sacrifice is exactly what makes per-scale, per-region editing possible. Reach for a representation that is local in both, not for the one that is perfect in one.
3.9.2 The idea: process an image at many resolutions⧉
The core idea, due to Peter Burt and Edward Adelson in the early 1980s, is almost embarrassingly simple: represent the image at several resolutions at once, and handle each scale on its own. A coarse version holds the broad layout — the overall light, the large shapes. A fine version holds the texture and the edges. Once you can reach into each scale independently, a startling range of operations falls out nearly for free: blending (Earth and Jupiter), denoising, tone mapping, compression, feature analysis, coarse-to-fine motion estimation, and — much later in the book — the pooling layers and U-Nets of deep networks. We will preview several of these at the end and develop them properly in their own chapters; the point for now is that they are all the same trick wearing different hats.
There are two flavors of pyramid, and the difference maps cleanly onto frequency. A low-pass pyramid stores, at each level, a lower-resolution copy of the whole image — that is the Gaussian pyramid. A band-pass pyramid stores, at each level, only a narrow band of frequencies — the detail one scale adds over the next coarser one — and that is the Laplacian pyramid. We build the low-pass one first, because the band-pass one is simply the differences between its levels.
3.9.3 The Gaussian pyramid⧉
Begin with the original image. Blur it with a small Gaussian, then throw away every other pixel in each direction, so the result is half as wide and half as tall — a quarter of the pixels. Do the same to that image, and again, and again, until you are down to a handful of pixels. The sequence of ever-smaller, ever-coarser images is the Gaussian pyramid, and stacking them smallest-on-top is what earns it the name (Figure 3.9.1). We write the levels $G_0$ (the original, full resolution), $G_1$, $G_2$, and so on, each $G_{k+1}$ one octave coarser than $G_k$.
The blur-then-downsample step has a name, the reduce operator: $G_{k+1} = \text{reduce}(G_k)$, where reduce means "blur with a small Gaussian, then drop every other sample." The blur is not optional. We are halving the sampling rate, and the Fourier chapter was emphatic that you must low-pass before you downsample, or you alias — moiré and jaggies folded in from the frequencies the coarse grid cannot represent. The Gaussian is the prefilter; downsampling by two is the resample. So a pyramid level is not "the image, smaller," it is "the image with its top octave of frequencies properly removed, then sampled at the rate that octave-removed image deserves."
We will also need the opposite motion, going from a small image to a larger one. Call it expand: upsample by two (insert a zero between every pair of samples) and blur to fill in the gaps. Expand does not recover what reduce discarded — the fine detail is gone for good — but it does produce a smooth, full-size image that agrees with the coarse one. Reduce shrinks, expand grows; they are approximate inverses, and the gap between them is precisely the detail we are about to harvest.
You might worry that storing the image at every scale balloons the memory. It barely does. Each level has a quarter of the pixels of the one below — $1 + \tfrac14 + \tfrac1{16} + \tfrac1{64} + \cdots$ — and that geometric series sums to 4/3. The entire Gaussian pyramid, from full resolution down to a single pixel, costs only about a third more storage than the original image. Coarse-to-fine search, scale-space analysis, and everything else in this chapter ride on that bargain.
3.9.4 The Laplacian pyramid⧉
The Gaussian pyramid is useful but redundant in an unsatisfying way: each level repeats most of the information already in the level below, just blurrier. The interesting question is not "what does the image look like coarse" but "what does each scale add." That is what the Laplacian pyramid captures, and it is built straight out of the Gaussian one.
Take a Gaussian level $G_k$. The next coarser level $G_{k+1}$ is a blurred, shrunk version of it; expand it back up to $G_k$'s size and you get a smooth prediction of $G_k$ — everything except the finest octave of detail, which the blur removed. Whatever $G_k$ has that this prediction lacks is exactly the band of detail that this scale contributes. So subtract:
That difference image, $L_k$, is the Laplacian pyramid level $k$ (Figure 3.9.2). It is band-pass: it holds neither the broad low frequencies (those survived into $G_{k+1}$ and got subtracted away) nor any frequencies too fine for $G_k$ to represent — just the octave in between. Read the equation back in words: each Laplacian level is what one Gaussian level has that the next coarser one, blown back up, is missing. The coarsest level of the whole structure is special — there is no coarser Gaussian to subtract — so we keep the smallest Gaussian image $G_n$ itself as the apex, a tiny low-pass residual carrying the average brightness and the very largest-scale variation.
The name is a small piece of history worth a line. Expanding $G_{k+1}$ and subtracting it from $G_k$ is, up to scaling, a difference of Gaussians — one blur minus a slightly bigger blur — and a difference of Gaussians is a well-known approximation to the Laplacian operator, the thing that measures how much a pixel differs from the average of its neighbors. So each Laplacian coefficient behaves like a local second derivative: near zero in flat regions, large at edges and texture. Hence "Laplacian pyramid."
3.9.5 Reconstruction: an exact encoder and decoder⧉
Here is the property that turns the Laplacian pyramid from a curiosity into a workhorse: it loses nothing. We can put the image back together exactly. Just rearrange the construction equation. We defined $L_k = G_k - \text{expand}(G_{k+1})$, so
Read this as a recipe that climbs from coarse to fine. Start at the apex with the residual $G_n$. Expand it and add the band-pass level $L_{n-1}$ to recover $G_{n-1}$. Expand that and add $L_{n-2}$ to recover $G_{n-2}$. Keep going one level at a time, and the last addition hands you back $G_0$ — the original image, pixel for pixel. The build (analysis) and collapse (synthesis) are the two loops, one the exact inverse of the other:
# analysis: build the Laplacian pyramid G[0] = im for k in 0 .. n-1: G[k+1] = reduce(G[k]) # blur, then downsample by 2 L[k] = G[k] - expand(G[k+1]) # band-pass detail at scale k # L[0..n-1] plus the residual G[n] are the pyramid # synthesis: collapse it back, coarse to fine out = G[n] # the low-pass residual for k in n-1 .. 0: out = L[k] + expand(out) # undo each subtraction return out # equals the original, up to round-off
Nothing was approximated; every subtraction in the analysis is undone by an addition in the synthesis (Figure 3.9.4).
This is the encoder–decoder framing, and it is the right way to hold the whole construction in your head. Building the Laplacian pyramid is the encoder (or analysis): it takes the image apart into independent scale bands. Reconstructing is the decoder (or synthesis): it puts them back. The two are exact inverses, which is precisely what licenses the central move of this chapter — edit the bands in the middle. Because each $L_k$ is an independent band of frequencies, you can darken one, amplify another, zero a third, then run the decoder, and you get a coherent image back with no seams between scales. Blending, coring, detail enhancement, and tone mapping are all just different things to do to the bands before re-synthesizing.
Seeing it run on real pixels makes the "loses nothing" claim concrete. Start from the coarse residual — a tiny, blurry image with only the broadest tone — and add the bands back one octave at a time: the picture sharpens from broad light, to coarse structure, to fine edges and texture, until the last band hands back the original. Subtract that reconstruction from the original and the difference is flat black — exactly zero, up to floating-point round-off (Figure 3.9.4a).
3.9.6 Each level is a frequency band⧉
The frequency picture ties the pyramid back to Fourier and is worth making explicit. Sort the Fourier plane into concentric rings by frequency: low frequencies near the center, high frequencies toward the edges. The Gaussian residual $G_n$ is the small central disk. Each Laplacian level $L_k$ is one annulus — one ring — of that plane: a band of frequencies roughly an octave wide, with the finest band $L_0$ the outer ring and coarser bands stepping inward (Figure 3.9.5). The bands tile the frequency plane; stacked, they cover everything, which is just the reconstruction property seen from the Fourier side.
But — and this is the entire point — within each band the information is still laid out spatially. The Laplacian level is an image: it tells you not only which frequencies are present but roughly where in the picture they live. That is the local-in-both property we have been chasing. The cost is that the frequency bands are crude — a handful of octave-wide rings rather than Fourier's per-frequency precision — and the price of spatial locality is some smearing of frequency, exactly as the uncertainty principle demands. A Laplacian pyramid is a deliberately coarse carving of frequency into bands, with the position within each band left spatial: a sensible middle of the road between the primal grid and Fourier.
3.9.7 What the bands look like on a real image⧉
Two properties of the Laplacian pyramid on natural photographs are worth naming, because they are why it works so well in practice. First, the bands are decorrelated: the broad illumination lives in the residual, the edges in one band, the fine texture in another, and they do not duplicate one another the way Gaussian levels do. Second, and more striking, the band-pass levels are sparse. Look at a Laplacian level and most of it is flat grey — exactly zero — because most of a photograph is smooth: skies, walls, skin, out-of-focus background. Only at edges and texture do the coefficients light up. A local second derivative is zero wherever the image is locally linear, and most of an image is locally linear.
That sparsity is the engine behind two whole families of applications. It is why pyramids compress well — a representation that is mostly zeros is cheap to store (this is essentially what JPEG 2000 does with its wavelet cousin). And it is why they denoise well: noise is spread evenly across all the coefficients, but the signal is concentrated in the few large ones, so a rule as crude as "zero out the small coefficients" throws away mostly noise and keeps mostly edges. We return to that idea, coring, at the end of the chapter.
3.9.8 Honest limitations⧉
The Laplacian pyramid is a wonderful default, not a perfect basis, and it pays to be blunt about its shortcomings. It is overcomplete: with that 4/3-size Gaussian pyramid behind it, the Laplacian pyramid stores more numbers than the original image had pixels. It is not a tight, minimal basis the way Fourier is — it is a redundant frame, fine for editing (the redundancy buys robustness) but wasteful for compression, where you would rather have one coefficient per pixel. It is only approximately shift-invariant: nudge the image by a pixel and the coefficients shuffle a little, because of the downsampling. And it is isotropic — it has no notion of direction. A horizontal edge and a vertical edge of the same scale land in the same band, indistinguishable; the pyramid sees scale but is blind to orientation. Each of these limitations is the motivation for one of its relatives.
3.9.9 Cousins: wavelets and steerable pyramids⧉
The Laplacian pyramid was the first of a family, and its siblings each fix one of those limitations.
Wavelets are the principled, mathematically tight version. Where the Laplacian pyramid is overcomplete, an orthogonal wavelet transform is a true basis: exactly one coefficient per pixel, no redundancy, perfectly invertible. Like the pyramid, each wavelet is localized in both space and frequency — that is the whole reason wavelets exist, the same compromise made with cleaner mathematics. The trade is that wavelets are less Fourier-friendly and, in their basic orthogonal form, neither shift- nor rotation-invariant — shift the image and the coefficients change in awkward ways. Their tightness makes them the natural choice for compression, which is exactly where they ended up: JPEG 2000 is built on a wavelet transform. For our intuition-first purposes the takeaway is simple — a wavelet is what you get when you insist the multiresolution decomposition be a genuine, non-redundant basis, and you pay for that insistence with messier behavior under shifts.
It helps to see the three representations on one axis. Pure Fourier is all frequency, no space: each basis function is a sine wave filling the image. The pure primal grid is all space, no frequency: each basis function is a single pixel. Wavelets and pyramids sit in between — each basis function is a little bump, somewhat localized in space and covering a band of frequencies — and you can even dial where they sit, trading spatial sharpness for frequency sharpness. This is the same uncertainty tradeoff from the Fourier chapter, now offered as a design knob rather than a fixed pair of extremes.
Steerable pyramids (Simoncelli and Freeman, 1995) go the other way and add back what the Laplacian pyramid lacks: a notion of orientation. They split each frequency band further into a few oriented sub-bands — horizontal, vertical, diagonal — so the representation captures not just the scale of a feature but its direction, and does so in a way that is very nicely shift- and rotation-invariant. The price is even more overcompleteness. They are the tool of choice when orientation matters: texture analysis, and the phase-based video magnification we will meet later. (There is also multigrid, a coarse-to-fine relative from numerical analysis used to solve large linear systems fast, which we mention only in passing.)
The map, then: the Laplacian pyramid is the friendly, overcomplete, isotropic default; wavelets make it tight and minimal at the cost of nice invariances; steerable pyramids make it oriented and beautifully invariant at the cost of even more redundancy. Different points on the same space-versus-frequency, basis-versus-frame tradeoff.
3.9.10 What pyramids are good for⧉
We have built the representation; the rest of the book spends it. A quick tour of the applications — each developed properly in its own chapter — shows how much mileage one decomposition gives.
Multiscale blending is the Earth and Jupiter we opened with, and it is Burt and Adelson's original payoff. The recipe is elegant: build a Laplacian pyramid of each source image and a Gaussian pyramid of the mask that says where to switch from one to the other; then, level by level, mix the two source bands using the mask at that scale,
and finally run the decoder on the blended pyramid (Figure 3.9.6). The crucial detail is that the mask gets its own Gaussian pyramid, not a Laplacian one: $\text{mask}_k$ is the mask blurred and shrunk to level $k$, so it is broad and soft at the coarse levels and crisp at the fine ones. The transition is therefore wide for the low frequencies and narrow for the high ones — a gentle cross-fade of the broad colors, a crisp switch of the fine texture. That is exactly the "blend smoothly where smooth, sharply where detailed" we wanted and could not get from a single blur. The full treatment is in the compositing and panorama chapters.
Because this blend runs over the values band by band, it keeps each source's own brightness — its DC — and merely feathers the two across the mask. That is exactly what sets it apart from the gradient-domain Poisson blend, which discards the source's level and re-derives it from the boundary; the two are placed side by side, including the revealing case of pasting a constant, in Poisson image editing → Poisson vs. pyramid blending.
Exposure fusion and high dynamic range (HDR) tone mapping are the same blend in disguise: instead of two unrelated pictures, you merge several differently-exposed shots of one scene, taking each pixel from whichever exposure renders it best and stitching the choices together across scales with a Laplacian pyramid so there are no seams. This is the heart of HDR+ on cell phones; the full story is in the HDR chapter.
Pyramid denoising, or coring, cashes in the sparsity we noted earlier. Build the Laplacian pyramid, then clamp the small coefficients to exactly zero and keep the large ones (Figure 3.9.7). In smooth regions the band-pass coefficients are tiny and mostly noise, so zeroing them cleans up the flats; at strong edges the coefficients are large, so they survive untouched and the structure is preserved. It is the crudest possible edge-preserving denoiser, and it works because of the statistics of natural images — most coefficients really are near zero. Fancier versions (Simoncelli and Adelson) replace the hard clamp with a Bayesian estimate that uses the noise level and a natural-image prior, but the intuition is the same. More in the denoising chapter.
Local Laplacian filtering (Paris, Aubry and colleagues, 2011–2015) is the modern, edge-aware extreme: it manipulates the Laplacian coefficients with a per-pixel nonlinearity to boost detail or compress range without the halos a naive high-frequency boost produces. It has replaced bilateral filtering inside Adobe's products. We develop it in the edge-preserving part; here it stands as proof that this 1980s representation is still doing frontline work.
3.9.11 Multiresolution as a recurring theme⧉
Step back and the pyramid stops looking like one chapter's tool and starts looking like a recurring idea the whole book leans on. Coarse-to-fine is how we will estimate optical flow: motion is too large to track at full resolution, so you find it on a tiny coarse image where the motion is small, then refine it level by level down the pyramid. Scale-space is how feature detectors like the scale-invariant feature transform (SIFT) find points that look the same regardless of how close the camera was — by searching across the levels of a Gaussian pyramid. And in deep learning, the pooling layers that shrink a feature map and the U-Net that goes coarse-to-fine-and-back are pyramids by another name; the U-Net in particular, with its downsampling encoder and upsampling decoder, is the exact shape of the analysis/synthesis structure we built here, and it is the workhorse behind modern denoising diffusion models. The same $1 + \tfrac14 + \tfrac1{16} + \cdots$ bargain, the same coarse-to-fine flow of information, shows up again and again. Learn it once, here, and you will recognize it everywhere.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
The Fourier chapter's lesson was diagonalize when you can — work in the basis where your operator is simple. This chapter adds the practical sequel: the perfect basis is rarely the useful one. Fourier is perfectly localized in frequency and not at all in space; the primal grid is the reverse. The uncertainty principle (compact in space ⇔ spread in frequency, from the Fourier chapter) says you cannot have both perfectly. A pyramid surrenders a little of each — it chops frequency into a few coarse bands and keeps each band laid out spatially — and that modest sacrifice is exactly what makes per-scale, per-region editing possible. Reach for a representation that is local in both, not for the one that is perfect in one.