💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to

5.7 Guided image filtering

The cross/joint bilateral filter, from the previous chapter, taught one liberating move: the edges you preserve need not come from the image you are smoothing. Hand the filter a separate guidance image — a clean flash photo, a sharp color image, a coarse depth map's high-resolution companion — and it will borrow that guide's edges to steer the smoothing of something noisier or coarser. It is a powerful idea, but it inherits the bilateral's two standing complaints. The bilateral is slow when its spatial window is large, because its cost grows with the window (the bilateral grid fixed this, but at the price of quantizing the range axis). And near a strong edge the bilateral can produce gradient reversal — a faint, unnatural overshoot where a smooth ramp acquires a little kink, an artefact traceable to the same range quantization. Both complaints have the same root: the bilateral decides each output pixel by a weighted vote over its neighbours, and the votes get expensive to count and awkward to discretize.

The guided filter (He, Sun and Tang (2010)) keeps the cross-bilateral's guide but throws out the vote. It asks a different question of each little window of the image: what is the best straight-line relationship between the guide and the input here? Fit that line, apply it, and move on. Because the output is, locally, a straight line in the guide, it can only ever do what a straight line does — and a straight line through a step is still a step. The guide's edges therefore pass through by construction, with none of the bilateral's cost and none of its reversal. This chapter builds that idea from the line up, shows why it is O(N) (linear in the number of pixels, independent of window size), reads the affinity back out of it, and then walks the list of places it has quietly become the default tool.

5.7.1 The key idea: the output is a local line in the guide

Here is the whole filter in one picture. Take a small window of the image — say a $9\times9$ patch. Inside it, look at the guidance image $I$ (the image whose edges we trust) and the input $p$ (the possibly-noisy image we want to clean). Plot, for every pixel in the window, the guide value $I$ against the input value $p$, one dot per pixel. Now fit a straight line through that cloud of dots: find the slope $a$ and intercept $b$ for which

$$ q = a\,I + b $$

best matches the input over the window, in the least-squares sense. The fitted line is the filter: the output $q$ at each pixel is whatever the line predicts from that pixel's guide value, $q = aI + b$.

Read this back in words: in each little window, explain the output as the best linear function of the guidance image. Two consequences follow immediately, and they are the entire reason the filter works.

First, edges are preserved. Suppose the window straddles an edge in the guide — the guide jumps from dark to bright across it. A straight line $q = aI + b$ maps that jump straight through: where $I$ steps, $q$ steps by exactly $a$ times as much. The output inherits the guide's edge. Crucially, it cannot do otherwise — there is no setting of two numbers $a$ and $b$ that turns a step in $I$ into a smooth ramp in $q$. Edge preservation is not enforced by a weight that we tuned; it is baked into the shape of the model. Take the gradient of both sides and you see it cleanly: $\nabla q = a\,\nabla I$, so wherever the guide has structure, the output carries a scaled copy of it, and nowhere else.

Second, flat regions get smoothed. Where the guide is nearly constant over the window, there is no edge for the line to lock onto; the slope $a$ collapses toward zero and the output becomes nearly $q \approx b$, a local average of the input. So in the interior of a smooth region the filter behaves like a plain blur — exactly what we want — while at the guide's edges it holds the line. One model, two behaviours, selected automatically by whether the guide has a gradient to track.

fig-guided-window-fit
Figure 5.7.1. The guided filter is a local line in the guide. Left, a small window straddling an edge; for every pixel in it we plot the guidance value $I$ (horizontal) against the input value $p$ (vertical), one dot per pixel — the dots fall into two clusters, one per side of the edge. Centre, the least-squares line $q=aI+b$ fitted through the cloud: it runs from one cluster to the other, so a step in $I$ becomes a step in $q$ — the edge passes through. Right, a window inside a flat region: the guide barely varies, the dots pile up in a vertical smear, the best-fit slope $a\to 0$, and the output collapses to the local mean $b$ — i.e. a blur. The same fit smooths flats and keeps edges, with no range weight anywhere.

5.7.2 Fitting the line, and the one knob

What sets the slope $a$? Ordinary least squares over the window gives the textbook answer — the slope is the covariance of guide and input divided by the variance of the guide:

$$ a \;=\; \frac{\operatorname{cov}_w(I,p)}{\operatorname{var}_w(I) + \varepsilon}, \qquad b \;=\; \bar p_w - a\,\bar I_w, $$

where the subscript $w$ means "averaged over the window," $\bar I_w$ and $\bar p_w$ are the window means, and $\varepsilon$ is the filter's single knob, explained in a moment. The intercept $b$ is then just whatever makes the line pass through the window's centre of mass $(\bar I_w, \bar p_w)$. Read the slope back: how strongly does the input track the guide here, relative to how much the guide itself varies? Where guide and input rise and fall together (an edge they share), the covariance is large, the slope is steep, and the edge is carried through. Where the guide is flat, its variance is tiny and — but for $\varepsilon$ — the slope would be ill-defined; $\varepsilon$ is what keeps it tame.

That $\varepsilon$ is the regularizer, and it is the whole user-facing control. It penalizes a large slope, so when the guide's variance in a window is smaller than $\varepsilon$, the fit gives up on tracking the guide and lets $a\to 0$, smoothing instead. When the guide's variance is larger than $\varepsilon$ — a real edge — the slope survives and the edge is preserved. So $\varepsilon$ plays exactly the role the bilateral's range width $\sigma_r$ plays: it is the threshold separating "noise/texture, please smooth" from "edge, please keep." A guide fluctuation counts as an edge when its variance clears $\varepsilon$; everything quieter is treated as something to average away. (Like the bilateral, the filter runs on the values the eye sees, so we state the encoding before quoting an $\varepsilon$: an $\varepsilon$ chosen on gamma-encoded values means something different from one chosen on linear light.)

One loose end: windows overlap. A given pixel lies inside many windows — one centred on each of its neighbours — and each of those windows fits its own line, so each makes its own prediction for that pixel. The guided filter resolves this in the simplest possible way: it averages the per-window coefficients. Compute $(a,b)$ for every window, then for each pixel set $\bar a$ and $\bar b$ to the average of the $a$'s and $b$'s of all windows covering it, and output

$$ q = \bar a\,I + \bar b. $$

Averaging the coefficients rather than the predictions is what keeps the gradient relation $\nabla q \approx \bar a\,\nabla I$ intact — $\bar a$ varies slowly because it is itself a blurred field, so it acts like a locally constant scale on the guide's gradient, and the edge stays sharp. (This averaging is also, quietly, why the filter is so well-behaved at the edge itself: a pixel right on a strong edge sits in windows that mostly agree on a steep slope, so the edge is reinforced, not blurred.)

fig-guided-pipeline
Figure 5.7.2. The guided-filter pipeline, all box filters. Inputs: guidance $I$ and input $p$ (often the same image for plain smoothing). (1) Per-window statistics — by sliding box (running) averages, compute the window means $\bar I_w,\bar p_w$, the variance $\operatorname{var}_w(I)$ and the covariance $\operatorname{cov}_w(I,p)$. (2) Fit the line — per window, $a=\operatorname{cov}_w(I,p)/(\operatorname{var}_w(I)+\varepsilon)$ and $b=\bar p_w-a\,\bar I_w$. (3) Average the coefficients — box-filter the $a$ and $b$ fields into $\bar a,\bar b$ to resolve overlapping windows. (4) Apply — $q=\bar a\,I+\bar b$. Every step is a mean over a window, so the whole filter is a handful of box filters, independent of window size.

5.7.3 Why it is fast: O(N) regardless of window size

The bilateral's cost is its Achilles' heel: a brute-force bilateral over a window of radius $r$ does work proportional to the window area at every pixel, so widening the window makes it slower. The guided filter has no such dependence. Look back at the recipe: every quantity it needs — the means $\bar I_w, \bar p_w$, the variance, the covariance — is a local average over the window, and a local average is exactly a box filter. And a box filter, computed with a running sum (a sliding accumulator that adds the pixel entering the window and subtracts the one leaving), costs the same per pixel no matter how big the window is. (The variance and covariance need only the means of $I$, $p$, $I^2$, and $Ip$, each a box filter, combined pointwise — for instance $\operatorname{var}_w(I) = \overline{I^2}_w - \bar I_w^{\,2}$.)

So the entire filter is a fixed number of box filters plus some pointwise arithmetic, and its total cost is O(N) — linear in the number of pixels $N$, with a constant that does not grow with the window radius. This is the headline practical advantage. Where the bilateral forces a trade between window size and speed, the guided filter charges the same whether the window is $3\times3$ or $33\times33$, which is precisely why it became the go-to fast edge-preserving filter and a standard drop-in replacement for the bilateral when speed matters.

5.7.4 No gradient reversal

The bilateral's other failing is subtler and worth naming precisely. Gradient reversal is an artefact in which a smooth ramp, after filtering, comes back with its slope locally reversed — a little overshoot or undershoot that creates a faint false edge where the input had none. It happens near strong edges, and its cause is the bilateral's range weighting: close to an edge, a pixel's neighbourhood is dominated by samples from one side, the weighted average pulls toward that side, and the reconstructed detail can tip the wrong way. Sharpening or detail-boosting on top of a bilateral base layer makes these reversals plainly visible as halable rings and ridges.

The guided filter is free of gradient reversal, and the reason is the linear model again. Because the output obeys $\nabla q = \bar a\,\nabla I$ with $\bar a$ a slowly varying, smoothly blurred field, the output's gradient is just a gently rescaled copy of the guide's gradient. It can be larger or smaller, but it cannot change sign relative to the guide — there is no mechanism for the slope to flip. A monotone ramp in the guide produces a monotone ramp in the output. No range bins to quantize means no quantization artefact, and the structural linearity means no overshoot. This is the second reason the guided filter is preferred for detail enhancement and tone work: you can push the detail layer hard without summoning the rings.

5.7.5 Reading the affinity back out

We have not mentioned an affinity once — no range weight, no patch distance, no smoothness energy — and yet the guided filter belongs squarely in the affinity family. To see it, ask the implicit question: in computing the output at pixel $i$, how much does the input at some other pixel $j$ contribute? Expand the algebra (substitute the line fit into the coefficient average and collect terms) and the guided filter turns out to be, like every method in this chapter, a weighted average of the input, $q_i = \sum_j W_{ij}\,p_j$, with a specific weight kernel

$$ W_{ij} \;\propto\; \sum_{w \ni i,j} \Bigl(1 + \frac{(I_i-\bar I_w)(I_j-\bar I_w)}{\operatorname{var}_w(I)+\varepsilon}\Bigr), $$

summed over windows $w$ containing both pixels. The shape of this weight is the affinity. Two pixels that fall on the same side of an edge have guide values that deviate from the window mean in the same direction, so the product $(I_i-\bar I_w)(I_j-\bar I_w)$ is positive and their weight is large — they belong together. Two pixels on opposite sides of an edge deviate in opposite directions, the product is negative, the weight shrinks toward zero — they are kept apart. That is the L4 affinity exactly: high weight for pixels of the same surface, low weight across an edge, read off the guide's structure. The bilateral wrote this affinity as an explicit range Gaussian; the guided filter dissolves it into the arithmetic of a local linear fit, where you can no longer point at it — but it is doing the same job.

This is the cleanest stopping point for the chapter's through-line. The affinity opened the part as the bilateral's explicit range weight; method by method — bilateral, then the grid, then a learned grid, then regression, then patches, then optimization — it migrated into different mechanisms; here it has gone fully implicit, hidden inside the per-window line. Same family, last disguise.

5.7.6 Where it is used

The guided filter's combination of edge preservation, O(N) speed, freedom from reversal, and one more property — it is a smooth, differentiable chain of local averages and pointwise divisions — has made it a default building block in a surprising number of places.

A closing tie to the filter it descends from. The cross/joint bilateral of Bilateral filtering asked how similar are these two pixels in the guide? and voted accordingly; the guided filter asks what line through the guide explains the input here? and lets the line carry the edges. Same goal, same affinity underneath — but the guided filter trades the vote for a fit, and that one trade buys O(N) speed, no gradient reversal, and a differentiable operator. It is where the affinity story, having spent itself six ways, comes quietly to rest.