5.7 Guided image filtering⧉
The cross/joint bilateral filter, from the previous chapter, taught one liberating move: the edges you preserve need not come from the image you are smoothing. Hand the filter a separate guidance image — a clean flash photo, a sharp color image, a coarse depth map's high-resolution companion — and it will borrow that guide's edges to steer the smoothing of something noisier or coarser. It is a powerful idea, but it inherits the bilateral's two standing complaints. The bilateral is slow when its spatial window is large, because its cost grows with the window (the bilateral grid fixed this, but at the price of quantizing the range axis). And near a strong edge the bilateral can produce gradient reversal — a faint, unnatural overshoot where a smooth ramp acquires a little kink, an artefact traceable to the same range quantization. Both complaints have the same root: the bilateral decides each output pixel by a weighted vote over its neighbours, and the votes get expensive to count and awkward to discretize.
The guided filter (He, Sun and Tang (2010)) keeps the cross-bilateral's guide but throws out the vote. It asks a different question of each little window of the image: what is the best straight-line relationship between the guide and the input here? Fit that line, apply it, and move on. Because the output is, locally, a straight line in the guide, it can only ever do what a straight line does — and a straight line through a step is still a step. The guide's edges therefore pass through by construction, with none of the bilateral's cost and none of its reversal. This chapter builds that idea from the line up, shows why it is O(N) (linear in the number of pixels, independent of window size), reads the affinity back out of it, and then walks the list of places it has quietly become the default tool.
5.7.1 The key idea: the output is a local line in the guide⧉
Here is the whole filter in one picture. Take a small window of the image — say a $9\times9$ patch. Inside it, look at the guidance image $I$ (the image whose edges we trust) and the input $p$ (the possibly-noisy image we want to clean). Plot, for every pixel in the window, the guide value $I$ against the input value $p$, one dot per pixel. Now fit a straight line through that cloud of dots: find the slope $a$ and intercept $b$ for which
best matches the input over the window, in the least-squares sense. The fitted line is the filter: the output $q$ at each pixel is whatever the line predicts from that pixel's guide value, $q = aI + b$.
Read this back in words: in each little window, explain the output as the best linear function of the guidance image. Two consequences follow immediately, and they are the entire reason the filter works.
First, edges are preserved. Suppose the window straddles an edge in the guide — the guide jumps from dark to bright across it. A straight line $q = aI + b$ maps that jump straight through: where $I$ steps, $q$ steps by exactly $a$ times as much. The output inherits the guide's edge. Crucially, it cannot do otherwise — there is no setting of two numbers $a$ and $b$ that turns a step in $I$ into a smooth ramp in $q$. Edge preservation is not enforced by a weight that we tuned; it is baked into the shape of the model. Take the gradient of both sides and you see it cleanly: $\nabla q = a\,\nabla I$, so wherever the guide has structure, the output carries a scaled copy of it, and nowhere else.
Second, flat regions get smoothed. Where the guide is nearly constant over the window, there is no edge for the line to lock onto; the slope $a$ collapses toward zero and the output becomes nearly $q \approx b$, a local average of the input. So in the interior of a smooth region the filter behaves like a plain blur — exactly what we want — while at the guide's edges it holds the line. One model, two behaviours, selected automatically by whether the guide has a gradient to track.
5.7.2 Fitting the line, and the one knob⧉
What sets the slope $a$? Ordinary least squares over the window gives the textbook answer — the slope is the covariance of guide and input divided by the variance of the guide:
where the subscript $w$ means "averaged over the window," $\bar I_w$ and $\bar p_w$ are the window means, and $\varepsilon$ is the filter's single knob, explained in a moment. The intercept $b$ is then just whatever makes the line pass through the window's centre of mass $(\bar I_w, \bar p_w)$. Read the slope back: how strongly does the input track the guide here, relative to how much the guide itself varies? Where guide and input rise and fall together (an edge they share), the covariance is large, the slope is steep, and the edge is carried through. Where the guide is flat, its variance is tiny and — but for $\varepsilon$ — the slope would be ill-defined; $\varepsilon$ is what keeps it tame.
That $\varepsilon$ is the regularizer, and it is the whole user-facing control. It penalizes a large slope, so when the guide's variance in a window is smaller than $\varepsilon$, the fit gives up on tracking the guide and lets $a\to 0$, smoothing instead. When the guide's variance is larger than $\varepsilon$ — a real edge — the slope survives and the edge is preserved. So $\varepsilon$ plays exactly the role the bilateral's range width $\sigma_r$ plays: it is the threshold separating "noise/texture, please smooth" from "edge, please keep." A guide fluctuation counts as an edge when its variance clears $\varepsilon$; everything quieter is treated as something to average away. (Like the bilateral, the filter runs on the values the eye sees, so we state the encoding before quoting an $\varepsilon$: an $\varepsilon$ chosen on gamma-encoded values means something different from one chosen on linear light.)
One loose end: windows overlap. A given pixel lies inside many windows — one centred on each of its neighbours — and each of those windows fits its own line, so each makes its own prediction for that pixel. The guided filter resolves this in the simplest possible way: it averages the per-window coefficients. Compute $(a,b)$ for every window, then for each pixel set $\bar a$ and $\bar b$ to the average of the $a$'s and $b$'s of all windows covering it, and output
Averaging the coefficients rather than the predictions is what keeps the gradient relation $\nabla q \approx \bar a\,\nabla I$ intact — $\bar a$ varies slowly because it is itself a blurred field, so it acts like a locally constant scale on the guide's gradient, and the edge stays sharp. (This averaging is also, quietly, why the filter is so well-behaved at the edge itself: a pixel right on a strong edge sits in windows that mostly agree on a steep slope, so the edge is reinforced, not blurred.)
5.7.3 Why it is fast: O(N) regardless of window size⧉
The bilateral's cost is its Achilles' heel: a brute-force bilateral over a window of radius $r$ does work proportional to the window area at every pixel, so widening the window makes it slower. The guided filter has no such dependence. Look back at the recipe: every quantity it needs — the means $\bar I_w, \bar p_w$, the variance, the covariance — is a local average over the window, and a local average is exactly a box filter. And a box filter, computed with a running sum (a sliding accumulator that adds the pixel entering the window and subtracts the one leaving), costs the same per pixel no matter how big the window is. (The variance and covariance need only the means of $I$, $p$, $I^2$, and $Ip$, each a box filter, combined pointwise — for instance $\operatorname{var}_w(I) = \overline{I^2}_w - \bar I_w^{\,2}$.)
So the entire filter is a fixed number of box filters plus some pointwise arithmetic, and its total cost is O(N) — linear in the number of pixels $N$, with a constant that does not grow with the window radius. This is the headline practical advantage. Where the bilateral forces a trade between window size and speed, the guided filter charges the same whether the window is $3\times3$ or $33\times33$, which is precisely why it became the go-to fast edge-preserving filter and a standard drop-in replacement for the bilateral when speed matters.
5.7.4 No gradient reversal⧉
The bilateral's other failing is subtler and worth naming precisely. Gradient reversal is an artefact in which a smooth ramp, after filtering, comes back with its slope locally reversed — a little overshoot or undershoot that creates a faint false edge where the input had none. It happens near strong edges, and its cause is the bilateral's range weighting: close to an edge, a pixel's neighbourhood is dominated by samples from one side, the weighted average pulls toward that side, and the reconstructed detail can tip the wrong way. Sharpening or detail-boosting on top of a bilateral base layer makes these reversals plainly visible as halable rings and ridges.
The guided filter is free of gradient reversal, and the reason is the linear model again. Because the output obeys $\nabla q = \bar a\,\nabla I$ with $\bar a$ a slowly varying, smoothly blurred field, the output's gradient is just a gently rescaled copy of the guide's gradient. It can be larger or smaller, but it cannot change sign relative to the guide — there is no mechanism for the slope to flip. A monotone ramp in the guide produces a monotone ramp in the output. No range bins to quantize means no quantization artefact, and the structural linearity means no overshoot. This is the second reason the guided filter is preferred for detail enhancement and tone work: you can push the detail layer hard without summoning the rings.
5.7.5 Reading the affinity back out⧉
We have not mentioned an affinity once — no range weight, no patch distance, no smoothness energy — and yet the guided filter belongs squarely in the affinity family. To see it, ask the implicit question: in computing the output at pixel $i$, how much does the input at some other pixel $j$ contribute? Expand the algebra (substitute the line fit into the coefficient average and collect terms) and the guided filter turns out to be, like every method in this chapter, a weighted average of the input, $q_i = \sum_j W_{ij}\,p_j$, with a specific weight kernel
summed over windows $w$ containing both pixels. The shape of this weight is the affinity. Two pixels that fall on the same side of an edge have guide values that deviate from the window mean in the same direction, so the product $(I_i-\bar I_w)(I_j-\bar I_w)$ is positive and their weight is large — they belong together. Two pixels on opposite sides of an edge deviate in opposite directions, the product is negative, the weight shrinks toward zero — they are kept apart. That is the L4 affinity exactly: high weight for pixels of the same surface, low weight across an edge, read off the guide's structure. The bilateral wrote this affinity as an explicit range Gaussian; the guided filter dissolves it into the arithmetic of a local linear fit, where you can no longer point at it — but it is doing the same job.
This is the cleanest stopping point for the chapter's through-line. The affinity opened the part as the bilateral's explicit range weight; method by method — bilateral, then the grid, then a learned grid, then regression, then patches, then optimization — it migrated into different mechanisms; here it has gone fully implicit, hidden inside the per-window line. Same family, last disguise.
5.7.6 Where it is used⧉
The guided filter's combination of edge preservation, O(N) speed, freedom from reversal, and one more property — it is a smooth, differentiable chain of local averages and pointwise divisions — has made it a default building block in a surprising number of places.
- Edge-preserving smoothing and detail enhancement. Set the guide equal to the input ($I = p$) and you have a self-guided edge-aware smoother: a fast, reversal-free stand-in for the bilateral, ideal as the base layer in base/detail decomposition for tone mapping and detail boosting. The seed of this whole part — the halo-free tone manipulation that motivated the bilateral — is served just as well, and faster, by the guided filter.
- Flash / no-flash fusion. Exactly the cross-bilateral application of the last chapter (Petschnigg et al. 2004; Eisemann and Durand 2004): smooth the noisy ambient (no-flash) image using the clean flash image as the guide, so its trustworthy edges steer the denoising. The guided filter does this with the same recipe and none of the bilateral's cost.
- Matting and feathering (Compositing, segmentation and matting). The guided filter's weight kernel above is closely related to the matting Laplacian — both encode "pixels are coupled when a local color line fits them" — which is why guiding by the color image gives a fast way to refine a coarse matte into one that hugs the true object boundary. Feathering a hard mask into a soft alpha is the same operation.
- Dehazing — transmission refinement. Single-image dehazing first estimates a rough, blocky transmission map (how much of each pixel is haze versus scene) from the dark-channel prior (He, Sun and Tang 2009); see Dehazing. That rough map must be refined to align with the actual scene edges before it is used, and the guided filter does it directly with the hazy image itself as the guide — in fact the guided filter was introduced partly to replace the slow matting-Laplacian solve that the original dehazing work used for exactly this step. Cross-reference Dehazing.
- Joint upsampling. A quantity computed cheaply at low resolution — a depth map, a segmentation, a chrominance field — can be upsampled to align with a high-resolution guide by guiding on the full-resolution color image. The guide supplies the missing high-frequency edges; the low-resolution quantity supplies the values. This is the joint-bilateral-upsampling idea done in O(N).
- A differentiable layer in networks. Because every operation is an average or a pointwise arithmetic step, the guided filter is differentiable end to end, so it drops into a learning pipeline as a layer — a fast, edge-aware module a network can train through, often used to push a coarse network output back onto the full-resolution image's edges.
A closing tie to the filter it descends from. The cross/joint bilateral of Bilateral filtering asked how similar are these two pixels in the guide? and voted accordingly; the guided filter asks what line through the guide explains the input here? and lets the line carry the edges. Same goal, same affinity underneath — but the guided filter trades the vote for a fit, and that one trade buys O(N) speed, no gradient reversal, and a differentiable operator. It is where the affinity story, having spent itself six ways, comes quietly to rest.