💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

5.2 Bilateral filtering

📎 Problem set

PS2 implements the bilateral filter (with a YUV chroma variant). → Problem sets (appendix).

A Gaussian blur asks a single question of every neighbour: how close are you? A nearby pixel counts for a lot, a distant one for little, and that is the entire policy. Inside the flat interior of a region that is exactly right. At an edge it is exactly wrong: the blur reaches across the boundary and averages two things that have nothing to do with each other — the bright sky and the dark mountain it sits against. This chapter builds the standard cure, the bilateral filter, which adds a second question to the average: not only how close is this neighbour? but how similar in value is it? That second question — read off the value difference between two pixels — is an affinity, a measure of how much two pixels belong together, and it turns out to be the organizing idea behind an entire family of edge-aware methods. We meet the bilateral first as the remedy for one specific, ugly artefact, and then follow its affinity through cross/joint filtering, the bilateral grid, and a learned grid (HDRnet).

5.2.1 Motivation — local tone-mapping haloes

Start where the bilateral filter earns its keep in this book: tone mapping a high dynamic range (HDR) image. A real scene spans a contrast far larger than any screen or print can reproduce — a sunlit window and the shadowed room behind it can differ by thousands to one — so we must compress the range while keeping the picture legible. The move that works is a two-scale decomposition: split the image into a large-scale base layer (the overall distribution of light — where it is bright, where it is dark) and a fine-scale detail layer (texture, small edges, the fine print), then crush the base hard, leave the detail alone, and recombine. It is the best-of-both trick (Style Reference rule 5): squeeze the part of the signal that carries the offending range, keep the part that carries the look. For HDR we run all of this on log luminance, because the eye reads contrast multiplicatively — a fixed ratio of light should map to a fixed step.

The obvious way to extract the base layer is the tool we already own: a Gaussian blur. Blur the log-luminance image with a wide Gaussian and what survives is the slow, large-scale variation; the detail is whatever the blur threw away. In the flat interior of a region this works beautifully. At a strong edge it fails, visibly and spectacularly.

Picture a bright sky meeting a dark mountain silhouette. The Gaussian base layer is a weighted average of the neighbourhood, and right at the silhouette that neighbourhood straddles the boundary: the blur pulls bright sky values onto the dark side of the edge and dark mountain values onto the bright side. The base no longer respects the edge — it ramps smoothly across it. Compress that base, add the detail back, and the mismatch surfaces as an overshoot: a bright fringe glowing on the dark side of the silhouette, a dark fringe on the bright side. That glowing fringe hugging every high-contrast edge is the classic halo, the signature artefact of naive local tone mapping (Figure 5.2.1).

fig-tonemap-halo
Figure 5.2.1. The halo, and its cure. Local tone mapping splits the image into a smooth base (the overall light) and a fine detail layer, compresses the base, and recombines. With a plain Gaussian base (centre), the smoothing leaks across the strong silhouette edge; after the base is compressed and the detail added back, the mismatch overshoots into a glowing halo — a bright fringe on the dark side of the edge, a dark fringe on the bright side. With a bilateral base (right), the base layer stops at the edge, so there is no overshoot and the halo is gone. That halo is exactly the motivation for the edge-preserving filter this chapter builds.

What we want is a base layer that is smooth within a region but does not average across the boundary into the next region — a filter that blurs everywhere except where the image has a real edge. That is precisely an edge-preserving filter, and the bilateral is the canonical one.

The halo is the cleanest motivation, but far from the only one. The same wish — do not let the filter bleed past an edge — turns up all over computational photography. Denoise a portrait and you want the grain gone from the cheek without the eyelashes smearing into the skin. Stylize or abstract an image and you want flat, poster-like regions with the boundaries kept crisp. Build a selection mask and you want it to snap to edges, not slop across them. These are the same request, served by the same machinery (cross-link to #Compositing, segmentation and matting and to stylization / non-photorealistic rendering (NPR)).

5.2.2 Bilateral filter

Begin from what we already know (name-then-use). A Gaussian blur replaces each pixel with a weighted average of its neighbours, the weight being a spatial Gaussian $g_s(\lVert p - q\rVert)$ of the distance between the centre pixel $p$ and a neighbour $q$: nearby pixels count for a lot, distant ones taper off. It is easiest to watch this on a 1-D scanline — a single row of the image — where the picture becomes a wiggly curve of value against position (Figure 5.2.2).

The trouble, once more, is the edge. A step edge on the scanline — values low on the left, high on the right — gets smeared by the Gaussian, because right at the step the averaging window reaches across the boundary and blends the two plateaus together. This is exactly the bleed that built the halo, seen one dimension down.

The fix is a single extra factor. Keep the spatial Gaussian, but multiply in a second Gaussian — this one on the value difference between the centre pixel and the neighbour, $g_r(\lvert I_p - I_q\rvert)$. This range weight asks "are these two pixels similar in value?" and down-weights neighbours that are too different. A neighbour that is close in space but far in value — one living on the other side of an edge — now earns a weight near zero, however spatially near it sits. The filter, all by itself, declines to average across the edge.

Putting the two Gaussians together gives the bilateral filter (Tomasi & Manduchi 1998). The weight a neighbour $q$ receives at pixel $p$ is the product of the spatial and range terms,

$$ w(p,q) \;=\; g_s\!\big(\lVert p - q\rVert\big)\; g_r\!\big(\lvert I_p - I_q\rvert\big), \qquad g_s(d) = e^{-d^2/2\sigma_s^2}, \quad g_r(\Delta) = e^{-\Delta^2/2\sigma_r^2}, $$

and the output is the normalized weighted average of the neighbours,

$$ I^{\mathrm{bf}}_p \;=\; \frac{1}{W_p}\sum_q w(p,q)\, I_q, \qquad W_p = \sum_q w(p,q). $$

Read it back in words (Style Reference rule 3): "average my neighbours, but only those that are both nearby and similar to me." The two widths do different jobs. $\sigma_s$, in pixels, sets how large a region we pool over — the reach of the spatial Gaussian. $\sigma_r$, in intensity units, sets how different is "too different" — the value gap beyond which a neighbour stops counting. Because $\sigma_r$ is measured in the same units as $I$, we must fix the encoding before we can quote it: for HDR tone mapping the filter runs on log luminance (Durand & Dorsey 2002, where a range $\sigma_r \approx 0.4$ in $\log_{10}$ is typical); for ordinary edits it runs on gamma values.

As code, the brute-force bilateral is a double loop with a per-pixel normalization, and one subtlety distinguishes it from convolution: you may precompute the spatial weight $g_s$ (it is the same window everywhere), but the range weight $g_r$ must be recomputed for every pair, because it depends on the value difference at that pixel — so it cannot be tabulated or reused from your Gaussian-blur code. Truncate only the spatial Gaussian to a finite window.

precompute g_s over the window          # spatial weights, same for every pixel
for each output pixel p:
    sum ← 0;  W ← 0                      # accumulator and normalizer
    for each neighbour q in the window around p:
        w ← g_s(‖p − q‖) · g_r(|I[p] − I[q]|)   # range term recomputed per pair
        sum ← sum + w · I[q]
        W   ← W + w
    I_bf[p] ← sum / W                    # per-pixel normalization

For a color image the range distance $|I_p - I_q|$ is the 3-D distance in RGB (or YUV); everything else is unchanged.

Sidebar — debugging the bilateral

Two extreme settings of the range width $\sigma_r$ give two predictable outcomes that bracket the whole filter — the fastest way to sanity-check an implementation (the recipe also lives in Developing, testing & debugging). Figure 5.2.7

If either limit misbehaves, the bug is in the range term or the normalization — fix that before trusting any middle $\sigma_r$.

fig-bilateral-1d
Figure 5.2.2. The bilateral filter on a 1-D scanline. A noisy step edge (one row of an image). A plain Gaussian averages across the step and smears it into a ramp. The bilateral filter denoises each plateau but keeps the step crisp, because at the edge the range weight kills every neighbour on the far side. The inset plots the weight at a probe pixel sitting just left of the edge: the symmetric spatial Gaussian (what a blur would use) versus the bilateral weight $g_s\cdot g_r$, which collapses to one side — almost all of it on the near (left) plateau, almost none across the edge.

The 1-D picture carries straight to the 2-D image: the bilateral smooths flat texture and noise inside each region while holding the edges, so a portrait emerges with clean skin and intact eyelashes, and an HDR base layer stops cleanly at the silhouette instead of ramping across it (Figure 5.2.3).

fig-bilateral-2d
Figure 5.2.3. The bilateral filter on a real photograph. Left, the input — a white egret against busy foliage. Centre, a plain Gaussian blur of the same radius: it smooths the foliage texture but also smears the bird's edges into the background. Right, the bilateral filter: texture and noise are smoothed away within each region while the strong edges — the bird's silhouette, the beak — stay crisp. The same behaviour as the 1-D scanline, now on a 2-D image.

One observation lifts the bilateral from a useful trick to the organizing idea of the whole chapter. The range Gaussian $g_r$ is not just a clever weight — it is a similarity, an affinity between two pixels: a number that says how much these two pixels belong together, computed from their value difference. The bilateral averages pixels in proportion to that affinity. Every later method in this chapter — and several beyond it — is a different answer to one question: how should we measure whether two pixels belong together?

💡 Big lesson — edge-preserving = affinity

Use the color / intensity difference between two pixels as how much they belong together — an affinity — then average (or connect) pixels in proportion to it. Pixels with high affinity are treated as measurements of the same underlying thing and get averaged together; pixels with low affinity — across an edge — are kept apart. The bilateral filter's range weight $g_r$ is the first full instance, but the same affinity drives the bilateral grid, the guided filter, non-local means (NL-means), colorization, and matting / segmentation that follow. Edge-preserving is affinity. (This is L4, registered in Denoising where the bilateral first appeared for noise; this is its full edge-preserving treatment.)

Robustness for values with few neighbours

The normalized form above carries a quiet weakness, and it bites exactly where it should not. Consider a pixel whose value is rare — a lone bright speck against a dark wall, a thin highlight, a pixel sitting on a tiny isolated feature. Very few of its neighbours fall within $\sigma_r$ of its value, so its normalizer $W_p = \sum_q w(p,q)$ is tiny and noisy: a sum of a mere handful of small weights. Dividing by a near-zero $W_p$ amplifies whatever noise those few neighbours carry, and the standard normalized average over-trusts them. In this regime the Durand–Dorsey normalization is, strictly, the wrong thing to do — it manufactures a confident estimate out of almost no evidence.

The fix is to stop dividing, and to express the output as a correction to the input rather than a fresh average built from scratch (Aubry et al. 2014). Write the bilateral output as the input plus a sum of per-neighbour pulls, and drop the $1/W_p$:

$$ I^{\mathrm{bf}}_p \;=\; I_p \;+\; \sum_q w(p,q)\,\big(I_q - I_p\big). $$

This is the unnormalized, or delta, bilateral. Read each term: a neighbour $q$ contributes $w(p,q)\,(I_q - I_p)$, a nudge toward its own value scaled by how much it belongs. When many neighbours match — a flat region — these nudges accumulate and pull the pixel firmly toward the local consensus, just as the normalized filter does. But when few neighbours match — a rare value — the nudges are few and small, so the pixel makes a small correction and stays close to itself, instead of dividing by a near-zero $W_p$ and blowing up. A lonely value is left roughly alone, which is the honest answer when there is little to average it against.

This unnormalized form is also the bilateral cousin of the local Laplacian filters: working in deltas from the input, scale by scale, is exactly the move that makes the multiscale local Laplacian fast and robust (Paris / Aubry; cross-ref Local Laplacian filters and Linear pyramids and wavelets).

Sidebar — anisotropic diffusion and the robust-statistics view

The delta form above is more than a normalization fix: it is one step of gradient descent on a robust energy, and that observation ties the bilateral filter to a second classic family — anisotropic diffusion. Read $\sum_q w(p,q)\,(I_q-I_p)$ as a weighted, edge-stopping diffusion of intensity: where neighbours agree (small $|I_q-I_p|$) heat flows freely and the region smooths; where they disagree across an edge (large $|I_q-I_p|$) the weight $w$ collapses and flow stops. That is exactly Perona–Malik anisotropic diffusion (Perona & Malik 1990) — iterate $\partial_t I = \operatorname{div}\!\big(g(|\nabla I|)\,\nabla I\big)$ with a conductance $g$ that vanishes for large gradients, and edges survive because no heat crosses them. The bilateral filter is essentially a large-step, non-iterated version of the same idea. What unifies them is robust statistics (Black et al. 1998; Durand & Dorsey 2002). Smoothing a pixel toward its neighbours is estimation: minimize a penalty $\sum_q \rho(I_q-I_p)$ on the value differences. A quadratic penalty $\rho(x)=x^2$ lets every neighbour pull in proportion to its difference — so large differences (edges) pull hardest, and the estimate is the plain average that blurs edges. Swap in a robust penalty that saturates (Lorentzian, Tukey biweight) and its derivative — the influence function $\psi=\rho'$ — rises and then returns to zero for large differences: a neighbour across an edge is an outlier and exerts no pull. Perona–Malik's conductance $g$, the bilateral's range weight $g_r$, and a robust influence function are the same curve — edge preservation is just outlier rejection (Figure 5.2.5). The idea generalizes from a scalar conductance to a tensor one — diffusion that flows along a structure but not across it. That is the basis of diffusion-tensor imaging (DTI) in medicine: each voxel carries a $3\times3$ diffusion tensor (drawn as an ellipsoid, elongated along the direction water moves most freely), estimated from the diffusion of water in tissue, which lets tractography trace white-matter fibre bundles in the brain (Figure 5.2.6).

fig-bilateral-image
Figure 5.2.4. The bilateral filter on a real image. Unlike a Gaussian blur, which smooths across edges and softens the whole image, the bilateral filter weights neighbours by both spatial distance and intensity similarity — so it smooths flat regions while preserving strong edges. Compare the input, the bilateral result, and a plain Gaussian of the same spatial extent; vary the range and spatial widths to see the trade-off.
fig-robust-diffusion
Figure 5.2.5. Edge-preserving smoothing as robust estimation. Left: a robust error norm $\rho$ (which saturates) versus the quadratic $\rho(x)=x^2$ (which grows without bound). Right: the influence function $\psi=\rho'$ — for the quadratic it grows linearly, so large differences pull hardest and blur edges; for the robust norm it rises and then returns to zero, so a neighbour across a strong edge is treated as an outlier and exerts no smoothing force. The bilateral range weight, the Perona–Malik conductance $g(|\nabla I|)$, and this influence function are the same curve. After Perona & Malik (1990), Black et al. (1998), and Durand & Dorsey (2002).
fig-diffusion-tensor
Figure 5.2.6. Anisotropic diffusion with a tensor conductance — the basis of diffusion-tensor imaging (DTI). Each location carries a diffusion tensor, visualized as an ellipsoid: elongated where diffusion is directional (along a fibre bundle), round where it is isotropic. Letting smoothing flow along structures but not across them generalizes the scalar edge-stopping conductance of Perona–Malik to a direction-dependent one, and in medicine recovers and traces the brain's white-matter pathways. (Visualization by Thomas Schultz, CC BY-SA 3.0, via Wikimedia Commons.)

Debugging — hand-checkable limits

The bilateral has two limits you can predict by hand, and together they make an excellent correctness test for an implementation (Figure 5.2.7). Push the range width to each extreme:

A middle $\sigma_r$ interpolates between these two: enough to smooth away noise and texture, not so much that it bleeds across real edges. If your implementation does not reproduce a Gaussian at large $\sigma_r$ and the identity at small $\sigma_r$, it is wrong — these are the first two things to check (and the recipe in the debugging sidebar above).

fig-bilateral-limits
Figure 5.2.7. The two debugging limits. The same input filtered across a sweep of range widths $\sigma_r$. At large $\sigma_r$ the range weight does nothing and the bilateral is a plain Gaussian blur (edges washed out); at tiny $\sigma_r$ only the pixel itself counts and the output is the identity (input untouched); in between, the filter smooths within regions while holding edges. The scanline plot underneath makes it concrete: the identity tracks the input exactly, the Gaussian washes it out, and a middle $\sigma_r$ denoises the plateaus while keeping the step.

The same two knobs guide tuning. Pick $\sigma_s$ from the spatial scale of the structure you mean to remove (how large a region should be pooled). Pick $\sigma_r$ from the contrast of the edges you mean to preserve — set it just below the value jump of a real edge, so the filter smooths everything gentler than that and stops at anything sharper. And state the encoding first: the same numeric $\sigma_r$ means something completely different in linear light, in gamma, and in log luminance.

5.2.3 Cross / joint bilateral

So far the affinity has been read off the same image we are smoothing — the range weight uses $I$ itself. But there is no reason it must. The cross (or joint) bilateral filter reads the range weight from a different image, a guide $J$, while still averaging the target $I$:

$$ I^{\mathrm{cb}}_p \;=\; \frac{1}{W_p}\sum_q g_s\!\big(\lVert p-q\rVert\big)\, g_r\!\big(\lvert J_p - J_q\rvert\big)\, I_q, \qquad W_p = \sum_q g_s\!\big(\lVert p-q\rVert\big)\, g_r\!\big(\lvert J_p - J_q\rvert\big). $$

The averaging runs over the target $I$; the affinity comes from the guide $J$. Why want that? Because sometimes the image you wish to smooth is too noisy or too dark for its own edges to be trusted — read the affinity off $I$ and you are reading off the noise. If you hold a cleaner signal that shares the same scene geometry, borrow its edges instead.

The headline application is flash / no-flash photography (Petschnigg et al. 2004; Eisemann & Durand 2004). Shoot a dim scene twice: once without flash (the right mood and color, but dark and noisy) and once with flash (clean and well-exposed, but flat, harsh lighting). The two share the same edges — the same object boundaries are present in both. So smooth the noisy no-flash image using the flash image as the guide: the flash image's clean edges drive the range weight, telling the filter where it may and may not average, while the values being averaged come from the ambient no-flash shot. The result keeps the no-flash mood and kills the no-flash noise (Figure 5.2.8). We develop the full flash/no-flash pipeline later (forward-ref → #Illumination related effects in a single image, computational illumination).

fig-crossbilateral
Figure 5.2.8. The cross / joint bilateral: read the range (edge) weight from a guide image while averaging the target. Left, the no-flash ambient shot — the right mood, but dark and noisy. Centre, the flash image used as the guide — clean, well-exposed edges. Right, the joint bilateral result: the ambient image smoothed using the flash image's edges to steer the filter, so it is denoised without smearing across object boundaries — the no-flash look kept, the noise gone.

The cross bilateral is the first instance of a more general joint / guided idea — filter one image using structure taken from another — that culminates in the guided filter (later in the part).

5.2.4 Bilateral grid

The bilateral filter is non-linear: the weights depend on the pixel values, so they differ at every output pixel, and the whole thing is not a convolution. That is why you cannot reuse your fast Gaussian-blur code, and why a brute-force bilateral is a slow double loop. The bilateral grid (Chen, Paris & Durand 2007) is the beautiful observation that you can turn it back into an ordinary convolution — by adding a dimension (Figure 5.2.9).

The key picture: the product of a spatial Gaussian (in $x, y$) and a range Gaussian (in $I$) is just a single 3-D Gaussian in the space $(x, y, I)$. So lift every pixel out of the 2-D plane into a 3-D space by using its value as a third coordinate — pixel $(x, y)$ with value $I(x, y)$ goes to grid position $(x, y, I(x,y))$. In this lifted space the nonlinear 2-D bilateral becomes a plain, separable 3-D Gaussian blur — an ordinary linear convolution — and the edge-preserving behaviour is now automatic: two pixels on opposite sides of an edge have very different values, so they land in different layers of the grid, and a 3-D blur simply cannot reach between them.

The pipeline has three steps — splat, blur, slice:

The whole pipeline is short, and every step is an ordinary operation — accumulate, blur, look up:

# splat — build the coarse grid (downsample by σ_s in space, σ_r in range)
for each pixel (x, y):
    c ← (x/σ_s, y/σ_s, I[x,y]/σ_r)      # the cell this pixel lands in
    grid[c]   ← grid[c]   + I[x, y]      # accumulate value
    weight[c] ← weight[c] + 1            # accumulate count

# blur — plain separable 3-D Gaussian, on BOTH grid and weight
blur grid, weight along x, then y, then I    # e.g. [1 4 6 4 1]/16 per axis

# slice — read back at each pixel's own (x, y, I), normalized
for each pixel (x, y):
    c ← (x/σ_s, y/σ_s, I[x,y]/σ_r)
    out[x, y] ← grid[c] / weight[c]      # trilinear interpolation at c
return out

Why is this fast? Because the grid is coarse. The right resolution is set by the filter's own widths — a grid cell about $\sigma_s$ wide in space and $\sigma_r$ deep in range — so the grid is heavily downsampled in both space and value, often by large factors. A small grid, a separable blur, a trilinear read: the whole thing runs in real time on a GPU. This is the pipeline that Halide schedules for live bilateral filtering (cross-ref #Differentiable image pipelines and algorithm optimization (Halide)).

fig-bilateral-grid
Figure 5.2.9. The bilateral grid: lift $(x, y, I)$ → splat · blur · slice. A nonlinear 2-D bilateral becomes a linear 3-D blur by promoting the value $I$ to a third axis. Splat the pixels into a coarse 3-D grid (each cell accumulates a value and a weight); blur with a plain separable 3-D Gaussian (an ordinary convolution — edges are automatic because dissimilar values sit in different grid layers); slice the result back by trilinear interpolation at each pixel's $(x, y, I)$, dividing value by weight to normalize. A plain blur that will not cross edges.

The grid also makes the affinity lesson geometric: the third axis is the affinity, drawn as a coordinate. Two pixels far apart in value are literally far apart in the grid, so a local blur — which only mixes nearby cells — can never average them together. The bilateral's "do these pixels belong together?" becomes, in the grid, simply "are these cells close?"

5.2.5 Bilateral-grid learning (HDRnet)

The grid stores, per cell, a value and a weight — a small, edge-aware, low-resolution summary of the image from which a full-resolution result is sliced. HDRnet (Gharbi et al. 2017, Deep Bilateral Learning) takes the obvious next step: rather than fill the grid by splatting the input, learn what goes in it. A convolutional neural network runs on a low-resolution copy of the image and predicts, per grid cell, the coefficients of a per-cell affine transform — a little color/tone map to apply. Then, at full resolution, the network slices that grid at each pixel's $(x, y, I)$ (an edge-aware, trilinear lookup) and applies the sliced affine transform to the original pixel.

This is best-of-both once more: run a heavy network on a tiny image (cheap, and able to capture a global, learned enhancement), then carry the result back to the full-resolution image through an edge-aware upsampling structure that keeps the detail crisp. The bilateral grid has become a learned, edge-preserving upsampling layer — fast enough to run live on a phone. It is also a clean instance of swapping a hand-designed operator for a learned one (a one-line callback to L8; cross-ref Machine learning).


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson — edge-preserving = affinity

Use the color / intensity difference between two pixels as how much they belong together — an affinity — then average (or connect) pixels in proportion to it. Pixels with high affinity are treated as measurements of the same underlying thing and get averaged together; pixels with low affinity — across an edge — are kept apart. The bilateral filter's range weight $g_r$ is the first full instance, but the same affinity drives the bilateral grid, the guided filter, non-local means (NL-means), colorization, and matting / segmentation that follow. Edge-preserving is affinity. (This is L4, registered in Denoising where the bilateral first appeared for noise; this is its full edge-preserving treatment.)