💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to

5.4 NL-means (non-local means)

The bilateral filter and its regression cousin both judge a neighbour from a single pixel's value (or its gradient). That is a flimsy basis for the question every edge-preserving method is really asking — do these two pixels belong to the same thing? — and noise attacks it head-on. Take two pixels that genuinely belong to the same wood grain, the same patch of sky, the same cheek; let each be struck by its own random noise, and their values can drift apart far enough that the bilateral reads them as different surfaces and withholds their affinity. The very thing we are trying to remove is the thing corrupting the decision about what to average. A test built on one number is fragile precisely because one number is so easy for noise to move.

Non-local means (Buades, Coll & Morel 2005) repairs this with a single, decisive leap, and then a second one rides along with it. The first leap is to stop comparing pixels and start comparing patches: to decide how loudly some candidate pixel should vote for the pixel we are denoising, compare the little patch around the candidate with the little patch around the centre. Two pixels belong together when their surroundings match — and a whole neighbourhood of values is something noise cannot easily spoof, because noise would have to corrupt every offset of the patch in the same coordinated way to forge a match, which it essentially never does. The second leap is in the name. Once the affinity is a whole-patch comparison, there is no longer any reason to keep the search near the pixel we are fixing — so we don't. This chapter is those two leaps and what falls out of them.

5.4.1 The leap: compare patches, not pixels

Set the two leaps side by side, because together they are the method.

Patches, not pixels. The bilateral asks: is your value close to mine? Non-local means asks: does your neighbourhood look like mine? Around the pixel $p$ we are denoising, take a small patch — call its values, read off relative to the centre, $N_p$ (a $7\times 7$ window is typical). Around any candidate pixel $q$, take the same-shaped patch $N_q$. The candidate gets a large vote only when $N_p$ and $N_q$ are alike as patches. A pixel sitting in the middle of a brick gets to borrow strength from every other brick, because every brick's surroundings look like every other's — mortar lines in the same places, the same speckle of texture — even though the individual centre values, scrambled by noise, may not match at all. Matching the surroundings rather than the point is what makes the affinity robust (Figure 5.4.1).

Non-local. Because the test is now a patch comparison, the spatial Gaussian that kept the bilateral local has lost its job. In the bilateral, proximity was a proxy for "probably the same surface"; with a real patch test in hand we can check sameness directly, so we drop the proximity prior — or, for speed, relax it to a generous search window rather than the whole frame. The consequence is the method's signature: any patch anywhere in the image may vote. Another brick three metres down the wall, another strand of the same hair, the mirror-image eye on the other side of a face, the next repeat of a wallpaper motif — all of them are legitimate, high-weight contributors. Natural images repeat themselves at every scale, and non-local means is the filter that cashes that repetition wherever it lies, not merely next door. This is the "non-local" in the name, and it is the deeper of the two leaps: self-similarity is a denoising resource, and most of it is not local.

Sidebar — why self-similarity buys denoising (frame averaging in disguise)

The reason non-local means denoises so well is an old idea wearing patch clothing. If you photograph a static scene many times and average the frames, the noise — independent from shot to shot — shrinks like $1/\sqrt{n}$, while the fixed signal survives untouched. That is frame averaging, the surest denoiser there is, and it needs $n$ genuine repeats of the same scene. Non-local means finds those repeats inside a single image. A clean signal repeats — the same texture, the same structure, recurs all over a natural photo — but the noise on each instance is independent. So when we pool the centre pixels of many well-matching patches, we are doing frame averaging on the cheap: the repeated clean value reinforces, the independent noise cancels, and structure is preserved because we only ever averaged things whose surroundings already agreed. The patch test is what guarantees the "frames" are really the same scene; the non-local search is what lets us find enough of them. The catch is honest — the matches are approximate, not exact repeats, so the cancellation is imperfect and very rare structures (a one-off blemish, a unique corner) have few partners to average with and denoise least.

5.4.2 The distance and the weight

The mechanics are exactly the bilateral's, with the single-pixel difference swapped for a patch distance. The dissimilarity between the patches around $p$ and $q$ is a sum of squared value differences taken offset by offset across the patch,

$$ d^2(p,q) = \sum_{k} g_a(k)\,\lVert I(p+k) - I(q+k) \rVert^2 , $$

where $k$ ranges over the patch offsets and $g_a$ is an optional small Gaussian that leans the comparison on the patch centre (offsets near the middle count for more than offsets at the rim). Read it back in words: lay the two patches on top of each other, walk them in lockstep, and add up how different they are point by point. A small total means the surroundings match.

The affinity is then the same exponential falloff we have used since the bilateral — a small distance becomes a large weight,

$$ w(p,q) = \exp\!\left(-\frac{d^2(p,q)}{h^2}\right) , $$

with $h$ a single scalar that sets how forgiving the match must be (a larger $h$ admits looser matches and denoises harder, at the risk of blurring genuine differences; it is tuned to the noise level, and is sometimes written $h^2 = 2\sigma^2$ in the patch-distance units). The denoised value at $p$ is the affinity-weighted average of the centre pixels of the matching patches, normalized:

$$ \widehat{I}(p) = \frac{\sum_q w(p,q)\,I(q)}{\sum_q w(p,q)} . $$

The thing to hold onto is which pixel each weight delivers. The weight is computed from the whole patch $N_q$, but what we average in is only the centre value $I(q)$. The patch is the credential; the centre pixel is the vote it earns. Pool enough such votes from genuine matches and the noise cancels while the structure — present in every matching patch's centre alike — comes through clean.

💡 The L4 affinity, at higher resolution

Non-local means is not a new principle; it is the L4 affinity lesson seen through a sharper lens. The bilateral's slogan was average neighbours of a similar value. Non-local means rewrites it as average pixels whose surroundings look alike — two precise swaps and nothing else. Swap the single-pixel value difference $\lvert I_p - I_q\rvert$ for the whole-patch distance $d(p,q)$; swap the local spatial window for the whole image (or a wide search window). The exponential affinity and the normalized weighted average are identical to the bilateral's. So non-local means is a bilateral filter living in patch space — which is exactly why it belongs in the edge-preserving family and not off on its own. It also pushes the affinity idea past mere edge preservation: by judging similarity over a patch, it matches texture and structure, not just brightness, and so it can keep fine repeated detail that a value-only filter would smooth away as noise.

fig-nlmeans-patches
Figure 5.4.1. Non-local means: compare patches, not pixels. The pixel being denoised sits at the centre of a small reference patch (highlighted). Several other patches scattered across the image — a repeated texture, another instance of the same structure — match it closely; each carries its similarity weight $w(p,q)=\exp(-d^2(p,q)/h^2)$, computed from the whole patch but spent on the patch's centre pixel. NL-means averages those centre pixels, pooling many independent noisy measurements of the same underlying signal. The matches need not be nearby: the search is non-local, any patch in the image may vote — which is exactly why image self-similarity is such a strong denoising prior.

5.4.3 Where it sits: ancestors, successors, and the patch-method family

Non-local means is a hinge in the book's story, looking back to filtering and forward to almost everything patch-based.

Looking back, it is the bilateral filter generalized — the affinity made robust by widening its window from a pixel to a patch (see Bilateral filtering for the affinity machinery we are reusing). Looking sideways, the same self-similarity it exploits for denoising is the engine of patch-based synthesis: copying and stitching matching patches is how we fill holes and grow texture in Inpainting, texture synthesis, and finding those matches fast — the expensive non-local search, made practical — is precisely the problem Patch match solves with a randomized nearest-neighbour scheme. Non-local means searches for similar patches to average them away; texture synthesis and inpainting search for similar patches to copy them in; PatchMatch is the shared accelerator. They are one family built on one fact about natural images: patches repeat.

Looking forward, non-local means is the direct ancestor of the strongest classical denoiser. Block-matching and 3-D filtering (BM3D) takes the same matched patches but, instead of simply averaging their centres, stacks the matching patches into a small 3-D group and filters them jointly — transforming across the stack so that the shared signal concentrates into a few coefficients while the independent noise spreads thin, then transforming back. It is non-local means with a smarter combination step in place of the plain average, and for years it set the bar that learned methods had to beat.

That lineage carries one more idea worth flagging now and developing later. A good denoiser encodes, implicitly, a strong prior about what clean natural images look like — here, that they are self-similar. Modern reconstruction methods exploit this directly: they plug a powerful denoiser (non-local means, BM3D, or a learned network) into an inverse problem as the regularizer, letting "what a denoiser would do" stand in for "what a plausible image looks like." That denoiser-as-prior link, and the learned self-similarity priors that descend from non-local means, are taken up in Super-resolution and image priors — where the self-similarity that here cancels noise becomes the assumption that lets us invent detail we never measured.