💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

8.7 Patch match

A hole-filler erasing a tourist from a vacation photo, a texture synthesizer growing more brick wall, a content-aware resizer widening a panorama without stretching the faces — under the hood these tools all spend almost all their time asking one small question, over and over: for this little patch of the output, where in the source is the patch that looks most like it? Answer it once and you have copied a believable scrap of content into place; answer it for every output pixel and you have a dense map of where each piece of the result came from. The trouble is the cost. Asked naively — compare each patch against every candidate — that search is so slow that for years exemplar methods felt like batch jobs: kick one off, go get coffee, come back to a result. "Content-Aware Fill" did not exist as a click-and-wait tool, because the wait was minutes, not milliseconds.

This chapter is about making that question fast, and the object it produces is the nearest-neighbour field — for each output patch, the offset to its best source match. There are two roads. The first is a beautiful randomized heuristic, PatchMatch Barnes, Shechtman, Finkelstein and Goldman, 2009, which exploits two facts about real images to get a near-perfect field in a few passes over the image. The second is a global energy minimization, Shift-Map editing Pritch, Kav-Venaki and Peleg, 2009, which poses the same "where does each output pixel come from" question as a graph-cut labelling and solves it optimally for the energy you wrote down. Fast-and-approximate versus slow-and-optimal, over the same underlying object — the two-sided view of patch-based editing.

💡 Big lesson (recurrence of L12)

Image edits are discrete optimization over a correspondence — and the energy is the design. "Where does each output pixel come from?" is a labelling: assign every output pixel a shift into the input. Pick the energy — a data term (respect the user's keep / remove / resize constraints) plus a seam term (neighbouring pixels should take consistent offsets, so the cuts where they disagree fall on real edges) — and a generic graph-cut solver returns the globally optimal field. PatchMatch is the fast heuristic answer to that same labelling; Shift-Map is the globally optimal one. The solver is off the shelf; choosing the affinity and the energy is the whole job. This is the affinity lesson L4 ("how much do these two pixels belong together") turned into a cut — "don't cut here" — rather than an average. (→ see Big lesson L12, first placed in Seam optimization; it extends L4, first placed in Bilateral filtering.)

8.7.1 The nearest-neighbour field, and why exhaustive search is the bottleneck

Start by naming the shared sub-problem precisely, because once you see it you see it everywhere. Exemplar inpainting (Criminisi's method, and the older Efros–Leung texture synthesis, Efros & Leung 1999), patch-based texture synthesis, content-aware retargeting, even non-local-means denoising and Image Analogies — each one repeatedly needs, for a given output patch, the most similar patch in some source region. Collect those answers for every pixel and you have a nearest-neighbour field (NNF): an offset field

$$ f:\; p \mapsto \Delta, \qquad \text{minimizing}\quad D(p) = \big\lVert P_{\text{out}}(p) - P_{\text{in}}\big(p + f(p)\big)\big\rVert^2, $$

where $P_{\text{out}}(p)$ is the small patch (say $7\times 7$) centred at output pixel $p$, $P_{\text{in}}(p+\Delta)$ is the source patch reached by adding the offset $\Delta = f(p)$, and $D$ is the sum-of-squared-differences patch distance. Read it back: for each output pixel, $f$ stores the displacement to the source patch that matches it best, and $D$ is how good that match is. The field $f$ is the whole object of this chapter; everything downstream is "now copy along $f$" (Figure 8.7.1).

Now the cost wall. Computed exhaustively, finding $f(p)$ for one pixel means comparing its patch against every candidate patch in the source — and you do that for every output pixel. That is quadratic in the number of pixels, and on a multi-megapixel image it is hopeless for anything interactive. Acceleration structures like kd-trees over patch descriptors help, but they still fall far short of real time on full-resolution photos, and they cost memory. This quadratic search — not the copying, not the blending — is the reason exemplar texture synthesis and Criminisi inpainting were slow, and the reason a one-click "remove this object" tool had to wait for a faster way to get the field.

The reframing that unlocks the chapter is to stop insisting on the exact nearest-neighbour field. We almost never need it. We need a field good enough that the patches it points to are visually right — and that we can settle for approximately, but fast. Two ways to do that: a randomized heuristic (PatchMatch, next) and a global energy minimization (Shift-Map, at the end). The rest of the chapter is those two.

fig-nnf-field
Figure 8.7.1. The nearest-neighbour field made visible. Left: an output image (or the region being synthesized), tiled to suggest its overlapping patches. Right: the NNF as a color-coded offset map — each output pixel is colored by the displacement $\Delta=f(p)$ to its best-matching source patch (hue = direction, saturation = length of the offset). Smooth regions of constant color are coherent: whole blocks of output map to a single translated block of source. The field, not any one match, is what every patch-based edit consumes. (Deferred — to be generated.)

8.7.2 PatchMatch — randomized correspondence Barnes et al. 2009

PatchMatch is one of those algorithms that sounds like it cannot possibly work, then converges in front of your eyes. It rests on two facts about natural images, and the whole design is a way to cash them in.

The first is coherence. Real images are made of coherent regions — a stretch of wall, a patch of sky — and within them, adjacent output patches usually map to adjacent source patches. If the patch at $p$ matched well to the source patch at $p+\Delta$, then the patch just to the right of $p$ very likely matches well to the source patch just to the right of $p+\Delta$ — the same offset $\Delta$, give or take. So one good offset is good for its neighbours, and a single correct match wants to spread.

The second is a birthday-paradox / law-of-large-numbers argument. Suppose you assign every output pixel a completely random offset. Most will be terrible. But there are a great many pixels, and the source has only so many places a given patch could go — so purely by chance, some patches will land on a good or near-good match. You do not need most guesses to be good; you need a few lucky hits anywhere in the image, because coherence will let you spread them. Random guessing seeds; coherence harvests.

The algorithm cashes in both with three moves, iterated a handful of times (Figure 8.7.2):

  1. Random initialization. Give every pixel a random offset into the valid source region. The resulting field is mostly garbage — but, by the birthday argument, salted with a few good matches.
  2. Propagation. Scan the image in raster order (top-left to bottom-right), and alternate the scan direction each iteration (bottom-right to top-left on the next pass, so information flows both ways). At pixel $p$, do not just keep its current offset — also try the offsets its already-scanned neighbours used, the pixel to its left and the one above, each shifted by one to account for the step between them. Keep whichever of the candidates gives the smallest patch distance $D$. The effect is a flood fill: a single good offset, once found, infects its whole coherent region in one sweep, because each pixel copies the good offset from the neighbour that already has it.
  3. Random search. Propagation can only spread offsets that already exist somewhere; it cannot discover a better one or fine-tune. So after propagating at $p$, refine its current best by testing a sequence of random offsets in an exponentially shrinking window around it — candidates of the form $f(p) + w\,\alpha^{i}\,R_i$ for $i = 0, 1, 2, \dots$, where $w$ is the image size, $\alpha < 1$ (say $\tfrac12$), and $R_i$ is a fresh random direction. The first, large jumps can escape a bad local optimum entirely; the later, tiny jumps polish the match to the exact pixel. Keep the best seen.

The whole algorithm is short — initialize, then sweep, propagating and searching at each pixel:

initialize f[p] ← random offset, for every pixel p
repeat a few passes (alternate scan direction each pass):
    for each pixel p in scan order:
        propagate: f[p] ← argmin over {f[p], f[p_prev]+δ, f[p_above]+δ} of D
        random search: for i = 0, 1, 2, …:
            try f[p] + w·αⁱ·Rᵢ;  keep it if D is smaller
return f

where p_prev and p_above are the two already-scanned neighbours (shifted by the step δ between them), w is the image size, α < 1, and Rᵢ is a fresh random direction. Run propagation-plus-search for two to five passes and the field is clean — a near-exact NNF computed in milliseconds, orders of magnitude faster than exhaustive search, with a memory footprint of just one offset (and its distance) per pixel.

Why does it converge so fast, when the per-pixel search it replaces is quadratic? Because the work is shared in exactly the way the two facts allow. Propagation turns one lucky match into an entire region's worth of good matches for free — the expensive discovery happens once per coherent region, and coherence broadcasts it. Random search guarantees no pixel stays stuck on a bad offset that propagation cannot reach, and its exponential window means it covers "anywhere in the image" and "right next door" with a logarithmic number of tries. The slogan is worth keeping: randomness plus coherence beats brute force. You are not avoiding the search; you are letting the structure of images do almost all of it for you.

One extension matters for later chapters. Generalized PatchMatch Barnes, Shechtman, Goldman and Finkelstein, 2010 lifts the field from a single nearest neighbour per pixel to the $k$ nearest neighbours, and from pure translations to offsets that also include rotations and scales. That is what you need when the content you want to match is transformed — a texture at a different orientation, a stylized example, the rotated-patch matching inside denoising and style transfer (cross-ref Style transfer). Same core loop — init, propagate, search — over a richer space of offsets.

fig-patchmatch-three-steps
Figure 8.7.2. PatchMatch in three moves, shown as the NNF cleaning up. (1) Random initialization: the offset field is noise — every pixel points somewhere random, a few happen to be good. (2) Propagation: scanning in (alternating) raster order, each pixel also tries its left/up neighbours' offsets shifted by one; a single good offset flood-fills across a coherent region, so blocks of the field snap to a common color. (3) Random search: around each pixel's current best, test offsets in an exponentially shrinking window ($w\,\alpha^{i}$) — big jumps escape local optima, small jumps fine-tune. After a couple of iterations the field is clean. The point: a few lucky random hits, spread by coherence, replace an exhaustive search. (Deferred — to be generated.)

8.7.3 Applications: hole filling, retargeting, reshuffle

Once the NNF is milliseconds away, a family of edits that used to be impractical becomes interactive — and they are mostly the same engine pointed at different constraints (Figure 8.7.3).

Interactive hole filling — the engine inside Content-Aware Fill. To fill a hole, repeatedly run PatchMatch between the hole's patches and the rest of the image, copy in the best-matching source patches, average overlapping copies, and iterate coarse-to-fine down an image pyramid so that the fill is globally consistent and not just locally plausible. This is precisely the search that turned exemplar inpainting from a batch job into a real-time "remove the tourist" tool. What objective is it (approximately) optimizing? Bidirectional similarity Simakov, Caspi, Shechtman and Irani, 2008: a good result is both coherent — every output patch has a good match in the source, so nothing looks fabricated — and complete — every important source patch appears somewhere in the output, so nothing important is dropped. PatchMatch is the fast search that makes minimizing that two-sided similarity interactive.

Image retargeting — content-aware resize. Change a photo's aspect ratio — make it wider for a banner, narrower for a phone — while preserving the important content rather than uniformly squashing it (which turns faces into ovals and buildings into parallelograms). The patch-based version re-synthesizes the image at the new size so that every output patch still has a good source match, letting unimportant, repetitive regions (sky, grass) absorb the change. This is the smooth cousin of seam carving (Avidan & Shamir 2007), which retargets by greedily removing one minimum-energy connected seam of pixels at a time; we will see in the Shift-Map section that seam carving is the greedy special case of a global offset-labelling.

Reshuffle. Drag an object to a new spot in the image and let the algorithm re-synthesize the rest so the scene stays coherent — the moved object lands cleanly and the gap it left behind, and the place it now covers, are filled from surrounding content. Same NNF machinery, now with a user constraint pinning some content to a new location.

And beyond. The same engine drives video completion Wexler, Shechtman and Irani, 2007 — fill a hole consistently across time, not just space — Image Melding Darabi, Shechtman, Barnes, Goldman and Sen, 2012 — PatchMatch with rotation/scale plus a gradient-domain blend, so the copied patches are also seamlessly integrated — the self-similarity search inside denoising, and the patch search inside Image Analogies (cross-ref Style transfer).

A note of honesty before we move on, because it sets up the rest of the part. Every method here still copies real content — it rearranges patches that genuinely exist in the image, so it is reconstruction-leaning within the image: faithful, but bounded by what the photo already contains. It cannot invent a face that was never there, only borrow one. The generative successors — diffusion inpainting and outpainting — synthesize new content instead of copying it: far more powerful (they can hallucinate a plausible building behind the removed sign), and correspondingly less faithful (they can hallucinate the wrong one). That is the L10 / L11 hand-off to Generative AI and diffusion: copying is bounded and honest; generation is unbounded and must be trusted.

[figure fig-patchmatch-apps not built]
Figure 8.7.3. One image, three patch-based edits, all driven by the same NNF engine. Hole fill: an object is masked out and the gap re-synthesized from surrounding patches (coarse-to-fine), the engine behind Content-Aware Fill. Retarget: the frame is made wider without distortion — unimportant regions (sky, ground) stretch, faces and buildings keep their proportions, versus the squashed result of a naive resize. Reshuffle: an object is dragged to a new location and the surround re-synthesized so the scene stays coherent. The common thread: choose where output patches may come from, then let the field do the rest. (Deferred — to be generated.)

8.7.4 Shift-Map image editing Pritch et al. 2009 — editing as graph-cut labelling

PatchMatch finds a good offset field by a clever search. Shift-Map editing asks for the offset field that is optimal for an explicit energy, and gets it by graph cuts. The reframing is the chapter's big lesson made literal: decide, per output pixel, a single shift $M(p)$ into the input, and then read the output off by copying

$$ \text{output}(p) = \text{input}\big(p + M(p)\big). $$

So the output is a rearrangement of input pixels under a smooth offset (shift) field — exactly the same kind of object as the NNF, only now chosen by global optimization instead of randomized search (Figure 8.7.4).

What makes one shift field better than another? An energy in two parts (the L12 template):

$$ E(M) \;=\; \sum_{p} E_{\text{data}}\big(M(p)\big) \;+\; \lambda \sum_{(p,q)\in\mathcal{N}} E_{\text{smooth}}\big(M(p), M(q)\big). $$

Minimize $E(M)$ and you have the globally (near-)optimal rearrangement. The solver is graph cuts / $\alpha$-expansion (Boykov et al. 2001): the labels are the candidate shifts, the data term sets each pixel's cost for taking a label, the smoothness term sets the cost of a seam between neighbouring labels, and min-cut/max-flow finds the cheapest labelling. The label set — all possible shifts — is large, so Shift-Map runs coarse-to-fine: solve on a downsampled image with a small set of shifts, then refine the offsets on finer levels. The result is an offset map with clean, explicit seams placed exactly where the eye will not catch them.

The duality — PatchMatch versus Shift-Map. Hold the two side by side and the chapter closes on itself. Both produce an offset field that rearranges the input into the output. PatchMatch finds it by a randomized heuristic: fast, approximate, and ideal when you want a dense field for synthesis (hole filling, reshuffle), where every output pixel needs a plausible source and a clean global seam matters less than speed. Shift-Map finds it by minimizing a global graph-cut energy: slower, but optimal for its energy and producing clean, explicit seams — ideal for retargeting and removal, where you want a defensible, artifact-free cut rather than a fast fill. This is the L4 → L12 progression in one picture: an affinity (patch similarity, or seam cost) is consumed either by a fast local rule (propagation) or by a global cut ($\alpha$-expansion). The affinity is the design; the solver is a choice.

fig-shiftmap-labeling
Figure 8.7.4. Shift-Map editing as a labelling. Left: the output, with pixels colored by their chosen shift $M(p)$ into the input — large regions share one shift (a translated block of input), and the graph-cut seams are the thin boundaries where the shift jumps. Right (or inset): those seams overlaid on the image, showing that they fall on real edges (where the color/gradient discontinuity penalty is cheap), so the rearrangement is invisible. Compare PatchMatch's field (found by randomized search) with this one (found by minimizing $E(M)=\sum_p E_{\text{data}}+\lambda\sum E_{\text{smooth}}$ via $\alpha$-expansion): the same offset-field object, one heuristic, one optimal. (Deferred — to be generated.)

A few cross-references close the loop. Seam carving (Avidan & Shamir 2007) is the greedy special case Shift-Map generalizes — it removes one optimal seam at a time, where Shift-Map optimizes the whole rearrangement at once. The seamless blend after a cut or patch composite — to erase whatever residual mismatch the seam leaves — is the gradient-domain Poisson solve of Poisson image editing (the L9 half of "cut here, blend there"). And the learned successor to all of it — generate the fill rather than copy it, predict dense correspondence with deep features rather than search for it — lives in Generative AI and diffusion (L11) and Deep learning.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (recurrence of L12)

Image edits are discrete optimization over a correspondence — and the energy is the design. "Where does each output pixel come from?" is a labelling: assign every output pixel a shift into the input. Pick the energy — a data term (respect the user's keep / remove / resize constraints) plus a seam term (neighbouring pixels should take consistent offsets, so the cuts where they disagree fall on real edges) — and a generic graph-cut solver returns the globally optimal field. PatchMatch is the fast heuristic answer to that same labelling; Shift-Map is the globally optimal one. The solver is off the shelf; choosing the affinity and the energy is the whole job. This is the affinity lesson L4 ("how much do these two pixels belong together") turned into a cut — "don't cut here" — rather than an average. (→ see Big lesson L12, first placed in Seam optimization; it extends L4, first placed in Bilateral filtering.)