💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 2 big lessons ↓

8.6 Inpainting, texture synthesis

Erase a person from a holiday snapshot and you are left with a person-shaped hole — a region of pixels you never photographed, surrounded by sand and sea you did. What should go in the hole? There is no measurement to consult; the camera recorded nothing there but the person who is now gone. And yet you have a strong sense of what should be there: more sand, more sea, the horizon line continuing unbroken from one side of the hole to the other. That sense is not information from the photograph. It is a prior — your internal model of what beach photographs look like — and it is doing one hundred percent of the work.

This is the cleanest illustration in the whole book of the lesson that ran through super-resolution and deblurring: when the measurement genuinely gives you nothing, no amount of cleverness recovers it from the data, and a prior is what selects an answer. Super-resolution had to invent high frequencies the sensor folded away; inpainting has to invent whole regions. The dial is turned all the way up — inside the hole there is zero data — so the result is exactly as good, and exactly as honest, as the prior is.

The chapter is one spectrum. At the weak end, a smoothness prior simply continues the surrounding color and edges inward — wonderful for a thin scratch, hopeless for a large hole. Stronger is a self-similarity prior that copies patches from elsewhere in the same image, which is what makes texture synthesis and object removal work. Stronger still is a database prior that pastes from other images entirely — "the internet is your prior." And at the strong end sits a learned generative prior that has compressed millions of images into weights and simply generates a completion. Each rung copies more and invents more; we climb it in order.

💡 Big lesson (recurrence of L10)

The prior is not optional — and filling a hole is the purest case. There is zero data inside the hole, so 100% of the answer comes from the prior. Smoothness, self-similarity, a photo database, a learned net — each is a different prior, and the result is exactly as good (and as honest) as the prior is. The split that organized super-resolution survives intact: a prior is reconstruction-leaning where it copies genuine structure from somewhere real, and hallucination where it invents detail that was never anywhere. (Registered as L10; first appears in Super-resolution and image priors; here taken to its extreme — no data, all prior.)

8.6.1 Inpainting as filling unmeasured pixels — the spectrum of priors

State the problem plainly. We are given an image $I$ with a masked region $\Omega$ — the hole — marked by the user or by a defect detector, and we want a complete image in which $\Omega$ is plausible and seamless: the eye should not be able to find it. The defining feature is the one we keep returning to: there is no data in $\Omega$, so the entire fill is the prior's invention. The only thing the algorithm has to work with is the known surround and whatever assumptions it makes about how the hole relates to it.

Read that way, the four families of methods are just four different answers to the question "what does the missing region look like?" (Figure 8.6.1):

Those four assumptions — looks like (a) its neighbours, (b) another part of this image, (c) another image, (d) a typical image — are the four families, and they are ordered weak to strong. The rest of the chapter walks the ladder. The framing to hold onto is not "Photoshop magic" but the honest one: the algorithm is answering a question it has no data for, by assuming the hole resembles one of those four things.

fig-inpaint-prior-spectrum
Figure 8.6.1. One hole, filled four ways — the weak→strong prior axis. A single masked region (say a removed object on a textured natural scene) is completed by: (1) diffusion / PDE — color and edges smoothed inward; correct near the rim but blurry and textureless in the middle; (2) exemplar / PatchMatch — patches copied from elsewhere in the same image; the right texture and copied real structure, limited to what this image contains; (3) database / Hays–Efros — a region from a different, scene-matched photograph grafted in; photoreal and coherent but a different scene's content; (4) diffusion-net / learned — a completion synthesized from a learned image prior; sharp and plausible but invented, not measured. The panels move left to right from weakest to strongest prior, and from "blurry but safe" to "sharp but possibly wrong."

8.6.2 PDE / diffusion-based inpainting

The weakest rung borrows its idea from the art world. A museum conservator repairing a cracked painting does not invent new subject matter inside the crack; she continues the lines that arrive at its edges, carrying a contour or a brushstroke smoothly across the gap so the eye glides over it. Bertalmío, Sapiro, Caselles and Ballester 2000, in the paper that named image inpainting, made exactly this into an algorithm.

The naive version of "be smooth" is to solve Laplace's equation inside the hole — fill $\Omega$ with the harmonic function that interpolates its boundary values, the soap-film membrane we met in Poisson image editing. That diffuses color inward, but it diffuses it in all directions equally, so an edge arriving at the boundary is smeared across into a soft blur rather than continued as an edge. The conservator's instinct is the opposite: diffuse along the contour, not across it.

Bertalmío's method encodes that instinct. The level-lines of an image — the curves of constant intensity, called isophotes — are perpendicular to the gradient $\nabla I$, so the direction along an isophote is $\nabla^\perp I$ (the gradient rotated by ninety degrees). The method propagates the image's smoothness — concretely, its Laplacian $\Delta I$, a measure of how "unfinished" the smoothing is — along those level-lines:

$$ \partial_t I \;=\; \nabla(\Delta I)\cdot \nabla^{\perp} I . $$

Read it back: push the smoothness information in the direction the level-lines run. An edge that meets the hole is an isophote; following it carries the edge straight across the gap instead of blurring it out. Run this evolution to steady state and the hole fills with colors and contours that continue what surrounds it (Figure 8.6.2). Telea 2004 gave a fast approximation built on the fast-marching method, sweeping inward from the boundary in one pass instead of iterating a PDE to convergence — the same idea, cheap enough for interactive scratch removal.

This works beautifully for thin structures — scratches, scanner dust, text overlays, hairline cracks — where the two sides of the gap are close and there is genuinely little to invent: continuing the lines is almost as good as having measured them. But it has a hard ceiling. For a large hole there is nothing to propagate inward except flatness; the isophotes peter out long before they reach the centre, and the fill goes blurry and textureless. Diffusion knows how to continue an edge but has no concept of texture — it cannot manufacture the grain of grass or the speckle of gravel, because those are not smooth at all. That failure is precisely the motivation for the next rung.

This rung is, structurally, the gradient-domain / Poisson family in disguise (cross-ref Poisson image editing, 💡 L9): seamless cloning edits the gradient field where it has a source to copy from, and PDE inpainting edits it where there is no source at all — it manufactures the gradients to integrate from a smoothness assumption rather than copying them.

fig-pde-isophote-diffusion
Figure 8.6.2. PDE inpainting continues level-lines into the gap. Left: a scratch crosses an edge (say the boundary between a dark roof and a bright sky). Middle (naive harmonic fill): solving Laplace's equation diffuses color across the edge, smearing the sharp boundary into a soft grey blur inside the gap. Right (isophote-driven, Bertalmío): diffusing the Laplacian along the isophotes $\nabla^{\perp}I$ continues the roof–sky edge straight across the scratch — it stays an edge. Below: the same method on a large hole fails — with no nearby structure to propagate, the interior fills blurry and textureless, motivating the exemplar methods.

8.6.3 Texture synthesis (Efros–Leung; Efros–Freeman quilting)

The fix for a large hole in a textured region — grass, gravel, a knit sweater, foliage — is to stop trying to diffuse it and start trying to synthesize it. The leverage is a single observation: texture is self-similar. A patch of lawn looks like every other patch of lawn; the missing piece is not smooth, but it is statistically a copy of what surrounds it. So rather than model the texture, copy it.

Non-parametric texture synthesis (Efros–Leung)

Efros and Leung's 1999 method Efros & Leung 1999 makes this astonishingly literal: it synthesizes the texture one pixel at a time, and the sample image itself is the model — no parameters are fit at all. To fill the next unknown pixel $p$, take its already-known neighbourhood $N(p)$ — the ring of pixels around $p$ that are already filled or were never missing — and search the source texture for windows that match it, scoring each candidate window $N(q)$ by a Gaussian-weighted sum of squared differences (SSD) over only the known part of $p$'s neighbourhood. Pick a window that matches well, and copy its centre pixel into $p$:

$$ \hat I(p) \;=\; I\!\left(\arg\min_{q}\ \big\lVert w \odot \big(N(p) - N(q)\big)\big\rVert^2\right), $$

where $w$ is a Gaussian weighting that trusts the pixels nearest $p$ most. Then $p$ is known, the frontier advances by one pixel, and you repeat:

while any pixel is unfilled:
    p ← unfilled pixel with the most known neighbours
    for each candidate window N(q) in the source:
        d[q] ← ‖ w ⊙ (N(p) − N(q)) ‖²   over the known part of N(p)
    pick q at random among the windows with d[q] near the minimum
    Î(p) ← I(q)        # copy the matching window's centre pixel

In words: to decide what goes here, find the places in the image whose surroundings look most like this pixel's surroundings, and copy what they have in the middle. (Filling the most-surrounded pixel first keeps the synthesis frontier coherent; accepting a near-best match at random, rather than the single best, avoids verbatim repetition.) It is the Markov-random-field assumption made into an algorithm — a pixel is conditionally determined by its neighbourhood, and the empirical distribution of "neighbourhood → centre" is the source image.

There is exactly one knob, and it controls everything: the window size (Figure 8.6.3). Make it too small and the match sees only a few pixels, so it captures the texture's fine grain but loses any large-scale structure — bricks dissolve into a noisy mush of brick-colored speckle. Make it too large and the only windows that match are near-identical to the source, so the method copies the source verbatim, tiling it like wallpaper (and slowly, since each match is a full search). The window must straddle the texture's characteristic scale — large enough to see a whole brick, small enough not to lock onto one specific brick. That single trade-off — local randomness versus large-scale faithfulness — is the recurring theme of every patch method that follows.

The honest cost is speed: a full search over the source for every pixel is brutally slow, which is exactly the bottleneck Patch match later removes Barnes et al. 2009, and Wei & Levoy 2000 had already sped up by using fixed-shape neighbourhoods amenable to tree search.

fig-efros-leung-growth
Figure 8.6.3. Texture grown pixel-by-pixel, and the window-size knob. Top: the growth process — for the next unfilled pixel $p$ (on the synthesis frontier), its known neighbourhood $N(p)$ is matched by Gaussian-weighted SSD against candidate windows $N(q)$ throughout the source; the centre of a near-best match is copied into $p$, and the frontier advances. Bottom, three results from one texture as window size grows: too small → the texture's large-scale structure is lost, output looks like noise of the right color; just right → straddles the texture's characteristic scale, output looks like more of the same texture; too large → matches force near-exact copies, output reproduces the source verbatim (and is slow).

Image quilting (Efros–Freeman)

Synthesizing a single pixel per search is wasteful when whole patches of the source are reusable. Efros and Freeman's 2001 image quilting Efros & Freeman 2001 lays down overlapping blocks copied from the source, choosing each new block so that its overlap region matches what is already there. That alone would leave a visible seam wherever two blocks meet, so quilting adds the crucial trick: in the overlap strip, it does not blend — it cuts along the minimum-error boundary, the path through the overlap where the two blocks disagree least, and takes one block on each side of the cut (Figure 8.6.4). Finding that least-disagreement path through a strip is a tiny dynamic program — a one-dimensional shortest-path / graph cut — and because the seam wanders through wherever the two patches happen to agree, it becomes invisible.

Quilting is far faster than pixel-at-a-time synthesis (a whole block per search, not a pixel), and its min-error-seam idea is the direct ancestor of GraphCut textures (Kwatra et al. 2003 Kwatra et al. 2003), which generalize the one-dimensional cut to a full two-dimensional graph cut: place a patch and let min-cut/max-flow decide, per pixel, whether to keep the old texture or the new one, with the cut running along the seam of least visible disagreement.

That is not a coincidence; it is a big lesson. The patch-plus-min-error-seam view is discrete optimization on a graph — a match cost (how well a patch fits) plus a seam cost (how visible the boundary is) — minimized by an off-the-shelf optimizer, with the energy doing all the design work.

💡 Big lesson (L12)

The quilting seam is a graph cut — the energy is the design. Stitching two overlapping patches by choosing the minimum-error boundary through their overlap is exactly discrete optimization on a graph: a dynamic program in one dimension (quilting), a min-cut/max-flow in two (GraphCut textures, Kwatra et al. 2003). The optimizer is off-the-shelf; what you design is the energymatch cost (does this patch belong here?) plus seam cost (is the boundary visible?). The same machinery cuts panorama and photomontage seams and segments objects; texture synthesis is one more instance. (See L12image edits as discrete optimization on a graph; first placed in Seam optimization.)

fig-quilting-seam
Figure 8.6.4. Image quilting stitches patches along a minimum-error boundary. Left: two source blocks are placed with an overlap region; naively butting them together leaves a hard seam down the middle of the overlap. Middle: the per-pixel squared difference within the overlap strip — bright where the two blocks disagree, dark where they agree. Right: a minimum-error path (a 1-D dynamic program / graph cut) is routed through the dark, low-disagreement valleys; the left block is kept on one side of the cut and the right block on the other. The seam follows wherever the textures already match, so it disappears — the prefiguring of GraphCut textures (Kwatra 2003) and 💡 L12.

A note on the road not taken: instead of copying patches, one can fit an explicit parametric model of the texture's statistics — Heeger & Bergen 1995 match the histograms of steerable-pyramid subband responses, and Portilla & Simoncelli 2000 match a richer set of correlations among those responses. Synthesize by coercing white noise to have the same subband statistics. These never matched the visual quality of patch copying for structured textures, but they are conceptually important: matching the statistics of multiscale filter responses is the direct ancestor of neural style's Gram-matrix idea (cross-ref Style transfer, Linear pyramids and wavelets).

8.6.4 Exemplar inpainting: clone, healing brush, and object removal (Criminisi)

Texture synthesis fills a textured hole; exemplar inpainting turns the same patch-copying machinery on real photographs, where a hole sits among structured content — an edge, a horizon, a corner — that must continue across it. The lineage starts with two manual tools every retoucher knows.

The clone brush is the bare exemplar prior under hand control: pick a source region and paint its pixels into the hole. Pure copying betrays itself instantly — wherever the source and destination differ in lighting or color, a hard seam appears. The healing brush (Photoshop 7, 2002) fixes exactly that by transferring the source's texture and detail while re-matching the destination's color and illumination — it pastes the source's gradients and re-levels to the surround, the gradient-domain / Poisson trick (cross-ref Poisson image editing, 💡 L9). The relationship is the one we already drew: Pérez, Gangnet & Blake's gradient-domain cloning Pérez et al. 2003 was inspired by the healing brush and gave it its clean formulation (Georgiev 2005 later recast it as a covariant-derivative correction).

fig-clone-heal-brush
Figure 8.6.5. Clone vs heal, interactively. The clone brush copies pixels verbatim from a chosen source — wherever source and destination differ in tone or colour, a hard seam appears. The heal brush copies the source's texture but re-matches the destination's local tone (a cheap stand-in for the gradient-domain re-levelling described above), so the patch melts in. Drag the green dot to pick what to copy, drag on the photo to paint, switch brushes — or upload your own image.

Criminisi, Pérez and Toyama 2004 automated the clone brush into full object removal. The texture-synthesis idea — copy the best-matching patch — is sound, but applied in a careless order it fails on the very thing that matters in photographs: linear structure. If you fill flat regions before edges, a wall's edge or a horizon line arrives at the hole from two sides and the two extensions do not meet; the structure breaks. Criminisi's contribution is a fill order that processes the hole's boundary in a deliberate sequence, by assigning each boundary patch a priority

$$ P(p) \;=\; C(p)\,D(p), $$

the product of a confidence term $C(p)$ — how much of this patch's neighbourhood is already reliably known, so we fill the most-surrounded regions first — and a data term $D(p)$ — how strongly an image structure (an isophote, an edge) runs into the hole at $p$. The data term is the key: it pushes patches where a strong edge meets the boundary to the front of the queue, so linear structures are continued across the hole before the flat regions on either side are filled in (Figure 8.6.6). The effect is that a removed person's hole is closed by sand and sea that meet correctly at the horizon, not by two mismatched halves.

The limits are the limits of self-similarity itself: the search is slow and confined to this image's content. PatchMatch (Patch match) removes the slowness, making the search interactive; a database (Hays & Efros) or a learned net removes the "only this image" ceiling.

[figure fig-object-removal not built]
Figure 8.6.6. Criminisi exemplar inpainting: fill order continues structure. Left: a photo with a foreground person to remove; the mask $\Omega$ leaves a hole straddling a strong linear structure (a horizon / wall edge). Middle: the priority $P(p)=C(p)D(p)$ along the hole boundary — high (bright) where an edge runs into the hole (the data term) and where the surround is well known (the confidence term). Patches are filled highest-priority first, so the edge is propagated across the hole early, before the flat regions. Right: the completed image — the structure continues unbroken and the textured regions are filled by copied patches; the person is gone with no visible seam.

8.6.5 Data-driven scene completion (Hays & Efros)

Every method so far is trapped inside one image: if this photo does not contain the right grass, sky, or building façade, self-similarity has nothing to copy. Hays and Efros 2007 made the obvious-in-hindsight leap — use a different image — into a system. To fill a hole, they retrieve, from millions of photographs, scenes whose surround matches the area around the hole, then take a good match, align it, cut a region out along a graph-cut seam, and Poisson-blend it into place. The completion is borrowed wholesale from a photo that was never of this scene at all.

The slogan is "the internet is your prior." With a large enough database, some photograph already contains a plausible thing to put in the hole — a stretch of sky with the right clouds, a building wall of the right kind — so you retrieve the answer instead of modelling it. This is the non-parametric, copy-from-an-example prior taken to web scale: the exemplar is no longer a patch of this image but a region of some image.

The honesty cost is steep and worth stating: the grafted region is from a semantically different scene. The result is coherent and photoreal, but it is not this scene's truth — the clouds are real clouds, just not the clouds that were there. It is the most vivid demonstration of L10's "plausible ≠ measured." And it needs a genuinely huge database plus reliable scene matching to find a graft that fits.

Scene completion is also the conceptual hinge to deep learning. "Retrieve the right completion from millions of images" is exactly what a generative network does implicitly: it has compressed those millions into its weights, so instead of searching a database at fill time it can synthesize the completion directly. Database retrieval and learned generation are the same prior — the rest of the world's images — stored two different ways.

8.6.6 Deep inpainting (context encoders → partial/gated conv → diffusion)

The strong end of the ladder replaces the hand-built prior with a learned one (💡 L8, Deep learning): a network trained to regenerate masked regions, having seen enough images to know what plausibly belongs in a hole.

Context encoders Pathak et al. 2016 were the first convolutional neural network (CNN) inpainter — an encoder–decoder trained with a reconstruction loss plus an adversarial loss (a generative adversarial network (GAN) discriminator pressuring the fill to look real). They captured semantics that no copy-based method could — completing a face with a plausible eye, not a patch of skin — but early versions were blurry, because a reconstruction loss averages over the many plausible completions.

Partial and gated convolutions Liu et al. 2018; @yu-etal-2019|Yu et al. 2019 fixed a quiet but crippling nuisance: a vanilla convolution applied near a hole mixes the hole's garbage masked pixels into its output, poisoning everything downstream. Partial convolution renormalizes each output by the fraction of valid (known) pixels in its receptive field and shrinks the hole layer by layer; gated convolution goes further and learns a soft per-pixel mask gate. The payoff is robustness to free-form holes of any shape — the practical "select this object and remove it" workhorse behind modern editing tools.

Diffusion inpainting Lugmayr et al. 2022, RePaint is the strongest prior and closes the loop back to Super-resolution and image priors. Treat the hole as a measurement constraint — the known pixels are the measurement, the masked pixels are unknown — and run a pretrained diffusion model's denoising loop, but at every step pin the known region to the real pixels (renoised to that step's noise level) while letting the network fill the rest:

$$ x_{t-1} \;\leftarrow\; \text{mask}\odot q\!\left(x^{\text{known}}_{t-1}\right) \;+\; (1-\text{mask})\odot \mathcal{D}_\theta(x_t). $$

This is exactly PnP/RED-style posterior sampling (💡 L11) with the forward operator set to "keep the known pixels": the data-fit step re-injects the measurement, the denoise step imposes the prior, and the loop alternates — the same recipe as super-resolution and deblurring, with masking as the measurement. The architectures themselves live in Deep learning / Generative AI and diffusion; here they are simply the strong-prior end of the spectrum.

The honesty note must be loudest here, because this is where it bites: the learned prior is the most flexible and the most prone to hallucinate. It will confidently invent a face, a license plate, a sign's text that were never measured and may be wrong. An inpainted region is not evidence — a forensic and ethical line this part has drawn since super-resolution (L10).

8.6.7 Highlight / specular recovery

A last, less obvious instance closes the chapter, because it is inpainting in disguise. A blown highlight — a glaring specular spot on metal, a sky clipped to pure white, a sunlit window — is a region where the sensor saturated: every pixel hit its ceiling, so the true radiance above the clip was never measured. A clipped region is a hole, and highlight recovery is inpainting it — filling the saturated pixels with a plausible surface color using the surroundings and whatever was not clipped as the prior (Figure 8.6.7).

There are two complementary angles. The first is specular–diffuse separation: a highlight is the specular reflection added on top of the underlying diffuse surface, so recovering the diffuse layer beneath it is the goal (cross-ref the illumination-in-a-single-image material). The second, and more directly inpainting-flavoured, is per-channel clipping recovery: highlights rarely clip all three color channels at once, so when only the red channel has saturated, the unclipped green and blue still carry the hue and the local gradient, and the missing red can be extrapolated from them and from the surround — a small, comparatively well-posed inpainting whose prior is "the channels move together."

It lives here for the same L10 reason as everything else: recover what the measurement destroyed — destroyed by clipping rather than by masking, but unmeasured all the same. And it has an honest, non-inpainting cousin worth flagging: rather than guess the highlight, capture it with a shorter exposure and merge — the high-dynamic-range (HDR) move of trading a guess for a measurement (forward-ref Multiple exposure imaging). Inpainting the clip is what you do when you only have the one frame.

[figure fig-highlight-recovery not built]
Figure 8.6.7. Highlight recovery as inpainting a clipped region. Left: a photo with a blown specular highlight (e.g. a bright spot on a glossy fruit or a clipped patch of sky) — the saturated pixels read pure white, their true radiance unmeasured. Middle: the clipped region treated as a mask $\Omega$; note that often only one or two channels saturated, so the unclipped channels still carry hue and gradient information at those pixels. Right: the recovered image — the saturated region filled with a plausible surface color by extrapolating the unclipped channels and the surrounding texture; the highlight reads as a bright but colored surface rather than a flat white hole.

8.6.8 Epitomes — a compact patch model

The exemplar methods above treat the image itself as the dictionary of patches to copy from. Epitomes (Jojic, Frey & Kannan 2003) instead learn a miniature — an "epitome" a fraction of the image's size — fit so that every patch of the original has a close match somewhere inside it. It is a generative patch model: the full image is explained as a quilt of overlapping epitome patches, and the usual tasks fall out of "find each patch's place in the epitome and read back a clean (or completed) version" — denoising, inpainting, super-resolution, texture transfer. It is the same patch-nearest-neighbour idea that drives texture synthesis and Patch match, but with a learned, compact dictionary in place of the raw image — and so the pre-deep cousin of today's learned patch and latent priors (L8 / L11), a bridge from exemplar copying to the modern generative models of Generative AI and diffusion.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (recurrence of L10)

The prior is not optional — and filling a hole is the purest case. There is zero data inside the hole, so 100% of the answer comes from the prior. Smoothness, self-similarity, a photo database, a learned net — each is a different prior, and the result is exactly as good (and as honest) as the prior is. The split that organized super-resolution survives intact: a prior is reconstruction-leaning where it copies genuine structure from somewhere real, and hallucination where it invents detail that was never anywhere. (Registered as L10; first appears in Super-resolution and image priors; here taken to its extreme — no data, all prior.)

💡 Big lesson (L12)

The quilting seam is a graph cut — the energy is the design. Stitching two overlapping patches by choosing the minimum-error boundary through their overlap is exactly discrete optimization on a graph: a dynamic program in one dimension (quilting), a min-cut/max-flow in two (GraphCut textures, Kwatra et al. 2003). The optimizer is off-the-shelf; what you design is the energymatch cost (does this patch belong here?) plus seam cost (is the boundary visible?). The same machinery cuts panorama and photomontage seams and segments objects; texture synthesis is one more instance. (See L12image edits as discrete optimization on a graph; first placed in Seam optimization.)