💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

11.5 Inpainting Using Millions of Photographs⧉

The previous sections built up a database as raw material — a wall of tiles in Photo Mosaics, a searchable index in Retrieval. This section spends that database on a single concrete task: fill a hole in a photograph. The classical way, covered in Inpainting and texture synthesis, synthesizes the missing region from the image's own texture. Hays and Efros's Scene Completion makes a startling substitution — fill the hole with pixels borrowed from another photograph of a similar scene — and it works precisely because, across a corpus of millions, a matching scene almost always exists. The result is the clean foil to today's model-based diffusion inpainting, and the ancestor of every "millions of photographs" method that follows in this part.

💡 The big lesson

The data is the prior. A learned inpainting model encodes "what scenes look like" in its weights; Scene Completion encodes the same knowledge in the collection itself — no parametric model, no training, just a large enough pile of real photographs and a fast nearest-neighbour search. Scale does the work a model would otherwise do: at a few million images, the database has effectively already seen a plausible completion of almost any ordinary scene, and the algorithm's only job is to find it and paste it in cleanly. This is the wager Pix 2 GPS and Personalized priors each make in their own domain — the collection is the model — and the whole part is variations on it. The reason to state it loudly here is the contrast it sets up: when diffusion inpainting later puts the prior back into weights, you will be able to see exactly what was traded away.

11.5.1 Why self-similar inpainting isn't enough⧉

Classical inpainting copies patches from within the same image. The texture-synthesis lineage — Efros & Leung (1999), then [@criminisi-2004] — grows the known texture inward across the hole, matching each new patch against the surrounding image. For a small hole in homogeneous texture this is exactly right: a scratch across a brick wall, a speck on a lawn, a removed power line against open sky all close seamlessly, because the pixels that belong in the hole genuinely are present nearby. The image is its own database, and for these cases it is a sufficient one.

The method breaks the moment the hole is large or semantic. Remove a person standing in front of a building and the missing region is not more of some texture already on screen — it is the part of the building they were occluding, content that the image simply does not contain. No amount of copying from the visible pixels can invent the window, the doorway, the horizon line that belonged behind the subject. Self-similar synthesis can only ever extend what is there; it cannot supply what is absent.

The reframing is the whole idea. The pixels you need almost certainly exist — they are just in a different photograph, of a similar scene, taken by someone else. A thousand other people have photographed a building, a beach, a street corner, a mountain ridge like yours. So stop treating inpainting as synthesis ("invent the missing pixels from this image") and treat it as retrieval-and-paste ("find a real photograph whose content fits the hole, and borrow it"). The image's own pixels were the wrong database; the right one is the rest of the photographed world.

11.5.2 Scene completion from a huge database⧉

The pipeline has three moves: retrieve, align-and-cut, blend (Figure 1). The first is the heart of it. Describe the image around the hole with a scene-level descriptor — Hays and Efros use GIST ([@oliva-torralba-2001]), a coarse summary of the scene's spatial layout (horizon, openness, dominant structures), deliberately blind to fine detail. Then search a database of millions of photographs for scenes whose GIST — computed on the context surrounding the hole — matches the query (Retrieval). You are not looking for the same object; you are looking for the same kind of scene: a matching horizon, a compatible scene type, a plausible global structure into which the borrowed patch can slot. Adding a coarse colour match to GIST tightens the candidates to scenes that will also composite without a jarring tonal jump.

For each good match, the missing content must be cut out and pasted in. The matched photograph is aligned to the query, the region covering the hole is identified, and the algorithm chooses where exactly to splice — finding a seam through the overlap zone where the borrowed patch and the original agree most closely. That seam is a graph cut ([@kwatra-2003]): the optimal boundary is the minimum-cost cut of a graph whose edge weights penalize visible mismatch, so the splice line threads through pixels where the two images already look alike. Then Poisson blending ([@perez-2003]) reconciles the residual difference — solving for an image whose gradients match the borrowed patch's interior while its values match the original at the boundary, so colour and brightness transition invisibly across the seam (Poisson image editing / Blending). Cut where the images agree, then blend away what disagreement remains.

The third move is a stance, not an algorithm: offer several answers, not one. Because many database scenes match the context, there is no single correct fill — a removed building could plausibly be replaced by any compatible building, a missing field by any compatible field. Rather than commit to one, Scene Completion composites the top handful of matches and presents several equally-plausible completions side by side (Figure 1). Surfacing the genuine ambiguity is part of the result, and it is the same intellectual honesty as the heatmap-not-a-pin output of Pix 2 GPS: when the data admits many answers, report many answers.

fig-scene-completion — **Figure 11.5.1.** Inpainting by retrieval, end to end. **Far left:** a photograph with a large hole (a removed building / person), the surrounding context shaded to show what the scene descriptor sees. **Centre-left:** the top contextually-matching scenes pulled from a million-image database by GIST — different real places that share the query's layout and horizon. **Centre-right:** for one match, the splice — the graph-cut seam threaded through the overlap where the two images agree, then Poisson-blended so colour and gradient cross the boundary invisibly. **Far right:** several finished completions, each borrowing a different real scene — visibly plausible, visibly different, because the data admits more than one answer. The missing pixels were never in the image; they were in the collection.

11.5.3 The data is the prior — and its modern opposite⧉

Step back from the mechanics and the philosophical claim is sharp. Scene Completion uses no parametric model and no training. What "scenes look like" is not learned into any weights — it is simply present, distributed across the million photographs, and read out at query time by nearest-neighbour search. The prior is the database. This is why the result improves as the corpus grows, and why Hays and Efros found a step-change in quality between tens of thousands of images and millions: at small scale the nearest match is a poor fit and the seam betrays it, while at millions the nearest match is genuinely of a scene like yours and the composite is convincing. Scale is doing the work a learned prior would otherwise do — the same wager Pix 2 GPS makes for geolocation and Personalized priors makes for a single subject.

The contrast with model-based inpainting (forward-referenced to the diffusion sections) is the cleanest way to see what data-as-prior buys and costs. Modern diffusion inpainting bakes the prior into network weights and synthesizes the fill — it iteratively denoises new, coherent pixels conditioned on the surrounding context, inventing content that was never photographed. Data-as-prior versus weights-as-prior: the two have opposite failure modes. Scene Completion borrows a real photograph, so its pixels are always individually plausible — but it may import the wrong specifics, a building of the right type but visibly the wrong building, complete with a stranger's window boxes. Diffusion invents coherent pixels tailored to this image — but they are a confident fabrication, plausible-looking content that no camera ever recorded. One retrieves truth that may not belong; the other manufactures belonging that is not true. Same task, opposite philosophy — and that opposition is exactly why Scene Completion is worth meeting first, as the data-driven pole against which the model-driven era defines itself.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 The big lesson

object	meaning (this chapter)	source
GIST	the coarse scene descriptor (spatial-layout summary) computed on the context around the hole and matched against the database	Retrieval ([@oliva-torralba-2001	Oliva & Torralba, 2001])
graph cut	the seam-finding step — minimum-cost boundary through the overlap, threaded where query and borrowed patch agree	Blending ([@kwatra-2003	Kwatra et al., 2003])
Poisson blending	the seamless composite — gradient-domain solve matching the patch's interior gradients and the original's boundary values	Poisson image editing ([@perez-2003	Pérez et al., 2003])

11.5 Inpainting Using Millions of Photographs🔗⧉

11.5.1 Why self-similar inpainting isn't enough🔗⧉

11.5.2 Scene completion from a huge database🔗⧉

11.5.3 The data is the prior — and its modern opposite🔗⧉

Big lessons of this chapter

11.5 Inpainting Using Millions of Photographs⧉

11.5.1 Why self-similar inpainting isn't enough⧉

11.5.2 Scene completion from a huge database⧉

11.5.3 The data is the prior — and its modern opposite⧉