11.5 Inpainting Using Millions of Photographs⧉
The previous sections built up a database as raw material — a wall of tiles in Photo Mosaics, a searchable index in Retrieval. This section spends that database on a single concrete task: fill a hole in a photograph. The classical way, covered in Inpainting and texture synthesis, synthesizes the missing region from the image's own texture. Hays and Efros's Scene Completion makes a startling substitution — fill the hole with pixels borrowed from another photograph of a similar scene — and it works precisely because, across a corpus of millions, a matching scene almost always exists. The result is the clean foil to today's model-based diffusion inpainting, and the ancestor of every "millions of photographs" method that follows in this part.
The data is the prior. A learned inpainting model encodes "what scenes look like" in its weights; Scene Completion encodes the same knowledge in the collection itself — no parametric model, no training, just a large enough pile of real photographs and a fast nearest-neighbour search. Scale does the work a model would otherwise do: at a few million images, the database has effectively already seen a plausible completion of almost any ordinary scene, and the algorithm's only job is to find it and paste it in cleanly. This is the wager Pix 2 GPS and Personalized priors each make in their own domain — the collection is the model — and the whole part is variations on it. The reason to state it loudly here is the contrast it sets up: when diffusion inpainting later puts the prior back into weights, you will be able to see exactly what was traded away.
11.5.1 Why self-similar inpainting isn't enough⧉
Classical inpainting copies patches from within the same image. The texture-synthesis lineage — Efros & Leung (1999), then [@criminisi-2004] — grows the known texture inward across the hole, matching each new patch against the surrounding image. For a small hole in homogeneous texture this is exactly right: a scratch across a brick wall, a speck on a lawn, a removed power line against open sky all close seamlessly, because the pixels that belong in the hole genuinely are present nearby. The image is its own database, and for these cases it is a sufficient one.
The method breaks the moment the hole is large or semantic. Remove a person standing in front of a building and the missing region is not more of some texture already on screen — it is the part of the building they were occluding, content that the image simply does not contain. No amount of copying from the visible pixels can invent the window, the doorway, the horizon line that belonged behind the subject. Self-similar synthesis can only ever extend what is there; it cannot supply what is absent.
The reframing is the whole idea. The pixels you need almost certainly exist — they are just in a different photograph, of a similar scene, taken by someone else. A thousand other people have photographed a building, a beach, a street corner, a mountain ridge like yours. So stop treating inpainting as synthesis ("invent the missing pixels from this image") and treat it as retrieval-and-paste ("find a real photograph whose content fits the hole, and borrow it"). The image's own pixels were the wrong database; the right one is the rest of the photographed world.
11.5.2 Scene completion from a huge database⧉
The pipeline has three moves: retrieve, align-and-cut, blend (Figure 1). The first is the heart of it. Describe the image around the hole with a scene-level descriptor — Hays and Efros use GIST ([@oliva-torralba-2001]), a coarse summary of the scene's spatial layout (horizon, openness, dominant structures), deliberately blind to fine detail. Then search a database of millions of photographs for scenes whose GIST — computed on the context surrounding the hole — matches the query (Retrieval). You are not looking for the same object; you are looking for the same kind of scene: a matching horizon, a compatible scene type, a plausible global structure into which the borrowed patch can slot. Adding a coarse colour match to GIST tightens the candidates to scenes that will also composite without a jarring tonal jump.
For each good match, the missing content must be cut out and pasted in. The matched photograph is aligned to the query, the region covering the hole is identified, and the algorithm chooses where exactly to splice — finding a seam through the overlap zone where the borrowed patch and the original agree most closely. That seam is a graph cut ([@kwatra-2003]): the optimal boundary is the minimum-cost cut of a graph whose edge weights penalize visible mismatch, so the splice line threads through pixels where the two images already look alike. Then Poisson blending ([@perez-2003]) reconciles the residual difference — solving for an image whose gradients match the borrowed patch's interior while its values match the original at the boundary, so colour and brightness transition invisibly across the seam (Poisson image editing / Blending). Cut where the images agree, then blend away what disagreement remains.
The third move is a stance, not an algorithm: offer several answers, not one. Because many database scenes match the context, there is no single correct fill — a removed building could plausibly be replaced by any compatible building, a missing field by any compatible field. Rather than commit to one, Scene Completion composites the top handful of matches and presents several equally-plausible completions side by side (Figure 1). Surfacing the genuine ambiguity is part of the result, and it is the same intellectual honesty as the heatmap-not-a-pin output of Pix 2 GPS: when the data admits many answers, report many answers.
11.5.3 The data is the prior — and its modern opposite⧉
Step back from the mechanics and the philosophical claim is sharp. Scene Completion uses no parametric model and no training. What "scenes look like" is not learned into any weights — it is simply present, distributed across the million photographs, and read out at query time by nearest-neighbour search. The prior is the database. This is why the result improves as the corpus grows, and why Hays and Efros found a step-change in quality between tens of thousands of images and millions: at small scale the nearest match is a poor fit and the seam betrays it, while at millions the nearest match is genuinely of a scene like yours and the composite is convincing. Scale is doing the work a learned prior would otherwise do — the same wager Pix 2 GPS makes for geolocation and Personalized priors makes for a single subject.
The contrast with model-based inpainting (forward-referenced to the diffusion sections) is the cleanest way to see what data-as-prior buys and costs. Modern diffusion inpainting bakes the prior into network weights and synthesizes the fill — it iteratively denoises new, coherent pixels conditioned on the surrounding context, inventing content that was never photographed. Data-as-prior versus weights-as-prior: the two have opposite failure modes. Scene Completion borrows a real photograph, so its pixels are always individually plausible — but it may import the wrong specifics, a building of the right type but visibly the wrong building, complete with a stranger's window boxes. Diffusion invents coherent pixels tailored to this image — but they are a confident fabrication, plausible-looking content that no camera ever recorded. One retrieves truth that may not belong; the other manufactures belonging that is not true. Same task, opposite philosophy — and that opposition is exactly why Scene Completion is worth meeting first, as the data-driven pole against which the model-driven era defines itself.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
The data is the prior. A learned inpainting model encodes "what scenes look like" in its weights; Scene Completion encodes the same knowledge in the collection itself — no parametric model, no training, just a large enough pile of real photographs and a fast nearest-neighbour search. Scale does the work a model would otherwise do: at a few million images, the database has effectively already seen a plausible completion of almost any ordinary scene, and the algorithm's only job is to find it and paste it in cleanly. This is the wager Pix 2 GPS and Personalized priors each make in their own domain — the collection is the model — and the whole part is variations on it. The reason to state it loudly here is the contrast it sets up: when diffusion inpainting later puts the prior back into weights, you will be able to see exactly what was traded away.