💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

11.1 Photo Mosaics

A photomosaic is built from one simple instruction repeated over a grid. Take the target image — a portrait, say — and downsample it to a coarse grid of cells, each cell a single average colour. Then, for every cell, reach into a tile library of thousands of small photographs and pick the one whose appearance best matches that cell. Lay the chosen tiles down in place of the cells, and the assembled wall of photographs reads, from far enough away, as the original target. Up close it dissolves back into hundreds of unrelated snapshots; from across the room it snaps into the portrait. That two-distances-two-images effect is the whole point, and it is engineered, not accidental.

The matching at the heart of it is a nearest-neighbour search over an image database — the same primitive that, scaled up and dressed in learned features, drives every method in this part (Retrieval). For each target cell you compute a descriptor and ask the tile library for the photo whose descriptor is closest. The simplest descriptor is a single mean colour: average each library tile down to one RGB value, average each target cell the same way, and match by nearest colour in RGB space. A slightly richer descriptor is a small low-resolution thumbnail of each tile — a handful of pixels — matched by sum-of-squared-differences against the equally-downsampled cell, so that not just the average colour but the coarse internal layout (a dark top, a bright bottom) lines up. Either way the recipe is identical: downsample the target, describe each cell, and run a per-cell nearest-neighbour query against the tile library (Figure 1).

The original popular photomosaics are Robert Silvers and Michael Hawkins's Photomosaics (around 1996), and the technique was given an algorithmic treatment by Finkelstein and Range; from there it became a standard teaching case for nearest-neighbour matching over an image collection. What makes it such a good first example is honesty about where the difficulty lives: the matching is trivial, and almost all of the craft is in everything that surrounds it — colour correction, repeat avoidance, and the multi-scale layout — which is exactly the texture of real collection-scale image problems.

fig-photomosaic
Figure 11.1.1. A target portrait rendered as a photomosaic, three readings in one figure. Full view (left): the assembled grid of tile photographs, which reads cleanly as the target face from a distance. Zoom inset (centre): a magnified corner showing that every "pixel" of the portrait is in fact an individual, unrelated photograph. Matching panel (right): one target cell shown beside its computed mean colour, and the nearest-neighbour tile pulled from the library by that colour — the per-cell search drawn out explicitly, with a few runner-up tiles to show the candidates it chose among.

11.1.1 The tiling-and-matching pipeline

Strip the pipeline to its bones and it is three steps. First, downsample the target to the grid resolution — one descriptor per cell, where the grid spacing is the tile size in the final image. Second, describe each cell: mean colour for the cheap version, a small thumbnail descriptor for the better one. Third, match each cell to the library by nearest neighbour under that descriptor, and paste the winning tile into the cell. The coarseness of the target downsample is the single most important knob — it sets how many tiles span the image, and therefore the viewing distance at which the illusion fuses.

The descriptor choice is the one genuine design decision, and it is a miniature of the descriptor question that runs through all of Retrieval. A mean colour is one number per channel and matches in constant time per comparison; it is blind to everything but average tone, so a tile of a red barn and a tile of a red sunset are interchangeable. A low-resolution thumbnail descriptor — say a $4\times4$ or $8\times8$ downsample — keeps a little spatial structure, so the chosen tile shares not just the cell's colour but its rough gradient, which markedly improves the read at intermediate distances. Richer still, one could match on a small texture or edge descriptor, but for photomosaics the returns flatten quickly: the eye is going to throw the tile detail away anyway.

Two implementation details matter in practice. The library tiles should be pre-cropped to the cell aspect ratio and pre-summarised to their descriptors once, up front, so the per-cell query is a fast lookup rather than a re-computation. And because the target cell and the tile must be compared on equal footing, both are reduced to the same descriptor resolution before the distance is taken — exactly the downsample-then-compare discipline that keeps the comparison meaningful.

11.1.2 Colour correction and avoiding repeats

Pure nearest-neighbour matching has two visible failure modes, and the craft of a good photomosaic is in fixing both. The first is colour mismatch: even the best-matching tile is rarely the exact colour the cell wants, and a grid of slightly-off tiles looks muddy. The standard remedy is a per-tile colour correction — gently shift each placed tile's colour toward its target cell's colour, by tinting or a mild channel scaling, so the assembled grid carries the target's colour field faithfully while each tile stays recognisably a photograph. There is a dial here: correct too little and the target washes out; correct too much and the tiles all bleed toward flat colour patches and lose their identity as photos. The pleasing setting is a partial nudge, not a full repaint.

The second failure mode is repetition. With a finite library and a greedy nearest-neighbour rule, large flat regions of the target — a clear sky, a cheek — will all match the same best tile, producing conspicuous blocks of identical photographs that shatter the illusion of variety. The fix is to forbid or penalise reuse: once a tile is placed, remove it from the candidate pool, or down-weight it for nearby cells, so the search is pushed to its second- and third-best choices and the grid stays visually diverse. This turns the per-cell independent nearest-neighbour search into a mildly constrained assignment — still cheap, but no longer purely greedy — and it is the difference between a photomosaic that looks crafted and one that looks like a tiling bug.

11.1.3 Multi-scale and irregular tilings

A uniform grid spends the same number of tiles on a featureless sky as on a detailed eye, which is backwards: the eye needs fine tiles to render its structure, the sky needs almost none. Multi-scale tilings fix this by varying the tile size across the image — small tiles where the target has high-frequency detail, large tiles where it is smooth — typically driven by a quadtree that subdivides only where the target's local variance is high. The result spends the tile budget where the perception actually needs it, sharpening faces and edges without wasting photographs on flat fields.

The same logic pushes toward irregular tilings that abandon the square grid altogether — packing tiles of varying shape, or jittering their placement, so the assembled image avoids the tell-tale regular lattice and reads more like a continuous picture. These are refinements on the same core operation, not new algorithms: each cell, whatever its size or shape, is still described and matched by nearest neighbour against the library. The multi-scale machinery simply decides which cells to ask about, leaving the matching engine untouched — a clean separation of "where to look" from "what to match," which recurs throughout the part.

11.1.4 Why it resolves into the target at a distance

The reason a photomosaic works is not in the algorithm at all — it is in the viewer. Human vision is multi-scale and effectively low-pass at a distance: as you step back, the fine detail of each tile shrinks below your eye's resolving power and is averaged away, leaving only the coarse spatial-frequency content — and that coarse content is, by construction, the downsampled target (Human vision). The visual system's contrast sensitivity function (CSF) rolls off at high spatial frequencies, so the high-frequency tile texture contributes nothing to the percept at viewing distance; what survives is exactly the low-frequency colour field the matching put there. The two-distances-two-images effect is therefore a designed exploitation of how the eye integrates across scale.

This is why the downsample resolution and the intended viewing distance are two ends of the same parameter. A coarse grid fuses at close range but is crude; a fine grid demands you step further back before the tiles blur into the target, but renders it more faithfully. It also explains why mean-colour matching, crude as it is, suffices: the only tile property that survives to the fused percept is its average colour, so matching average colour is matching precisely the quantity that will matter. Everything else about the tile is a near-distance reward — the pleasure of discovering it is a real photograph — that the far-distance percept simply discards.

💡 The big lesson

A huge photo collection is raw material, and the engine that mines it is nearest-neighbour matching over a descriptor. The photomosaic is this idea at its most playful and most legible: a tiny descriptor (mean colour), a per-cell nearest-neighbour query, and a wall of photographs that resolves into a target because human vision is low-pass at a distance. Hold onto the shape of it, because the rest of the part is the same move at larger scale and higher stakes — swap the mean-colour descriptor for a learned embedding, swap the tile library for millions of internet photos, and swap "fill this cell" for "complete this scene," "locate this photo on Earth," or "restore this exact person." The collection-as-material creed and the find-the-nearest-neighbour engine never change; only the descriptor, the database size, and the consequence do.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 The big lesson

A huge photo collection is raw material, and the engine that mines it is nearest-neighbour matching over a descriptor. The photomosaic is this idea at its most playful and most legible: a tiny descriptor (mean colour), a per-cell nearest-neighbour query, and a wall of photographs that resolves into a target because human vision is low-pass at a distance. Hold onto the shape of it, because the rest of the part is the same move at larger scale and higher stakes — swap the mean-colour descriptor for a learned embedding, swap the tile library for millions of internet photos, and swap "fill this cell" for "complete this scene," "locate this photo on Earth," or "restore this exact person." The collection-as-material creed and the find-the-nearest-neighbour engine never change; only the descriptor, the database size, and the consequence do.