💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

11.2 Retrieval

The previous section built one image out of thousands of small ones, picking each tile by a nearest-neighbour search over a photo library (Photo Mosaics). That tile search is a toy instance of the problem this section makes general: content-based image retrieval — finding images by what they look like, not by a filename or a tag. The recipe never changes in spirit. Turn each image into a point in some space, build an index over the collection, and answer a query by returning its nearest neighbours under a distance that means dissimilarity. What changes across three decades is only the two ingredients: how you turn an image into a point, and how you find the neighbours fast. The classic systems borrow their space and their index wholesale from text search; the deep systems learn the space; the web-scale systems approximate the search. One idea, applied at ever-larger scale.

💡 The big lesson — distance is similarity; the rest is the space and the index

Retrieval is one move repeated: embed an image as a vector so that nearness means likeness, then find the nearest vectors. Every system in this section is a different bet on the two free choices. The space runs from a hand-built colour/texture histogram, through a bag-of-visual-words vector, to a learned CLIP embedding — each more robust to viewpoint, lighting, and clutter than the last, and CLIP's so robust it spans modalities, letting a sentence and a photo be neighbours. The index runs from a brute-force scan, through a text-style inverted file, to an approximate nearest-neighbour graph that searches billions of vectors in milliseconds. Choose the space well and dissimilar things land far apart; choose the index well and you find the near ones without looking at the far ones. Hold this in view — it is also exactly why retrieval is the quiet engine under every "millions of photos" method in the rest of the part: each one just plugs a different descriptor into the same nearest-neighbour query.

11.2.1 Classic CBIR — histograms and the text-retrieval analogy

The earliest systems took the most literal reading of "matching by appearance": describe a whole image by a single global descriptor — a histogram of its colours, or of its textures — and rank database images by histogram distance. The appeal is that such descriptors are cheap and naturally tolerant of translation and rotation: shuffle the pixels of a red barn and its colour histogram is unmoved. The flaw is the same property read as a weakness — a global histogram is blind to layout, so a red barn and a red sunset collide, and two utterly different scenes that happen to share a palette score as twins. These systems set the template that everything below inherits — image → vector → nearest neighbour — but they had little power to discriminate among images, and so they retrieved by vibe rather than by content.

The breakthrough was to stop describing the whole image and start describing its parts, by stealing the entire architecture of text search. This is bag-of-visual-words, introduced for images by Sivic & Zisserman's Video Google (2003), and the analogy is exact enough to be worth drawing out (Figure 1). Extract local features from the image — the SIFT-style keypoints and descriptors of Sparse matching. Quantize each descriptor against a fixed vocabulary of cluster centres (learned once, by k-means over a sample of descriptors), so that each local feature collapses to the index of its nearest centre: a visual word. An image is now a bag of visual words, structurally identical to a text document as a bag of terms, and the whole text-retrieval toolbox transfers verbatim. Build an inverted index — a map from each visual word to the list of images containing it — so a query touches only the images that share a word with it, never the whole collection.

The other import from text is the weighting, and it is the section's first equation. Not all words are equally informative: a visual word that appears in nearly every image (the analogue of "the" or "and") says almost nothing about which image you want, while a rare, distinctive word is strong evidence. TF-IDF weighting encodes exactly this intuition, multiplying a word's frequency within an image by the logarithm of its rarity across the corpus:

$$ w_{t,d}=\mathrm{tf}_{t,d}\cdot\log\frac{N}{n_t}, $$

where $\mathrm{tf}_{t,d}$ is the count of visual word $t$ in image $d$, $N$ is the total number of images, and $n_t$ is the number of images that contain $t$ at all. The first factor rewards a word that is prominent here; the second, the inverse document frequency, down-weights a word that is common everywhere. With each image reduced to a TF-IDF-weighted, sparse word vector, a query is answered by cosine similarity — the dot product of the two vectors normalised by their lengths,

$$ \cos\theta=\frac{\mathbf q\cdot\mathbf d}{\lVert\mathbf q\rVert\,\lVert\mathbf d\rVert}, $$

which measures the angle between the query and a candidate, ignoring how many features each happens to have. Because the vectors are sparse and the inverted index has already narrowed the field, this is a fast sparse dot product, not a dense scan — text search, reused pixel-for-concept.

A flat vocabulary of a few thousand words is fine for a few thousand images but far too coarse for millions, where every word collides too often to discriminate. Nistér & Stewénius (2006) scaled the vocabulary itself with a vocabulary tree: recursive k-means partitions descriptor space hierarchically, yielding an effective vocabulary of millions of words, and — crucially — quantizing a new descriptor takes only a logarithmic walk down the tree rather than a comparison against every centre. That single change made real-time recognition over enormous databases practical, and it is the move that turned bag-of-words from a clever analogy into a deployable system.

One thing the bag-of-words throws away comes back to bite it: where the features were. By design the bag discards all geometry, so an image that shares many words with the query scattered at random scores the same as one whose words are arranged the same way — a true match and a coincidence are indistinguishable on word counts alone. The fix is spatial verification, due to Philbin et al. (2007) in their Oxford-buildings work: take the top-ranked candidates from the bag-of-words stage and re-rank them with a geometric check. Fit a transform between the matched words of query and candidate, count how many matches are inliers to that transform using RANSAC (Robustness: the ratio test and RANSAC), and keep only candidates whose shared words are spatially consistent with the query (Figure 2). Bag-of-words proposes cheaply; geometry confirms strictly — the same cheap-propose-then-verify pattern as the ratio-test-then-RANSAC pipeline of Sparse matching.

fig-bag-of-visual-words
Figure 11.2.1. The text-retrieval analogy, drawn side by side. Left, an image: local features (Sparse matching) extracted from a photo, each descriptor quantized to its nearest centre in a learned vocabulary of visual words — so the image becomes a histogram over words, exactly a document's bag of terms. Right, the index: an inverted index mapping each visual word to the list of images that contain it, with words weighted by TF-IDF ($w_{t,d}=\mathrm{tf}_{t,d}\log(N/n_t)$) so distinctive words count more and ubiquitous ones less. A query image is reduced to its own word vector and scored against the index by cosine similarity — a sparse dot product. The panel makes the central conceit explicit: an image, indexed and ranked, is just a document.

11.2.2 Deep retrieval — learned embeddings and CLIP

The hand-built descriptors of the classic era are robust to what their designers anticipated — an additive brightness offset, a rotation, a modest zoom — and brittle to everything else. A change of viewpoint that re-arranges occlusions, a shift from day to night, heavy clutter: these break a colour histogram and strain even a bag of visual words. The deep era's answer is to stop designing the space and learn it. Train a network so that visually or semantically similar images land close together in a fixed-length vector — typically by metric or contrastive learning, which explicitly pulls matching pairs together and pushes mismatched pairs apart (Deep learning). Once trained, retrieval is again plain nearest-neighbour, now in the learned embedding space — and the embedding absorbs the invariances that the hand-built descriptors could only approximate. This is the section's recurrence of L8: the learned operator swaps a hand-designed prior (the histogram, the fixed vocabulary) for one fit from data.

The landmark is CLIP (Radford et al. 2021), and its move is to make the embedding cross-modal. CLIP trains two encoders together — one for images, one for text — on hundreds of millions of image-caption pairs scraped from the web, with a contrastive objective that places a photo and its matching caption near each other in one shared space. The consequence is striking: because a sentence and a matching photo are now neighbours, you can search an image collection with words. "A dog on a skateboard," typed as a query, embeds to a point, and the nearest images to that point are dogs on skateboards — with no tags, no filenames, no manual annotation anywhere in the loop. Cross-modal retrieval like this is the quiet engine behind much of modern semantic image search, and the same shared text-image space later conditions generative models, letting a text prompt steer what an image model produces.

It is worth naming what the learned embedding gives up, because the classic pipeline still has a niche. CLIP-style embeddings excel at semantic likeness — "these are both photos of a beach at sunset" — but they are not built to certify that two images are the same physical scene down to the geometry. For that exact-instance question — is this the same Oxford building, the same painting — the bag-of-words-plus-spatial-verification pipeline, with its explicit geometric inlier check, remains sharper. In practice the modern systems combine them: a learned embedding for a fast, robust semantic shortlist, then a geometric verification for instance certainty.

11.2.3 Retrieval at scale — approximate nearest neighbour

Whatever the space — histogram, bag-of-words, or CLIP embedding — the query is the same nearest-neighbour search, and at web scale the search itself is the bottleneck. With billions of vectors in the index, comparing a query against every one of them exactly is hopeless; a single brute-force scan would take far longer than any user will wait. So the entire field runs on approximate nearest-neighbour (ANN): give up the guarantee of finding the exact closest vector in exchange for finding a very probably close one, orders of magnitude faster. The trade is a sliver of recall for an enormous gain in speed, and the recall lost is usually invisible — the second-nearest neighbour is, by construction, almost as good as the first. The standard tools are an inverted-file index paired with product quantization (the basis of FAISS, the billion-scale library of Johnson et al. 2019), navigable-graph indexes like HNSW, and locality-sensitive hashing; the mechanics live in Misc: fast matching.

That one fast query, with the right descriptor plugged in, is everything the rest of the part runs on, and it is worth seeing the same machine drive each application (Figure 2). Ranking database images by similarity to a query is reverse image searchfind where this picture came from — when the database is the indexed web. Make the descriptor sensitive enough to compression, resizing, and watermarking and the same query becomes near-duplicate detection — finding the same shot re-encoded, the foundation of copyright and provenance matching, and of the burst-grouping step in Auto curation. Swap in a scene-level descriptor and retrieve from a corpus of millions and you have the heart of Inpainting Using Millions of Photographs; index the corpus by geotag and the same neighbours' coordinates answer where on Earth was this taken in Pix 2 GPS. The descriptor and the index change; the query — return the nearest neighbours of this point — does not.

fig-image-retrieval
Figure 11.2.2. One query, a ranked row of neighbours. A query image on the left; to its right, the database images returned in order of decreasing similarity — the nearest neighbour first, then progressively less similar matches trailing off. An inset shows the two-stage pipeline behind the order: a fast bag-of-visual-words / embedding retrieval produces the shortlist by cosine similarity, then spatial verification (RANSAC inlier count between matched features) re-ranks the top candidates so that geometrically consistent matches rise and coincidental word-collisions fall. The ranked strip is the universal output: every "millions of photos" method downstream is this same query with a different descriptor.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 The big lesson — distance is similarity; the rest is the space and the index

Retrieval is one move repeated: embed an image as a vector so that nearness means likeness, then find the nearest vectors. Every system in this section is a different bet on the two free choices. The space runs from a hand-built colour/texture histogram, through a bag-of-visual-words vector, to a learned CLIP embedding — each more robust to viewpoint, lighting, and clutter than the last, and CLIP's so robust it spans modalities, letting a sentence and a photo be neighbours. The index runs from a brute-force scan, through a text-style inverted file, to an approximate nearest-neighbour graph that searches billions of vectors in milliseconds. Choose the space well and dissimilar things land far apart; choose the index well and you find the near ones without looking at the far ones. Hold this in view — it is also exactly why retrieval is the quiet engine under every "millions of photos" method in the rest of the part: each one just plugs a different descriptor into the same nearest-neighbour query.