💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

11.8 Pix 2 GPS

IM2GPS asks an audacious question — where on Earth was this photograph taken? — and answers it with nothing but the pixels and a giant geotagged collection. No EXIF GPS, no named landmark, no map: describe the query by its visual content, retrieve the photographs that look most like it from a database of millions of geotagged images, and let their coordinates vote on where the query was shot. The planet itself becomes the model. The crucial move — and the reason the result is honest rather than a parlour trick — is that the answer is a distribution, not a point: a nondescript field or beach stays geographically vague, while distinctive architecture, vegetation, or signage pins the location tightly. This is the geolocation sibling of Inpainting Using Millions of Photographs (Hays & Efros 2008, "IM2GPS"), and it makes the same bet as the rest of the part — that at sufficient scale, the collection is the prior.

💡 The big lesson

Location is a distribution recovered from data, not a coordinate computed from a model. IM2GPS never builds a parametric model of "what places look like" and never regresses a single latitude and longitude. It does something humbler and more honest: it asks the geotagged web "which photos look like this one, and where were they taken?", then reports the spread of those answers as a probability map over the globe. The spread is not a failure to commit — it is the result. A generic scene spreads the mass across every plausible region; a distinctive one concentrates it. The same data-as-prior creed that lets Inpainting Using Millions of Photographs borrow a real patch lets IM2GPS borrow a real coordinate — and the same creed, aimed at a single person rather than the planet, reappears in Personalized priors. Hold this in view: the technique is retrieval, the output is a posterior, and the model is the crowd's photographs.

11.8.1 Geolocation as retrieval over a geotagged corpus

The whole method is Retrieval pointed at a corpus that happens to carry coordinates. Describe the query image $I$ with the same kind of global, scene-level descriptors that classic content-based retrieval uses — a mix of colour histograms, texture statistics, and a coarse scene/layout descriptor (the GIST-style "gist of a scene" that captures whether you are looking at a coastline, a street, a forest, a mountain). Then find the query's nearest visual neighbours among millions of geotagged Flickr photos. Nothing in this step knows anything about geography; it only knows what the query looks like. The geography enters entirely through the neighbours' attached GPS tags, which become the evidence for where the query was taken.

That re-framing is the clever part. There is no map in the original system, no per-place classifier, no list of landmarks to match against. The database's empirical "what places look like" — accumulated, for free, from millions of strangers tagging their holiday snapshots with locations — does all the work. This is exactly the wager of Inpainting Using Millions of Photographs: scale substitutes for a learned model. If the collection is large enough, then for almost any query some genuinely similar scene has already been photographed and geotagged, and the right answer is sitting in the database waiting to be retrieved. The descriptors and the fast nearest-neighbour search are the engine; the coordinates are just metadata that rides along.

It is worth being honest about what the visual features can and cannot do. They are good at scene type and texture — Mediterranean stucco versus Alpine timber, desert scrub versus tropical canopy, the colour cast of a particular sky — and at coarse layout. They are not, in this classic form, recognising a specific building by its facade (that is the landmark-recognition problem, which leans on the local-feature, spatial-verification side of Retrieval). IM2GPS deliberately works at the level of scene appearance, which is why it generalises to anonymous places that no landmark recogniser would ever have catalogued.

11.8.2 The answer is a distribution

Once you have the query's nearest neighbours, you do not pick the single best match and copy its coordinate — that would throw away the most useful thing the method knows, which is how confident to be. Instead, aggregate the retrieved neighbours' coordinates into a probability distribution over the globe: a kernel (Parzen) density estimate that places a soft bump at each neighbour's location and sums them, so the posterior is

$$ p(\text{lat},\text{lon}\mid I) \;\propto\; \sum_{j \in \text{NN}(I)} w_j \; K\big((\text{lat},\text{lon}) - (\text{lat}_j,\text{lon}_j)\big), $$

a weighted sum of kernels $K$ centred on the geotags of the visual neighbours $\text{NN}(I)$. Where many similar-looking photos cluster geographically, the bumps pile up into a sharp peak; where the neighbours scatter across continents, the mass smears out (Figure 1). The map, not a pin, is the deliverable.

And the spread is the point, not an embarrassment. Uncertainty is the result. A photo of a generic sandy beach is consistent with every warm coastline on Earth, so the honest output is a diffuse band across the tropics and Mediterranean — and reporting that diffuseness is more useful than confidently guessing one wrong beach. A photo carrying distinctive cues — a particular style of architecture, a writing system on a shop sign, an unmistakable vegetation or terrain — collapses the mass onto a small region. Reading the shape of the distribution tells you both the best guess and, just as importantly, how much to trust it and where the plausible alternatives lie. This is the same humility that the rest of the book prizes: a method that reports its own confidence beats one that always answers with false precision.

fig-im2gps
Figure 11.8.1. From one photo to a probability map of where on Earth it was taken. Left: a query photograph, with no GPS and no named landmark — just visual content (architecture, vegetation, sky, signage). Middle: its nearest visual neighbours retrieved from a database of millions of geotagged Flickr photos, ranked by appearance similarity — each carrying the coordinate where it was taken. Right: those coordinates aggregated by a kernel density estimate into a world-map heatmap of likely locations. Two cases are shown stacked: a distinctive scene whose neighbours agree, concentrating the mass into a tight hot spot; and a generic scene (a plain beach or field) whose neighbours scatter, smearing the mass diffusely across every plausible region. The heatmap — not a single pin — is the output, and its tightness reports the method's confidence.

11.8.3 The learned successors

The classic IM2GPS recipe is explicit retrieval: hand-built descriptors, a nearest-neighbour search, a kernel estimate over the neighbours' tags. The deep era keeps the task and the distribution-over-the-globe output but swaps the mechanism. PlaNet (Weyand, Kostrikov & Philbin 2016) reframes geolocation as classification: partition the surface of the Earth into a few thousand discrete geo-cells (smaller cells where photos are dense, larger where they are sparse), and train a deep network to classify a photo directly into a cell. The network learns the planet's appearance end-to-end from the geotagged corpus rather than relying on hand-designed colour and texture features, and its softmax over cells is the probability map over the globe — the same honest, distributional output as IM2GPS, now produced by learned weights instead of an explicit neighbour search.

The two approaches are the geolocation instance of a pattern that recurs across this book: data-as-prior (retrieve a real coordinate from real photographs) versus weights-as-prior (bake the planet's appearance into a trained network). They share IM2GPS's founding insight — that location is recoverable from visual content and is best reported as a distribution — and they share its dependence on a massive geotagged collection. What changed is only the operator: a hand-built descriptor and a nearest-neighbour vote gave way to a learned classifier (Deep learning). The problem Hays and Efros posed is intact; the engine under it was neuralised, exactly as it was for Retrieval's embeddings and for so much else in this part.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 The big lesson

Location is a distribution recovered from data, not a coordinate computed from a model. IM2GPS never builds a parametric model of "what places look like" and never regresses a single latitude and longitude. It does something humbler and more honest: it asks the geotagged web "which photos look like this one, and where were they taken?", then reports the spread of those answers as a probability map over the globe. The spread is not a failure to commit — it is the result. A generic scene spreads the mass across every plausible region; a distinctive one concentrates it. The same data-as-prior creed that lets Inpainting Using Millions of Photographs borrow a real patch lets IM2GPS borrow a real coordinate — and the same creed, aimed at a single person rather than the planet, reappears in Personalized priors. Hold this in view: the technique is retrieval, the output is a posterior, and the model is the crowd's photographs.