💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

11.3 Auto curation⧉

You took four thousand photos on vacation. You want the twelve keepers. Auto-curation is the machine doing the culling: scoring each shot for technical quality — is it sharp, is it exposed, are the eyes open — and for learned aesthetics, grouping the near-duplicate bursts and keeping only the best of each, then assembling a small, diverse highlight set. It is the selection problem that a firehose of images forces on anyone who shoots digitally, and it sits one rung above retrieval (Retrieval): retrieval finds the images that match a query, curation finds the ones that are good and discards the rest. The pipeline reads like a triage. Reject the technically broken — blurry, blown-out, blinked — then score the survivors for appeal, collapse the near-duplicate bursts to their best member, and finally choose a set that is both good and varied. It is retrieval's selective twin: not "find images like this" but "find the ones worth keeping."

fig-auto-curation — **Figure 11.3.1.** A burst becomes a single keeper. A row of near-duplicate shots of one moment — the same group, the same sunset, taken a second apart — each annotated with two scores: a **technical-quality** score (sharpness, exposure, eyes-open) and a learned **aesthetics** score. The pipeline groups the row as one near-duplicate cluster, keeps the **single best member** (sharp, well-exposed, everyone's eyes open and smiling), and dims the rest as discarded. A side panel shows the failures the early stages catch — a motion-blurred frame, a blown-out highlight clip, a mid-blink face — flagged before any question of taste enters.

11.3.1 Technical quality — the easy rejects⧉

The first stage of the triage spends no taste at all. It asks only whether the frame is measurably broken, and the appeal of starting here is that these judgments have ground truth: a motion-blurred frame is objectively blurred, a clipped highlight is objectively clipped, and you can flag both with a cheap, interpretable score before any subjective question enters. Sharpness is the canonical example — a blurred photo has little high-frequency content, so the energy in its gradients, or the variance of its Laplacian, falls; a sharp one spikes. Exposure is read off the histogram: mass piled against zero means crushed shadows, mass piled against the maximum means blown highlights, and either is a defect the camera cannot recover. Noise estimation rounds out the trio. None of these requires a learned model; they are signal statistics, and they earn their place at the front of the pipeline precisely because they are certain and nearly free to compute.

Faces are special, and they get their own detectors, because a frame can be flawless by every signal statistic and still be a reject the instant you look at the people in it. The two cues that matter most are eyes-open / blink detection and smile / expression scoring. A technically perfect group portrait with one person caught mid-blink is unusable, and no amount of sharpness or exposure quality rescues it — which is exactly why the "best shot" features on phone cameras lean so heavily on these face cues. Google's Top Shot and Apple's equivalent both quietly run blink and smile detectors over the moments around the shutter press, picking the frame where everyone's eyes are open and the expressions are good (Figure 11.3.1). The lesson of this stage is that a great deal of culling is just defect rejection, and defect rejection is the part of curation that is genuinely solved — interpretable, fast, and not a matter of opinion.

11.3.2 Aesthetics — the hard, learned part⧉

A photo can clear every technical hurdle and still be boring. Sharp, well-exposed, eyes open — and utterly forgettable. The second stage of the triage is the one that tries to predict appeal, and unlike the first it has no clean ground truth, so it is learned from large datasets of human-rated photographs. The reference corpus is AVA (Murray, Marchesotti & Perronnin 2012), a quarter-million images each carrying a distribution of scores from a photography-contest community. The standout method is NIMA (Talebi & Milanfar 2018), and its key move is subtle and worth stating precisely: rather than regress a single quality number, NIMA predicts the entire distribution of human ratings. That distribution captures something a scalar cannot — not just whether an image is good on average, but how divisive it is. A photo that everyone rates a flat 5 and a photo that half the crowd loves and half hates can share a mean and yet be wildly different pictures, and predicting the spread tells them apart.

Here the honest caveat earns its keep, and it is the reason this subsection cannot be read as a verdict. Aesthetics is subjective and culturally loaded (Human factors). A learned scorer does not encode beauty; it encodes the average taste of its annotators — in AVA's case, the conventions of one online photography community at one moment in time. That average is genuinely useful for ranking a personal collection, which is all the curation pipeline asks of it, but it is not a pronouncement on whether a photograph is any good. Worse, by construction it regresses toward the conventional, so it systematically undervalues the unconventional — the deliberately off-kilter, the anticliché shot that is interesting because it breaks the rules the annotators rewarded (Artistic projects with photo collections). The anticliché camera is curation's contrarian twin, and the gap between them is exactly the gap between "what the crowd scores highly" and "what is worth keeping." A curation system that only ever surfaces the high-scoring frame will quietly flatten a collection toward the postcard, and knowing that is part of using the scorer well.

11.3.3 Grouping, summary, and diversity⧉

The single biggest win on a real camera roll is not aesthetics at all — it is near-duplicate grouping. Bursts, re-tries, and the reflexive habit of firing three shots to be safe produce clusters of nearly identical frames, and the obvious move is to group each cluster and keep only its best member. The grouping reuses the very machinery of the previous section: the same learned embeddings that let Retrieval rank a database by similarity also place near-duplicate frames almost on top of each other in the embedding space, so a simple proximity threshold collapses a burst into one cluster. Within each cluster, the technical-quality and aesthetics scores from the earlier stages pick the keeper — the sharpest frame, the one where nobody blinked, the best-composed of the bunch (Figure 11.3.1). This is where the four-thousand-to-twelve compression mostly happens, and it is the stage that most reliably matches what a human editor would do by hand.

But picking the best of each cluster is still not enough, because of a failure mode that is easy to miss until you see it. A highlight reel of the twelve highest-scoring photos is very often twelve takes of the same hero shot — the one gorgeous sunset, photographed a dozen ways, each version scoring high and each crowding out everything else the trip contained. The fix is to make diversity an explicit objective rather than a hoped-for side effect. Summary selection adds a coverage / diversity term — formalized as submodular maximization, where each added photo's marginal value shrinks as the set already covers what it would have contributed, or as a clustering over the collection's people, places, and moments — so the kept set spans the trip instead of dwelling on its single best moment. Quality alone ranks; quality-plus-diversity summarizes. This is the same problem that Life logging cameras makes acute, where a day of passive capture must be compressed not to its prettiest frames but to a faithful, varied account of the day — and it is the stage where curation stops being a filter and becomes an editor.

💡 The big lesson

Curation is retrieval pointed at quality and diversity instead of similarity. The same embeddings, the same nearest-neighbour machinery, the same "image as a point in a space" — but where retrieval asks what matches this?, curation asks what is worth keeping? and answers it as a triage: reject the measurably broken first (sharpness, exposure, blink — interpretable, certain, nearly free), score the survivors for learned aesthetics second (useful for ranking, never a verdict — it encodes the crowd's average taste and undervalues the anticliché), then collapse near-duplicate bursts to their best member (the biggest real-world win), and finally select for diversity so the keepers span the collection rather than re-shooting its one hero moment. The order is the design: certainty up front, taste at the back. And the deepest caveat survives the whole pipeline — every late stage encodes whose taste, which is why the same machinery that culls a vacation roll so well is also the machinery that quietly flattens it toward the conventional.

Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 The big lesson

11.3 Auto curation🔗⧉

11.3.1 Technical quality — the easy rejects🔗⧉

11.3.2 Aesthetics — the hard, learned part🔗⧉

11.3.3 Grouping, summary, and diversity🔗⧉

Big lessons of this chapter

11.3 Auto curation⧉

11.3.1 Technical quality — the easy rejects⧉

11.3.2 Aesthetics — the hard, learned part⧉

11.3.3 Grouping, summary, and diversity⧉