8.5 Style transfer⧉
Here are two pictures. One is a snapshot you took on a flat, overcast afternoon — correct, sharp, lifeless. The other is a print by a photographer whose work you love, all deep shadows and luminous skies, or a Van Gogh whose every brushstroke swirls. You do not want the painter's subject and you do not want the master's scene — you want your own picture, your own content, wearing their look. That is style transfer, and it is one of the oldest wishes in image-making: make my thing look like that thing.
The wish is easy to state and slippery to pin down, because "look" is not one quantity. It is the distribution of tones (the moody near-black shadows), the texture and grain, the palette, the way edges are rendered (a hard brushstroke versus a soft gradient). The history of style transfer is really the history of what we decide "style" is made of — and how we measure and impose it. The arc runs cleanly from hand-built answers to learned ones. The earliest methods say style is low-order statistics and patches: match a histogram, copy a texture, find for each spot in your image the most similar spot in the example and borrow its rendering. The modern methods say style is the statistics of deep features — and the headline trick, the Gram matrix, turns out to be the exact same idea the classical texture people had, just measured one layer up. That continuity is the spine of the chapter: classical texture statistics → deep-feature statistics, the same thought, learned instead of designed.
We walk it in three steps. First the classical lineage — Image Analogies (learn a filter from one example pair by patch matching) and two-scale tone/style transfer (split into base + detail, transfer each). Then neural style — content as deep features, style as Gram-matrix feature correlations, an image optimized to match both — and its feed-forward speed-up, which trains a network to do in one pass what optimization did slowly. Finally the domain-level view: when "style" is a whole collection of images (photos↔paintings, summer↔winter), stylization becomes image-to-image translation — Pix2Pix when you have paired data, CycleGAN when you do not.
8.5.1 Classical style transfer: patches and statistics⧉
Before deep features, "style" had to be written down by hand — as a statistic you could measure on the reference and force onto the target, or as a stock of example patches you could copy. A small family of methods stakes out the territory, and the cleanest of them is the leanest.
Color (palette) transfer (Reinhard et al. 2001) is the simplest statistic-matching style transfer there is — and the one to start from, because it shows the whole idea in miniature. Suppose you want your daylight snapshot to carry a reference's palette and mood — the warmth of a sunset, the cool cast of an overcast morning. Reinhard's recipe is to match only the mean and standard deviation of the color distributions: shift each channel so the target's mean lands on the reference's, and scale by the ratio of standard deviations, so the spread of colors matches too. The one subtlety is where you do this. Done in RGB, the channels are correlated and the shifts spill into each other; so both images are first converted to a decorrelated color space — the $l\alpha\beta$ space of Ruderman, whose axes are statistically near-independent for natural scenes — where each channel can be retargeted on its own without cross-talk (the choice of space ties back to Human (and animal) vision and color). That is the entire method: first- and second-order statistics, per channel, in a smart space — no patches, no histograms — yet enough to pour a sunset's warmth over a flat afternoon. It is the pre-deep ancestor of neural style and of modern color grading: where Gatys will later match the second-order statistics of deep features (the Gram matrix), Reinhard matches the first and second moments of raw pixels. Hold the comparison — it is the chapter's spine compressed to a single line.
Image Analogies (Hertzmann, Jacobs, Oliver, Curless and Salesin, 2001; Hertzmann et al. 2001) frames the problem as an explicit analogy. You give it a training pair — an image $A$ and its filtered or stylized version $A'$ (an unfiltered photo and its watercolor, say, or a daytime render and its night version) — and a new target $B$. It synthesizes $B'$ so that the relationship reads "$A$ is to $A'$ as $B$ is to $B'$." The mechanism is patch matching: walk over $B'$ pixel by pixel, and for each location find the patch in $A$ that best matches the corresponding patch in $B$ (and the part of $B'$ already synthesized), then copy across the co-located $A'$ pixel. A single example pair teaches the operator — no training set, no network — and the operator is "whatever transformation took $A$ to $A'$." It is the pre-deep ancestor of neural style: the idea that you can learn a look from one example by matching local appearance is exactly what the neural methods will do, only with a learned notion of "similar patch" instead of raw pixels (cross-ref Deep learning and the texture-synthesis section of this part). Its limit is also instructive — because it matches raw patches, it transfers a look only when "similar in pixels" means "should be rendered alike," which is why it shines on textures and stylized filters and struggles when the correspondence is more abstract.
Two-scale tone and style transfer (Bae, Paris and Durand 2006) takes a different and very photographic tack: it transfers the pictorial look of a model photograph — the tonal drama of an Ansel Adams print, say — onto an ordinary snapshot, by separating scales and transferring each independently. It splits an image, using an edge-preserving base/detail decomposition (the one developed in Bilateral filtering), into a base layer (large-scale tone — the broad distribution of lights and darks) and a detail layer (fine texture and local contrast). Then it matches each separately: the base is brought to the model's tonal distribution by a histogram match (BASIC), so your sky picks up the model's brooding gradient; the detail is rescaled to the model's amount of texture, so your print gets the model's crispness or its softness. Pulling the two scales apart is what makes it work — match the whole histogram at once and you would crush local contrast trying to fix the global tone, or vice versa. This is style as two low-order statistics, measured at two scales — and it makes the classical thesis concrete.
The classical takeaway is one sentence. Style here is low-order statistics plus patches — global means and variances, histograms, the amount of local texture, exemplar patches — matched by hand-designed pipelines. And the lineage has a direction: it is a rising statistical order. Reinhard matches first and second moments; histogram and patch methods match whole distributions and local appearance; the neural methods below will match the second-order statistics of deep features. Each rung keeps less of the layout and summarizes more of the look, but the move is always the same — measure a statistic on the reference, force it onto the target's content. Notice what counts as "style" in every case: not the content (where the edges and objects are), but aggregate properties — how tones are distributed, how much texture there is, what a typical patch looks like — properties that are roughly the same everywhere in the image and so can be summarized by a statistic. Hold that thought. The neural methods below change exactly one thing: they replace "hand-designed statistic" with "deep-feature statistic." Everything else — the idea that style is a summary and content is a layout — survives the jump.
8.5.2 Neural style and feed-forward stylization⧉
The breakthrough was a reframing, not a new optimizer. Gatys, Ecker and Bethge (2016; Gatys et al. 2016) asked what content and style are, in the language of a convolutional network trained for recognition (a net from Oxford's Visual Geometry Group (VGG)), and gave two crisp answers.
Content is what the deep features encode — what is where. Run the image through the network and read off the activations $\phi_\ell$ at some layer $\ell$. Deep layers respond to objects and their arrangement and are largely indifferent to exact pixels — two photos of the same scene under different rendering have nearly the same deep features. So "keep the content of $I_c$" means keep its deep features: penalize $\lVert \phi_\ell(\hat I) - \phi_\ell(I_c)\rVert^2$ at a chosen layer.
Style is the correlations among those features — and emphatically not where they are. This is the load-bearing idea. At a layer with $C$ feature channels, the activation is a stack of $C$ feature maps. Ask: across the spatial extent of the image, which channels tend to fire together? Flatten each channel's map into a long vector and take all the pairwise inner products — that is the Gram matrix,
where in the compact form $\phi_\ell$ is the activation reshaped from a tensor into a $C\times HW$ matrix (each of the $C$ channels flattened into a length-$HW$ row), so that $\phi_\ell\phi_\ell^{\top}$ is a $C\times C$ matrix of feature co-occurrence — exactly the index-form sum that follows. Summing over all spatial positions is the crucial move: it discards location and keeps only how much each pair of features appears together, anywhere. That is precisely a statistic of texture — "brushstroke-like channel and orange-ish channel fire together a lot" — with the layout thrown away. Two images with the same Gram matrices have the same stuff (the same strokes, the same palette of local patterns) arranged however you like. So "give $\hat I$ the style of $I_s$" means match the Gram matrices: $\sum_\ell \lVert G_\ell(\hat I) - G_\ell(I_s)\rVert^2$, summed over several layers to capture style at several scales.
Put the two together and stylization is one optimization — start from noise (or from the content image) and descend on
with $\alpha,\beta$ trading content fidelity against stylistic strength. The first term pins what is where to the content image; the second pours the reference's texture statistics over it; the optimizer finds the image that satisfies both (Figure 8.5.2). It is slow — a full gradient-descent per image, minutes per result — but it changed the field, because it said style is deep-feature statistics and made the content/style split a pair of measurable losses.
The classical methods and the neural one are the same skeleton with one substitution. Both define style as a summary statistic of the image and synthesize the output by matching that statistic while preserving a layout. The classical pipelines used hand-designed statistics — color means and variances, a tone histogram, a texture amount, raw exemplar patches. Gatys keeps the skeleton exactly and swaps in a statistic computed from features learned from data: the Gram matrix is "which learned channels co-occur," and "learned channels" is the whole difference. The texture-synthesis intuition (style = co-occurrence statistics, content = layout) is unchanged; only the space the statistics live in moved from pixels to deep features. This is L8 in miniature — the inverse/synthesis template is identical, the prior just became data-driven — and it is why the classical → neural transition reads as continuity, not rupture. (Registered as L8 — first appears in Machine learning; recurs here as the classical-statistics → Gram-matrix bridge.)
Feed-forward / perceptual-loss stylization (Johnson, Alahi and Fei-Fei 2016) removes the speed problem without changing the idea. Gatys optimizes an image against the content+style loss; the feed-forward version instead trains a network, once, per style to map any content image to its stylized version in a single forward pass. The training objective is the same loss Gatys minimized — the content feature distance plus the Gram-matrix style distance — now called a perceptual loss because it is measured in deep-feature space rather than on pixels. After training, stylization is real-time: feed an image, get the stylized result, no optimization loop. The pattern is worth naming because it recurs all over the book: take an expensive per-instance optimization, amortize it into a network trained to minimize the same objective. That same perceptual / feature loss reappears as a training loss for super-resolution and, as a learned perceptual metric, as LPIPS (cross-ref Deep learning).
Markovian / patch-based neural style (Li and Wand 2016) sits squarely between Image Analogies and Gatys, and explains why. The Gram matrix is a global statistic — it pools over the entire image, so it can scramble local structure (a face stylized by global Gram matching may grow an eye where there should be a cheek, because globally the textures match). Li and Wand combine Markov random fields with CNNs: match style on local neighborhoods of deep features — for each patch of $\hat I$'s features, find a matching patch of the style image's features and pull toward it — rather than on a single global correlation matrix. That is exactly Image Analogies' patch matching, but performed in deep-feature space instead of on raw pixels, and exactly Gatys' "style = feature statistics," but local instead of global. Patches, but learned; statistics, but spatial. The three methods are corners of one triangle: Image Analogies (patches, no net), Gatys (global feature statistics), Li and Wand (local feature patches).
8.5.3 Style transfer as image-to-image translation⧉
So far "style" has been the look of a single reference. Often it is really a whole domain: you do not have one Monet you want to match, you have the set of all Monets, and you want your photo to look like it belongs to that set. Likewise photos↔paintings, day↔night, summer↔winter, horses↔zebras. Here stylization becomes learned image-to-image translation — train a generator to map an image from one domain to the other, and a discriminator — together, a generative adversarial network (GAN) — to keep the output looking like a genuine member of the target domain. Style is no longer a statistic you measure on one reference; it is membership in a collection, enforced adversarially.
Pix2Pix (Isola, Zhu, Zhou and Efros, 2017; Isola et al. 2017) is the paired case: when you have aligned input/output pairs — the same scene as a photo and as a segmentation map, a sketch and its finished drawing, a daytime and a nighttime shot of one street — learn the map directly with a conditional GAN (the generator conditions on the input image; the discriminator judges input–output pairs). It is the generic "learn the mapping between two image domains when you can put the domains in correspondence" tool, and it works strikingly well across an absurd range of paired tasks.
CycleGAN (Zhu, Park, Isola and Efros, 2017; Zhu et al. 2017) handles the far more common unpaired case. You almost never have aligned pairs for the interesting styles — there is no photograph of the exact scene Monet painted, only piles of photos and piles of Monets. CycleGAN learns the mapping anyway, using cycle consistency: train two generators, $A\!\to\!B$ and $B\!\to\!A$, and require that translating a photo to a painting and back returns the original photo ($B\!\to\!A\!\to\!B$ should be the identity). That round-trip constraint forces the translation to preserve content (otherwise you could not get back) while the adversarial loss forces the look to match the target collection. It is the workhorse behind "make my photo look like X" when X exists only as a collection — the unpaired, domain-level end of the style-transfer spectrum.
Where these sit. They are surveyed as models (conditional GANs, cycle consistency, adversarial training) in Deep learning; here they are framed as the learned, domain-level answer to the chapter's one question — match the look of a reference (now a whole collection) onto a target's content. And the spectrum keeps extending: the current frontier is text- and exemplar-driven stylization with diffusion — "render this photo in the style of …" from a prompt, or from a single style image — which is the forward limit treated in Generative AI and diffusion. The thread is unbroken from Image Analogies' single example pair to a diffusion model conditioned on a phrase: define what style is made of, measure it on the reference, impose it on the target's content. Only the definition of "style" — raw patches, tone histograms, Gram matrices, domain membership, a text embedding — keeps getting richer.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
The classical methods and the neural one are the same skeleton with one substitution. Both define style as a summary statistic of the image and synthesize the output by matching that statistic while preserving a layout. The classical pipelines used hand-designed statistics — color means and variances, a tone histogram, a texture amount, raw exemplar patches. Gatys keeps the skeleton exactly and swaps in a statistic computed from features learned from data: the Gram matrix is "which learned channels co-occur," and "learned channels" is the whole difference. The texture-synthesis intuition (style = co-occurrence statistics, content = layout) is unchanged; only the space the statistics live in moved from pixels to deep features. This is L8 in miniature — the inverse/synthesis template is identical, the prior just became data-driven — and it is why the classical → neural transition reads as continuity, not rupture. (Registered as L8 — first appears in Machine learning; recurs here as the classical-statistics → Gram-matrix bridge.)