💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
💡 In a hurry? Jump to this chapter’s 1 big lesson ↓

6.4 Morphable models

A single linear morph turns one face into another by interpolating a hand-drawn correspondence: it knows about exactly two faces and the straight line between them. Suppose instead you have a hundred faces, all registered to one another, and you ask a bigger question — not "what is between these two," but "what does the space of all of them look like, and can I synthesize, recognize, or edit a face by navigating that space?" That is the move from a morph to a morphable model, and it is one of the most influential ideas in this part: a whole family of shapes — faces, bodies, a class of objects — captured as a mean plus a basis of deformations, so that any member is a low-dimensional coordinate vector and any coordinate vector is a plausible member.

The construction is short to state and is built from machinery you already own. Put a training set of examples into dense correspondence (every example's nose-tip maps to every other's), align away the nuisance pose with Procrustes, and take the principal components (PCA) of the aligned shapes. PCA hands back a mean shape and an ordered basis of eigen-deformations — the directions in which the family varies most — and any instance is written

$$ \mathbf{s}(\mathbf{c}) = \bar{\mathbf{s}} + \sum_{k} c_k\,\sigma_k\,\mathbf{e}_k, $$

the mean plus a weighted sum of eigen-modes, with coordinates $\mathbf{c}=(c_1,c_2,\dots)$ measured in units of each mode's standard deviation $\sigma_k$. Read in this part's language, $\bar{\mathbf{s}}$ is a mean shape and the $\mathbf{e}_k$ are a learned warp basis over it: each eigen-mode is a small canonical deformation of the mean, and an instance is the mean warped by a linear combination of them. A morphable model is morphing, with the correspondence field constrained to the subspace the data taught us is plausible (Figure 6.4.1, Figure 6.4.2).

The same idea recurs at three scales of fidelity, and they share one skeleton. In 2-D, Cootes, Edwards and Taylor's Active Shape / Active Appearance Models (ASM 1995, AAM 2001) build PCA models of landmark shapes (and, for AAMs, of the shape-free texture too) and fit them to images. In 3-D, Blanz and Vetter's 3D Morphable Model of faces (1999) is the landmark result: a PCA model of 3-D face geometry and texture, fit to a single photograph by analysis-by-synthesis to recover shape, albedo, pose, and lighting at once. The same recipe gives SMPL for human bodies (Loper et al. 2015) and FLAME for faces (Li et al. 2017). Egger et al.'s 2020 survey is the map of the whole territory. We develop the skeleton once; the differences are scale (2-D vs 3-D), what gets a basis (shape, texture, expression), and how the prior is parameterized (linear PCA vs a neural net).

💡 Big lesson (L17, recurrence) — a morphable model is a learned warp basis over a mean

Everywhere in this part the recipe is build a coordinate map, then transport the pixels; what changes is how the map is obtained. A parametric warp solves it from constraints; a morph interpolates a hand-drawn one between two images; a morphable model learns the basis of maps from a training population and then lets you pick one by its coordinates. The mean shape is the origin, the eigen-deformations are the basis vectors, and an instance is a short vector $\mathbf{c}$ of how far to move along each. The payoff of folding correspondence into a learned basis rather than a per-pair drawing is a prior: not every warp is allowed, only the ones the family actually exhibits — which is exactly what lets the model fill in an occluded cheek, reject an implausible fit, and generalize from one photo to a full 3-D face. The same shape — mean + learned basis of deformations + coordinates — underlies SMPL bodies, FLAME faces, and, at the end of the chapter, the latent spaces of generative networks. (L17 is registered in the part introduction, Warping and morphing; here the map is a statistical model rather than a drawn or estimated one.)

📎 Connection

A morphable model is the family-constrained sibling of the per-pair morph of Morphing; its fitting step is an inverse-rendering analysis-by-synthesis that anticipates monocular 3-D reconstruction in 3D and depth; and its linear PCA prior is the classical ancestor of the neural image and shape priors in Deep learning and Super-resolution and image priors. Downstream it powers identity-preserving portrait editing, relighting, and avatar animation.

6.4.1 Step 1 — dense correspondence: the precondition for averaging shapes

Everything rests on a precondition that is easy to state and hard to satisfy: before you can average a hundred faces or take their PCA, every face must be in dense correspondence with every other — the same anatomical point (the tip of the nose, the corner of the left eye, a specific spot on the cheek) must occupy the same index in every example's data vector. Only then does "average the noses" mean averaging noses rather than averaging a nose in one image with a cheek in another. This is the exact lesson of Morphingalign before you blend — promoted from two images to a whole population: you cannot meaningfully average, or compute a covariance over, points that do not correspond.

For 2-D models the correspondence is often a fixed set of landmarks placed on every training image — eye corners, nostrils, jaw points — by hand or by a detector; an instance is then the concatenation of its landmark coordinates, $\mathbf{s} = (x_1,y_1,\dots,x_K,y_K)$. For dense models (a 3-D face mesh, a body) every vertex must correspond across the set, which is itself a registration problem — fit a common template mesh to every scan so that vertex $j$ lands on the same anatomical point in all of them. Blanz and Vetter's original 3DMM obtained this dense correspondence between laser scans with an optical-flow-style algorithm; modern pipelines use template-fitting and non-rigid registration. The quality of the whole model is bounded by the quality of this correspondence: if it is sloppy, the "mean face" is a blurred average of misaligned parts, and the eigen-modes mix true variation with registration error.

6.4.2 Step 2 — Procrustes alignment: factor out pose so PCA sees shape

Two photos of the same face at different positions, scales, and in-plane rotations have different landmark coordinates, yet they are the same shape. If we ran PCA on the raw coordinates, the first and largest eigen-modes would be wasted describing pose — global translation, rotation, scale — the nuisance variation we do not want the model to spend its budget on. The fix is generalized Procrustes analysis: rigidly align every example (translate to a common centroid, rescale to a common size, rotate to best match) to a shared reference, iterating the reference as the running mean until it stops moving. What remains after Procrustes is shape in the technical sense — geometry with similarity (translation, rotation, uniform scale) divided out — and that is what we feed to PCA. (Pose itself is not discarded; it becomes a few explicit parameters bolted onto the model at fit time, alongside the coordinates $\mathbf{c}$.) The discipline is the same one that opened Warping and resampling's DOF ladder: separate the global rigid/similarity part from the non-rigid part, and model only the latter statistically.

6.4.3 Step 3 — PCA: a mean and a basis of eigen-deformations

With the population corresponded and aligned, PCA does the rest. Stack the $N$ aligned shape-vectors as rows, subtract the mean shape $\bar{\mathbf{s}}$, and take the eigenvectors of the covariance (in practice via the SVD of the centered data). The eigenvectors $\mathbf{e}_1,\mathbf{e}_2,\dots$, ordered by decreasing eigenvalue, are the eigen-deformations (for faces, evocatively, the "eigenfaces" of shape), and each carries a standard deviation $\sigma_k$ — how far the family typically moves along it. Any instance is

$$ \mathbf{s}(\mathbf{c}) = \bar{\mathbf{s}} + \sum_{k=1}^{M} c_k\,\sigma_k\,\mathbf{e}_k, $$

a mean plus a short ($M$-term) weighted sum (Figure 6.4.1). Three properties make this the workhorse of the whole field:

The figures here run a real, tiny PCA on a small synthetic population of 2-D face landmarks: the mean is the genuine average, and the modes shown are the actual leading eigenvectors of that population's covariance — so "PC1 ≈ aspect" is what the data produced, not a hand-drawn cartoon.

fig-morphable-mean-eigen
Figure 6.4.1. A mean shape plus its first few eigen-deformations, from a real (small) PCA on synthetic 2-D face landmarks. The center column is the mean face $\bar{\mathbf{s}}$. Each row sweeps one principal component from $-2\sigma$ through the mean to $+2\sigma$ ($\mathbf{s}=\bar{\mathbf{s}}+c_k\sigma_k\mathbf{e}_k$), holding all other coordinates at zero; the faint grey outline behind each variant is the mean, for reference. The leading mode (PC1) is the family's biggest axis of variation (here, overall aspect), PC2/PC3 finer axes. An instance is the mean warped by a weighted sum of these modes — the eigen-deformations are a learned warp basis over the mean.
fig-morphable-shape-space
Figure 6.4.2. The shape space: every training face is a point, and the axes are the principal components. The blue cloud is the corresponded-and-aligned training population plotted on (PC1, PC2) in $\sigma$ units; the mean sits at the origin. A few sample shapes are drawn at their own coordinates, showing that moving in the plane is moving through face shape — narrow/tall in one corner, wide in another. Any face is a coordinate vector $\mathbf{c}=(c_1,c_2,\dots)$; any (small-norm) coordinate vector is a plausible face. Fitting (Figure 6.4.4) is the search for the point that matches a given photo.

6.4.4 Step 4 — shape and appearance are separate bases

A face is not only a shape; it is also a coloring — skin tone, the dark of the brows and lips, shading. A morphable model handles these with two separate bases, and keeping them separate is exactly the separate-shape-from-color discipline of Morphing, now applied to a whole family. Geometry gets the shape PCA above. Appearance (texture) gets its own PCA — but, crucially, the textures are first warped to the mean shape so they are compared in a common, shape-free frame (every face's texture sampled at the same canonical landmarks before averaging). The result is a mean texture $\bar{\mathbf{t}}$ and an appearance basis $\mathbf{a}_k$, and a full instance is the pair

$$ \mathbf{s}(\mathbf{c}) = \bar{\mathbf{s}} + \sum_k c_k\,\sigma_k\,\mathbf{e}_k, \qquad \mathbf{t}(\mathbf{b}) = \bar{\mathbf{t}} + \sum_k b_k\,\tau_k\,\mathbf{a}_k, $$

with two coordinate vectors — $\mathbf{c}$ for geometry, $\mathbf{b}$ for appearance. To render an instance you synthesize the texture in the shape-free frame, then warp it onto the synthesized shape (the warp engine of Warping and resampling, the morphable model's biggest customer after morphing): texture-on-the-mean, deformed to the geometry the shape coordinates request (Figure 6.4.3). Why separate them? Because the two vary independently — a wide face can have any skin tone; a smile (geometry) is not a color change — and because untangling them is what makes the model useful: relighting edits the appearance/shading without touching identity geometry; re-posing edits geometry without re-painting the texture; recognition can compare two faces in shape alone, robust to lighting. Active Appearance Models combine the two into a single joint appearance model; the 3DMM keeps shape, texture, and the imaging parameters (pose, illumination) explicit so each can be read off after fitting.

fig-morphable-shape-vs-appearance
Figure 6.4.3. The model carries two bases. Left — shape basis: a geometry instance, the mean warped by a few eigen-deformations (displacement arrows from the mean landmarks to the instance). Middle — appearance basis: a texture instance, defined in the shape-free frame on the mean shape (mean texture plus eigen-color). Right — combined instance: the appearance is warped onto the shape to synthesize one face — texture-on-the-mean, deformed to the requested geometry. Shape and appearance vary independently, which is what lets you re-pose without re-painting and relight without re-shaping.

6.4.5 Step 5 — fitting a new photo: analysis-by-synthesis

A model is only useful if you can fit it to a new image — recover the coordinates of the face in this photo. The principle is analysis-by-synthesis (equivalently, inverse rendering or render-and-compare): rather than read the parameters off the image directly, search for the parameters whose synthesized image best reproduces the input. Concretely, hold a current guess of the coordinates $\mathbf{c},\mathbf{b}$ and the imaging parameters (pose, camera, lighting); render the model under them; compare the render to the target and measure the residual (the mismatch, in landmark positions and/or pixel colors); and update the parameters to shrink it. Iterate until the residual stops falling. It is the morphable model's answer to the same loop that drives every fitting problem in the book — propose, render, compare, correct (Figure 6.4.4).

The objective being minimized is the residual plus a prior term that keeps the coordinates plausible — typically $\sum_k c_k^2$ (small in $\sigma$ units), the negative log of the Gaussian that PCA gave us in Step 3. That regularizer is the whole reason a morphable model can fit a single photograph and still produce a complete, sensible 3-D face: the data constrain the visible front of the face, and the prior fills in the occluded sides and the parts the lighting hid, by pulling the unconstrained coordinates toward the mean. Without the prior the fit would overfit the visible pixels and hallucinate a monster around the back. This is precisely the regularized inverse problem structure of Linear Inverse Problems and Regression and Super-resolution and image priorsdata term + prior — with the morphable model supplying an unusually strong, learned, class-specific prior. The non-convex search (over geometry, texture, pose, and light at once) is the hard part in practice; Blanz and Vetter optimized it directly, while modern systems often regress the coordinates with a neural network in a single forward pass and refine from there (Sec. 7).

fig-morphable-fit
Figure 6.4.4. Fitting by analysis-by-synthesis. A target "photo" (left, red), then the model's successive guesses: it starts at the mean and steps through the shape space, each guess rendered and compared to the target. The yellow arrows are the residual — how far each model landmark still sits from its target — and they drive the coordinate update; the residual (RMS) shrinks across iterations (here $0.100\!\to\!0.045\!\to\!0.009$), the faint red ghost behind each guess being the target. The strip below is the loop: coordinates $\mathbf{c}$ → render → compare to target (residual) → update $\mathbf{c}$, repeated until the residual stops shrinking. Fitting is a search in the shape space for the point that reproduces the image.

6.4.6 Step 6 — once fitted, edit by moving in the space

The reason to go to all this trouble is what fitting enables. Once a photo is reduced to a point $\mathbf{c}$ (plus appearance $\mathbf{b}$, pose, and lighting), every downstream operation is a move in the space — and because the space is the learned manifold of plausible faces, every move stays on a plausible face. Re-pose / re-illuminate: change the explicit pose and lighting parameters and re-render — the same identity from a new angle or under new light, the bread and butter of 3DMM-based face manipulation. Interpolate / average: the straight line between two fitted points is a morph that never leaves the family — every in-between is a valid face, no ghosting, because the correspondence is baked into the basis (this is the morph of Morphing, with the correspondence supplied by the model rather than drawn). Edit semantically: add a multiple of a mode that correlates with an attribute — older, wider, more of a smile — to push the face along that attribute while holding the rest. Recognize: compare faces in the identity coordinates, where pose and lighting have been factored out, so the same person under different conditions lands at nearly the same point — the use that motivated much of the original work.

This last group surfaces the identity-vs-expression factorization. The geometry of a face is the identity (bone structure, who it is — roughly constant for a person) composed with an expression (smile, frown, blink — momentary). A flat shape PCA blends the two; the more capable models give them separate factors — FLAME, for instance, carries distinct identity, expression, and pose/jaw parameters — so you can swap one person's expression onto another's identity, or animate a fixed identity through expressions. SMPL does the analogous split for bodies: a shape space (the person's build) and a pose space (the skeleton's articulation), with pose-dependent deformations layered on. The general pattern is to factor the deformation into independent groups — identity, expression, pose — each its own little basis, which is the rigging artist's blendshapes wearing a statistician's hat.

6.4.7 Step 7 — from linear PCA to neural priors

PCA is a linear, Gaussian model: instances are linear combinations of the basis, and the prior is a single ellipsoid. That is its strength — interpretable, cheap, a closed-form mean and modes — and its ceiling. A linear basis cannot capture nonlinear variation (the manifold of faces is curved; large expressions and extreme shapes bend off the flat subspace), it ties detail to how well correspondence and the Gaussian assumption hold, and it tops out at a certain realism. The modern arc keeps the morphable-model idea — a mean and a learned space of deformations you fit and then navigate — but replaces the linear basis with a neural one:

The constant across the whole arc is the framing this chapter began with. Whether the basis is a handful of eigenfaces or the latent space of a billion-parameter network, a morphable model is a mean object plus a learned space of deformations, fit to data by analysis-by-synthesis, and used by moving through that space — a learned warp basis over a mean shape. PCA was the first, clearest instance; the priors got stronger, the idea did not change.

6.4.8 Recap and significance

Three ideas outlast the specific models. First, a class of shapes is a low-dimensional space: corresponded, aligned, and PCA'd, a whole family collapses to a mean plus a basis of eigen-deformations, and any member is a short coordinate vector — any face is a point in this space. Second, separate shape from appearance (and, further, identity from expression from pose): independent factors, each its own basis, recombined by warping texture onto geometry — the morphing discipline at population scale. Third, fit by analysis-by-synthesis with a prior: search the coordinates that reproduce the image, regularized toward plausibility, so the model can fill in what a single photo cannot show.

Those three put the morphable model at a crossroads in the book. Backward, it is morphing constrained to a learned family (Morphing) built on the warp engine (Warping and resampling). Forward, its analysis-by-synthesis fitting is the template for monocular 3-D face and body reconstruction (3D and depth); its data-term-plus-prior objective is the classical face of the regularized inverse problems that recur in Super-resolution and image priors; and its linear basis is the ancestor of the neural priors and generative latent spaces of Deep learning — the same idea, a learned space of deformations over a mean, scaled up. The throughline of the whole part holds to the end: find the coordinate map, then move the pixels — only here the map is learned from a population, and moving through it is how you synthesize, recognize, and edit an entire visual class.


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (L17, recurrence) — a morphable model is a learned warp basis over a mean

Everywhere in this part the recipe is build a coordinate map, then transport the pixels; what changes is how the map is obtained. A parametric warp solves it from constraints; a morph interpolates a hand-drawn one between two images; a morphable model learns the basis of maps from a training population and then lets you pick one by its coordinates. The mean shape is the origin, the eigen-deformations are the basis vectors, and an instance is a short vector $\mathbf{c}$ of how far to move along each. The payoff of folding correspondence into a learned basis rather than a per-pair drawing is a prior: not every warp is allowed, only the ones the family actually exhibits — which is exactly what lets the model fill in an occluded cheek, reject an implausible fit, and generalize from one photo to a full 3-D face. The same shape — mean + learned basis of deformations + coordinates — underlies SMPL bodies, FLAME faces, and, at the end of the chapter, the latent spaces of generative networks. (L17 is registered in the part introduction, Warping and morphing; here the map is a statistical model rather than a drawn or estimated one.)