6.4 Morphable models⧉
A single linear morph turns one face into another by interpolating a hand-drawn correspondence: it knows about exactly two faces and the straight line between them. Suppose instead you have a hundred faces, all registered to one another, and you ask a bigger question — not "what is between these two," but "what does the space of all of them look like, and can I synthesize, recognize, or edit a face by navigating that space?" That is the move from a morph to a morphable model, and it is one of the most influential ideas in this part: a whole family of shapes — faces, bodies, a class of objects — captured as a mean plus a basis of deformations, so that any member is a low-dimensional coordinate vector and any coordinate vector is a plausible member.
The construction is short to state and is built from machinery you already own. Put a training set of examples into dense correspondence (every example's nose-tip maps to every other's), align away the nuisance pose with Procrustes, and take the principal components (PCA) of the aligned shapes. PCA hands back a mean shape and an ordered basis of eigen-deformations — the directions in which the family varies most — and any instance is written
the mean plus a weighted sum of eigen-modes, with coordinates $\mathbf{c}=(c_1,c_2,\dots)$ measured in units of each mode's standard deviation $\sigma_k$. Read in this part's language, $\bar{\mathbf{s}}$ is a mean shape and the $\mathbf{e}_k$ are a learned warp basis over it: each eigen-mode is a small canonical deformation of the mean, and an instance is the mean warped by a linear combination of them. A morphable model is morphing, with the correspondence field constrained to the subspace the data taught us is plausible (Figure 6.4.1, Figure 6.4.2).
The same idea recurs at three scales of fidelity, and they share one skeleton. In 2-D, Cootes, Edwards and Taylor's Active Shape / Active Appearance Models (ASM 1995, AAM 2001) build PCA models of landmark shapes (and, for AAMs, of the shape-free texture too) and fit them to images. In 3-D, Blanz and Vetter's 3D Morphable Model of faces (1999) is the landmark result: a PCA model of 3-D face geometry and texture, fit to a single photograph by analysis-by-synthesis to recover shape, albedo, pose, and lighting at once. The same recipe gives SMPL for human bodies (Loper et al. 2015) and FLAME for faces (Li et al. 2017). Egger et al.'s 2020 survey is the map of the whole territory. We develop the skeleton once; the differences are scale (2-D vs 3-D), what gets a basis (shape, texture, expression), and how the prior is parameterized (linear PCA vs a neural net).
Everywhere in this part the recipe is build a coordinate map, then transport the pixels; what changes is how the map is obtained. A parametric warp solves it from constraints; a morph interpolates a hand-drawn one between two images; a morphable model learns the basis of maps from a training population and then lets you pick one by its coordinates. The mean shape is the origin, the eigen-deformations are the basis vectors, and an instance is a short vector $\mathbf{c}$ of how far to move along each. The payoff of folding correspondence into a learned basis rather than a per-pair drawing is a prior: not every warp is allowed, only the ones the family actually exhibits — which is exactly what lets the model fill in an occluded cheek, reject an implausible fit, and generalize from one photo to a full 3-D face. The same shape — mean + learned basis of deformations + coordinates — underlies SMPL bodies, FLAME faces, and, at the end of the chapter, the latent spaces of generative networks. (L17 is registered in the part introduction, Warping and morphing; here the map is a statistical model rather than a drawn or estimated one.)
A morphable model is the family-constrained sibling of the per-pair morph of Morphing; its fitting step is an inverse-rendering analysis-by-synthesis that anticipates monocular 3-D reconstruction in 3D and depth; and its linear PCA prior is the classical ancestor of the neural image and shape priors in Deep learning and Super-resolution and image priors. Downstream it powers identity-preserving portrait editing, relighting, and avatar animation.
6.4.1 Step 1 — dense correspondence: the precondition for averaging shapes⧉
Everything rests on a precondition that is easy to state and hard to satisfy: before you can average a hundred faces or take their PCA, every face must be in dense correspondence with every other — the same anatomical point (the tip of the nose, the corner of the left eye, a specific spot on the cheek) must occupy the same index in every example's data vector. Only then does "average the noses" mean averaging noses rather than averaging a nose in one image with a cheek in another. This is the exact lesson of Morphing — align before you blend — promoted from two images to a whole population: you cannot meaningfully average, or compute a covariance over, points that do not correspond.
For 2-D models the correspondence is often a fixed set of landmarks placed on every training image — eye corners, nostrils, jaw points — by hand or by a detector; an instance is then the concatenation of its landmark coordinates, $\mathbf{s} = (x_1,y_1,\dots,x_K,y_K)$. For dense models (a 3-D face mesh, a body) every vertex must correspond across the set, which is itself a registration problem — fit a common template mesh to every scan so that vertex $j$ lands on the same anatomical point in all of them. Blanz and Vetter's original 3DMM obtained this dense correspondence between laser scans with an optical-flow-style algorithm; modern pipelines use template-fitting and non-rigid registration. The quality of the whole model is bounded by the quality of this correspondence: if it is sloppy, the "mean face" is a blurred average of misaligned parts, and the eigen-modes mix true variation with registration error.
6.4.2 Step 2 — Procrustes alignment: factor out pose so PCA sees shape⧉
Two photos of the same face at different positions, scales, and in-plane rotations have different landmark coordinates, yet they are the same shape. If we ran PCA on the raw coordinates, the first and largest eigen-modes would be wasted describing pose — global translation, rotation, scale — the nuisance variation we do not want the model to spend its budget on. The fix is generalized Procrustes analysis: rigidly align every example (translate to a common centroid, rescale to a common size, rotate to best match) to a shared reference, iterating the reference as the running mean until it stops moving. What remains after Procrustes is shape in the technical sense — geometry with similarity (translation, rotation, uniform scale) divided out — and that is what we feed to PCA. (Pose itself is not discarded; it becomes a few explicit parameters bolted onto the model at fit time, alongside the coordinates $\mathbf{c}$.) The discipline is the same one that opened Warping and resampling's DOF ladder: separate the global rigid/similarity part from the non-rigid part, and model only the latter statistically.
6.4.3 Step 3 — PCA: a mean and a basis of eigen-deformations⧉
With the population corresponded and aligned, PCA does the rest. Stack the $N$ aligned shape-vectors as rows, subtract the mean shape $\bar{\mathbf{s}}$, and take the eigenvectors of the covariance (in practice via the SVD of the centered data). The eigenvectors $\mathbf{e}_1,\mathbf{e}_2,\dots$, ordered by decreasing eigenvalue, are the eigen-deformations (for faces, evocatively, the "eigenfaces" of shape), and each carries a standard deviation $\sigma_k$ — how far the family typically moves along it. Any instance is
a mean plus a short ($M$-term) weighted sum (Figure 6.4.1). Three properties make this the workhorse of the whole field:
- Low-dimensional. The eigenvalues fall off fast: a handful of modes capture most of the variance, so a face that lived in a few hundred raw coordinates is well described by perhaps a few dozen $c_k$. The family really is a low-dimensional manifold sitting inside a high-dimensional coordinate space, and PCA finds the flat that best approximates it.
- Interpretable, ordered modes. Because the modes are sorted by variance, the first ones capture the biggest axes of variation — for faces, things like overall width/aspect, or a slim-vs-round face; later ones, finer detail. Sweeping a single $c_k$ from $-2\sigma$ to $+2\sigma$ traces that one axis of facial variation while everything else is held at the mean (the rows of Figure 6.4.1). The modes are not guaranteed to be human-namable, but the leading ones usually are.
- A built-in prior. A coordinate vector with small $\|\mathbf{c}\|$ (each $c_k$ within a couple of $\sigma$) is a typical member of the family; a vector far out in the tails is implausible. This is the statistical prior the model buys you: it is not merely a parameterization but a probability over shapes (a Gaussian, under PCA), and that is what later lets fitting regularize toward plausible faces and reject nonsense.
The figures here run a real, tiny PCA on a small synthetic population of 2-D face landmarks: the mean is the genuine average, and the modes shown are the actual leading eigenvectors of that population's covariance — so "PC1 ≈ aspect" is what the data produced, not a hand-drawn cartoon.
6.4.4 Step 4 — shape and appearance are separate bases⧉
A face is not only a shape; it is also a coloring — skin tone, the dark of the brows and lips, shading. A morphable model handles these with two separate bases, and keeping them separate is exactly the separate-shape-from-color discipline of Morphing, now applied to a whole family. Geometry gets the shape PCA above. Appearance (texture) gets its own PCA — but, crucially, the textures are first warped to the mean shape so they are compared in a common, shape-free frame (every face's texture sampled at the same canonical landmarks before averaging). The result is a mean texture $\bar{\mathbf{t}}$ and an appearance basis $\mathbf{a}_k$, and a full instance is the pair
with two coordinate vectors — $\mathbf{c}$ for geometry, $\mathbf{b}$ for appearance. To render an instance you synthesize the texture in the shape-free frame, then warp it onto the synthesized shape (the warp engine of Warping and resampling, the morphable model's biggest customer after morphing): texture-on-the-mean, deformed to the geometry the shape coordinates request (Figure 6.4.3). Why separate them? Because the two vary independently — a wide face can have any skin tone; a smile (geometry) is not a color change — and because untangling them is what makes the model useful: relighting edits the appearance/shading without touching identity geometry; re-posing edits geometry without re-painting the texture; recognition can compare two faces in shape alone, robust to lighting. Active Appearance Models combine the two into a single joint appearance model; the 3DMM keeps shape, texture, and the imaging parameters (pose, illumination) explicit so each can be read off after fitting.
6.4.5 Step 5 — fitting a new photo: analysis-by-synthesis⧉
A model is only useful if you can fit it to a new image — recover the coordinates of the face in this photo. The principle is analysis-by-synthesis (equivalently, inverse rendering or render-and-compare): rather than read the parameters off the image directly, search for the parameters whose synthesized image best reproduces the input. Concretely, hold a current guess of the coordinates $\mathbf{c},\mathbf{b}$ and the imaging parameters (pose, camera, lighting); render the model under them; compare the render to the target and measure the residual (the mismatch, in landmark positions and/or pixel colors); and update the parameters to shrink it. Iterate until the residual stops falling. It is the morphable model's answer to the same loop that drives every fitting problem in the book — propose, render, compare, correct (Figure 6.4.4).
The objective being minimized is the residual plus a prior term that keeps the coordinates plausible — typically $\sum_k c_k^2$ (small in $\sigma$ units), the negative log of the Gaussian that PCA gave us in Step 3. That regularizer is the whole reason a morphable model can fit a single photograph and still produce a complete, sensible 3-D face: the data constrain the visible front of the face, and the prior fills in the occluded sides and the parts the lighting hid, by pulling the unconstrained coordinates toward the mean. Without the prior the fit would overfit the visible pixels and hallucinate a monster around the back. This is precisely the regularized inverse problem structure of Linear Inverse Problems and Regression and Super-resolution and image priors — data term + prior — with the morphable model supplying an unusually strong, learned, class-specific prior. The non-convex search (over geometry, texture, pose, and light at once) is the hard part in practice; Blanz and Vetter optimized it directly, while modern systems often regress the coordinates with a neural network in a single forward pass and refine from there (Sec. 7).
6.4.6 Step 6 — once fitted, edit by moving in the space⧉
The reason to go to all this trouble is what fitting enables. Once a photo is reduced to a point $\mathbf{c}$ (plus appearance $\mathbf{b}$, pose, and lighting), every downstream operation is a move in the space — and because the space is the learned manifold of plausible faces, every move stays on a plausible face. Re-pose / re-illuminate: change the explicit pose and lighting parameters and re-render — the same identity from a new angle or under new light, the bread and butter of 3DMM-based face manipulation. Interpolate / average: the straight line between two fitted points is a morph that never leaves the family — every in-between is a valid face, no ghosting, because the correspondence is baked into the basis (this is the morph of Morphing, with the correspondence supplied by the model rather than drawn). Edit semantically: add a multiple of a mode that correlates with an attribute — older, wider, more of a smile — to push the face along that attribute while holding the rest. Recognize: compare faces in the identity coordinates, where pose and lighting have been factored out, so the same person under different conditions lands at nearly the same point — the use that motivated much of the original work.
This last group surfaces the identity-vs-expression factorization. The geometry of a face is the identity (bone structure, who it is — roughly constant for a person) composed with an expression (smile, frown, blink — momentary). A flat shape PCA blends the two; the more capable models give them separate factors — FLAME, for instance, carries distinct identity, expression, and pose/jaw parameters — so you can swap one person's expression onto another's identity, or animate a fixed identity through expressions. SMPL does the analogous split for bodies: a shape space (the person's build) and a pose space (the skeleton's articulation), with pose-dependent deformations layered on. The general pattern is to factor the deformation into independent groups — identity, expression, pose — each its own little basis, which is the rigging artist's blendshapes wearing a statistician's hat.
6.4.7 Step 7 — from linear PCA to neural priors⧉
PCA is a linear, Gaussian model: instances are linear combinations of the basis, and the prior is a single ellipsoid. That is its strength — interpretable, cheap, a closed-form mean and modes — and its ceiling. A linear basis cannot capture nonlinear variation (the manifold of faces is curved; large expressions and extreme shapes bend off the flat subspace), it ties detail to how well correspondence and the Gaussian assumption hold, and it tops out at a certain realism. The modern arc keeps the morphable-model idea — a mean and a learned space of deformations you fit and then navigate — but replaces the linear basis with a neural one:
- Regression networks that map a photo straight to 3DMM coordinates in one forward pass, trained with a render-and-compare (differentiable-rendering) loss — analysis-by-synthesis with the search amortized into a network. This is how monocular face reconstruction became fast and robust (cross-ref 3D and depth).
- Nonlinear / neural morphable models that learn the shape and appearance decoders themselves (autoencoders, implicit fields), so the "basis" is a nonlinear function of a latent code rather than a fixed set of eigen-vectors — more expressive than PCA at the cost of interpretability.
- Generative latent spaces — GANs and diffusion models — as implicit morphable models: a face GAN's latent space is a learned, navigable manifold of faces, where interpolation and attribute edits are again "moves in the space," now over a vastly richer (but less labeled) coordinate system (cross-ref Deep learning).
The constant across the whole arc is the framing this chapter began with. Whether the basis is a handful of eigenfaces or the latent space of a billion-parameter network, a morphable model is a mean object plus a learned space of deformations, fit to data by analysis-by-synthesis, and used by moving through that space — a learned warp basis over a mean shape. PCA was the first, clearest instance; the priors got stronger, the idea did not change.
6.4.8 Recap and significance⧉
Three ideas outlast the specific models. First, a class of shapes is a low-dimensional space: corresponded, aligned, and PCA'd, a whole family collapses to a mean plus a basis of eigen-deformations, and any member is a short coordinate vector — any face is a point in this space. Second, separate shape from appearance (and, further, identity from expression from pose): independent factors, each its own basis, recombined by warping texture onto geometry — the morphing discipline at population scale. Third, fit by analysis-by-synthesis with a prior: search the coordinates that reproduce the image, regularized toward plausibility, so the model can fill in what a single photo cannot show.
Those three put the morphable model at a crossroads in the book. Backward, it is morphing constrained to a learned family (Morphing) built on the warp engine (Warping and resampling). Forward, its analysis-by-synthesis fitting is the template for monocular 3-D face and body reconstruction (3D and depth); its data-term-plus-prior objective is the classical face of the regularized inverse problems that recur in Super-resolution and image priors; and its linear basis is the ancestor of the neural priors and generative latent spaces of Deep learning — the same idea, a learned space of deformations over a mean, scaled up. The throughline of the whole part holds to the end: find the coordinate map, then move the pixels — only here the map is learned from a population, and moving through it is how you synthesize, recognize, and edit an entire visual class.
Big lessons of this chapter
The recurring principles from this chapter, gathered for review.
Everywhere in this part the recipe is build a coordinate map, then transport the pixels; what changes is how the map is obtained. A parametric warp solves it from constraints; a morph interpolates a hand-drawn one between two images; a morphable model learns the basis of maps from a training population and then lets you pick one by its coordinates. The mean shape is the origin, the eigen-deformations are the basis vectors, and an instance is a short vector $\mathbf{c}$ of how far to move along each. The payoff of folding correspondence into a learned basis rather than a per-pair drawing is a prior: not every warp is allowed, only the ones the family actually exhibits — which is exactly what lets the model fill in an occluded cheek, reject an implausible fit, and generalize from one photo to a full 3-D face. The same shape — mean + learned basis of deformations + coordinates — underlies SMPL bodies, FLAME faces, and, at the end of the chapter, the latent spaces of generative networks. (L17 is registered in the part introduction, Warping and morphing; here the map is a statistical model rather than a drawn or estimated one.)