💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
jump to
💡 In a hurry? Jump to this chapter’s 3 big lessons ↓

8.8 Compositing, segmentation and matting

Cut a person out of one photograph and drop them into another. State the task plainly and it splits into two jobs that, once separated, stay separate for the rest of the chapter: which pixels are the subject (a selection), and how to lay those pixels over the new background so the seam is invisible (a blend). The first job is segmentation and matting; the second is compositing, together with its seamless cousins, Poisson and pyramid blending, which we built in the EDGES MATTER part and only point back to here.

The honest difficulty is hiding in the word "which." At a clean, in-focus silhouette — the edge of a coffee mug against a wall — a pixel is unambiguously mug or wall, and a binary mask is the right answer. But run the same question along a head of windblown hair, a motion-blurred arm, the out-of-focus rim of a leaf, or a pane of frosted glass, and there is no honest yes-or-no. A single sensor pixel there integrated light from both the foreground strand and the background behind it; it is genuinely part foreground and part background. The faithful description of such a pixel is not a label but a mixture, and writing that mixture down gives us the one equation this whole chapter orbits:

$$ C = \alpha F + (1-\alpha)B, \qquad \alpha \in [0,1]. $$

Every observed pixel $C$ is a blend of a foreground color $F$ and a background color $B$ in proportion $\alpha$ — the alpha, or matte: $\alpha=1$ is pure foreground, $\alpha=0$ is pure background, and the interesting pixels are the in-between ones at the boundary (Figure 8.8.1). This is the compositing equation, and it points two ways. Read forward — you are given $F$, $B$, and $\alpha$, and you produce $C$ — it is compositing, the easy direction, a per-pixel blend. Read backward — you see only $C$ and must recover $\alpha$, $F$, and $B$ — it is matting, and it is brutally ill-posed: three numbers known, seven unknown, per pixel. That backward direction is where almost all the work, and all the cleverness, lives.

The unifying machinery is the spine of this chapter, and it is machinery you have already met. A hard selection — a binary mask — is a graph cut: a globally-optimal binary labelling whose energy is a region data term plus an edge-aware smoothness term, exactly the discrete optimization of Seam optimization (L12), and that smoothness term is an affinity (L4). A soft selection — the matte $\alpha$ — is the same idea relaxed to a continuous value: the matting Laplacian turns out to be a pixel-affinity matrix, so natural-image matting is "propagate $\alpha$ along affinities," which is the colorization solve of the colorization solve in the deblurring chapter wearing a different hat. Once $\alpha$ is in hand, the over operator drops the cutout onto the new background; making it actually look like it belongs — consistent color, light, and grain — is harmonization, and erasing the seam itself is Poisson / pyramid blending, both developed in EDGES MATTER and used, not re-derived, here.

💡 Big lesson (recurrence of L12 — image edits as discrete optimization on a graph; the energy is the design)

A hard selection is exactly the graph cut of Seam optimization. Make the pixels nodes of a grid graph, add a region data term ("this color looks like foreground / like background") and an edge-aware smoothness term ("it's cheap to cut between two pixels whose colors already differ"), and min-cut / max-flow returns the globally optimal boundary — for two labels, exactly optimal, which is why graph cut beats greedy region-growing. The optimizer is off-the-shelf; choosing the energy is the whole design. And the smoothness weight $V_{pq}\propto e^{-\lVert C_p-C_q\rVert^2/2\sigma^2}$ is an affinity — so this is also a recurrence of L4 · edge-preserving = affinity: the very same "how much do these two pixels belong together" that drove the bilateral filter, now deciding where not to cut. Below, the same affinity returns once more as the matting Laplacian — a cut and a soft propagation are the two faces of one idea. (→ see Big lesson L12, first placed in Seam optimization, extending L4, first placed in Bilateral filtering.)

fig-matting-equation
Figure 8.8.1. The matting equation made visible. A single observed pixel $C$ taken from a hairy boundary is decomposed into its three constituents: a foreground color $F$, a background color $B$, and the soft coverage $\alpha$ that mixes them, $C=\alpha F+(1-\alpha)B$. A magnified strip across the strand of hair runs $\alpha$ from $1$ (solid hair) through the fractional values where a single sensor pixel straddled hair and sky, down to $0$ (clear background). The point: at the boundary a pixel is not foreground or background but a genuine mixture, and the honest unknown is the continuous $\alpha\in[0,1]$, not a binary label.

8.8.1 Segmentation: cutting the object out

Start with the hard version of the cut: produce a binary mask — these pixels are the subject, those are not. There are three classic ways to pose it, and the lesson they teach together is L12: each is a discrete optimization on a graph, the three differ only in what energy they minimize and which off-the-shelf optimizer solves it, and the energy you pick — not the solver — is the design.

Interactive graph-cut segmentation

The cleanest framing is Boykov and Jolly's (2001) interactive graph cut. Treat the image as a grid graph whose nodes are pixels, and let the user paint a few foreground and background strokes. From those strokes build a color model — say, the color distribution of foreground pixels versus background pixels — which gives a per-pixel data term $D_p(L_p)$: the cost of labelling pixel $p$ as foreground or background, low when $p$'s color matches that label's model. Add the edge-aware smoothness term

$$ V_{pq} \propto \exp\!\big(-\lVert C_p - C_q\rVert^2 / 2\sigma^2\big), $$

a penalty for giving neighbouring pixels $p,q$ different labels — large (expensive to cut) when their colors are similar, small (cheap to cut) when their colors already differ. The total energy is

$$ E(\mathbf L) = \sum_p D_p(L_p) \;+\; \lambda \sum_{(p,q)\in\mathcal N} V_{pq}(L_p,L_q), $$

over all labellings $\mathbf L$ assigning each pixel to foreground or background. Minimizing this is exactly a min-cut / max-flow problem: add a source terminal (foreground) and a sink terminal (background), wire each pixel to the terminals with t-links weighted by its data term and to its neighbours with n-links weighted by the smoothness term, and the minimum cut separating source from sink is the optimal labelling, with the cut running along the object boundary (Figure 8.8.2). For two labels this minimum is found globally and exactly — there is no local-minimum trap — which is precisely why graph cut beats greedy region-growing, and why the fast max-flow algorithms (Boykov et al. 2001; Boykov and Kolmogorov 2004, whose solver and $\alpha$-expansion move are the workhorse here) made it practical.

fig-segmentation-graphcut
Figure 8.8.2. Interactive segmentation as a min-cut. Left: the image as a grid graph — every pixel is a node, joined to its 4-neighbours by n-links (weighted by color affinity $V_{pq}\propto e^{-\lVert C_p-C_q\rVert^2/2\sigma^2}$, thick where neighbours match) and to a source (foreground) and sink (background) terminal by t-links (weighted by the region data term $D_p$ from the user's strokes / color model). Middle: the minimum cut — the cheapest set of edges to sever to disconnect source from sink — threads through where colors change and where the data term is weak. Right: that cut is the object boundary, yielding the binary mask. Inset: GrabCut's looser interaction — a rectangle around the subject is enough to seed the color models, after which the cut is re-run and the models re-estimated, iterating to a clean mask.

GrabCut: less input, same cut

GrabCut (Rother et al. 2004) is the same machinery with a far gentler interaction. Instead of foreground/background strokes, the user simply drags a loose rectangle around the subject. Everything outside is seeded as background; everything inside is provisionally foreground. Fit Gaussian-mixture color models to those two sets, run the graph cut, then re-estimate the color models from the resulting labelling and cut again — iterate until it settles. A rectangle, plus perhaps a stroke or two to fix mistakes, becomes a clean mask. This is the interaction model that every "select subject" tool descends from, and it begins a thread that runs through the rest of the chapter: less user input is the research arc.

Two other framings: normalized cuts and intelligent scissors

The same selection can be posed two other ways, with two other optimizers — the point of the next figure (Figure 8.8.3).

Normalized cuts (Shi and Malik 2000) is the spectral framing. Build a pixel-affinity graph and cut it into pieces, but penalize the cut normalized by the size of the pieces so the solution does not trivially lop off a single pixel; the resulting normalized-cut cost is minimized (in a relaxation) by an eigenvector of the graph Laplacian. The affinity is the same "how much do these pixels belong together," but the optimizer is an eigensolver and the result is unsupervised grouping — no strokes, no terminals, just the image's own structure carved into balanced regions.

Intelligent scissors, or live-wire (Mortensen and Barrett 1995), is the boundary framing. Here the energy lives on the contour, not the regions: define an edge cost that is low along strong image edges, and as the user traces near the object's outline, a least-cost path (dynamic programming) snaps to the true contour between the clicked anchor points. The optimizer is shortest-path; the design is again the energy — what counts as a cheap place to put the boundary.

Region (graph cut), spectral (normalized cuts), boundary (scissors): three energies, three optimizers, one lesson L12.

fig-segmentation-three-framings
Figure 8.8.3. One task, three graph energies. The same object is selected three ways. (a) Min-cut (graph cut / GrabCut): a region energy — data term + edge-aware smoothness — minimized by max-flow; the cut is a closed boundary around the region. (b) Normalized cut (spectral): the pixel-affinity graph is partitioned to minimize a size-normalized cut cost, solved by an eigenvector of the graph Laplacian — unsupervised, balanced grouping. (c) Least-cost path (intelligent scissors / live-wire): the energy lives on the boundary, cheap along strong edges; dynamic programming finds the shortest such path and it snaps to the contour as the user traces. Three optimizers (max-flow, eigensolver, shortest-path) for one selection — the energy is the design, not the solver (L12).

The bridge to soft selection

Every method here returns a hard 0/1 mask. Run any of them through a head of hair and the result looks scissor-cut: jagged, fringed, with the fine wisps either chopped off or surrounded by a halo of the old background. The mask had to commit each pixel wholly to foreground or background, and the boundary pixels are exactly the ones for which that commitment is a lie. The fix is to stop committing — to relax the binary label to a continuous $\alpha\in[0,1]$. That is matting, and it is the rest of the chapter.

8.8.2 The matting equation and the alpha channel

We have already written the model: each observed pixel is a mixture, $C = \alpha F + (1-\alpha)B$, with $\alpha$ the coverage or transparency of the foreground. The whole point of $\alpha$ is to be fractional at the boundary — a hair strand that covers a third of a pixel contributes $\alpha\approx\tfrac13$ of its color, the sky behind it the other two-thirds — so that the cutout can carry soft, believable edges instead of a stair-stepped silhouette.

Compositing — the forward direction — is the easy half, and it is just the over operator (Porter and Duff 1984, who formalized digital compositing and premultiplied alpha). Given a foreground $F$ with matte $\alpha$ and any new background $B'$, the composite is one per-pixel blend, $C = \alpha F + (1-\alpha)B'$. In practice colors are stored premultiplied by their alpha, $F' = \alpha F$, so the over operator becomes the cheaper and filter-correct $C = F' + (1-\alpha)B'$ — premultiplication is what lets you downsample, blur, or interpolate an image with transparency without colors bleeding out of the antialiased edges. None of this is hard; it is the inverse that hurts.

Matting — the inverse direction — asks you to look at a single photograph, observe only $C$ (three numbers, the RGB of each pixel), and recover $F$ (three), $B$ (three), and $\alpha$ (one): seven unknowns from three equations, per pixel. There is no solving that as stated. Every matting method in this chapter is, at bottom, a different answer to one question — where does the missing information come from?

This is another recurrence of L10 — the prior is not optional; see Super-resolution and image priors (the learned version is L8). Matting is a genuinely under-determined inverse problem, and the methods differ only in what they add to pin it down: a user trimap plus a natural-image prior (natural matting), a known background (chroma key), an extra measurement (depth, infrared, flash), or a learned model (deep matting, Segment Anything (SAM)) — in every case load-bearing, not a tuning knob.

8.8.3 Trimaps: turning the impossible inverse into a tractable one

The standard way to make matting tractable is the trimap: a three-way labelling of the image into definitely foreground, definitely background, and a thin band of unknown pixels $\Omega$ straddling the boundary — the hair, the blur, the soft rim. The trimap can come from a few user scribbles, from dilating a rough binary mask outward and inward by a few pixels, or, increasingly, from a learned segmenter.

Why does this rescue the problem? Because in the definite regions the answer is no longer unknown: where a pixel is definitely foreground, $\alpha=1$ and $C$ is the foreground color $F$; where definitely background, $\alpha=0$ and $C$ is $B$. The algorithm then has to solve for $\alpha,F,B$ only in the narrow unknown band, and it can borrow the nearby observed foreground and background colors as boundary anchors. "Seven unknowns everywhere" collapses to "a few unknowns in a thin strip, hemmed in by strong known values on both sides" (Figure 8.8.4). That is a problem with an answer.

The trimap also makes the chapter's research arc legible. Watch the user input shrink: hard foreground/background scribbles (graph cut) → a loose rectangle (GrabCut) → a trimap band (classical matting) → nothing at all (learned / SAM). Less input, every step.

[figure fig-trimap-to-alpha not built]
Figure 8.8.4. From trimap to alpha. Left: the user (or a rough segmenter) supplies a trimap — white = definitely foreground, black = definitely background, grey = the unknown band $\Omega$ straddling the hair. In white the matte is pinned to $\alpha=1$ and $C=F$ is observed; in black, $\alpha=0$ and $C=B$ is observed; only the grey strip is solved. Right: the recovered continuous $\alpha$ over the unknown band — individual wisps of hair emerge with fractional coverage, each pixel's $\alpha$ set by how much hair it actually covered (after closed-form matting). The thin band plus strong boundary anchors is what turns the impossible 7-unknowns-everywhere inverse into a tractable few-unknowns-in-a-strip solve.

8.8.4 Natural-image matting: Bayesian and closed-form

When the only extra information is a trimap and a prior — no green screen, no depth sensor — we are in natural-image matting: pull a soft matte out of an ordinary photograph against an ordinary, cluttered background. Two methods anchor the classical era, and the second reveals the affinity hinge that ties this whole chapter to the previous part.

Bayesian matting (Chuang et al. 2001) attacks the unknown band pixel by pixel, working outward from the known regions. For each unknown pixel it gathers nearby known foreground and background colors, models each as a Gaussian color cluster, and finds the $(F,B,\alpha)$ that best explains the observed $C$ in a maximum-a-posteriori sense — likeliest mixture given the local color statistics. Sweep this from the trimap boundary inward and the band fills in. It is intuitive and often works, but it is fundamentally local: each pixel is solved on its own, so neighbouring alphas have no built-in reason to agree, and the matte can be noisy where colors are ambiguous.

Closed-form matting (Levin, Lischinski and Weiss 2008) makes the global leap. Its key move is to eliminate $F$ and $B$ entirely. Assume that within any small window the foreground colors lie on a single color line and the background colors on another — a mild, usually-true local model. Under that assumption one can show that $\alpha$ is, within each window, an affine function of the pixel color; pushing this through algebraically removes $F$ and $B$ and leaves a quadratic energy in $\alpha$ alone:

$$ \hat\alpha = \arg\min_\alpha \; \alpha^\top L\,\alpha \quad \text{subject to the trimap constraints,} $$

where $L$ is the matting Laplacian — a large, sparse pixel-affinity matrix whose entries encode, from the local color-line model, how strongly each pair of neighbouring pixels should share the same $\alpha$. Minimizing a quadratic subject to linear (trimap) constraints is a sparse linear solve — the very same conjugate-gradient / multigrid machinery as Poisson editing and regression (Figure 8.8.4). The matte comes out globally consistent because $L$ couples the whole band at once, not one pixel at a time.

💡 Big lesson (the hinge — recurrence of L4 — edge-preserving = affinity)

The matting Laplacian $L$ is an affinity matrix. It is the same kind of object as the bilateral filter's weight, as the graph-cut smoothness term $V_{pq}$ above, and — most pointedly — as the affinity matrix in the colorization solve (Levin, Lischinski and Weiss 2004, the same authors' earlier work). Closed-form matting is, almost line for line, colorization with $\alpha$ in place of chroma: "propagate the unknown value along pixel affinities, anchored by the user's hard constraints." The cut (L12) and the soft propagation (this) are the two faces of one principle — measure how much two pixels belong together, then either don't cut between them or make them share a value. (→ see Big lesson L4, first placed in Bilateral filtering; the cut form is L12.)

Robust / sampling hybrids (Wang and Cohen 2007 and the broader robust-matting line) combine the two ideas — sample candidate $(F,B)$ colors per pixel like Bayesian matting, and impose the affinity smoothness of the matting Laplacian — and were the workhorse just before learned methods took over.

8.8.5 Easy matting I — blue/green/chroma keying (constrain the background)

There is a way to make matting well-posed by construction: take three of the seven unknowns off the table by controlling the scene. Shoot the subject against a known, uniform, saturated background — the green or blue screen — so that $B$ is simply known everywhere, and chosen to be a color far from anything in skin, hair, or clothing (Smith and Blinn 1996, who put blue-screen matting on a clean algebraic footing). With $B$ fixed, only $\alpha$ and $F$ remain, and the per-pixel solve becomes easy and stable (Figure 8.8.5, left).

Green versus blue: green dominates today for two practical reasons — a Bayer sensor has two green photosites per cell, so the green channel is the least noisy, and green is rarer than blue in skin tones and common wardrobe. Blue was the film-era standard. Failure modes are exactly the ways the "known $B$" assumption breaks: spill, where green light bounces off the screen onto the subject's edges and hair (handled by de-spill correction); uneven screen lighting, where the background is not actually one constant color across the frame; and fine detail, where even a perfect green screen leaves a genuinely soft $\alpha$ at the hair boundary that still has to be matted. This is exactly why a film studio obsesses over lighting the screen flatly and evenly — every unevenness is a place where $B$ stops being known. The moral: chroma key removes unknowns by controlling the scene.

8.8.6 Easy matting II — IR, depth and multi-flash matting (measure the separation)

The other road to an easy matte spends not control over the scene but an extra measurement: arrange the capture so that the sensor itself hands you something that separates foreground from background, independent of color. This is the computational-illumination thread.

Depth matting uses a depth sensor — time-of-flight, stereo, the dual-pixel split on a phone sensor, or LiDAR — to label "near = foreground, far = background" with no reference to color at all. This is what phone portrait mode and video-call background blur run on. The catch is that depth edges are coarse and noisy compared to the color image, so depth is rarely used as the final matte; instead it generates a trimap automatically (near, far, and an uncertain band at the depth discontinuity), which is then refined by a closed-form or guided-filter matte on the color image — the same affinity solve, and the same edge-aware refinement step as the dehazing transmission map in Dehazing.

Infrared matting (the "Disney" approach) lights the background in the infrared with a retro-reflective IR backdrop, invisible to the eye and to the color image. A second, IR-only image then sees the backdrop glowing uniformly bright — a built-in green screen with no visible colored screen on set to spill onto the subject or to be relit. You get the constrained-$B$ benefit of chroma key without a colored screen in shot.

Flash / multi-flash matting exploits the inverse-square falloff of a flash. A flash brightens the near foreground enormously and the far background hardly at all (light falls off as $1/d^2$), so differencing a flash and a no-flash frame — or reading the occlusion shadows cast in a multi-flash rig — separates foreground from background by illumination falloff rather than color (Figure 8.8.5, right). The moral here is the complement of chroma key: spend a sensor or a light to make the matte well-posed, rather than spending user input or a prior.

fig-key-vs-measure
Figure 8.8.5. Two roads to an easy matte. Left — constrain $B$ (green screen): shoot against a known, uniform, saturated background, so $B$ is known everywhere and only $\alpha,F$ remain; the cartoon shows the subject on green and the clean matte that falls out, plus a callout of the failure modes (spill onto hair, uneven screen lighting). Right — measure the separation (depth / IR / multi-flash): add a measurement that distinguishes near from far independent of color — a depth map ("near = foreground"), an IR backdrop image (an invisible green screen), or a flash/no-flash difference (the near subject brightens far more than the far background, $1/d^2$). One road controls the scene; the other spends a sensor or a light.

8.8.7 Rotoscoping and video matting (coherence over time)

Add a time axis and selection becomes video matting, with one genuinely new requirement.

Rotoscoping is the hand-craft ancestor: trace the moving subject frame by frame, originally over filmed footage by hand. Modern tools make it bearable by letting the artist keyframe spline contours or masks at a few frames and interpolate between them — segmentation with a temporal axis bolted on.

The new requirement is temporal coherence. A matte that is individually plausible on each frame but flickers between frames — a hair edge that shimmers, a boundary that jitters by a pixel — looks worse to a viewer than a matte that is slightly wrong but perfectly stable. So video matting adds a temporal smoothness or propagation term: track the subject, or optical-flow-warp the matte from one frame to the next, so the affinity graph now spans the whole video volume rather than a single frame. The applications are everywhere a moving subject must be isolated — green-screen visual effects, live virtual backgrounds, object removal or replacement, and the selection layer under any video edit (a substrate for the stylization of Non-photorealistic rendering). Learned video matting (Lin et al. 2021, real-time background matting) does the per-frame regression and the temporal coherence together with a recurrent network.

8.8.8 Learned segmentation and matting (the prior, learned)

The deep-learning era does not change the skeleton of this chapter at all. The task is still select, then recover $\alpha$; the matting equation $C=\alpha F+(1-\alpha)B$ is unchanged; "cut, then blend" is unchanged. What changes is the prior: the hand-built color model, hand-designed affinity, and hand-drawn trimap are replaced by a function fit to data.

💡 Big lesson (callback to L8 — a learned operator swaps a hand-designed prior for one learned from data)

Deep segmentation and matting keep the same template as their classical predecessors and swap only the prior — the color model, the affinity, the trimap propagation — for one learned from a dataset. The inverse-problem structure is identical; the prior is now data-driven, with the usual costs (data, compute, and failure on inputs unlike the training distribution). (→ see Big lesson L8, first placed in Machine learning; it extends the prior-not-optional L10 of this chapter.)

SAM — Segment Anything (Kirillov et al. 2023) is the present-day front-end. It is a promptable segmentation model: click a point, drag a box, or ask for "everything," and it returns clean masks for arbitrary objects with no per-class training. SAM effectively collapses GrabCut's whole interaction — the rectangle, the color model, the iterate — into a single prompt against a large pretrained network, and serves as the "cut anything" stage that feeds masks or trimaps to everything downstream.

Deep alpha matting (Xu et al. 2017's deep image matting, and trimap-free successors such as MODNet and Lin et al. 2021's robust matting) trains a network to regress the soft $\alpha$ (and often $F$) directly, learning the hair-and-edge prior from labelled examples instead of assuming a local color line. The trimap-free variants push the user input to zero — the end of the "less input" arc that began with Boykov–Jolly's strokes — at the cost of being only as good as their training distribution. As always, these are taught as models in Deep learning; here they are simply the learned-prior end of the same select-then-recover spectrum. Fully generative object insertion — re-rendering the composite, shadow and all, rather than just cutting and pasting — belongs to Generative AI and diffusion (L11).

8.8.9 Harmonization: making the composite belong

Suppose the cut is finally perfect — a flawless soft $\alpha$, every wisp of hair accounted for. Drop the cutout onto the new background and it can still look wrong: it sits there "stickered on," lit by a different sun, white-balanced by a different camera, carrying a different contrast and a different film grain than its new surroundings. The matte was right; the appearance is wrong (Figure 8.8.6). Fixing that takes two further layers, both of which were built in EDGES MATTER and are only invoked here.

Seamless blending hides the seam — the abrupt jump in color and brightness right at the boundary. Poisson / gradient-domain blending (Pérez et al. 2003) pastes the foreground's gradients rather than its pixels and re-derives the boundary color from the background, so the cutout slides in tone to match its surroundings at the rim (L9, Poisson image editing). Multi-band / pyramid blending (Burt and Adelson 1983) instead cross-fades the two images band by band, feathering coarse bands over a wide transition and fine bands over a narrow one (Linear pyramids and wavelets). Both are used here, not re-derived.

Global appearance harmonization fixes what blending cannot: the foreground's overall color distribution, illumination direction, contrast, and grain still have to be made consistent with the background. Classical methods transfer color statistics from background to foreground (Reinhard et al. 2001); learned methods train a network to predict a harmonized composite directly. The distinction worth holding onto is that Poisson blending hides the seam but not a wrong shading direction — paste a face lit from the left onto a scene lit from the right and Poisson will make the boundary invisible while leaving the lighting plainly contradictory; only harmonization addresses that. The takeaway is that a believable composite needs all three ingredients: a correct matte (the cut), a seamless blend (Poisson or pyramid), and consistent appearance (harmonization).

[figure fig-harmonization not built]
Figure 8.8.6. Three stages of belonging, same cutout. (a) Raw paste: a correct-$\alpha$ cutout dropped straight onto the new background — "stickered on," with a visible brightness/color jump at the rim and mismatched overall tone. (b) Poisson / pyramid blend: the seam is gone (the gradient-domain or multi-band blend re-lights the boundary to match), but a wrong global lighting or color cast can remain. (c) Harmonized: the foreground's color distribution, contrast, and grain are matched to the background as well, so it finally reads as part of the scene. The progression makes the point that the matte is necessary but not sufficient — a believable composite needs cut + blend + appearance.

8.8.10 Where the blends live: Poisson, pyramid, photomontage, and fusion (cross-refs)

This chapter owns the cut; the blends it leans on were built elsewhere, and it is worth a one-paragraph map of where each lives so the cross-references resolve cleanly.

Poisson / gradient-domain blending is Poisson image editing (EDGES MATTER), used above for the seam (L9). Pyramid / multi-band blending is Linear pyramids and wavelets (Burt and Adelson 1983). Interactive digital photomontage (Agarwala et al. 2004) is the canonical system that wires both halves of this chapter together: a graph-cut seam (L12) chooses, per region, which source photograph contributes its pixels — to assemble a group portrait where everyone is smiling, to paint out a tourist against a clean plate, or to extend depth of field — and then a Poisson blend (L9) hides whatever residual mismatch the seam leaves. Cut along the edges, then reconstruct across them; it sits at the seam/blend crossover in EDGES MATTER and is showcased here. Finally, multi-image fusion — high-dynamic-range (HDR) merge, exposure fusion, focus stacking, flash/no-flash — lives in Multiple exposure imaging (and the Recap: tone mapping recap); it shares the "weighted per-pixel combine in a pyramid" mechanics with pyramid blending but combines exposures or focus, not objects.

The one-line map: segmentation/matting is the cut (this chapter); Poisson / pyramid / photomontage is the blend (EDGES MATTER); HDR / fusion is the multi-capture combine (Multiple exposure imaging).


Big lessons of this chapter

The recurring principles from this chapter, gathered for review.

💡 Big lesson (recurrence of L12 — image edits as discrete optimization on a graph; the energy is the design)

A hard selection is exactly the graph cut of Seam optimization. Make the pixels nodes of a grid graph, add a region data term ("this color looks like foreground / like background") and an edge-aware smoothness term ("it's cheap to cut between two pixels whose colors already differ"), and min-cut / max-flow returns the globally optimal boundary — for two labels, exactly optimal, which is why graph cut beats greedy region-growing. The optimizer is off-the-shelf; choosing the energy is the whole design. And the smoothness weight $V_{pq}\propto e^{-\lVert C_p-C_q\rVert^2/2\sigma^2}$ is an affinity — so this is also a recurrence of L4 · edge-preserving = affinity: the very same "how much do these two pixels belong together" that drove the bilateral filter, now deciding where not to cut. Below, the same affinity returns once more as the matting Laplacian — a cut and a soft propagation are the two faces of one idea. (→ see Big lesson L12, first placed in Seam optimization, extending L4, first placed in Bilateral filtering.)

💡 Big lesson (the hinge — recurrence of L4 — edge-preserving = affinity)

The matting Laplacian $L$ is an affinity matrix. It is the same kind of object as the bilateral filter's weight, as the graph-cut smoothness term $V_{pq}$ above, and — most pointedly — as the affinity matrix in the colorization solve (Levin, Lischinski and Weiss 2004, the same authors' earlier work). Closed-form matting is, almost line for line, colorization with $\alpha$ in place of chroma: "propagate the unknown value along pixel affinities, anchored by the user's hard constraints." The cut (L12) and the soft propagation (this) are the two faces of one principle — measure how much two pixels belong together, then either don't cut between them or make them share a value. (→ see Big lesson L4, first placed in Bilateral filtering; the cut form is L12.)

💡 Big lesson (callback to L8 — a learned operator swaps a hand-designed prior for one learned from data)

Deep segmentation and matting keep the same template as their classical predecessors and swap only the prior — the color model, the affinity, the trimap propagation — for one learned from a dataset. The inverse-problem structure is identical; the prior is now data-driven, with the usual costs (data, compute, and failure on inputs unlike the training distribution). (→ see Big lesson L8, first placed in Machine learning; it extends the prior-not-optional L10 of this chapter.)