💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.
Computational Photography, an AI-powered Slopendium — 14 3D and depth
expand to📖 Full book outlinejump to1 parts · 9 chapters · 0 sections · 10 figures embedded · 5 placeholders · double-click a figure to enlarge
Part 14 3D AND DEPTH
14.1 What "depth" means, and where it comes from
• recap, don't re-teach: **depth = the z-coordinate**, *not* the ray length to the point; **unprojection** (pixel + depth → 3-D point) inverts perspective projection (defined in [[Multiple view geometry]] / Fundamentals). A **depth map** is just a per-pixel z image; a **point map** (DUSt3R's term, below) is its unprojected 3-D field.
• the cues/sensors that *give* depth, as a quick inventory (each lives in its home chapter; collected here for contrast):
• **stereo / disparity** — two views, triangulate matched points (baseline + correspondence) → [[Multiple view geometry]]
• **multi-view** — many views (the SfM/MVS pipeline below)
• **active depth** — **time-of-flight**, **structured light** (project a known pattern), **LiDAR** — return z directly (pointer to sensors/optics; RealSense & phone face-ID dot projectors)
• **defocus / focus** — depth-from-defocus and focal stacks → [[Advanced computational photography#Light field and multi-aperture imaging]], Depth of field
• **dual-pixel** — the half-aperture parallax phones already capture for autofocus, reused for portrait depth → Optics (dual-pixel AF)
• **motion** — structure-from-motion / optical-flow parallax → [[Matching pixels across space and time]]
• **monocular cues** — a *single* still already implies depth to a human (occlusion, perspective, texture gradient, shading, familiar size, haze); turning those into numbers is the next chapter
• the recurring catch: most reconstructions recover geometry only **up to a similarity** (scale/rotation/translation) unless something fixes the **absolute scale** — a known baseline, a calibration target, or a metric sensor. Keep "relative" vs "metric" depth distinct.
14.2 Monocular depth estimation (one image → depth)
⬜ figure not yet created
`fig-monocular-cues` (one photo annotated with each depth cue) fig-depth-anything
fig-pinhole-imaging
fig-pinhole-imaging · imaging-scenario series (2/3): add a pinhole to the bare sensor — one ray per scene point → a dim **inverted** image (same tree+sensor+colours as fig-bare-sensor-averaging)
fig-depth-of-field-sim
fig-depth-of-field-sim · live macro depth-of-field simulator (web edition): a to-scale thin-lens diagram + a 3D photo (aperture-supersampled bokeh) + a circle-of-confusion plot all sharing one optics model; subject dropdown (Meshy beetle / bee / ladybug) over a flower background, focus landing on the front of the face so the body and antennae blur; static fallback is a screenshot. *Bug & flower 3D models generated with Meshy AI.*
14.3 Single-image 3-D: tour into the picture, photo pop-up, 3-D Ken Burns
fig-pinhole-imaging
fig-pinhole-imaging · imaging-scenario series (2/3): add a pinhole to the bare sensor — one ray per scene point → a dim **inverted** image (same tree+sensor+colours as fig-bare-sensor-averaging)
⬜ figure not yet created
`fig-popup-ground-vertical-sky` (geometric-context labels → folded pop-up)
fig-resample-forward-inverse
fig-resample-forward-inverse · concrete 2× upsampling on a 5×5 → 10×10 rainbow grid: FORWARD pushes input (i,j)→output (2i,2j) so 1 of every 2×2 output block is filled and 3 are black holes (regular lattice of gaps); INVERSE loops output, samples input via f⁻¹ (nearest) → every pixel filled (2×2 colour blocks). Forward leaves holes, inverse fills everything (Resampling, BASIC)
14.4 Multi-view 3-D reconstruction: the classic pipeline
fig-light-models
fig-light-models · three models of light: wave (λ) vs ray vs particle/photon propagation, and what each is good for 🟨
fig-slr-cross-section
fig-slr-cross-section · labelled cross-section of an SLR: 45° main reflex mirror sends light UP to the focusing screen + pentaprism → optical viewfinder; the semi-transparent main mirror + secondary (sub) mirror fold light DOWN to the phase-detect AF module in the mirror-box floor; focal-plane shutter + sensor/film behind; inset shows the mirror flipping up for exposure (companion to `fig-mirrorless-anatomy`)
14.5 Photos → radiance fields and Gaussian splatting (NeRF, 3DGS)
⬜ figure not yet created
`fig-nvs-vs-reconstruction` (the 3-D↔2-D triangle: rendering, reconstruction, NVS)
fig-pinhole-imaging
fig-pinhole-imaging · imaging-scenario series (2/3): add a pinhole to the bare sensor — one ray per scene point → a dim **inverted** image (same tree+sensor+colours as fig-bare-sensor-averaging)
⬜ figure not yet created
`fig-3dgs-splat` (anisotropic 3-D Gaussians → projected & alpha-blended)
fig-isp-walk-real
fig-isp-walk-real · the ISP pipeline walked on ONE real photo (boatman gathering lotus, overcast lake, Sony DNG): raw (linear, no WB — flat & green) · white balance · tone+colour (gamma encode) · denoise · sharpen · finished JPEG, frame-coloured by linear-light vs encoded regime — real-image companion to `fig-isp-block-diagram` (Recap ISP, BASIC)
14.6 Feed-forward (amortized) 3-D: skip the per-scene optimization
fig-pinhole-imaging
fig-pinhole-imaging · imaging-scenario series (2/3): add a pinhole to the bare sensor — one ray per scene point → a dim **inverted** image (same tree+sensor+colours as fig-bare-sensor-averaging)
14.7 Re-photography
fig-portrait-lighting-sim
fig-portrait-lighting-sim · live portrait-lighting simulator (web edition): three soft area lights (key/fill/kicker — az/el/extent/intensity/colour, area-light-supersampled soft shadows) on a 3D face with a physiological melanin/hemoglobin skin model; a from-behind setup view with Meshy studio umbrellas + a camera rig; lighting presets; static fallback is a screenshot. *3D umbrella generated with Meshy AI.*
⬜ figure not yet created
the estimated relative-pose vector) fig-rolling-shutter-perrow
14.8 The landscape, and is 3-D a "fake task"?
• a deliberately reflective close (slides 84–86): the field is a **complex landscape** along several axes — *reconstruction* (output 3-D) vs *NVS* (output 2-D); *overfit-per-scene* (NeRF, 3DGS) vs *amortized* (DUSt3R, VGGT); *baked illumination* vs *relightable*; **number of input views** (1 → 360° → ∞); the **3-D representation** used (volumetric/fuzzy vs hard surface vs 4-D ray space — or none); how much **prior** is learned; world-space vs camera-space.
• the provocation worth leaving the reader with: the **Bitter Lesson** (Sutton) and the **"parable of the parser"** — when data + compute are plentiful, the hand-built intermediate (here, *explicit 3-D*) may be an unnecessary **"fake task"** that learning can skip. **Is 3-D still needed, or is it scaffolding we'll discard?** (Counterpoint: real applications — robotics, AR, fabrication, autonomous driving — keep wanting *actual* geometry. → [[Adjacent fields and applications]].)
• frontier (slide 50): scalability / long context, beyond shape+texture (materials, **relightability**), **motion** (dynamic scenes), less supervision, fewer inductive biases.
14.9 3-D displays
• the *output* counterpart to all of the above — once you have a 3-D scene, how do you *show* it with parallax? Brief pointer, since the optics live with light fields: **lenticular / parallax-barrier** autostereoscopic displays, **light-field displays**, head-tracked & VR/AR stereo, and research like **Project Starline**. Cross-link to the light-field display material in [[Advanced computational photography#Light field and multi-aperture imaging]] (lenticular displays, 3-D TV) rather than re-deriving.