No big lessons in this chapter — skip ahead to SINGLE IMAGE COMPUTATIONAL PHOTOGRAPHY →

7.10 Body pose estimation⧉

Move from the face to the whole body and the problem becomes articulated: a person is a kinematic chain of joints, and pose estimation recovers a skeleton of keypoints — head, shoulders, elbows, wrists, hips, knees, ankles — for every person in the frame, and increasingly a full 3-D body mesh. It is correspondence once more, but now between the image and a known articulated model of the human body: instead of "where did this pixel go," the question is "where is each named joint." The standard 2-D target is the COCO 17-keypoint skeleton; hands (21 points each) and face extend it to whole-body pose.

7.10.1 Top-down vs bottom-up⧉

Multi-person pose forces an architectural fork, and it is the one thing worth understanding before you pick a tool (Figure 7.10.1). A top-down method first detects each person (an off-the-shelf person detector), then estimates the keypoints inside each box independently — accurate and simple, because each crop holds one well-framed person, but its cost grows with the number of people and it inherits the detector's misses in crowds. HRNet (Sun et al. 2019 (HRNet)) is the canonical top-down backbone: it keeps a high-resolution representation throughout the network (rather than downsampling and upsampling), which is what precise joint localisation needs. A bottom-up method instead finds all keypoints in the whole image at once and then groups them into individuals — constant cost regardless of crowd size, and robust when people overlap. OpenPose (Cao et al. 2017 (OpenPose)) is the landmark bottom-up system: alongside a heat-map per keypoint it predicts part affinity fields — a vector field along each limb that says "this elbow belongs to that wrist" — turning the grouping into a clean bipartite matching. The lineage traces to DeepPose (Toshev & Szegedy 2014 (DeepPose)), the first to regress joint coordinates directly with a deep network.

fig-pose-skeleton — **Figure 7.10.1.** 2-D human pose: the COCO skeleton, and the two paradigms. Left: a single person's **17-keypoint** skeleton (joints as dots, limbs as edges). Right: **top-down** (detect each person, then find keypoints in each box — cost grows with the crowd) versus **bottom-up** (find all keypoints, then group them into people via part-affinity links — constant cost, crowd-robust). The keypoints are the correspondence between the image and a known articulated body model.

7.10.2 On-device, real time⧉

For phones and interactive use, BlazePose / MediaPipe Pose (Bazarevsky et al. 2020 (BlazePose)) is the workhorse: a lightweight detector-plus-tracker that returns 33 body keypoints (a superset of COCO with extra hand and foot points) in real time on-device, using the same detect-once-then-track trick as the face pipeline to avoid re-detecting every frame. MediaPipe also ships Hands (21 keypoints per hand) and a Holistic model that fuses face, hands, and body — whole-person tracking from a single camera.

7.10.3 Lifting to 3-D: parametric bodies⧉

Two-dimensional keypoints under-determine a 3-D pose (the perspective divide threw depth away, as ever), so the richest representation regresses a full parametric body model. SMPL (Loper et al. 2015 (SMPL)) is the standard: a learned model that turns a low-dimensional shape vector (the person's build) and pose vector (the joint angles) into a full 3-D mesh, differentiable and easy to fit. HMR (Kanazawa et al. 2018 (HMR)) and its many descendants regress SMPL parameters directly from one image, recovering shape and pose end-to-end. This is the basis of markerless motion capture, AR avatars and virtual try-on, sports biomechanics, and fitness coaching.

7.10.4 Which library to use optional⧉

Again, stand on the shoulders of the toolkits. MediaPipe Pose / Tasks is the easiest path to real-time, on-device, whole-body tracking. OpenPose is the classic multi-person bottom-up system. MMPose and Detectron2 are the research-grade frameworks with the widest model zoo (HRNet, ViTPose, and the rest), and AlphaPose is a strong accurate multi-person option. Pick one for your latency/accuracy budget; the interesting work is almost always what you do with the skeleton, not the skeleton itself.

AR	augmented reality
COCO	Common Objects in Context (dataset)
HMR	Human Mesh Recovery
SMPL	Skinned Multi-Person Linear model

7.10 Body pose estimation🔗⧉

7.10.1 Top-down vs bottom-up🔗⧉

7.10.2 On-device, real time🔗⧉

7.10.3 Lifting to 3-D: parametric bodies🔗⧉

7.10.4 Which library to use optional🔗⧉

7.10 Body pose estimation⧉

7.10.1 Top-down vs bottom-up⧉

7.10.2 On-device, real time⧉

7.10.3 Lifting to 3-D: parametric bodies⧉

7.10.4 Which library to use optional⧉