💬Comments welcome. To leave a note, select any text and click the note / highlight button that pops up — or open the panel with the tab at the top-right (‹). Notes are visible only inside our private review group.

7.10 Body pose estimation

Move from the face to the whole body and the problem becomes articulated: a person is a kinematic chain of joints, and pose estimation recovers a skeleton of keypoints — head, shoulders, elbows, wrists, hips, knees, ankles — for every person in the frame, and increasingly a full 3-D body mesh. It is correspondence once more, but now between the image and a known articulated model of the human body: instead of "where did this pixel go," the question is "where is each named joint." The standard 2-D target is the COCO 17-keypoint skeleton; hands (21 points each) and face extend it to whole-body pose.

7.10.1 Top-down vs bottom-up

Multi-person pose forces an architectural fork, and it is the one thing worth understanding before you pick a tool (Figure 7.10.1). A top-down method first detects each person (an off-the-shelf person detector), then estimates the keypoints inside each box independently — accurate and simple, because each crop holds one well-framed person, but its cost grows with the number of people and it inherits the detector's misses in crowds. HRNet (Sun et al. 2019 (HRNet)) is the canonical top-down backbone: it keeps a high-resolution representation throughout the network (rather than downsampling and upsampling), which is what precise joint localisation needs. A bottom-up method instead finds all keypoints in the whole image at once and then groups them into individuals — constant cost regardless of crowd size, and robust when people overlap. OpenPose (Cao et al. 2017 (OpenPose)) is the landmark bottom-up system: alongside a heat-map per keypoint it predicts part affinity fields — a vector field along each limb that says "this elbow belongs to that wrist" — turning the grouping into a clean bipartite matching. The lineage traces to DeepPose (Toshev & Szegedy 2014 (DeepPose)), the first to regress joint coordinates directly with a deep network.

fig-pose-skeleton
Figure 7.10.1. 2-D human pose: the COCO skeleton, and the two paradigms. Left: a single person's 17-keypoint skeleton (joints as dots, limbs as edges). Right: top-down (detect each person, then find keypoints in each box — cost grows with the crowd) versus bottom-up (find all keypoints, then group them into people via part-affinity links — constant cost, crowd-robust). The keypoints are the correspondence between the image and a known articulated body model.

7.10.2 On-device, real time

For phones and interactive use, BlazePose / MediaPipe Pose (Bazarevsky et al. 2020 (BlazePose)) is the workhorse: a lightweight detector-plus-tracker that returns 33 body keypoints (a superset of COCO with extra hand and foot points) in real time on-device, using the same detect-once-then-track trick as the face pipeline to avoid re-detecting every frame. MediaPipe also ships Hands (21 keypoints per hand) and a Holistic model that fuses face, hands, and body — whole-person tracking from a single camera.

7.10.3 Lifting to 3-D: parametric bodies

Two-dimensional keypoints under-determine a 3-D pose (the perspective divide threw depth away, as ever), so the richest representation regresses a full parametric body model. SMPL (Loper et al. 2015 (SMPL)) is the standard: a learned model that turns a low-dimensional shape vector (the person's build) and pose vector (the joint angles) into a full 3-D mesh, differentiable and easy to fit. HMR (Kanazawa et al. 2018 (HMR)) and its many descendants regress SMPL parameters directly from one image, recovering shape and pose end-to-end. This is the basis of markerless motion capture, AR avatars and virtual try-on, sports biomechanics, and fitness coaching.

7.10.4 Which library to use optional

Again, stand on the shoulders of the toolkits. MediaPipe Pose / Tasks is the easiest path to real-time, on-device, whole-body tracking. OpenPose is the classic multi-person bottom-up system. MMPose and Detectron2 are the research-grade frameworks with the widest model zoo (HRNet, ViTPose, and the rest), and AlphaPose is a strong accurate multi-person option. Pick one for your latency/accuracy budget; the interesting work is almost always what you do with the skeleton, not the skeleton itself.