7.10 Body pose estimation⧉
Move from the face to the whole body and the problem becomes articulated: a person is a kinematic chain of joints, and pose estimation recovers a skeleton of keypoints — head, shoulders, elbows, wrists, hips, knees, ankles — for every person in the frame, and increasingly a full 3-D body mesh. It is correspondence once more, but now between the image and a known articulated model of the human body: instead of "where did this pixel go," the question is "where is each named joint." The standard 2-D target is the COCO 17-keypoint skeleton; hands (21 points each) and face extend it to whole-body pose.
7.10.1 Top-down vs bottom-up⧉
Multi-person pose forces an architectural fork, and it is the one thing worth understanding before you pick a tool (Figure 7.10.1). A top-down method first detects each person (an off-the-shelf person detector), then estimates the keypoints inside each box independently — accurate and simple, because each crop holds one well-framed person, but its cost grows with the number of people and it inherits the detector's misses in crowds. HRNet (Sun et al. 2019 (HRNet)) is the canonical top-down backbone: it keeps a high-resolution representation throughout the network (rather than downsampling and upsampling), which is what precise joint localisation needs. A bottom-up method instead finds all keypoints in the whole image at once and then groups them into individuals — constant cost regardless of crowd size, and robust when people overlap. OpenPose (Cao et al. 2017 (OpenPose)) is the landmark bottom-up system: alongside a heat-map per keypoint it predicts part affinity fields — a vector field along each limb that says "this elbow belongs to that wrist" — turning the grouping into a clean bipartite matching. The lineage traces to DeepPose (Toshev & Szegedy 2014 (DeepPose)), the first to regress joint coordinates directly with a deep network.
7.10.2 On-device, real time⧉
For phones and interactive use, BlazePose / MediaPipe Pose (Bazarevsky et al. 2020 (BlazePose)) is the workhorse: a lightweight detector-plus-tracker that returns 33 body keypoints (a superset of COCO with extra hand and foot points) in real time on-device, using the same detect-once-then-track trick as the face pipeline to avoid re-detecting every frame. MediaPipe also ships Hands (21 keypoints per hand) and a Holistic model that fuses face, hands, and body — whole-person tracking from a single camera.
7.10.3 Lifting to 3-D: parametric bodies⧉
Two-dimensional keypoints under-determine a 3-D pose (the perspective divide threw depth away, as ever), so the richest representation regresses a full parametric body model. SMPL (Loper et al. 2015 (SMPL)) is the standard: a learned model that turns a low-dimensional shape vector (the person's build) and pose vector (the joint angles) into a full 3-D mesh, differentiable and easy to fit. HMR (Kanazawa et al. 2018 (HMR)) and its many descendants regress SMPL parameters directly from one image, recovering shape and pose end-to-end. This is the basis of markerless motion capture, AR avatars and virtual try-on, sports biomechanics, and fitness coaching.
7.10.4 Which library to use optional⧉
Again, stand on the shoulders of the toolkits. MediaPipe Pose / Tasks is the easiest path to real-time, on-device, whole-body tracking. OpenPose is the classic multi-person bottom-up system. MMPose and Detectron2 are the research-grade frameworks with the widest model zoo (HRNet, ViTPose, and the rest), and AlphaPose is a strong accurate multi-person option. Pick one for your latency/accuracy budget; the interesting work is almost always what you do with the skeleton, not the skeleton itself.