Computational Photography, an AI-powered Slopendium — 20 Performance engineering and Halide

▸Part 20 PERFORMANCE ENGINEERING AND HALIDE⧉

▸20.1 8-bit and fixed-point arithmetic⧉

• how about fp16???

▸20.2 Algorithmic speedups⧉

• separable

• recursive filters

• **integral images / summed-area tables** — precompute a running 2-D prefix sum of the image once; afterwards **any** axis-aligned box sum is just **four array lookups** ($S_{D}-S_{B}-S_{C}+S_{A}$), independent of the box size, so a box filter is **O(1) per pixel** at any radius (and multi-scale features cost the same). First introduced in graphics as the **summed-area table** (Crow 1984) for mip-free texture filtering, then **re-invented as the "integral image"** by **Viola & Jones (2001)** to sum Haar-like features fast enough for real-time face detection — a clean example of the same trick rediscovered across fields. (Watch precision: the running sum grows large, so it wants wide integer/float accumulators.)

▸20.3 Performance engineering and scheduling (Halide)⧉

• Modern machines

• challenges: parallelism, locality/cache behavior

• solutions: slice, fuse

• **Why low-level matters — the language speed differential**: when you write the **low-level pixel code yourself** (e.g. **your own convolution**), the speed gaps are dramatic — in Fredo's experience roughly **~1000× from Python to optimized C++**, and **at least another ~10× from C++ to Halide**, won back through scheduling (tiling, vectorization/SIMD, multithreading, producer-consumer fusion). This is exactly **why** Python is right for prototyping, glue, and deep learning, but **C++/Halide earn their place for the inner loops**. See [[Appendices#Programming: Python, C++, and PyTorch]] and [[Intro#What language for computational photography?]].

• **historical example — Photoshop 1.0 (1990), the split done by hand**: the discipline predates the tools. Thomas Knoll wrote **Photoshop 1.0** ≈75 % in **Pascal** but dropped into **68000 assembly** for the speed-critical inner loops — *productive language for the bulk, hand-tuned machine code for the hot pixels* (≈128,000 lines total; source released by the Computer History Museum, 2013). That **manual** algorithm-vs-hand-optimization split is exactly what **Halide** automates with its algorithm/schedule separation — the same idea, now mechanized and retargetable. (Full tidbits in the language sidebar, [[Intro#What language for computational photography?]].)

• Halide

• **the scheduling primitives — `compute_at` and the locality/recompute trade-off**: a Halide schedule is mostly choosing, for each stage, **where it is computed relative to its consumer** — `compute_root` (compute the whole stage once, store it all), `compute_at` (compute just the slice needed inside the consumer's loop, trading **recomputation** for **locality / less memory traffic**), and `store_at`/`fold` in between. This producer–consumer granularity *is* the core of the algorithm/schedule split. [📺 **embed Fredo's `compute_at` explainer** → https://www.youtube.com/watch?v=ViFfigvV418 — see [[Video Resources]]]

• **Benchmarking and the development loop** (the part people skip): you do **not** get fast code by *one-shotting* it — not by hand, and not by asking an LLM to write it in one go. Fast code comes from a **loop**: change the code → **verify correctness** → **measure** performance → repeat. Wire an LLM into exactly that loop (give it a correctness check and a benchmark whose output it can read) and it will converge to fast *and* correct code; the bottleneck is that the model needs some **benchmarking hygiene** trained in, or it does naïve things — e.g. running several benchmarks **in parallel**, so they fight over cores and cache and every number is garbage.

• **a little statistics is mandatory.** A trap worth naming: run *one* benchmark several times, see that the runtimes are **stable**, and conclude your measurements are reliable — then run a **thousand different** benchmarks in one big batch and read the **outliers** as real performance regressions. Low variance *within* a single benchmark tells you nothing about the **tail across a thousand** of them: with that many trials, extreme outliers are *expected from noise alone*, so without accounting for the multiple comparisons (and for shared-machine contention) you will chase regressions that aren't there.

▸20.4 Hardware backends: GPU, NPU, DSP⧉

*(Parked here for now; the real-world ways imaging code is made fast on devices, beyond hand-written C++/Halide.)*

• **GPU compute — OpenCL / Vulkan for image processing**: the per-pixel work (filters, pyramids, warps, blending, demosaicking) is **embarrassingly parallel**, so it maps naturally to the GPU. **OpenCL** kernels and **Vulkan compute shaders** are the cross-platform ways to run it on mobile and desktop GPUs; **Halide can target these backends**, so the same algorithm retargets from CPU SIMD to GPU compute. The GPU is the workhorse for **real-time** image processing (camera preview, video).

• **On-device ML framework (e.g. LiteRT)**: trained imaging models — denoise, super-resolution, segmentation/portrait, semantic ISP, HDR fusion — increasingly run **locally on the device** for latency, privacy, and offline use. **LiteRT** (formerly TensorFlow Lite), Core ML, and ONNX Runtime Mobile load **quantized (int8/fp16)** models and **delegate** the heavy layers to the GPU or NPU. Cross-ref [[Deep learning]] (the models) and [[On-device, mobile and real-time considerations]] if present.

• **NPU acceleration for imaging models**: dedicated **neural processing units** (Apple **Neural Engine**, Qualcomm **Hexagon** AI, Google **Tensor** TPU) run CNN/transformer inference at **far higher throughput per watt** than CPU/GPU — the engine behind Night mode, learned demosaicking/denoising, face & scene understanding in the live pipeline. The scheduling problem shifts from cache tiling to **operator fusion, quantization, and keeping the NPU fed**.

• **DSP programming for real-time 3A control**: the **3A** loop — **auto-exposure, auto-focus, auto-white-balance** — needs per-frame scene statistics and control at video rate with minimal power, so it typically runs on a low-power **DSP** (e.g. Hexagon) inside the camera subsystem/ISP rather than the application CPU. Real-time, deadline-driven, and tightly coupled to the sensor — a different optimization regime from batch image processing.

20.4.1 Karima’s search⧉