• Modern machines
• challenges: parallelism, locality/cache behavior
• solutions: slice, fuse
• **Why low-level matters — the language speed differential**: when you write the **low-level pixel code yourself** (e.g. **your own convolution**), the speed gaps are dramatic — in Fredo's experience roughly **~1000× from Python to optimized C++**, and **at least another ~10× from C++ to Halide**, won back through scheduling (tiling, vectorization/SIMD, multithreading, producer-consumer fusion). This is exactly **why** Python is right for prototyping, glue, and deep learning, but **C++/Halide earn their place for the inner loops**. See [[Appendices#Programming: Python, C++, and PyTorch]] and [[Intro#What language for computational photography?]].
• **historical example — Photoshop 1.0 (1990), the split done by hand**: the discipline predates the tools. Thomas Knoll wrote **Photoshop 1.0** ≈75 % in **Pascal** but dropped into **68000 assembly** for the speed-critical inner loops — *productive language for the bulk, hand-tuned machine code for the hot pixels* (≈128,000 lines total; source released by the Computer History Museum, 2013). That **manual** algorithm-vs-hand-optimization split is exactly what **Halide** automates with its algorithm/schedule separation — the same idea, now mechanized and retargetable. (Full tidbits in the language sidebar, [[Intro#What language for computational photography?]].)
• Halide
• **the scheduling primitives — `compute_at` and the locality/recompute trade-off**: a Halide schedule is mostly choosing, for each stage, **where it is computed relative to its consumer** — `compute_root` (compute the whole stage once, store it all), `compute_at` (compute just the slice needed inside the consumer's loop, trading **recomputation** for **locality / less memory traffic**), and `store_at`/`fold` in between. This producer–consumer granularity *is* the core of the algorithm/schedule split. [📺 **embed Fredo's `compute_at` explainer** → https://www.youtube.com/watch?v=ViFfigvV418 — see [[Video Resources]]]
• **Benchmarking and the development loop** (the part people skip): you do **not** get fast code by *one-shotting* it — not by hand, and not by asking an LLM to write it in one go. Fast code comes from a **loop**: change the code → **verify correctness** → **measure** performance → repeat. Wire an LLM into exactly that loop (give it a correctness check and a benchmark whose output it can read) and it will converge to fast *and* correct code; the bottleneck is that the model needs some **benchmarking hygiene** trained in, or it does naïve things — e.g. running several benchmarks **in parallel**, so they fight over cores and cache and every number is garbage.
• **a little statistics is mandatory.** A trap worth naming: run *one* benchmark several times, see that the runtimes are **stable**, and conclude your measurements are reliable — then run a **thousand different** benchmarks in one big batch and read the **outliers** as real performance regressions. Low variance *within* a single benchmark tells you nothing about the **tail across a thousand** of them: with that many trials, extreme outliers are *expected from noise alone*, so without accounting for the multiple comparisons (and for shared-machine contention) you will chase regressions that aren't there.