Automatic Prompt Engineer. Instead of hand-crafting prompts through trial and error, let AI generate and test dozens of candidates to find the best one.
Writing good prompts is surprisingly hard. Small changes in wording can swing accuracy by 10% or more. Most people find good prompts through intuition and trial-and-error — testing a few variations, picking what seems to work. But what if you could test fifty variations systematically?
APE treats prompt discovery as a search problem. It asks AI to generate many candidate instructions for a task, tests each one against real examples, and picks the winner. This automated search famously discovered a chain-of-thought prompt that outperformed the best human-designed version — proving that AI can be better at writing prompts than the people who design them.
This composition builds on:
Ask a Better Question Show by ExampleAPE automates the insight that prompt quality matters enormously. It uses examples to guide candidate generation and systematic evaluation to find the best phrasing.
Task: Find the best prompt for solving math word problems.
Human-designed prompt:
78.7% accuracy on math benchmarks
APE-discovered prompt:
82.0% accuracy — +3.3% improvement
The subtle difference in phrasing — adding "to be sure we have the right answer" — was something no human had thought to try, but the automated search found it.
Humans can test maybe 5–10 prompt variations before running out of ideas or patience. APE tests 50–100+ candidates systematically. The search covers phrasing variations that humans wouldn't think to try, and the evaluation is objective — measured accuracy, not subjective judgment.
It works because prompt sensitivity is real: tiny wording changes cause big performance swings. The only way to navigate this landscape reliably is to search it broadly and measure rigorously. APE does both.
Generate many candidate prompts. Test each on real examples. Pick the winner. Let AI discover phrasings that humans would never try — and that actually work better.
APE is the foundational idea behind DSPy, which takes prompt optimization much further — automatically compiling entire prompt pipelines, not just single instructions. Think of APE as the simple, powerful core that DSPy builds a full framework around.
It's also related to Directional Stimulus Prompting, which uses a small model to generate hints that steer a large model. Both are about optimizing what you feed to the model, but APE optimizes the instruction while Directional Stimulus optimizes the hints given alongside the question.