Ruairidh Wynne-McCorry
I'm Ruairidh (/roo-ree/), a senior engineer who's spent the last few years building product at scale. I write about the things I'm actually working on: LLM evaluation, systems design, and whatever rabbit hole I've gone down that week. Code over commentary.
Writing
I tested 5 frontier LLMs on a long agent task. Only Sonnet 4.6 broke.
Sonnet 4.6 silently dropped a one-line format rule on 80% of long agent runs. The fix turned out to be one extra sentence in the initial user message. Tested across Opus 4.7, Sonnet 4.6, Haiku 4.5, GPT-4.1, and GPT-4o — 270 trajectories, ~$40 of API spend, four perfect end-to-end.
17 min read
Compressing Prompts with an Autoresearch Loop
I tested 5 production system prompts to find out what's safe to cut, what's load-bearing, and why the popular "compress by 75%" claims fall apart under scrutiny.
11 min read