Commentary

  • Passive context (AGENTS.md) currently outperforms active retrieval (skills)
  • Skills are still useful for vertical, action-specific workflows
  • I think I can say that LLMs are bad at reliably picking tools, skills, or docs. If the information is needed, make it always present rather than calling it separately. The best results for this eval came from removing choices, and ambiguity