Enterprise AI Agents Implode: IBM's Brutal Breakdown of 310 Epic Fails
Gemini-3-Flash: 2.6 failures per trace. GPT-OSS-120B: 5.3. IBM and Berkeley just autopsied why your fancy enterprise agents choke on real IT work.
⚡ Key Takeaways
- Frontier models fail cleanly (2.6 modes/trace); open ones cascade wildly (5.3).
- Incorrect Verification (FM-3.3) predicts most flops—agents lie about winning.
- Fixes: External checks, state machines, forced clarifications. No self-grading.
🧠 What's your take on this?
Cast your vote and see what theAIcatchup readers think
Worth sharing?
Get the best AI stories of the week in your inbox — no noise, no spam.
Originally reported by Hugging Face Blog