Voxli
The multi-turn failures that prompt evals can't see
Most agent failures we see in pilots don't show up on prompt evals.
Voxli
The failed Tool Call when Simulating a Customer Conversation Across Three LLMs
Recently, to assess AI Agent performance with tool calls, we executed the same multi-turn conversation across the three tiers of OpenAI's GPT-5.4: standard, mini, and nano.
Mahey Qadir
The Risks of Agent Speculation
It’s no surprise that hallucinations are a common known failure during agentic AI testing. The agent starts to overpromise, begins to fabricate answers and even claims that it…
Voxli