Discussion about this post

User's avatar
ToxSec's avatar

“The agent stack is not the LLM stack. A chatbot needs inference and maybe RAG. An agent needs state management across multi-step execution, tool access governed by protocols, memory that persists across sessions, autonomous reasoning loops, and guardrails that constrain behavior in real time.”

i really like this post because you are doing a great job at explaining all of this while clearing up some really common misconceptions. this was a great read!

Pawel Jozefiak's avatar

The 37-point eval gap is the most honest number in the whole piece. 89% built observability, 52% built formal evals. Teams that can see the logs but have no idea if it's working.

But that gap is downstream of a harder problem: most teams haven't defined success at the task level. You can't build evals without a spec. The tooling exists. The bottleneck is articulating what "done" looks like.

I've been running a self-improving agent for months (nightshift planning, dayshift execution, feedback loops). The stack converged fast -> provider SDK, Postgres, MCP. Done in weeks. What took months was boundary design: where human judgment ends, what failure modes actually matter, what counts as a good output. The infrastructure is a solved problem.

The hard part is still human.

7 more comments...

No posts

Ready for more?