The Challenges of Building a Fair AI Memory Benchmark
Evaluating AI memory systems is notoriously difficult. Static Q&A datasets do not reflect how autonomous coding agents actually work. Here is how we built an open-source, multi-session benchmark that exposed a critical flaw in existing memory tools: ingestion speed.
Why Standard Benchmarks Fail
Most memory benchmarks test simple retrieval: "The user's favorite color is blue. What is the user's favorite color?"
But real-world software engineering is messy. Decisions evolve. In session one, the team might choose a monolithic architecture. In session two, they switch to microservices. In session three, they revert back to a modular monolith. A good memory system must track this non-linear history and return the current state, not just a blend of past text.
Designing a Fair Test
We wanted to create a level playing field. We gave both Memstate and competing tools (like Mem0) the exact same advantages:
- Custom Prompts: Each tool was allowed a custom system prompt explaining how to use its specific MCP tools effectively.
- State-of-the-Art Models: We used Claude Opus 4.6 and Gemini 3 to ensure the LLM itself was not the bottleneck.
- Real Scenarios: We designed 5 complex coding challenges, including Database Schema Evolution, Auth System Migration, and Team Decision Reversals.

The Surprise Finding: Ingestion Speed Matters
During early runs of the test, we let the coding agents move as fast as they wanted. This exposed a massive vulnerability in tools like Mem0: ingestion latency.
When an agent finishes a task, it sends a summary to the memory system. If the memory system takes 5-10 seconds to process and index that memory, the agent has already moved on to the next task. When it immediately queries the memory system for the next step, the previous facts are missing.

Memstate AI solves this because our custom-trained fact extraction models are heavily optimized for speed. Memories are ingested and available for recall in milliseconds, ensuring the agent never outpaces its own brain.
Verify the Results Yourself
We believe in transparency. The entire benchmark suite, including all scenarios, prompts, and evaluation scripts, is open-source.
View Benchmark RepositoryRelated Guides
Build with reliable memory
Get the memory system that actually keeps up with your AI agents.