Blog/Research
Benchmarking

The Challenges of Building a Fair AI Memory Benchmark

Evaluating AI memory systems is notoriously difficult. Static Q&A datasets do not reflect how autonomous coding agents actually work. Here is how we built an open-source, multi-session benchmark that exposed a critical flaw in existing memory tools: ingestion speed.

March 25, 2026·8 min read·Jason

Why Standard Benchmarks Fail

Most memory benchmarks test simple retrieval: "The user's favorite color is blue. What is the user's favorite color?"

But real-world software engineering is messy. Decisions evolve. In session one, the team might choose a monolithic architecture. In session two, they switch to microservices. In session three, they revert back to a modular monolith. A good memory system must track this non-linear history and return the current state, not just a blend of past text.

Designing a Fair Test

We wanted to create a level playing field. We gave both Memstate and competing tools (like Mem0) the exact same advantages:

  • Custom Prompts: Each tool was allowed a custom system prompt explaining how to use its specific MCP tools effectively.
  • State-of-the-Art Models: We used Claude Opus 4.6 and Gemini 3 to ensure the LLM itself was not the bottleneck.
  • Real Scenarios: We designed 5 complex coding challenges, including Database Schema Evolution, Auth System Migration, and Team Decision Reversals.
AI Memory Benchmark Results showing Memstate at 84.4% and Mem0 at 20.4%

The Surprise Finding: Ingestion Speed Matters

During early runs of the test, we let the coding agents move as fast as they wanted. This exposed a massive vulnerability in tools like Mem0: ingestion latency.

When an agent finishes a task, it sends a summary to the memory system. If the memory system takes 5-10 seconds to process and index that memory, the agent has already moved on to the next task. When it immediately queries the memory system for the next step, the previous facts are missing.

Diagram showing Memstate fast ingestion vs Mem0 lagging ingestion
Slow ingestion causes agents to hallucinate because the facts they just saved are not available yet.

Memstate AI solves this because our custom-trained fact extraction models are heavily optimized for speed. Memories are ingested and available for recall in milliseconds, ensuring the agent never outpaces its own brain.

Verify the Results Yourself

We believe in transparency. The entire benchmark suite, including all scenarios, prompts, and evaluation scripts, is open-source.

View Benchmark Repository

Build with reliable memory

Get the memory system that actually keeps up with your AI agents.