Live Benchmark Standings

Memory Systems Leaderboard

Ranked by weighted score across 5 real-world multi-session coding scenarios. Each scenario was designed to reflect how software engineers actually work: requirements change, decisions reverse, and the agent must track the current truth across completely fresh context windows.

Agent model: Claude Sonnet 4.620 runs per scenario per systemBoth systems given custom prompt prefixes
Overall standingsScores shown as percentages. Higher is better except where noted.
Rank
System
Overall
#1
Memstate
84.4%
#2
Mem0
20.4%

Metric breakdown

Each metric is scored independently. Hover the metric name for a full description of what was tested and how it was scored.

Overall Score
Memstate 84.4%Mem0 20.4%
Memstate
84.4%
Mem0
20.4%
Fact Recall Accuracy
Memstate 92.2%Mem0 17.5%
Memstate
92.2%
Mem0
17.5%
Conflict Detection
Memstate 95.0%Mem0 20.2%
Memstate
95.0%
Mem0
20.2%
Cross-Session Continuity
Memstate 88.7%Mem0 17.2%
Memstate
88.7%
Mem0
17.2%
Token Efficiency*
Memstate 16.2%Mem0 40.0%
Memstate
16.2%
Mem0
40.0%

Token efficiency is only a win when accuracy stays high. A system that retrieves nothing uses zero tokens. Memstate's low token score reflects precise, targeted retrieval rather than dumping large context blobs into every prompt.

Scenario breakdown

These are not toy Q&A tests. Each scenario simulates a real software engineering project evolving across multiple sessions. The agent starts every session with a blank context window and must rely entirely on its memory system to reconstruct what it knows.

Web App Architecture Evolution

The agent worked on a full-stack web application across 6 sessions. In session 1, the team chose a monolithic Next.js architecture. By session 3, they switched to a microservices model. In session 5, they partially reverted to a modular monolith for the auth and billing services only. The agent had to track the current architecture for each service independently, not just the most recent global decision.

Hard
Memstate
85.8%
Mem0
70.6%
Gap: +15.2% in favor of Memstate

Auth System Migration

The project started with JWT-based authentication. Across 5 sessions, the team migrated to session-cookie auth, then added OAuth for social login, then reverted the session-cookie approach for API clients back to JWT. The agent had to correctly recall which auth strategy applied to which client type at any given point, and detect when a question about auth referred to a state that had since changed.

Very Hard
Memstate
85.1%
Mem0
9.0%
Gap: +76.0% in favor of Memstate

Database Schema Evolution

A PostgreSQL schema evolved across 7 sessions. Tables were added, columns were renamed, data types changed, and two tables were merged into one. The agent was asked specific questions mid-migration: 'What is the current column name for user email?' and 'Does the orders table still have a status_code field?' Answering correctly required knowing the exact schema state at each point in time, not just the initial or final version.

Very Hard
Memstate
81.0%
Mem0
10.3%
Gap: +70.8% in favor of Memstate

API Versioning Conflicts

The team maintained three concurrent API versions (v1, v2, v3) across 8 sessions. Breaking changes were introduced in v2 that did not apply to v1. v3 deprecated several v2 endpoints. The agent had to recall version-specific constraints accurately when asked about a specific endpoint, and correctly identify when a question about 'the API' was ambiguous and needed version clarification.

Extreme
Memstate
85.0%
Mem0
4.1%
Gap: +80.9% in favor of Memstate

Team Decision Reversal

This scenario simulated real engineering team dynamics. Across 6 sessions, the team made and then reversed four major architectural decisions: the message queue choice (Kafka then RabbitMQ then back to Kafka), the deployment target (AWS ECS then Kubernetes then back to ECS), the ORM library (Prisma then Drizzle), and the caching strategy (Redis then in-memory then Redis again). The agent had to recall the current decision for each area without conflating it with a previous state.

Extreme
Memstate
85.2%
Mem0
7.8%
Gap: +77.4% in favor of Memstate

Fairness and methodology

  • Both systems used the same agent model: Claude Sonnet 4.6.
  • Each tool was given a custom system prompt prefix explaining how to use its MCP tools most effectively. Neither system had an unfair instruction advantage.
  • Scenarios reflect real multi-session software engineering work. Requirements change. Decisions reverse. The agent must track the current truth, not just the most recent text.
  • Each scenario was run 20 times per system to account for LLM non-determinism. Scores are averages across all runs.
  • Token efficiency is weighted at only 10% to avoid rewarding systems that retrieve nothing or compress memory so aggressively that accuracy suffers.

Scoring weights

Accuracy (fact recall)40%
Conflict Detection25%
Context Continuity25%
Token Efficiency10%

Why token efficiency has an asterisk

Token efficiency is only a win when accuracy stays high. A system that retrieves nothing uses zero tokens. Memstate's low token score reflects precise, targeted retrieval rather than dumping large context blobs into every prompt.

Source: Open benchmark reports in memstate-ai/memstate-benchmark — see BENCHMARK.md and per-adapter result files. All scenario prompts, scoring scripts, and raw results are public and reproducible.

This benchmark is optimized for AI agent coding workflows where factual recall, evolving decisions, and deterministic context matter most. Freeform document indexing may favor different retrieval patterns, but that is not the primary use case evaluated here.