Live Benchmark Standings

Memory Systems Leaderboard

Ranked by weighted score across 5 real-world multi-session coding scenarios. Each scenario was designed to reflect how software engineers actually work: requirements change, decisions reverse, and the agent must track the current truth across completely fresh context windows.

Agent model: Claude Sonnet 4.620 runs per scenario per systemBoth systems given custom prompt prefixes

Overall standingsScores shown as percentages. Higher is better except where noted.

Rank

System

Overall

Accuracy

Conflict

Continuity

Token*

WinnerMemstate

84.4%

92.2%

95.0%

88.7%

16.2%

Mem0

20.4%

17.5%

20.2%

17.2%

40.0%

Metric breakdown

Each metric is scored independently. Hover the metric name for a full description of what was tested and how it was scored.

Overall Score

Memstate 84.4%Mem0 20.4%

Memstate

84.4%

Mem0

20.4%

Fact Recall Accuracy

Memstate 92.2%Mem0 17.5%

Memstate

92.2%

Mem0

17.5%

Conflict Detection

Memstate 95.0%Mem0 20.2%

Memstate

95.0%

Mem0

20.2%

Cross-Session Continuity

Memstate 88.7%Mem0 17.2%

Memstate

88.7%

Mem0

17.2%

Token Efficiency*

Memstate 16.2%Mem0 40.0%

Memstate

16.2%

Mem0

40.0%

Token efficiency is only a win when accuracy stays high. A system that retrieves nothing uses zero tokens. Memstate's low token score reflects precise, targeted retrieval rather than dumping large context blobs into every prompt.

Scenario breakdown

These are not toy Q&A tests. Each scenario simulates a real software engineering project evolving across multiple sessions. The agent starts every session with a blank context window and must rely entirely on its memory system to reconstruct what it knows.

Web App Architecture Evolution

The agent worked on a full-stack web application across 6 sessions. In session 1, the team chose a monolithic Next.js architecture. By session 3, they switched to a microservices model. In session 5, they partially reverted to a modular monolith for the auth and billing services only. The agent had to track the current architecture for each service independently, not just the most recent global decision.

Hard

Memstate

85.8%

Mem0

70.6%

Gap: +15.2% in favor of Memstate

Auth System Migration

The project started with JWT-based authentication. Across 5 sessions, the team migrated to session-cookie auth, then added OAuth for social login, then reverted the session-cookie approach for API clients back to JWT. The agent had to correctly recall which auth strategy applied to which client type at any given point, and detect when a question about auth referred to a state that had since changed.

Very Hard

Memstate

85.1%

Mem0

9.0%

Gap: +76.0% in favor of Memstate

Database Schema Evolution

A PostgreSQL schema evolved across 7 sessions. Tables were added, columns were renamed, data types changed, and two tables were merged into one. The agent was asked specific questions mid-migration: 'What is the current column name for user email?' and 'Does the orders table still have a status_code field?' Answering correctly required knowing the exact schema state at each point in time, not just the initial or final version.

Very Hard

Memstate

81.0%

Mem0

10.3%

Gap: +70.8% in favor of Memstate

API Versioning Conflicts

The team maintained three concurrent API versions (v1, v2, v3) across 8 sessions. Breaking changes were introduced in v2 that did not apply to v1. v3 deprecated several v2 endpoints. The agent had to recall version-specific constraints accurately when asked about a specific endpoint, and correctly identify when a question about 'the API' was ambiguous and needed version clarification.

Extreme

Memstate

85.0%

Mem0

4.1%

Gap: +80.9% in favor of Memstate

Team Decision Reversal

This scenario simulated real engineering team dynamics. Across 6 sessions, the team made and then reversed four major architectural decisions: the message queue choice (Kafka then RabbitMQ then back to Kafka), the deployment target (AWS ECS then Kubernetes then back to ECS), the ORM library (Prisma then Drizzle), and the caching strategy (Redis then in-memory then Redis again). The agent had to recall the current decision for each area without conflating it with a previous state.

Extreme

Memstate

85.2%

Mem0

7.8%

Gap: +77.4% in favor of Memstate

Fairness and methodology

Both systems used the same agent model: Claude Sonnet 4.6.
Each tool was given a custom system prompt prefix explaining how to use its MCP tools most effectively. Neither system had an unfair instruction advantage.
Scenarios reflect real multi-session software engineering work. Requirements change. Decisions reverse. The agent must track the current truth, not just the most recent text.
Each scenario was run 20 times per system to account for LLM non-determinism. Scores are averages across all runs.
Token efficiency is weighted at only 10% to avoid rewarding systems that retrieve nothing or compress memory so aggressively that accuracy suffers.

Scoring weights

Accuracy (fact recall)40%

Conflict Detection25%

Context Continuity25%

Token Efficiency10%

Why token efficiency has an asterisk

Source: Open benchmark reports in memstate-ai/memstate-benchmark — see BENCHMARK.md and per-adapter result files. All scenario prompts, scoring scripts, and raw results are public and reproducible.

This benchmark is optimized for AI agent coding workflows where factual recall, evolving decisions, and deterministic context matter most. Freeform document indexing may favor different retrieval patterns, but that is not the primary use case evaluated here.

Full Benchmark Methodology View Open-Source Repo