Memory Systems Leaderboard
Ranked by weighted score across 5 real-world multi-session coding scenarios. Each scenario was designed to reflect how software engineers actually work: requirements change, decisions reverse, and the agent must track the current truth across completely fresh context windows.
Metric breakdown
Each metric is scored independently. Hover the metric name for a full description of what was tested and how it was scored.
Token efficiency is only a win when accuracy stays high. A system that retrieves nothing uses zero tokens. Memstate's low token score reflects precise, targeted retrieval rather than dumping large context blobs into every prompt.
Scenario breakdown
These are not toy Q&A tests. Each scenario simulates a real software engineering project evolving across multiple sessions. The agent starts every session with a blank context window and must rely entirely on its memory system to reconstruct what it knows.
Web App Architecture Evolution
The agent worked on a full-stack web application across 6 sessions. In session 1, the team chose a monolithic Next.js architecture. By session 3, they switched to a microservices model. In session 5, they partially reverted to a modular monolith for the auth and billing services only. The agent had to track the current architecture for each service independently, not just the most recent global decision.
Auth System Migration
The project started with JWT-based authentication. Across 5 sessions, the team migrated to session-cookie auth, then added OAuth for social login, then reverted the session-cookie approach for API clients back to JWT. The agent had to correctly recall which auth strategy applied to which client type at any given point, and detect when a question about auth referred to a state that had since changed.
Database Schema Evolution
A PostgreSQL schema evolved across 7 sessions. Tables were added, columns were renamed, data types changed, and two tables were merged into one. The agent was asked specific questions mid-migration: 'What is the current column name for user email?' and 'Does the orders table still have a status_code field?' Answering correctly required knowing the exact schema state at each point in time, not just the initial or final version.
API Versioning Conflicts
The team maintained three concurrent API versions (v1, v2, v3) across 8 sessions. Breaking changes were introduced in v2 that did not apply to v1. v3 deprecated several v2 endpoints. The agent had to recall version-specific constraints accurately when asked about a specific endpoint, and correctly identify when a question about 'the API' was ambiguous and needed version clarification.
Team Decision Reversal
This scenario simulated real engineering team dynamics. Across 6 sessions, the team made and then reversed four major architectural decisions: the message queue choice (Kafka then RabbitMQ then back to Kafka), the deployment target (AWS ECS then Kubernetes then back to ECS), the ORM library (Prisma then Drizzle), and the caching strategy (Redis then in-memory then Redis again). The agent had to recall the current decision for each area without conflating it with a previous state.
Fairness and methodology
- Both systems used the same agent model: Claude Sonnet 4.6.
- Each tool was given a custom system prompt prefix explaining how to use its MCP tools most effectively. Neither system had an unfair instruction advantage.
- Scenarios reflect real multi-session software engineering work. Requirements change. Decisions reverse. The agent must track the current truth, not just the most recent text.
- Each scenario was run 20 times per system to account for LLM non-determinism. Scores are averages across all runs.
- Token efficiency is weighted at only 10% to avoid rewarding systems that retrieve nothing or compress memory so aggressively that accuracy suffers.
Scoring weights
Why token efficiency has an asterisk
Token efficiency is only a win when accuracy stays high. A system that retrieves nothing uses zero tokens. Memstate's low token score reflects precise, targeted retrieval rather than dumping large context blobs into every prompt.
Source: Open benchmark reports in memstate-ai/memstate-benchmark — see BENCHMARK.md and per-adapter result files. All scenario prompts, scoring scripts, and raw results are public and reproducible.
This benchmark is optimized for AI agent coding workflows where factual recall, evolving decisions, and deterministic context matter most. Freeform document indexing may favor different retrieval patterns, but that is not the primary use case evaluated here.