Benchmark Methodology
Memstate is evaluated against alternative memory systems using public, reproducible, multi-session coding scenarios. It is built for AI agent coding work where facts must stay reliable, structure must stay navigable, and decisions can change quickly across sessions.
Winner
Memstate
Date
2026-03-12
Model
Claude Sonnet 4.6
Head-to-head metrics
All values below are benchmark percentages.
Scoring weights
Scenario coverage
Each scenario simulates a real software engineering project evolving across multiple sessions. The agent starts every session with a blank context window and must rely entirely on its memory system to reconstruct what it knows.
The agent worked on a full-stack web application across 6 sessions. In session 1, the team chose a monolithic Next.js architecture. By session 3, they switched to a microservices model. In session 5, they partially reverted to a modular monolith for the auth and billing services only. The agent had to track the current architecture for each service independently, not just the most recent global decision.
The project started with JWT-based authentication. Across 5 sessions, the team migrated to session-cookie auth, then added OAuth for social login, then reverted the session-cookie approach for API clients back to JWT. The agent had to correctly recall which auth strategy applied to which client type at any given point, and detect when a question about auth referred to a state that had since changed.
A PostgreSQL schema evolved across 7 sessions. Tables were added, columns were renamed, data types changed, and two tables were merged into one. The agent was asked specific questions mid-migration: 'What is the current column name for user email?' and 'Does the orders table still have a status_code field?' Answering correctly required knowing the exact schema state at each point in time, not just the initial or final version.
The team maintained three concurrent API versions (v1, v2, v3) across 8 sessions. Breaking changes were introduced in v2 that did not apply to v1. v3 deprecated several v2 endpoints. The agent had to recall version-specific constraints accurately when asked about a specific endpoint, and correctly identify when a question about 'the API' was ambiguous and needed version clarification.
This scenario simulated real engineering team dynamics. Across 6 sessions, the team made and then reversed four major architectural decisions: the message queue choice (Kafka then RabbitMQ then back to Kafka), the deployment target (AWS ECS then Kubernetes then back to ECS), the ORM library (Prisma then Drizzle), and the caching strategy (Redis then in-memory then Redis again). The agent had to recall the current decision for each area without conflating it with a previous state.
Fairness and transparency
Both systems were given identical agent models (Claude Sonnet 4.6) and identical scenario setups.
Each tool was allowed a custom system prompt prefix explaining how to use its MCP tools most effectively. Neither system had an unfair instruction advantage.
Scenarios were designed to reflect real multi-session software engineering work, not toy Q&A pairs. Requirements change. Decisions reverse. The agent must track the current truth, not just the most recent text.
Each scenario was run 20 times per system to account for LLM non-determinism. Scores are averages across all runs.
Token efficiency is included at a lower weight (10%) to avoid rewarding systems that retrieve nothing or compress memory so aggressively that accuracy suffers.
The full benchmark suite, all scenario prompts, scoring scripts, and raw result files are published at github.com/memstate-ai/memstate-benchmark and can be reproduced by anyone.
Token efficiency is marked with an asterisk because lower usage can reflect less useful retrieval. The benchmark prioritizes correctness and continuity for real production workflows.
Mem0 may perform better for broad freeform text indexing in some workloads, but that is not the focus of this benchmark. This evaluation targets what coding agents need: deterministic memory, accurate facts, and robust handling of changing plans.
Track current standings
See how memory systems stack up as more adapters are added.
Open leaderboardRun and verify everything yourself
The scenarios, scoring, and result generation are fully public in GitHub.
View repositorySource of benchmark numbers: memstate-ai/memstate-benchmark — see BENCHMARK.md and adapter-level result files.