Open-Source Evaluation

Benchmark Methodology

Memstate is evaluated against alternative memory systems using public, reproducible, multi-session coding scenarios. It is built for AI agent coding work where facts must stay reliable, structure must stay navigable, and decisions can change quickly across sessions.

Winner

Memstate

Date

2026-03-12

Model

Claude Sonnet 4.6

Head-to-head metrics

All values below are benchmark percentages.

Metric

Memstate

Mem0

Winner

Overall Score

84.4%

20.4%

Memstate

Fact Recall Accuracy

92.2%

17.5%

Memstate

Conflict Detection

95.0%

20.2%

Memstate

Cross-Session Continuity

88.7%

17.2%

Memstate

Token Efficiency*

16.2%

40.0%

Mem0

Scoring weights

Accuracy (fact recall)40%

Conflict Detection25%

Context Continuity25%

Token Efficiency10%

Scenario coverage

Each scenario simulates a real software engineering project evolving across multiple sessions. The agent starts every session with a blank context window and must rely entirely on its memory system to reconstruct what it knows.

Web App Architecture Evolution

Hard

The agent worked on a full-stack web application across 6 sessions. In session 1, the team chose a monolithic Next.js architecture. By session 3, they switched to a microservices model. In session 5, they partially reverted to a modular monolith for the auth and billing services only. The agent had to track the current architecture for each service independently, not just the most recent global decision.

Memstate 85.8%Mem0 70.6%

Auth System Migration

Very Hard

The project started with JWT-based authentication. Across 5 sessions, the team migrated to session-cookie auth, then added OAuth for social login, then reverted the session-cookie approach for API clients back to JWT. The agent had to correctly recall which auth strategy applied to which client type at any given point, and detect when a question about auth referred to a state that had since changed.

Memstate 85.1%Mem0 9.0%

Database Schema Evolution

Very Hard

A PostgreSQL schema evolved across 7 sessions. Tables were added, columns were renamed, data types changed, and two tables were merged into one. The agent was asked specific questions mid-migration: 'What is the current column name for user email?' and 'Does the orders table still have a status_code field?' Answering correctly required knowing the exact schema state at each point in time, not just the initial or final version.

Memstate 81.0%Mem0 10.3%

API Versioning Conflicts

Extreme

The team maintained three concurrent API versions (v1, v2, v3) across 8 sessions. Breaking changes were introduced in v2 that did not apply to v1. v3 deprecated several v2 endpoints. The agent had to recall version-specific constraints accurately when asked about a specific endpoint, and correctly identify when a question about 'the API' was ambiguous and needed version clarification.

Memstate 85.0%Mem0 4.1%

Team Decision Reversal

Extreme

This scenario simulated real engineering team dynamics. Across 6 sessions, the team made and then reversed four major architectural decisions: the message queue choice (Kafka then RabbitMQ then back to Kafka), the deployment target (AWS ECS then Kubernetes then back to ECS), the ORM library (Prisma then Drizzle), and the caching strategy (Redis then in-memory then Redis again). The agent had to recall the current decision for each area without conflating it with a previous state.

Memstate 85.2%Mem0 7.8%

Fairness and transparency

Both systems were given identical agent models (Claude Sonnet 4.6) and identical scenario setups.

Each tool was allowed a custom system prompt prefix explaining how to use its MCP tools most effectively. Neither system had an unfair instruction advantage.

Scenarios were designed to reflect real multi-session software engineering work, not toy Q&A pairs. Requirements change. Decisions reverse. The agent must track the current truth, not just the most recent text.

Each scenario was run 20 times per system to account for LLM non-determinism. Scores are averages across all runs.

Token efficiency is included at a lower weight (10%) to avoid rewarding systems that retrieve nothing or compress memory so aggressively that accuracy suffers.

The full benchmark suite, all scenario prompts, scoring scripts, and raw result files are published at github.com/memstate-ai/memstate-benchmark and can be reproduced by anyone.

Token efficiency is marked with an asterisk because lower usage can reflect less useful retrieval. The benchmark prioritizes correctness and continuity for real production workflows.

Mem0 may perform better for broad freeform text indexing in some workloads, but that is not the focus of this benchmark. This evaluation targets what coding agents need: deterministic memory, accurate facts, and robust handling of changing plans.

Leaderboard

Track current standings

See how memory systems stack up as more adapters are added.

Open leaderboard

Inspect the benchmark suite

Run and verify everything yourself

The scenarios, scoring, and result generation are fully public in GitHub.

View repository

Source of benchmark numbers: memstate-ai/memstate-benchmark — see BENCHMARK.md and adapter-level result files.