Benchmarks

How Memax performs on public memory benchmarks — retrieval quality, QA accuracy, and cost.

Memax is evaluated on independent public benchmarks to measure retrieval quality and end-to-end question-answering against real long-context memory workloads.

Headline

On LongMemEval (ICLR 2025) — the standard benchmark for long-term memory systems — Memax reaches:

96.9% retrieval recall@5 (top-tier, ahead of published dense-retrieval baselines)
91.2% QA accuracy — ranked #9 worldwide, on par with Hindsight (91.4%)
$0.046 / question end-to-end — one of the most cost-efficient systems above the 90% accuracy bar

All results are single-pass, no tuning on the benchmark. The leaderboard entries above and below us that beat 91.2% use significantly more expensive generator models or iterate repeatedly on the test set.

"QA accuracy" is end-to-end: retrieve → synthesize answer → judge. "Retrieval recall" isolates just the retrieval stage. Most published systems report one or the other — we report both for transparency.

Benchmark

LongMemEval is the community-standard benchmark for long-term memory:

Dataset: longmemeval_s_cleaned.json — 500 questions, ~115k tokens/question, ~48 sessions/question
Six question types: single-session-user, single-session-preference, single-session-assistant, multi-session, temporal-reasoning, knowledge-update
Judge: GPT-4o-2024-08-06, matching LongMemEval's official methodology
419 questions evaluated after applying the benchmark's standard exclusions (abstention + no-target-turn questions)

Retrieval — recall_any@k

How often at least one evidence chunk appears in the top-k.

Metric	k=1	k=3	k=5	k=10	k=30
recall_any@k	0.8520	0.9379	0.9690	0.9833	0.9881
recall_all@k	0.2363	0.7924	0.8974	0.9427	0.9761
ndcg_any@k	0.8520	0.8753	0.8980	0.9079	0.9131

Recall comparison vs published systems

Only a handful of systems report retrieval metrics at all — most report end-to-end QA accuracy.

System	recall_any@5	Notes
BM25 (LongMemEval paper)	0.51	Sparse keyword baseline
Stella V5 1.5B (LongMemEval paper)	0.70	Dense retrieval, 1.5B params
MemPalace raw (ChromaDB)	0.966	Verbatim sessions, MiniLM
Memax	0.969	Single pass, no tuning
MemPalace hybrid v4 (held-out)	0.984	+ keyword/temporal boost

QA accuracy — leaderboard

End-to-end QA on LongMemEval s_cleaned (500 questions, GPT-4o judge).

Rank	System	QA Accuracy	Generator	$/Q
1	agentmemory	96.2%	Claude Opus	$0.170
2	PwC Chronos	95.6%	Enhanced config	$0.10–$0.54
3	Mastra OM	94.87%	GPT-5-mini	$0.130
4	Backboard	93.4%	GPT-4.1	~$0.070
5	Ensue	93.2%	GPT-5-mini	~$0.003
6	OMEGA	93.2%	GPT-4.1	$0.022
7	MemMachine	93.0%	GPT-5-mini	$0.005
8	Hindsight	91.4%	Gemini-3-Pro	$0.650
9	Memax	91.2%	Claude Sonnet	$0.046
11	Emergence AI	86.0%	GPT-4o	$0.057
12	Supermemory	85.2%	Gemini-3-Pro	$0.110
17	Zep / Graphiti	71.2%	GPT-4o	$0.058
19	Full-context baseline	60.2%	GPT-4o	$0.290
20	Mem0 (independent eval)	49.0%	—	—

Showing representative entries. Sources: self-reported results from each system's repo, blog, or paper. Estimates for $/Q come from each system's disclosed embedding + LLM usage.

QA accuracy — per question type

How Memax scores across the six LongMemEval question types.

Type	QA Accuracy	N
single-session-user	97.1%	70
single-session-preference	93.3%	30
single-session-assistant	100.0%	56
multi-session	83.5%	133
temporal-reasoning	91.7%	133
knowledge-update	91.0%	78
abstention	96.7%	30
Overall	91.2%	500

Cost efficiency

At $0.046 / question, Memax is one of the cheapest systems above 90% QA accuracy. The big driver is the number of LLM calls per question, not the per-token price of the generator:

Systems that call an LLM per session during ingestion (~48 sessions per question) — e.g. Hindsight ($0.65/q with Gemini-3-Pro), Chronos ($0.10–$0.54/q) — spend most of their budget on extraction, not answer generation.
Memax does one generator call per question (the answer synthesis). Even with a per-token-expensive generator like Sonnet, one call × ~question-context tokens stays well under $0.05.
Supermemory (Gemini-3-Pro, $0.11/q) and Memax (Sonnet, $0.046/q) at the same QA tier show that call count, not model price, dominates.

Prompt caching and batch API discounts (not applied here) would cut cost further.

What's not in these numbers

Production ingestion enrichment — our production ingestion also does LLM-based enrichment per memory (titles, summaries, tags). The benchmark harness intentionally skips this to match how peer systems report LongMemEval, so the 91.2% is a conservative lower bound.
Upgrading the generator — Opus-class models (used by the #1 system) would likely push QA higher with no retrieval changes.
Multi-hop retrieval tuning — the weakest categories (multi-session, temporal-reasoning) are where multiple evidence chunks must all land in top-k. Known improvement area, not yet tuned.

Caveats worth reading

Corpus construction is not standardized. LongMemEval's paper prescribes user-turn-only indexing for retrieval, but all top QA systems index both user and assistant turns. Our reported result uses the same all-turns approach for comparability with other top systems; a paper-compliant user-only run yields 83.4% QA.
Generator model dominates QA scores. Systems using Opus, GPT-5-mini, or Gemini-3-Pro have a mechanical advantage over those using cheaper models, independent of retrieval quality.
No standardized protocol across the ecosystem. Systems vary in corpus construction, generator model, judge model, dataset variant, and whether they iterate on the test set. We report a single pass with no tuning.
Retrieval vs QA are different metrics. A system with great retrieval can still have mediocre QA (and vice versa) depending on the generator.

Reproducibility

The dataset (longmemeval_s_cleaned.json), the evaluation script, and the GPT-4o judge prompt are public — available in the LongMemEval repo. Anyone running a comparable system can produce a directly comparable score with the same methodology.

Benchmark results are a snapshot in time — we re-run as the retrieval stack evolves. The specific retrieval techniques and models behind each stage are implementation details and subject to change without notice.

On this page