MemTensor (Shanghai) Technology Co., Ltd. · Institute for Advanced Algorithms Research, Shanghai · Research Institute of China Telecom · Tongji University · Zhejiang University
LLMs lack well-defined memory management systems, limiting long-context reasoning and continual personalization. We propose MemOS, a memory operating system that treats memory as a first-class resource — unifying plaintext, activation-based, and parameter-level memories under a single hierarchical framework called MemCube.
MemOS achieves state-of-the-art performance across all major memory benchmarks (PrefEval-0t, PrefEval-10t, PersonaMem, LongMemEval, LoCoMo), outperforming Mem0, Zep, MemBase, MIRIX, and Supermemory.
With the advent of the Transformer architecture and the maturation of self-supervised pretraining, Large Language Models (LLMs) have become the cornerstone of modern AI. Trained on massive corpora, these models encode vast amounts of world knowledge in their parameters and demonstrate remarkable cross-task generalization.
However, a fundamental challenge remains: LLMs are inherently stateless. Each session starts from scratch, without persistent memory of past interactions, user preferences, or evolving knowledge. As LLMs transition from tools to persistent agents operating across time and space, this limitation becomes a critical bottleneck.
Existing approaches like Retrieval-Augmented Generation (RAG) treat memory as an afterthought — a stateless workaround without lifecycle control or integration with persistent representations. While RAG introduces external knowledge in plain text, it cannot unify heterogeneous memory types or manage memory evolution over time.
To address this, we propose MemOS: a Memory Operating System for AI systems. MemOS unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. The core unit — MemCube — encapsulates both memory content and metadata such as provenance and versioning.
Research on LLM memory has progressed through four key stages — from early definitions of implicit and explicit memory, through human-like memory architectures, to the current era of systematic memory operating systems.
Early work explored the distinction between implicit memory (encoded in model weights via pretraining) and explicit memory (stored externally as text or key-value pairs). Representative systems: RAG, kNN-LMs, prefix tuning.
Inspired by the hippocampus (short-term recall) and neocortex (long-term consolidation), researchers developed multi-component memory architectures with mind-map-like structures for persistent knowledge storage.
Systems like Mem0 and Zep introduced explicit APIs for memory operations (Add, Modify, Update, Delete). However, these remain siloed — unable to unify heterogeneous memory types under a single framework.
MemOS introduces OS-level resource management principles to LLM memory — unified scheduling, lifecycle control, governance policies, and cross-type memory migration. This is the first system to treat memory as a first-class OS resource.
As AGI advances toward increasingly complex systems involving multiple tasks, roles, and modalities, LLMs must go beyond merely "understanding the world" — they must also accumulate experience, retain preferences, and evolve over time.
Model performance is approaching the upper limits predicted by traditional scaling laws. The prevailing research paradigm is transitioning from data- and parameter-centric pretraining to reinforcement alignment (post-training, e.g., GPT-O1, DeepSeek-R1). Yet this shift faces diminishing returns.
MemOS proposes Mem-training Scaling as the next frontier: by continuously accumulating and refining memory across deployments, LLMs can break through post-training performance ceilings. Thousands of heterogeneously deployed model instances can gather experience in situ and exchange it via MemOS infrastructure.
In traditional computing systems, the OS centrally manages hardware resources (CPU, memory, storage) to support efficient application execution. MemOS applies this same principle to LLM memory resources.
The table below maps traditional OS components to their MemOS counterparts. Just as an OS abstracts hardware for applications, MemOS abstracts heterogeneous memory types (parameter, activation, plaintext) for LLM applications:
| Traditional OS | MemOS Module | Function |
|---|---|---|
| Registers / Microcode | Parameter Memory | Long-term ability |
| Cache / I/O Buffer | Activation Memory | Fast working state |
| Main Memory | Plaintext Memory | External episodes |
| Scheduler | MemScheduler | Prioritise ops |
| File System | MemVault | Versioned store |
| System Call | Memory API | Unified access |
| Device Driver | MemLoader / Dumper | Move memories |
| Package Manager | MemStore | Share bundles |
| Auth / ACLs | MemGovernance | Access control |
| Syslog | Audit Log | Audit trail |
MemOS systematizes LLM memory into three core types that together reflect a full spectrum of knowledge representation — from volatile inference state to durable parameter knowledge:
Explicitly stored, structured/unstructured text — conversation histories, user preferences, episodic notes. Highest interpretability, lowest integration cost. Analogous to main RAM for fast access.
Inference-coupled activation states — KV cache, hidden state steering vectors. Bridges plaintext and parameter memory. Enables fast working state injection without full fine-tuning.
Implicitly embedded knowledge in model weights — LoRA adapters, weight patches, fine-tuned modules. Highest integration depth and durability. Used for long-term skill and knowledge consolidation.
MemCube is the unified abstraction that standardizes memory representation, lifecycle management, and scheduling across all three memory types.
Each MemCube consists of a Metadata Header (lifecycle timestamps, access control lists, storage profile) and a Memory Payload (plaintext content, activation state, or parameter patch). MemCubes can be composed, migrated, and fused over time.
The MemScheduler handles context-aware matching, priority-based loading, memory lifecycle control, and runtime memory injection — routing the right MemCubes to the right LLM at the right time:
"meta": {
"created": "2025-04-10",
"source": "session_3894",
"model": "LLaMA3-8B",
"priority": "mid",
"expires": "2025-06-01",
"access": ["user_483", "admin"]
}
"payload": {
"type": "explicit",
"format": "text",
"content": "You are a helpful assistant..."
}
MemOS adopts a modular three-layer architecture to support efficient invocation, dynamic scheduling, and compliant governance of complex memory tasks:
When a user sends a message, MemReader parses the intent and formulates a MemoryQuery. The MemOperator retrieves matching MemCubes from MemVault, the MemScheduler injects relevant memory into the LLM context (as plaintext, activation bias, or parameter patch), and the model generates a memory-augmented response. The full interaction is logged to the Audit Log via MemGovernance.
We systematically evaluate MemOS capabilities through holistic and component-level experiments — benchmarking full system performance on long-context memory, personalization understanding, chunk size sensitivity, retrieval robustness, and KV-based acceleration.
MemOS is benchmarked against MIRIX, Zep, MemBase, Mem0, and Supermemory on the LoCoMo and LongMemEval benchmarks. MemOS-1031 achieves the highest average score (75.80) — a 5% improvement over the previous best (MemBase: 72.01):
| Method | Memory Size | LiftAge ↑ | F1 ↑ | ROUGE-L ↑ | BLEU ↑ | Avg ↑ | LoCoMo ↑ |
|---|---|---|---|---|---|---|---|
| No Memory | — | 68.22 | 54.26 | 68.54 | 46.88 | 64.33 | 28.10 |
| MIRIX | 1,172 | 73.33 | 58.75 | 52.34 | 45.83 | 64.57 | 43.46 |
| Zep | 2,701 | 66.23 | 52.12 | 54.82 | 33.33 | 59.22 | 41.23 |
| MemBase | 2,102 | 73.12 | 64.65 | 81.20 | 53.12 | 72.01 | 50.18 |
| Mem0 | 617 | 66.34 | 63.12 | 27.10 | 50.01 | 56.55 | 35.15 |
| Supermemory | 500 | 67.30 | 51.12 | 31.77 | 42.67 | 55.34 | 34.87 |
| MemOS-1031 | 1,589 | 81.09 | 67.49 | 75.18 | 55.90 | 75.80 | 45.27 |
MemOS is evaluated on the PrefEval and PersonaMem benchmarks to assess personalization quality. MemOS achieves the best Personalized Response performance in both zero-turn and 10-irrelevant-turn settings, while recording the lowest Preference Unaware Rate — indicating that MemOS consistently recalls and applies user preferences without being misled by irrelevant context.
For the PersonaMem benchmark, MemOS achieved the best precision while maintaining acceptable context length control, validating its superior capability in handling dynamic user preference evolution across extended interaction histories.
We analyze the impact of retrieved memory chunk count (Top-K) and chunk size on MemOS performance across multiple metrics. Performance stabilizes at Top-K=3 and chunk size ~512 tokens, providing the optimal tradeoff between context window usage and recall quality:
We conduct a focused evaluation analyzing the efficiency and robustness of memory retrieval via network API under varying query-per-second (QPS) pressure. Metrics include P99, P90, and mean latency for both memory insertion (add) and retrieval (search) operations, as well as success rate.
MemOS achieved 100% success rate across all QPS levels while maintaining the lowest latency — demonstrating robust production-grade memory retrieval performance even under high concurrency pressure.
KV-based memory injection pre-computes key-value cache from memory content, bypassing repeated attention computation at inference time. We compare Time to First Token (TTFT) across model sizes (3B, 7B, 72B), context lengths (short/medium/long), and query lengths against standard attention:
MemOS enables a new paradigm for AI applications where persistent memory is a modular, manageable resource. Key application scenarios:
MemOS treats memory as a first-class system resource, enabling unified lifecycle management and orchestration of memory in multiple forms. This abstraction supports two key architectural innovations:
Domain experts can publish structured experiential memories via MemStore — like a knowledge plugin. Consumers (students, enterprise agents, assistant models) can subscribe, download, and activate these memory modules to immediately acquire domain-specific expertise without fine-tuning.
Users and developers access memory through standardized task-level Memory API calls — without handling low-level vector indexing, KV-caching, or context orchestration. Memory is a universal, long-lived, shareable infrastructure resource, analogous to storage subsystems in traditional OS.
MemOS maintains conversation history, user preferences, and task context across sessions and modalities — enabling seamless long-horizon interactions without context window limitations.
Domain knowledge can be continuously updated via MemCube injection without full retraining. New information is integrated at the plaintext or activation level, then consolidated into parameter memory over time.
User-specific preferences, communication styles, and role definitions are stored as MemCubes and injected at inference time — enabling genuine personalization across diverse user profiles.
MemCubes can be exported, imported, and transferred across model instances and platforms via MemStore — enabling a decentralized memory marketplace where users own their AI memory.
We introduce MemOS, a memory operating system designed for Large Language Models, aimed at collaboratively building foundational memory infrastructure for next-generation LLMs. MemOS provides a unified abstraction and integrated management framework for heterogeneous memory types — parameter memory, activation memory, and explicit plaintext memory.
The MemCube abstraction enables controllable, plastic, and evolvable memory management — laying the foundation for continual learning and personalized modeling at scale. MemOS achieves state-of-the-art performance across all evaluated benchmarks while significantly reducing Time to First Token via KV-based memory acceleration.
Looking ahead, we envision a future intelligent ecosystem centered on modular memory resources and supported by a decentralized memory marketplace. This paradigm shift — from stateless tools to memory-rich persistent agents — represents the next frontier in AI system design.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.