ML

Forgetting is a feature: building brain-shaped memory for LLMs

Every LLM conversation starts from zero. You explain your project, your preferences, the bug you fixed last Tuesday, and the next session none of it exists. The obvious fix is memory, and the obvious implementation is a database that accumulates everything. I built that first instinct into claude-engram, my memory system for Claude Code, and then spent most of the project learning why the brain doesn't work that way.

The system that came out the other end is brain-shaped on purpose: it gates what gets stored, lets unimportant memories decay on a power-law curve, consolidates during "sleep," and compresses old details into gist. The hook I keep coming back to is that forgetting isn't the failure mode of this system. It's the product.

Why remembering everything fails

A memory system's output is ultimately a briefing: when a session starts, what does the model need to know? If you store everything, the briefing is either enormous (and the important parts drown) or you need a perfect retrieval system to pick the right needles every time (you will not build one). Signal-to-noise is the whole game, and storage discipline is cheaper than retrieval brilliance.

Humans handle this with a piece of hardware called the hippocampus, which decides what's worth encoding in the first place, and with forgetting curves that have been measured since Ebbinghaus: memory strength falls off following a power law, fast at first, then slowly, unless something reinforces it. That's not a flaw to engineer around. It's a relevance filter that runs for free.

An LLM as a hippocampus

claude-engram hooks into Claude Code's lifecycle — session start, post-response, pre-compaction, session end — and exposes an MCP server with eight tools so Claude can search, store, reinforce, and forget memories mid-conversation. No manual note-taking; it works invisibly.

The gate is the first brain-shaped piece. A separate Haiku instance scores every candidate memory on four dimensions: novelty, relevance, emotional weight, and prediction error — that last one being the "wait, that's not what I expected" signal that makes biological memory sit up. Only content that clears the bar gets stored. Most of what happens in a coding session is ephemeral, and the gate's job is to let it stay that way.

The scoring isn't static, either. Reinforce, forget, and prune events emit training signals, and the per-dimension extraction weights adapt to how you actually behave — a dopamine-style loop. If I keep reinforcing memories about architecture decisions and pruning trivia, the gate learns my tastes.

Strength is computed, never stored

This is my favorite design decision in the project. A memory's strength is not a column in a table. It's derived at recall time:

strength = salience
         + retrieval boost
         + consolidation bonus
         − decay × √age

The decay term follows the power-law shape from Ebbinghaus and Wixted: steep early, gentle later. Because strength is a pure function of stored facts (salience, retrieval history, consolidation status, age), there's no background job sweeping the database to decrement scores, no rows quietly mutating, and if I want to tune the forgetting curve, every memory in the system follows the new curve instantly — including ones written months ago.

Recall ranks by relevance × strength, where relevance comes from a hybrid: Voyage embeddings with cosine similarity, layered over token-level fuzzy matching. A memory you've never touched in months scores low even if it's topically dead-on, which is exactly what you want from a briefing.

And forgetting here isn't deletion. Decayed memories slide into an archive, and a separate deep_recall tool can still reach them. Psychology draws a line between retrieval failure and true forgetting — the memory you can't summon at will but recognize instantly when prompted. The system keeps that distinction: weak memories leave the briefing long before they leave the disk.

Sleep

The second brain-shaped piece is consolidation. Periodically, a Sonnet-driven cycle does what sleep does for you: it merges redundant memories, resolves contradictions (newest wins), extracts recurring patterns into generalized memories, and prunes the dead ones. Ten observations that I keep choosing a particular testing pattern become one semantic memory that says so.

Above 100 memories, consolidation goes two-pass for cost reasons: Haiku sweeps the whole store and flags clusters of merge candidates, and Sonnet only processes the flagged groups. The same tiering logic as everywhere else — cheap model triages, capable model operates — and it's a big part of why the whole system runs at roughly $0.05 to $0.15 a day.

There's also a slower transformation running underneath, borrowed from Fuzzy Trace Theory: episodic memories degrade into semantic ones. After seven days, a detailed memory compresses to its gist. This matches how you work. You don't remember the exact stack trace from last month; you remember "that bug was a timezone thing." The gist is what's useful at briefing time, and it's a fraction of the tokens.

A couple of smaller mechanisms round it out. Proactive interference: when a new memory updates an old one, the old trace's salience gets dampened immediately, so superseded facts don't compete with their replacements. Temporal association: recalling one memory surfaces others from the same session, the way one detail of a day drags its neighbors along.

What the briefing feels like

All of this exists to serve one moment: session start. The briefing is context-adaptive — project-scoped memories get boosted when you're in that project, while global identity and preference memories stay in scope everywhere. What comes out is short and dense: who you are, how you like to work, what's true about this project, what happened recently that still matters. Old noise has decayed, duplicates have merged, details have collapsed to gist.

That quality is not produced by the recall algorithm. It's produced by everything the system declined to store and allowed to fade. The forgetting is the briefing.

The unglamorous parts

For anyone building something similar: the neuroscience mapping was the fun 30%. The rest was file locking, pre-consolidation backups, auto-rotating logs, cursor-tracked transcript parsing, a SQLite event dashboard, a one-step installer, and 130-plus tests. There's also a second implementation of the same ideas that runs as a single React artifact inside Claude.ai with zero infrastructure, plus a companion benchmark repo (recall-bench) for checking that recall actually recalls.

But the idea I'd want you to take is the smaller one. When people hear their AI's memory system forgets things, they assume it's a limitation to apologize for. It's the opposite. A memory that keeps everything is a junk drawer. A memory that forgets is an editor — and an editor is what you want whispering in the model's ear when a session begins.