The leaked Claude Code source code reveals a fact that was previously hard to verify: Claude Code’s lifecycle extends far beyond the user-visible request-response loop. While your cursor blinks in the input box as you compose your next message, Claude Code’s backend is executing dozens of asynchronous tasks — speculative execution, memory extraction, document maintenance, context compaction, and more. Every moment you assume is idle is actually one of the system’s most computation-dense work periods.
These background mechanisms are not clever innovations unique to Claude Code. They represent a set of universal patterns that the AI agent industry is converging on, and understanding them has direct value for building any agent system.
The design philosophy of our own context infrastructure closely aligns with Claude Code’s background activity system. Claude Code’s memory extraction, automatic compaction, and prompt cache optimization can seamlessly transfer to our workflows because the underlying problem being solved is the same: how to maintain coherent cognitive state for an agent across multiple interactions. OpenClaw’s heartbeat auto-distillation mechanism is another instantiation of the same pattern. When we see Claude Code’s auto-dream consolidating session memories every 24 hours and OpenClaw’s heartbeat distilling context at a fixed cadence, two systems that evolved independently arrived at nearly identical cognitive maintenance rhythms — this indicates that such designs have become industry standard practice. The tension between automation and controllability that we discussed in our Claude Dispatch vs OpenClaw analysis directly echoes the observations in this article: the more background activity, the smarter the system, but the harder it becomes for users to understand what the system is doing. This trade-off permeates the design decisions behind every mechanism discussed below.
This article starts from the source code and examines four background mechanisms with the greatest engineering depth, unpacking their design decisions and implementation details.
Before diving into specific mechanisms, there is a cross-cutting engineering principle that needs to be stated upfront: maintaining the prompt cache is a hard constraint across all of Claude Code’s background activity, not an optimization point for any single feature. The Manus team’s context engineering practice summary also reported similar findings: in production environments, the input-to-output token ratio is approximately 100:1, and prompt cache hit efficiency directly determines the cost and speed of agent systems.
Every background agent in Claude Code runs in forked agent mode, and
the first design principle of a forked agent is: it must share exactly
the same cache key parameters as the parent process (system prompt,
tools, model, messages prefix, thinking config). Any parameter deviation
causes cache invalidation, with the cost being a full cold-start API
call. The source code documents a real lesson learned: PR #18143
attempted to set effort:'low' on a fork, which caused cache
hit rate to plummet from 92.7% to 61%, and cache writes to spike by 45x.
The only safe overrides are four: abortController (not sent to the API),
skipTranscript (purely client-side), skipCacheWrite (controls
cache_control markers without affecting the cache key), and canUseTool
(client-side permission check).
This constraint manifests in every background mechanism. The speculative execution forked agent shares the prompt cache. The memory extraction forked agent shares the prompt cache. The auto-dream forked agent shares the prompt cache. The session memory forked agent shares the prompt cache. All background agents are designed under the prompt cache constraint — parameters cannot be touched, models cannot be swapped, thinking config cannot be altered. This makes background activity extremely low-cost (the vast majority of tokens hit the cache), while also explaining why the behavioral space of these forks is constrained so narrowly.
Speculative execution (Speculation) is the most aggressive design in
the entire background activity system. (Anthropic internal only,
USER_TYPE === 'ant'.) Its logic chain has three steps:
predict what command the user is about to type, hand this predicted
command to a forked agent for actual execution, and use a copy-on-write
overlay filesystem to isolate execution artifacts. If the user
ultimately accepts the prediction, the overlay merges into the main
filesystem immediately and the response returns almost instantly.
Prediction occurs after every model response completes.
promptSuggestion.ts is called fire-and-forget in the stop
hooks, forking a child agent that uses exactly the same prompt cache
parameters as the parent process to generate predictions. (Prompt
Suggestion itself is available in the public version, controlled by the
GrowthBook tengu_chomp_inflection experiment flag.
Speculative execution is a further behavior built on top of prompt
suggestion, restricted to internal users only.)
const result = await runForkedAgent({
promptMessages: [createUserMessage({ content: prompt })],
cacheSafeParams, // Don't override tools/thinking settings - busts cache
canUseTool,
querySource: 'prompt_suggestion',
forkLabel: 'prompt_suggestion',
overrides: { abortController },
skipTranscript: true,
skipCacheWrite: true,
})Predicted content passes through a series of filters: length is limited to 2-12 words, evaluative statements (thanks, looks good) are filtered out, Claude-style expressions (Let me…, I’ll…) are removed, as are multi-sentence outputs and formatting markers. The goal of filtering is to retain only short commands the user themselves might actually type.
Once a prediction passes filtering, speculation.ts
immediately launches actual speculative execution. The key design is the
copy-on-write overlay:
// Copy-on-write: copy original to overlay if not yet there
if (!writtenPathsRef.current.has(rel)) {
const overlayFile = join(overlayPath, rel)
await mkdir(dirname(overlayFile), { recursive: true })
try {
await copyFile(join(cwd, rel), overlayFile)
} catch {
// Original may not exist (new file creation) - that's fine
}
writtenPathsRef.current.add(rel)
}
input = { ...input, [pathKey]: join(overlayPath, rel) }The overlay path is located at
~/.claude/tmp/speculation/<pid>/<uuid>/. When
the forked agent needs to write a file, the system first copies the
original file to the overlay directory, then redirects the write
operation to the copy in the overlay. Read operations perform a reverse
check: if the target file has been modified in the overlay, it reads
from the overlay; otherwise it reads directly from the main filesystem.
This achieves a fully isolated speculative execution environment.
Speculative execution has explicit safety boundaries. Only three tools — Edit, Write, and NotebookEdit — are allowed to write (and writes are redirected to the overlay). Read-only tools like Read, Glob, and Grep are allowed through directly. Bash commands are only allowed if they pass read-only validation. When encountering file edits that require user confirmation (permission mode below acceptEdits), speculation immediately pauses and records a boundary. A maximum of 20 conversation turns and 100 messages is enforced.
When the user presses Tab to accept a prediction,
acceptSpeculation copies files from the overlay back to the
main filesystem one by one and injects the messages produced during
speculation into the official conversation stream. If speculation has
completed (boundary type is complete), the entire response is presented
instantly. If speculation paused midway due to hitting a safety
boundary, the system truncates to the last user message and initiates a
follow-up query to let the model continue from the breakpoint.
Even more impressive is pipelining. When the first round of speculation completes, the system immediately starts a second round of prediction in the gap while waiting for the user to accept:
// Pipeline: generate the next suggestion while we wait for the user to accept
void generatePipelinedSuggestion(
contextRef.current,
suggestionText,
messagesRef.current,
setAppState,
abortController,
)If the user accepts the first round of prediction, the system checks whether a pipelined suggestion is already available. If so, it promotes it to the new prediction and immediately launches the corresponding speculative execution. This forms a pre-computation chain: the instant the first step finishes predicting, the second step is already underway.
In theory, if the user accepts multiple predictions in succession, the response time for each approaches zero, because all computation happens during the user’s thinking gaps. This has crossed beyond the realm of completion into agentic autonomous workflow territory.
The design inspiration for Auto-Dream is clear: just as humans
consolidate daytime memories during sleep, Claude Code consolidates
multi-session context during periods of no interaction. (Available in
the public version, controlled by the GrowthBook
tengu_onyx_plover experiment flag. The user setting
autoDreamEnabled can override remote configuration.)
The entry point in the source code is autoDream.ts, with
trigger conditions following a three-level gating mechanism (gate order:
cheapest first):
1. Time: hours since lastConsolidatedAt >= minHours (one stat)
2. Sessions: transcript count with mtime > lastConsolidatedAt >= minSessions
3. Lock: no other process mid-consolidation
The default parameters are 24 hours and 5 sessions. This means dream
is only triggered when more than 24 hours have passed since the last
consolidation and at least 5 session transcripts have accumulated in the
interim. These parameters are remotely controlled by the GrowthBook
feature flag tengu_onyx_plover and can be adjusted online
without requiring a release.
Once triggered, the system forks a child agent and gives it a carefully designed consolidation prompt. The prompt divides the consolidation process into four phases: Orient (read existing memory files to understand current state), Gather (search transcripts for new signals), Consolidate (merge new information into memory files), and Prune and Index (update the index, clean up stale content).
Regarding transcript search, the prompt explicitly instructs the child agent to use grep for narrow searches:
grep -rn "<narrow term>" ${transcriptDir}/ --include="*.jsonl" | tail -50
The reason is that transcript files (large JSONL) can be very large, and reading them in full would consume massive amounts of tokens. The prompt’s guiding philosophy is “Don’t exhaustively read transcripts. Look only for things you already suspect matter.”
The child agent’s permissions are strictly limited: Bash only allows
read-only commands (ls, find, grep, cat, stat, wc, head, tail), and Edit
and Write can only operate on files within the memory directory. This is
implemented through the createAutoMemCanUseTool function,
with extractMemories and autoDream sharing the same permission
logic.
After consolidation completes, if the child agent modified any memory files, the system inserts an “Improved N memories” system message into the main conversation stream to inform the user that memory updates occurred in the background. If the child agent fails, the system rolls back the lock file’s mtime so that the next time gate check will pass again, achieving automatic retry. There is a 10-minute scan throttle between retries to avoid repeated scanning when the session gate is not met.
A notable engineering trade-off in this design: dream is always triggered during gaps in normal user interaction with Claude Code (checked during each stop hooks execution), rather than driven by an independent timer. This means if the user hasn’t used Claude Code for 48 consecutive hours, dream won’t execute until the next interaction begins. This design prioritizes resource efficiency at the cost of potential consolidation delay.
The trigger mechanism for Magic Docs is remarkably elegant.
(Anthropic internal only, USER_TYPE === 'ant'.) If the
first line of any Markdown file matches the pattern
# MAGIC DOC: <title>, it is automatically registered
as a document requiring continuous maintenance. Registration occurs in
the FileReadTool’s listener callback:
registerFileReadListener((filePath: string, content: string) => {
const result = detectMagicDocHeader(content)
if (result) {
registerMagicDoc(filePath)
}
})In other words, the moment you read the file once, it gets tracked. From then on, every time the model finishes a response and the last assistant message in that turn contains no tool calls (indicating the conversation is at a natural idle point), the Magic Docs post-sampling hook updates all tracked documents one by one.
The update process uses the Sonnet model (not the Opus used in the main conversation), running as a constrained child agent with only Edit tool permissions, restricted to editing the corresponding Magic Doc file. The update prompt’s philosophy is worth quoting:
DOCUMENTATION PHILOSOPHY - READ CAREFULLY:
- BE TERSE. High signal only. No filler words or unnecessary elaboration.
- Documentation is for OVERVIEWS, ARCHITECTURE, and ENTRY POINTS - not detailed code walkthroughs
- Do NOT duplicate information that's already obvious from reading the source code
- Focus on: WHY things exist, HOW components connect, WHERE to start reading, WHAT patterns are used
Additionally, if a line of italic text immediately follows the Magic Doc header, it is parsed as document-specific instructions and passed to the update child agent with higher priority than the general rules. This means document authors can embed control over AI update behavior directly within the file.
The prompt also has a key constraint: “Keep the document CURRENT with the latest state of the codebase. This is NOT a changelog or history.” The update child agent is explicitly required to modify information in place and delete outdated content, rather than appending historical records. This ensures Magic Docs always reflect the current state of the codebase rather than devolving into unmaintained changelogs.
Extract Memories is the core write path for Claude Code’s memory
system. (Available in the public version, build flag
EXTRACT_MEMORIES is compiled into the public release,
controlled at runtime by isExtractModeActive() and the
auto-memory toggle isAutoMemoryEnabled().) At the end of
each query loop (when the model produces a final response with no tool
calls), handleStopHooks calls
executeExtractMemories in a fire-and-forget manner:
if (feature('EXTRACT_MEMORIES') && !toolUseContext.agentId && isExtractModeActive()) {
void extractMemoriesModule!.executeExtractMemories(
stopHookContext,
toolUseContext.appendSystemMessage,
)
}The extraction agent runs in forked agent mode, sharing the parent
process’s full prompt cache. It only examines recently added messages
(tracked via a cursor UUID that marks where the last processing left
off), identifies information worth persisting, and writes it to the
~/.claude/projects/<path>/memory/ directory.
There is an elegant mutual exclusion design here. If the main agent has already written memory files during the conversation (user explicitly asked to “remember this”), the extraction agent skips this round:
if (hasMemoryWritesSince(messages, lastMemoryMessageUuid)) {
logForDebugging(
'[extractMemories] skipping — conversation already wrote to memory files',
)
// ...advance cursor past this range
return
}The main agent and the background extraction agent are mutually exclusive for the same conversation segment. This avoids duplicate writes and prevents two agents from producing conflicting memories about the same conversation.
Extraction frequency is controlled by two dimensions: token threshold
and tool call count. Extraction is only triggered when both conditions
are met. Additionally, a tengu_bramble_lintel feature flag
controls turn intervals, allowing further dilution of extraction
frequency.
The extraction agent’s prompt design emphasizes efficiency. It is limited to completing its work within 5 turns, with the recommended strategy being: read all files that need updating in parallel during the first turn, then write all modifications in parallel during the second turn. The prompt explicitly forbids exploratory behaviors like “reading code to verify whether a memory is correct”:
You MUST only use content from the last ~N messages to update your persistent memories.
Do not waste any turns attempting to investigate or verify that content further —
no grepping source files, no reading code to confirm a pattern exists, no git commands.
In non-interactive mode (-p mode or SDK),
print.ts explicitly waits for in-flight extraction to
complete after flushing the response before performing graceful
shutdown, ensuring memory extraction is not truncated by process exit.
This is implemented through drainPendingExtraction with a
60-second timeout.
The four mechanisms above are the most engineering-complex background activities. Beyond them, Claude Code runs a series of auxiliary background tasks:
Session Memory (available in the public version, controlled by
GrowthBook tengu_session_memory): Periodically maintains
summary notes for the current session based on token thresholds and tool
call counts, used by auto-compact and away summary. Only a Sonnet child
agent with Edit permissions is used, restricted to editing the
corresponding session memory file.
Auto-Compact (available in the public version, enabled by
default, autoCompactEnabled setting defaults to
true): Automatically compresses old messages when the
context window approaches capacity limits. The formula is
effective context window - 13000 tokens; this threshold
triggers compaction. It first attempts session memory-based compression,
falling back to legacy compaction on failure. There is a circuit
breaker: after 3 consecutive failures, it stops retrying to avoid
repeatedly calling the API when the context is irrecoverable.
Cron Scheduler (available in the public version, build flag
AGENT_TRIGGERS compiled into the public release, GrowthBook
tengu_kairos_cron defaults to true as a remote
kill switch): A complete cron scheduling system that checks once per
second whether any scheduled tasks are due. It supports two task types:
file-driven (.claude/scheduled_tasks.json) and
session-only. When multiple Claude instances share the same working
directory, a lock mechanism ensures only one instance executes cron
tasks.
Prevent Sleep (available in the public version, macOS only, no
feature flag gating): On macOS, prevents system idle sleep via
caffeinate -i -t 300. The caffeinate process is restarted
every 4 minutes (with a 5-minute timeout), so even if the Node process
is killed by SIGKILL, the orphaned caffeinate will automatically exit
after 5 minutes. It uses a refcount model, with multiple concurrent
tasks sharing the same caffeinate process.
Away Summary (available in the public version, no feature flag gating): When the user returns after a period of inactivity, the system uses a small fast model based on the most recent 30 messages and session memory to generate a 1-3 sentence summary telling the user “here’s what we were working on while you were away.”
Plugin Auto-Update (available in the public version, no feature flag gating): Checks for installed plugin updates in the background at startup. First refreshes marketplaces with autoUpdate enabled (git pull), then checks installed plugins from those marketplaces one by one. Updates are non-inplace (disk writes only) and require a restart to take effect. After updates complete, a callback notifies the REPL to display a restart prompt.
GrowthBook Feature Flag Refresh (available in the public version, no feature flag gating, external users refresh every 6 hours, internal users every 20 minutes): Periodically fetches feature flag configuration from remote. Nearly all background feature toggles, parameters, and behavioral variants are controlled through feature flags.
Background Housekeeping (available in the public version): The
startBackgroundHousekeeping function initializes Magic
Docs, Skill Improvement, Extract Memories, Auto-Dream, and Plugin
Auto-Update in a one-time setup at startup, and executes slow operations
like old version cleanup and message file cleanup 10 minutes after
launch (and only when the user has had no interaction in the last
minute). Long-running sessions in internal user environments also
perform periodic cleanup of npm cache and old versions every 24
hours.
Viewing these mechanisms together, Claude Code’s design philosophy is clear: the user’s thinking gaps are the most valuable compute resource. In the traditional REPL model, the gap between user input and AI response is pure waiting. Claude Code transforms this gap into a dense background scheduling window, running multiple parallel pipelines for speculative execution, memory consolidation, document maintenance, context management, and more.
Every pipeline follows the same engineering constraints: forked agent mode ensures prompt cache sharing with the parent process (at the cost of zero parameter deviation allowed), fire-and-forget invocation ensures background tasks don’t block user interaction, feature flag control ensures any new mechanism can be canary-released and tuned online, and permission sandboxing ensures background agents can only operate on resources within their scope of responsibility.
The central scheduling hub for all background activity is
handleStopHooks, executed at the end of each query loop.
The call order within this function represents the priority of
background activities: first save cache params (preparing the cache for
future forks), then launch prompt suggestion, extract memories, and
auto-dream in parallel. These fire-and-forget calls run on Node’s event
loop. When the user’s next input arrives, background tasks still in
progress are either cancelled via abort controller (like speculation) or
continue to run silently in the background until completion (like
extract memories).
From an engineering practice perspective, Claude Code is already a daemon with an autonomous lifecycle, not a REPL waiting for input. This shift from passive response to proactive computation may represent a direction in the evolution of AI-assisted programming tools: when collaborating with humans, AI systems should leverage every available idle window to pre-compute, organize, and optimize, so that when the human is ready, the AI’s response is already on its way.