Context Pruning

Context pruning is a model-directed mechanism for surgically removing irrelevant content from the working context during a run. Unlike compaction (which is threshold-triggered and bulk), pruning gives the model fine-grained control over what stays in the context window.

Philosophy

Context pruning saves context length (tokens in the context window), not monetary cost. The token cost of a pruned message has already been paid -- pruning cannot reclaim it. What pruning reclaims is space: room in the context window for the model to continue working without hitting the context limit.

Think of it as a researcher working through a stack of papers. The researcher freely explores tangential references, reads through lengthy tool outputs, and investigates dead ends. When a line of inquiry turns out to be irrelevant, the researcher sets those papers aside rather than keeping them on the desk. The desk has limited space; the filing cabinet does not. Pruning moves content from desk to cabinet.

This freedom to explore without anxiety about context length is the core value proposition. The model can request verbose tool outputs, try multiple approaches, and investigate broadly -- knowing it can prune the dead ends and keep only what matters for the current task.

Static vs In-Run Context

Every message in the context belongs to one of two streams:

  • user_context -- All User messages: the initial prompt, follow-ups, steering messages, and any user-injected content. These represent user intent.
  • inrun_context -- All model-generated content: Assistant messages, ToolCall content, and ToolResult messages. These are the model's working memory.

The system_prompt is separate from both streams. It is a dedicated field on AgentContext, always occupies the first position, and is never subject to pruning or compaction.

Pruning Rules

  • user_context is NEVER pruned. User intent is sacred. The model cannot discard the user's words, steering messages, or follow-up instructions.
  • inrun_context CAN be surgically pruned by the model using the PrunTool. The model decides what is no longer relevant and removes it.
  • system_prompt is never pruned. It is not part of either stream.

PrunTool Variants

phi-core provides two pruning operations, both invoked by the model as tool calls:

prun(tokens)

Silent removal. The model specifies a token budget to reclaim, and the oldest inrun_context entries (by timestamp) are removed from the working context until the budget is met.

  • Removed content is preserved in the session log -- nothing is lost permanently
  • Removed entries become invisible to the LLM on subsequent turns
  • The model sees a ToolResult confirming how many tokens were reclaimed

prun_with_memo(tokens, memo)

Removal with summary. Same as prun, but the model provides a concise memo string that replaces the pruned content in the working context.

  • Each pruned message with a memo creates a separate PrunedMemo entry at its original timestamp, preserving chronological order
  • Useful when the pruned content contained decisions or conclusions the model wants to remember
  • The memo should be concise -- a few sentences, not a reproduction of the pruned content

Model Autonomy

The model decides which variant to use and when. Typical patterns:

  • Silent prune after exploring a dead end (e.g., reading a file that turned out to be irrelevant)
  • Memo prune after a productive investigation (e.g., "Investigated auth module: uses JWT with RS256, tokens expire after 1h, refresh handled in middleware")
  • No prune when all context remains relevant to the current task

Working Context Rebuild

Each turn, the working context sent to the LLM is rebuilt from scratch by build_working_context(), merging the two streams:

  1. Collect all user_context entries with their timestamps
  2. Collect all live inrun_context entries with their timestamps
  3. For each PrunedMemo entry, create a separate User message with the memo text at the entry's original timestamp
  4. Sort all collected entries by timestamp to preserve chronological order
  5. Prepend the system_prompt

The result is a coherent conversation history where:

  • User messages are always present
  • Pruned-silent entries are invisible (the conversation flows as if they never existed)
  • Each pruned-with-memo entry appears as a separate brief summary message at its original timestamp position, preserving the chronological position of the message it replaced

Session Log Integrity

The session log (context.messages) records everything that happened during the run. Pruning never modifies the session log -- it only affects what the LLM sees in the working context.

PrunRecord

Each prune operation emits a PrunApplied event (recorded in LoopRecord.events by SessionRecorder) containing:

  • pruned_timestamps -- Vec<u64> of timestamps identifying the pruned messages
  • tokens_removed -- Total tokens reclaimed
  • messages_removed -- Number of messages pruned
  • memo -- Optional summary string (present only for prun_with_memo)

On session reload, the two context streams are reconstructed by walking LoopRecord.events to find PrunApplied events. The pruned_timestamps field identifies which messages were pruned. These messages are placed in the pruned state (PrunedSilent or PrunedMemo depending on whether memo is Some), and their memo (if any) is restored as a separate message at the correct chronological position.

Compaction Interaction

Pruning and compaction are complementary mechanisms that operate at different levels:

PruningCompaction
TriggerModel-directed (tool call)Threshold-triggered (automatic)
GranularitySurgical (specific messages)Bulk (entire middle section)
Scopeinrun_context onlyAll messages in the compaction window
Preserved inSession log + PrunRecordCompactionBlock overlay

After Compaction

When compaction fires, it summarizes a range of messages into a CompactionBlock. After compaction:

  • All surviving messages (the summary, kept-first, and kept-recent) become part of user_context -- they are treated as established context and are unprunable
  • New model-generated content after compaction starts a fresh inrun_context stream
  • The model can prune this new inrun_context as usual

This means compaction resets the pruning boundary. Content that was once prunable inrun_context, if it survives compaction, becomes permanent user_context.

Configuration

TOML

[tools]
enabled = ["bash", "read_file", "write_file", "edit_file", "search", "prun"]

Adding "prun" to the enabled tools list makes both prun and prun_with_memo available to the model. They are two operations exposed through a single tool registration.

Rust (Programmatic)

#![allow(unused)]
fn main() {
use phi_core::agents::BasicAgent;
use phi_core::provider::ModelConfig;

let agent = BasicAgent::new(ModelConfig::anthropic("claude-sonnet-4-20250514", "Sonnet", &api_key))
    .with_default_tools()
    .with_prun_tool();  // enables context pruning
}

The with_prun_tool() builder method registers the PrunTool, making both pruning variants available. It can be combined with any other tool configuration.

Context pruning works best when compaction is also configured, providing both surgical (model-directed) and bulk (automatic) context management:

[tools]
enabled = ["bash", "read_file", "write_file", "edit_file", "search", "prun"]

[compaction]
max_context_tokens = 200000
compact_at_pct = 0.85
keep_first_turns = 2
keep_recent_turns = 4

With this setup, the model can prune irrelevant exploration results as it works, and compaction provides a safety net if the context still grows too large.