Files
dirigent/crates/dirigent_archivist/README.md
T
2026-05-08 01:59:04 +02:00

11 KiB

Dirigent Archivist

Persistent storage for all agentic interactions in Dirigent.

Overview

The Archivist automatically archives every conversation, message, and file from your AI sessions into a local, grep-able, human-readable archive. No cloud required - your data stays on your machine in formats you can read and search manually.

Why Archivist?

  • Offline Access: All conversations are saved locally, accessible even when connectors are offline
  • Manual Curation: Files are in plain JSON/NDJSON/TSV - grep, edit, or analyze them with any tool
  • Knowledge Base: Build a searchable archive of all your AI interactions across projects
  • Session Lineage: Track conversation branches, splits, and continuations
  • File Deduplication: Attachments are stored once, referenced multiple times (content-addressable)
  • Archive-First: UI reads from local archive first, only falls back to remote connectors when needed

Quick Start

The Archivist runs automatically when you start Dirigent. The archive location is determined by the DIRIGENT_DATA_DIR environment variable (archives are stored at <data_dir>/archives/):

# Override data directory (archives at /path/to/data/archives/)
DIRIGENT_DATA_DIR=/path/to/data dx serve

That's it! Every session and message will be automatically archived.

Archive Structure

Your archive is organized like this:

dirigent_archive/
├── .contexts/                       # Session data
│   └── 01936e8f-e5a7-7000-8000.../
│       ├── session.json             # Session metadata
│       └── messages.ndjson          # All messages (one JSON per line)
├── .db/
│   └── connectors/                  # Connector registry
│       ├── index.tsv                # Fast lookup table
│       └── 01936e8f-e5a7.../
│           ├── connector.json       # Connector info
│           └── sessions.ndjson      # Session ID mappings
└── .files/                          # Attachments (by SHA-256)
    └── a1b2c3d4...                  # Content-addressable storage

Why Hidden Directories?

The .contexts, .db, and .files directories start with . to keep them internal (like .git). In the future, you'll be able to export rendered markdown files into the archive root for easy reading.

File Formats

Session Metadata (.contexts/{id}/session.json)

{
  "version": 1,
  "scroll_id": "01936e8f-e5a7-7000-8000-000000000001",
  "created_at": "2025-01-01T12:00:00Z",
  "updated_at": "2025-01-01T12:30:00Z",
  "title": "Implement user authentication",
  "connector_uid": "01936e8f-e5a7-7000-8000-000000000002",
  "tags": ["backend", "auth"],
  "metadata": {
    "source": "OpenCode",
    "model": "claude-3-5-sonnet"
  }
}

Messages (.contexts/{id}/messages.ndjson)

Newline-delimited JSON - one message per line, append-only:

{"version":1,"message_id":"...","session":"...","role":"user","ts":"2025-01-01T12:01:00Z","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}}
{"version":1,"message_id":"...","session":"...","role":"assistant","ts":"2025-01-01T12:01:10Z","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}}

IMPORTANT: Messages are written as events arrive, NOT in chronological order. Assistant replies often appear after subsequent user messages due to streaming latency. When reading programmatically, use the Archivist API which sorts by timestamp (ts) to ensure correct order. For manual inspection, sort by the ts field.

Connector Index (.db/connectors/index.tsv)

Tab-separated values for fast scanning:

connector_uid	type	title	client_native_id	alias_of	created_at
01936e8f...	OpenCode	OpenCode Local	opencode@http://localhost:12225		2025-01-01T12:00:00Z

Searching Your Archive

Since everything is plain text, you can use standard Unix tools:

# Find all sessions about "authentication"
grep -r "authentication" dirigent_archive/.contexts/*/session.json

# Find messages mentioning a specific error
grep "ECONNREFUSED" dirigent_archive/.contexts/*/messages.ndjson

# List all sessions for a connector
cat dirigent_archive/.db/connectors/*/sessions.ndjson | jq .

# Get all user messages from a session (sorted by timestamp)
cat dirigent_archive/.contexts/01936e8f.../messages.ndjson | jq -s 'sort_by(.ts) | .[] | select(.role=="user")'

# View messages in chronological order
cat dirigent_archive/.contexts/01936e8f.../messages.ndjson | jq -s 'sort_by(.ts)'

Note on ordering: Remember that the file order is append-only (event arrival order). Always sort by ts (timestamp) when reading manually to see messages in chronological order.

Integration with Dirigent

The Archivist integrates seamlessly with Dirigent's core runtime:

  1. Automatic Archiving: Every session and message is archived in real-time as events arrive
  2. Event-Driven: No polling - listens to dirigent_core's event stream
  3. Append-Only Writes: Messages written as completion events arrive (preserves durability)
  4. Sort-on-Read: API returns messages in chronological order despite append-only file order
  5. UI Integration: Web UI reads from archive first, shows data even when connectors are offline
  6. Connector Coordination: Assigns stable UUIDs to connectors with collision detection

Key Concepts

Scroll IDs

Every session gets a unique scroll_id (UUIDv7) that's independent of the connector's native session ID. This allows:

  • Sessions to move between connectors
  • Stable references even if connector data is deleted
  • Time-ordered sorting (UUIDv7 encodes timestamp)

Session Lineage

Sessions can have parent sessions, creating a tree of related conversations:

  • Split: Fork conversation at a specific message
  • Compact: Summarized version of parent
  • Reference: Points to parent without duplication
  • Edit: Modified version of parent

Content-Addressable Storage

Files are stored by their SHA-256 hash, so:

  • Same file uploaded twice uses same storage
  • Files can be shared across sessions without duplication
  • You can verify file integrity by hash

Configuration

Environment Variables

  • DIRIGENT_DATA_DIR: Override data directory; archives are stored at <data_dir>/archives/

Example Configurations

# Use custom data directory (archives at /home/user/mydata/archives/)
DIRIGENT_DATA_DIR=/home/user/mydata dx serve

# Use global data directory
DIRIGENT_DATA_DIR=/home/user/.dirigent dx serve

# Use temporary data directory (testing)
DIRIGENT_DATA_DIR=/tmp/dirigent_test dx serve

Programmatic Access

While the Archivist runs automatically, you can also use it programmatically:

use dirigent_archivist::Archivist;
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create an archivist over a single archive directory.
    // Internally this wires up a `JsonlBackend` for the archive.
    let archivist = Archivist::new_with_single_archive(
        PathBuf::from("./dirigent_archive")
    ).await?;

    // List sessions for a connector
    let sessions = archivist.list_sessions(connector_uid).await?;

    for session in sessions {
        println!("{}: {}", session.scroll_id, session.title.unwrap_or_default());
    }

    Ok(())
}

Archivist is a concrete struct that owns a registry of ArchiveBackend implementations keyed by archive name. In Phase 2 the only backend is JsonlBackend (file-based NDJSON/JSON/TSV). See examples/ for more detailed usage.

Performance

The Archivist is designed for human-scale workloads (thousands of sessions, millions of messages):

  • Fast Writes: Append-only NDJSON is O(1)
  • Cached Reads: Common lookups cached in memory
  • Grep-able: TSV indices can be scanned in milliseconds
  • Incremental: Only new messages are written, no full re-writes

Scalability Notes

  • Large sessions (1000+ messages) may take a few seconds to load
  • TSV indices are suitable for 100-1000 connectors
  • File deduplication saves space for repeated attachments

Querying and Curation

Future: Knowledge Overviews

The Archivist is designed to support knowledge curation workflows:

  • Export sessions as clean markdown files
  • Create summaries and overviews across sessions
  • Tag and categorize conversations
  • Build a personal knowledge base

These features are planned for future releases.

Current: Manual Curation

For now, you can manually curate your archive:

  • Edit session.json to add tags
  • Grep through messages for specific topics
  • Copy/organize sessions into project folders
  • Use jq/awk/sed to extract insights

Advanced Features

Session Splitting

Create a new conversation branch from any point in history:

// Future API (not yet implemented)
let new_session = archivist.split_session(
    session_id,
    at_message_id,
    Continuation::Split
).await?;

Attachment Storage

Files are automatically deduplicated using SHA-256:

// Store file (content-addressable)
let file_id = archivist.store_file(
    &file_data,
    "spec.pdf",
    Some("application/pdf")
).await?;

// Reference in message
let attachment = AttachmentRef {
    file_id,  // "sha256:abc123..."
    name: "spec.pdf".to_string(),
    mime_type: Some("application/pdf".to_string()),
};

Multi-Archive Support

Archivist natively manages multiple named archives via an on-disk registry. Each archive is backed by its own ArchiveBackend (currently JsonlBackend) and selected per call via an optional archive argument. This enables:

  • Separate archives per project
  • A default archive plus specialized side archives
  • Moving or copying sessions between archives

Future backends (e.g. SQLite, indexed, remote) will plug into the same trait layer without changing the coordinator API.

Troubleshooting

Archive Not Created

If the archive directory doesn't appear:

  1. Check DIRIGENT_DATA_DIR is set correctly (or that the default data directory is writable)
  2. Ensure write permissions on parent directory
  3. Check logs for I/O errors

Missing Sessions

If sessions don't appear in archive:

  1. Verify EventHandler is running
  2. Check for event subscription errors in logs
  3. Ensure connector emits SessionCreated events

Large Archive Size

If archive grows too large:

  1. Check for duplicate files in .files/
  2. Consider archiving old sessions separately
  3. Future: Use compaction features (not yet implemented)

Development Status

Current (Phase 2 complete):

  • Automatic archival of sessions and messages
  • Event-driven integration with dirigent_core
  • File-based storage with NDJSON/JSON/TSV (JsonlBackend)
  • Content-addressable file storage
  • Multi-archive coordinator with per-archive backends
  • Trait-based backend abstraction (ArchiveBackend + sub-traits)

Future:

  • Indexed SearchBackend implementations (full-text search)
  • Additional concrete backends (SQLite, remote)
  • Session splitting and lineage management
  • Knowledge overview generation
  • Network RPC interface

Documentation

  • Developer Guide: CLAUDE.md - Package architecture and implementation details
  • Architecture: docs/building/05_archivist/vision.md - Design rationale
  • API Docs: cargo doc --package dirigent_archivist --open
  • Examples: See examples/ for working code

Contributing

The Archivist is part of the Dirigent project. See the main repository for contribution guidelines.

License

Part of the Dirigent project.