dirigence/dirigent

Fork 0

Files

T

g4borg b03dc15371 sync from monorepo @ 2452e92e

2026-05-08 01:59:04 +02:00

11 KiB

Raw Blame History

Dirigent Archivist

Persistent storage for all agentic interactions in Dirigent.

Overview

The Archivist automatically archives every conversation, message, and file from your AI sessions into a local, grep-able, human-readable archive. No cloud required - your data stays on your machine in formats you can read and search manually.

Why Archivist?

Offline Access: All conversations are saved locally, accessible even when connectors are offline
Manual Curation: Files are in plain JSON/NDJSON/TSV - grep, edit, or analyze them with any tool
Knowledge Base: Build a searchable archive of all your AI interactions across projects
Session Lineage: Track conversation branches, splits, and continuations
File Deduplication: Attachments are stored once, referenced multiple times (content-addressable)
Archive-First: UI reads from local archive first, only falls back to remote connectors when needed

Quick Start

The Archivist runs automatically when you start Dirigent. The archive location is determined by the DIRIGENT_DATA_DIR environment variable (archives are stored at <data_dir>/archives/):

# Override data directory (archives at /path/to/data/archives/)
DIRIGENT_DATA_DIR=/path/to/data dx serve

That's it! Every session and message will be automatically archived.

Archive Structure

Your archive is organized like this:

dirigent_archive/
├── .contexts/                       # Session data
│   └── 01936e8f-e5a7-7000-8000.../
│       ├── session.json             # Session metadata
│       └── messages.ndjson          # All messages (one JSON per line)
├── .db/
│   └── connectors/                  # Connector registry
│       ├── index.tsv                # Fast lookup table
│       └── 01936e8f-e5a7.../
│           ├── connector.json       # Connector info
│           └── sessions.ndjson      # Session ID mappings
└── .files/                          # Attachments (by SHA-256)
    └── a1b2c3d4...                  # Content-addressable storage

Why Hidden Directories?

The .contexts, .db, and .files directories start with . to keep them internal (like .git). In the future, you'll be able to export rendered markdown files into the archive root for easy reading.

File Formats

Session Metadata (`.contexts/{id}/session.json`)

{
  "version": 1,
  "scroll_id": "01936e8f-e5a7-7000-8000-000000000001",
  "created_at": "2025-01-01T12:00:00Z",
  "updated_at": "2025-01-01T12:30:00Z",
  "title": "Implement user authentication",
  "connector_uid": "01936e8f-e5a7-7000-8000-000000000002",
  "tags": ["backend", "auth"],
  "metadata": {
    "source": "OpenCode",
    "model": "claude-3-5-sonnet"
  }
}

Messages (`.contexts/{id}/messages.ndjson`)

Newline-delimited JSON - one message per line, append-only:

{"version":1,"message_id":"...","session":"...","role":"user","ts":"2025-01-01T12:01:00Z","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}}
{"version":1,"message_id":"...","session":"...","role":"assistant","ts":"2025-01-01T12:01:10Z","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}}

IMPORTANT: Messages are written as events arrive, NOT in chronological order. Assistant replies often appear after subsequent user messages due to streaming latency. When reading programmatically, use the Archivist API which sorts by timestamp (ts) to ensure correct order. For manual inspection, sort by the ts field.

Connector Index (`.db/connectors/index.tsv`)

Tab-separated values for fast scanning:

connector_uid	type	title	client_native_id	alias_of	created_at
01936e8f...	OpenCode	OpenCode Local	opencode@http://localhost:12225		2025-01-01T12:00:00Z

Searching Your Archive

Since everything is plain text, you can use standard Unix tools:

# Find all sessions about "authentication"
grep -r "authentication" dirigent_archive/.contexts/*/session.json

# Find messages mentioning a specific error
grep "ECONNREFUSED" dirigent_archive/.contexts/*/messages.ndjson

# List all sessions for a connector
cat dirigent_archive/.db/connectors/*/sessions.ndjson | jq .

# Get all user messages from a session (sorted by timestamp)
cat dirigent_archive/.contexts/01936e8f.../messages.ndjson | jq -s 'sort_by(.ts) | .[] | select(.role=="user")'

# View messages in chronological order
cat dirigent_archive/.contexts/01936e8f.../messages.ndjson | jq -s 'sort_by(.ts)'

Note on ordering: Remember that the file order is append-only (event arrival order). Always sort by ts (timestamp) when reading manually to see messages in chronological order.

Integration with Dirigent

The Archivist integrates seamlessly with Dirigent's core runtime:

Automatic Archiving: Every session and message is archived in real-time as events arrive
Event-Driven: No polling - listens to dirigent_core's event stream
Append-Only Writes: Messages written as completion events arrive (preserves durability)
Sort-on-Read: API returns messages in chronological order despite append-only file order
UI Integration: Web UI reads from archive first, shows data even when connectors are offline
Connector Coordination: Assigns stable UUIDs to connectors with collision detection

Key Concepts

Scroll IDs

Every session gets a unique scroll_id (UUIDv7) that's independent of the connector's native session ID. This allows:

Sessions to move between connectors
Stable references even if connector data is deleted
Time-ordered sorting (UUIDv7 encodes timestamp)

Session Lineage

Sessions can have parent sessions, creating a tree of related conversations:

Split: Fork conversation at a specific message
Compact: Summarized version of parent
Reference: Points to parent without duplication
Edit: Modified version of parent

Content-Addressable Storage

Files are stored by their SHA-256 hash, so:

Same file uploaded twice uses same storage
Files can be shared across sessions without duplication
You can verify file integrity by hash

Configuration

Environment Variables

DIRIGENT_DATA_DIR: Override data directory; archives are stored at <data_dir>/archives/

Example Configurations

# Use custom data directory (archives at /home/user/mydata/archives/)
DIRIGENT_DATA_DIR=/home/user/mydata dx serve

# Use global data directory
DIRIGENT_DATA_DIR=/home/user/.dirigent dx serve

# Use temporary data directory (testing)
DIRIGENT_DATA_DIR=/tmp/dirigent_test dx serve

Programmatic Access

While the Archivist runs automatically, you can also use it programmatically:

use dirigent_archivist::Archivist;
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create an archivist over a single archive directory.
    // Internally this wires up a `JsonlBackend` for the archive.
    let archivist = Archivist::new_with_single_archive(
        PathBuf::from("./dirigent_archive")
    ).await?;

    // List sessions for a connector
    let sessions = archivist.list_sessions(connector_uid).await?;

    for session in sessions {
        println!("{}: {}", session.scroll_id, session.title.unwrap_or_default());
    }

    Ok(())
}

Archivist is a concrete struct that owns a registry of ArchiveBackend implementations keyed by archive name. In Phase 2 the only backend is JsonlBackend (file-based NDJSON/JSON/TSV). See examples/ for more detailed usage.

Performance

The Archivist is designed for human-scale workloads (thousands of sessions, millions of messages):

Fast Writes: Append-only NDJSON is O(1)
Cached Reads: Common lookups cached in memory
Grep-able: TSV indices can be scanned in milliseconds
Incremental: Only new messages are written, no full re-writes

Scalability Notes

Large sessions (1000+ messages) may take a few seconds to load
TSV indices are suitable for 100-1000 connectors
File deduplication saves space for repeated attachments

Querying and Curation

Future: Knowledge Overviews

The Archivist is designed to support knowledge curation workflows:

Export sessions as clean markdown files
Create summaries and overviews across sessions
Tag and categorize conversations
Build a personal knowledge base

These features are planned for future releases.

Current: Manual Curation

For now, you can manually curate your archive:

Edit session.json to add tags
Grep through messages for specific topics
Copy/organize sessions into project folders
Use jq/awk/sed to extract insights

Advanced Features

Session Splitting

Create a new conversation branch from any point in history:

// Future API (not yet implemented)
let new_session = archivist.split_session(
    session_id,
    at_message_id,
    Continuation::Split
).await?;

Attachment Storage

Files are automatically deduplicated using SHA-256:

// Store file (content-addressable)
let file_id = archivist.store_file(
    &file_data,
    "spec.pdf",
    Some("application/pdf")
).await?;

// Reference in message
let attachment = AttachmentRef {
    file_id,  // "sha256:abc123..."
    name: "spec.pdf".to_string(),
    mime_type: Some("application/pdf".to_string()),
};

Multi-Archive Support

Archivist natively manages multiple named archives via an on-disk registry. Each archive is backed by its own ArchiveBackend (currently JsonlBackend) and selected per call via an optional archive argument. This enables:

Separate archives per project
A default archive plus specialized side archives
Moving or copying sessions between archives

Future backends (e.g. SQLite, indexed, remote) will plug into the same trait layer without changing the coordinator API.

Troubleshooting

Archive Not Created

If the archive directory doesn't appear:

Check DIRIGENT_DATA_DIR is set correctly (or that the default data directory is writable)
Ensure write permissions on parent directory
Check logs for I/O errors

Missing Sessions

If sessions don't appear in archive:

Verify EventHandler is running
Check for event subscription errors in logs
Ensure connector emits SessionCreated events

Large Archive Size

If archive grows too large:

Check for duplicate files in .files/
Consider archiving old sessions separately
Future: Use compaction features (not yet implemented)

Development Status

Current (Phase 2 complete):

Automatic archival of sessions and messages
Event-driven integration with dirigent_core
File-based storage with NDJSON/JSON/TSV (JsonlBackend)
Content-addressable file storage
Multi-archive coordinator with per-archive backends
Trait-based backend abstraction (ArchiveBackend + sub-traits)

Future:

Indexed SearchBackend implementations (full-text search)
Additional concrete backends (SQLite, remote)
Session splitting and lineage management
Knowledge overview generation
Network RPC interface

Documentation

Developer Guide: CLAUDE.md - Package architecture and implementation details
Architecture: docs/building/05_archivist/vision.md - Design rationale
API Docs: cargo doc --package dirigent_archivist --open
Examples: See examples/ for working code

Contributing

The Archivist is part of the Dirigent project. See the main repository for contribution guidelines.

License

Part of the Dirigent project.

11 KiB Raw Blame History