Files
dirigent/crates/dirigent_archivist/CLAUDE.md
T
2026-05-08 01:59:04 +02:00

31 KiB

Package: dirigent_archivist

Persistent storage for all agentic interactions in Dirigent.

Quick Facts

  • Type: Library
  • Main Entry: src/lib.rs
  • Dependencies: dirigent_protocol, uuid, chrono, serde, tokio, tracing, thiserror, sha2, hex, async-trait
  • Status: Complete - Production ready with comprehensive tests

Purpose

The Archivist provides file-based archival storage for all session data, messages, and attachments in Dirigent. It implements an archive-first architecture with connector API fallback, using NDJSON, JSON, and TSV formats for durability and human-readability.

Key Features

  • File-based Storage: NDJSON for messages, JSON for metadata, TSV for indices
  • Content-Addressable Files: SHA-256 based storage for attachments with automatic deduplication
  • Session Lineage: Track splits, continuations, and mutations with parent references
  • Connector Registry: Coordinate UID assignment across connectors with collision detection
  • Event Streaming: Real-time updates via EventHandler subscribing to dirigent_protocol events
  • Archive-First Design: Read from archive first, fall back to connector API when needed
  • Caching: In-memory caching of connector and session mappings for performance

Architecture

The Archivist is built on three core architectural principles:

1. Archive-First Read Strategy

The Archivist is the primary source of truth for historical data:

  • UI and APIs query the archive first
  • Only fall back to connector APIs if data is not in archive
  • This enables offline access and consistent history across restarts

2. Write-Through Event Capture (Append-Only)

The EventHandler subscribes to the global event stream from dirigent_core:

  • Captures session creation, message streaming, and tool calls in real-time
  • Uses MessageAccumulator to assemble streaming chunks into complete messages
  • Writes complete messages to archive immediately upon finalization
  • No polling required - fully event-driven
  • Append-only writes: Messages are appended as events arrive, NOT in chronological order
  • File order reflects event timing, not message timestamps

3. File-Based Storage with Sort-on-Read

All data is stored in human-readable, grep-able formats:

  • NDJSON (Newline-Delimited JSON): Incremental append-only logs for messages and mappings
  • JSON: Structured metadata for sessions and connectors
  • TSV (Tab-Separated Values): Fast indices for cross-references
  • Content-Addressed Files: Binary attachments stored by SHA-256 hash for deduplication
  • Sort-on-Read: get_messages() sorts by timestamp and message_id to ensure chronological order despite append-only writes

Backend Trait Layer (Phase 2)

The archivist uses a trait-based backend abstraction. ArchiveBackend defines the mandatory session and message primitives every backend must provide, plus as_xxx() accessors returning optional sub-traits:

  • SearchBackend — reserved for Phase 3+ indexed backends (not wired)
  • DagBackend — session lineage DAG edges
  • MetaEventsBackend — ACP connection lifecycle events
  • ConnectorRegistryBackend — per-archive connector metadata
  • SessionMappingBackend — native↔scroll session ID mapping

JsonlBackend is the Phase 2 concrete implementation (file-based NDJSON/JSON/TSV) and opts into every sub-trait except SearchBackend (content search continues to be served by ripgrep via crates/api/src/archivist/search_task.rs).

The Archivist struct (in src/coordinator/) owns a registry of backends keyed by archive name and performs orchestration (alias detection, session lineage, move/copy, DAG walks, archive lifecycle). Consumers hold Arc<Archivist> directly — the coordinator is concrete, not a trait.

See docs/plans/2026-04-18-archivist-phase2-design.md for design rationale.

Multi-Backend Registry (Phase 3)

The coordinator (Archivist) holds Vec<Arc<ArchiveRegistration>> sorted by read_priority instead of a flat HashMap<name, Arc<dyn ArchiveBackend>>. Each registration carries:

  • backend: Arc<dyn ArchiveBackend> + its declared capabilities
  • failure_mode: Required (must succeed) | BestEffort (errors log + drift health)
  • read_priority: lower = tried first for reads; also selects the default write target when no archive is named
  • write_active: participates in fanout writes
  • enabled: kill-switch without removing config
  • write_policy: Inline (default; await per call) or Queued (mpsc + batch_window + overflow policy)
  • Runtime state: last_health, last_error, consecutive_failures (all Arc<RwLock<_>>, shared with the writer task when queued)
  • Optional writer: Option<WriterHandle> (Some iff write_policy = Queued)

Backends are declared in dirigent.toml under [[archives]] and constructed at boot via Archivist::from_config(cfg, &BackendRegistry). Add a new backend type by implementing BackendFactory and registering it on the BackendRegistry before from_config.

Reads

get_session, get_messages_paged, count_messages, get_meta_events, get_children, etc. walk the registry in priority order via read_walk_per_session(scroll_id, predicate, op). The predicate capability-filters; Unavailable backends are skipped. The first backend that returns Some(value) wins and its name is cached against the scroll_id in a positive LRU (capacity 10_000). Subsequent reads for the same scroll_id short-circuit to the cached backend before falling back to the full priority walk.

Collection-shape reads (list_sessions_paged, list_connectors, list_meta_sessions, find_meta_session_by_client) use read_walk_collection — first enabled backend that can answer wins, no cache, no aggregation across backends. Phase 3 explicitly defers cross-backend merge/dedup to a later phase.

Writes

Mutating methods (append_messages, register_session, update_session_*, append_meta_events, append_dag_edge, clear_session_messages, update_connector_fingerprint) resolve a primary (per-call archive: Some(name) override or the default-write target) and fan out to every other enabled && write_active backend that has the required capability. Capability-mismatched backends are skipped with a debug capability_skip log (never an error). Required failures propagate to the caller; BestEffort failures log + drift health.

register_connector currently does NOT fan out — alias detection + the tri-state Accepted/Aliased/Rejected return shape make replication non-trivial. Fanout for connectors is deferred; single-backend setups are unaffected.

For write_policy = Queued backends, the primary/secondary write paths enqueue a WriteOp into the backend's writer task instead of awaiting. Errors drift the backend's health but do not propagate to the caller. Coalescing merges consecutive AppendMessages/AppendMetaEvents for the same scroll_id within batch_window_ms.

Cross-backend operations

  • delete_session(scroll_id, _) fans out to every enabled backend that has the session. Copies in write_active=false backends produce ArchivistError::DeleteOnReadOnlyBackend (write-active copies are still deleted); cache invalidated regardless of outcome.
  • copy_session(scroll_id, from, to) reads from from, writes to to, including DAG and meta-events when both sides have the capability. The source remains canonical (the cache is NOT rewritten).
  • move_session(scroll_id, from, to) is copy + delete-from-source. If the source-side delete fails after the copy succeeded, ArchivistError::PartialMove { copied_to, delete_error } is returned so the caller knows the session now lives in both places.

The Phase 2 connector-aware move_session(scroll_id, target_connector_uid, _) and copy_session(scroll_id, target_connector_uid, _) survived the Phase 3 rename as move_session_to_connector / copy_session_to_connector. Their bulk variant is move_sessions_to_connector.

Health

HealthStatus drifts on every coordinator call that observes a backend:

  • Successful write → Healthy; consecutive_failures reset to 0.
  • Successful read → Healthy (only rescues Degraded; does not reset the counter).
  • Write failure → Degraded { reason }; consecutive_failures += 1; after K = 5 consecutive failures drifts to Unavailable { reason }. Reads skip Unavailable backends; writes against an Unavailable Required backend fail, while writes against an Unavailable BestEffort backend are still attempted.
  • Read failure alone never drifts past Degraded; writes are the authoritative health signal.

list_archives_with_health() returns a Vec<ArchiveStatus> snapshot of every registration: name, type, capabilities, health, last_error, and queue_depth (for queued backends).

Lifecycle

Phase 3 is startup-only. add_archive / remove_archive / set_default_archive on the coordinator return ArchivistError::DynamicRegistryUnsupported. To change the registry, edit dirigent.toml and restart the server. Archivist::shutdown() drains queued writer tasks (sends WriteOp::Shutdown on each writer's mpsc and awaits ack); call it before process exit.

Test-only constructors Archivist::from_registrations(regs) and SessionMetadata::stub(scroll_id) live under #[cfg(any(test, feature = "test-utils"))] for integration tests that bypass the factory.

See docs/plans/2026-04-19-archivist-phase3-design.md for the full design rationale, and examples/multi_backend.rs for a runnable end-to-end example.

Module Organization

Core Modules

  • lib.rs: Public API surface and re-exports
  • types.rs: Core data structures (session metadata, message records, connector info, API types)
  • error.rs: Error types and Result alias for archivist operations

Backend Layer (backend/)

  • traits.rs: ArchiveBackend trait + 5 optional sub-traits
  • capability.rs: ArchiveCapability enum + CapabilitySet type
  • health.rs: HealthStatus enum returned by health_check
  • contract.rs: Reusable behavioral tests for any &dyn ArchiveBackend (cfg-gated)
  • mock.rs: In-memory MockBackend for coordinator unit tests (cfg-gated)

Concrete Backends (backends/)

  • jsonl/: The file-based JsonlBackend — the only Phase 2 backend. Reuses storage/ primitives for NDJSON/JSON/TSV operations.

Coordinator (coordinator/)

  • mod.rs: The Archivist struct + constructors
  • archives.rs: Archive lifecycle (add/remove/list/default)
  • connectors.rs: Connector registration + alias detection
  • sessions.rs: Session registration, metadata updates, move/copy
  • meta.rs: Meta events, DAG walks, cleanup

Storage Layer (storage/)

Low-level file I/O primitives used by JsonlBackend. All storage operations are async and use tokio.

  • paths.rs: ArchivePaths utility for consistent directory structure and path resolution
  • ndjson.rs: Newline-delimited JSON operations (read_ndjson, append_ndjson)
  • json.rs: JSON operations (read_json, write_json)
  • tsv.rs: Tab-separated value operations for connector index
  • files.rs: Content-addressable file storage with SHA-256 hashing and deduplication

Supporting Modules

  • registry.rs: Archive registry persistence (multi-archive metadata)
  • migration.rs: Single-archive → multi-archive migration path
  • session.rs: Session lineage types shared across layers
  • accumulator.rs: MessageAccumulator for assembling streaming message chunks
  • backfill.rs: Backfill helpers for importing historical sessions
  • import/: External conversation importers (e.g. Claude export)

Events

  • events.rs: EventHandler for subscribing to dirigent_protocol events and archiving them

Configuration

The Archivist archive root is determined by DirigentPaths resolution:

  • Set DIRIGENT_DATA_DIR to override the data directory; archives will be stored at <data_dir>/archives/
  • Defaults to ~/.local/share/dirigent/archives/ (or platform equivalent)
DIRIGENT_DATA_DIR=/path/to/data dx serve

Archive Structure

dirigent_archive/
├── .contexts/
│   └── {scroll_id:uuidv7}/          # One directory per session
│       ├── session.json             # Session metadata
│       ├── messages.jsonl           # Incremental message log (.ndjson also supported)
│       └── lineage.json             # Session lineage info (optional)
├── .db/
│   └── connectors/
│       ├── index.tsv                # Fast connector lookup (TSV)
│       └── {connector_uid}/
│           ├── connector.json       # Connector metadata
│           └── sessions.jsonl       # Session mappings (.ndjson also supported)
└── .files/
    └── {sha256-hash}                # Content-addressable file storage

Why Hidden Directories?

The .contexts, .db, and .files directories are hidden (prefixed with .) to keep the archive root clean for future rendered outputs (like chat.md exports). This is similar to how .git hides implementation details in a codebase.

File Formats

Session Metadata (session.json)

{
  "version": 1,
  "scroll_id": "01936e8f-e5a7-7000-8000-000000000001",
  "created_at": "2025-01-01T12:00:00Z",
  "updated_at": "2025-01-01T12:30:00Z",
  "title": "Implement user authentication",
  "connector_uid": "01936e8f-e5a7-7000-8000-000000000002",
  "native_session_id": "abc123",
  "agent_id": null,
  "parent_scroll_id": null,
  "continuation": null,
  "tags": ["backend", "auth"],
  "metadata": {
    "source": "OpenCode",
    "model": "claude-3-5-sonnet"
  }
}

Messages Log (messages.jsonl)

One JSON object per line, append-only:

{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000003","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":null,"ts":"2025-01-01T12:01:00Z","role":"user","author":"alice","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}}
{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000004","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":"01936e8f-e5a7-7000-8000-000000000003","ts":"2025-01-01T12:01:10Z","role":"assistant","author":"claude","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}}

IMPORTANT - Ordering: The order of lines in the message log file (messages.jsonl or messages.ndjson) reflects event arrival order, NOT chronological order. Assistant replies often arrive after subsequent user messages due to streaming latency, resulting in non-chronological file order. Always use the Archivist::get_messages() API to retrieve messages, which sorts by ts (timestamp) and message_id (UUIDv7) to guarantee chronological order.

File Format Compatibility: The archivist supports both .ndjson and .jsonl file extensions for newline-delimited JSON files. When reading, .jsonl is preferred if present, with automatic fallback to .ndjson for backward compatibility. Write operations use .jsonl (canonical format). Both formats are identical in content - the difference is purely the file extension.

Connector Index (index.tsv)

Tab-separated values with header row:

connector_uid	type	title	client_native_id	alias_of	created_at
01936e8f-e5a7-7000-8000-000000000002	OpenCode	OpenCode Local	opencode@http://localhost:12225		2025-01-01T12:00:00Z

Session Mappings (sessions.jsonl)

Maps native session IDs from connectors to scroll IDs in the archive:

{"version":1,"connector_uid":"01936e8f-e5a7-7000-8000-000000000002","native_session_id":"abc123","scroll_id":"01936e8f-e5a7-7000-8000-000000000001","created_at":"2025-01-01T12:00:00Z","alias_of":null}

Message Ordering Guarantees

The Problem: Append Order ≠ Chronological Order

In the event-driven architecture, messages are written to the message log file (messages.jsonl) as completion events arrive. Due to streaming latency:

  • User messages complete nearly instantly and are written immediately
  • Assistant messages stream over time and complete later
  • A second user message can be written before the first assistant reply completes

Example scenario:

T0: User sends "tell me a joke about snakes" (ts=18:23:36.947)
T1: Assistant starts streaming reply (ts=18:23:36.969)
T2: User sends "now one about tigers" (ts=18:23:49.429) <- completes and writes BEFORE assistant finishes
T3: Assistant finishes "snakes" reply <- writes AFTER "tigers" user message

File order in the message log file:

1. user "snakes" (18:23:36.947)
2. user "tigers" (18:23:49.429)  <- written second
3. assistant "snakes" (18:23:36.969)  <- written third, but timestamp is earlier!

The Solution: Sort-on-Read

The Archivist::get_messages() implementation sorts messages before returning:

  1. Primary sort: ts (timestamp) ascending
  2. Secondary sort: message_id (UUIDv7) ascending for stable tie-breaking

This guarantees chronological order regardless of NDJSON append order:

1. user "snakes" (18:23:36.947)
2. assistant "snakes" (18:23:36.969)
3. user "tigers" (18:23:49.429)

Why This Approach?

  • Maintains durability: Append-only writes preserve crash safety
  • No migration needed: Existing archives work without rewrites
  • Simple implementation: No buffered writes or complex write-time ordering
  • Performance trade-off: Small CPU cost on read (sorting) vs. complex write-time coordination

Consumer Guidance

  • DO: Use Archivist::get_messages() to retrieve messages
  • DON'T: Read the message log file directly and assume file order = chronological order
  • UI/API: Always sort by ts then message_id for defense in depth
  • Tie-breaking: Use message_id (UUIDv7) as secondary sort for stable ordering when timestamps match

Key Types

SessionMetadata

Stores all metadata about a session including:

  • scroll_id: UUIDv7 identifier for the session
  • connector_uid: Which connector owns this session
  • native_session_id: Original session ID from the connector (optional)
  • title: Optional human-readable session title (see Title Management below)
  • parent_scroll_id: For session lineage (splits, continuations)
  • continuation: Type of continuation (SPLIT, COMPACT, REFERENCE, EDIT)
  • tags: User-defined categorization
  • metadata: Free-form JSON for connector-specific fields

Title Management

Session titles are fully supported and persist across restarts. Titles are stored in the SessionMetadata struct and saved to the session.json file.

Setting Titles:

// Update title for an existing session
archivist.update_session_metadata(
    scroll_id,
    Some("My Custom Session Title".to_string()),
    None, // model
    None  // archive
).await?;

Default Behavior:

  • New sessions can specify an initial title during registration
  • If no title is provided, sessions default to None
  • The UI typically displays "Untitled" for sessions without titles

Title Loading:

  • Titles are automatically loaded when retrieving session metadata via get_session_metadata()
  • Session lists include titles via list_sessions() and list_sessions_all()
  • Titles are part of the SessionMetadata struct returned by all session queries

UI Integration:

  • The web UI displays session titles in the session list and sidebar
  • Users can rename sessions via the "Rename" button in the session list view
  • Renaming calls api::archivist::rename_session() which uses update_session_metadata()
  • Title changes are persisted immediately and survive application restarts

MessageRecord

Represents a single message in the archive:

  • message_id: UUIDv7 identifier
  • session: scroll_id this message belongs to
  • role: "user", "assistant", or "system"
  • content_md: Message content in Markdown format
  • attachments: References to attached files
  • metadata: Free-form JSON for connector-specific fields

ConnectorRecord

Metadata about a connector:

  • connector_uid: UUIDv7 identifier
  • type: "OpenCode", "ACP", or custom
  • client_native_id: Unique identifier from client (e.g., "opencode@http://localhost:12225")
  • alias_of: If this connector is an alias of another (for deduplication)

Archivist Public API

The Archivist struct (in coordinator/) is the main public entry point for archival operations. Consumers hold Arc<Archivist> and call inherent methods — there is no Archivist trait anymore. The coordinator resolves the target backend per call (via archive: Option<String>) and delegates to ArchiveBackend methods.

Key method families (see coordinator/*.rs for full signatures):

  • Archive lifecycle (archives.rs): add_archive, remove_archive, list_archives, set_default_archive
  • Connectors (connectors.rs): register_connector with tri-state result (Accepted / Aliased / Rejected), list_connectors
  • Sessions (sessions.rs): register_session, get_session_metadata, update_session_metadata, list_sessions_paged, move_session, copy_session, resolve_session
  • Messages: append_messages, get_messages (sorts by ts then message_id for stable chronological order)
  • Meta / DAG (meta.rs): meta-event recording, session lineage DAG walks, cleanup routines

Two distinct query paths exist — do not conflate them.

List filterArchivist::list_sessions_paged(SessionListQuery) returns a cursor-paged list of sessions, AND-filtered by title_query (substring on title), tags, model_filter (substring on metadata.model), project_id, connector_uid, and include_hidden. This is the right tool for "narrow the list of visible sessions."

Full-text searchapi::search_sessions (in the api package, backed by api::archivist::search_task::SearchTask) spawns rg --json over the archive's .contexts/ tree to find messages containing text. It streams SearchExcerpts with parsed NDJSON content and supports cancellation via CancellationToken. This is the right tool for "find messages containing text."

Do not extend list_sessions_paged to do content search. Content search belongs in the ripgrep pipeline. Future improvements to content search (indexed backends, relevance scoring) are Phase 2d / Phase 3 concerns.

JsonlBackend Implementation

The Phase 2 production backend — an implementation of ArchiveBackend plus every sub-trait except SearchBackend:

  • Thread-safe: Uses RwLock for in-memory caches
  • Async: All operations use tokio for non-blocking I/O
  • Caching: In-memory caches for connector and session mappings
  • Collision Detection: Tri-state registration for connectors and sessions

Located under src/backends/jsonl/ and split by concern (backend.rs, connectors.rs, dag.rs, mapping.rs, meta.rs).

Caching Strategy

JsonlBackend maintains two in-memory caches:

  1. connector_cache: HashMap<Uuid, ConnectorRecord>

    • Populated on registration
    • Read from TSV index on startup (future enhancement)
  2. session_cache: HashMap<(Uuid, String), Uuid>

    • Maps (connector_uid, native_session_id) to scroll_id
    • Populated on registration and session resolution
    • Enables fast session lookups without disk I/O

Event Handling

The EventHandler subscribes to dirigent_protocol events and archives them in real-time:

// Create archivist and event handler
let archivist = Archivist::new_with_single_archive(archive_path).await?;
let handler = EventHandler::new(Arc::new(archivist));

// Subscribe to event stream from dirigent_core
let events = event_stream.subscribe();

// Run event loop (blocking)
handler.run(events).await;

Supported Events

  • SessionCreated: Registers new sessions with the archivist
  • MessageCompleted: Writes finalized messages to the archive
  • SessionUpdate: Accumulates streaming message chunks
    • AgentMessageChunk
    • UserMessageChunk
    • AgentThoughtChunk
    • ToolCall

MessageAccumulator

Assembles streaming message chunks into complete messages:

  • Accumulates text chunks by message_id
  • Tracks thinking blocks separately
  • Stores tool calls with input/output
  • Finalizes messages on MessageCompleted event
  • Converts to MessageRecord for archival

Integration with dirigent_core

The Archivist integrates with dirigent_core via the global event stream:

  1. CoreRuntime emits events for all connector operations
  2. EventHandler subscribes to event stream
  3. MessageAccumulator assembles streaming chunks
  4. Archivist writes complete messages to archive

This enables:

  • Automatic archival of all sessions and messages
  • No polling required - fully event-driven
  • Consistent history across restarts
  • Offline access to historical data

Testing

The package has comprehensive test coverage across multiple dimensions:

Unit Tests

Located in each module (src/*.rs, src/storage/*.rs):

  • Type serialization/deserialization
  • UUIDv7 generation and ordering
  • Timestamp formatting (RFC 3339)
  • Storage operations (NDJSON, JSON, TSV, files)
  • Connector registration tri-state logic
  • Session registration and alias detection

Integration Tests

Located in tests/:

  • integration_tests.rs: Full Archivist + JsonlBackend lifecycle, event handler integration, multi-connector scenarios, session lineage, message accumulation
  • list_sessions_paged_test.rs, pagination_test.rs: List filter + cursor pagination coverage
  • import_claude_idempotency_test.rs: Claude export re-import idempotency

Backend Contract Tests

src/backend/contract.rs holds reusable async assertions that any &dyn ArchiveBackend must pass. JsonlBackend and MockBackend both run the contract suite; new backends added in Phase 3+ should do the same.

Examples

Located in examples/:

  • basic_usage.rs: Core archivist operations
  • event_handling.rs: EventHandler and MessageAccumulator
  • file_storage.rs: Content-addressable file storage

Run tests:

cargo test --package dirigent_archivist

Run examples:

cargo run --package dirigent_archivist --example basic_usage
cargo run --package dirigent_archivist --example event_handling
cargo run --package dirigent_archivist --example file_storage

Performance Characteristics

  • Append Operations: O(1) with sequential file writes
  • Session Lookup: O(1) with in-memory cache, O(n) cache miss
  • Message Retrieval: O(n) where n = number of messages (NDJSON parsing)
  • File Storage: O(1) content-addressable lookup with SHA-256 hashing
  • Connector Index: O(n) TSV scan, suitable for hundreds of connectors

Scalability Considerations

  • Large Sessions: NDJSON is append-only, so reading large sessions requires parsing all lines
  • Many Sessions: TSV indices are suitable for thousands of sessions per connector
  • File Deduplication: SHA-256 hashing provides automatic deduplication across sessions
  • Concurrent Access: RwLock allows multiple concurrent readers, single writer

Error Handling

The Archivist uses thiserror for rich error types:

pub enum ArchivistError {
    IoError(std::io::Error),
    SerdeError(serde_json::Error),
    SessionUnknown(Uuid),
    CollisionInconsistent(Uuid),
    // ... etc
}

All public APIs return Result<T, ArchivistError> for explicit error handling.

Development Notes

  • All storage operations are async (using tokio)
  • Content-addressable storage uses SHA-256 hashes (hex-encoded)
  • Archive directory structure mirrors session/message hierarchy
  • UUIDv7 provides time-ordered, sortable identifiers
  • RFC 3339 UTC timestamps for all time-based fields
  • Schema versioning via version field in all records
  • dirigent_protocol: Shared types and protocol definitions (dependency)
  • dirigent_core: Runtime integration for SSE event capture (integration point)
  • api: Server functions for archive queries (future)
  • web: UI for archive browsing and search (future)

Phase 4: ArchiveFilter (2026-04-21)

Every ArchiveRegistration carries a filter: ArchiveFilter. The filter describes which sessions/writes the backend wants to receive. Fields:

  • include_connectors: Option<HashSet<Uuid>> — if Some, only these connector UIDs pass. None means no connector gate.
  • exclude_connectors: HashSet<Uuid> — always rejected.
  • include_tags: HashSet<String> — if non-empty, the session must carry at least one matching tag.
  • exclude_tags: HashSet<String> — any matching tag rejects.
  • include_hidden: bool — default true. If false, sessions whose metadata has "hidden": true are skipped.

Primary-always-writes invariant

The per-call primary (either the archive: Some(name) argument or the default write-target) is never filtered. If a caller explicitly asks to write to archive X, the filter on X is not consulted. Filters only gate secondary fanout.

Boot validator

At boot (coordinator/boot.rs), the validator rejects configurations where:

  • No write-active + enabled registration has an unrestricted filter (ArchiveFilter::default() is unrestricted). Prevents configurations that silently drop all writes.
  • An archive's filter has include_connectors = Some(empty set) — equivalent to "reject everything", which is almost certainly a config bug.

See docs/plans/2026-04-21-archivist-phase4-design.md §4 for the full design rationale.

Phase 5: Importers (2026-04-21)

The import:: module centres on an Importer trait with per-source implementations under import::sources::*. Each source produces a ParsedConversation (ChatGPT) / ParsedSession (Codex) / session directory walk (Claude) and feeds the results through the common import_sessions orchestrator, which fires ImportProgressEvents on a bounded ImportProgressSink.

Importer trait

Every importer declares a config_shape() so UIs can render a dynamic form; a discover() that returns an ImportDiscovery preview; and an import() that does the actual work. All three methods are async.

The trait lives in import::trait_def. Shape types (ImportConfig, ImportTarget, ConfigField, ConfigFieldKind, ImportError) are serialisable and safe to cross the WASM boundary.

Registry

ImporterRegistry::with_defaults() registers every enabled importer-* feature. Currently: claude, chatgpt, codex. The registry is constructed at boot and stored on AppState.

Progress sink

ImportProgressSink::channel() returns a bounded mpsc pair. Non-terminal events use try_send (dropped on full); terminal events use send().await so consumers always see the final state.

Source crates

  • dirigent_chatgpt — parses conversations.json from the OpenAI data export.
  • dirigent_codex — parses *.jsonl session files under ~/.codex/sessions.

Both crates hold pure parser types with zero dirigent-specific types.

See docs/plans/2026-04-21-archivist-phase5-design.md.

Future Enhancements

  • Indexed SearchBackend implementations (tantivy/sqlite) — currently content search is ripgrep-based in the api package
  • Session splitting and lineage management (mutations.ndjson)
  • Knowledge overview generation (chat.md exports)
  • Embedding storage and search (embeds/)
  • Network RPC interface for remote archivist
  • Compaction and pruning policies
  • Additional concrete backends (e.g. SQLite, remote)

Documentation

  • Package README: ./README.md - User-facing overview
  • Architecture Docs: ../../docs/building/05_archivist/ - Design and planning
  • API Docs: Run cargo doc --package dirigent_archivist --open
  • Examples: See examples/ directory for working code samples