# Package: dirigent_archivist Persistent storage for all agentic interactions in Dirigent. ## Quick Facts - **Type**: Library - **Main Entry**: src/lib.rs - **Dependencies**: dirigent_protocol, uuid, chrono, serde, tokio, tracing, thiserror, sha2, hex, async-trait - **Status**: Complete - Production ready with comprehensive tests ## Purpose The Archivist provides file-based archival storage for all session data, messages, and attachments in Dirigent. It implements an archive-first architecture with connector API fallback, using NDJSON, JSON, and TSV formats for durability and human-readability. ## Key Features - **File-based Storage**: NDJSON for messages, JSON for metadata, TSV for indices - **Content-Addressable Files**: SHA-256 based storage for attachments with automatic deduplication - **Session Lineage**: Track splits, continuations, and mutations with parent references - **Connector Registry**: Coordinate UID assignment across connectors with collision detection - **Event Streaming**: Real-time updates via EventHandler subscribing to dirigent_protocol events - **Archive-First Design**: Read from archive first, fall back to connector API when needed - **Caching**: In-memory caching of connector and session mappings for performance ## Architecture The Archivist is built on three core architectural principles: ### 1. Archive-First Read Strategy The Archivist is the primary source of truth for historical data: - UI and APIs query the archive first - Only fall back to connector APIs if data is not in archive - This enables offline access and consistent history across restarts ### 2. Write-Through Event Capture (Append-Only) The EventHandler subscribes to the global event stream from dirigent_core: - Captures session creation, message streaming, and tool calls in real-time - Uses MessageAccumulator to assemble streaming chunks into complete messages - Writes complete messages to archive immediately upon finalization - No polling required - fully event-driven - **Append-only writes**: Messages are appended as events arrive, NOT in chronological order - File order reflects event timing, not message timestamps ### 3. File-Based Storage with Sort-on-Read All data is stored in human-readable, grep-able formats: - **NDJSON** (Newline-Delimited JSON): Incremental append-only logs for messages and mappings - **JSON**: Structured metadata for sessions and connectors - **TSV** (Tab-Separated Values): Fast indices for cross-references - **Content-Addressed Files**: Binary attachments stored by SHA-256 hash for deduplication - **Sort-on-Read**: `get_messages()` sorts by timestamp and message_id to ensure chronological order despite append-only writes ## Backend Trait Layer (Phase 2) The archivist uses a trait-based backend abstraction. `ArchiveBackend` defines the mandatory session and message primitives every backend must provide, plus `as_xxx()` accessors returning optional sub-traits: - `SearchBackend` — reserved for Phase 3+ indexed backends (not wired) - `DagBackend` — session lineage DAG edges - `MetaEventsBackend` — ACP connection lifecycle events - `ConnectorRegistryBackend` — per-archive connector metadata - `SessionMappingBackend` — native↔scroll session ID mapping `JsonlBackend` is the Phase 2 concrete implementation (file-based NDJSON/JSON/TSV) and opts into every sub-trait except `SearchBackend` (content search continues to be served by ripgrep via `crates/api/src/archivist/search_task.rs`). The `Archivist` struct (in `src/coordinator/`) owns a registry of backends keyed by archive name and performs orchestration (alias detection, session lineage, move/copy, DAG walks, archive lifecycle). Consumers hold `Arc` directly — the coordinator is concrete, not a trait. See `docs/plans/2026-04-18-archivist-phase2-design.md` for design rationale. ## Multi-Backend Registry (Phase 3) The coordinator (`Archivist`) holds `Vec>` sorted by `read_priority` instead of a flat `HashMap>`. Each registration carries: - `backend: Arc` + its declared capabilities - `failure_mode`: `Required` (must succeed) | `BestEffort` (errors log + drift health) - `read_priority`: lower = tried first for reads; also selects the default write target when no archive is named - `write_active`: participates in fanout writes - `enabled`: kill-switch without removing config - `write_policy`: `Inline` (default; `await` per call) or `Queued` (mpsc + batch_window + overflow policy) - Runtime state: `last_health`, `last_error`, `consecutive_failures` (all `Arc>`, shared with the writer task when queued) - Optional `writer: Option` (Some iff `write_policy = Queued`) Backends are declared in `dirigent.toml` under `[[archives]]` and constructed at boot via `Archivist::from_config(cfg, &BackendRegistry)`. Add a new backend type by implementing `BackendFactory` and registering it on the `BackendRegistry` before `from_config`. ### Reads `get_session`, `get_messages_paged`, `count_messages`, `get_meta_events`, `get_children`, etc. walk the registry in priority order via `read_walk_per_session(scroll_id, predicate, op)`. The predicate capability-filters; `Unavailable` backends are skipped. The first backend that returns `Some(value)` wins and its name is cached against the `scroll_id` in a positive LRU (capacity 10_000). Subsequent reads for the same `scroll_id` short-circuit to the cached backend before falling back to the full priority walk. Collection-shape reads (`list_sessions_paged`, `list_connectors`, `list_meta_sessions`, `find_meta_session_by_client`) use `read_walk_collection` — first enabled backend that can answer wins, no cache, no aggregation across backends. Phase 3 explicitly defers cross-backend merge/dedup to a later phase. ### Writes Mutating methods (`append_messages`, `register_session`, `update_session_*`, `append_meta_events`, `append_dag_edge`, `clear_session_messages`, `update_connector_fingerprint`) resolve a primary (per-call `archive: Some(name)` override or the default-write target) and fan out to every other `enabled && write_active` backend that has the required capability. Capability-mismatched backends are skipped with a debug `capability_skip` log (never an error). `Required` failures propagate to the caller; `BestEffort` failures log + drift health. `register_connector` currently does NOT fan out — alias detection + the tri-state `Accepted`/`Aliased`/`Rejected` return shape make replication non-trivial. Fanout for connectors is deferred; single-backend setups are unaffected. For `write_policy = Queued` backends, the primary/secondary write paths enqueue a `WriteOp` into the backend's writer task instead of awaiting. Errors drift the backend's health but do not propagate to the caller. Coalescing merges consecutive `AppendMessages`/`AppendMetaEvents` for the same `scroll_id` within `batch_window_ms`. ### Cross-backend operations - `delete_session(scroll_id, _)` fans out to every enabled backend that has the session. Copies in `write_active=false` backends produce `ArchivistError::DeleteOnReadOnlyBackend` (write-active copies are still deleted); cache invalidated regardless of outcome. - `copy_session(scroll_id, from, to)` reads from `from`, writes to `to`, including DAG and meta-events when both sides have the capability. The source remains canonical (the cache is NOT rewritten). - `move_session(scroll_id, from, to)` is `copy + delete-from-source`. If the source-side delete fails after the copy succeeded, `ArchivistError::PartialMove { copied_to, delete_error }` is returned so the caller knows the session now lives in both places. The Phase 2 connector-aware `move_session(scroll_id, target_connector_uid, _)` and `copy_session(scroll_id, target_connector_uid, _)` survived the Phase 3 rename as `move_session_to_connector` / `copy_session_to_connector`. Their bulk variant is `move_sessions_to_connector`. ### Health `HealthStatus` drifts on every coordinator call that observes a backend: - Successful write → `Healthy`; `consecutive_failures` reset to 0. - Successful read → `Healthy` (only rescues `Degraded`; does not reset the counter). - Write failure → `Degraded { reason }`; `consecutive_failures += 1`; after K = 5 consecutive failures drifts to `Unavailable { reason }`. Reads skip `Unavailable` backends; writes against an `Unavailable` `Required` backend fail, while writes against an `Unavailable` `BestEffort` backend are still attempted. - Read failure alone never drifts past `Degraded`; writes are the authoritative health signal. `list_archives_with_health()` returns a `Vec` snapshot of every registration: name, type, capabilities, health, last_error, and queue_depth (for queued backends). ### Lifecycle Phase 3 is **startup-only**. `add_archive` / `remove_archive` / `set_default_archive` on the coordinator return `ArchivistError::DynamicRegistryUnsupported`. To change the registry, edit `dirigent.toml` and restart the server. `Archivist::shutdown()` drains queued writer tasks (sends `WriteOp::Shutdown` on each writer's mpsc and awaits ack); call it before process exit. Test-only constructors `Archivist::from_registrations(regs)` and `SessionMetadata::stub(scroll_id)` live under `#[cfg(any(test, feature = "test-utils"))]` for integration tests that bypass the factory. See `docs/plans/2026-04-19-archivist-phase3-design.md` for the full design rationale, and `examples/multi_backend.rs` for a runnable end-to-end example. ## Module Organization ### Core Modules - **`lib.rs`**: Public API surface and re-exports - **`types.rs`**: Core data structures (session metadata, message records, connector info, API types) - **`error.rs`**: Error types and Result alias for archivist operations ### Backend Layer (`backend/`) - **`traits.rs`**: `ArchiveBackend` trait + 5 optional sub-traits - **`capability.rs`**: `ArchiveCapability` enum + `CapabilitySet` type - **`health.rs`**: `HealthStatus` enum returned by `health_check` - **`contract.rs`**: Reusable behavioral tests for any `&dyn ArchiveBackend` (cfg-gated) - **`mock.rs`**: In-memory `MockBackend` for coordinator unit tests (cfg-gated) ### Concrete Backends (`backends/`) - **`jsonl/`**: The file-based `JsonlBackend` — the only Phase 2 backend. Reuses `storage/` primitives for NDJSON/JSON/TSV operations. ### Coordinator (`coordinator/`) - **`mod.rs`**: The `Archivist` struct + constructors - **`archives.rs`**: Archive lifecycle (add/remove/list/default) - **`connectors.rs`**: Connector registration + alias detection - **`sessions.rs`**: Session registration, metadata updates, move/copy - **`meta.rs`**: Meta events, DAG walks, cleanup ### Storage Layer (`storage/`) Low-level file I/O primitives used by `JsonlBackend`. All storage operations are async and use tokio. - **`paths.rs`**: ArchivePaths utility for consistent directory structure and path resolution - **`ndjson.rs`**: Newline-delimited JSON operations (read_ndjson, append_ndjson) - **`json.rs`**: JSON operations (read_json, write_json) - **`tsv.rs`**: Tab-separated value operations for connector index - **`files.rs`**: Content-addressable file storage with SHA-256 hashing and deduplication ### Supporting Modules - **`registry.rs`**: Archive registry persistence (multi-archive metadata) - **`migration.rs`**: Single-archive → multi-archive migration path - **`session.rs`**: Session lineage types shared across layers - **`accumulator.rs`**: MessageAccumulator for assembling streaming message chunks - **`backfill.rs`**: Backfill helpers for importing historical sessions - **`import/`**: External conversation importers (e.g. Claude export) ### Events - **`events.rs`**: EventHandler for subscribing to dirigent_protocol events and archiving them ## Configuration The Archivist archive root is determined by `DirigentPaths` resolution: - Set `DIRIGENT_DATA_DIR` to override the data directory; archives will be stored at `/archives/` - Defaults to `~/.local/share/dirigent/archives/` (or platform equivalent) ```bash DIRIGENT_DATA_DIR=/path/to/data dx serve ``` ## Archive Structure ``` dirigent_archive/ ├── .contexts/ │ └── {scroll_id:uuidv7}/ # One directory per session │ ├── session.json # Session metadata │ ├── messages.jsonl # Incremental message log (.ndjson also supported) │ └── lineage.json # Session lineage info (optional) ├── .db/ │ └── connectors/ │ ├── index.tsv # Fast connector lookup (TSV) │ └── {connector_uid}/ │ ├── connector.json # Connector metadata │ └── sessions.jsonl # Session mappings (.ndjson also supported) └── .files/ └── {sha256-hash} # Content-addressable file storage ``` ### Why Hidden Directories? The `.contexts`, `.db`, and `.files` directories are hidden (prefixed with `.`) to keep the archive root clean for future rendered outputs (like `chat.md` exports). This is similar to how `.git` hides implementation details in a codebase. ## File Formats ### Session Metadata (`session.json`) ```json { "version": 1, "scroll_id": "01936e8f-e5a7-7000-8000-000000000001", "created_at": "2025-01-01T12:00:00Z", "updated_at": "2025-01-01T12:30:00Z", "title": "Implement user authentication", "connector_uid": "01936e8f-e5a7-7000-8000-000000000002", "native_session_id": "abc123", "agent_id": null, "parent_scroll_id": null, "continuation": null, "tags": ["backend", "auth"], "metadata": { "source": "OpenCode", "model": "claude-3-5-sonnet" } } ``` ### Messages Log (`messages.jsonl`) One JSON object per line, **append-only**: ```jsonl {"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000003","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":null,"ts":"2025-01-01T12:01:00Z","role":"user","author":"alice","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}} {"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000004","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":"01936e8f-e5a7-7000-8000-000000000003","ts":"2025-01-01T12:01:10Z","role":"assistant","author":"claude","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}} ``` **IMPORTANT - Ordering**: The order of lines in the message log file (`messages.jsonl` or `messages.ndjson`) reflects **event arrival order**, NOT chronological order. Assistant replies often arrive after subsequent user messages due to streaming latency, resulting in non-chronological file order. Always use the `Archivist::get_messages()` API to retrieve messages, which sorts by `ts` (timestamp) and `message_id` (UUIDv7) to guarantee chronological order. **File Format Compatibility**: The archivist supports both `.ndjson` and `.jsonl` file extensions for newline-delimited JSON files. When reading, `.jsonl` is preferred if present, with automatic fallback to `.ndjson` for backward compatibility. Write operations use `.jsonl` (canonical format). Both formats are identical in content - the difference is purely the file extension. ### Connector Index (`index.tsv`) Tab-separated values with header row: ```tsv connector_uid type title client_native_id alias_of created_at 01936e8f-e5a7-7000-8000-000000000002 OpenCode OpenCode Local opencode@http://localhost:12225 2025-01-01T12:00:00Z ``` ### Session Mappings (`sessions.jsonl`) Maps native session IDs from connectors to scroll IDs in the archive: ```jsonl {"version":1,"connector_uid":"01936e8f-e5a7-7000-8000-000000000002","native_session_id":"abc123","scroll_id":"01936e8f-e5a7-7000-8000-000000000001","created_at":"2025-01-01T12:00:00Z","alias_of":null} ``` ## Message Ordering Guarantees ### The Problem: Append Order ≠ Chronological Order In the event-driven architecture, messages are written to the message log file (`messages.jsonl`) as completion events arrive. Due to streaming latency: - User messages complete nearly instantly and are written immediately - Assistant messages stream over time and complete later - A second user message can be written before the first assistant reply completes Example scenario: ``` T0: User sends "tell me a joke about snakes" (ts=18:23:36.947) T1: Assistant starts streaming reply (ts=18:23:36.969) T2: User sends "now one about tigers" (ts=18:23:49.429) <- completes and writes BEFORE assistant finishes T3: Assistant finishes "snakes" reply <- writes AFTER "tigers" user message ``` File order in the message log file: ``` 1. user "snakes" (18:23:36.947) 2. user "tigers" (18:23:49.429) <- written second 3. assistant "snakes" (18:23:36.969) <- written third, but timestamp is earlier! ``` ### The Solution: Sort-on-Read The `Archivist::get_messages()` implementation sorts messages before returning: 1. **Primary sort**: `ts` (timestamp) ascending 2. **Secondary sort**: `message_id` (UUIDv7) ascending for stable tie-breaking This guarantees chronological order regardless of NDJSON append order: ``` 1. user "snakes" (18:23:36.947) 2. assistant "snakes" (18:23:36.969) 3. user "tigers" (18:23:49.429) ``` ### Why This Approach? - **Maintains durability**: Append-only writes preserve crash safety - **No migration needed**: Existing archives work without rewrites - **Simple implementation**: No buffered writes or complex write-time ordering - **Performance trade-off**: Small CPU cost on read (sorting) vs. complex write-time coordination ### Consumer Guidance - **DO**: Use `Archivist::get_messages()` to retrieve messages - **DON'T**: Read the message log file directly and assume file order = chronological order - **UI/API**: Always sort by `ts` then `message_id` for defense in depth - **Tie-breaking**: Use `message_id` (UUIDv7) as secondary sort for stable ordering when timestamps match ## Key Types ### SessionMetadata Stores all metadata about a session including: - **scroll_id**: UUIDv7 identifier for the session - **connector_uid**: Which connector owns this session - **native_session_id**: Original session ID from the connector (optional) - **title**: Optional human-readable session title (see Title Management below) - **parent_scroll_id**: For session lineage (splits, continuations) - **continuation**: Type of continuation (SPLIT, COMPACT, REFERENCE, EDIT) - **tags**: User-defined categorization - **metadata**: Free-form JSON for connector-specific fields #### Title Management Session titles are fully supported and persist across restarts. Titles are stored in the `SessionMetadata` struct and saved to the `session.json` file. **Setting Titles:** ```rust // Update title for an existing session archivist.update_session_metadata( scroll_id, Some("My Custom Session Title".to_string()), None, // model None // archive ).await?; ``` **Default Behavior:** - New sessions can specify an initial title during registration - If no title is provided, sessions default to `None` - The UI typically displays "Untitled" for sessions without titles **Title Loading:** - Titles are automatically loaded when retrieving session metadata via `get_session_metadata()` - Session lists include titles via `list_sessions()` and `list_sessions_all()` - Titles are part of the `SessionMetadata` struct returned by all session queries **UI Integration:** - The web UI displays session titles in the session list and sidebar - Users can rename sessions via the "Rename" button in the session list view - Renaming calls `api::archivist::rename_session()` which uses `update_session_metadata()` - Title changes are persisted immediately and survive application restarts ### MessageRecord Represents a single message in the archive: - **message_id**: UUIDv7 identifier - **session**: scroll_id this message belongs to - **role**: "user", "assistant", or "system" - **content_md**: Message content in Markdown format - **attachments**: References to attached files - **metadata**: Free-form JSON for connector-specific fields ### ConnectorRecord Metadata about a connector: - **connector_uid**: UUIDv7 identifier - **type**: "OpenCode", "ACP", or custom - **client_native_id**: Unique identifier from client (e.g., "opencode@http://localhost:12225") - **alias_of**: If this connector is an alias of another (for deduplication) ## Archivist Public API The `Archivist` struct (in `coordinator/`) is the main public entry point for archival operations. Consumers hold `Arc` and call inherent methods — there is no `Archivist` trait anymore. The coordinator resolves the target backend per call (via `archive: Option`) and delegates to `ArchiveBackend` methods. Key method families (see `coordinator/*.rs` for full signatures): - **Archive lifecycle** (`archives.rs`): `add_archive`, `remove_archive`, `list_archives`, `set_default_archive` - **Connectors** (`connectors.rs`): `register_connector` with tri-state result (Accepted / Aliased / Rejected), `list_connectors` - **Sessions** (`sessions.rs`): `register_session`, `get_session_metadata`, `update_session_metadata`, `list_sessions_paged`, `move_session`, `copy_session`, `resolve_session` - **Messages**: `append_messages`, `get_messages` (sorts by `ts` then `message_id` for stable chronological order) - **Meta / DAG** (`meta.rs`): meta-event recording, session lineage DAG walks, cleanup routines ## List Filter vs. Full-Text Search Two distinct query paths exist — do not conflate them. **List filter** — `Archivist::list_sessions_paged(SessionListQuery)` returns a cursor-paged list of sessions, AND-filtered by `title_query` (substring on title), `tags`, `model_filter` (substring on `metadata.model`), `project_id`, `connector_uid`, and `include_hidden`. This is the right tool for "narrow the list of visible sessions." **Full-text search** — `api::search_sessions` (in the `api` package, backed by `api::archivist::search_task::SearchTask`) spawns `rg --json` over the archive's `.contexts/` tree to find messages containing text. It streams `SearchExcerpt`s with parsed NDJSON content and supports cancellation via `CancellationToken`. This is the right tool for "find messages containing text." **Do not extend `list_sessions_paged` to do content search.** Content search belongs in the ripgrep pipeline. Future improvements to content search (indexed backends, relevance scoring) are Phase 2d / Phase 3 concerns. ## JsonlBackend Implementation The Phase 2 production backend — an implementation of `ArchiveBackend` plus every sub-trait except `SearchBackend`: - **Thread-safe**: Uses RwLock for in-memory caches - **Async**: All operations use tokio for non-blocking I/O - **Caching**: In-memory caches for connector and session mappings - **Collision Detection**: Tri-state registration for connectors and sessions Located under `src/backends/jsonl/` and split by concern (`backend.rs`, `connectors.rs`, `dag.rs`, `mapping.rs`, `meta.rs`). ### Caching Strategy `JsonlBackend` maintains two in-memory caches: 1. **connector_cache**: HashMap - Populated on registration - Read from TSV index on startup (future enhancement) 2. **session_cache**: HashMap<(Uuid, String), Uuid> - Maps (connector_uid, native_session_id) to scroll_id - Populated on registration and session resolution - Enables fast session lookups without disk I/O ## Event Handling The EventHandler subscribes to dirigent_protocol events and archives them in real-time: ```rust // Create archivist and event handler let archivist = Archivist::new_with_single_archive(archive_path).await?; let handler = EventHandler::new(Arc::new(archivist)); // Subscribe to event stream from dirigent_core let events = event_stream.subscribe(); // Run event loop (blocking) handler.run(events).await; ``` ### Supported Events - **SessionCreated**: Registers new sessions with the archivist - **MessageCompleted**: Writes finalized messages to the archive - **SessionUpdate**: Accumulates streaming message chunks - AgentMessageChunk - UserMessageChunk - AgentThoughtChunk - ToolCall ### MessageAccumulator Assembles streaming message chunks into complete messages: - Accumulates text chunks by message_id - Tracks thinking blocks separately - Stores tool calls with input/output - Finalizes messages on MessageCompleted event - Converts to MessageRecord for archival ## Integration with dirigent_core The Archivist integrates with dirigent_core via the global event stream: 1. **CoreRuntime** emits events for all connector operations 2. **EventHandler** subscribes to event stream 3. **MessageAccumulator** assembles streaming chunks 4. **Archivist** writes complete messages to archive This enables: - Automatic archival of all sessions and messages - No polling required - fully event-driven - Consistent history across restarts - Offline access to historical data ## Testing The package has comprehensive test coverage across multiple dimensions: ### Unit Tests Located in each module (`src/*.rs`, `src/storage/*.rs`): - Type serialization/deserialization - UUIDv7 generation and ordering - Timestamp formatting (RFC 3339) - Storage operations (NDJSON, JSON, TSV, files) - Connector registration tri-state logic - Session registration and alias detection ### Integration Tests Located in `tests/`: - `integration_tests.rs`: Full `Archivist` + `JsonlBackend` lifecycle, event handler integration, multi-connector scenarios, session lineage, message accumulation - `list_sessions_paged_test.rs`, `pagination_test.rs`: List filter + cursor pagination coverage - `import_claude_idempotency_test.rs`: Claude export re-import idempotency ### Backend Contract Tests `src/backend/contract.rs` holds reusable async assertions that any `&dyn ArchiveBackend` must pass. `JsonlBackend` and `MockBackend` both run the contract suite; new backends added in Phase 3+ should do the same. ### Examples Located in `examples/`: - `basic_usage.rs`: Core archivist operations - `event_handling.rs`: EventHandler and MessageAccumulator - `file_storage.rs`: Content-addressable file storage Run tests: ```bash cargo test --package dirigent_archivist ``` Run examples: ```bash cargo run --package dirigent_archivist --example basic_usage cargo run --package dirigent_archivist --example event_handling cargo run --package dirigent_archivist --example file_storage ``` ## Performance Characteristics - **Append Operations**: O(1) with sequential file writes - **Session Lookup**: O(1) with in-memory cache, O(n) cache miss - **Message Retrieval**: O(n) where n = number of messages (NDJSON parsing) - **File Storage**: O(1) content-addressable lookup with SHA-256 hashing - **Connector Index**: O(n) TSV scan, suitable for hundreds of connectors ### Scalability Considerations - **Large Sessions**: NDJSON is append-only, so reading large sessions requires parsing all lines - **Many Sessions**: TSV indices are suitable for thousands of sessions per connector - **File Deduplication**: SHA-256 hashing provides automatic deduplication across sessions - **Concurrent Access**: RwLock allows multiple concurrent readers, single writer ## Error Handling The Archivist uses thiserror for rich error types: ```rust pub enum ArchivistError { IoError(std::io::Error), SerdeError(serde_json::Error), SessionUnknown(Uuid), CollisionInconsistent(Uuid), // ... etc } ``` All public APIs return `Result` for explicit error handling. ## Development Notes - All storage operations are async (using tokio) - Content-addressable storage uses SHA-256 hashes (hex-encoded) - Archive directory structure mirrors session/message hierarchy - UUIDv7 provides time-ordered, sortable identifiers - RFC 3339 UTC timestamps for all time-based fields - Schema versioning via `version` field in all records ## Related Packages - **dirigent_protocol**: Shared types and protocol definitions (dependency) - **dirigent_core**: Runtime integration for SSE event capture (integration point) - **api**: Server functions for archive queries (future) - **web**: UI for archive browsing and search (future) ## Phase 4: `ArchiveFilter` (2026-04-21) Every `ArchiveRegistration` carries a `filter: ArchiveFilter`. The filter describes which sessions/writes the backend wants to receive. Fields: - `include_connectors: Option>` — if Some, only these connector UIDs pass. `None` means no connector gate. - `exclude_connectors: HashSet` — always rejected. - `include_tags: HashSet` — if non-empty, the session must carry at least one matching tag. - `exclude_tags: HashSet` — any matching tag rejects. - `include_hidden: bool` — default `true`. If `false`, sessions whose metadata has `"hidden": true` are skipped. ### Primary-always-writes invariant The per-call primary (either the `archive: Some(name)` argument or the default write-target) is **never** filtered. If a caller explicitly asks to write to archive X, the filter on X is not consulted. Filters only gate secondary fanout. ### Boot validator At boot (`coordinator/boot.rs`), the validator rejects configurations where: - No write-active + enabled registration has an **unrestricted** filter (`ArchiveFilter::default()` is unrestricted). Prevents configurations that silently drop all writes. - An archive's filter has `include_connectors = Some(empty set)` — equivalent to "reject everything", which is almost certainly a config bug. See `docs/plans/2026-04-21-archivist-phase4-design.md` §4 for the full design rationale. ## Phase 5: Importers (2026-04-21) The `import::` module centres on an `Importer` trait with per-source implementations under `import::sources::*`. Each source produces a `ParsedConversation` (ChatGPT) / `ParsedSession` (Codex) / session directory walk (Claude) and feeds the results through the common `import_sessions` orchestrator, which fires `ImportProgressEvent`s on a bounded `ImportProgressSink`. ### `Importer` trait Every importer declares a `config_shape()` so UIs can render a dynamic form; a `discover()` that returns an `ImportDiscovery` preview; and an `import()` that does the actual work. All three methods are async. The trait lives in `import::trait_def`. Shape types (`ImportConfig`, `ImportTarget`, `ConfigField`, `ConfigFieldKind`, `ImportError`) are serialisable and safe to cross the WASM boundary. ### Registry `ImporterRegistry::with_defaults()` registers every enabled `importer-*` feature. Currently: `claude`, `chatgpt`, `codex`. The registry is constructed at boot and stored on `AppState`. ### Progress sink `ImportProgressSink::channel()` returns a bounded mpsc pair. Non-terminal events use `try_send` (dropped on full); terminal events use `send().await` so consumers always see the final state. ### Source crates - `dirigent_chatgpt` — parses `conversations.json` from the OpenAI data export. - `dirigent_codex` — parses `*.jsonl` session files under `~/.codex/sessions`. Both crates hold pure parser types with zero dirigent-specific types. See `docs/plans/2026-04-21-archivist-phase5-design.md`. ## Future Enhancements - Indexed `SearchBackend` implementations (tantivy/sqlite) — currently content search is ripgrep-based in the `api` package - Session splitting and lineage management (mutations.ndjson) - Knowledge overview generation (chat.md exports) - Embedding storage and search (embeds/) - Network RPC interface for remote archivist - Compaction and pruning policies - Additional concrete backends (e.g. SQLite, remote) ## Documentation - **Package README**: `./README.md` - User-facing overview - **Architecture Docs**: `../../docs/building/05_archivist/` - Design and planning - **API Docs**: Run `cargo doc --package dirigent_archivist --open` - **Examples**: See `examples/` directory for working code samples