sync from monorepo @ 2452e92e

This commit is contained in:
2026-05-08 01:59:04 +02:00
commit b03dc15371
459 changed files with 129586 additions and 0 deletions
+761
View File
@@ -0,0 +1,761 @@
# Package: dirigent_archivist
Persistent storage for all agentic interactions in Dirigent.
## Quick Facts
- **Type**: Library
- **Main Entry**: src/lib.rs
- **Dependencies**: dirigent_protocol, uuid, chrono, serde, tokio, tracing, thiserror, sha2, hex, async-trait
- **Status**: Complete - Production ready with comprehensive tests
## Purpose
The Archivist provides file-based archival storage for all session data, messages, and attachments in Dirigent. It implements an archive-first architecture with connector API fallback, using NDJSON, JSON, and TSV formats for durability and human-readability.
## Key Features
- **File-based Storage**: NDJSON for messages, JSON for metadata, TSV for indices
- **Content-Addressable Files**: SHA-256 based storage for attachments with automatic deduplication
- **Session Lineage**: Track splits, continuations, and mutations with parent references
- **Connector Registry**: Coordinate UID assignment across connectors with collision detection
- **Event Streaming**: Real-time updates via EventHandler subscribing to dirigent_protocol events
- **Archive-First Design**: Read from archive first, fall back to connector API when needed
- **Caching**: In-memory caching of connector and session mappings for performance
## Architecture
The Archivist is built on three core architectural principles:
### 1. Archive-First Read Strategy
The Archivist is the primary source of truth for historical data:
- UI and APIs query the archive first
- Only fall back to connector APIs if data is not in archive
- This enables offline access and consistent history across restarts
### 2. Write-Through Event Capture (Append-Only)
The EventHandler subscribes to the global event stream from dirigent_core:
- Captures session creation, message streaming, and tool calls in real-time
- Uses MessageAccumulator to assemble streaming chunks into complete messages
- Writes complete messages to archive immediately upon finalization
- No polling required - fully event-driven
- **Append-only writes**: Messages are appended as events arrive, NOT in chronological order
- File order reflects event timing, not message timestamps
### 3. File-Based Storage with Sort-on-Read
All data is stored in human-readable, grep-able formats:
- **NDJSON** (Newline-Delimited JSON): Incremental append-only logs for messages and mappings
- **JSON**: Structured metadata for sessions and connectors
- **TSV** (Tab-Separated Values): Fast indices for cross-references
- **Content-Addressed Files**: Binary attachments stored by SHA-256 hash for deduplication
- **Sort-on-Read**: `get_messages()` sorts by timestamp and message_id to ensure chronological order despite append-only writes
## Backend Trait Layer (Phase 2)
The archivist uses a trait-based backend abstraction. `ArchiveBackend`
defines the mandatory session and message primitives every backend must
provide, plus `as_xxx()` accessors returning optional sub-traits:
- `SearchBackend` — reserved for Phase 3+ indexed backends (not wired)
- `DagBackend` — session lineage DAG edges
- `MetaEventsBackend` — ACP connection lifecycle events
- `ConnectorRegistryBackend` — per-archive connector metadata
- `SessionMappingBackend` — native↔scroll session ID mapping
`JsonlBackend` is the Phase 2 concrete implementation (file-based
NDJSON/JSON/TSV) and opts into every sub-trait except `SearchBackend`
(content search continues to be served by ripgrep via
`crates/api/src/archivist/search_task.rs`).
The `Archivist` struct (in `src/coordinator/`) owns a registry of backends
keyed by archive name and performs orchestration (alias detection, session
lineage, move/copy, DAG walks, archive lifecycle). Consumers hold
`Arc<Archivist>` directly — the coordinator is concrete, not a trait.
See `docs/plans/2026-04-18-archivist-phase2-design.md` for design rationale.
## Multi-Backend Registry (Phase 3)
The coordinator (`Archivist`) holds `Vec<Arc<ArchiveRegistration>>` sorted
by `read_priority` instead of a flat `HashMap<name, Arc<dyn ArchiveBackend>>`.
Each registration carries:
- `backend: Arc<dyn ArchiveBackend>` + its declared capabilities
- `failure_mode`: `Required` (must succeed) | `BestEffort` (errors log + drift health)
- `read_priority`: lower = tried first for reads; also selects the default
write target when no archive is named
- `write_active`: participates in fanout writes
- `enabled`: kill-switch without removing config
- `write_policy`: `Inline` (default; `await` per call) or `Queued`
(mpsc + batch_window + overflow policy)
- Runtime state: `last_health`, `last_error`, `consecutive_failures`
(all `Arc<RwLock<_>>`, shared with the writer task when queued)
- Optional `writer: Option<WriterHandle>` (Some iff `write_policy = Queued`)
Backends are declared in `dirigent.toml` under `[[archives]]` and
constructed at boot via `Archivist::from_config(cfg, &BackendRegistry)`.
Add a new backend type by implementing `BackendFactory` and registering
it on the `BackendRegistry` before `from_config`.
### Reads
`get_session`, `get_messages_paged`, `count_messages`, `get_meta_events`,
`get_children`, etc. walk the registry in priority order via
`read_walk_per_session(scroll_id, predicate, op)`. The predicate
capability-filters; `Unavailable` backends are skipped. The first backend
that returns `Some(value)` wins and its name is cached against the
`scroll_id` in a positive LRU (capacity 10_000). Subsequent reads for the
same `scroll_id` short-circuit to the cached backend before falling back
to the full priority walk.
Collection-shape reads (`list_sessions_paged`, `list_connectors`,
`list_meta_sessions`, `find_meta_session_by_client`) use
`read_walk_collection` — first enabled backend that can answer wins, no
cache, no aggregation across backends. Phase 3 explicitly defers
cross-backend merge/dedup to a later phase.
### Writes
Mutating methods (`append_messages`, `register_session`, `update_session_*`,
`append_meta_events`, `append_dag_edge`, `clear_session_messages`,
`update_connector_fingerprint`) resolve a primary (per-call `archive:
Some(name)` override or the default-write target) and fan out to every
other `enabled && write_active` backend that has the required capability.
Capability-mismatched backends are skipped with a debug `capability_skip`
log (never an error). `Required` failures propagate to the caller;
`BestEffort` failures log + drift health.
`register_connector` currently does NOT fan out — alias detection + the
tri-state `Accepted`/`Aliased`/`Rejected` return shape make replication
non-trivial. Fanout for connectors is deferred; single-backend setups are
unaffected.
For `write_policy = Queued` backends, the primary/secondary write paths
enqueue a `WriteOp` into the backend's writer task instead of awaiting.
Errors drift the backend's health but do not propagate to the caller.
Coalescing merges consecutive `AppendMessages`/`AppendMetaEvents` for the
same `scroll_id` within `batch_window_ms`.
### Cross-backend operations
- `delete_session(scroll_id, _)` fans out to every enabled backend that has
the session. Copies in `write_active=false` backends produce
`ArchivistError::DeleteOnReadOnlyBackend` (write-active copies are still
deleted); cache invalidated regardless of outcome.
- `copy_session(scroll_id, from, to)` reads from `from`, writes to `to`,
including DAG and meta-events when both sides have the capability. The
source remains canonical (the cache is NOT rewritten).
- `move_session(scroll_id, from, to)` is `copy + delete-from-source`. If
the source-side delete fails after the copy succeeded,
`ArchivistError::PartialMove { copied_to, delete_error }` is returned so
the caller knows the session now lives in both places.
The Phase 2 connector-aware `move_session(scroll_id, target_connector_uid, _)`
and `copy_session(scroll_id, target_connector_uid, _)` survived the Phase
3 rename as `move_session_to_connector` / `copy_session_to_connector`.
Their bulk variant is `move_sessions_to_connector`.
### Health
`HealthStatus` drifts on every coordinator call that observes a backend:
- Successful write → `Healthy`; `consecutive_failures` reset to 0.
- Successful read → `Healthy` (only rescues `Degraded`; does not reset the counter).
- Write failure → `Degraded { reason }`; `consecutive_failures += 1`; after
K = 5 consecutive failures drifts to `Unavailable { reason }`. Reads skip
`Unavailable` backends; writes against an `Unavailable` `Required`
backend fail, while writes against an `Unavailable` `BestEffort` backend
are still attempted.
- Read failure alone never drifts past `Degraded`; writes are the
authoritative health signal.
`list_archives_with_health()` returns a `Vec<ArchiveStatus>` snapshot of
every registration: name, type, capabilities, health, last_error, and
queue_depth (for queued backends).
### Lifecycle
Phase 3 is **startup-only**. `add_archive` / `remove_archive` /
`set_default_archive` on the coordinator return
`ArchivistError::DynamicRegistryUnsupported`. To change the registry,
edit `dirigent.toml` and restart the server. `Archivist::shutdown()`
drains queued writer tasks (sends `WriteOp::Shutdown` on each writer's
mpsc and awaits ack); call it before process exit.
Test-only constructors `Archivist::from_registrations(regs)` and
`SessionMetadata::stub(scroll_id)` live under `#[cfg(any(test, feature =
"test-utils"))]` for integration tests that bypass the factory.
See `docs/plans/2026-04-19-archivist-phase3-design.md` for the full
design rationale, and `examples/multi_backend.rs` for a runnable
end-to-end example.
## Module Organization
### Core Modules
- **`lib.rs`**: Public API surface and re-exports
- **`types.rs`**: Core data structures (session metadata, message records, connector info, API types)
- **`error.rs`**: Error types and Result alias for archivist operations
### Backend Layer (`backend/`)
- **`traits.rs`**: `ArchiveBackend` trait + 5 optional sub-traits
- **`capability.rs`**: `ArchiveCapability` enum + `CapabilitySet` type
- **`health.rs`**: `HealthStatus` enum returned by `health_check`
- **`contract.rs`**: Reusable behavioral tests for any `&dyn ArchiveBackend` (cfg-gated)
- **`mock.rs`**: In-memory `MockBackend` for coordinator unit tests (cfg-gated)
### Concrete Backends (`backends/`)
- **`jsonl/`**: The file-based `JsonlBackend` — the only Phase 2 backend.
Reuses `storage/` primitives for NDJSON/JSON/TSV operations.
### Coordinator (`coordinator/`)
- **`mod.rs`**: The `Archivist` struct + constructors
- **`archives.rs`**: Archive lifecycle (add/remove/list/default)
- **`connectors.rs`**: Connector registration + alias detection
- **`sessions.rs`**: Session registration, metadata updates, move/copy
- **`meta.rs`**: Meta events, DAG walks, cleanup
### Storage Layer (`storage/`)
Low-level file I/O primitives used by `JsonlBackend`. All storage operations are async and use tokio.
- **`paths.rs`**: ArchivePaths utility for consistent directory structure and path resolution
- **`ndjson.rs`**: Newline-delimited JSON operations (read_ndjson, append_ndjson)
- **`json.rs`**: JSON operations (read_json, write_json)
- **`tsv.rs`**: Tab-separated value operations for connector index
- **`files.rs`**: Content-addressable file storage with SHA-256 hashing and deduplication
### Supporting Modules
- **`registry.rs`**: Archive registry persistence (multi-archive metadata)
- **`migration.rs`**: Single-archive → multi-archive migration path
- **`session.rs`**: Session lineage types shared across layers
- **`accumulator.rs`**: MessageAccumulator for assembling streaming message chunks
- **`backfill.rs`**: Backfill helpers for importing historical sessions
- **`import/`**: External conversation importers (e.g. Claude export)
### Events
- **`events.rs`**: EventHandler for subscribing to dirigent_protocol events and archiving them
## Configuration
The Archivist archive root is determined by `DirigentPaths` resolution:
- Set `DIRIGENT_DATA_DIR` to override the data directory; archives will be stored at `<data_dir>/archives/`
- Defaults to `~/.local/share/dirigent/archives/` (or platform equivalent)
```bash
DIRIGENT_DATA_DIR=/path/to/data dx serve
```
## Archive Structure
```
dirigent_archive/
├── .contexts/
│ └── {scroll_id:uuidv7}/ # One directory per session
│ ├── session.json # Session metadata
│ ├── messages.jsonl # Incremental message log (.ndjson also supported)
│ └── lineage.json # Session lineage info (optional)
├── .db/
│ └── connectors/
│ ├── index.tsv # Fast connector lookup (TSV)
│ └── {connector_uid}/
│ ├── connector.json # Connector metadata
│ └── sessions.jsonl # Session mappings (.ndjson also supported)
└── .files/
└── {sha256-hash} # Content-addressable file storage
```
### Why Hidden Directories?
The `.contexts`, `.db`, and `.files` directories are hidden (prefixed with `.`) to keep the archive root clean for future rendered outputs (like `chat.md` exports). This is similar to how `.git` hides implementation details in a codebase.
## File Formats
### Session Metadata (`session.json`)
```json
{
"version": 1,
"scroll_id": "01936e8f-e5a7-7000-8000-000000000001",
"created_at": "2025-01-01T12:00:00Z",
"updated_at": "2025-01-01T12:30:00Z",
"title": "Implement user authentication",
"connector_uid": "01936e8f-e5a7-7000-8000-000000000002",
"native_session_id": "abc123",
"agent_id": null,
"parent_scroll_id": null,
"continuation": null,
"tags": ["backend", "auth"],
"metadata": {
"source": "OpenCode",
"model": "claude-3-5-sonnet"
}
}
```
### Messages Log (`messages.jsonl`)
One JSON object per line, **append-only**:
```jsonl
{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000003","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":null,"ts":"2025-01-01T12:01:00Z","role":"user","author":"alice","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}}
{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000004","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":"01936e8f-e5a7-7000-8000-000000000003","ts":"2025-01-01T12:01:10Z","role":"assistant","author":"claude","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}}
```
**IMPORTANT - Ordering**: The order of lines in the message log file (`messages.jsonl` or `messages.ndjson`) reflects **event arrival order**, NOT chronological order. Assistant replies often arrive after subsequent user messages due to streaming latency, resulting in non-chronological file order. Always use the `Archivist::get_messages()` API to retrieve messages, which sorts by `ts` (timestamp) and `message_id` (UUIDv7) to guarantee chronological order.
**File Format Compatibility**: The archivist supports both `.ndjson` and `.jsonl` file extensions for newline-delimited JSON files. When reading, `.jsonl` is preferred if present, with automatic fallback to `.ndjson` for backward compatibility. Write operations use `.jsonl` (canonical format). Both formats are identical in content - the difference is purely the file extension.
### Connector Index (`index.tsv`)
Tab-separated values with header row:
```tsv
connector_uid type title client_native_id alias_of created_at
01936e8f-e5a7-7000-8000-000000000002 OpenCode OpenCode Local opencode@http://localhost:12225 2025-01-01T12:00:00Z
```
### Session Mappings (`sessions.jsonl`)
Maps native session IDs from connectors to scroll IDs in the archive:
```jsonl
{"version":1,"connector_uid":"01936e8f-e5a7-7000-8000-000000000002","native_session_id":"abc123","scroll_id":"01936e8f-e5a7-7000-8000-000000000001","created_at":"2025-01-01T12:00:00Z","alias_of":null}
```
## Message Ordering Guarantees
### The Problem: Append Order ≠ Chronological Order
In the event-driven architecture, messages are written to the message log file (`messages.jsonl`) as completion events arrive. Due to streaming latency:
- User messages complete nearly instantly and are written immediately
- Assistant messages stream over time and complete later
- A second user message can be written before the first assistant reply completes
Example scenario:
```
T0: User sends "tell me a joke about snakes" (ts=18:23:36.947)
T1: Assistant starts streaming reply (ts=18:23:36.969)
T2: User sends "now one about tigers" (ts=18:23:49.429) <- completes and writes BEFORE assistant finishes
T3: Assistant finishes "snakes" reply <- writes AFTER "tigers" user message
```
File order in the message log file:
```
1. user "snakes" (18:23:36.947)
2. user "tigers" (18:23:49.429) <- written second
3. assistant "snakes" (18:23:36.969) <- written third, but timestamp is earlier!
```
### The Solution: Sort-on-Read
The `Archivist::get_messages()` implementation sorts messages before returning:
1. **Primary sort**: `ts` (timestamp) ascending
2. **Secondary sort**: `message_id` (UUIDv7) ascending for stable tie-breaking
This guarantees chronological order regardless of NDJSON append order:
```
1. user "snakes" (18:23:36.947)
2. assistant "snakes" (18:23:36.969)
3. user "tigers" (18:23:49.429)
```
### Why This Approach?
- **Maintains durability**: Append-only writes preserve crash safety
- **No migration needed**: Existing archives work without rewrites
- **Simple implementation**: No buffered writes or complex write-time ordering
- **Performance trade-off**: Small CPU cost on read (sorting) vs. complex write-time coordination
### Consumer Guidance
- **DO**: Use `Archivist::get_messages()` to retrieve messages
- **DON'T**: Read the message log file directly and assume file order = chronological order
- **UI/API**: Always sort by `ts` then `message_id` for defense in depth
- **Tie-breaking**: Use `message_id` (UUIDv7) as secondary sort for stable ordering when timestamps match
## Key Types
### SessionMetadata
Stores all metadata about a session including:
- **scroll_id**: UUIDv7 identifier for the session
- **connector_uid**: Which connector owns this session
- **native_session_id**: Original session ID from the connector (optional)
- **title**: Optional human-readable session title (see Title Management below)
- **parent_scroll_id**: For session lineage (splits, continuations)
- **continuation**: Type of continuation (SPLIT, COMPACT, REFERENCE, EDIT)
- **tags**: User-defined categorization
- **metadata**: Free-form JSON for connector-specific fields
#### Title Management
Session titles are fully supported and persist across restarts. Titles are stored in the `SessionMetadata` struct and saved to the `session.json` file.
**Setting Titles:**
```rust
// Update title for an existing session
archivist.update_session_metadata(
scroll_id,
Some("My Custom Session Title".to_string()),
None, // model
None // archive
).await?;
```
**Default Behavior:**
- New sessions can specify an initial title during registration
- If no title is provided, sessions default to `None`
- The UI typically displays "Untitled" for sessions without titles
**Title Loading:**
- Titles are automatically loaded when retrieving session metadata via `get_session_metadata()`
- Session lists include titles via `list_sessions()` and `list_sessions_all()`
- Titles are part of the `SessionMetadata` struct returned by all session queries
**UI Integration:**
- The web UI displays session titles in the session list and sidebar
- Users can rename sessions via the "Rename" button in the session list view
- Renaming calls `api::archivist::rename_session()` which uses `update_session_metadata()`
- Title changes are persisted immediately and survive application restarts
### MessageRecord
Represents a single message in the archive:
- **message_id**: UUIDv7 identifier
- **session**: scroll_id this message belongs to
- **role**: "user", "assistant", or "system"
- **content_md**: Message content in Markdown format
- **attachments**: References to attached files
- **metadata**: Free-form JSON for connector-specific fields
### ConnectorRecord
Metadata about a connector:
- **connector_uid**: UUIDv7 identifier
- **type**: "OpenCode", "ACP", or custom
- **client_native_id**: Unique identifier from client (e.g., "opencode@http://localhost:12225")
- **alias_of**: If this connector is an alias of another (for deduplication)
## Archivist Public API
The `Archivist` struct (in `coordinator/`) is the main public entry point
for archival operations. Consumers hold `Arc<Archivist>` and call inherent
methods — there is no `Archivist` trait anymore. The coordinator resolves
the target backend per call (via `archive: Option<String>`) and delegates
to `ArchiveBackend` methods.
Key method families (see `coordinator/*.rs` for full signatures):
- **Archive lifecycle** (`archives.rs`): `add_archive`, `remove_archive`,
`list_archives`, `set_default_archive`
- **Connectors** (`connectors.rs`): `register_connector` with tri-state
result (Accepted / Aliased / Rejected), `list_connectors`
- **Sessions** (`sessions.rs`): `register_session`, `get_session_metadata`,
`update_session_metadata`, `list_sessions_paged`, `move_session`,
`copy_session`, `resolve_session`
- **Messages**: `append_messages`, `get_messages` (sorts by `ts` then
`message_id` for stable chronological order)
- **Meta / DAG** (`meta.rs`): meta-event recording, session lineage DAG
walks, cleanup routines
## List Filter vs. Full-Text Search
Two distinct query paths exist — do not conflate them.
**List filter**`Archivist::list_sessions_paged(SessionListQuery)` returns a
cursor-paged list of sessions, AND-filtered by `title_query` (substring on
title), `tags`, `model_filter` (substring on `metadata.model`), `project_id`,
`connector_uid`, and `include_hidden`. This is the right tool for "narrow the
list of visible sessions."
**Full-text search**`api::search_sessions` (in the `api` package, backed by
`api::archivist::search_task::SearchTask`) spawns `rg --json` over the
archive's `.contexts/` tree to find messages containing text. It streams
`SearchExcerpt`s with parsed NDJSON content and supports cancellation via
`CancellationToken`. This is the right tool for "find messages containing
text."
**Do not extend `list_sessions_paged` to do content search.** Content search
belongs in the ripgrep pipeline. Future improvements to content search
(indexed backends, relevance scoring) are Phase 2d / Phase 3 concerns.
## JsonlBackend Implementation
The Phase 2 production backend — an implementation of `ArchiveBackend` plus
every sub-trait except `SearchBackend`:
- **Thread-safe**: Uses RwLock for in-memory caches
- **Async**: All operations use tokio for non-blocking I/O
- **Caching**: In-memory caches for connector and session mappings
- **Collision Detection**: Tri-state registration for connectors and sessions
Located under `src/backends/jsonl/` and split by concern (`backend.rs`,
`connectors.rs`, `dag.rs`, `mapping.rs`, `meta.rs`).
### Caching Strategy
`JsonlBackend` maintains two in-memory caches:
1. **connector_cache**: HashMap<Uuid, ConnectorRecord>
- Populated on registration
- Read from TSV index on startup (future enhancement)
2. **session_cache**: HashMap<(Uuid, String), Uuid>
- Maps (connector_uid, native_session_id) to scroll_id
- Populated on registration and session resolution
- Enables fast session lookups without disk I/O
## Event Handling
The EventHandler subscribes to dirigent_protocol events and archives them in real-time:
```rust
// Create archivist and event handler
let archivist = Archivist::new_with_single_archive(archive_path).await?;
let handler = EventHandler::new(Arc::new(archivist));
// Subscribe to event stream from dirigent_core
let events = event_stream.subscribe();
// Run event loop (blocking)
handler.run(events).await;
```
### Supported Events
- **SessionCreated**: Registers new sessions with the archivist
- **MessageCompleted**: Writes finalized messages to the archive
- **SessionUpdate**: Accumulates streaming message chunks
- AgentMessageChunk
- UserMessageChunk
- AgentThoughtChunk
- ToolCall
### MessageAccumulator
Assembles streaming message chunks into complete messages:
- Accumulates text chunks by message_id
- Tracks thinking blocks separately
- Stores tool calls with input/output
- Finalizes messages on MessageCompleted event
- Converts to MessageRecord for archival
## Integration with dirigent_core
The Archivist integrates with dirigent_core via the global event stream:
1. **CoreRuntime** emits events for all connector operations
2. **EventHandler** subscribes to event stream
3. **MessageAccumulator** assembles streaming chunks
4. **Archivist** writes complete messages to archive
This enables:
- Automatic archival of all sessions and messages
- No polling required - fully event-driven
- Consistent history across restarts
- Offline access to historical data
## Testing
The package has comprehensive test coverage across multiple dimensions:
### Unit Tests
Located in each module (`src/*.rs`, `src/storage/*.rs`):
- Type serialization/deserialization
- UUIDv7 generation and ordering
- Timestamp formatting (RFC 3339)
- Storage operations (NDJSON, JSON, TSV, files)
- Connector registration tri-state logic
- Session registration and alias detection
### Integration Tests
Located in `tests/`:
- `integration_tests.rs`: Full `Archivist` + `JsonlBackend` lifecycle, event
handler integration, multi-connector scenarios, session lineage, message
accumulation
- `list_sessions_paged_test.rs`, `pagination_test.rs`: List filter + cursor
pagination coverage
- `import_claude_idempotency_test.rs`: Claude export re-import idempotency
### Backend Contract Tests
`src/backend/contract.rs` holds reusable async assertions that any
`&dyn ArchiveBackend` must pass. `JsonlBackend` and `MockBackend` both
run the contract suite; new backends added in Phase 3+ should do the same.
### Examples
Located in `examples/`:
- `basic_usage.rs`: Core archivist operations
- `event_handling.rs`: EventHandler and MessageAccumulator
- `file_storage.rs`: Content-addressable file storage
Run tests:
```bash
cargo test --package dirigent_archivist
```
Run examples:
```bash
cargo run --package dirigent_archivist --example basic_usage
cargo run --package dirigent_archivist --example event_handling
cargo run --package dirigent_archivist --example file_storage
```
## Performance Characteristics
- **Append Operations**: O(1) with sequential file writes
- **Session Lookup**: O(1) with in-memory cache, O(n) cache miss
- **Message Retrieval**: O(n) where n = number of messages (NDJSON parsing)
- **File Storage**: O(1) content-addressable lookup with SHA-256 hashing
- **Connector Index**: O(n) TSV scan, suitable for hundreds of connectors
### Scalability Considerations
- **Large Sessions**: NDJSON is append-only, so reading large sessions requires parsing all lines
- **Many Sessions**: TSV indices are suitable for thousands of sessions per connector
- **File Deduplication**: SHA-256 hashing provides automatic deduplication across sessions
- **Concurrent Access**: RwLock allows multiple concurrent readers, single writer
## Error Handling
The Archivist uses thiserror for rich error types:
```rust
pub enum ArchivistError {
IoError(std::io::Error),
SerdeError(serde_json::Error),
SessionUnknown(Uuid),
CollisionInconsistent(Uuid),
// ... etc
}
```
All public APIs return `Result<T, ArchivistError>` for explicit error handling.
## Development Notes
- All storage operations are async (using tokio)
- Content-addressable storage uses SHA-256 hashes (hex-encoded)
- Archive directory structure mirrors session/message hierarchy
- UUIDv7 provides time-ordered, sortable identifiers
- RFC 3339 UTC timestamps for all time-based fields
- Schema versioning via `version` field in all records
## Related Packages
- **dirigent_protocol**: Shared types and protocol definitions (dependency)
- **dirigent_core**: Runtime integration for SSE event capture (integration point)
- **api**: Server functions for archive queries (future)
- **web**: UI for archive browsing and search (future)
## Phase 4: `ArchiveFilter` (2026-04-21)
Every `ArchiveRegistration` carries a `filter: ArchiveFilter`. The filter
describes which sessions/writes the backend wants to receive. Fields:
- `include_connectors: Option<HashSet<Uuid>>` — if Some, only these
connector UIDs pass. `None` means no connector gate.
- `exclude_connectors: HashSet<Uuid>` — always rejected.
- `include_tags: HashSet<String>` — if non-empty, the session must carry
at least one matching tag.
- `exclude_tags: HashSet<String>` — any matching tag rejects.
- `include_hidden: bool` — default `true`. If `false`, sessions whose
metadata has `"hidden": true` are skipped.
### Primary-always-writes invariant
The per-call primary (either the `archive: Some(name)` argument or the
default write-target) is **never** filtered. If a caller explicitly asks
to write to archive X, the filter on X is not consulted. Filters only
gate secondary fanout.
### Boot validator
At boot (`coordinator/boot.rs`), the validator rejects configurations
where:
- No write-active + enabled registration has an **unrestricted** filter
(`ArchiveFilter::default()` is unrestricted). Prevents configurations
that silently drop all writes.
- An archive's filter has `include_connectors = Some(empty set)`
equivalent to "reject everything", which is almost certainly a config
bug.
See `docs/plans/2026-04-21-archivist-phase4-design.md` §4 for the full
design rationale.
## Phase 5: Importers (2026-04-21)
The `import::` module centres on an `Importer` trait with per-source
implementations under `import::sources::*`. Each source produces a
`ParsedConversation` (ChatGPT) / `ParsedSession` (Codex) / session
directory walk (Claude) and feeds the results through the common
`import_sessions` orchestrator, which fires `ImportProgressEvent`s on a
bounded `ImportProgressSink`.
### `Importer` trait
Every importer declares a `config_shape()` so UIs can render a dynamic
form; a `discover()` that returns an `ImportDiscovery` preview; and an
`import()` that does the actual work. All three methods are async.
The trait lives in `import::trait_def`. Shape types (`ImportConfig`,
`ImportTarget`, `ConfigField`, `ConfigFieldKind`, `ImportError`) are
serialisable and safe to cross the WASM boundary.
### Registry
`ImporterRegistry::with_defaults()` registers every enabled
`importer-*` feature. Currently: `claude`, `chatgpt`, `codex`. The
registry is constructed at boot and stored on `AppState`.
### Progress sink
`ImportProgressSink::channel()` returns a bounded mpsc pair.
Non-terminal events use `try_send` (dropped on full); terminal events
use `send().await` so consumers always see the final state.
### Source crates
- `dirigent_chatgpt` — parses `conversations.json` from the OpenAI data
export.
- `dirigent_codex` — parses `*.jsonl` session files under
`~/.codex/sessions`.
Both crates hold pure parser types with zero dirigent-specific types.
See `docs/plans/2026-04-21-archivist-phase5-design.md`.
## Future Enhancements
- Indexed `SearchBackend` implementations (tantivy/sqlite) — currently
content search is ripgrep-based in the `api` package
- Session splitting and lineage management (mutations.ndjson)
- Knowledge overview generation (chat.md exports)
- Embedding storage and search (embeds/)
- Network RPC interface for remote archivist
- Compaction and pruning policies
- Additional concrete backends (e.g. SQLite, remote)
## Documentation
- **Package README**: `./README.md` - User-facing overview
- **Architecture Docs**: `../../docs/building/05_archivist/` - Design and planning
- **API Docs**: Run `cargo doc --package dirigent_archivist --open`
- **Examples**: See `examples/` directory for working code samples