sync from monorepo @ 2452e92e
This commit is contained in:
@@ -0,0 +1,761 @@
|
||||
# Package: dirigent_archivist
|
||||
|
||||
Persistent storage for all agentic interactions in Dirigent.
|
||||
|
||||
## Quick Facts
|
||||
- **Type**: Library
|
||||
- **Main Entry**: src/lib.rs
|
||||
- **Dependencies**: dirigent_protocol, uuid, chrono, serde, tokio, tracing, thiserror, sha2, hex, async-trait
|
||||
- **Status**: Complete - Production ready with comprehensive tests
|
||||
|
||||
## Purpose
|
||||
|
||||
The Archivist provides file-based archival storage for all session data, messages, and attachments in Dirigent. It implements an archive-first architecture with connector API fallback, using NDJSON, JSON, and TSV formats for durability and human-readability.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **File-based Storage**: NDJSON for messages, JSON for metadata, TSV for indices
|
||||
- **Content-Addressable Files**: SHA-256 based storage for attachments with automatic deduplication
|
||||
- **Session Lineage**: Track splits, continuations, and mutations with parent references
|
||||
- **Connector Registry**: Coordinate UID assignment across connectors with collision detection
|
||||
- **Event Streaming**: Real-time updates via EventHandler subscribing to dirigent_protocol events
|
||||
- **Archive-First Design**: Read from archive first, fall back to connector API when needed
|
||||
- **Caching**: In-memory caching of connector and session mappings for performance
|
||||
|
||||
## Architecture
|
||||
|
||||
The Archivist is built on three core architectural principles:
|
||||
|
||||
### 1. Archive-First Read Strategy
|
||||
|
||||
The Archivist is the primary source of truth for historical data:
|
||||
- UI and APIs query the archive first
|
||||
- Only fall back to connector APIs if data is not in archive
|
||||
- This enables offline access and consistent history across restarts
|
||||
|
||||
### 2. Write-Through Event Capture (Append-Only)
|
||||
|
||||
The EventHandler subscribes to the global event stream from dirigent_core:
|
||||
- Captures session creation, message streaming, and tool calls in real-time
|
||||
- Uses MessageAccumulator to assemble streaming chunks into complete messages
|
||||
- Writes complete messages to archive immediately upon finalization
|
||||
- No polling required - fully event-driven
|
||||
- **Append-only writes**: Messages are appended as events arrive, NOT in chronological order
|
||||
- File order reflects event timing, not message timestamps
|
||||
|
||||
### 3. File-Based Storage with Sort-on-Read
|
||||
|
||||
All data is stored in human-readable, grep-able formats:
|
||||
- **NDJSON** (Newline-Delimited JSON): Incremental append-only logs for messages and mappings
|
||||
- **JSON**: Structured metadata for sessions and connectors
|
||||
- **TSV** (Tab-Separated Values): Fast indices for cross-references
|
||||
- **Content-Addressed Files**: Binary attachments stored by SHA-256 hash for deduplication
|
||||
- **Sort-on-Read**: `get_messages()` sorts by timestamp and message_id to ensure chronological order despite append-only writes
|
||||
|
||||
## Backend Trait Layer (Phase 2)
|
||||
|
||||
The archivist uses a trait-based backend abstraction. `ArchiveBackend`
|
||||
defines the mandatory session and message primitives every backend must
|
||||
provide, plus `as_xxx()` accessors returning optional sub-traits:
|
||||
|
||||
- `SearchBackend` — reserved for Phase 3+ indexed backends (not wired)
|
||||
- `DagBackend` — session lineage DAG edges
|
||||
- `MetaEventsBackend` — ACP connection lifecycle events
|
||||
- `ConnectorRegistryBackend` — per-archive connector metadata
|
||||
- `SessionMappingBackend` — native↔scroll session ID mapping
|
||||
|
||||
`JsonlBackend` is the Phase 2 concrete implementation (file-based
|
||||
NDJSON/JSON/TSV) and opts into every sub-trait except `SearchBackend`
|
||||
(content search continues to be served by ripgrep via
|
||||
`crates/api/src/archivist/search_task.rs`).
|
||||
|
||||
The `Archivist` struct (in `src/coordinator/`) owns a registry of backends
|
||||
keyed by archive name and performs orchestration (alias detection, session
|
||||
lineage, move/copy, DAG walks, archive lifecycle). Consumers hold
|
||||
`Arc<Archivist>` directly — the coordinator is concrete, not a trait.
|
||||
|
||||
See `docs/plans/2026-04-18-archivist-phase2-design.md` for design rationale.
|
||||
|
||||
## Multi-Backend Registry (Phase 3)
|
||||
|
||||
The coordinator (`Archivist`) holds `Vec<Arc<ArchiveRegistration>>` sorted
|
||||
by `read_priority` instead of a flat `HashMap<name, Arc<dyn ArchiveBackend>>`.
|
||||
Each registration carries:
|
||||
|
||||
- `backend: Arc<dyn ArchiveBackend>` + its declared capabilities
|
||||
- `failure_mode`: `Required` (must succeed) | `BestEffort` (errors log + drift health)
|
||||
- `read_priority`: lower = tried first for reads; also selects the default
|
||||
write target when no archive is named
|
||||
- `write_active`: participates in fanout writes
|
||||
- `enabled`: kill-switch without removing config
|
||||
- `write_policy`: `Inline` (default; `await` per call) or `Queued`
|
||||
(mpsc + batch_window + overflow policy)
|
||||
- Runtime state: `last_health`, `last_error`, `consecutive_failures`
|
||||
(all `Arc<RwLock<_>>`, shared with the writer task when queued)
|
||||
- Optional `writer: Option<WriterHandle>` (Some iff `write_policy = Queued`)
|
||||
|
||||
Backends are declared in `dirigent.toml` under `[[archives]]` and
|
||||
constructed at boot via `Archivist::from_config(cfg, &BackendRegistry)`.
|
||||
Add a new backend type by implementing `BackendFactory` and registering
|
||||
it on the `BackendRegistry` before `from_config`.
|
||||
|
||||
### Reads
|
||||
|
||||
`get_session`, `get_messages_paged`, `count_messages`, `get_meta_events`,
|
||||
`get_children`, etc. walk the registry in priority order via
|
||||
`read_walk_per_session(scroll_id, predicate, op)`. The predicate
|
||||
capability-filters; `Unavailable` backends are skipped. The first backend
|
||||
that returns `Some(value)` wins and its name is cached against the
|
||||
`scroll_id` in a positive LRU (capacity 10_000). Subsequent reads for the
|
||||
same `scroll_id` short-circuit to the cached backend before falling back
|
||||
to the full priority walk.
|
||||
|
||||
Collection-shape reads (`list_sessions_paged`, `list_connectors`,
|
||||
`list_meta_sessions`, `find_meta_session_by_client`) use
|
||||
`read_walk_collection` — first enabled backend that can answer wins, no
|
||||
cache, no aggregation across backends. Phase 3 explicitly defers
|
||||
cross-backend merge/dedup to a later phase.
|
||||
|
||||
### Writes
|
||||
|
||||
Mutating methods (`append_messages`, `register_session`, `update_session_*`,
|
||||
`append_meta_events`, `append_dag_edge`, `clear_session_messages`,
|
||||
`update_connector_fingerprint`) resolve a primary (per-call `archive:
|
||||
Some(name)` override or the default-write target) and fan out to every
|
||||
other `enabled && write_active` backend that has the required capability.
|
||||
Capability-mismatched backends are skipped with a debug `capability_skip`
|
||||
log (never an error). `Required` failures propagate to the caller;
|
||||
`BestEffort` failures log + drift health.
|
||||
|
||||
`register_connector` currently does NOT fan out — alias detection + the
|
||||
tri-state `Accepted`/`Aliased`/`Rejected` return shape make replication
|
||||
non-trivial. Fanout for connectors is deferred; single-backend setups are
|
||||
unaffected.
|
||||
|
||||
For `write_policy = Queued` backends, the primary/secondary write paths
|
||||
enqueue a `WriteOp` into the backend's writer task instead of awaiting.
|
||||
Errors drift the backend's health but do not propagate to the caller.
|
||||
Coalescing merges consecutive `AppendMessages`/`AppendMetaEvents` for the
|
||||
same `scroll_id` within `batch_window_ms`.
|
||||
|
||||
### Cross-backend operations
|
||||
|
||||
- `delete_session(scroll_id, _)` fans out to every enabled backend that has
|
||||
the session. Copies in `write_active=false` backends produce
|
||||
`ArchivistError::DeleteOnReadOnlyBackend` (write-active copies are still
|
||||
deleted); cache invalidated regardless of outcome.
|
||||
- `copy_session(scroll_id, from, to)` reads from `from`, writes to `to`,
|
||||
including DAG and meta-events when both sides have the capability. The
|
||||
source remains canonical (the cache is NOT rewritten).
|
||||
- `move_session(scroll_id, from, to)` is `copy + delete-from-source`. If
|
||||
the source-side delete fails after the copy succeeded,
|
||||
`ArchivistError::PartialMove { copied_to, delete_error }` is returned so
|
||||
the caller knows the session now lives in both places.
|
||||
|
||||
The Phase 2 connector-aware `move_session(scroll_id, target_connector_uid, _)`
|
||||
and `copy_session(scroll_id, target_connector_uid, _)` survived the Phase
|
||||
3 rename as `move_session_to_connector` / `copy_session_to_connector`.
|
||||
Their bulk variant is `move_sessions_to_connector`.
|
||||
|
||||
### Health
|
||||
|
||||
`HealthStatus` drifts on every coordinator call that observes a backend:
|
||||
|
||||
- Successful write → `Healthy`; `consecutive_failures` reset to 0.
|
||||
- Successful read → `Healthy` (only rescues `Degraded`; does not reset the counter).
|
||||
- Write failure → `Degraded { reason }`; `consecutive_failures += 1`; after
|
||||
K = 5 consecutive failures drifts to `Unavailable { reason }`. Reads skip
|
||||
`Unavailable` backends; writes against an `Unavailable` `Required`
|
||||
backend fail, while writes against an `Unavailable` `BestEffort` backend
|
||||
are still attempted.
|
||||
- Read failure alone never drifts past `Degraded`; writes are the
|
||||
authoritative health signal.
|
||||
|
||||
`list_archives_with_health()` returns a `Vec<ArchiveStatus>` snapshot of
|
||||
every registration: name, type, capabilities, health, last_error, and
|
||||
queue_depth (for queued backends).
|
||||
|
||||
### Lifecycle
|
||||
|
||||
Phase 3 is **startup-only**. `add_archive` / `remove_archive` /
|
||||
`set_default_archive` on the coordinator return
|
||||
`ArchivistError::DynamicRegistryUnsupported`. To change the registry,
|
||||
edit `dirigent.toml` and restart the server. `Archivist::shutdown()`
|
||||
drains queued writer tasks (sends `WriteOp::Shutdown` on each writer's
|
||||
mpsc and awaits ack); call it before process exit.
|
||||
|
||||
Test-only constructors `Archivist::from_registrations(regs)` and
|
||||
`SessionMetadata::stub(scroll_id)` live under `#[cfg(any(test, feature =
|
||||
"test-utils"))]` for integration tests that bypass the factory.
|
||||
|
||||
See `docs/plans/2026-04-19-archivist-phase3-design.md` for the full
|
||||
design rationale, and `examples/multi_backend.rs` for a runnable
|
||||
end-to-end example.
|
||||
|
||||
## Module Organization
|
||||
|
||||
### Core Modules
|
||||
|
||||
- **`lib.rs`**: Public API surface and re-exports
|
||||
- **`types.rs`**: Core data structures (session metadata, message records, connector info, API types)
|
||||
- **`error.rs`**: Error types and Result alias for archivist operations
|
||||
|
||||
### Backend Layer (`backend/`)
|
||||
|
||||
- **`traits.rs`**: `ArchiveBackend` trait + 5 optional sub-traits
|
||||
- **`capability.rs`**: `ArchiveCapability` enum + `CapabilitySet` type
|
||||
- **`health.rs`**: `HealthStatus` enum returned by `health_check`
|
||||
- **`contract.rs`**: Reusable behavioral tests for any `&dyn ArchiveBackend` (cfg-gated)
|
||||
- **`mock.rs`**: In-memory `MockBackend` for coordinator unit tests (cfg-gated)
|
||||
|
||||
### Concrete Backends (`backends/`)
|
||||
|
||||
- **`jsonl/`**: The file-based `JsonlBackend` — the only Phase 2 backend.
|
||||
Reuses `storage/` primitives for NDJSON/JSON/TSV operations.
|
||||
|
||||
### Coordinator (`coordinator/`)
|
||||
|
||||
- **`mod.rs`**: The `Archivist` struct + constructors
|
||||
- **`archives.rs`**: Archive lifecycle (add/remove/list/default)
|
||||
- **`connectors.rs`**: Connector registration + alias detection
|
||||
- **`sessions.rs`**: Session registration, metadata updates, move/copy
|
||||
- **`meta.rs`**: Meta events, DAG walks, cleanup
|
||||
|
||||
### Storage Layer (`storage/`)
|
||||
|
||||
Low-level file I/O primitives used by `JsonlBackend`. All storage operations are async and use tokio.
|
||||
|
||||
- **`paths.rs`**: ArchivePaths utility for consistent directory structure and path resolution
|
||||
- **`ndjson.rs`**: Newline-delimited JSON operations (read_ndjson, append_ndjson)
|
||||
- **`json.rs`**: JSON operations (read_json, write_json)
|
||||
- **`tsv.rs`**: Tab-separated value operations for connector index
|
||||
- **`files.rs`**: Content-addressable file storage with SHA-256 hashing and deduplication
|
||||
|
||||
### Supporting Modules
|
||||
|
||||
- **`registry.rs`**: Archive registry persistence (multi-archive metadata)
|
||||
- **`migration.rs`**: Single-archive → multi-archive migration path
|
||||
- **`session.rs`**: Session lineage types shared across layers
|
||||
- **`accumulator.rs`**: MessageAccumulator for assembling streaming message chunks
|
||||
- **`backfill.rs`**: Backfill helpers for importing historical sessions
|
||||
- **`import/`**: External conversation importers (e.g. Claude export)
|
||||
|
||||
### Events
|
||||
|
||||
- **`events.rs`**: EventHandler for subscribing to dirigent_protocol events and archiving them
|
||||
|
||||
## Configuration
|
||||
|
||||
The Archivist archive root is determined by `DirigentPaths` resolution:
|
||||
|
||||
- Set `DIRIGENT_DATA_DIR` to override the data directory; archives will be stored at `<data_dir>/archives/`
|
||||
- Defaults to `~/.local/share/dirigent/archives/` (or platform equivalent)
|
||||
|
||||
```bash
|
||||
DIRIGENT_DATA_DIR=/path/to/data dx serve
|
||||
```
|
||||
|
||||
## Archive Structure
|
||||
|
||||
```
|
||||
dirigent_archive/
|
||||
├── .contexts/
|
||||
│ └── {scroll_id:uuidv7}/ # One directory per session
|
||||
│ ├── session.json # Session metadata
|
||||
│ ├── messages.jsonl # Incremental message log (.ndjson also supported)
|
||||
│ └── lineage.json # Session lineage info (optional)
|
||||
├── .db/
|
||||
│ └── connectors/
|
||||
│ ├── index.tsv # Fast connector lookup (TSV)
|
||||
│ └── {connector_uid}/
|
||||
│ ├── connector.json # Connector metadata
|
||||
│ └── sessions.jsonl # Session mappings (.ndjson also supported)
|
||||
└── .files/
|
||||
└── {sha256-hash} # Content-addressable file storage
|
||||
```
|
||||
|
||||
### Why Hidden Directories?
|
||||
|
||||
The `.contexts`, `.db`, and `.files` directories are hidden (prefixed with `.`) to keep the archive root clean for future rendered outputs (like `chat.md` exports). This is similar to how `.git` hides implementation details in a codebase.
|
||||
|
||||
## File Formats
|
||||
|
||||
### Session Metadata (`session.json`)
|
||||
|
||||
```json
|
||||
{
|
||||
"version": 1,
|
||||
"scroll_id": "01936e8f-e5a7-7000-8000-000000000001",
|
||||
"created_at": "2025-01-01T12:00:00Z",
|
||||
"updated_at": "2025-01-01T12:30:00Z",
|
||||
"title": "Implement user authentication",
|
||||
"connector_uid": "01936e8f-e5a7-7000-8000-000000000002",
|
||||
"native_session_id": "abc123",
|
||||
"agent_id": null,
|
||||
"parent_scroll_id": null,
|
||||
"continuation": null,
|
||||
"tags": ["backend", "auth"],
|
||||
"metadata": {
|
||||
"source": "OpenCode",
|
||||
"model": "claude-3-5-sonnet"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Messages Log (`messages.jsonl`)
|
||||
|
||||
One JSON object per line, **append-only**:
|
||||
|
||||
```jsonl
|
||||
{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000003","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":null,"ts":"2025-01-01T12:01:00Z","role":"user","author":"alice","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}}
|
||||
{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000004","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":"01936e8f-e5a7-7000-8000-000000000003","ts":"2025-01-01T12:01:10Z","role":"assistant","author":"claude","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}}
|
||||
```
|
||||
|
||||
**IMPORTANT - Ordering**: The order of lines in the message log file (`messages.jsonl` or `messages.ndjson`) reflects **event arrival order**, NOT chronological order. Assistant replies often arrive after subsequent user messages due to streaming latency, resulting in non-chronological file order. Always use the `Archivist::get_messages()` API to retrieve messages, which sorts by `ts` (timestamp) and `message_id` (UUIDv7) to guarantee chronological order.
|
||||
|
||||
**File Format Compatibility**: The archivist supports both `.ndjson` and `.jsonl` file extensions for newline-delimited JSON files. When reading, `.jsonl` is preferred if present, with automatic fallback to `.ndjson` for backward compatibility. Write operations use `.jsonl` (canonical format). Both formats are identical in content - the difference is purely the file extension.
|
||||
|
||||
### Connector Index (`index.tsv`)
|
||||
|
||||
Tab-separated values with header row:
|
||||
|
||||
```tsv
|
||||
connector_uid type title client_native_id alias_of created_at
|
||||
01936e8f-e5a7-7000-8000-000000000002 OpenCode OpenCode Local opencode@http://localhost:12225 2025-01-01T12:00:00Z
|
||||
```
|
||||
|
||||
### Session Mappings (`sessions.jsonl`)
|
||||
|
||||
Maps native session IDs from connectors to scroll IDs in the archive:
|
||||
|
||||
```jsonl
|
||||
{"version":1,"connector_uid":"01936e8f-e5a7-7000-8000-000000000002","native_session_id":"abc123","scroll_id":"01936e8f-e5a7-7000-8000-000000000001","created_at":"2025-01-01T12:00:00Z","alias_of":null}
|
||||
```
|
||||
|
||||
## Message Ordering Guarantees
|
||||
|
||||
### The Problem: Append Order ≠ Chronological Order
|
||||
|
||||
In the event-driven architecture, messages are written to the message log file (`messages.jsonl`) as completion events arrive. Due to streaming latency:
|
||||
|
||||
- User messages complete nearly instantly and are written immediately
|
||||
- Assistant messages stream over time and complete later
|
||||
- A second user message can be written before the first assistant reply completes
|
||||
|
||||
Example scenario:
|
||||
```
|
||||
T0: User sends "tell me a joke about snakes" (ts=18:23:36.947)
|
||||
T1: Assistant starts streaming reply (ts=18:23:36.969)
|
||||
T2: User sends "now one about tigers" (ts=18:23:49.429) <- completes and writes BEFORE assistant finishes
|
||||
T3: Assistant finishes "snakes" reply <- writes AFTER "tigers" user message
|
||||
```
|
||||
|
||||
File order in the message log file:
|
||||
```
|
||||
1. user "snakes" (18:23:36.947)
|
||||
2. user "tigers" (18:23:49.429) <- written second
|
||||
3. assistant "snakes" (18:23:36.969) <- written third, but timestamp is earlier!
|
||||
```
|
||||
|
||||
### The Solution: Sort-on-Read
|
||||
|
||||
The `Archivist::get_messages()` implementation sorts messages before returning:
|
||||
|
||||
1. **Primary sort**: `ts` (timestamp) ascending
|
||||
2. **Secondary sort**: `message_id` (UUIDv7) ascending for stable tie-breaking
|
||||
|
||||
This guarantees chronological order regardless of NDJSON append order:
|
||||
```
|
||||
1. user "snakes" (18:23:36.947)
|
||||
2. assistant "snakes" (18:23:36.969)
|
||||
3. user "tigers" (18:23:49.429)
|
||||
```
|
||||
|
||||
### Why This Approach?
|
||||
|
||||
- **Maintains durability**: Append-only writes preserve crash safety
|
||||
- **No migration needed**: Existing archives work without rewrites
|
||||
- **Simple implementation**: No buffered writes or complex write-time ordering
|
||||
- **Performance trade-off**: Small CPU cost on read (sorting) vs. complex write-time coordination
|
||||
|
||||
### Consumer Guidance
|
||||
|
||||
- **DO**: Use `Archivist::get_messages()` to retrieve messages
|
||||
- **DON'T**: Read the message log file directly and assume file order = chronological order
|
||||
- **UI/API**: Always sort by `ts` then `message_id` for defense in depth
|
||||
- **Tie-breaking**: Use `message_id` (UUIDv7) as secondary sort for stable ordering when timestamps match
|
||||
|
||||
## Key Types
|
||||
|
||||
### SessionMetadata
|
||||
|
||||
Stores all metadata about a session including:
|
||||
- **scroll_id**: UUIDv7 identifier for the session
|
||||
- **connector_uid**: Which connector owns this session
|
||||
- **native_session_id**: Original session ID from the connector (optional)
|
||||
- **title**: Optional human-readable session title (see Title Management below)
|
||||
- **parent_scroll_id**: For session lineage (splits, continuations)
|
||||
- **continuation**: Type of continuation (SPLIT, COMPACT, REFERENCE, EDIT)
|
||||
- **tags**: User-defined categorization
|
||||
- **metadata**: Free-form JSON for connector-specific fields
|
||||
|
||||
#### Title Management
|
||||
|
||||
Session titles are fully supported and persist across restarts. Titles are stored in the `SessionMetadata` struct and saved to the `session.json` file.
|
||||
|
||||
**Setting Titles:**
|
||||
```rust
|
||||
// Update title for an existing session
|
||||
archivist.update_session_metadata(
|
||||
scroll_id,
|
||||
Some("My Custom Session Title".to_string()),
|
||||
None, // model
|
||||
None // archive
|
||||
).await?;
|
||||
```
|
||||
|
||||
**Default Behavior:**
|
||||
- New sessions can specify an initial title during registration
|
||||
- If no title is provided, sessions default to `None`
|
||||
- The UI typically displays "Untitled" for sessions without titles
|
||||
|
||||
**Title Loading:**
|
||||
- Titles are automatically loaded when retrieving session metadata via `get_session_metadata()`
|
||||
- Session lists include titles via `list_sessions()` and `list_sessions_all()`
|
||||
- Titles are part of the `SessionMetadata` struct returned by all session queries
|
||||
|
||||
**UI Integration:**
|
||||
- The web UI displays session titles in the session list and sidebar
|
||||
- Users can rename sessions via the "Rename" button in the session list view
|
||||
- Renaming calls `api::archivist::rename_session()` which uses `update_session_metadata()`
|
||||
- Title changes are persisted immediately and survive application restarts
|
||||
|
||||
### MessageRecord
|
||||
|
||||
Represents a single message in the archive:
|
||||
- **message_id**: UUIDv7 identifier
|
||||
- **session**: scroll_id this message belongs to
|
||||
- **role**: "user", "assistant", or "system"
|
||||
- **content_md**: Message content in Markdown format
|
||||
- **attachments**: References to attached files
|
||||
- **metadata**: Free-form JSON for connector-specific fields
|
||||
|
||||
### ConnectorRecord
|
||||
|
||||
Metadata about a connector:
|
||||
- **connector_uid**: UUIDv7 identifier
|
||||
- **type**: "OpenCode", "ACP", or custom
|
||||
- **client_native_id**: Unique identifier from client (e.g., "opencode@http://localhost:12225")
|
||||
- **alias_of**: If this connector is an alias of another (for deduplication)
|
||||
|
||||
## Archivist Public API
|
||||
|
||||
The `Archivist` struct (in `coordinator/`) is the main public entry point
|
||||
for archival operations. Consumers hold `Arc<Archivist>` and call inherent
|
||||
methods — there is no `Archivist` trait anymore. The coordinator resolves
|
||||
the target backend per call (via `archive: Option<String>`) and delegates
|
||||
to `ArchiveBackend` methods.
|
||||
|
||||
Key method families (see `coordinator/*.rs` for full signatures):
|
||||
|
||||
- **Archive lifecycle** (`archives.rs`): `add_archive`, `remove_archive`,
|
||||
`list_archives`, `set_default_archive`
|
||||
- **Connectors** (`connectors.rs`): `register_connector` with tri-state
|
||||
result (Accepted / Aliased / Rejected), `list_connectors`
|
||||
- **Sessions** (`sessions.rs`): `register_session`, `get_session_metadata`,
|
||||
`update_session_metadata`, `list_sessions_paged`, `move_session`,
|
||||
`copy_session`, `resolve_session`
|
||||
- **Messages**: `append_messages`, `get_messages` (sorts by `ts` then
|
||||
`message_id` for stable chronological order)
|
||||
- **Meta / DAG** (`meta.rs`): meta-event recording, session lineage DAG
|
||||
walks, cleanup routines
|
||||
|
||||
## List Filter vs. Full-Text Search
|
||||
|
||||
Two distinct query paths exist — do not conflate them.
|
||||
|
||||
**List filter** — `Archivist::list_sessions_paged(SessionListQuery)` returns a
|
||||
cursor-paged list of sessions, AND-filtered by `title_query` (substring on
|
||||
title), `tags`, `model_filter` (substring on `metadata.model`), `project_id`,
|
||||
`connector_uid`, and `include_hidden`. This is the right tool for "narrow the
|
||||
list of visible sessions."
|
||||
|
||||
**Full-text search** — `api::search_sessions` (in the `api` package, backed by
|
||||
`api::archivist::search_task::SearchTask`) spawns `rg --json` over the
|
||||
archive's `.contexts/` tree to find messages containing text. It streams
|
||||
`SearchExcerpt`s with parsed NDJSON content and supports cancellation via
|
||||
`CancellationToken`. This is the right tool for "find messages containing
|
||||
text."
|
||||
|
||||
**Do not extend `list_sessions_paged` to do content search.** Content search
|
||||
belongs in the ripgrep pipeline. Future improvements to content search
|
||||
(indexed backends, relevance scoring) are Phase 2d / Phase 3 concerns.
|
||||
|
||||
## JsonlBackend Implementation
|
||||
|
||||
The Phase 2 production backend — an implementation of `ArchiveBackend` plus
|
||||
every sub-trait except `SearchBackend`:
|
||||
|
||||
- **Thread-safe**: Uses RwLock for in-memory caches
|
||||
- **Async**: All operations use tokio for non-blocking I/O
|
||||
- **Caching**: In-memory caches for connector and session mappings
|
||||
- **Collision Detection**: Tri-state registration for connectors and sessions
|
||||
|
||||
Located under `src/backends/jsonl/` and split by concern (`backend.rs`,
|
||||
`connectors.rs`, `dag.rs`, `mapping.rs`, `meta.rs`).
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
`JsonlBackend` maintains two in-memory caches:
|
||||
|
||||
1. **connector_cache**: HashMap<Uuid, ConnectorRecord>
|
||||
- Populated on registration
|
||||
- Read from TSV index on startup (future enhancement)
|
||||
|
||||
2. **session_cache**: HashMap<(Uuid, String), Uuid>
|
||||
- Maps (connector_uid, native_session_id) to scroll_id
|
||||
- Populated on registration and session resolution
|
||||
- Enables fast session lookups without disk I/O
|
||||
|
||||
## Event Handling
|
||||
|
||||
The EventHandler subscribes to dirigent_protocol events and archives them in real-time:
|
||||
|
||||
```rust
|
||||
// Create archivist and event handler
|
||||
let archivist = Archivist::new_with_single_archive(archive_path).await?;
|
||||
let handler = EventHandler::new(Arc::new(archivist));
|
||||
|
||||
// Subscribe to event stream from dirigent_core
|
||||
let events = event_stream.subscribe();
|
||||
|
||||
// Run event loop (blocking)
|
||||
handler.run(events).await;
|
||||
```
|
||||
|
||||
### Supported Events
|
||||
|
||||
- **SessionCreated**: Registers new sessions with the archivist
|
||||
- **MessageCompleted**: Writes finalized messages to the archive
|
||||
- **SessionUpdate**: Accumulates streaming message chunks
|
||||
- AgentMessageChunk
|
||||
- UserMessageChunk
|
||||
- AgentThoughtChunk
|
||||
- ToolCall
|
||||
|
||||
### MessageAccumulator
|
||||
|
||||
Assembles streaming message chunks into complete messages:
|
||||
|
||||
- Accumulates text chunks by message_id
|
||||
- Tracks thinking blocks separately
|
||||
- Stores tool calls with input/output
|
||||
- Finalizes messages on MessageCompleted event
|
||||
- Converts to MessageRecord for archival
|
||||
|
||||
## Integration with dirigent_core
|
||||
|
||||
The Archivist integrates with dirigent_core via the global event stream:
|
||||
|
||||
1. **CoreRuntime** emits events for all connector operations
|
||||
2. **EventHandler** subscribes to event stream
|
||||
3. **MessageAccumulator** assembles streaming chunks
|
||||
4. **Archivist** writes complete messages to archive
|
||||
|
||||
This enables:
|
||||
- Automatic archival of all sessions and messages
|
||||
- No polling required - fully event-driven
|
||||
- Consistent history across restarts
|
||||
- Offline access to historical data
|
||||
|
||||
## Testing
|
||||
|
||||
The package has comprehensive test coverage across multiple dimensions:
|
||||
|
||||
### Unit Tests
|
||||
|
||||
Located in each module (`src/*.rs`, `src/storage/*.rs`):
|
||||
- Type serialization/deserialization
|
||||
- UUIDv7 generation and ordering
|
||||
- Timestamp formatting (RFC 3339)
|
||||
- Storage operations (NDJSON, JSON, TSV, files)
|
||||
- Connector registration tri-state logic
|
||||
- Session registration and alias detection
|
||||
|
||||
### Integration Tests
|
||||
|
||||
Located in `tests/`:
|
||||
- `integration_tests.rs`: Full `Archivist` + `JsonlBackend` lifecycle, event
|
||||
handler integration, multi-connector scenarios, session lineage, message
|
||||
accumulation
|
||||
- `list_sessions_paged_test.rs`, `pagination_test.rs`: List filter + cursor
|
||||
pagination coverage
|
||||
- `import_claude_idempotency_test.rs`: Claude export re-import idempotency
|
||||
|
||||
### Backend Contract Tests
|
||||
|
||||
`src/backend/contract.rs` holds reusable async assertions that any
|
||||
`&dyn ArchiveBackend` must pass. `JsonlBackend` and `MockBackend` both
|
||||
run the contract suite; new backends added in Phase 3+ should do the same.
|
||||
|
||||
### Examples
|
||||
|
||||
Located in `examples/`:
|
||||
- `basic_usage.rs`: Core archivist operations
|
||||
- `event_handling.rs`: EventHandler and MessageAccumulator
|
||||
- `file_storage.rs`: Content-addressable file storage
|
||||
|
||||
Run tests:
|
||||
```bash
|
||||
cargo test --package dirigent_archivist
|
||||
```
|
||||
|
||||
Run examples:
|
||||
```bash
|
||||
cargo run --package dirigent_archivist --example basic_usage
|
||||
cargo run --package dirigent_archivist --example event_handling
|
||||
cargo run --package dirigent_archivist --example file_storage
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
- **Append Operations**: O(1) with sequential file writes
|
||||
- **Session Lookup**: O(1) with in-memory cache, O(n) cache miss
|
||||
- **Message Retrieval**: O(n) where n = number of messages (NDJSON parsing)
|
||||
- **File Storage**: O(1) content-addressable lookup with SHA-256 hashing
|
||||
- **Connector Index**: O(n) TSV scan, suitable for hundreds of connectors
|
||||
|
||||
### Scalability Considerations
|
||||
|
||||
- **Large Sessions**: NDJSON is append-only, so reading large sessions requires parsing all lines
|
||||
- **Many Sessions**: TSV indices are suitable for thousands of sessions per connector
|
||||
- **File Deduplication**: SHA-256 hashing provides automatic deduplication across sessions
|
||||
- **Concurrent Access**: RwLock allows multiple concurrent readers, single writer
|
||||
|
||||
## Error Handling
|
||||
|
||||
The Archivist uses thiserror for rich error types:
|
||||
|
||||
```rust
|
||||
pub enum ArchivistError {
|
||||
IoError(std::io::Error),
|
||||
SerdeError(serde_json::Error),
|
||||
SessionUnknown(Uuid),
|
||||
CollisionInconsistent(Uuid),
|
||||
// ... etc
|
||||
}
|
||||
```
|
||||
|
||||
All public APIs return `Result<T, ArchivistError>` for explicit error handling.
|
||||
|
||||
## Development Notes
|
||||
|
||||
- All storage operations are async (using tokio)
|
||||
- Content-addressable storage uses SHA-256 hashes (hex-encoded)
|
||||
- Archive directory structure mirrors session/message hierarchy
|
||||
- UUIDv7 provides time-ordered, sortable identifiers
|
||||
- RFC 3339 UTC timestamps for all time-based fields
|
||||
- Schema versioning via `version` field in all records
|
||||
|
||||
## Related Packages
|
||||
|
||||
- **dirigent_protocol**: Shared types and protocol definitions (dependency)
|
||||
- **dirigent_core**: Runtime integration for SSE event capture (integration point)
|
||||
- **api**: Server functions for archive queries (future)
|
||||
- **web**: UI for archive browsing and search (future)
|
||||
|
||||
## Phase 4: `ArchiveFilter` (2026-04-21)
|
||||
|
||||
Every `ArchiveRegistration` carries a `filter: ArchiveFilter`. The filter
|
||||
describes which sessions/writes the backend wants to receive. Fields:
|
||||
|
||||
- `include_connectors: Option<HashSet<Uuid>>` — if Some, only these
|
||||
connector UIDs pass. `None` means no connector gate.
|
||||
- `exclude_connectors: HashSet<Uuid>` — always rejected.
|
||||
- `include_tags: HashSet<String>` — if non-empty, the session must carry
|
||||
at least one matching tag.
|
||||
- `exclude_tags: HashSet<String>` — any matching tag rejects.
|
||||
- `include_hidden: bool` — default `true`. If `false`, sessions whose
|
||||
metadata has `"hidden": true` are skipped.
|
||||
|
||||
### Primary-always-writes invariant
|
||||
|
||||
The per-call primary (either the `archive: Some(name)` argument or the
|
||||
default write-target) is **never** filtered. If a caller explicitly asks
|
||||
to write to archive X, the filter on X is not consulted. Filters only
|
||||
gate secondary fanout.
|
||||
|
||||
### Boot validator
|
||||
|
||||
At boot (`coordinator/boot.rs`), the validator rejects configurations
|
||||
where:
|
||||
|
||||
- No write-active + enabled registration has an **unrestricted** filter
|
||||
(`ArchiveFilter::default()` is unrestricted). Prevents configurations
|
||||
that silently drop all writes.
|
||||
- An archive's filter has `include_connectors = Some(empty set)` —
|
||||
equivalent to "reject everything", which is almost certainly a config
|
||||
bug.
|
||||
|
||||
See `docs/plans/2026-04-21-archivist-phase4-design.md` §4 for the full
|
||||
design rationale.
|
||||
|
||||
## Phase 5: Importers (2026-04-21)
|
||||
|
||||
The `import::` module centres on an `Importer` trait with per-source
|
||||
implementations under `import::sources::*`. Each source produces a
|
||||
`ParsedConversation` (ChatGPT) / `ParsedSession` (Codex) / session
|
||||
directory walk (Claude) and feeds the results through the common
|
||||
`import_sessions` orchestrator, which fires `ImportProgressEvent`s on a
|
||||
bounded `ImportProgressSink`.
|
||||
|
||||
### `Importer` trait
|
||||
|
||||
Every importer declares a `config_shape()` so UIs can render a dynamic
|
||||
form; a `discover()` that returns an `ImportDiscovery` preview; and an
|
||||
`import()` that does the actual work. All three methods are async.
|
||||
|
||||
The trait lives in `import::trait_def`. Shape types (`ImportConfig`,
|
||||
`ImportTarget`, `ConfigField`, `ConfigFieldKind`, `ImportError`) are
|
||||
serialisable and safe to cross the WASM boundary.
|
||||
|
||||
### Registry
|
||||
|
||||
`ImporterRegistry::with_defaults()` registers every enabled
|
||||
`importer-*` feature. Currently: `claude`, `chatgpt`, `codex`. The
|
||||
registry is constructed at boot and stored on `AppState`.
|
||||
|
||||
### Progress sink
|
||||
|
||||
`ImportProgressSink::channel()` returns a bounded mpsc pair.
|
||||
Non-terminal events use `try_send` (dropped on full); terminal events
|
||||
use `send().await` so consumers always see the final state.
|
||||
|
||||
### Source crates
|
||||
|
||||
- `dirigent_chatgpt` — parses `conversations.json` from the OpenAI data
|
||||
export.
|
||||
- `dirigent_codex` — parses `*.jsonl` session files under
|
||||
`~/.codex/sessions`.
|
||||
|
||||
Both crates hold pure parser types with zero dirigent-specific types.
|
||||
|
||||
See `docs/plans/2026-04-21-archivist-phase5-design.md`.
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Indexed `SearchBackend` implementations (tantivy/sqlite) — currently
|
||||
content search is ripgrep-based in the `api` package
|
||||
- Session splitting and lineage management (mutations.ndjson)
|
||||
- Knowledge overview generation (chat.md exports)
|
||||
- Embedding storage and search (embeds/)
|
||||
- Network RPC interface for remote archivist
|
||||
- Compaction and pruning policies
|
||||
- Additional concrete backends (e.g. SQLite, remote)
|
||||
|
||||
## Documentation
|
||||
|
||||
- **Package README**: `./README.md` - User-facing overview
|
||||
- **Architecture Docs**: `../../docs/building/05_archivist/` - Design and planning
|
||||
- **API Docs**: Run `cargo doc --package dirigent_archivist --open`
|
||||
- **Examples**: See `examples/` directory for working code samples
|
||||
Reference in New Issue
Block a user