sync from monorepo @ 2452e92e

2026-05-08 01:59:04 +02:00
commit b03dc15371
459 changed files with 129586 additions and 0 deletions
@@ -0,0 +1,761 @@
+# Package: dirigent_archivist
+
+Persistent storage for all agentic interactions in Dirigent.
+
+## Quick Facts
+- **Type**: Library
+- **Main Entry**: src/lib.rs
+- **Dependencies**: dirigent_protocol, uuid, chrono, serde, tokio, tracing, thiserror, sha2, hex, async-trait
+- **Status**: Complete - Production ready with comprehensive tests
+
+## Purpose
+
+The Archivist provides file-based archival storage for all session data, messages, and attachments in Dirigent. It implements an archive-first architecture with connector API fallback, using NDJSON, JSON, and TSV formats for durability and human-readability.
+
+## Key Features
+
+- **File-based Storage**: NDJSON for messages, JSON for metadata, TSV for indices
+- **Content-Addressable Files**: SHA-256 based storage for attachments with automatic deduplication
+- **Session Lineage**: Track splits, continuations, and mutations with parent references
+- **Connector Registry**: Coordinate UID assignment across connectors with collision detection
+- **Event Streaming**: Real-time updates via EventHandler subscribing to dirigent_protocol events
+- **Archive-First Design**: Read from archive first, fall back to connector API when needed
+- **Caching**: In-memory caching of connector and session mappings for performance
+
+## Architecture
+
+The Archivist is built on three core architectural principles:
+
+### 1. Archive-First Read Strategy
+
+The Archivist is the primary source of truth for historical data:
+- UI and APIs query the archive first
+- Only fall back to connector APIs if data is not in archive
+- This enables offline access and consistent history across restarts
+
+### 2. Write-Through Event Capture (Append-Only)
+
+The EventHandler subscribes to the global event stream from dirigent_core:
+- Captures session creation, message streaming, and tool calls in real-time
+- Uses MessageAccumulator to assemble streaming chunks into complete messages
+- Writes complete messages to archive immediately upon finalization
+- No polling required - fully event-driven
+- **Append-only writes**: Messages are appended as events arrive, NOT in chronological order
+- File order reflects event timing, not message timestamps
+
+### 3. File-Based Storage with Sort-on-Read
+
+All data is stored in human-readable, grep-able formats:
+- **NDJSON** (Newline-Delimited JSON): Incremental append-only logs for messages and mappings
+- **JSON**: Structured metadata for sessions and connectors
+- **TSV** (Tab-Separated Values): Fast indices for cross-references
+- **Content-Addressed Files**: Binary attachments stored by SHA-256 hash for deduplication
+- **Sort-on-Read**: `get_messages()` sorts by timestamp and message_id to ensure chronological order despite append-only writes
+
+## Backend Trait Layer (Phase 2)
+
+The archivist uses a trait-based backend abstraction. `ArchiveBackend`
+defines the mandatory session and message primitives every backend must
+provide, plus `as_xxx()` accessors returning optional sub-traits:
+
+- `SearchBackend` — reserved for Phase 3+ indexed backends (not wired)
+- `DagBackend` — session lineage DAG edges
+- `MetaEventsBackend` — ACP connection lifecycle events
+- `ConnectorRegistryBackend` — per-archive connector metadata
+- `SessionMappingBackend` — native↔scroll session ID mapping
+
+`JsonlBackend` is the Phase 2 concrete implementation (file-based
+NDJSON/JSON/TSV) and opts into every sub-trait except `SearchBackend`
+(content search continues to be served by ripgrep via
+`crates/api/src/archivist/search_task.rs`).
+
+The `Archivist` struct (in `src/coordinator/`) owns a registry of backends
+keyed by archive name and performs orchestration (alias detection, session
+lineage, move/copy, DAG walks, archive lifecycle). Consumers hold
+`Arc<Archivist>` directly — the coordinator is concrete, not a trait.
+
+See `docs/plans/2026-04-18-archivist-phase2-design.md` for design rationale.
+
+## Multi-Backend Registry (Phase 3)
+
+The coordinator (`Archivist`) holds `Vec<Arc<ArchiveRegistration>>` sorted
+by `read_priority` instead of a flat `HashMap<name, Arc<dyn ArchiveBackend>>`.
+Each registration carries:
+
+- `backend: Arc<dyn ArchiveBackend>` + its declared capabilities
+- `failure_mode`: `Required` (must succeed) | `BestEffort` (errors log + drift health)
+- `read_priority`: lower = tried first for reads; also selects the default
+  write target when no archive is named
+- `write_active`: participates in fanout writes
+- `enabled`: kill-switch without removing config
+- `write_policy`: `Inline` (default; `await` per call) or `Queued`
+  (mpsc + batch_window + overflow policy)
+- Runtime state: `last_health`, `last_error`, `consecutive_failures`
+  (all `Arc<RwLock<_>>`, shared with the writer task when queued)
+- Optional `writer: Option<WriterHandle>` (Some iff `write_policy = Queued`)
+
+Backends are declared in `dirigent.toml` under `[[archives]]` and
+constructed at boot via `Archivist::from_config(cfg, &BackendRegistry)`.
+Add a new backend type by implementing `BackendFactory` and registering
+it on the `BackendRegistry` before `from_config`.
+
+### Reads
+
+`get_session`, `get_messages_paged`, `count_messages`, `get_meta_events`,
+`get_children`, etc. walk the registry in priority order via
+`read_walk_per_session(scroll_id, predicate, op)`. The predicate
+capability-filters; `Unavailable` backends are skipped. The first backend
+that returns `Some(value)` wins and its name is cached against the
+`scroll_id` in a positive LRU (capacity 10_000). Subsequent reads for the
+same `scroll_id` short-circuit to the cached backend before falling back
+to the full priority walk.
+
+Collection-shape reads (`list_sessions_paged`, `list_connectors`,
+`list_meta_sessions`, `find_meta_session_by_client`) use
+`read_walk_collection` — first enabled backend that can answer wins, no
+cache, no aggregation across backends. Phase 3 explicitly defers
+cross-backend merge/dedup to a later phase.
+
+### Writes
+
+Mutating methods (`append_messages`, `register_session`, `update_session_*`,
+`append_meta_events`, `append_dag_edge`, `clear_session_messages`,
+`update_connector_fingerprint`) resolve a primary (per-call `archive:
+Some(name)` override or the default-write target) and fan out to every
+other `enabled && write_active` backend that has the required capability.
+Capability-mismatched backends are skipped with a debug `capability_skip`
+log (never an error). `Required` failures propagate to the caller;
+`BestEffort` failures log + drift health.
+
+`register_connector` currently does NOT fan out — alias detection + the
+tri-state `Accepted`/`Aliased`/`Rejected` return shape make replication
+non-trivial. Fanout for connectors is deferred; single-backend setups are
+unaffected.
+
+For `write_policy = Queued` backends, the primary/secondary write paths
+enqueue a `WriteOp` into the backend's writer task instead of awaiting.
+Errors drift the backend's health but do not propagate to the caller.
+Coalescing merges consecutive `AppendMessages`/`AppendMetaEvents` for the
+same `scroll_id` within `batch_window_ms`.
+
+### Cross-backend operations
+
+- `delete_session(scroll_id, _)` fans out to every enabled backend that has
+  the session. Copies in `write_active=false` backends produce
+  `ArchivistError::DeleteOnReadOnlyBackend` (write-active copies are still
+  deleted); cache invalidated regardless of outcome.
+- `copy_session(scroll_id, from, to)` reads from `from`, writes to `to`,
+  including DAG and meta-events when both sides have the capability. The
+  source remains canonical (the cache is NOT rewritten).
+- `move_session(scroll_id, from, to)` is `copy + delete-from-source`. If
+  the source-side delete fails after the copy succeeded,
+  `ArchivistError::PartialMove { copied_to, delete_error }` is returned so
+  the caller knows the session now lives in both places.
+
+The Phase 2 connector-aware `move_session(scroll_id, target_connector_uid, _)`
+and `copy_session(scroll_id, target_connector_uid, _)` survived the Phase
+3 rename as `move_session_to_connector` / `copy_session_to_connector`.
+Their bulk variant is `move_sessions_to_connector`.
+
+### Health
+
+`HealthStatus` drifts on every coordinator call that observes a backend:
+
+- Successful write → `Healthy`; `consecutive_failures` reset to 0.
+- Successful read → `Healthy` (only rescues `Degraded`; does not reset the counter).
+- Write failure → `Degraded { reason }`; `consecutive_failures += 1`; after
+  K = 5 consecutive failures drifts to `Unavailable { reason }`. Reads skip
+  `Unavailable` backends; writes against an `Unavailable` `Required`
+  backend fail, while writes against an `Unavailable` `BestEffort` backend
+  are still attempted.
+- Read failure alone never drifts past `Degraded`; writes are the
+  authoritative health signal.
+
+`list_archives_with_health()` returns a `Vec<ArchiveStatus>` snapshot of
+every registration: name, type, capabilities, health, last_error, and
+queue_depth (for queued backends).
+
+### Lifecycle
+
+Phase 3 is **startup-only**. `add_archive` / `remove_archive` /
+`set_default_archive` on the coordinator return
+`ArchivistError::DynamicRegistryUnsupported`. To change the registry,
+edit `dirigent.toml` and restart the server. `Archivist::shutdown()`
+drains queued writer tasks (sends `WriteOp::Shutdown` on each writer's
+mpsc and awaits ack); call it before process exit.
+
+Test-only constructors `Archivist::from_registrations(regs)` and
+`SessionMetadata::stub(scroll_id)` live under `#[cfg(any(test, feature =
+"test-utils"))]` for integration tests that bypass the factory.
+
+See `docs/plans/2026-04-19-archivist-phase3-design.md` for the full
+design rationale, and `examples/multi_backend.rs` for a runnable
+end-to-end example.
+
+## Module Organization
+
+### Core Modules
+
+- **`lib.rs`**: Public API surface and re-exports
+- **`types.rs`**: Core data structures (session metadata, message records, connector info, API types)
+- **`error.rs`**: Error types and Result alias for archivist operations
+
+### Backend Layer (`backend/`)
+
+- **`traits.rs`**: `ArchiveBackend` trait + 5 optional sub-traits
+- **`capability.rs`**: `ArchiveCapability` enum + `CapabilitySet` type
+- **`health.rs`**: `HealthStatus` enum returned by `health_check`
+- **`contract.rs`**: Reusable behavioral tests for any `&dyn ArchiveBackend` (cfg-gated)
+- **`mock.rs`**: In-memory `MockBackend` for coordinator unit tests (cfg-gated)
+
+### Concrete Backends (`backends/`)
+
+- **`jsonl/`**: The file-based `JsonlBackend` — the only Phase 2 backend.
+  Reuses `storage/` primitives for NDJSON/JSON/TSV operations.
+
+### Coordinator (`coordinator/`)
+
+- **`mod.rs`**: The `Archivist` struct + constructors
+- **`archives.rs`**: Archive lifecycle (add/remove/list/default)
+- **`connectors.rs`**: Connector registration + alias detection
+- **`sessions.rs`**: Session registration, metadata updates, move/copy
+- **`meta.rs`**: Meta events, DAG walks, cleanup
+
+### Storage Layer (`storage/`)
+
+Low-level file I/O primitives used by `JsonlBackend`. All storage operations are async and use tokio.
+
+- **`paths.rs`**: ArchivePaths utility for consistent directory structure and path resolution
+- **`ndjson.rs`**: Newline-delimited JSON operations (read_ndjson, append_ndjson)
+- **`json.rs`**: JSON operations (read_json, write_json)
+- **`tsv.rs`**: Tab-separated value operations for connector index
+- **`files.rs`**: Content-addressable file storage with SHA-256 hashing and deduplication
+
+### Supporting Modules
+
+- **`registry.rs`**: Archive registry persistence (multi-archive metadata)
+- **`migration.rs`**: Single-archive → multi-archive migration path
+- **`session.rs`**: Session lineage types shared across layers
+- **`accumulator.rs`**: MessageAccumulator for assembling streaming message chunks
+- **`backfill.rs`**: Backfill helpers for importing historical sessions
+- **`import/`**: External conversation importers (e.g. Claude export)
+
+### Events
+
+- **`events.rs`**: EventHandler for subscribing to dirigent_protocol events and archiving them
+
+## Configuration
+
+The Archivist archive root is determined by `DirigentPaths` resolution:
+
+- Set `DIRIGENT_DATA_DIR` to override the data directory; archives will be stored at `<data_dir>/archives/`
+- Defaults to `~/.local/share/dirigent/archives/` (or platform equivalent)
+
+```bash
+DIRIGENT_DATA_DIR=/path/to/data dx serve
+```
+
+## Archive Structure
+
+```
+dirigent_archive/
+├── .contexts/
+│   └── {scroll_id:uuidv7}/          # One directory per session
+│       ├── session.json             # Session metadata
+│       ├── messages.jsonl           # Incremental message log (.ndjson also supported)
+│       └── lineage.json             # Session lineage info (optional)
+├── .db/
+│   └── connectors/
+│       ├── index.tsv                # Fast connector lookup (TSV)
+│       └── {connector_uid}/
+│           ├── connector.json       # Connector metadata
+│           └── sessions.jsonl       # Session mappings (.ndjson also supported)
+└── .files/
+    └── {sha256-hash}                # Content-addressable file storage
+```
+
+### Why Hidden Directories?
+
+The `.contexts`, `.db`, and `.files` directories are hidden (prefixed with `.`) to keep the archive root clean for future rendered outputs (like `chat.md` exports). This is similar to how `.git` hides implementation details in a codebase.
+
+## File Formats
+
+### Session Metadata (`session.json`)
+
+```json
+{
+  "version": 1,
+  "scroll_id": "01936e8f-e5a7-7000-8000-000000000001",
+  "created_at": "2025-01-01T12:00:00Z",
+  "updated_at": "2025-01-01T12:30:00Z",
+  "title": "Implement user authentication",
+  "connector_uid": "01936e8f-e5a7-7000-8000-000000000002",
+  "native_session_id": "abc123",
+  "agent_id": null,
+  "parent_scroll_id": null,
+  "continuation": null,
+  "tags": ["backend", "auth"],
+  "metadata": {
+    "source": "OpenCode",
+    "model": "claude-3-5-sonnet"
+  }
+}
+```
+
+### Messages Log (`messages.jsonl`)
+
+One JSON object per line, **append-only**:
+
+```jsonl
+{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000003","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":null,"ts":"2025-01-01T12:01:00Z","role":"user","author":"alice","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}}
+{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000004","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":"01936e8f-e5a7-7000-8000-000000000003","ts":"2025-01-01T12:01:10Z","role":"assistant","author":"claude","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}}
+```
+
+**IMPORTANT - Ordering**: The order of lines in the message log file (`messages.jsonl` or `messages.ndjson`) reflects **event arrival order**, NOT chronological order. Assistant replies often arrive after subsequent user messages due to streaming latency, resulting in non-chronological file order. Always use the `Archivist::get_messages()` API to retrieve messages, which sorts by `ts` (timestamp) and `message_id` (UUIDv7) to guarantee chronological order.
+
+**File Format Compatibility**: The archivist supports both `.ndjson` and `.jsonl` file extensions for newline-delimited JSON files. When reading, `.jsonl` is preferred if present, with automatic fallback to `.ndjson` for backward compatibility. Write operations use `.jsonl` (canonical format). Both formats are identical in content - the difference is purely the file extension.
+
+### Connector Index (`index.tsv`)
+
+Tab-separated values with header row:
+
+```tsv
+connector_uid	type	title	client_native_id	alias_of	created_at
+01936e8f-e5a7-7000-8000-000000000002	OpenCode	OpenCode Local	opencode@http://localhost:12225		2025-01-01T12:00:00Z
+```
+
+### Session Mappings (`sessions.jsonl`)
+
+Maps native session IDs from connectors to scroll IDs in the archive:
+
+```jsonl
+{"version":1,"connector_uid":"01936e8f-e5a7-7000-8000-000000000002","native_session_id":"abc123","scroll_id":"01936e8f-e5a7-7000-8000-000000000001","created_at":"2025-01-01T12:00:00Z","alias_of":null}
+```
+
+## Message Ordering Guarantees
+
+### The Problem: Append Order ≠ Chronological Order
+
+In the event-driven architecture, messages are written to the message log file (`messages.jsonl`) as completion events arrive. Due to streaming latency:
+
+- User messages complete nearly instantly and are written immediately
+- Assistant messages stream over time and complete later
+- A second user message can be written before the first assistant reply completes
+
+Example scenario:
+```
+T0: User sends "tell me a joke about snakes" (ts=18:23:36.947)
+T1: Assistant starts streaming reply (ts=18:23:36.969)
+T2: User sends "now one about tigers" (ts=18:23:49.429) <- completes and writes BEFORE assistant finishes
+T3: Assistant finishes "snakes" reply <- writes AFTER "tigers" user message
+```
+
+File order in the message log file:
+```
+1. user "snakes" (18:23:36.947)
+2. user "tigers" (18:23:49.429)  <- written second
+3. assistant "snakes" (18:23:36.969)  <- written third, but timestamp is earlier!
+```
+
+### The Solution: Sort-on-Read
+
+The `Archivist::get_messages()` implementation sorts messages before returning:
+
+1. **Primary sort**: `ts` (timestamp) ascending
+2. **Secondary sort**: `message_id` (UUIDv7) ascending for stable tie-breaking
+
+This guarantees chronological order regardless of NDJSON append order:
+```
+1. user "snakes" (18:23:36.947)
+2. assistant "snakes" (18:23:36.969)
+3. user "tigers" (18:23:49.429)
+```
+
+### Why This Approach?
+
+- **Maintains durability**: Append-only writes preserve crash safety
+- **No migration needed**: Existing archives work without rewrites
+- **Simple implementation**: No buffered writes or complex write-time ordering
+- **Performance trade-off**: Small CPU cost on read (sorting) vs. complex write-time coordination
+
+### Consumer Guidance
+
+- **DO**: Use `Archivist::get_messages()` to retrieve messages
+- **DON'T**: Read the message log file directly and assume file order = chronological order
+- **UI/API**: Always sort by `ts` then `message_id` for defense in depth
+- **Tie-breaking**: Use `message_id` (UUIDv7) as secondary sort for stable ordering when timestamps match
+
+## Key Types
+
+### SessionMetadata
+
+Stores all metadata about a session including:
+- **scroll_id**: UUIDv7 identifier for the session
+- **connector_uid**: Which connector owns this session
+- **native_session_id**: Original session ID from the connector (optional)
+- **title**: Optional human-readable session title (see Title Management below)
+- **parent_scroll_id**: For session lineage (splits, continuations)
+- **continuation**: Type of continuation (SPLIT, COMPACT, REFERENCE, EDIT)
+- **tags**: User-defined categorization
+- **metadata**: Free-form JSON for connector-specific fields
+
+#### Title Management
+
+Session titles are fully supported and persist across restarts. Titles are stored in the `SessionMetadata` struct and saved to the `session.json` file.
+
+**Setting Titles:**
+```rust
+// Update title for an existing session
+archivist.update_session_metadata(
+    scroll_id,
+    Some("My Custom Session Title".to_string()),
+    None, // model
+    None  // archive
+).await?;
+```
+
+**Default Behavior:**
+- New sessions can specify an initial title during registration
+- If no title is provided, sessions default to `None`
+- The UI typically displays "Untitled" for sessions without titles
+
+**Title Loading:**
+- Titles are automatically loaded when retrieving session metadata via `get_session_metadata()`
+- Session lists include titles via `list_sessions()` and `list_sessions_all()`
+- Titles are part of the `SessionMetadata` struct returned by all session queries
+
+**UI Integration:**
+- The web UI displays session titles in the session list and sidebar
+- Users can rename sessions via the "Rename" button in the session list view
+- Renaming calls `api::archivist::rename_session()` which uses `update_session_metadata()`
+- Title changes are persisted immediately and survive application restarts
+
+### MessageRecord
+
+Represents a single message in the archive:
+- **message_id**: UUIDv7 identifier
+- **session**: scroll_id this message belongs to
+- **role**: "user", "assistant", or "system"
+- **content_md**: Message content in Markdown format
+- **attachments**: References to attached files
+- **metadata**: Free-form JSON for connector-specific fields
+
+### ConnectorRecord
+
+Metadata about a connector:
+- **connector_uid**: UUIDv7 identifier
+- **type**: "OpenCode", "ACP", or custom
+- **client_native_id**: Unique identifier from client (e.g., "opencode@http://localhost:12225")
+- **alias_of**: If this connector is an alias of another (for deduplication)
+
+## Archivist Public API
+
+The `Archivist` struct (in `coordinator/`) is the main public entry point
+for archival operations. Consumers hold `Arc<Archivist>` and call inherent
+methods — there is no `Archivist` trait anymore. The coordinator resolves
+the target backend per call (via `archive: Option<String>`) and delegates
+to `ArchiveBackend` methods.
+
+Key method families (see `coordinator/*.rs` for full signatures):
+
+- **Archive lifecycle** (`archives.rs`): `add_archive`, `remove_archive`,
+  `list_archives`, `set_default_archive`
+- **Connectors** (`connectors.rs`): `register_connector` with tri-state
+  result (Accepted / Aliased / Rejected), `list_connectors`
+- **Sessions** (`sessions.rs`): `register_session`, `get_session_metadata`,
+  `update_session_metadata`, `list_sessions_paged`, `move_session`,
+  `copy_session`, `resolve_session`
+- **Messages**: `append_messages`, `get_messages` (sorts by `ts` then
+  `message_id` for stable chronological order)
+- **Meta / DAG** (`meta.rs`): meta-event recording, session lineage DAG
+  walks, cleanup routines
+
+## List Filter vs. Full-Text Search
+
+Two distinct query paths exist — do not conflate them.
+
+**List filter** — `Archivist::list_sessions_paged(SessionListQuery)` returns a
+cursor-paged list of sessions, AND-filtered by `title_query` (substring on
+title), `tags`, `model_filter` (substring on `metadata.model`), `project_id`,
+`connector_uid`, and `include_hidden`. This is the right tool for "narrow the
+list of visible sessions."
+
+**Full-text search** — `api::search_sessions` (in the `api` package, backed by
+`api::archivist::search_task::SearchTask`) spawns `rg --json` over the
+archive's `.contexts/` tree to find messages containing text. It streams
+`SearchExcerpt`s with parsed NDJSON content and supports cancellation via
+`CancellationToken`. This is the right tool for "find messages containing
+text."
+
+**Do not extend `list_sessions_paged` to do content search.** Content search
+belongs in the ripgrep pipeline. Future improvements to content search
+(indexed backends, relevance scoring) are Phase 2d / Phase 3 concerns.
+
+## JsonlBackend Implementation
+
+The Phase 2 production backend — an implementation of `ArchiveBackend` plus
+every sub-trait except `SearchBackend`:
+
+- **Thread-safe**: Uses RwLock for in-memory caches
+- **Async**: All operations use tokio for non-blocking I/O
+- **Caching**: In-memory caches for connector and session mappings
+- **Collision Detection**: Tri-state registration for connectors and sessions
+
+Located under `src/backends/jsonl/` and split by concern (`backend.rs`,
+`connectors.rs`, `dag.rs`, `mapping.rs`, `meta.rs`).
+
+### Caching Strategy
+
+`JsonlBackend` maintains two in-memory caches:
+
+1. **connector_cache**: HashMap<Uuid, ConnectorRecord>
+   - Populated on registration
+   - Read from TSV index on startup (future enhancement)
+
+2. **session_cache**: HashMap<(Uuid, String), Uuid>
+   - Maps (connector_uid, native_session_id) to scroll_id
+   - Populated on registration and session resolution
+   - Enables fast session lookups without disk I/O
+
+## Event Handling
+
+The EventHandler subscribes to dirigent_protocol events and archives them in real-time:
+
+```rust
+// Create archivist and event handler
+let archivist = Archivist::new_with_single_archive(archive_path).await?;
+let handler = EventHandler::new(Arc::new(archivist));
+
+// Subscribe to event stream from dirigent_core
+let events = event_stream.subscribe();
+
+// Run event loop (blocking)
+handler.run(events).await;
+```
+
+### Supported Events
+
+- **SessionCreated**: Registers new sessions with the archivist
+- **MessageCompleted**: Writes finalized messages to the archive
+- **SessionUpdate**: Accumulates streaming message chunks
+  - AgentMessageChunk
+  - UserMessageChunk
+  - AgentThoughtChunk
+  - ToolCall
+
+### MessageAccumulator
+
+Assembles streaming message chunks into complete messages:
+
+- Accumulates text chunks by message_id
+- Tracks thinking blocks separately
+- Stores tool calls with input/output
+- Finalizes messages on MessageCompleted event
+- Converts to MessageRecord for archival
+
+## Integration with dirigent_core
+
+The Archivist integrates with dirigent_core via the global event stream:
+
+1. **CoreRuntime** emits events for all connector operations
+2. **EventHandler** subscribes to event stream
+3. **MessageAccumulator** assembles streaming chunks
+4. **Archivist** writes complete messages to archive
+
+This enables:
+- Automatic archival of all sessions and messages
+- No polling required - fully event-driven
+- Consistent history across restarts
+- Offline access to historical data
+
+## Testing
+
+The package has comprehensive test coverage across multiple dimensions:
+
+### Unit Tests
+
+Located in each module (`src/*.rs`, `src/storage/*.rs`):
+- Type serialization/deserialization
+- UUIDv7 generation and ordering
+- Timestamp formatting (RFC 3339)
+- Storage operations (NDJSON, JSON, TSV, files)
+- Connector registration tri-state logic
+- Session registration and alias detection
+
+### Integration Tests
+
+Located in `tests/`:
+- `integration_tests.rs`: Full `Archivist` + `JsonlBackend` lifecycle, event
+  handler integration, multi-connector scenarios, session lineage, message
+  accumulation
+- `list_sessions_paged_test.rs`, `pagination_test.rs`: List filter + cursor
+  pagination coverage
+- `import_claude_idempotency_test.rs`: Claude export re-import idempotency
+
+### Backend Contract Tests
+
+`src/backend/contract.rs` holds reusable async assertions that any
+`&dyn ArchiveBackend` must pass. `JsonlBackend` and `MockBackend` both
+run the contract suite; new backends added in Phase 3+ should do the same.
+
+### Examples
+
+Located in `examples/`:
+- `basic_usage.rs`: Core archivist operations
+- `event_handling.rs`: EventHandler and MessageAccumulator
+- `file_storage.rs`: Content-addressable file storage
+
+Run tests:
+```bash
+cargo test --package dirigent_archivist
+```
+
+Run examples:
+```bash
+cargo run --package dirigent_archivist --example basic_usage
+cargo run --package dirigent_archivist --example event_handling
+cargo run --package dirigent_archivist --example file_storage
+```
+
+## Performance Characteristics
+
+- **Append Operations**: O(1) with sequential file writes
+- **Session Lookup**: O(1) with in-memory cache, O(n) cache miss
+- **Message Retrieval**: O(n) where n = number of messages (NDJSON parsing)
+- **File Storage**: O(1) content-addressable lookup with SHA-256 hashing
+- **Connector Index**: O(n) TSV scan, suitable for hundreds of connectors
+
+### Scalability Considerations
+
+- **Large Sessions**: NDJSON is append-only, so reading large sessions requires parsing all lines
+- **Many Sessions**: TSV indices are suitable for thousands of sessions per connector
+- **File Deduplication**: SHA-256 hashing provides automatic deduplication across sessions
+- **Concurrent Access**: RwLock allows multiple concurrent readers, single writer
+
+## Error Handling
+
+The Archivist uses thiserror for rich error types:
+
+```rust
+pub enum ArchivistError {
+    IoError(std::io::Error),
+    SerdeError(serde_json::Error),
+    SessionUnknown(Uuid),
+    CollisionInconsistent(Uuid),
+    // ... etc
+}
+```
+
+All public APIs return `Result<T, ArchivistError>` for explicit error handling.
+
+## Development Notes
+
+- All storage operations are async (using tokio)
+- Content-addressable storage uses SHA-256 hashes (hex-encoded)
+- Archive directory structure mirrors session/message hierarchy
+- UUIDv7 provides time-ordered, sortable identifiers
+- RFC 3339 UTC timestamps for all time-based fields
+- Schema versioning via `version` field in all records
+
+## Related Packages
+
+- **dirigent_protocol**: Shared types and protocol definitions (dependency)
+- **dirigent_core**: Runtime integration for SSE event capture (integration point)
+- **api**: Server functions for archive queries (future)
+- **web**: UI for archive browsing and search (future)
+
+## Phase 4: `ArchiveFilter` (2026-04-21)
+
+Every `ArchiveRegistration` carries a `filter: ArchiveFilter`. The filter
+describes which sessions/writes the backend wants to receive. Fields:
+
+- `include_connectors: Option<HashSet<Uuid>>` — if Some, only these
+  connector UIDs pass. `None` means no connector gate.
+- `exclude_connectors: HashSet<Uuid>` — always rejected.
+- `include_tags: HashSet<String>` — if non-empty, the session must carry
+  at least one matching tag.
+- `exclude_tags: HashSet<String>` — any matching tag rejects.
+- `include_hidden: bool` — default `true`. If `false`, sessions whose
+  metadata has `"hidden": true` are skipped.
+
+### Primary-always-writes invariant
+
+The per-call primary (either the `archive: Some(name)` argument or the
+default write-target) is **never** filtered. If a caller explicitly asks
+to write to archive X, the filter on X is not consulted. Filters only
+gate secondary fanout.
+
+### Boot validator
+
+At boot (`coordinator/boot.rs`), the validator rejects configurations
+where:
+
+- No write-active + enabled registration has an **unrestricted** filter
+  (`ArchiveFilter::default()` is unrestricted). Prevents configurations
+  that silently drop all writes.
+- An archive's filter has `include_connectors = Some(empty set)` —
+  equivalent to "reject everything", which is almost certainly a config
+  bug.
+
+See `docs/plans/2026-04-21-archivist-phase4-design.md` §4 for the full
+design rationale.
+
+## Phase 5: Importers (2026-04-21)
+
+The `import::` module centres on an `Importer` trait with per-source
+implementations under `import::sources::*`. Each source produces a
+`ParsedConversation` (ChatGPT) / `ParsedSession` (Codex) / session
+directory walk (Claude) and feeds the results through the common
+`import_sessions` orchestrator, which fires `ImportProgressEvent`s on a
+bounded `ImportProgressSink`.
+
+### `Importer` trait
+
+Every importer declares a `config_shape()` so UIs can render a dynamic
+form; a `discover()` that returns an `ImportDiscovery` preview; and an
+`import()` that does the actual work. All three methods are async.
+
+The trait lives in `import::trait_def`. Shape types (`ImportConfig`,
+`ImportTarget`, `ConfigField`, `ConfigFieldKind`, `ImportError`) are
+serialisable and safe to cross the WASM boundary.
+
+### Registry
+
+`ImporterRegistry::with_defaults()` registers every enabled
+`importer-*` feature. Currently: `claude`, `chatgpt`, `codex`. The
+registry is constructed at boot and stored on `AppState`.
+
+### Progress sink
+
+`ImportProgressSink::channel()` returns a bounded mpsc pair.
+Non-terminal events use `try_send` (dropped on full); terminal events
+use `send().await` so consumers always see the final state.
+
+### Source crates
+
+- `dirigent_chatgpt` — parses `conversations.json` from the OpenAI data
+  export.
+- `dirigent_codex` — parses `*.jsonl` session files under
+  `~/.codex/sessions`.
+
+Both crates hold pure parser types with zero dirigent-specific types.
+
+See `docs/plans/2026-04-21-archivist-phase5-design.md`.
+
+## Future Enhancements
+
+- Indexed `SearchBackend` implementations (tantivy/sqlite) — currently
+  content search is ripgrep-based in the `api` package
+- Session splitting and lineage management (mutations.ndjson)
+- Knowledge overview generation (chat.md exports)
+- Embedding storage and search (embeds/)
+- Network RPC interface for remote archivist
+- Compaction and pruning policies
+- Additional concrete backends (e.g. SQLite, remote)
+
+## Documentation
+
+- **Package README**: `./README.md` - User-facing overview
+- **Architecture Docs**: `../../docs/building/05_archivist/` - Design and planning
+- **API Docs**: Run `cargo doc --package dirigent_archivist --open`
+- **Examples**: See `examples/` directory for working code samples