dirigence/dirigent

Fork 0

Files

T

g4borg b03dc15371 sync from monorepo @ 2452e92e

2026-05-08 01:59:04 +02:00

31 KiB

Raw Blame History

Package: dirigent_archivist

Persistent storage for all agentic interactions in Dirigent.

Quick Facts

Type: Library
Main Entry: src/lib.rs
Dependencies: dirigent_protocol, uuid, chrono, serde, tokio, tracing, thiserror, sha2, hex, async-trait
Status: Complete - Production ready with comprehensive tests

Purpose

The Archivist provides file-based archival storage for all session data, messages, and attachments in Dirigent. It implements an archive-first architecture with connector API fallback, using NDJSON, JSON, and TSV formats for durability and human-readability.

Key Features

File-based Storage: NDJSON for messages, JSON for metadata, TSV for indices
Content-Addressable Files: SHA-256 based storage for attachments with automatic deduplication
Session Lineage: Track splits, continuations, and mutations with parent references
Connector Registry: Coordinate UID assignment across connectors with collision detection
Event Streaming: Real-time updates via EventHandler subscribing to dirigent_protocol events
Archive-First Design: Read from archive first, fall back to connector API when needed
Caching: In-memory caching of connector and session mappings for performance

Architecture

The Archivist is built on three core architectural principles:

1. Archive-First Read Strategy

The Archivist is the primary source of truth for historical data:

UI and APIs query the archive first
Only fall back to connector APIs if data is not in archive
This enables offline access and consistent history across restarts

2. Write-Through Event Capture (Append-Only)

The EventHandler subscribes to the global event stream from dirigent_core:

Captures session creation, message streaming, and tool calls in real-time
Uses MessageAccumulator to assemble streaming chunks into complete messages
Writes complete messages to archive immediately upon finalization
No polling required - fully event-driven
Append-only writes: Messages are appended as events arrive, NOT in chronological order
File order reflects event timing, not message timestamps

3. File-Based Storage with Sort-on-Read

All data is stored in human-readable, grep-able formats:

NDJSON (Newline-Delimited JSON): Incremental append-only logs for messages and mappings
JSON: Structured metadata for sessions and connectors
TSV (Tab-Separated Values): Fast indices for cross-references
Content-Addressed Files: Binary attachments stored by SHA-256 hash for deduplication
Sort-on-Read: get_messages() sorts by timestamp and message_id to ensure chronological order despite append-only writes

Backend Trait Layer (Phase 2)

The archivist uses a trait-based backend abstraction. ArchiveBackend defines the mandatory session and message primitives every backend must provide, plus as_xxx() accessors returning optional sub-traits:

SearchBackend — reserved for Phase 3+ indexed backends (not wired)
DagBackend — session lineage DAG edges
MetaEventsBackend — ACP connection lifecycle events
ConnectorRegistryBackend — per-archive connector metadata
SessionMappingBackend — native↔scroll session ID mapping

JsonlBackend is the Phase 2 concrete implementation (file-based NDJSON/JSON/TSV) and opts into every sub-trait except SearchBackend (content search continues to be served by ripgrep via crates/api/src/archivist/search_task.rs).

The Archivist struct (in src/coordinator/) owns a registry of backends keyed by archive name and performs orchestration (alias detection, session lineage, move/copy, DAG walks, archive lifecycle). Consumers hold Arc<Archivist> directly — the coordinator is concrete, not a trait.

See docs/plans/2026-04-18-archivist-phase2-design.md for design rationale.

Multi-Backend Registry (Phase 3)

The coordinator (Archivist) holds Vec<Arc<ArchiveRegistration>> sorted by read_priority instead of a flat HashMap<name, Arc<dyn ArchiveBackend>>. Each registration carries:

backend: Arc<dyn ArchiveBackend> + its declared capabilities
failure_mode: Required (must succeed) | BestEffort (errors log + drift health)
read_priority: lower = tried first for reads; also selects the default write target when no archive is named
write_active: participates in fanout writes
enabled: kill-switch without removing config
write_policy: Inline (default; await per call) or Queued (mpsc + batch_window + overflow policy)
Runtime state: last_health, last_error, consecutive_failures (all Arc<RwLock<_>>, shared with the writer task when queued)
Optional writer: Option<WriterHandle> (Some iff write_policy = Queued)

Backends are declared in dirigent.toml under [[archives]] and constructed at boot via Archivist::from_config(cfg, &BackendRegistry). Add a new backend type by implementing BackendFactory and registering it on the BackendRegistry before from_config.

Reads

get_session, get_messages_paged, count_messages, get_meta_events, get_children, etc. walk the registry in priority order via read_walk_per_session(scroll_id, predicate, op). The predicate capability-filters; Unavailable backends are skipped. The first backend that returns Some(value) wins and its name is cached against the scroll_id in a positive LRU (capacity 10_000). Subsequent reads for the same scroll_id short-circuit to the cached backend before falling back to the full priority walk.

Collection-shape reads (list_sessions_paged, list_connectors, list_meta_sessions, find_meta_session_by_client) use read_walk_collection — first enabled backend that can answer wins, no cache, no aggregation across backends. Phase 3 explicitly defers cross-backend merge/dedup to a later phase.

Writes

Mutating methods (append_messages, register_session, update_session_*, append_meta_events, append_dag_edge, clear_session_messages, update_connector_fingerprint) resolve a primary (per-call archive: Some(name) override or the default-write target) and fan out to every other enabled && write_active backend that has the required capability. Capability-mismatched backends are skipped with a debug capability_skip log (never an error). Required failures propagate to the caller; BestEffort failures log + drift health.

register_connector currently does NOT fan out — alias detection + the tri-state Accepted/Aliased/Rejected return shape make replication non-trivial. Fanout for connectors is deferred; single-backend setups are unaffected.

For write_policy = Queued backends, the primary/secondary write paths enqueue a WriteOp into the backend's writer task instead of awaiting. Errors drift the backend's health but do not propagate to the caller. Coalescing merges consecutive AppendMessages/AppendMetaEvents for the same scroll_id within batch_window_ms.

Cross-backend operations

delete_session(scroll_id, _) fans out to every enabled backend that has the session. Copies in write_active=false backends produce ArchivistError::DeleteOnReadOnlyBackend (write-active copies are still deleted); cache invalidated regardless of outcome.
copy_session(scroll_id, from, to) reads from from, writes to to, including DAG and meta-events when both sides have the capability. The source remains canonical (the cache is NOT rewritten).
move_session(scroll_id, from, to) is copy + delete-from-source. If the source-side delete fails after the copy succeeded, ArchivistError::PartialMove { copied_to, delete_error } is returned so the caller knows the session now lives in both places.

The Phase 2 connector-aware move_session(scroll_id, target_connector_uid, _) and copy_session(scroll_id, target_connector_uid, _) survived the Phase 3 rename as move_session_to_connector / copy_session_to_connector. Their bulk variant is move_sessions_to_connector.

Health

HealthStatus drifts on every coordinator call that observes a backend:

Successful write → Healthy; consecutive_failures reset to 0.
Successful read → Healthy (only rescues Degraded; does not reset the counter).
Write failure → Degraded { reason }; consecutive_failures += 1; after K = 5 consecutive failures drifts to Unavailable { reason }. Reads skip Unavailable backends; writes against an Unavailable Required backend fail, while writes against an Unavailable BestEffort backend are still attempted.
Read failure alone never drifts past Degraded; writes are the authoritative health signal.

list_archives_with_health() returns a Vec<ArchiveStatus> snapshot of every registration: name, type, capabilities, health, last_error, and queue_depth (for queued backends).

Lifecycle

Phase 3 is startup-only. add_archive / remove_archive / set_default_archive on the coordinator return ArchivistError::DynamicRegistryUnsupported. To change the registry, edit dirigent.toml and restart the server. Archivist::shutdown() drains queued writer tasks (sends WriteOp::Shutdown on each writer's mpsc and awaits ack); call it before process exit.

Test-only constructors Archivist::from_registrations(regs) and SessionMetadata::stub(scroll_id) live under #[cfg(any(test, feature = "test-utils"))] for integration tests that bypass the factory.

See docs/plans/2026-04-19-archivist-phase3-design.md for the full design rationale, and examples/multi_backend.rs for a runnable end-to-end example.

Module Organization

Core Modules

lib.rs: Public API surface and re-exports
types.rs: Core data structures (session metadata, message records, connector info, API types)
error.rs: Error types and Result alias for archivist operations

Backend Layer (`backend/`)

traits.rs: ArchiveBackend trait + 5 optional sub-traits
capability.rs: ArchiveCapability enum + CapabilitySet type
health.rs: HealthStatus enum returned by health_check
contract.rs: Reusable behavioral tests for any &dyn ArchiveBackend (cfg-gated)
mock.rs: In-memory MockBackend for coordinator unit tests (cfg-gated)

Concrete Backends (`backends/`)

jsonl/: The file-based JsonlBackend — the only Phase 2 backend. Reuses storage/ primitives for NDJSON/JSON/TSV operations.

Coordinator (`coordinator/`)

mod.rs: The Archivist struct + constructors
archives.rs: Archive lifecycle (add/remove/list/default)
connectors.rs: Connector registration + alias detection
sessions.rs: Session registration, metadata updates, move/copy
meta.rs: Meta events, DAG walks, cleanup

Storage Layer (`storage/`)

Low-level file I/O primitives used by JsonlBackend. All storage operations are async and use tokio.

paths.rs: ArchivePaths utility for consistent directory structure and path resolution
ndjson.rs: Newline-delimited JSON operations (read_ndjson, append_ndjson)
json.rs: JSON operations (read_json, write_json)
tsv.rs: Tab-separated value operations for connector index
files.rs: Content-addressable file storage with SHA-256 hashing and deduplication

Supporting Modules

registry.rs: Archive registry persistence (multi-archive metadata)
migration.rs: Single-archive → multi-archive migration path
session.rs: Session lineage types shared across layers
accumulator.rs: MessageAccumulator for assembling streaming message chunks
backfill.rs: Backfill helpers for importing historical sessions
import/: External conversation importers (e.g. Claude export)

Events

events.rs: EventHandler for subscribing to dirigent_protocol events and archiving them

Configuration

The Archivist archive root is determined by DirigentPaths resolution:

Set DIRIGENT_DATA_DIR to override the data directory; archives will be stored at <data_dir>/archives/
Defaults to ~/.local/share/dirigent/archives/ (or platform equivalent)

DIRIGENT_DATA_DIR=/path/to/data dx serve

Archive Structure

dirigent_archive/
├── .contexts/
│   └── {scroll_id:uuidv7}/          # One directory per session
│       ├── session.json             # Session metadata
│       ├── messages.jsonl           # Incremental message log (.ndjson also supported)
│       └── lineage.json             # Session lineage info (optional)
├── .db/
│   └── connectors/
│       ├── index.tsv                # Fast connector lookup (TSV)
│       └── {connector_uid}/
│           ├── connector.json       # Connector metadata
│           └── sessions.jsonl       # Session mappings (.ndjson also supported)
└── .files/
    └── {sha256-hash}                # Content-addressable file storage

Why Hidden Directories?

The .contexts, .db, and .files directories are hidden (prefixed with .) to keep the archive root clean for future rendered outputs (like chat.md exports). This is similar to how .git hides implementation details in a codebase.

File Formats

Session Metadata (`session.json`)

{
  "version": 1,
  "scroll_id": "01936e8f-e5a7-7000-8000-000000000001",
  "created_at": "2025-01-01T12:00:00Z",
  "updated_at": "2025-01-01T12:30:00Z",
  "title": "Implement user authentication",
  "connector_uid": "01936e8f-e5a7-7000-8000-000000000002",
  "native_session_id": "abc123",
  "agent_id": null,
  "parent_scroll_id": null,
  "continuation": null,
  "tags": ["backend", "auth"],
  "metadata": {
    "source": "OpenCode",
    "model": "claude-3-5-sonnet"
  }
}

Messages Log (`messages.jsonl`)

One JSON object per line, append-only:

{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000003","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":null,"ts":"2025-01-01T12:01:00Z","role":"user","author":"alice","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}}
{"version":1,"message_id":"01936e8f-e5a7-7000-8000-000000000004","session":"01936e8f-e5a7-7000-8000-000000000001","parent_id":"01936e8f-e5a7-7000-8000-000000000003","ts":"2025-01-01T12:01:10Z","role":"assistant","author":"claude","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}}

IMPORTANT - Ordering: The order of lines in the message log file (messages.jsonl or messages.ndjson) reflects event arrival order, NOT chronological order. Assistant replies often arrive after subsequent user messages due to streaming latency, resulting in non-chronological file order. Always use the Archivist::get_messages() API to retrieve messages, which sorts by ts (timestamp) and message_id (UUIDv7) to guarantee chronological order.

File Format Compatibility: The archivist supports both .ndjson and .jsonl file extensions for newline-delimited JSON files. When reading, .jsonl is preferred if present, with automatic fallback to .ndjson for backward compatibility. Write operations use .jsonl (canonical format). Both formats are identical in content - the difference is purely the file extension.

Connector Index (`index.tsv`)

Tab-separated values with header row:

connector_uid	type	title	client_native_id	alias_of	created_at
01936e8f-e5a7-7000-8000-000000000002	OpenCode	OpenCode Local	opencode@http://localhost:12225		2025-01-01T12:00:00Z

Session Mappings (`sessions.jsonl`)

Maps native session IDs from connectors to scroll IDs in the archive:

{"version":1,"connector_uid":"01936e8f-e5a7-7000-8000-000000000002","native_session_id":"abc123","scroll_id":"01936e8f-e5a7-7000-8000-000000000001","created_at":"2025-01-01T12:00:00Z","alias_of":null}

Message Ordering Guarantees

The Problem: Append Order ≠ Chronological Order

In the event-driven architecture, messages are written to the message log file (messages.jsonl) as completion events arrive. Due to streaming latency:

User messages complete nearly instantly and are written immediately
Assistant messages stream over time and complete later
A second user message can be written before the first assistant reply completes

Example scenario:

T0: User sends "tell me a joke about snakes" (ts=18:23:36.947)
T1: Assistant starts streaming reply (ts=18:23:36.969)
T2: User sends "now one about tigers" (ts=18:23:49.429) <- completes and writes BEFORE assistant finishes
T3: Assistant finishes "snakes" reply <- writes AFTER "tigers" user message

File order in the message log file:

1. user "snakes" (18:23:36.947)
2. user "tigers" (18:23:49.429)  <- written second
3. assistant "snakes" (18:23:36.969)  <- written third, but timestamp is earlier!

The Solution: Sort-on-Read

The Archivist::get_messages() implementation sorts messages before returning:

Primary sort: ts (timestamp) ascending
Secondary sort: message_id (UUIDv7) ascending for stable tie-breaking

This guarantees chronological order regardless of NDJSON append order:

1. user "snakes" (18:23:36.947)
2. assistant "snakes" (18:23:36.969)
3. user "tigers" (18:23:49.429)

Why This Approach?

Maintains durability: Append-only writes preserve crash safety
No migration needed: Existing archives work without rewrites
Simple implementation: No buffered writes or complex write-time ordering
Performance trade-off: Small CPU cost on read (sorting) vs. complex write-time coordination

Consumer Guidance

DO: Use Archivist::get_messages() to retrieve messages
DON'T: Read the message log file directly and assume file order = chronological order
UI/API: Always sort by ts then message_id for defense in depth
Tie-breaking: Use message_id (UUIDv7) as secondary sort for stable ordering when timestamps match

Key Types

SessionMetadata

Stores all metadata about a session including:

scroll_id: UUIDv7 identifier for the session
connector_uid: Which connector owns this session
native_session_id: Original session ID from the connector (optional)
title: Optional human-readable session title (see Title Management below)
parent_scroll_id: For session lineage (splits, continuations)
continuation: Type of continuation (SPLIT, COMPACT, REFERENCE, EDIT)
tags: User-defined categorization
metadata: Free-form JSON for connector-specific fields

Title Management

Session titles are fully supported and persist across restarts. Titles are stored in the SessionMetadata struct and saved to the session.json file.

Setting Titles:

// Update title for an existing session
archivist.update_session_metadata(
    scroll_id,
    Some("My Custom Session Title".to_string()),
    None, // model
    None  // archive
).await?;

Default Behavior:

New sessions can specify an initial title during registration
If no title is provided, sessions default to None
The UI typically displays "Untitled" for sessions without titles

Title Loading:

Titles are automatically loaded when retrieving session metadata via get_session_metadata()
Session lists include titles via list_sessions() and list_sessions_all()
Titles are part of the SessionMetadata struct returned by all session queries

UI Integration:

The web UI displays session titles in the session list and sidebar
Users can rename sessions via the "Rename" button in the session list view
Renaming calls api::archivist::rename_session() which uses update_session_metadata()
Title changes are persisted immediately and survive application restarts

MessageRecord

Represents a single message in the archive:

message_id: UUIDv7 identifier
session: scroll_id this message belongs to
role: "user", "assistant", or "system"
content_md: Message content in Markdown format
attachments: References to attached files
metadata: Free-form JSON for connector-specific fields

ConnectorRecord

Metadata about a connector:

connector_uid: UUIDv7 identifier
type: "OpenCode", "ACP", or custom
client_native_id: Unique identifier from client (e.g., "opencode@http://localhost:12225")
alias_of: If this connector is an alias of another (for deduplication)

Archivist Public API

The Archivist struct (in coordinator/) is the main public entry point for archival operations. Consumers hold Arc<Archivist> and call inherent methods — there is no Archivist trait anymore. The coordinator resolves the target backend per call (via archive: Option<String>) and delegates to ArchiveBackend methods.

Key method families (see coordinator/*.rs for full signatures):

Archive lifecycle (archives.rs): add_archive, remove_archive, list_archives, set_default_archive
Connectors (connectors.rs): register_connector with tri-state result (Accepted / Aliased / Rejected), list_connectors
Sessions (sessions.rs): register_session, get_session_metadata, update_session_metadata, list_sessions_paged, move_session, copy_session, resolve_session
Messages: append_messages, get_messages (sorts by ts then message_id for stable chronological order)
Meta / DAG (meta.rs): meta-event recording, session lineage DAG walks, cleanup routines

List Filter vs. Full-Text Search

Two distinct query paths exist — do not conflate them.

List filter — Archivist::list_sessions_paged(SessionListQuery) returns a cursor-paged list of sessions, AND-filtered by title_query (substring on title), tags, model_filter (substring on metadata.model), project_id, connector_uid, and include_hidden. This is the right tool for "narrow the list of visible sessions."

Full-text search — api::search_sessions (in the api package, backed by api::archivist::search_task::SearchTask) spawns rg --json over the archive's .contexts/ tree to find messages containing text. It streams SearchExcerpts with parsed NDJSON content and supports cancellation via CancellationToken. This is the right tool for "find messages containing text."

Do not extend list_sessions_paged to do content search. Content search belongs in the ripgrep pipeline. Future improvements to content search (indexed backends, relevance scoring) are Phase 2d / Phase 3 concerns.

JsonlBackend Implementation

The Phase 2 production backend — an implementation of ArchiveBackend plus every sub-trait except SearchBackend:

Thread-safe: Uses RwLock for in-memory caches
Async: All operations use tokio for non-blocking I/O
Caching: In-memory caches for connector and session mappings
Collision Detection: Tri-state registration for connectors and sessions

Located under src/backends/jsonl/ and split by concern (backend.rs, connectors.rs, dag.rs, mapping.rs, meta.rs).

Caching Strategy

JsonlBackend maintains two in-memory caches:

connector_cache: HashMap<Uuid, ConnectorRecord>
- Populated on registration
- Read from TSV index on startup (future enhancement)
session_cache: HashMap<(Uuid, String), Uuid>
- Maps (connector_uid, native_session_id) to scroll_id
- Populated on registration and session resolution
- Enables fast session lookups without disk I/O

Event Handling

The EventHandler subscribes to dirigent_protocol events and archives them in real-time:

// Create archivist and event handler
let archivist = Archivist::new_with_single_archive(archive_path).await?;
let handler = EventHandler::new(Arc::new(archivist));

// Subscribe to event stream from dirigent_core
let events = event_stream.subscribe();

// Run event loop (blocking)
handler.run(events).await;

Supported Events

SessionCreated: Registers new sessions with the archivist
MessageCompleted: Writes finalized messages to the archive
SessionUpdate: Accumulates streaming message chunks
- AgentMessageChunk
- UserMessageChunk
- AgentThoughtChunk
- ToolCall

MessageAccumulator

Assembles streaming message chunks into complete messages:

Accumulates text chunks by message_id
Tracks thinking blocks separately
Stores tool calls with input/output
Finalizes messages on MessageCompleted event
Converts to MessageRecord for archival

Integration with dirigent_core

The Archivist integrates with dirigent_core via the global event stream:

CoreRuntime emits events for all connector operations
EventHandler subscribes to event stream
MessageAccumulator assembles streaming chunks
Archivist writes complete messages to archive

This enables:

Automatic archival of all sessions and messages
No polling required - fully event-driven
Consistent history across restarts
Offline access to historical data

Testing

The package has comprehensive test coverage across multiple dimensions:

Unit Tests

Located in each module (src/*.rs, src/storage/*.rs):

Type serialization/deserialization
UUIDv7 generation and ordering
Timestamp formatting (RFC 3339)
Storage operations (NDJSON, JSON, TSV, files)
Connector registration tri-state logic
Session registration and alias detection

Integration Tests

Located in tests/:

integration_tests.rs: Full Archivist + JsonlBackend lifecycle, event handler integration, multi-connector scenarios, session lineage, message accumulation
list_sessions_paged_test.rs, pagination_test.rs: List filter + cursor pagination coverage
import_claude_idempotency_test.rs: Claude export re-import idempotency

Backend Contract Tests

src/backend/contract.rs holds reusable async assertions that any &dyn ArchiveBackend must pass. JsonlBackend and MockBackend both run the contract suite; new backends added in Phase 3+ should do the same.

Examples

Located in examples/:

basic_usage.rs: Core archivist operations
event_handling.rs: EventHandler and MessageAccumulator
file_storage.rs: Content-addressable file storage

Run tests:

cargo test --package dirigent_archivist

Run examples:

cargo run --package dirigent_archivist --example basic_usage
cargo run --package dirigent_archivist --example event_handling
cargo run --package dirigent_archivist --example file_storage

Performance Characteristics

Append Operations: O(1) with sequential file writes
Session Lookup: O(1) with in-memory cache, O(n) cache miss
Message Retrieval: O(n) where n = number of messages (NDJSON parsing)
File Storage: O(1) content-addressable lookup with SHA-256 hashing
Connector Index: O(n) TSV scan, suitable for hundreds of connectors

Scalability Considerations

Large Sessions: NDJSON is append-only, so reading large sessions requires parsing all lines
Many Sessions: TSV indices are suitable for thousands of sessions per connector
File Deduplication: SHA-256 hashing provides automatic deduplication across sessions
Concurrent Access: RwLock allows multiple concurrent readers, single writer

Error Handling

The Archivist uses thiserror for rich error types:

pub enum ArchivistError {
    IoError(std::io::Error),
    SerdeError(serde_json::Error),
    SessionUnknown(Uuid),
    CollisionInconsistent(Uuid),
    // ... etc
}

All public APIs return Result<T, ArchivistError> for explicit error handling.

Development Notes

All storage operations are async (using tokio)
Content-addressable storage uses SHA-256 hashes (hex-encoded)
Archive directory structure mirrors session/message hierarchy
UUIDv7 provides time-ordered, sortable identifiers
RFC 3339 UTC timestamps for all time-based fields
Schema versioning via version field in all records

dirigent_protocol: Shared types and protocol definitions (dependency)
dirigent_core: Runtime integration for SSE event capture (integration point)
api: Server functions for archive queries (future)
web: UI for archive browsing and search (future)

Phase 4: `ArchiveFilter` (2026-04-21)

Every ArchiveRegistration carries a filter: ArchiveFilter. The filter describes which sessions/writes the backend wants to receive. Fields:

include_connectors: Option<HashSet<Uuid>> — if Some, only these connector UIDs pass. None means no connector gate.
exclude_connectors: HashSet<Uuid> — always rejected.
include_tags: HashSet<String> — if non-empty, the session must carry at least one matching tag.
exclude_tags: HashSet<String> — any matching tag rejects.
include_hidden: bool — default true. If false, sessions whose metadata has "hidden": true are skipped.

Primary-always-writes invariant

The per-call primary (either the archive: Some(name) argument or the default write-target) is never filtered. If a caller explicitly asks to write to archive X, the filter on X is not consulted. Filters only gate secondary fanout.

Boot validator

At boot (coordinator/boot.rs), the validator rejects configurations where:

No write-active + enabled registration has an unrestricted filter (ArchiveFilter::default() is unrestricted). Prevents configurations that silently drop all writes.
An archive's filter has include_connectors = Some(empty set) — equivalent to "reject everything", which is almost certainly a config bug.

See docs/plans/2026-04-21-archivist-phase4-design.md §4 for the full design rationale.

Phase 5: Importers (2026-04-21)

The import:: module centres on an Importer trait with per-source implementations under import::sources::*. Each source produces a ParsedConversation (ChatGPT) / ParsedSession (Codex) / session directory walk (Claude) and feeds the results through the common import_sessions orchestrator, which fires ImportProgressEvents on a bounded ImportProgressSink.

`Importer` trait

Every importer declares a config_shape() so UIs can render a dynamic form; a discover() that returns an ImportDiscovery preview; and an import() that does the actual work. All three methods are async.

The trait lives in import::trait_def. Shape types (ImportConfig, ImportTarget, ConfigField, ConfigFieldKind, ImportError) are serialisable and safe to cross the WASM boundary.

Registry

ImporterRegistry::with_defaults() registers every enabled importer-* feature. Currently: claude, chatgpt, codex. The registry is constructed at boot and stored on AppState.

Progress sink

ImportProgressSink::channel() returns a bounded mpsc pair. Non-terminal events use try_send (dropped on full); terminal events use send().await so consumers always see the final state.

Source crates

dirigent_chatgpt — parses conversations.json from the OpenAI data export.
dirigent_codex — parses *.jsonl session files under ~/.codex/sessions.

Both crates hold pure parser types with zero dirigent-specific types.

See docs/plans/2026-04-21-archivist-phase5-design.md.

Future Enhancements

Indexed SearchBackend implementations (tantivy/sqlite) — currently content search is ripgrep-based in the api package
Session splitting and lineage management (mutations.ndjson)
Knowledge overview generation (chat.md exports)
Embedding storage and search (embeds/)
Network RPC interface for remote archivist
Compaction and pruning policies
Additional concrete backends (e.g. SQLite, remote)

Documentation

Package README: ./README.md - User-facing overview
Architecture Docs: ../../docs/building/05_archivist/ - Design and planning
API Docs: Run cargo doc --package dirigent_archivist --open
Examples: See examples/ directory for working code samples

31 KiB Raw Blame History