sync from monorepo @ 2452e92e

2026-05-08 01:59:04 +02:00
commit b03dc15371
459 changed files with 129586 additions and 0 deletions
@@ -0,0 +1,338 @@
+# Dirigent Archivist
+
+Persistent storage for all agentic interactions in Dirigent.
+
+## Overview
+
+The Archivist automatically archives every conversation, message, and file from your AI sessions into a local, grep-able, human-readable archive. No cloud required - your data stays on your machine in formats you can read and search manually.
+
+## Why Archivist?
+
+- **Offline Access**: All conversations are saved locally, accessible even when connectors are offline
+- **Manual Curation**: Files are in plain JSON/NDJSON/TSV - grep, edit, or analyze them with any tool
+- **Knowledge Base**: Build a searchable archive of all your AI interactions across projects
+- **Session Lineage**: Track conversation branches, splits, and continuations
+- **File Deduplication**: Attachments are stored once, referenced multiple times (content-addressable)
+- **Archive-First**: UI reads from local archive first, only falls back to remote connectors when needed
+
+## Quick Start
+
+The Archivist runs automatically when you start Dirigent. The archive location is determined by the `DIRIGENT_DATA_DIR` environment variable (archives are stored at `<data_dir>/archives/`):
+
+```bash
+# Override data directory (archives at /path/to/data/archives/)
+DIRIGENT_DATA_DIR=/path/to/data dx serve
+```
+
+That's it! Every session and message will be automatically archived.
+
+## Archive Structure
+
+Your archive is organized like this:
+
+```
+dirigent_archive/
+├── .contexts/                       # Session data
+│   └── 01936e8f-e5a7-7000-8000.../
+│       ├── session.json             # Session metadata
+│       └── messages.ndjson          # All messages (one JSON per line)
+├── .db/
+│   └── connectors/                  # Connector registry
+│       ├── index.tsv                # Fast lookup table
+│       └── 01936e8f-e5a7.../
+│           ├── connector.json       # Connector info
+│           └── sessions.ndjson      # Session ID mappings
+└── .files/                          # Attachments (by SHA-256)
+    └── a1b2c3d4...                  # Content-addressable storage
+```
+
+### Why Hidden Directories?
+
+The `.contexts`, `.db`, and `.files` directories start with `.` to keep them internal (like `.git`). In the future, you'll be able to export rendered markdown files into the archive root for easy reading.
+
+## File Formats
+
+### Session Metadata (`.contexts/{id}/session.json`)
+
+```json
+{
+  "version": 1,
+  "scroll_id": "01936e8f-e5a7-7000-8000-000000000001",
+  "created_at": "2025-01-01T12:00:00Z",
+  "updated_at": "2025-01-01T12:30:00Z",
+  "title": "Implement user authentication",
+  "connector_uid": "01936e8f-e5a7-7000-8000-000000000002",
+  "tags": ["backend", "auth"],
+  "metadata": {
+    "source": "OpenCode",
+    "model": "claude-3-5-sonnet"
+  }
+}
+```
+
+### Messages (`.contexts/{id}/messages.ndjson`)
+
+Newline-delimited JSON - one message per line, **append-only**:
+
+```jsonl
+{"version":1,"message_id":"...","session":"...","role":"user","ts":"2025-01-01T12:01:00Z","content_md":"How do I implement JWT auth?","attachments":[],"metadata":{}}
+{"version":1,"message_id":"...","session":"...","role":"assistant","ts":"2025-01-01T12:01:10Z","content_md":"Here's how to implement JWT authentication...","attachments":[],"metadata":{"model":"claude-3-5-sonnet"}}
+```
+
+**IMPORTANT**: Messages are written as events arrive, NOT in chronological order. Assistant replies often appear after subsequent user messages due to streaming latency. When reading programmatically, use the Archivist API which sorts by timestamp (`ts`) to ensure correct order. For manual inspection, sort by the `ts` field.
+
+### Connector Index (`.db/connectors/index.tsv`)
+
+Tab-separated values for fast scanning:
+
+```tsv
+connector_uid	type	title	client_native_id	alias_of	created_at
+01936e8f...	OpenCode	OpenCode Local	opencode@http://localhost:12225		2025-01-01T12:00:00Z
+```
+
+## Searching Your Archive
+
+Since everything is plain text, you can use standard Unix tools:
+
+```bash
+# Find all sessions about "authentication"
+grep -r "authentication" dirigent_archive/.contexts/*/session.json
+
+# Find messages mentioning a specific error
+grep "ECONNREFUSED" dirigent_archive/.contexts/*/messages.ndjson
+
+# List all sessions for a connector
+cat dirigent_archive/.db/connectors/*/sessions.ndjson | jq .
+
+# Get all user messages from a session (sorted by timestamp)
+cat dirigent_archive/.contexts/01936e8f.../messages.ndjson | jq -s 'sort_by(.ts) | .[] | select(.role=="user")'
+
+# View messages in chronological order
+cat dirigent_archive/.contexts/01936e8f.../messages.ndjson | jq -s 'sort_by(.ts)'
+```
+
+**Note on ordering**: Remember that the file order is append-only (event arrival order). Always sort by `ts` (timestamp) when reading manually to see messages in chronological order.
+
+## Integration with Dirigent
+
+The Archivist integrates seamlessly with Dirigent's core runtime:
+
+1. **Automatic Archiving**: Every session and message is archived in real-time as events arrive
+2. **Event-Driven**: No polling - listens to dirigent_core's event stream
+3. **Append-Only Writes**: Messages written as completion events arrive (preserves durability)
+4. **Sort-on-Read**: API returns messages in chronological order despite append-only file order
+5. **UI Integration**: Web UI reads from archive first, shows data even when connectors are offline
+6. **Connector Coordination**: Assigns stable UUIDs to connectors with collision detection
+
+## Key Concepts
+
+### Scroll IDs
+
+Every session gets a unique `scroll_id` (UUIDv7) that's independent of the connector's native session ID. This allows:
+- Sessions to move between connectors
+- Stable references even if connector data is deleted
+- Time-ordered sorting (UUIDv7 encodes timestamp)
+
+### Session Lineage
+
+Sessions can have parent sessions, creating a tree of related conversations:
+- **Split**: Fork conversation at a specific message
+- **Compact**: Summarized version of parent
+- **Reference**: Points to parent without duplication
+- **Edit**: Modified version of parent
+
+### Content-Addressable Storage
+
+Files are stored by their SHA-256 hash, so:
+- Same file uploaded twice uses same storage
+- Files can be shared across sessions without duplication
+- You can verify file integrity by hash
+
+## Configuration
+
+### Environment Variables
+
+- `DIRIGENT_DATA_DIR`: Override data directory; archives are stored at `<data_dir>/archives/`
+
+### Example Configurations
+
+```bash
+# Use custom data directory (archives at /home/user/mydata/archives/)
+DIRIGENT_DATA_DIR=/home/user/mydata dx serve
+
+# Use global data directory
+DIRIGENT_DATA_DIR=/home/user/.dirigent dx serve
+
+# Use temporary data directory (testing)
+DIRIGENT_DATA_DIR=/tmp/dirigent_test dx serve
+```
+
+## Programmatic Access
+
+While the Archivist runs automatically, you can also use it programmatically:
+
+```rust
+use dirigent_archivist::Archivist;
+use std::path::PathBuf;
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    // Create an archivist over a single archive directory.
+    // Internally this wires up a `JsonlBackend` for the archive.
+    let archivist = Archivist::new_with_single_archive(
+        PathBuf::from("./dirigent_archive")
+    ).await?;
+
+    // List sessions for a connector
+    let sessions = archivist.list_sessions(connector_uid).await?;
+
+    for session in sessions {
+        println!("{}: {}", session.scroll_id, session.title.unwrap_or_default());
+    }
+
+    Ok(())
+}
+```
+
+`Archivist` is a concrete struct that owns a registry of `ArchiveBackend`
+implementations keyed by archive name. In Phase 2 the only backend is
+`JsonlBackend` (file-based NDJSON/JSON/TSV). See `examples/` for more
+detailed usage.
+
+## Performance
+
+The Archivist is designed for human-scale workloads (thousands of sessions, millions of messages):
+
+- **Fast Writes**: Append-only NDJSON is O(1)
+- **Cached Reads**: Common lookups cached in memory
+- **Grep-able**: TSV indices can be scanned in milliseconds
+- **Incremental**: Only new messages are written, no full re-writes
+
+### Scalability Notes
+
+- Large sessions (1000+ messages) may take a few seconds to load
+- TSV indices are suitable for 100-1000 connectors
+- File deduplication saves space for repeated attachments
+
+## Querying and Curation
+
+### Future: Knowledge Overviews
+
+The Archivist is designed to support knowledge curation workflows:
+- Export sessions as clean markdown files
+- Create summaries and overviews across sessions
+- Tag and categorize conversations
+- Build a personal knowledge base
+
+These features are planned for future releases.
+
+### Current: Manual Curation
+
+For now, you can manually curate your archive:
+- Edit `session.json` to add tags
+- Grep through messages for specific topics
+- Copy/organize sessions into project folders
+- Use jq/awk/sed to extract insights
+
+## Advanced Features
+
+### Session Splitting
+
+Create a new conversation branch from any point in history:
+
+```rust
+// Future API (not yet implemented)
+let new_session = archivist.split_session(
+    session_id,
+    at_message_id,
+    Continuation::Split
+).await?;
+```
+
+### Attachment Storage
+
+Files are automatically deduplicated using SHA-256:
+
+```rust
+// Store file (content-addressable)
+let file_id = archivist.store_file(
+    &file_data,
+    "spec.pdf",
+    Some("application/pdf")
+).await?;
+
+// Reference in message
+let attachment = AttachmentRef {
+    file_id,  // "sha256:abc123..."
+    name: "spec.pdf".to_string(),
+    mime_type: Some("application/pdf".to_string()),
+};
+```
+
+### Multi-Archive Support
+
+`Archivist` natively manages multiple named archives via an on-disk
+registry. Each archive is backed by its own `ArchiveBackend` (currently
+`JsonlBackend`) and selected per call via an optional `archive` argument.
+This enables:
+- Separate archives per project
+- A default archive plus specialized side archives
+- Moving or copying sessions between archives
+
+Future backends (e.g. SQLite, indexed, remote) will plug into the same
+trait layer without changing the coordinator API.
+
+## Troubleshooting
+
+### Archive Not Created
+
+If the archive directory doesn't appear:
+1. Check `DIRIGENT_DATA_DIR` is set correctly (or that the default data directory is writable)
+2. Ensure write permissions on parent directory
+3. Check logs for I/O errors
+
+### Missing Sessions
+
+If sessions don't appear in archive:
+1. Verify EventHandler is running
+2. Check for event subscription errors in logs
+3. Ensure connector emits `SessionCreated` events
+
+### Large Archive Size
+
+If archive grows too large:
+1. Check for duplicate files in `.files/`
+2. Consider archiving old sessions separately
+3. Future: Use compaction features (not yet implemented)
+
+## Development Status
+
+**Current** (Phase 2 complete):
+- Automatic archival of sessions and messages
+- Event-driven integration with dirigent_core
+- File-based storage with NDJSON/JSON/TSV (`JsonlBackend`)
+- Content-addressable file storage
+- Multi-archive coordinator with per-archive backends
+- Trait-based backend abstraction (`ArchiveBackend` + sub-traits)
+
+**Future**:
+- Indexed `SearchBackend` implementations (full-text search)
+- Additional concrete backends (SQLite, remote)
+- Session splitting and lineage management
+- Knowledge overview generation
+- Network RPC interface
+
+## Documentation
+
+- **Developer Guide**: `CLAUDE.md` - Package architecture and implementation details
+- **Architecture**: `docs/building/05_archivist/vision.md` - Design rationale
+- **API Docs**: `cargo doc --package dirigent_archivist --open`
+- **Examples**: See `examples/` for working code
+
+## Contributing
+
+The Archivist is part of the Dirigent project. See the main repository for contribution guidelines.
+
+## License
+
+Part of the Dirigent project.