Implementation Roadmap

This document outlines the phased implementation plan for Aranya observability.

Logging vs. Metrics Boundary

The roadmap includes two kinds of work:

Logging existing data (low effort)

Add tracing fields that already exist in scope (IDs, counts, sizes, timestamps).
Example: logging peer_device_id, effects_count, or bytes_transferred that are already computed.

Implementing new metrics (higher effort)

Requires new measurement, tracking, or state that does not exist yet.
Example: calculating bandwidth, tracking per-channel success rates, or adding new counters that persist across calls.

If a field is not already available at the logging call site, it is a new metric and requires additional code to compute, store, or pass through the call chain.

Phase Overview

Phase	Goal	Status
1	Foundation: structured logging, correlation IDs, error chains	Planned
2	Enhanced sync debugging: stall detection, topology, bundling	Planned
3	Policy & AFC: detailed error reporting, SHM logging	Planned

Phase 1: Foundation

Goal: Establish consistent structured logging with correlation IDs and device/team context.

What Gets Done

Structured JSON logging
- Configure tracing_subscriber with JSON formatter
- All logs output as consistent JSON with required fields
- Files: crates/aranya-daemon/src/main.rs
Correlation IDs
- Add correlation_id: Tarpc::TraceId to all RPC requests
- Thread through client → daemon → sync
- Derive from tarpc rpc.trace_id for cross-process correlation
- Edit Files: crates/aranya-daemon-api/, crates/aranya-daemon/src/api.rs, crates/aranya-client/src/client.rs, crates/aranya-client/src/util.rs, crates/aranya-client/src/team.rs, crates/aranya-client/src/device.rs
- New File: crates/aranya-daemon/src/observability.rs (to contain the correlation_id handlers)
Error chains
- Wrap errors with .context() for full causal chain
- Add structured fields (device_id, team_id, peer_id, etc.)
- Files: All error handling locations in daemon and client

Phase 2: Enhanced Sync Debugging

Goal: Provide comprehensive sync visibility: first command tracking, stall detection, network metrics

What Gets Done

First command tracking
- Log hash + max_cts of first command sent in each sync
- Compare on next sync to detect stalls
- Edit File: crates/aranya-daemon/src/sync/mod.rs
Stall detection
- Track per-peer: last_first_cmd_hash, last_first_cmd_max_cts, stall_count
- Only flag as stall when first command repeats AND new data expected
- Threshold: 3 consecutive identical first commands → WARNING
- Edit File: crates/aranya-daemon/src/sync/mod.rs
Network quality metrics
- Measure RTT (round-trip time) [new metric]
- Calculate bandwidth from bytes transferred [new metric]
- Track packet loss (if using QUIC) [new metric]
- Edit Files: crates/aranya-daemon/src/sync/mod.rs, crates/aranya-util/

Note: These metrics are on a per-transport basis, some metrics aren’t applicable to particular transport types.

Phase 3: Policy & AFC

Goal: Enhanced policy and AFC observability with detailed error reporting.

What Gets Done

Policy error reporting
- Add source file and line number to policy errors
- Show permission mismatches (required vs actual)
- Include check name that failed (e.g., “CanCreateLabels”)
- Generate source maps during compilation
- Files: crates/aranya-policy-vm/, crates/aranya-daemon/src/actions.rs
AFC SHM operation logging
- Log all key add/remove operations at DEBUG level
- Log failures with error codes and context
- Track per-channel statistics [new metrics]
- Files: crates/aranya-client/src/afc.rs
AFC failure tracking
- Detect SHM permission errors
- Detect SHM full conditions (max_keys reached)
- Log seal/open failures with crypto context
- Track failure patterns [new metrics]
- Files: crates/aranya-client/src/afc.rs

Aranya Documentation An overview of the Aranya project

Implementation Roadmap

Logging vs. Metrics Boundary

Phase Overview

Phase 1: Foundation

What Gets Done

Phase 2: Enhanced Sync Debugging

What Gets Done

Phase 3: Policy & AFC

What Gets Done