Observability in Talon

Talon uses OpenTelemetry for traces and metrics, and zerolog for structured logs. Evidence records are stored in SQLite for compliance; traces and metrics are for operational observability.

Enabling OpenTelemetry

OTel is off by default. Enable it in one of these ways:

Flag: talon --otel run "query" or talon --otel serve
Environment: TALON_OTEL_ENABLED=true
Verbose: talon -v ... also enables OTel (for development)

Example (production-style, without turning on verbose logs):

export TALON_OTEL_ENABLED=true
talon serve

Or:

talon --otel serve

What is exported

Traces: Span hierarchy for agent runs, policy evaluation, LLM calls, evidence store, secrets, memory, and HTTP requests (when using talon serve). Spans include correlation_id, tenant_id, agent_id, and GenAI attributes where applicable.
Metrics: 25+ OTel instruments across all subsystems (see Metrics reference below). Exported via OTLP or stdout depending on configuration.
Logs: Structured JSON or console via zerolog. Key log lines include trace_id and span_id when OTel is enabled so logs can be correlated with traces in a backend.

Export destination is stdout by default. Use the OTLP exporter to send traces and metrics to a collector (e.g. Jaeger, Prometheus, Grafana). See examples/observability for a ready-made local stack.

Metrics reference

All metrics are registered via the OpenTelemetry Go SDK. The tables below group them by subsystem. Every metric follows the naming convention talon.<subsystem>.<metric> for custom metrics or GenAI Semantic Conventions for standard LLM telemetry.

LLM / GenAI

Registered by internal/llm. Emitted on every LLM call.

Metric	Type	Unit	Attributes	Description
`talon.cost.request`	Float64Histogram	`eur`	`agent`, `model`, `degraded`	Cost in EUR per LLM request.
`gen_ai.client.token.usage`	Int64Histogram	`{token}`	`gen_ai.system`, `gen_ai.request.model`, `gen_ai.token.type`	Token usage per LLM request (input/output). GenAI SemConv.
`gen_ai.client.operation.duration`	Float64Histogram	`s`	`gen_ai.system`, `gen_ai.request.model`	End-to-end LLM operation duration. GenAI SemConv.
`gen_ai.server.time_to_first_token`	Float64Histogram	`s`	`gen_ai.system`, `gen_ai.request.model`	Time from request sent to first content token (streaming). GenAI SemConv.
`gen_ai.server.time_per_output_token`	Float64Histogram	`s`	`gen_ai.system`, `gen_ai.request.model`	Time per output token after first token (streaming decode phase). GenAI SemConv.
`talon.provider.availability`	Float64Gauge	`1`	`provider`	Provider availability (1 = up, 0 = down).
`talon.provider.failover.total`	Int64Counter	`{failover}`	`original_model`, `fallback_model`, `reason`	Provider failover events (cost degradation or unavailability).

Gateway

Registered by internal/gateway. Emitted for every request through the LLM API gateway.

Metric	Type	Unit	Attributes	Description
`talon.gateway.requests.total`	Int64Counter	`{request}`	`agent`, `model`, `gen_ai.system`, `status`	Total gateway proxy requests.
`talon.gateway.errors.total`	Int64Counter	`{error}`	`error_type`	Gateway errors by type (auth, policy, provider, timeout).
`talon.data_tier.requests`	Int64Counter	`{request}`	`tier`, `agent`	Requests by data classification tier (0/1/2).
`talon.tools.governance.total`	Int64Counter	`{decision}`	`tool`, `action`	Tool governance decisions (allow, block, filter).
`talon.cache.hits`	Int64Counter	`{hit}`	`tenant_id`	Semantic cache hits (request served from cache).
`talon.cache.misses`	Int64Counter	`{miss}`	`tenant_id`	Semantic cache misses (forwarded to LLM).
`talon.shadow.violations.total`	Int64Counter	`{violation}`	`violation_type`	Shadow mode violations (would-have-blocked in enforce mode).
`talon.gateway.egress.decisions`	Int64Counter	`{decision}`	`tenant_id`, `tier`, `gen_ai.system`, `region`, `decision`	Egress policy decisions (destination × data tier), `decision` is `allow` or `deny`.
`talon.budget.utilization`	Float64Gauge	`%`	`tenant_id`, `period`	Current budget utilization as a percentage.
`talon.budget.alerts.total`	Int64Counter	`{alert}`	`tenant_id`, `threshold`	Budget threshold breach alerts.

When an egress policy is configured, the gateway request span also carries talon.egress.* attributes: agent, correlation_id, data_tier, destination_provider, destination_region, decision, and reason (machine code, empty on allow).

Routing spans

The llm.route / llm.graceful_route spans (agent runs) carry talon.data.tier and, when compliance-aware routing is active, talon.routing.sovereignty_mode, talon.provider.jurisdiction, talon.provider.region, talon.routing.rejected_count, and talon.routing.selection_reason.

Policy engine

Registered by internal/policy. Emitted on every policy evaluation (OPA).

Metric	Type	Unit	Attributes	Description
`talon.policy.evaluations.total`	Int64Counter	`{evaluation}`	`decision`, `tenant_id`, `agent_id`	Policy evaluation count by decision (allow/deny).
`talon.policy.evaluation.duration`	Float64Histogram	`ms`	`tenant_id`	Policy evaluation latency in milliseconds.

PII classifier

Registered by internal/classifier. Emitted when PII is detected or redacted.

Metric	Type	Unit	Attributes	Description
`talon.pii.detections.total`	Int64Counter	`{detection}`	`pii_type`, `direction`, `action`	PII entities detected (email, IBAN, phone, etc.).
`talon.pii.redactions.total`	Int64Counter	`{redaction}`	`pii_type`, `direction`	PII entities redacted before forwarding.

Evidence store

Registered by internal/evidence. Emitted on every evidence write or verification.

Metric	Type	Unit	Attributes	Description
`talon.evidence.records.total`	Int64Counter	`{record}`	`type`	Evidence records stored (LLM, tool, secret, memory).
`talon.evidence.signature_verifications`	Int64Counter	`{verification}`	`result`	HMAC signature verification attempts (success/failure).

Secrets vault

Registered by internal/secrets. Emitted on every secret access attempt.

Metric	Type	Unit	Attributes	Description
`talon.secrets.access.total`	Int64Counter	`{access}`	`secret_name`, `agent_id`, `outcome`	Secret access attempts (granted/denied).

Attachment scanner

Registered by internal/attachment. Emitted when prompt injection patterns are detected.

Metric	Type	Unit	Attributes	Description
`talon.injection.attempts.total`	Int64Counter	`{attempt}`	`detection_type`, `action`	Prompt injection attempts detected in attachments.

Agent memory

Registered by internal/memory. Emitted on memory operations.

Metric	Type	Unit	Attributes	Description
`memory.writes.total`	Int64Counter	`{write}`	`category`, `tenant_id`	Memory entries written.
`memory.writes.denied`	Int64Counter	`{write}`	`reason`	Memory writes denied (PII, forbidden category).
`memory.conflicts.detected`	Int64Counter	`{conflict}`	`resolution`	Memory conflicts detected during consolidation.
`memory.reads.total`	Int64Counter	`{read}`	`source`, `tenant_id`	Memory entries read (for prompt injection).
`memory.entries.count`	Int64Gauge	`{entry}`	`tenant_id`, `agent_id`	Current memory entry count per agent.
`memory.dedup.skips`	Int64Counter	`{skip}`	`tenant_id`	Duplicate memory writes suppressed by dedup window.
`memory.consolidation.noops`	Int64Counter	`{noop}`	—	Consolidation runs that found nothing to merge.
`memory.consolidation.invalidations`	Int64Counter	`{invalidation}`	—	Entries invalidated during consolidation.
`memory.consolidation.updates`	Int64Counter	`{update}`	—	Entries updated during consolidation.
`talon.memory.poisoning.blocked`	Int64Counter	`{block}`	`reason`, `agent_id`	Memory poisoning attempts blocked by governance.

Gateway dashboard metrics

In addition to OTel metrics, Talon provides a real-time runtime dashboard with an in-memory metrics collector. The collector is fed from successful evidence.Store.Store() commits (all invocation types), then periodically reconciled from evidence as an authoritative repair path.

The dashboard snapshot is available at:

HTML dashboard: GET /gateway/dashboard — single-page HTML with auto-refreshing charts.
JSON API: GET /api/v1/metrics — full snapshot for programmatic access.
SSE stream: GET /api/v1/metrics/stream — Server-Sent Events, one snapshot every 5 seconds.

The dashboard snapshot includes:

Field	Type	Description
`summary.total_requests`	int	Total requests processed.
`summary.blocked_requests`	int	Requests denied by policy.
`summary.pii_detections`	int	PII entities found across all requests.
`summary.pii_redactions`	int	PII entities redacted.
`summary.tools_filtered`	int	Tool calls filtered by governance.
`summary.total_cost_eur`	float	Cumulative cost in EUR.
`summary.avg_latency_ms`	int	Average request latency.
`summary.p99_latency_ms`	int	P99 request latency.
`summary.error_rate`	float	Error rate (0.0–1.0).
`summary.active_runs`	int	Currently executing agent runs.
`summary.pending_plans`	int	Plans awaiting human review.
`summary.approved_plans`	int	Plans approved by reviewers.
`summary.rejected_plans`	int	Plans rejected by reviewers.
`summary.modified_plans`	int	Plans approved with modifications.
`summary.dispatched_plans`	int	Approved plans already dispatched/executed.
`summary.plan_dispatch_errors`	int	Dispatched plans that recorded execution/dispatch errors.
`requests_timeline`	array	5-minute bucketed request counts.
`pii_timeline`	array	5-minute bucketed PII detection counts.
`cost_timeline`	array	5-minute bucketed cost in EUR.
`agent_stats`	array	Per-agent aggregates (requests, PII, blocked, cost, latency).
`pii_breakdown`	array	Detections broken down by PII type (email, IBAN, phone, etc.).
`model_breakdown`	array	Requests and cost broken down by LLM model.
`tool_governance`	object	Tool filtering stats (total, filtered, by risk level, anomalous agents).
`shadow_summary`	object	Shadow mode violation summary (only present in shadow mode).
`budget_status`	object	Budget utilization (daily/monthly used, limit, percentage).
`cache_stats`	object	Semantic cache performance (hits, hit rate, cost saved).
`plan_stats`	object	Plan lifecycle counters (pending/approved/rejected/modified/dispatched/failures).
`dropped_events`	int	Collector events dropped due to in-process backpressure.

See Gateway dashboard reference for full configuration, authentication, and API details.

Operational-event projection

Projection: Evidence → OperationalEvent → Metrics / UI / CLI. Metrics emit only after evidence persists, and live increments are driven by store post-commit observer notifications. Collector overflow is exposed as dropped_events in the snapshot, metrics_events_dropped in /v1/status, and OTel counter talon.metrics.events_dropped.total. Periodic reconciliation from evidence is bounded and idempotent, used to repair drift rather than define an alternate source of truth.

Surface	Endpoint / source	Scope	Parity expectation
Evidence list	`/v1/evidence`	all evidence rows in tenant window	authoritative ordering: `timestamp DESC, id DESC`
Events API	`/api/v1/events/recent`, `/api/v1/events/stream`	`terminal_plus_lifecycle_subset` (evidence-backed only)	same ordering and evidence-linked fields
Dashboard	`/dashboard` recent-events table	mirrors events API rows, with compact signal chips	reflects events API without manual refresh
Metrics snapshot	`/api/v1/metrics`	`all_activity` collector projection	consistent summary/breakdown invariants and reconciliation health
Status	`/v1/status`	reliability contract fields	exposes `metrics_events_dropped`, `events_stream_gaps`, `events_replay_misses`, `events_backlog_drops`, reconcile status

Log–trace correlation

When OTel is enabled, critical log events (e.g. agent_run_started, agent_run_completed, failed_to_generate_evidence) include trace_id and span_id. In an observability backend that ingests both logs and traces, you can jump from a log line to the corresponding trace.

Structured log fields

Logs consistently include:

correlation_id – unique per agent run
tenant_id – tenant scope
agent_id – agent name
trace_id / span_id – when OTel is enabled

Use these for filtering and correlation in your log aggregation (e.g. Elasticsearch, Loki).

HTTP server tracing

When running talon serve, the chi router uses OTel middleware. Each HTTP request (health, status, webhooks) gets a root span; agent runs triggered by webhooks appear as child spans, so you can see the full request → run → LLM/tool chain in one trace.

Local observability stack

A ready-made Docker Compose stack is provided in examples/observability/. It includes:

OpenTelemetry Collector — receives OTLP from Talon, exports to Prometheus
Prometheus — scrapes metrics from the Collector
Grafana — pre-built dashboard for all talon.* and gen_ai.* metrics

Start the stack:

cd examples/observability
docker compose up -d

Then configure Talon to export to the Collector:

export TALON_OTEL_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
talon serve --gateway

Open Grafana at http://localhost:3000 (admin/admin) to see the pre-built Talon Gateway dashboard.

See examples/observability/README.md for details.

Enabling OpenTelemetry​

What is exported​

Metrics reference​

LLM / GenAI​

Gateway​

Routing spans​

Policy engine​

PII classifier​

Evidence store​

Secrets vault​

Attachment scanner​

Agent memory​

Gateway dashboard metrics​

Operational-event projection​

Log–trace correlation​

Structured log fields​

HTTP server tracing​

Local observability stack​