Skip to main content

Observability in Talon

Talon uses OpenTelemetry for traces and metrics, and zerolog for structured logs. Evidence records are stored in SQLite for compliance; traces and metrics are for operational observability.

Enabling OpenTelemetry

OTel is off by default. Enable it in one of these ways:

  • Flag: talon --otel run "query" or talon --otel serve
  • Environment: TALON_OTEL_ENABLED=true
  • Verbose: talon -v ... also enables OTel (for development)

Example (production-style, without turning on verbose logs):

export TALON_OTEL_ENABLED=true
talon serve

Or:

talon --otel serve

What is exported

  • Traces: Span hierarchy for agent runs, policy evaluation, LLM calls, evidence store, secrets, memory, and HTTP requests (when using talon serve). Spans include correlation_id, tenant_id, agent_id, and GenAI attributes where applicable.
  • Metrics: 25+ OTel instruments across all subsystems (see Metrics reference below). Exported via OTLP or stdout depending on configuration.
  • Logs: Structured JSON or console via zerolog. Key log lines include trace_id and span_id when OTel is enabled so logs can be correlated with traces in a backend.

Export destination is stdout by default. Use the OTLP exporter to send traces and metrics to a collector (e.g. Jaeger, Prometheus, Grafana). See examples/observability for a ready-made local stack.


Metrics reference

All metrics are registered via the OpenTelemetry Go SDK. The tables below group them by subsystem. Every metric follows the naming convention talon.<subsystem>.<metric> for custom metrics or GenAI Semantic Conventions for standard LLM telemetry.

LLM / GenAI

Registered by internal/llm. Emitted on every LLM call.

MetricTypeUnitAttributesDescription
talon.cost.requestFloat64Histogrameuragent, model, degradedCost in EUR per LLM request.
gen_ai.client.token.usageInt64Histogram{token}gen_ai.system, gen_ai.request.model, gen_ai.token.typeToken usage per LLM request (input/output). GenAI SemConv.
gen_ai.client.operation.durationFloat64Histogramsgen_ai.system, gen_ai.request.modelEnd-to-end LLM operation duration. GenAI SemConv.
gen_ai.server.time_to_first_tokenFloat64Histogramsgen_ai.system, gen_ai.request.modelTime from request sent to first content token (streaming). GenAI SemConv.
gen_ai.server.time_per_output_tokenFloat64Histogramsgen_ai.system, gen_ai.request.modelTime per output token after first token (streaming decode phase). GenAI SemConv.
talon.provider.availabilityFloat64Gauge1providerProvider availability (1 = up, 0 = down).
talon.provider.failover.totalInt64Counter{failover}original_model, fallback_model, reasonProvider failover events (cost degradation or unavailability).

Gateway

Registered by internal/gateway. Emitted for every request through the LLM API gateway.

MetricTypeUnitAttributesDescription
talon.gateway.requests.totalInt64Counter{request}caller, model, gen_ai.system, statusTotal gateway proxy requests.
talon.gateway.errors.totalInt64Counter{error}error_typeGateway errors by type (auth, policy, provider, timeout).
talon.data_tier.requestsInt64Counter{request}tier, callerRequests by data classification tier (0/1/2).
talon.tools.governance.totalInt64Counter{decision}tool, actionTool governance decisions (allow, block, filter).
talon.cache.hitsInt64Counter{hit}tenant_idSemantic cache hits (request served from cache).
talon.cache.missesInt64Counter{miss}tenant_idSemantic cache misses (forwarded to LLM).
talon.shadow.violations.totalInt64Counter{violation}violation_typeShadow mode violations (would-have-blocked in enforce mode).
talon.gateway.egress.decisionsInt64Counter{decision}tenant_id, tier, gen_ai.system, region, decisionEgress policy decisions (destination × data tier), decision is allow or deny.
talon.budget.utilizationFloat64Gauge%tenant_id, periodCurrent budget utilization as a percentage.
talon.budget.alerts.totalInt64Counter{alert}tenant_id, thresholdBudget threshold breach alerts.

When an egress policy is configured, the gateway request span also carries talon.egress.* attributes: caller, correlation_id, data_tier, destination_provider, destination_region, decision, and reason (machine code, empty on allow).

Routing spans

The llm.route / llm.graceful_route spans (agent runs) carry talon.data.tier and, when compliance-aware routing is active, talon.routing.sovereignty_mode, talon.provider.jurisdiction, talon.provider.region, talon.routing.rejected_count, and talon.routing.selection_reason.

Policy engine

Registered by internal/policy. Emitted on every policy evaluation (OPA).

MetricTypeUnitAttributesDescription
talon.policy.evaluations.totalInt64Counter{evaluation}decision, tenant_id, agent_idPolicy evaluation count by decision (allow/deny).
talon.policy.evaluation.durationFloat64Histogrammstenant_idPolicy evaluation latency in milliseconds.

PII classifier

Registered by internal/classifier. Emitted when PII is detected or redacted.

MetricTypeUnitAttributesDescription
talon.pii.detections.totalInt64Counter{detection}pii_type, direction, actionPII entities detected (email, IBAN, phone, etc.).
talon.pii.redactions.totalInt64Counter{redaction}pii_type, directionPII entities redacted before forwarding.

Evidence store

Registered by internal/evidence. Emitted on every evidence write or verification.

MetricTypeUnitAttributesDescription
talon.evidence.records.totalInt64Counter{record}typeEvidence records stored (LLM, tool, secret, memory).
talon.evidence.signature_verificationsInt64Counter{verification}resultHMAC signature verification attempts (success/failure).

Secrets vault

Registered by internal/secrets. Emitted on every secret access attempt.

MetricTypeUnitAttributesDescription
talon.secrets.access.totalInt64Counter{access}secret_name, agent_id, outcomeSecret access attempts (granted/denied).

Attachment scanner

Registered by internal/attachment. Emitted when prompt injection patterns are detected.

MetricTypeUnitAttributesDescription
talon.injection.attempts.totalInt64Counter{attempt}detection_type, actionPrompt injection attempts detected in attachments.

Agent memory

Registered by internal/memory. Emitted on memory operations.

MetricTypeUnitAttributesDescription
memory.writes.totalInt64Counter{write}category, tenant_idMemory entries written.
memory.writes.deniedInt64Counter{write}reasonMemory writes denied (PII, forbidden category).
memory.conflicts.detectedInt64Counter{conflict}resolutionMemory conflicts detected during consolidation.
memory.reads.totalInt64Counter{read}source, tenant_idMemory entries read (for prompt injection).
memory.entries.countInt64Gauge{entry}tenant_id, agent_idCurrent memory entry count per agent.
memory.dedup.skipsInt64Counter{skip}tenant_idDuplicate memory writes suppressed by dedup window.
memory.consolidation.noopsInt64Counter{noop}Consolidation runs that found nothing to merge.
memory.consolidation.invalidationsInt64Counter{invalidation}Entries invalidated during consolidation.
memory.consolidation.updatesInt64Counter{update}Entries updated during consolidation.
talon.memory.poisoning.blockedInt64Counter{block}reason, agent_idMemory poisoning attempts blocked by governance.

Gateway dashboard metrics

In addition to OTel metrics, Talon provides a real-time runtime dashboard with an in-memory metrics collector. The collector is fed from successful evidence.Store.Store() commits (all invocation types), then periodically reconciled from evidence as an authoritative repair path.

The dashboard snapshot is available at:

  • HTML dashboard: GET /gateway/dashboard — single-page HTML with auto-refreshing charts.
  • JSON API: GET /api/v1/metrics — full snapshot for programmatic access.
  • SSE stream: GET /api/v1/metrics/stream — Server-Sent Events, one snapshot every 5 seconds.

The dashboard snapshot includes:

FieldTypeDescription
summary.total_requestsintTotal requests processed.
summary.blocked_requestsintRequests denied by policy.
summary.pii_detectionsintPII entities found across all requests.
summary.pii_redactionsintPII entities redacted.
summary.tools_filteredintTool calls filtered by governance.
summary.total_cost_eurfloatCumulative cost in EUR.
summary.avg_latency_msintAverage request latency.
summary.p99_latency_msintP99 request latency.
summary.error_ratefloatError rate (0.0–1.0).
summary.active_runsintCurrently executing agent runs.
summary.pending_plansintPlans awaiting human review.
summary.approved_plansintPlans approved by reviewers.
summary.rejected_plansintPlans rejected by reviewers.
summary.modified_plansintPlans approved with modifications.
summary.dispatched_plansintApproved plans already dispatched/executed.
summary.plan_dispatch_errorsintDispatched plans that recorded execution/dispatch errors.
requests_timelinearray5-minute bucketed request counts.
pii_timelinearray5-minute bucketed PII detection counts.
cost_timelinearray5-minute bucketed cost in EUR.
caller_statsarrayPer-caller aggregates (requests, PII, blocked, cost, latency).
pii_breakdownarrayDetections broken down by PII type (email, IBAN, phone, etc.).
model_breakdownarrayRequests and cost broken down by LLM model.
tool_governanceobjectTool filtering stats (total, filtered, by risk level, anomalous agents).
shadow_summaryobjectShadow mode violation summary (only present in shadow mode).
budget_statusobjectBudget utilization (daily/monthly used, limit, percentage).
cache_statsobjectSemantic cache performance (hits, hit rate, cost saved).
plan_statsobjectPlan lifecycle counters (pending/approved/rejected/modified/dispatched/failures).
dropped_eventsintCollector events dropped due to in-process backpressure.

See Gateway dashboard reference for full configuration, authentication, and API details.

Operational-event projection

Projection: Evidence → OperationalEvent → Metrics / UI / CLI. Metrics emit only after evidence persists, and live increments are driven by store post-commit observer notifications. Collector overflow is exposed as dropped_events in the snapshot, metrics_events_dropped in /v1/status, and OTel counter talon.metrics.events_dropped.total. Periodic reconciliation from evidence is bounded and idempotent, used to repair drift rather than define an alternate source of truth.

SurfaceEndpoint / sourceScopeParity expectation
Evidence list/v1/evidenceall evidence rows in tenant windowauthoritative ordering: timestamp DESC, id DESC
Events API/api/v1/events/recent, /api/v1/events/streamterminal_plus_lifecycle_subset (evidence-backed only)same ordering and evidence-linked fields
Dashboard/dashboard recent-events tablemirrors events API rows, with compact signal chipsreflects events API without manual refresh
Metrics snapshot/api/v1/metricsall_activity collector projectionconsistent summary/breakdown invariants and reconciliation health
Status/v1/statusreliability contract fieldsexposes metrics_events_dropped, events_stream_gaps, events_replay_misses, events_backlog_drops, reconcile status

Log–trace correlation

When OTel is enabled, critical log events (e.g. agent_run_started, agent_run_completed, failed_to_generate_evidence) include trace_id and span_id. In an observability backend that ingests both logs and traces, you can jump from a log line to the corresponding trace.

Structured log fields

Logs consistently include:

  • correlation_id – unique per agent run
  • tenant_id – tenant scope
  • agent_id – agent name
  • trace_id / span_id – when OTel is enabled

Use these for filtering and correlation in your log aggregation (e.g. Elasticsearch, Loki).

HTTP server tracing

When running talon serve, the chi router uses OTel middleware. Each HTTP request (health, status, webhooks) gets a root span; agent runs triggered by webhooks appear as child spans, so you can see the full request → run → LLM/tool chain in one trace.

Local observability stack

A ready-made Docker Compose stack is provided in examples/observability/. It includes:

  • OpenTelemetry Collector — receives OTLP from Talon, exports to Prometheus
  • Prometheus — scrapes metrics from the Collector
  • Grafana — pre-built dashboard for all talon.* and gen_ai.* metrics

Start the stack:

cd examples/observability
docker compose up -d

Then configure Talon to export to the Collector:

export TALON_OTEL_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
talon serve --gateway

Open Grafana at http://localhost:3000 (admin/admin) to see the pre-built Talon Gateway dashboard.

See examples/observability/README.md for details.