What Talon Does to Your Request

A technically precise, no-marketing explanation of every step Talon's gateway performs on an HTTP request. Written for engineers who want to know exactly what happens to their traffic.

Request Lifecycle

When you send an HTTP request to POST /v1/proxy/{provider}/v1/chat/completions, Talon runs a 10-step pipeline before returning the response. The request body is forwarded to the upstream provider; Talon does not modify it in shadow mode.

Client                         Talon Gateway                    LLM Provider
  │                                  │                               │
  │  POST /v1/proxy/openai/v1/...   │                               │
  │─────────────────────────────────▶│                               │
  │                                  │  1. Route request             │
  │                                  │  2. Resolve agent (<1ms)      │
  │                                  │  3. Rate limit check (<1ms)   │
  │                                  │  4. Extract model + text      │
  │                                  │  5. PII scan input (2-5ms)    │
  │                                  │  6. Classify data tier        │
  │                                  │  7. Policy eval / OPA (1-3ms) │
  │                                  │  8. Tool governance           │
  │                                  │  9. Redact (if enforcing)     │
  │                                  │                               │
  │                                  │  POST /v1/chat/completions    │
  │                                  │──────────────────────────────▶│
  │                                  │                               │
  │                                  │◀──────────────────────────────│
  │                                  │  Response                     │
  │                                  │                               │
  │                                  │  10. Response PII scan        │
  │                                  │  11. Evidence generation      │
  │                                  │  12. Cost tracking            │
  │◀─────────────────────────────────│                               │
  │  Response (byte-identical*)      │                               │

*In shadow mode, the response body is byte-identical to what the upstream provider returned. In enforce mode with pii_action: redact, PII in the response may be replaced.

Step-by-Step Breakdown

Step 1: Route Request

Talon examines the URL path to determine which upstream provider to use. /v1/proxy/openai/v1/chat/completions routes to the OpenAI provider configured in talon.config.yaml. The provider config specifies the upstream base URL (e.g., https://api.openai.com).

Bytes read: URL path only
Bytes modified: None
Latency: <1ms (string match)
On failure: 404 if provider not configured

Step 2: Resolve Agent

Talon reads the presented agent key (Authorization: Bearer <key> or x-api-key: <key>) and matches it against the identity registry — one entry per agent.talon.yaml, each key resolved from the vault at startup. A match yields the agent identity: its name, derived tenant (key → agent → tenant_id), team, tags, and its one policy override.

There is no source-IP identification and no anonymous fallback. The only non-key path is the explicit synthetic identity injected in-process by --proxy-quickstart.

Bytes read: Authorization / x-api-key header
Bytes modified: None
Latency: <1ms (constant-time comparison per registered agent)
On failure: 401 Invalid or missing agent key — in every mode; an unknown or missing key is never forwarded, shadow mode included.

Step 3: Rate Limit Check

A token-bucket rate limiter checks both global and per-agent request rates. Configured via rate_limits.global_requests_per_min and rate_limits.per_agent_requests_per_min.

Bytes read: None (uses the agent identity from step 2)
Bytes modified: None
Latency: <1ms
On failure: 429 Too Many Requests (with Retry-After header)

Step 4: Extract Model and Text

The JSON request body is parsed to extract the model name, message content, and tool call names. This is provider-aware: OpenAI uses messages[].content, Anthropic uses a different structure.

Bytes read: Full request body (JSON parse)
Bytes modified: None (the parsed body is used for scanning, the original bytes are forwarded)
Latency: 1-2ms (JSON unmarshal)

Step 5: PII Scan (Input)

The extracted text content is scanned for PII using regex-based recognizers (email, phone, IBAN, credit card, VAT IDs, national IDs across 27 EU member states). Each match returns a type, sensitivity level (1-3), and byte offset.

Bytes read: Extracted message content
Bytes modified: None at this stage
Latency: 2-5ms (regex matching over message text)
Evidence recorded: PII types found, count, sensitivity tiers

Validation: IBAN (MOD-97 + country-specific length), credit cards (Luhn algorithm), Dutch BSN (11-test), Polish PESEL (check digit).

Step 6: Classify Data Tier

Based on PII findings, the request is classified into a data tier (0 = public, 1 = internal, 2 = confidential). The highest-sensitivity PII finding determines the tier.

Bytes read: PII scan results
Bytes modified: None
Latency: <1ms (max over sensitivity scores)

Step 7: Policy Evaluation (OPA)

The policy engine (embedded OPA/Rego, no sidecar) evaluates the request against the agent's effective policy — organization baseline → the agent's one override → provider destination constraints, resolved by a single shared function (ResolveEffectivePolicy). Inputs include: model name, data tier, estimated cost, daily cost accumulator, allowed models list, destination provider and region, and the resolved egress rules (Rego inputs use agent_* names).

Checks performed:

Is the requested model in the agent's effective allowlist?
Does the estimated cost exceed the effective per-request/daily/monthly limits?
Does the data tier exceed the model's allowed tier?
Is the agent authorized for this provider (policies.allowed_providers)?
May this data tier egress to this destination (provider/region), per the configured egress rules? Denials carry the egress_tier_destination_disallowed or egress_destination_disallowed machine code, and the outcome is recorded in the egress_decision evidence section. Because this runs before Step 10, no request bytes leave Talon on an egress denial.
Bytes read: Extracted metadata (model, tier, cost estimate)
Bytes modified: None
Latency: 1-3ms (compiled Rego evaluation, no I/O)
On denial (enforce mode): Returns a provider-native error response (e.g., OpenAI-format JSON with appropriate HTTP status)
On denial (shadow mode): Logs the denial but forwards the request anyway

Step 8: Tool Governance

If the request includes function/tool calls, Talon checks them against the effective allowed/forbidden tool lists (baseline ∪ provider ∪ agent for forbidden; most-specific list for allowed). Tools matching forbidden_tools patterns (including glob patterns like admin_*) are filtered out.

Bytes read: Tool/function names from the parsed request
Bytes modified: In enforce mode, forbidden tools may be stripped from the request body before forwarding
Latency: <1ms

Step 9: Redact (Enforce Mode Only)

If the policy action is redact, PII found in step 5 is replaced in the request body before forwarding. Replacement preserves JSON structure. In shadow mode this step is skipped entirely.

Bytes read: Original request body + PII locations
Bytes modified: PII tokens replaced with [REDACTED:<type>]
Latency: <1ms

Step 10: Forward to Upstream

The request is forwarded to the upstream provider URL. Talon creates a new HTTP connection to the provider (it does not pass through the client's TLS session).

Non-streaming: The full response body is read, token usage is extracted from the JSON usage field, and the response is written to the client.

Streaming (SSE): Talon detects text/event-stream in the response Content-Type and enters streaming mode. SSE chunks are forwarded as received using a bufio.Scanner with 512KB buffer. Each chunk is flushed immediately. Token usage is extracted incrementally from data: lines. The client sees the first token at the same time it would without Talon (minus the ~15ms pipeline overhead on the initial request).

Headers forwarded to upstream: Content-Type, Authorization (replaced with the real provider API key from the secrets vault). Headers forwarded to client: Content-Type, X-Request-Id, rate-limit headers.

Latency: Network RTT to provider (pass-through, no additional buffering for streaming responses)

Step 11: Response PII Scan

For non-streaming responses, the LLM-generated content is extracted from the response JSON (e.g., choices[].message.content for OpenAI) and scanned for PII using the same recognizers as step 5.

For streaming responses, content is accumulated from SSE delta chunks and scanned after the stream completes.

Actions on PII detection in response (configurable):

allow — log only
warn — log with elevated severity
redact — rewrite response with PII replaced (non-streaming: JSON rewrite; streaming: buffer, redact, re-emit as SSE)
block — return 503 Unavailable For Legal Reasons
Bytes read: Response body content
Bytes modified: Only if pii_action: redact or block
Latency: 2-5ms (non-streaming); streaming scan happens after final chunk

Step 12: Evidence Generation and Cost Tracking

An evidence record is created and signed with HMAC-SHA256. The record includes:

Field	Source
`id`	Generated (`req_` + UUID prefix)
`correlation_id`	From `X-Request-Id` or generated
`timestamp`	`time.Now()`
`tenant_id`	Derived from the agent (`key → agent → tenant_id`)
`agent_id`	The resolved agent's name
`request_source_id`	The resolved agent's name
`policy_decision`	Allow/deny + reasons from step 7
`classification.input_tier`	Data tier from step 6
`classification.pii_detected`	PII types from step 5
`classification.output_pii_detected`	PII types from step 11
`execution.model_used`	Model from response
`execution.cost`	Calculated from token usage
`execution.tokens`	Input + output token counts
`execution.duration_ms`	End-to-end latency
`audit_trail.input_hash`	SHA-256 of request content
`audit_trail.output_hash`	SHA-256 of response content
`signature`	HMAC-SHA256 over all other fields

The record is written to SQLite asynchronously (<1-2ms). Cost is added to the agent's daily/monthly accumulator (in-memory counter, periodically flushed).

Latency Budget

Step	Operation	Typical Latency	Notes
1	Route request	<1ms	String match on URL path
2	Resolve agent	<1ms	Constant-time registry match
3	Rate limit check	<1ms	Token bucket
4	Extract model + text	1-2ms	JSON unmarshal
5	PII scan (input)	2-5ms	Regex over message content
6	Classify data tier	<1ms	Max over sensitivity scores
7	Policy evaluation	1-3ms	Compiled Rego, no I/O
8	Tool governance	<1ms	List matching
9	Redact (enforce only)	<1ms	String replacement
10	Forward	Network RTT	No buffering for streams
11	Response PII scan	2-5ms	Non-streaming only
12	Evidence + cost	1-2ms	Async SQLite write + HMAC
Total overhead		<15ms	Excluding network RTT

Throughput And Benchmarking

Micro-benchmarks (reproducible from a clean checkout): run make benchmarks or see Reproducible benchmarks for gateway pipeline overhead, PII scan latency, and evidence write throughput on your hardware.

End-to-end load (optional): use this harness when you need concurrent requests through a running gateway. Throughput depends on message size, PII pattern density, and upstream provider latency.

# 1) Start local proof environment
cd examples/docker-compose
docker compose up -d

# 2) Warm-up
bash ../../scripts/demo-recorder.sh

# 3) Measure request latency/throughput (example with hey)
hey -n 200 -c 20 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hello jan@example.com"}]}' \
  http://localhost:8080/v1/proxy/openai/v1/chat/completions

Overhead contributors, in order:

PII scanning complexity and input length
Policy evaluation breadth (number of active checks)
Response-path handling (especially redact/block modes)
Evidence write path and storage backend
Upstream provider/network latency (usually dominant)

What Talon Does NOT Do

Does not modify request bodies in shadow mode. The upstream provider receives exactly what your client sent. PII is scanned and logged but not altered.
Does not buffer streaming responses. SSE chunks are forwarded to the client as they arrive from the provider. There is no full-response buffering for streaming requests.
Does not decrypt TLS to the upstream. Talon terminates the client's HTTP connection and creates a new HTTPS connection to the provider. It does not act as a TLS-intercepting proxy.
Does not store prompt or response content by default. Evidence records contain metadata (model, cost, PII types, hashes) but not the actual text. Content logging is opt-in via log_prompts: true / log_responses: true.
Does not phone home. Talon sends no telemetry, analytics, or usage data to Dativo or anywhere else. OpenTelemetry export is configured by you and points where you choose.
Does not require an internet connection for policy evaluation. OPA is embedded in the binary. Policies are evaluated locally.

Threat Boundaries

Talon is a control-plane enforcement layer, not a full security perimeter.

In scope: pre-execution policy checks, request/response governance, signed evidence chain.
Out of scope: endpoint compromise, stolen deployment credentials, provider-side data handling controls.
Operator controls required: secure key management, host hardening, least-privilege API keys, retention/access policy for evidence storage.

Streaming Behavior

SSE (Server-Sent Events) streaming works as follows:

Client sends request with "stream": true
Talon runs steps 1-9 (same as non-streaming)
Talon forwards the request to the upstream provider
Provider responds with Content-Type: text/event-stream
Each SSE chunk (data: {...}\n\n) is forwarded to the client immediately after Talon receives it, with an http.Flusher.Flush() call
Token usage is extracted from data: lines as they arrive (OpenAI includes usage in the final chunk; Anthropic uses message_start/message_delta)
After the stream completes (data: [DONE]), response PII scanning runs on the accumulated content
Evidence is generated with the full token counts

The client sees the first token at the same latency as a direct connection to the provider, minus the ~15ms pipeline overhead on the initial request.

Source Code

The gateway pipeline implementation lives in these files:

File	Responsibility
`internal/gateway/gateway.go`	Main `ServeHTTP` handler — 10-step pipeline
`internal/gateway/router.go`	Provider routing from URL path
`internal/gateway/identity.go`	Agent identity registry (key → agent)
`internal/gateway/resolve.go`	Agent key resolution per request
`internal/gateway/effective.go`	Effective policy (baseline → override → provider constraints)
`internal/gateway/forward.go`	HTTP forwarding + SSE streaming
`internal/gateway/response_pii.go`	Response PII scanning
`internal/gateway/tool_filter.go`	Tool governance / filtering
`internal/gateway/ratelimit.go`	Token-bucket rate limiting
`internal/gateway/attachment.go`	Attachment extraction + injection scanning
`internal/classifier/patterns.go`	PII regex recognizers (EU pattern set)
`internal/classifier/pii.go`	PII analysis + redaction engine
`internal/evidence/store.go`	Evidence storage + HMAC signing
`internal/evidence/generator.go`	Evidence record creation

Request Lifecycle​

Step-by-Step Breakdown​

Step 1: Route Request​

Step 2: Resolve Agent​

Step 3: Rate Limit Check​

Step 4: Extract Model and Text​

Step 5: PII Scan (Input)​

Step 6: Classify Data Tier​

Step 7: Policy Evaluation (OPA)​

Step 8: Tool Governance​

Step 9: Redact (Enforce Mode Only)​

Step 10: Forward to Upstream​

Step 11: Response PII Scan​

Step 12: Evidence Generation and Cost Tracking​

Latency Budget​

Throughput And Benchmarking​

What Talon Does NOT Do​

Threat Boundaries​

Streaming Behavior​

Source Code​