Agent Observability (OTel Tracing)¶
Structured execution traces for every agent run, stored in PostgreSQL and visualized via a custom Django admin UI. The same spans also flow to Sentry for ops-style cross-service tracing — see Observability Stack for how this doc fits into the broader telemetry architecture.
Architecture¶
The system uses OpenTelemetry (OTel) Python SDK to instrument the agent's reasoning loop. Spans flow through a single TracerProvider to two consumers: the in-app PostgreSQL exporter (powering the admin drill-down UI) and Sentry's SentrySpanProcessor (for cross-service tracing). The OTel trace_id is the universal correlation key — it appears as trace_id on every log line and as Sentry's trace field on every event.
Agent reasoning loop (agent_core.py)
|
| OTel spans created at each step
v
BatchSpanProcessor (background thread)
|
v
PostgresSpanExporter
|
v
OTelTrace / OTelSpan tables (PostgreSQL)
|
v
Agent Trace Viewer (Django admin UI)
Span Hierarchy¶
Each agent run produces a single trace with this span tree. Names and sentry.op values follow the OpenTelemetry GenAI semantic conventions so Sentry's AI Agent UI recognizes them out of the box.
GET /get_chatbot_response/ # ASGI request (OpenTelemetryMiddleware)
|-- invoke_agent campuscore_agent # 1 per chat turn (op = gen_ai.invoke_agent)
| |-- iteration_1
| | |-- chat <model> # LLM call (op = gen_ai.chat)
| | | |-- POST # httpx → anthropic/openai (auto)
| | |-- execute_tool <name> # tool call (op = gen_ai.execute_tool)
| | | |-- SELECT documentchunk # psycopg → pgvector (auto)
| |-- iteration_2
| | |-- chat <model>
| | |-- token_management # internal helper
| |-- iteration_N ...
Legacy names (agent_run, llm_call, tool_call.<name>) persist in the database for traces written before the Phase-3 GenAI rename. The admin's _classify_span and root-span lookup handle both naming styles; no migration is required for historical data.
Key Files¶
| File | Purpose |
|---|---|
services/observability/__init__.py |
Package exports: init_tracing, shutdown_tracing, get_tracer |
services/observability/tracer.py |
TracerProvider setup; registers AgentRunTaggerProcessor, PostgresSpanExporter, and (when DSN set) SentrySpanProcessor + SentryPropagator |
services/observability/attributes.py |
OTel GenAI semantic attribute constants (gen_ai.*) + CampusCore extensions (campuscore.*); legacy agent.* / llm.* / tool.* aliases retained for back-compat |
services/observability/agent_run_tagger.py |
Contextvar + on_start span processor that marks every span created inside AgentCore.run — used by the exporter to filter agent-only traces |
services/observability/instrumentation.py |
OTel auto-instrumentors for psycopg, httpx, and stdlib logging; pytest-gated to avoid feedback loops |
services/observability/exporter.py |
PostgresSpanExporter — filters to agent-tagged spans, groups by trace_id, bulk-creates DB rows |
campus_core/asgi.py |
Wraps the ASGI app with OpenTelemetryMiddleware so the request span is created in the async context (NOT via DjangoInstrumentor) |
campus_core/observability/sentry_setup.py |
sentry_sdk.init(instrumenter="otel", enable_logs=True, ...) + FERPA scrubber wiring |
models.py (OTelTrace, OTelSpan) |
PostgreSQL storage for traces and spans |
apis/admin_agent_trace_apis.py |
Django views for the trace viewer UI; _classify_span + _LEGACY_ATTRIBUTE_ALIASES translate GenAI keys into the template's legacy paths |
templates/admin/agent_traces.html |
Main trace viewer page |
templates/admin/partials/admin_at_* |
HTMX partials for trace list, detail, span detail |
Models¶
OTelTrace¶
One row per agent run. Denormalized summary fields for efficient listing/filtering:
- id (UUID) — OTel trace_id
- conversation (FK, nullable) — link to conversation
- user_message, ai_message (FK, nullable) — linked messages
- query, status, model
- total_duration_ms, total_input_tokens, total_output_tokens
- iteration_count, tool_call_count, tools_used (JSONField)
OTelSpan¶
One row per span in the trace. Standard OTel shape:
- id (CharField) — OTel span_id hex
- trace (FK to OTelTrace)
- parent_span_id — for tree reconstruction
- name — e.g., agent_run, iteration_1, tool_call.search_knowledge
- attributes (JSONField), events (JSONField)
- started_at, ended_at, duration_ms
Instrumentation Points¶
Manual spans created by AgentCore.run() (GenAI conventions):
| Location | Span Name | sentry.op |
Key Attributes |
|---|---|---|---|
agent_core.run() |
invoke_agent <name> |
gen_ai.invoke_agent |
gen_ai.prompt, gen_ai.conversation.id, gen_ai.request.model, campuscore.agent.status, token totals |
| Reasoning loop | iteration_{n} |
— | campuscore.iteration.number, tool call count |
| LLM call | chat <model> |
gen_ai.chat |
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons |
| Tool execution | execute_tool <name> |
gen_ai.execute_tool |
gen_ai.tool.name, gen_ai.tool.input, gen_ai.tool.output, campuscore.tool.success |
| Knowledge routing | knowledge_routing |
— | campuscore.routing.folder_count, folders |
| Token management | token_management |
— | campuscore.token_management.strategy, message counts |
Auto-instrumented spans created during an agent run (inherit parent context from the surrounding manual span):
| Source | Span Name | Where it appears |
|---|---|---|
PsycopgInstrumentor |
SELECT main_app_documentchunk (etc.) |
Children of execute_tool (search) and chat (token counting) |
HTTPXClientInstrumentor |
POST, GET |
Children of chat (anthropic/openai API calls) and execute_tool (e.g., Cohere reranker) |
OpenTelemetryMiddleware |
GET /get_chatbot_response/ |
Root request span (parent of invoke_agent) |
The auto-instrumented spans are kept inside agent traces because they're the actual network and DB calls the agent made — exactly what's useful for offline analysis ("why was iteration 2 slow?" → child span shows the LLM API took 8 seconds).
Admin UI¶
Accessible at /admin/agent-traces/ (staff only). Features:
- Filterable trace list: status, model, min duration, tool used, date range
- Waterfall timeline: color-coded spans proportional to duration
- Span detail panel: attributes table, events, formatted JSON
- Cross-links: conversation viewer <-> trace viewer
Initialization¶
init_tracing() is called from MainAppConfig.ready() in apps.py. The BatchSpanProcessor runs on a background thread, flushing spans every 5 seconds.
Async Generator Spans¶
The agent's run() method is an async generator, so context managers (with start_as_current_span()) can't be used across yields. Instead, spans are manually managed:
span = tracer.start_span("invoke_agent campuscore_agent")
ctx = trace.set_span_in_context(span)
# ... yield events ...
span.end()
context= parameter.
OTel as the single source of truth (Sentry bridge)¶
OTel is the only system that creates spans for CampusCore. Sentry consumes them via SentrySpanProcessor, so the OTel trace_id is what populates Sentry's trace field — the two never diverge.
ASGI request → OpenTelemetryMiddleware (asgi.py)
│ opens root request span
▼
Django middleware chain → view → ChatService.generate_response
│
▼
AgentCore.run
│ starts "invoke_agent" child span
│ starts "chat" / "execute_tool" grand-children
▼
all spans end → BatchSpanProcessors:
├─ PostgresSpanExporter → OTelTrace/OTelSpan rows
└─ SentrySpanProcessor → Sentry transactions/spans
(uses the same trace_id)
Key wiring:
campus_core/asgi.pywraps the inner ASGI app withOpenTelemetryMiddlewarefromopentelemetry-instrumentation-asgi. The Django ASGI handler then runs inside that span, sotrace.get_current_span()works correctly in every downstream coroutine — including async middleware, streaming generators, and the agent's async loop. We deliberately avoidDjangoInstrumentor()because its sync-styleprocess_request/process_responsemiddleware loses OTel context across thesync_to_asyncboundary under uvicorn/ASGI.campus_core/observability/sentry_setup.pyinitializes Sentry withinstrumenter="otel", telling it not to auto-create transactions.DjangoIntegrationis retained but with all tracing knobs disabled (middleware_spans=False,signals_spans=False,cache_spans=False) — kept only for the error-capture path (request body, user attribution).apps/main_app/services/observability/tracer.pyattaches bothPostgresSpanExporter(for the admin UI) andSentrySpanProcessor(for Sentry) to the sameTracerProvider. Also installsSentryPropagatoras the global OTel text-map so outbound HTTP requests carrysentry-trace+baggageheaders, joining downstream services into the same trace.apps/main_app/services/observability/instrumentation.pyrunsPsycopgInstrumentor,HTTPXClientInstrumentor, andLoggingInstrumentorinAppConfig.ready(). It's gated off during pytest (setOTEL_INSTRUMENT_IN_TESTS=1to override) because PsycopgInstrumentor wraps every Django ORM cursor — including the onesPostgresSpanExporteruses — creating a write-amplification feedback loop.
Log correlation¶
Every JSON log line carries trace_id, span_id, request_id, and user_id. The values come from campus_core/observability/logging_filters.py::RequestIdFilter, which reads the current OTel span via trace.get_current_span(). Because OpenTelemetryMiddleware opens the span at the outermost ASGI layer, every log emitted during a request sees the same trace_id — and that value matches Sentry's trace field for the same request.
CloudWatch Logs Insights query example:
fields @timestamp, level, logger, message, trace_id, request_id, user_id
| filter trace_id = "abc123..."
| sort @timestamp asc
Why no DjangoInstrumentor¶
opentelemetry-instrumentation-django adds itself as a sync-style Django middleware (process_request/process_response). Django wraps sync middleware via asgiref.sync.sync_to_async when serving an async view, which runs the middleware in a thread with a copied contextvars context. Mutations to the OTel current-span contextvar inside that thread don't propagate back to the outer async context, so downstream async code (including streaming generators) sees no active span. We diagnosed this empirically: trace= was empty on every "request completed" log even though DjangoInstrumentor reported is_instrumented_by_opentelemetry == True. The fix was switching to OpenTelemetryMiddleware (a true ASGI middleware) wrapping the ASGI app itself.
Filtering: only agent spans reach the admin¶
With OTel auto-instrumentation on, every Django request produces a tree of spans (ASGI root, DB queries, outbound HTTP, etc.). Persisting all of them into OTelTrace/OTelSpan would drown the agent admin with non-agent traffic — admin views, static assets, healthchecks, the conversation list endpoint, etc.
To keep the admin focused, we tag spans created inside an agent run and the exporter filters on the tag:
| File | Role |
|---|---|
agent_run_tagger.py |
Defines a contextvar _in_agent_run and a SpanProcessor (AgentRunTaggerProcessor) that sets campuscore.in_agent_run = True on every span started while the contextvar is True. Registered as the first processor on the TracerProvider so the marker is in place before any downstream processor reads the span. |
agent_core.py |
Calls enter_agent_run() at the top of AgentCore.run (just before opening the invoke_agent span) and exit_agent_run(token) in finally. Every span started in this window — including child psycopg/httpx spans inside tool calls — inherits the marker via OTel's contextvar-driven context propagation. |
exporter.py |
First action in _persist_spans is [s for s in spans if s.attributes.get(AGENT_RUN_MARKER_ATTR)]. Untagged spans return immediately without touching the DB. |
Net effect: the admin shows exactly the spans that happened during an agent loop. The ASGI request span, DB queries for view dispatch, healthcheck spans, etc. are dropped at the exporter. Sentry is unaffected — SentrySpanProcessor runs as a sibling processor and sees every span regardless of the marker.
Test-suite gating¶
PsycopgInstrumentor wraps every Django ORM cursor — including the cursors PostgresSpanExporter uses to write spans. In tests, this creates a write-amplification feedback loop that materially slows the suite. enable_auto_instrumentation() in instrumentation.py detects pytest (via PYTEST_CURRENT_TEST env var or pytest in sys.modules) and no-ops by default. Override with OTEL_INSTRUMENT_IN_TESTS=1 if a specific test needs the auto-instrumentation surface active.
Golden-signal metrics¶
Beyond traces, the chat/agent/retrieval code emits counters and histograms via campus_core.observability.metrics.MetricsService (CloudWatch-bound — see Observability Stack). Alarms target these names directly in infrastructure/app/monitoring.tf.
| Surface | Metric | Where emitted |
|---|---|---|
| Chat | chat.request.count, chat.request.error_count, chat.request.duration_ms |
chat_service.py::generate_response |
| Chat | chat.steps_per_response, chat.tools_used_per_response |
chat_service.py::generate_response (on success) |
| Agent | agent.run.count, agent.run.error_count, agent.run.status.<completed\|error\|max_iterations> |
_emit_agent_metrics in agent_core.py (called from all three terminal paths) |
| Agent | agent.iteration_count, agent.tool_call_count, agent.tokens.input, agent.tokens.output |
same |
| Agent | agent.no_tool_calls_count |
same — fires when an agent run completed without ever calling a tool (routing-regression signal) |
| Tool | tool.<name>.success_count, tool.<name>.failure_count, tool.<name>.exception_count, tool.<name>.duration_ms |
_execute_tool_traced in agent_core.py |
| Retrieval | retrieval.duration_ms, retrieval.request_count, retrieval.result_count |
retrieval_service.py::retrieve |
| Retrieval | retrieval.empty_result_count, retrieval.error_count |
same |
psycopg v3 + OTel¶
CampusCore uses psycopg[binary]>=3.2. The matching OTel instrumentor is opentelemetry-instrumentation-psycopg (v3 only — the v2 instrumentor is a separate package, opentelemetry-instrumentation-psycopg2). Together this means every DB query made by Django's ORM or by hand-rolled connection.cursor() blocks becomes an OTel span when auto-instrumentation is active.
If you ever need to revert to psycopg2 (you shouldn't), also swap the instrumentor name in instrumentation.py::_instrument_psycopg or those spans silently disappear from the trace tree.