Skip to content

Agent Observability (OTel Tracing)

Structured execution traces for every agent run, stored in PostgreSQL and visualized via a custom Django admin UI. The same spans also flow to Sentry for ops-style cross-service tracing — see Observability Stack for how this doc fits into the broader telemetry architecture.

Architecture

The system uses OpenTelemetry (OTel) Python SDK to instrument the agent's reasoning loop. Spans flow through a single TracerProvider to two consumers: the in-app PostgreSQL exporter (powering the admin drill-down UI) and Sentry's SentrySpanProcessor (for cross-service tracing). The OTel trace_id is the universal correlation key — it appears as trace_id on every log line and as Sentry's trace field on every event.

Agent reasoning loop (agent_core.py)
  |
  |  OTel spans created at each step
  v
BatchSpanProcessor (background thread)
  |
  v
PostgresSpanExporter
  |
  v
OTelTrace / OTelSpan tables (PostgreSQL)
  |
  v
Agent Trace Viewer (Django admin UI)

Span Hierarchy

Each agent run produces a single trace with this span tree. Names and sentry.op values follow the OpenTelemetry GenAI semantic conventions so Sentry's AI Agent UI recognizes them out of the box.

GET /get_chatbot_response/                  # ASGI request (OpenTelemetryMiddleware)
  |-- invoke_agent campuscore_agent         # 1 per chat turn (op = gen_ai.invoke_agent)
  |     |-- iteration_1
  |     |     |-- chat <model>              # LLM call (op = gen_ai.chat)
  |     |     |     |-- POST                # httpx → anthropic/openai (auto)
  |     |     |-- execute_tool <name>       # tool call (op = gen_ai.execute_tool)
  |     |     |     |-- SELECT documentchunk # psycopg → pgvector (auto)
  |     |-- iteration_2
  |     |     |-- chat <model>
  |     |     |-- token_management          # internal helper
  |     |-- iteration_N ...

Legacy names (agent_run, llm_call, tool_call.<name>) persist in the database for traces written before the Phase-3 GenAI rename. The admin's _classify_span and root-span lookup handle both naming styles; no migration is required for historical data.

Key Files

File Purpose
services/observability/__init__.py Package exports: init_tracing, shutdown_tracing, get_tracer
services/observability/tracer.py TracerProvider setup; registers AgentRunTaggerProcessor, PostgresSpanExporter, and (when DSN set) SentrySpanProcessor + SentryPropagator
services/observability/attributes.py OTel GenAI semantic attribute constants (gen_ai.*) + CampusCore extensions (campuscore.*); legacy agent.* / llm.* / tool.* aliases retained for back-compat
services/observability/agent_run_tagger.py Contextvar + on_start span processor that marks every span created inside AgentCore.run — used by the exporter to filter agent-only traces
services/observability/instrumentation.py OTel auto-instrumentors for psycopg, httpx, and stdlib logging; pytest-gated to avoid feedback loops
services/observability/exporter.py PostgresSpanExporter — filters to agent-tagged spans, groups by trace_id, bulk-creates DB rows
campus_core/asgi.py Wraps the ASGI app with OpenTelemetryMiddleware so the request span is created in the async context (NOT via DjangoInstrumentor)
campus_core/observability/sentry_setup.py sentry_sdk.init(instrumenter="otel", enable_logs=True, ...) + FERPA scrubber wiring
models.py (OTelTrace, OTelSpan) PostgreSQL storage for traces and spans
apis/admin_agent_trace_apis.py Django views for the trace viewer UI; _classify_span + _LEGACY_ATTRIBUTE_ALIASES translate GenAI keys into the template's legacy paths
templates/admin/agent_traces.html Main trace viewer page
templates/admin/partials/admin_at_* HTMX partials for trace list, detail, span detail

Models

OTelTrace

One row per agent run. Denormalized summary fields for efficient listing/filtering: - id (UUID) — OTel trace_id - conversation (FK, nullable) — link to conversation - user_message, ai_message (FK, nullable) — linked messages - query, status, model - total_duration_ms, total_input_tokens, total_output_tokens - iteration_count, tool_call_count, tools_used (JSONField)

OTelSpan

One row per span in the trace. Standard OTel shape: - id (CharField) — OTel span_id hex - trace (FK to OTelTrace) - parent_span_id — for tree reconstruction - name — e.g., agent_run, iteration_1, tool_call.search_knowledge - attributes (JSONField), events (JSONField) - started_at, ended_at, duration_ms

Instrumentation Points

Manual spans created by AgentCore.run() (GenAI conventions):

Location Span Name sentry.op Key Attributes
agent_core.run() invoke_agent <name> gen_ai.invoke_agent gen_ai.prompt, gen_ai.conversation.id, gen_ai.request.model, campuscore.agent.status, token totals
Reasoning loop iteration_{n} campuscore.iteration.number, tool call count
LLM call chat <model> gen_ai.chat gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons
Tool execution execute_tool <name> gen_ai.execute_tool gen_ai.tool.name, gen_ai.tool.input, gen_ai.tool.output, campuscore.tool.success
Knowledge routing knowledge_routing campuscore.routing.folder_count, folders
Token management token_management campuscore.token_management.strategy, message counts

Auto-instrumented spans created during an agent run (inherit parent context from the surrounding manual span):

Source Span Name Where it appears
PsycopgInstrumentor SELECT main_app_documentchunk (etc.) Children of execute_tool (search) and chat (token counting)
HTTPXClientInstrumentor POST, GET Children of chat (anthropic/openai API calls) and execute_tool (e.g., Cohere reranker)
OpenTelemetryMiddleware GET /get_chatbot_response/ Root request span (parent of invoke_agent)

The auto-instrumented spans are kept inside agent traces because they're the actual network and DB calls the agent made — exactly what's useful for offline analysis ("why was iteration 2 slow?" → child span shows the LLM API took 8 seconds).

Admin UI

Accessible at /admin/agent-traces/ (staff only). Features: - Filterable trace list: status, model, min duration, tool used, date range - Waterfall timeline: color-coded spans proportional to duration - Span detail panel: attributes table, events, formatted JSON - Cross-links: conversation viewer <-> trace viewer

Initialization

init_tracing() is called from MainAppConfig.ready() in apps.py. The BatchSpanProcessor runs on a background thread, flushing spans every 5 seconds.

Async Generator Spans

The agent's run() method is an async generator, so context managers (with start_as_current_span()) can't be used across yields. Instead, spans are manually managed:

span = tracer.start_span("invoke_agent campuscore_agent")
ctx = trace.set_span_in_context(span)
# ... yield events ...
span.end()
Child spans receive the parent context explicitly via the context= parameter.

OTel as the single source of truth (Sentry bridge)

OTel is the only system that creates spans for CampusCore. Sentry consumes them via SentrySpanProcessor, so the OTel trace_id is what populates Sentry's trace field — the two never diverge.

ASGI request → OpenTelemetryMiddleware (asgi.py)
              │  opens root request span
              Django middleware chain → view → ChatService.generate_response
                                              AgentCore.run
                                              │  starts "invoke_agent" child span
                                              │  starts "chat" / "execute_tool" grand-children
                                              all spans end → BatchSpanProcessors:
                                                ├─ PostgresSpanExporter → OTelTrace/OTelSpan rows
                                                └─ SentrySpanProcessor → Sentry transactions/spans
                                                   (uses the same trace_id)

Key wiring:

  • campus_core/asgi.py wraps the inner ASGI app with OpenTelemetryMiddleware from opentelemetry-instrumentation-asgi. The Django ASGI handler then runs inside that span, so trace.get_current_span() works correctly in every downstream coroutine — including async middleware, streaming generators, and the agent's async loop. We deliberately avoid DjangoInstrumentor() because its sync-style process_request/process_response middleware loses OTel context across the sync_to_async boundary under uvicorn/ASGI.
  • campus_core/observability/sentry_setup.py initializes Sentry with instrumenter="otel", telling it not to auto-create transactions. DjangoIntegration is retained but with all tracing knobs disabled (middleware_spans=False, signals_spans=False, cache_spans=False) — kept only for the error-capture path (request body, user attribution).
  • apps/main_app/services/observability/tracer.py attaches both PostgresSpanExporter (for the admin UI) and SentrySpanProcessor (for Sentry) to the same TracerProvider. Also installs SentryPropagator as the global OTel text-map so outbound HTTP requests carry sentry-trace + baggage headers, joining downstream services into the same trace.
  • apps/main_app/services/observability/instrumentation.py runs PsycopgInstrumentor, HTTPXClientInstrumentor, and LoggingInstrumentor in AppConfig.ready(). It's gated off during pytest (set OTEL_INSTRUMENT_IN_TESTS=1 to override) because PsycopgInstrumentor wraps every Django ORM cursor — including the ones PostgresSpanExporter uses — creating a write-amplification feedback loop.

Log correlation

Every JSON log line carries trace_id, span_id, request_id, and user_id. The values come from campus_core/observability/logging_filters.py::RequestIdFilter, which reads the current OTel span via trace.get_current_span(). Because OpenTelemetryMiddleware opens the span at the outermost ASGI layer, every log emitted during a request sees the same trace_id — and that value matches Sentry's trace field for the same request.

CloudWatch Logs Insights query example:

fields @timestamp, level, logger, message, trace_id, request_id, user_id
| filter trace_id = "abc123..."
| sort @timestamp asc

Why no DjangoInstrumentor

opentelemetry-instrumentation-django adds itself as a sync-style Django middleware (process_request/process_response). Django wraps sync middleware via asgiref.sync.sync_to_async when serving an async view, which runs the middleware in a thread with a copied contextvars context. Mutations to the OTel current-span contextvar inside that thread don't propagate back to the outer async context, so downstream async code (including streaming generators) sees no active span. We diagnosed this empirically: trace= was empty on every "request completed" log even though DjangoInstrumentor reported is_instrumented_by_opentelemetry == True. The fix was switching to OpenTelemetryMiddleware (a true ASGI middleware) wrapping the ASGI app itself.

Filtering: only agent spans reach the admin

With OTel auto-instrumentation on, every Django request produces a tree of spans (ASGI root, DB queries, outbound HTTP, etc.). Persisting all of them into OTelTrace/OTelSpan would drown the agent admin with non-agent traffic — admin views, static assets, healthchecks, the conversation list endpoint, etc.

To keep the admin focused, we tag spans created inside an agent run and the exporter filters on the tag:

File Role
agent_run_tagger.py Defines a contextvar _in_agent_run and a SpanProcessor (AgentRunTaggerProcessor) that sets campuscore.in_agent_run = True on every span started while the contextvar is True. Registered as the first processor on the TracerProvider so the marker is in place before any downstream processor reads the span.
agent_core.py Calls enter_agent_run() at the top of AgentCore.run (just before opening the invoke_agent span) and exit_agent_run(token) in finally. Every span started in this window — including child psycopg/httpx spans inside tool calls — inherits the marker via OTel's contextvar-driven context propagation.
exporter.py First action in _persist_spans is [s for s in spans if s.attributes.get(AGENT_RUN_MARKER_ATTR)]. Untagged spans return immediately without touching the DB.

Net effect: the admin shows exactly the spans that happened during an agent loop. The ASGI request span, DB queries for view dispatch, healthcheck spans, etc. are dropped at the exporter. Sentry is unaffected — SentrySpanProcessor runs as a sibling processor and sees every span regardless of the marker.

Test-suite gating

PsycopgInstrumentor wraps every Django ORM cursor — including the cursors PostgresSpanExporter uses to write spans. In tests, this creates a write-amplification feedback loop that materially slows the suite. enable_auto_instrumentation() in instrumentation.py detects pytest (via PYTEST_CURRENT_TEST env var or pytest in sys.modules) and no-ops by default. Override with OTEL_INSTRUMENT_IN_TESTS=1 if a specific test needs the auto-instrumentation surface active.

Golden-signal metrics

Beyond traces, the chat/agent/retrieval code emits counters and histograms via campus_core.observability.metrics.MetricsService (CloudWatch-bound — see Observability Stack). Alarms target these names directly in infrastructure/app/monitoring.tf.

Surface Metric Where emitted
Chat chat.request.count, chat.request.error_count, chat.request.duration_ms chat_service.py::generate_response
Chat chat.steps_per_response, chat.tools_used_per_response chat_service.py::generate_response (on success)
Agent agent.run.count, agent.run.error_count, agent.run.status.<completed\|error\|max_iterations> _emit_agent_metrics in agent_core.py (called from all three terminal paths)
Agent agent.iteration_count, agent.tool_call_count, agent.tokens.input, agent.tokens.output same
Agent agent.no_tool_calls_count same — fires when an agent run completed without ever calling a tool (routing-regression signal)
Tool tool.<name>.success_count, tool.<name>.failure_count, tool.<name>.exception_count, tool.<name>.duration_ms _execute_tool_traced in agent_core.py
Retrieval retrieval.duration_ms, retrieval.request_count, retrieval.result_count retrieval_service.py::retrieve
Retrieval retrieval.empty_result_count, retrieval.error_count same

psycopg v3 + OTel

CampusCore uses psycopg[binary]>=3.2. The matching OTel instrumentor is opentelemetry-instrumentation-psycopg (v3 only — the v2 instrumentor is a separate package, opentelemetry-instrumentation-psycopg2). Together this means every DB query made by Django's ORM or by hand-rolled connection.cursor() blocks becomes an OTel span when auto-instrumentation is active.

If you ever need to revert to psycopg2 (you shouldn't), also swap the instrumentor name in instrumentation.py::_instrument_psycopg or those spans silently disappear from the trace tree.