Agent Observability (OTel Tracing)¶

Structured execution traces for every agent run, stored in PostgreSQL and visualized via a custom Django admin UI. The same spans also flow to Sentry for ops-style cross-service tracing — see Observability Stack for how this doc fits into the broader telemetry architecture.

Architecture¶

The system uses OpenTelemetry (OTel) Python SDK to instrument the agent's reasoning loop. Spans flow through a single TracerProvider to two consumers: the in-app PostgreSQL exporter (powering the admin drill-down UI) and Sentry's SentrySpanProcessor (for cross-service tracing). The OTel trace_id is the universal correlation key — it appears as trace_id on every log line and as Sentry's trace field on every event.

Agent reasoning loop (agent_core.py)
  |
  |  OTel spans created at each step
  v
BatchSpanProcessor (background thread)
  |
  v
PostgresSpanExporter
  |
  v
OTelTrace / OTelSpan tables (PostgreSQL)
  |
  v
Agent Trace Viewer (Django admin UI)

Span Hierarchy¶

Each agent run produces a single trace with this span tree. Names and sentry.op values follow the OpenTelemetry GenAI semantic conventions so Sentry's AI Agent UI recognizes them out of the box.

GET /get_chatbot_response/                  # ASGI request (OpenTelemetryMiddleware)
  |-- invoke_agent campuscore_agent         # 1 per chat turn (op = gen_ai.invoke_agent)
  |     |-- iteration_1
  |     |     |-- chat <model>              # LLM call (op = gen_ai.chat)
  |     |     |     |-- POST                # httpx → anthropic/openai (auto)
  |     |     |-- execute_tool <name>       # tool call (op = gen_ai.execute_tool)
  |     |     |     |-- SELECT documentchunk # psycopg → pgvector (auto)
  |     |-- iteration_2
  |     |     |-- chat <model>
  |     |     |-- token_management          # internal helper
  |     |-- iteration_N ...

Legacy names (agent_run, llm_call, tool_call.<name>) persist in the database for traces written before the Phase-3 GenAI rename. The admin's _classify_span and root-span lookup handle both naming styles; no migration is required for historical data.

Key Files¶

File	Purpose
`services/observability/__init__.py`	Package exports: `init_tracing`, `shutdown_tracing`, `get_tracer`
`services/observability/tracer.py`	`TracerProvider` setup; registers `AgentRunTaggerProcessor`, `PostgresSpanExporter`, and (when DSN set) `SentrySpanProcessor` + `SentryPropagator`
`services/observability/attributes.py`	OTel GenAI semantic attribute constants (`gen_ai.`) + CampusCore extensions (`campuscore.`); legacy `agent.` / `llm.` / `tool.*` aliases retained for back-compat
`services/observability/agent_run_tagger.py`	Contextvar + `on_start` span processor that marks every span created inside `AgentCore.run` — used by the exporter to filter agent-only traces
`services/observability/instrumentation.py`	OTel auto-instrumentors for psycopg, httpx, and stdlib logging; pytest-gated to avoid feedback loops
`services/observability/exporter.py`	`PostgresSpanExporter` — filters to agent-tagged spans, groups by trace_id, bulk-creates DB rows
`campus_core/asgi.py`	Wraps the ASGI app with `OpenTelemetryMiddleware` so the request span is created in the async context (NOT via `DjangoInstrumentor`)
`campus_core/observability/sentry_setup.py`	`sentry_sdk.init(instrumenter="otel", enable_logs=True, ...)` + FERPA scrubber wiring
`models.py` (OTelTrace, OTelSpan)	PostgreSQL storage for traces and spans
`apis/admin_agent_trace_apis.py`	Django views for the trace viewer UI; `_classify_span` + `_LEGACY_ATTRIBUTE_ALIASES` translate GenAI keys into the template's legacy paths
`templates/admin/agent_traces.html`	Main trace viewer page
`templates/admin/partials/admin_at_*`	HTMX partials for trace list, detail, span detail

Models¶

OTelTrace¶

One row per agent run. Denormalized summary fields for efficient listing/filtering: - id (UUID) — OTel trace_id - conversation (FK, nullable) — link to conversation - user_message, ai_message (FK, nullable) — linked messages - query, status, model - total_duration_ms, total_input_tokens, total_output_tokens - iteration_count, tool_call_count, tools_used (JSONField)

OTelSpan¶

One row per span in the trace. Standard OTel shape: - id (CharField) — OTel span_id hex - trace (FK to OTelTrace) - parent_span_id — for tree reconstruction - name — e.g., agent_run, iteration_1, tool_call.search_knowledge - attributes (JSONField), events (JSONField) - started_at, ended_at, duration_ms

Instrumentation Points¶

Manual spans created by AgentCore.run() (GenAI conventions):

Location	Span Name	`sentry.op`	Key Attributes
`agent_core.run()`	`invoke_agent <name>`	`gen_ai.invoke_agent`	`gen_ai.prompt`, `gen_ai.conversation.id`, `gen_ai.request.model`, `campuscore.agent.status`, token totals
Reasoning loop	`iteration_{n}`	—	`campuscore.iteration.number`, tool call count
LLM call	`chat <model>`	`gen_ai.chat`	`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.response.finish_reasons`
Tool execution	`execute_tool <name>`	`gen_ai.execute_tool`	`gen_ai.tool.name`, `gen_ai.tool.input`, `gen_ai.tool.output`, `campuscore.tool.success`
Knowledge routing	`knowledge_routing`	—	`campuscore.routing.folder_count`, folders
Token management	`token_management`	—	`campuscore.token_management.strategy`, message counts

Auto-instrumented spans created during an agent run (inherit parent context from the surrounding manual span):

Source	Span Name	Where it appears
`PsycopgInstrumentor`	`SELECT main_app_documentchunk` (etc.)	Children of `execute_tool` (search) and `chat` (token counting)
`HTTPXClientInstrumentor`	`POST`, `GET`	Children of `chat` (anthropic/openai API calls) and `execute_tool` (e.g., Cohere reranker)
`OpenTelemetryMiddleware`	`GET /get_chatbot_response/`	Root request span (parent of `invoke_agent`)

The auto-instrumented spans are kept inside agent traces because they're the actual network and DB calls the agent made — exactly what's useful for offline analysis ("why was iteration 2 slow?" → child span shows the LLM API took 8 seconds).

Admin UI¶

Accessible at /admin/agent-traces/ (staff only). Features: - Filterable trace list: status, model, min duration, tool used, date range - Waterfall timeline: color-coded spans proportional to duration - Span detail panel: attributes table, events, formatted JSON - Cross-links: conversation viewer <-> trace viewer

Initialization¶

init_tracing() is called from MainAppConfig.ready() in apps.py. The BatchSpanProcessor runs on a background thread, flushing spans every 5 seconds.

Async Generator Spans¶

The agent's run() method is an async generator, so context managers (with start_as_current_span()) can't be used across yields. Instead, spans are manually managed:

span = tracer.start_span("invoke_agent campuscore_agent")
ctx = trace.set_span_in_context(span)
# ... yield events ...
span.end()

Child spans receive the parent context explicitly via the context= parameter.

OTel as the single source of truth (Sentry bridge)¶

OTel is the only system that creates spans for CampusCore. Sentry consumes them via SentrySpanProcessor, so the OTel trace_id is what populates Sentry's trace field — the two never diverge.

ASGI request → OpenTelemetryMiddleware (asgi.py)
              │  opens root request span
              ▼
              Django middleware chain → view → ChatService.generate_response
                                              │
                                              ▼
                                              AgentCore.run
                                              │  starts "invoke_agent" child span
                                              │  starts "chat" / "execute_tool" grand-children
                                              ▼
                                              all spans end → BatchSpanProcessors:
                                                ├─ PostgresSpanExporter → OTelTrace/OTelSpan rows
                                                └─ SentrySpanProcessor → Sentry transactions/spans
                                                   (uses the same trace_id)

Key wiring:

campus_core/asgi.py wraps the inner ASGI app with OpenTelemetryMiddleware from opentelemetry-instrumentation-asgi. The Django ASGI handler then runs inside that span, so trace.get_current_span() works correctly in every downstream coroutine — including async middleware, streaming generators, and the agent's async loop. We deliberately avoid DjangoInstrumentor() because its sync-style process_request/process_response middleware loses OTel context across the sync_to_async boundary under uvicorn/ASGI.
campus_core/observability/sentry_setup.py initializes Sentry with instrumenter="otel", telling it not to auto-create transactions. DjangoIntegration is retained but with all tracing knobs disabled (middleware_spans=False, signals_spans=False, cache_spans=False) — kept only for the error-capture path (request body, user attribution).
apps/main_app/services/observability/tracer.py attaches both PostgresSpanExporter (for the admin UI) and SentrySpanProcessor (for Sentry) to the same TracerProvider. Also installs SentryPropagator as the global OTel text-map so outbound HTTP requests carry sentry-trace + baggage headers, joining downstream services into the same trace.
apps/main_app/services/observability/instrumentation.py runs PsycopgInstrumentor, HTTPXClientInstrumentor, and LoggingInstrumentor in AppConfig.ready(). It's gated off during pytest (set OTEL_INSTRUMENT_IN_TESTS=1 to override) because PsycopgInstrumentor wraps every Django ORM cursor — including the ones PostgresSpanExporter uses — creating a write-amplification feedback loop.

Log correlation¶

Every JSON log line carries trace_id, span_id, request_id, and user_id. The values come from campus_core/observability/logging_filters.py::RequestIdFilter, which reads the current OTel span via trace.get_current_span(). Because OpenTelemetryMiddleware opens the span at the outermost ASGI layer, every log emitted during a request sees the same trace_id — and that value matches Sentry's trace field for the same request.

CloudWatch Logs Insights query example:

fields @timestamp, level, logger, message, trace_id, request_id, user_id
| filter trace_id = "abc123..."
| sort @timestamp asc

Why no DjangoInstrumentor¶

opentelemetry-instrumentation-django adds itself as a sync-style Django middleware (process_request/process_response). Django wraps sync middleware via asgiref.sync.sync_to_async when serving an async view, which runs the middleware in a thread with a copied contextvars context. Mutations to the OTel current-span contextvar inside that thread don't propagate back to the outer async context, so downstream async code (including streaming generators) sees no active span. We diagnosed this empirically: trace= was empty on every "request completed" log even though DjangoInstrumentor reported is_instrumented_by_opentelemetry == True. The fix was switching to OpenTelemetryMiddleware (a true ASGI middleware) wrapping the ASGI app itself.

Filtering: only agent spans reach the admin¶

With OTel auto-instrumentation on, every Django request produces a tree of spans (ASGI root, DB queries, outbound HTTP, etc.). Persisting all of them into OTelTrace/OTelSpan would drown the agent admin with non-agent traffic — admin views, static assets, healthchecks, the conversation list endpoint, etc.

To keep the admin focused, we tag spans created inside an agent run and the exporter filters on the tag:

File	Role
`agent_run_tagger.py`	Defines a contextvar `_in_agent_run` and a `SpanProcessor` (`AgentRunTaggerProcessor`) that sets `campuscore.in_agent_run = True` on every span started while the contextvar is True. Registered as the first processor on the `TracerProvider` so the marker is in place before any downstream processor reads the span.
`agent_core.py`	Calls `enter_agent_run()` at the top of `AgentCore.run` (just before opening the `invoke_agent` span) and `exit_agent_run(token)` in `finally`. Every span started in this window — including child psycopg/httpx spans inside tool calls — inherits the marker via OTel's contextvar-driven context propagation.
`exporter.py`	First action in `_persist_spans` is `[s for s in spans if s.attributes.get(AGENT_RUN_MARKER_ATTR)]`. Untagged spans return immediately without touching the DB.

Net effect: the admin shows exactly the spans that happened during an agent loop. The ASGI request span, DB queries for view dispatch, healthcheck spans, etc. are dropped at the exporter. Sentry is unaffected — SentrySpanProcessor runs as a sibling processor and sees every span regardless of the marker.

Test-suite gating¶

PsycopgInstrumentor wraps every Django ORM cursor — including the cursors PostgresSpanExporter uses to write spans. In tests, this creates a write-amplification feedback loop that materially slows the suite. enable_auto_instrumentation() in instrumentation.py detects pytest (via PYTEST_CURRENT_TEST env var or pytest in sys.modules) and no-ops by default. Override with OTEL_INSTRUMENT_IN_TESTS=1 if a specific test needs the auto-instrumentation surface active.

Golden-signal metrics¶

Beyond traces, the chat/agent/retrieval code emits counters and histograms via campus_core.observability.metrics.MetricsService (CloudWatch-bound — see Observability Stack). Alarms target these names directly in infrastructure/app/monitoring.tf.

Surface	Metric	Where emitted
Chat	`chat.request.count`, `chat.request.error_count`, `chat.request.duration_ms`	`chat_service.py::generate_response`
Chat	`chat.steps_per_response`, `chat.tools_used_per_response`	`chat_service.py::generate_response` (on success)
Agent	`agent.run.count`, `agent.run.error_count`, `agent.run.status.<completed\\|error\\|max_iterations>`	`_emit_agent_metrics` in `agent_core.py` (called from all three terminal paths)
Agent	`agent.iteration_count`, `agent.tool_call_count`, `agent.tokens.input`, `agent.tokens.output`	same
Agent	`agent.no_tool_calls_count`	same — fires when an agent run completed without ever calling a tool (routing-regression signal)
Tool	`tool.<name>.success_count`, `tool.<name>.failure_count`, `tool.<name>.exception_count`, `tool.<name>.duration_ms`	`_execute_tool_traced` in `agent_core.py`
Retrieval	`retrieval.duration_ms`, `retrieval.request_count`, `retrieval.result_count`	`retrieval_service.py::retrieve`
Retrieval	`retrieval.empty_result_count`, `retrieval.error_count`	same

psycopg v3 + OTel¶

CampusCore uses psycopg[binary]>=3.2. The matching OTel instrumentor is opentelemetry-instrumentation-psycopg (v3 only — the v2 instrumentor is a separate package, opentelemetry-instrumentation-psycopg2). Together this means every DB query made by Django's ORM or by hand-rolled connection.cursor() blocks becomes an OTel span when auto-instrumentation is active.

If you ever need to revert to psycopg2 (you shouldn't), also swap the instrumentor name in instrumentation.py::_instrument_psycopg or those spans silently disappear from the trace tree.