Skip to content

Observability Stack

The bird's-eye view of where telemetry flows in CampusCore. Use this doc to figure out where to look for a given question. For deeper architecture on the agent-tracing subsystem specifically, see Agent Observability. For the operator playbook on setting Sentry up for a new tenant, see Sentry Setup.

The four-layer model

CampusCore separates concerns across four destinations. Each owns one job; they share trace IDs so you can pivot between them on a single request.

Concern Destination Audience Key signal
Errors Sentry Issues Engineering, support "Something broke for a user"
Distributed traces Sentry Performance / Insights + Sentry AI Agents Engineering "Why was this request slow / what did the agent do"
Logs (per-request) CloudWatch Logs + Sentry Logs Engineering, audit "What happened during request X"
Metrics + alarms CloudWatch Metrics + Alarms → Slack + email Ops, on-call "Is the system healthy"
Agent deep-dive Django admin (/admin/agent-traces/) Engineering, eval pipelines "What exactly did the agent reason / call / return"
Vector index health Django admin (/admin/observability/vector/) Engineering, infra "Is the HNSW index healthy / does it need REINDEX"
Workflow notifications Slack workflow_runs channel via slack_utils Engineering, on-call "Did this scheduled / long-running workflow finish or fail"

The split is intentional: Sentry gets the surfaces operators of an issue care about most (errors with full context, agent traces with token usage), CloudWatch keeps the in-tenant metric/log archive that BYOC customers need, and the Django admin is the agent-debugging cockpit no SaaS replicates. Slack carries time-bounded workflow lifecycle ("starting / completed / failed") events so an engineer doesn't have to open AWS to know if a deploy, scheduled check, or scrape finished — these never page on-call (Sentry does that), they're informational.

Data flow

ASGI request
OpenTelemetryMiddleware  ────► opens root request span (trace_id = T)
Django middleware + view + ChatService.generate_response
  │  (every log emitted here carries trace_id = T via RequestIdFilter)
AgentCore.run
  │  enter_agent_run()  ────► AgentRunTaggerProcessor stamps every subsequent
  │                            span with campuscore.in_agent_run = True
  │  tracer.start_span("invoke_agent ...")     # gen_ai.invoke_agent
  │    └── iteration_N
  │          ├── chat <model>                  # gen_ai.chat
  │          │     └── POST  (httpx auto)       # actual LLM API call
  │          └── execute_tool <name>           # gen_ai.execute_tool
  │                └── SELECT ... (psycopg auto)
All spans end → flow through ONE TracerProvider's processors in parallel:
  ├── AgentRunTaggerProcessor   (on_start — tagging only)
  ├── BatchSpanProcessor → PostgresSpanExporter
  │     │
  │     └── Drops untagged spans, persists agent traces to OTelTrace/OTelSpan
  └── SentrySpanProcessor (when DSN set)
        └── Forwards every span to Sentry as transactions/spans
              (uses the same OTel trace_id — Sentry "trace" field = T)

Log lines emitted anywhere in the request:
  stdout (JSON) ─► CloudWatch Logs (awslogs driver, full fidelity)
                ─► Sentry Logs    (via LoggingIntegration's SentryLogsHandler)

MetricsService.record/increment/gauge:
  In-memory batch ─► CloudWatch Metrics (every 60s, via boto3 put_metric_data)
                      └── CloudWatch Alarms ─┬─► SNS ─► email
                                             └─► EventBridge ─► Slack webhook

Errors

Where they go: Sentry, captured by sentry_sdk + LoggingIntegration (event_level=ERROR). Issues are deduplicated by fingerprint, grouped across releases, and routed to Slack via Sentry's Slack integration.

Why not CloudWatch: raw logger.error(...) lines end up in CloudWatch too, but they're not deduplicated, fingerprinted, or grouped. CloudWatch Logs Insights queries work for forensics; Sentry Issues are for actionable alerting.

Wiring: campus_core/observability/sentry_setup.py::init_sentry. The LoggingIntegration(event_level=ERROR) config turns every logger.error and logger.exception into a Sentry Issue.

Scrubbing: before_send hook in campus_core/observability/sentry_scrubber.py redacts PII before ship. See Security & SOC 2.

Traces

Single source of truth: OpenTelemetry. OTel produces spans; two consumers attach to the same TracerProvider:

  1. PostgresSpanExporter persists agent-relevant spans to OTelTrace/OTelSpan PostgreSQL tables for the Django admin at /admin/agent-traces/. Filtered by the agent-run tagger so only spans created inside AgentCore.run (and their children) are persisted.
  2. SentrySpanProcessor (when SENTRY_DSN is set) forwards every span to Sentry. Auto-instrumentation spans (psycopg, httpx, ASGI) become children of the request transaction in Sentry's UI.

Sentry is initialized with instrumenter="otel" so it does NOT auto-create transactions on Django requests — it consumes OTel spans only. This makes the OTel trace_id the universal correlation key: every Sentry event tagged trace=T corresponds to the same trace_id=T in CloudWatch logs and the agent admin.

See: Agent Observability → OTel as the single source of truth for the full bridge setup.

Logs

Two destinations, both full-fidelity:

Destination How it gets there Use it for
CloudWatch Logs (/ecs/campuscore-<env>) ECS task awslogs log driver captures stdout → CloudWatch Forensic queries, structured log analysis, log-based metric filters, archival, in-tenant retention compliance
Sentry Logs (in the same project as Issues/Traces) LoggingIntegration's SentryLogsHandler ships every logger.info+ record when enable_logs=True at SDK init Cross-environment log search alongside Issue and Trace correlation; convenient for "show me everything that happened during this trace"

Both carry the same fields injected by RequestIdFilter: - request_id — assigned per request by RequestCorrelationMiddleware - trace_id / span_id — read from OTel current span (populated as long as OpenTelemetryMiddleware is wrapping the ASGI app) - user_id — authenticated user PK or null

CloudWatch Logs Insights query example:

fields @timestamp, level, logger, message, trace_id, user_id
| filter trace_id = "abc123..."
| sort @timestamp asc

Sentry query example: open the Logs tab → filter environment:<tenant-env> trace_id:abc123....

To pivot: find a trace_id anywhere (Sentry issue → trace_id tag; CloudWatch log line → trace_id field; agent admin → trace UUID), paste it into any of the four surfaces.

Metrics

Where they go: CloudWatch only. The custom MetricsService in campus_core/observability/metrics.py batches counters/gauges/histograms in memory and flushes to CloudWatch every 60 seconds via boto3.client("cloudwatch").put_metric_data(...). Metrics are NOT sent to Sentry — see Why not Sentry metrics below.

Why CloudWatch: BYOC implications. Each tenant's metrics live in their own AWS account, alarms fire from their CloudWatch, alerts route to their Slack channel and email. There's no central place where all tenants' metrics aggregate — that's by design for data residency.

What's emitted: see Agent Observability → Golden-signal metrics for the full table. Highlights: - chat.request.duration_ms (histogram) → p95 latency alarm - agent.run.error_count / agent.run.count → error-rate alarm - retrieval.duration_ms, tool.<name>.duration_ms → drilling into where time goes

Alarm definitions + delivery: infrastructure/app/monitoring.tf defines all 16 alarms and wires their delivery. Every alarm publishes (via alarm_actions/ok_actions) to one SNS topic (email subscription, ALERT_EMAIL) and is forwarded to Slack by an EventBridge CloudWatch Alarm State Change rule → API destination → incoming webhook (SLACK_ALERT_WEBHOOK_URL). Both channels are gated on their config being set, so a tenant without them still deploys. The two thin modules modules/sns_alert_topic and modules/slack_alarm_notifier build the delivery fabric; thresholds are tunable in the locals block at the top of monitoring.tf. The custom-metric alarms filter on the DeploymentId dimension (= DEPLOYMENT_ID, set to the sanitized env name in the task), and a deploy-contract check fails the build if any alarm watches a CampusCore metric that app code no longer emits.

Local development: the MetricsService drops metrics on the floor in non-cloud envs by default. To see them in stdout locally, set METRICS_LOG_FALLBACK_ENABLED=true in .env — but turn it back off before deploying because it emits one log per metric per flush, which floods Sentry Logs in cloud.

Agent admin (PostgreSQL deep-dive)

The Django admin at /admin/agent-traces/ is CampusCore's purpose-built tool for agent behavior analysis. It's where you go to answer questions like:

  • What did the system prompt look like for turn 3 of conversation X?
  • What query did the agent send to search_knowledge, and what came back?
  • Which iteration introduced the wrong answer?
  • How many tokens did the LLM use across the full conversation?

Why a separate UI: Sentry's AI Agents page covers operator-grade questions ("which agent runs are slow / errored"); the Django admin is for engineer-grade forensics. The trace store is also queryable from offline eval pipelines and exportable as JSONL — a SaaS would force us to scrape its API.

Storage: spans tagged campuscore.in_agent_run land in OTelTrace / OTelSpan. See Agent Observability → Filtering: only agent spans reach the admin for how the filter works.

Vector index admin (HNSW health + maintenance)

The Django admin at /admin/observability/vector/ is the engineer's view into pgvector + HNSW state. Four tabs: Overview (inventory + heap/index health), Folder Routing (discoverability heatmap), Query Metrics (OTel rollups), Maintenance (trigger evaluation + rebuild button + log history).

When to use:

  • After a big scrape or bulk import — verify dead-tuple % stayed low and recall@10 didn't move.
  • When a user reports "search just returned nothing for a query that used to work" — the Folder Routing tab's discoverability heatmap surfaces folders sitting near the 0.46 threshold.
  • When p95 search latency creeps up — the Cache hit % and the latency-vs-baseline metric tell you whether it's RAM pressure or graph degradation.
  • Before/after a REINDEX — the Maintenance tab's composition diff shows what changed since the last rebuild.

Storage: every health check and rebuild writes one row to IndexMaintenanceLog. The findings JSONField is the audit trail. See Vector Index Observability for the full architecture + the seven metrics + the EventBridge auto-rebuild schedule.

Workflow notifications (Slack)

Long-running and scheduled workflows post lifecycle messages (start / complete / fail) to a Slack channel via campus_core.shared_utils.slack_utils.notify_workflow_event. Three current call sites:

  • maintenance.run_check — daily index health check (cron-fired or manual via dashboard)
  • maintenance.run_rebuild — HNSW REINDEX
  • scrape_webpages management command — every scrape run

Slack is the surface for "did the workflow finish?" — it's informational, never an alarm. Errors that should page still flow through Sentry. The two channels of information serve different audiences and shouldn't be confused: Slack for visible progress on intentional workflows; Sentry for unexpected errors.

The helper is no-op-safe when Slack isn't configured (empty SLACK_BOT_TOKENNone client → log + skip). Adding a new call site is notify_workflow_event(channel_key=..., header=..., body=..., fields=...) — see Slack Setup → Adding more notification channels later.

Cost levers

When Sentry volume gets tight (the Team plan's 50k errors / 5GB logs / 5M spans limits):

Lever Where When to pull
SENTRY_TRACES_SAMPLE_RATE (GH var) Deployed env var Always pull first — drops trace volume linearly. 0.2 default; 0.05 is fine for stable prod.
SENTRY_ENABLE_LOGS (GH var) Deployed env var Stop shipping logs to Sentry; keep CloudWatch. Cuts log GB to 0.
before_send_log filter sentry_scrubber.py Drop specific loggers (allauth, django.security.csrf) without disabling Logs entirely.
Issue alert filters Sentry UI Stop paging on known low-value error fingerprints.
PAYG spend cap Sentry → Settings → Subscription Hard ceiling. Sentry stops accepting events past the cap rather than charging more.

CloudWatch has its own cost shape — log ingest is the dominant line item, then alarm count. Tune via CloudWatch's Subscription Filters and log retention policies (set on the log group in infrastructure/modules/ecs_cluster/main.tf).

BYOC implications

Each tenant lives in their own AWS account. CloudWatch is therefore per-tenant by design — logs, metrics, alarms all stay in the tenant's account. Sentry follows the same model: one Sentry project per tenant (e.g., campuscore-vsu, campuscore-howard). All deployment environments for a given tenant (pilot/staging/prod) share that one project; the Sentry environment tag (auto-populated from the GitHub Environment name) differentiates them inside the project. See Sentry Setup → Project structure.

Why not Sentry metrics

Sentry's Metrics product exists (sentry_sdk.metrics.distribution / gauge / count), and the API surface is available in our pinned SDK. We deliberately don't use it:

  • BYOC fit. Metrics drive alarming. Alarming should fire from in-tenant infrastructure (CloudWatch) so it doesn't depend on a central SaaS being available. A Sentry outage shouldn't blind us to a CampusCore outage.
  • Sentry Metrics has had a bumpy product history — sunset and re-introduced. Easier to defer until the product stabilizes.
  • Sentry Insights gives derived metrics for free — p95 transaction duration, throughput, error rate, token usage from gen_ai.usage.* attributes. For the questions ops actually asks, this is enough without explicit metric emission to Sentry.

If a specific number would be more useful on a Sentry dashboard than in CloudWatch, the cleanest path is a small dual-emit shim on MetricsService. The class is intentionally narrow (just increment / gauge / record on the public surface) so adding a Sentry sink is mechanical — see the seam at _flush in metrics.py.

When you should look where

Question First place to look
"Did a user just hit an error?" Sentry Issues
"Why is the system slow?" CloudWatch alarm dashboard → Sentry Performance for specific transactions
"Did the agent answer this user's question correctly?" Django admin agent-traces by conversation ID
"What did the LLM see and produce on turn 2?" Same — agent admin, drill into the chat span's events
"Which release introduced this regression?" Sentry issue page → "regressed in release X"
"Is this DB query slow?" Sentry transaction's child psycopg span, or CloudWatch RDS Performance Insights
"Are we close to a Sentry quota?" Sentry → Settings → Subscription
"What did we deploy in the last hour?" GitHub Actions → deploy workflow runs
"Did this morning's scheduled index check pass?" Slack #{client}_workflow_runs (one message per run)
"Is the HNSW index degraded right now?" /admin/observability/vector/ Maintenance tab
"When was the last successful REINDEX?" Same — Maintenance tab → Recent events table