Skip to content

Vector Index Observability

Internal architecture doc for the pgvector / HNSW health dashboard and the maintenance pipeline behind it.

Operator-facing onboarding lives in Slack Setup and the index-maintenance sections of GitHub Environment Variables. Both feed into here.


The problem this solves

Three production failures in the previous quarter all traced back to "we had no view into the state of the index":

  1. HNSW + WHERE-clause interaction silently zeroed out folder-filtered searches. Diagnosed only after ad-hoc psql forensics, fixed by setting hnsw.ef_search = 200 + hnsw.iterative_scan = 'strict_order'.
  2. Folder discoverability score sat at 0.543 against a 0.46 threshold for the VSU Employees folder. One bad query away from breaking; no monitoring on the gap.
  3. HNSW graph degradation under incremental churn. As scrape pipelines insert and delete chunks, the graph fills with stale neighbors and recall degrades — but VACUUM only cleans the heap, not the graph. REINDEX is the only fix, and we had no signal for when to run it.

Each of these would have been a 10-second click in a dashboard. Instead they were 90-minute psql sessions. This module is the dashboard.

Architecture

                              ┌────────────────────────────────────┐
                              │  /admin/observability/vector/      │
                              │                                    │
                              │  ┌──────────┐ ┌──────────┐         │
                              │  │ Overview │ │ Routing  │         │
                              │  └──────────┘ └──────────┘         │
                              │  ┌──────────┐ ┌──────────────────┐ │
                              │  │ Metrics  │ │  Maintenance     │ │
                              │  └──────────┘ │  [Run check]     │ │
                              │               │  [Rebuild HNSW]  │ │
                              │               └──────────────────┘ │
                              └────────────────────────────────────┘
       ┌──────────────────────────────────────────────────────────────────┐
       │  apps/main_app/services/observability/                            │
       │                                                                   │
       │  vector_index_stats.py   — Overview tab data (inventory + health) │
       │  folder_discoverability  — heatmap / cannibalization / routing    │
       │  recall_eval.py          — exact-vs-approx ground-truth recall    │
       │  query_metrics.py        — OTel-driven traffic / latency rollups  │
       │  maintenance.py          — metric evaluators + run_check / run_   │
       │                            rebuild / reset_stuck_rebuild          │
       │  document_probe.py       — per-row probe panel (admin change_form)│
       └──────────────────────────────────────────────────────────────────┘
                  ┌──────────────────────────────────────────────┐
                  │  IndexMaintenanceLog (DB-backed audit trail) │
                  │  kind=check | rebuild                         │
                  │  trigger=manual | <metric_name>               │
                  │  findings={metrics, composition, ...}         │
                  └──────────────────────────────────────────────┘
                       ┌──────────────────────┴──────────────────────┐
                       ▼                                             ▼
        ┌────────────────────────────────┐         ┌─────────────────────────────────┐
        │  Slack workflow_runs channel   │         │  EventBridge daily schedule     │
        │  via campus_core.shared_utils  │         │  modules/scheduled_ecs_task/    │
        │  .slack_utils                  │         │  → auto_rebuild_if_justified    │
        └────────────────────────────────┘         └─────────────────────────────────┘

Everything sits inside the Django app — no separate metrics backend, no Prometheus, no external collector. The dashboard reads from existing tables (main_app_documentchunk, main_app_sourcefile, main_app_knowledgefolder, main_app_oteltrace/main_app_otelspan) plus the standard Postgres observability surface (pg_relation_size, pg_index, pg_stat_user_tables, pg_statio_user_indexes, pg_stat_activity). Net new tables: just IndexMaintenanceLog.

Key files

The metrics

Seven metrics, each with its own evaluator in maintenance.py. All return a MetricStatus with level ∈ {ok, warn, fail, unknown}. A failing metric is what auto_rebuild_if_justified (and a human eyeballing the dashboard) uses to decide whether to REINDEX.

Metric Data source Threshold What "fail" means
recall_drop Mean recall@10 across the probe set. Compare HNSW top-10 vs brute-force ground truth (enable_indexscan=off). < 0.90 The index is missing answers a sequential scan would find. Most direct retrieval-quality signal.
topk_drift Per-probe overlap with frozen_top_10_ids baseline captured at probe-creation time. ≥1 probe with < 7/10 overlap The graph drifted away from where it was when we last said "this index is healthy."
churn_high n_tup_ins + n_tup_upd + n_tup_del from pg_stat_user_tables minus the baseline at last rebuild. ≥ 20% of rows OR ≥ 10K absolute Lots of writes since last rebuild. Doesn't itself mean degradation — pure inserts grow but don't rot — but combined with other signals confirms accumulation.
bloat_high n_dead_tup / n_live_tup from pg_stat_all_tables. ≥ 15% dead Dead heap tuples → dead HNSW graph nodes pgvector still walks during search → degraded recall + latency. VACUUM cleans the heap; only REINDEX rebuilds the graph.
cache_low idx_blks_hit / (idx_blks_hit + idx_blks_read) from pg_statio_user_indexes. < 90% hit HNSW graph paging from disk. p95 latency is about to climb. May mean shared_buffers is undersized or another workload is evicting our pages.
latency_p95 p95 of tool_call.search_knowledge* span durations over 24h vs the baseline 24h after last successful rebuild. ≥ 1.5× baseline Search is meaningfully slower than it was right after the last rebuild. Often co-occurs with cache_low or bloat_high.
empty_rate Fraction of folder-scoped search calls that returned 0 results, sniffed from OTel tool.result_summary. > 2% over 24h Folder filtering is rejecting matches the index would otherwise return. The original VSU-Employees failure mode.

Reading combinations

The dashboard intentionally surfaces all seven side-by-side because each one alone is ambiguous. Real diagnostic value comes from combinations:

  • bloat_high failing AND recall_drop failing → genuine degradation. REINDEX.
  • bloat_high failing alone, recall fine → recent churn but no observable impact yet. Optional REINDEX; consider waiting.
  • churn_high failing alone → growth or churn, but no observable impact. Look at the composition diff on the Maintenance tab to see whether it was inserts (grew) or updates/deletes (rotted).
  • cache_low failing → memory/eviction issue, not graph state. Don't REINDEX; check shared_buffers and concurrent workloads.
  • empty_rate failing → folder routing bug, not graph state. Check whether folder description embeddings are stale; REINDEX won't help.

The auto_rebuild_if_justified command picks the first failing metric and uses its name as the trigger value on the resulting rebuild log row. That choice is intentionally simple — operator-side judgment for the more nuanced combinations is what the dashboard exists for.

The maintenance log

Every check or rebuild writes one row to IndexMaintenanceLog. Schema:

Column Values Why
kind check | rebuild Distinguishes "we measured" from "we mutated."
trigger manual | recall_drop | bloat_high | churn_high | cache_low | latency_p95 | empty_rate | topk_drift "Why does this event exist?" manual = button click or CLI invocation. Metric names = a future automated job fired this rebuild because that metric failed.
status in_progress | success | failed | skipped in_progress rows exceeding ~30 min after a deploy crash become "stuck" — see the Reset section below.
total_rows int Snapshot at event time.
churn_rows_since_last_rebuild int Pulled from the churn metric's value.
index_size_bytes bigint pg_relation_size snapshot.
recall_at_10_before / _after float|null Populated only on rebuilds, only when we re-run probes both sides of the REINDEX.
findings JSONB Full metric list + composition snapshot + before/after dead-tuple stats. The audit trail.
error_message, notes text Free-form. notes is append-only; error_message captures exceptions verbatim.

Composition snapshots

findings.composition_before (rebuild) and findings.composition (check) carry a snapshot of:

{
  "total_rows": 30987,
  "by_folder": {"vsu-employees": 1373, "another-test-folder": 115, ...},
  "by_source_type": {"web_scrape": 28000, "knowledge_folder": 1373, ...},
  "mean_chunk_length": 2464.18,
  "captured_at": "2026-05-17T01:11:42.123456+00:00"
}

The Maintenance tab diffs the latest snapshot against the previous rebuild's snapshot to show "+1,247 rows in vsu-employees, +0 elsewhere." That's the signal for "a bulk import shifted the distribution since we last rebuilt."

Datetime serialization

Heap stats come from psycopg as native datetime objects. Django's JSONField uses stdlib json.dumps which doesn't serialize datetimes, so we ISO-stringify before storing — see _json_safe_dead_tup() in maintenance.py. Easy to forget when adding new findings keys; this is also why composition uses model_dump(mode="json").

Commands

Three management commands, three different audiences:

check_index_health

python manage.py check_index_health [--json]

Operator-facing. Runs all seven evaluators (including the expensive recall + topk_drift), writes a kind=check log row, prints a human-readable summary, exits: - 0 — all metrics ok or warn - 1 — at least one metric failed (rebuild justified) - 2 — the check itself crashed (bug or DB unavailable)

The --json flag is for cron / EventBridge: emits the findings dict as a single line on stdout. Exit codes stay the same; the JSON is supplementary.

Also fires Slack notifications via notify_workflow_event for the start + completion. The completion message header reflects the worst level seen ( all ok / warnings / ✗ rebuild justified).

rebuild_hnsw_index

python manage.py rebuild_hnsw_index [--trigger TRIGGER] [--dry-run]

Operator-facing mutating command. Pipeline: 1. Snapshot composition + dead-tuple stats + size + pg_stat counter. 2. Run VACUUM ANALYZE main_app_documentchunk — cleans dead heap tuples before REINDEX so the new graph isn't born with dead nodes baked in. 3. Run REINDEX INDEX CONCURRENTLY main_app_docidx_embed_hnsw_cosine_v2 — doesn't block reads or writes, holds a ShareUpdateExclusiveLock on the table only (DDL blocks). 4. Re-snapshot the same stats. Persist the delta to the rebuild log row.

--dry-run skips the actual SQL but still writes the rebuild log row with status=skipped. Useful for verifying the dashboard's "rebuild button" Terraform/IAM/Slack glue without a real REINDEX.

--trigger controls the IndexMaintenanceLog.trigger value. Operators always pass manual (the default); the auto-rebuild command passes the failing metric name.

auto_rebuild_if_justified

python manage.py auto_rebuild_if_justified [--dry-run]

Cron-target. Always invoked by EventBridge, not humans. Pipeline: 1. Run the full check. 2. If findings.failing_metric_names is empty, exit 0 (no rebuild, no further Slack noise beyond the check's own ✓ all ok message). 3. If non-empty, call run_rebuild(trigger=<first failing metric>). The rebuild posts its own start + complete Slack messages.

Exit codes: - 0 — check clean OR rebuild succeeded - 1 — rebuild was justified but raised - 2 — the check itself raised

The exit-code split is so the EventBridge alarm distinguishes "found a problem and recovered" from "the schedule itself broke."

EventBridge schedule

GitHub Environment variable: ENABLE_INDEX_MAINTENANCE_SCHEDULE=true
infrastructure/app/eventbridge.tf  → module "auto_rebuild_schedule"
infrastructure/modules/scheduled_ecs_task/
  • aws_iam_role (assumed by scheduler.amazonaws.com)
  • aws_iam_role_policy (ecs:RunTask + iam:PassRole)
  • aws_scheduler_schedule (cron expression + container override)
ECS RunTask on existing campuscore-web task definition with
command override: ["python", "manage.py", "auto_rebuild_if_justified"]

Default cron is cron(0 6 * * ? *) (06:00 UTC daily), overridable via INDEX_MAINTENANCE_SCHEDULE_CRON. The schedule pins the current task-definition revision so deploys roll forward atomically.

One schedule, not two

The original design had two schedules — a standalone daily check + a separate auto-rebuild. We collapsed to one because auto_rebuild_if_justified already runs the check internally with the same Slack lifecycle. Having both fire simultaneously would just duplicate the brute-force recall pass at 06:00 UTC every morning for no behavioral benefit.

If you want a check without a possible rebuild for a specific environment, leave ENABLE_INDEX_MAINTENANCE_SCHEDULE=false and run python manage.py check_index_health from a shell on demand.

Reusing the module

modules/scheduled_ecs_task/ is generic — anything that wants to be "Django management command, daily, ECS Fargate, same image, Slack-tied" plugs in with ~15 lines:

module "my_new_schedule" {
  source = "../modules/scheduled_ecs_task"
  count  = var.enable_my_thing ? 1 : 0

  name                = "campuscore-${local.env_sanitized}-my-thing"
  schedule_expression = var.my_thing_cron
  cluster_arn         = local.base.ecs_cluster_id
  task_definition_arn = aws_ecs_task_definition.campuscore.arn
  execution_role_arn  = local.base.ecs_task_execution_role_arn
  task_role_arn       = local.base.ecs_task_role_arn
  subnet_ids          = local.base.subnet_ids
  security_group_ids  = [local.base.ecs_tasks_sg_id]
  container_name      = "campuscore-web"
  command             = ["python", "manage.py", "my_thing"]
}

Per-table autovacuum tuning

Migration 0057 sets tighter autovacuum parameters on the document-chunk table (named main_app_documentindex when 0057 ran; renamed to main_app_documentchunk in migration 0060 — the settings persist across the rename):

ALTER TABLE main_app_documentchunk SET (
  autovacuum_vacuum_scale_factor   = 0.05,   -- VACUUM at 5% dead (vs PG default 20%)
  autovacuum_vacuum_threshold      = 500,
  autovacuum_analyze_scale_factor  = 0.05,
  autovacuum_analyze_threshold     = 500
);

Why: the PG default scale_factor=0.2 means autovacuum doesn't fire until 20% of rows are dead. For HNSW that's catastrophic — recall starts degrading visibly past 10–15% dead because dead heap tuples = dead graph nodes until the next REINDEX. The tighter 5% trigger keeps the heap mostly clean between rebuilds.

If a DBA looking at pg_settings wonders why our settings differ from the cluster default: this is intentional and per-table only. Other tables use PG defaults.

The Reset Stuck Rebuild affordance

If the Django process dies mid-REINDEX (deploy, OOM, ECS task killed) the in-progress log row is never updated and the dashboard button stays disabled. The Maintenance tab shows a rose banner + "Reset stuck rebuild" button when:

  1. An IndexMaintenanceLog row exists with kind=rebuild and status=in_progress, AND
  2. pg_stat_activity has no active REINDEX backend on main_app_docidx_embed_hnsw_cosine_v2.

The combination means "the log thinks a rebuild is running, but Postgres says otherwise — almost certainly orphaned." Clicking the button calls maintenance.reset_stuck_rebuild() which flips the row to status=failed with a diagnostic note.

If pg_stat_activity does show a live REINDEX, the reset is refused with a RuntimeError carrying the active PID(s). That's the safety check that prevents marking a real in-flight rebuild as failed and racing it with a fresh one. To genuinely cancel a stuck rebuild, the operator must SELECT pg_cancel_backend(<pid>) from psql first; the reset then accepts.

Operational playbook

"Rebuild HNSW is disabled and I don't know why"

Check the Maintenance tab. Two failure modes light up the rose banner:

  1. Rose banner: "Rebuild log row stuck, no active REINDEX" → click Reset stuck rebuild. Button re-enables.
  2. No banner, but button still disabled → check the Overview tab's Index health card for indisvalid: ✗ INVALID. A previous REINDEX CONCURRENTLY left a broken index entry. Don't blindly retry — investigate via SELECT * FROM pg_indexes WHERE indexname = 'main_app_docidx_embed_hnsw_cosine_v2' and consider dropping the broken index manually before rebuilding.

"Recall metric is at exactly the floor"

If recall_drop reads exactly 0.90 ± 0.005, you're at the boundary and one bad probe will trip it. Two paths:

  • Investigate the worst probe. Re-run check_index_health (via the dashboard or CLI), pull the worst per-probe row from findings.metrics[0], drill into why HNSW misses ground-truth IDs. Common cause: ef_search is too low for the index's current size — try bumping HNSW_EF_SEARCH from 100 to 200 in settings.
  • Add probes that cover the gap. If one probe's ground truth is unusually hard, the probe set may not be representative of real traffic. Adjust prompts/observability_probes.py.

"Slack stopped posting after a deploy"

The most common pre-deploy state vs post-deploy state mismatch:

  • Diagnostic log line Slack token resolution failed: settings.SLACK_BOT_TOKEN attribute PRESENT, settings value length 0, os.environ length 0 → the deploy didn't pass the token through. Check the deploy-app job's "Export Terraform variables" step output.
  • Same line but settings length N, os.environ N → token is fine. The bug is elsewhere — check NOTIFICATION_CHANNELS['workflow_runs'] is set.

Full playbook in Slack Setup → Troubleshooting.

"How do I add a new probe?"

Edit prompts/observability_probes.py. Add a Probe(text=..., expected_folder_slug=...) at the end of the PROBES list. The probe-set hash is content-addressed (used as a cache key), so the edit busts cached folder-discoverability matrices on the next page load.

frozen_top_10_ids is intentionally optional. Leave it empty on first add; once the index has been REINDEX-ed and we trust the live top-10 result for that query, capture it via a future freeze_probe_top10 management command (not yet built — file a ticket when needed). Probes without a frozen baseline contribute to recall_drop (which uses ground-truth recall) but not to topk_drift.

What we deliberately don't do

  • Use a separate embedding_history table to track drift. The original plan included one. Walking through the actual code paths showed it doesn't earn its keep: scrape re-processing deletes and re-inserts (no in-place vector update) and the admin reindex path uses a deterministic input — there's no realistic scenario today where a vector "drifts in place" we'd detect by comparing hashes. We can revisit if we ever introduce a non-deterministic embedding pipeline.
  • Wrap the Slack SDK. campus_core/shared_utils/slack_utils.py exposes get_slack_client() (returns WebClient | None) and format_slack_message() (composes Block Kit blocks). Call sites use the SDK directly. No layer of "post()" helpers; the SDK's API is the API.
  • Run two EventBridge schedules. See "One schedule, not two" above.
  • Auto-rebuild without observed-threshold calibration. The enable_index_maintenance_schedule toggle defaults to off precisely so each tenant gets ~2 weeks of IndexMaintenanceLog data before automation fires. Thresholds in maintenance.py are heuristic — phase 2 calibrates them against observed reality.