Vector Index Observability¶
Internal architecture doc for the pgvector / HNSW health dashboard and the maintenance pipeline behind it.
Operator-facing onboarding lives in Slack Setup and the index-maintenance sections of GitHub Environment Variables. Both feed into here.
The problem this solves¶
Three production failures in the previous quarter all traced back to "we had no view into the state of the index":
- HNSW + WHERE-clause interaction silently zeroed out folder-filtered searches. Diagnosed only after ad-hoc psql forensics, fixed by setting
hnsw.ef_search = 200+hnsw.iterative_scan = 'strict_order'. - Folder discoverability score sat at 0.543 against a 0.46 threshold for the VSU Employees folder. One bad query away from breaking; no monitoring on the gap.
- HNSW graph degradation under incremental churn. As scrape pipelines insert and delete chunks, the graph fills with stale neighbors and recall degrades — but VACUUM only cleans the heap, not the graph. REINDEX is the only fix, and we had no signal for when to run it.
Each of these would have been a 10-second click in a dashboard. Instead they were 90-minute psql sessions. This module is the dashboard.
Architecture¶
┌────────────────────────────────────┐
│ /admin/observability/vector/ │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Overview │ │ Routing │ │
│ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Metrics │ │ Maintenance │ │
│ └──────────┘ │ [Run check] │ │
│ │ [Rebuild HNSW] │ │
│ └──────────────────┘ │
└────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ apps/main_app/services/observability/ │
│ │
│ vector_index_stats.py — Overview tab data (inventory + health) │
│ folder_discoverability — heatmap / cannibalization / routing │
│ recall_eval.py — exact-vs-approx ground-truth recall │
│ query_metrics.py — OTel-driven traffic / latency rollups │
│ maintenance.py — metric evaluators + run_check / run_ │
│ rebuild / reset_stuck_rebuild │
│ document_probe.py — per-row probe panel (admin change_form)│
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ IndexMaintenanceLog (DB-backed audit trail) │
│ kind=check | rebuild │
│ trigger=manual | <metric_name> │
│ findings={metrics, composition, ...} │
└──────────────────────────────────────────────┘
│
┌──────────────────────┴──────────────────────┐
▼ ▼
┌────────────────────────────────┐ ┌─────────────────────────────────┐
│ Slack workflow_runs channel │ │ EventBridge daily schedule │
│ via campus_core.shared_utils │ │ modules/scheduled_ecs_task/ │
│ .slack_utils │ │ → auto_rebuild_if_justified │
└────────────────────────────────┘ └─────────────────────────────────┘
Everything sits inside the Django app — no separate metrics backend, no Prometheus, no external collector. The dashboard reads from existing tables (main_app_documentchunk, main_app_sourcefile, main_app_knowledgefolder, main_app_oteltrace/main_app_otelspan) plus the standard Postgres observability surface (pg_relation_size, pg_index, pg_stat_user_tables, pg_statio_user_indexes, pg_stat_activity). Net new tables: just IndexMaintenanceLog.
Key files¶
- services/observability/vector_index_stats.py — Pydantic-typed aggregations for the Overview tab. Read-only PG queries; nothing writes.
- services/observability/folder_discoverability.py — N folders × M probes cosine matrix. Catches the "VSU Employees scored 0.543 against a 0.46 threshold" failure mode by surfacing it visually.
- services/observability/recall_eval.py — exact-vs-approximate recall via
SET LOCAL enable_indexscan = off. The brute-force scan is the mathematical ground truth; we don't need curatedexpected_chunk_ids. - services/observability/maintenance.py — every metric evaluator,
run_check,run_rebuild,reset_stuck_rebuild. The single source of truth for "what does failing mean." - models.py
IndexMaintenanceLog— one row per check or rebuild event. ThefindingsJSONField is the audit trail. - management/commands/check_index_health.py, rebuild_hnsw_index.py, auto_rebuild_if_justified.py — CLI entrypoints. Same code paths the dashboard buttons call.
The metrics¶
Seven metrics, each with its own evaluator in maintenance.py. All return a MetricStatus with level ∈ {ok, warn, fail, unknown}. A failing metric is what auto_rebuild_if_justified (and a human eyeballing the dashboard) uses to decide whether to REINDEX.
| Metric | Data source | Threshold | What "fail" means |
|---|---|---|---|
| recall_drop | Mean recall@10 across the probe set. Compare HNSW top-10 vs brute-force ground truth (enable_indexscan=off). |
< 0.90 | The index is missing answers a sequential scan would find. Most direct retrieval-quality signal. |
| topk_drift | Per-probe overlap with frozen_top_10_ids baseline captured at probe-creation time. |
≥1 probe with < 7/10 overlap | The graph drifted away from where it was when we last said "this index is healthy." |
| churn_high | n_tup_ins + n_tup_upd + n_tup_del from pg_stat_user_tables minus the baseline at last rebuild. |
≥ 20% of rows OR ≥ 10K absolute | Lots of writes since last rebuild. Doesn't itself mean degradation — pure inserts grow but don't rot — but combined with other signals confirms accumulation. |
| bloat_high | n_dead_tup / n_live_tup from pg_stat_all_tables. |
≥ 15% dead | Dead heap tuples → dead HNSW graph nodes pgvector still walks during search → degraded recall + latency. VACUUM cleans the heap; only REINDEX rebuilds the graph. |
| cache_low | idx_blks_hit / (idx_blks_hit + idx_blks_read) from pg_statio_user_indexes. |
< 90% hit | HNSW graph paging from disk. p95 latency is about to climb. May mean shared_buffers is undersized or another workload is evicting our pages. |
| latency_p95 | p95 of tool_call.search_knowledge* span durations over 24h vs the baseline 24h after last successful rebuild. |
≥ 1.5× baseline | Search is meaningfully slower than it was right after the last rebuild. Often co-occurs with cache_low or bloat_high. |
| empty_rate | Fraction of folder-scoped search calls that returned 0 results, sniffed from OTel tool.result_summary. |
> 2% over 24h | Folder filtering is rejecting matches the index would otherwise return. The original VSU-Employees failure mode. |
Reading combinations¶
The dashboard intentionally surfaces all seven side-by-side because each one alone is ambiguous. Real diagnostic value comes from combinations:
bloat_highfailing ANDrecall_dropfailing → genuine degradation. REINDEX.bloat_highfailing alone, recall fine → recent churn but no observable impact yet. Optional REINDEX; consider waiting.churn_highfailing alone → growth or churn, but no observable impact. Look at the composition diff on the Maintenance tab to see whether it was inserts (grew) or updates/deletes (rotted).cache_lowfailing → memory/eviction issue, not graph state. Don't REINDEX; checkshared_buffersand concurrent workloads.empty_ratefailing → folder routing bug, not graph state. Check whether folder description embeddings are stale; REINDEX won't help.
The auto_rebuild_if_justified command picks the first failing metric and uses its name as the trigger value on the resulting rebuild log row. That choice is intentionally simple — operator-side judgment for the more nuanced combinations is what the dashboard exists for.
The maintenance log¶
Every check or rebuild writes one row to IndexMaintenanceLog. Schema:
| Column | Values | Why |
|---|---|---|
kind |
check | rebuild |
Distinguishes "we measured" from "we mutated." |
trigger |
manual | recall_drop | bloat_high | churn_high | cache_low | latency_p95 | empty_rate | topk_drift |
"Why does this event exist?" manual = button click or CLI invocation. Metric names = a future automated job fired this rebuild because that metric failed. |
status |
in_progress | success | failed | skipped |
in_progress rows exceeding ~30 min after a deploy crash become "stuck" — see the Reset section below. |
total_rows |
int | Snapshot at event time. |
churn_rows_since_last_rebuild |
int | Pulled from the churn metric's value. |
index_size_bytes |
bigint | pg_relation_size snapshot. |
recall_at_10_before / _after |
float|null | Populated only on rebuilds, only when we re-run probes both sides of the REINDEX. |
findings |
JSONB | Full metric list + composition snapshot + before/after dead-tuple stats. The audit trail. |
error_message, notes |
text | Free-form. notes is append-only; error_message captures exceptions verbatim. |
Composition snapshots¶
findings.composition_before (rebuild) and findings.composition (check) carry a snapshot of:
{
"total_rows": 30987,
"by_folder": {"vsu-employees": 1373, "another-test-folder": 115, ...},
"by_source_type": {"web_scrape": 28000, "knowledge_folder": 1373, ...},
"mean_chunk_length": 2464.18,
"captured_at": "2026-05-17T01:11:42.123456+00:00"
}
The Maintenance tab diffs the latest snapshot against the previous rebuild's snapshot to show "+1,247 rows in vsu-employees, +0 elsewhere." That's the signal for "a bulk import shifted the distribution since we last rebuilt."
Datetime serialization¶
Heap stats come from psycopg as native datetime objects. Django's JSONField uses stdlib json.dumps which doesn't serialize datetimes, so we ISO-stringify before storing — see _json_safe_dead_tup() in maintenance.py. Easy to forget when adding new findings keys; this is also why composition uses model_dump(mode="json").
Commands¶
Three management commands, three different audiences:
check_index_health¶
Operator-facing. Runs all seven evaluators (including the expensive recall + topk_drift), writes a kind=check log row, prints a human-readable summary, exits:
- 0 — all metrics ok or warn
- 1 — at least one metric failed (rebuild justified)
- 2 — the check itself crashed (bug or DB unavailable)
The --json flag is for cron / EventBridge: emits the findings dict as a single line on stdout. Exit codes stay the same; the JSON is supplementary.
Also fires Slack notifications via notify_workflow_event for the start + completion. The completion message header reflects the worst level seen (✓ all ok / ⚠ warnings / ✗ rebuild justified).
rebuild_hnsw_index¶
Operator-facing mutating command. Pipeline:
1. Snapshot composition + dead-tuple stats + size + pg_stat counter.
2. Run VACUUM ANALYZE main_app_documentchunk — cleans dead heap tuples before REINDEX so the new graph isn't born with dead nodes baked in.
3. Run REINDEX INDEX CONCURRENTLY main_app_docidx_embed_hnsw_cosine_v2 — doesn't block reads or writes, holds a ShareUpdateExclusiveLock on the table only (DDL blocks).
4. Re-snapshot the same stats. Persist the delta to the rebuild log row.
--dry-run skips the actual SQL but still writes the rebuild log row with status=skipped. Useful for verifying the dashboard's "rebuild button" Terraform/IAM/Slack glue without a real REINDEX.
--trigger controls the IndexMaintenanceLog.trigger value. Operators always pass manual (the default); the auto-rebuild command passes the failing metric name.
auto_rebuild_if_justified¶
Cron-target. Always invoked by EventBridge, not humans. Pipeline:
1. Run the full check.
2. If findings.failing_metric_names is empty, exit 0 (no rebuild, no further Slack noise beyond the check's own ✓ all ok message).
3. If non-empty, call run_rebuild(trigger=<first failing metric>). The rebuild posts its own start + complete Slack messages.
Exit codes: - 0 — check clean OR rebuild succeeded - 1 — rebuild was justified but raised - 2 — the check itself raised
The exit-code split is so the EventBridge alarm distinguishes "found a problem and recovered" from "the schedule itself broke."
EventBridge schedule¶
GitHub Environment variable: ENABLE_INDEX_MAINTENANCE_SCHEDULE=true
│
▼
infrastructure/app/eventbridge.tf → module "auto_rebuild_schedule"
│
▼
infrastructure/modules/scheduled_ecs_task/
• aws_iam_role (assumed by scheduler.amazonaws.com)
• aws_iam_role_policy (ecs:RunTask + iam:PassRole)
• aws_scheduler_schedule (cron expression + container override)
│
▼
ECS RunTask on existing campuscore-web task definition with
command override: ["python", "manage.py", "auto_rebuild_if_justified"]
Default cron is cron(0 6 * * ? *) (06:00 UTC daily), overridable via INDEX_MAINTENANCE_SCHEDULE_CRON. The schedule pins the current task-definition revision so deploys roll forward atomically.
One schedule, not two¶
The original design had two schedules — a standalone daily check + a separate auto-rebuild. We collapsed to one because auto_rebuild_if_justified already runs the check internally with the same Slack lifecycle. Having both fire simultaneously would just duplicate the brute-force recall pass at 06:00 UTC every morning for no behavioral benefit.
If you want a check without a possible rebuild for a specific environment, leave ENABLE_INDEX_MAINTENANCE_SCHEDULE=false and run python manage.py check_index_health from a shell on demand.
Reusing the module¶
modules/scheduled_ecs_task/ is generic — anything that wants to be "Django management command, daily, ECS Fargate, same image, Slack-tied" plugs in with ~15 lines:
module "my_new_schedule" {
source = "../modules/scheduled_ecs_task"
count = var.enable_my_thing ? 1 : 0
name = "campuscore-${local.env_sanitized}-my-thing"
schedule_expression = var.my_thing_cron
cluster_arn = local.base.ecs_cluster_id
task_definition_arn = aws_ecs_task_definition.campuscore.arn
execution_role_arn = local.base.ecs_task_execution_role_arn
task_role_arn = local.base.ecs_task_role_arn
subnet_ids = local.base.subnet_ids
security_group_ids = [local.base.ecs_tasks_sg_id]
container_name = "campuscore-web"
command = ["python", "manage.py", "my_thing"]
}
Per-table autovacuum tuning¶
Migration 0057 sets tighter autovacuum parameters on the document-chunk table (named main_app_documentindex when 0057 ran; renamed to main_app_documentchunk in migration 0060 — the settings persist across the rename):
ALTER TABLE main_app_documentchunk SET (
autovacuum_vacuum_scale_factor = 0.05, -- VACUUM at 5% dead (vs PG default 20%)
autovacuum_vacuum_threshold = 500,
autovacuum_analyze_scale_factor = 0.05,
autovacuum_analyze_threshold = 500
);
Why: the PG default scale_factor=0.2 means autovacuum doesn't fire until 20% of rows are dead. For HNSW that's catastrophic — recall starts degrading visibly past 10–15% dead because dead heap tuples = dead graph nodes until the next REINDEX. The tighter 5% trigger keeps the heap mostly clean between rebuilds.
If a DBA looking at pg_settings wonders why our settings differ from the cluster default: this is intentional and per-table only. Other tables use PG defaults.
The Reset Stuck Rebuild affordance¶
If the Django process dies mid-REINDEX (deploy, OOM, ECS task killed) the in-progress log row is never updated and the dashboard button stays disabled. The Maintenance tab shows a rose banner + "Reset stuck rebuild" button when:
- An
IndexMaintenanceLogrow exists withkind=rebuildandstatus=in_progress, AND pg_stat_activityhas no active REINDEX backend onmain_app_docidx_embed_hnsw_cosine_v2.
The combination means "the log thinks a rebuild is running, but Postgres says otherwise — almost certainly orphaned." Clicking the button calls maintenance.reset_stuck_rebuild() which flips the row to status=failed with a diagnostic note.
If pg_stat_activity does show a live REINDEX, the reset is refused with a RuntimeError carrying the active PID(s). That's the safety check that prevents marking a real in-flight rebuild as failed and racing it with a fresh one. To genuinely cancel a stuck rebuild, the operator must SELECT pg_cancel_backend(<pid>) from psql first; the reset then accepts.
Operational playbook¶
"Rebuild HNSW is disabled and I don't know why"¶
Check the Maintenance tab. Two failure modes light up the rose banner:
- Rose banner: "Rebuild log row stuck, no active REINDEX" → click Reset stuck rebuild. Button re-enables.
- No banner, but button still disabled → check the Overview tab's Index health card for
indisvalid: ✗ INVALID. A previous REINDEX CONCURRENTLY left a broken index entry. Don't blindly retry — investigate viaSELECT * FROM pg_indexes WHERE indexname = 'main_app_docidx_embed_hnsw_cosine_v2'and consider dropping the broken index manually before rebuilding.
"Recall metric is at exactly the floor"¶
If recall_drop reads exactly 0.90 ± 0.005, you're at the boundary and one bad probe will trip it. Two paths:
- Investigate the worst probe. Re-run
check_index_health(via the dashboard or CLI), pull the worst per-probe row fromfindings.metrics[0], drill into why HNSW misses ground-truth IDs. Common cause:ef_searchis too low for the index's current size — try bumpingHNSW_EF_SEARCHfrom 100 to 200 in settings. - Add probes that cover the gap. If one probe's ground truth is unusually hard, the probe set may not be representative of real traffic. Adjust
prompts/observability_probes.py.
"Slack stopped posting after a deploy"¶
The most common pre-deploy state vs post-deploy state mismatch:
- Diagnostic log line
Slack token resolution failed: settings.SLACK_BOT_TOKEN attribute PRESENT, settings value length 0, os.environ length 0→ the deploy didn't pass the token through. Check thedeploy-appjob's "Export Terraform variables" step output. - Same line but
settings length N, os.environ N→ token is fine. The bug is elsewhere — checkNOTIFICATION_CHANNELS['workflow_runs']is set.
Full playbook in Slack Setup → Troubleshooting.
"How do I add a new probe?"¶
Edit prompts/observability_probes.py. Add a Probe(text=..., expected_folder_slug=...) at the end of the PROBES list. The probe-set hash is content-addressed (used as a cache key), so the edit busts cached folder-discoverability matrices on the next page load.
frozen_top_10_ids is intentionally optional. Leave it empty on first add; once the index has been REINDEX-ed and we trust the live top-10 result for that query, capture it via a future freeze_probe_top10 management command (not yet built — file a ticket when needed). Probes without a frozen baseline contribute to recall_drop (which uses ground-truth recall) but not to topk_drift.
What we deliberately don't do¶
- Use a separate
embedding_historytable to track drift. The original plan included one. Walking through the actual code paths showed it doesn't earn its keep: scrape re-processing deletes and re-inserts (no in-place vector update) and the admin reindex path uses a deterministic input — there's no realistic scenario today where a vector "drifts in place" we'd detect by comparing hashes. We can revisit if we ever introduce a non-deterministic embedding pipeline. - Wrap the Slack SDK.
campus_core/shared_utils/slack_utils.pyexposesget_slack_client()(returnsWebClient | None) andformat_slack_message()(composes Block Kit blocks). Call sites use the SDK directly. No layer of "post()" helpers; the SDK's API is the API. - Run two EventBridge schedules. See "One schedule, not two" above.
- Auto-rebuild without observed-threshold calibration. The
enable_index_maintenance_scheduletoggle defaults to off precisely so each tenant gets ~2 weeks ofIndexMaintenanceLogdata before automation fires. Thresholds inmaintenance.pyare heuristic — phase 2 calibrates them against observed reality.