VPC migration cutover runbook (Workstream B)¶
One-time, per-client maintenance-window procedure to move an environment off the AWS default VPC onto its dedicated VPC (private subnets + NAT). The Terraform change is already merged; the database restore is a script step, not Terraform - the infra code only ever describes the steady-state DB.
What replaces (and why there's downtime)¶
Moving to a new VPC is replace-forcing for every VPC-bound resource, so the apply recreates most of the base layer:
- RDS - cannot change VPCs in place. The old instance is deleted (its data preserved in a manual snapshot) and a new instance is restored from that snapshot into the new private subnet group by a script, then adopted into Terraform with
terraform import. This is the source of the downtime. - ElastiCache - recreated in the new VPC. Only holds the regenerable worker rate-budget cache; nothing is lost.
- ALB - recreated in the new public subnets, so it gets a new DNS name. The
{env}.campuscoreai.comCNAME repoints automatically (the subdomain module reads the ALB DNS); DNS propagation adds a few minutes. - Security groups, target group, listeners, tasks - recreated in the new VPC; tasks move to private subnets with
assign_public_ip = false.
Plan a window of roughly 30-60 minutes and announce downtime.
Why the DB restore is a script, not Terraform¶
Snapshot-restore logic does not belong in the steady-state module - it is a one-time data migration.
Restoring from a snapshot is also an AWS API operation (restore-db-instance-from-db-snapshot), so it needs no SQL connectivity into the now-private subnets.
The restored instance reuses the same identifier (campuscore-db-<env>) and is then imported, so Terraform manages it normally afterward with no snapshot reference in code.
A helper automates the AWS-side steps: scripts/vpc-cutover-rds.sh (snapshot, delete old, restore, wait, print the import command).
Pre-flight¶
- The deploy-role permission expansion (VPC/subnet/NAT/route/flow-log) must be live. It self-heals via the pipeline's CloudFormation
update-stackstep. - Confirm cost: single NAT (~$32/mo) + interface endpoints (~$73/mo across 2 AZs).
- No large ingestion/scrape running.
Cutover sequence¶
Run from
infrastructure/with the env's AWS credentials assumed.<env>= e.g.vsu-troy-pilot.
1. Snapshot + delete the old DB (script)¶
This takes a fresh manual snapshot, disables deletion protection, deletes the old instance (the manual snapshot survives), and prints the terraform state rm 'module.rds.aws_db_instance.campuscore' command to run next.
2. Build the new VPC + the DB's subnet group / SG (Terraform, targeted)¶
With the old instance gone, the subnet group is free to move to the new VPC. Build the networking and the DB's network prerequisites first (not the instance):
cd base
terraform apply \
-target=module.networking \
-target=module.ecs_cluster \
-target=module.rds.aws_db_subnet_group.default \
-target=module.rds.aws_security_group.rds
3. Restore the snapshot into the new subnet group (script)¶
This restores the snapshot to campuscore-db-<env> in the new subnet group + RDS security group (matching the Terraform config: db.t3.large, not publicly accessible, encrypted), waits until available, and prints the exact terraform import command.
4. Adopt the restored instance into Terraform¶
5. Full apply - base then app¶
terraform apply # reconciles RDS (expect no-op), finishes cache/ALB/SGs/endpoints
cd ../app && terraform apply # tasks roll to private subnets, DB_HOST -> restored RDS
Review each plan. The RDS instance should show no changes after import; if it shows a diff, fix the restore parameters before continuing.
6. Verify¶
- App healthy over the ALB (UI renders, a chat query streams).
- Data intact (a known record present).
- Web/worker tasks in private subnets with no public IP.
- Alarms settle to OK; VPC Flow Logs arriving in
/vpc/campuscore-<env>/flow-logs.
Rollback¶
If verification fails within the window: the pre-cutover snapshot restores the database, and if you kept the old default-VPC stack the old ALB/endpoints are still there to fall back to. The ECS circuit breaker auto-rolls-back a failed app task set.
Decommission (follow-up, after soak)¶
Once verified and soaked (a day or so):
- Restore
deletion_protection = truefor the DB. - Remove any retained default-VPC leftovers with a normal apply.
- Delete the cutover snapshot when you no longer want it as a rollback.
- Confirm flow logs + interface endpoints are healthy.
The infra code carries no snapshot reference at any point - the restore lived entirely in the script + import.