Skip to content

VPC migration cutover runbook (Workstream B)

One-time, per-client maintenance-window procedure to move an environment off the AWS default VPC onto its dedicated VPC (private subnets + NAT). The Terraform change is already merged; the database restore is a script step, not Terraform - the infra code only ever describes the steady-state DB.

What replaces (and why there's downtime)

Moving to a new VPC is replace-forcing for every VPC-bound resource, so the apply recreates most of the base layer:

  • RDS - cannot change VPCs in place. The old instance is deleted (its data preserved in a manual snapshot) and a new instance is restored from that snapshot into the new private subnet group by a script, then adopted into Terraform with terraform import. This is the source of the downtime.
  • ElastiCache - recreated in the new VPC. Only holds the regenerable worker rate-budget cache; nothing is lost.
  • ALB - recreated in the new public subnets, so it gets a new DNS name. The {env}.campuscoreai.com CNAME repoints automatically (the subdomain module reads the ALB DNS); DNS propagation adds a few minutes.
  • Security groups, target group, listeners, tasks - recreated in the new VPC; tasks move to private subnets with assign_public_ip = false.

Plan a window of roughly 30-60 minutes and announce downtime.

Why the DB restore is a script, not Terraform

Snapshot-restore logic does not belong in the steady-state module - it is a one-time data migration. Restoring from a snapshot is also an AWS API operation (restore-db-instance-from-db-snapshot), so it needs no SQL connectivity into the now-private subnets. The restored instance reuses the same identifier (campuscore-db-<env>) and is then imported, so Terraform manages it normally afterward with no snapshot reference in code.

A helper automates the AWS-side steps: scripts/vpc-cutover-rds.sh (snapshot, delete old, restore, wait, print the import command).

Pre-flight

  1. The deploy-role permission expansion (VPC/subnet/NAT/route/flow-log) must be live. It self-heals via the pipeline's CloudFormation update-stack step.
  2. Confirm cost: single NAT (~$32/mo) + interface endpoints (~$73/mo across 2 AZs).
  3. No large ingestion/scrape running.

Cutover sequence

Run from infrastructure/ with the env's AWS credentials assumed. <env> = e.g. vsu-troy-pilot.

1. Snapshot + delete the old DB (script)

scripts/vpc-cutover-rds.sh snapshot-and-delete <env>

This takes a fresh manual snapshot, disables deletion protection, deletes the old instance (the manual snapshot survives), and prints the terraform state rm 'module.rds.aws_db_instance.campuscore' command to run next.

2. Build the new VPC + the DB's subnet group / SG (Terraform, targeted)

With the old instance gone, the subnet group is free to move to the new VPC. Build the networking and the DB's network prerequisites first (not the instance):

cd base
terraform apply \
  -target=module.networking \
  -target=module.ecs_cluster \
  -target=module.rds.aws_db_subnet_group.default \
  -target=module.rds.aws_security_group.rds

3. Restore the snapshot into the new subnet group (script)

cd ..
scripts/vpc-cutover-rds.sh restore <env>

This restores the snapshot to campuscore-db-<env> in the new subnet group + RDS security group (matching the Terraform config: db.t3.large, not publicly accessible, encrypted), waits until available, and prints the exact terraform import command.

4. Adopt the restored instance into Terraform

cd base
terraform import 'module.rds.aws_db_instance.campuscore' campuscore-db-<env>

5. Full apply - base then app

terraform apply           # reconciles RDS (expect no-op), finishes cache/ALB/SGs/endpoints
cd ../app && terraform apply   # tasks roll to private subnets, DB_HOST -> restored RDS

Review each plan. The RDS instance should show no changes after import; if it shows a diff, fix the restore parameters before continuing.

6. Verify

  • App healthy over the ALB (UI renders, a chat query streams).
  • Data intact (a known record present).
  • Web/worker tasks in private subnets with no public IP.
  • Alarms settle to OK; VPC Flow Logs arriving in /vpc/campuscore-<env>/flow-logs.

Rollback

If verification fails within the window: the pre-cutover snapshot restores the database, and if you kept the old default-VPC stack the old ALB/endpoints are still there to fall back to. The ECS circuit breaker auto-rolls-back a failed app task set.

Decommission (follow-up, after soak)

Once verified and soaked (a day or so):

  • Restore deletion_protection = true for the DB.
  • Remove any retained default-VPC leftovers with a normal apply.
  • Delete the cutover snapshot when you no longer want it as a rollback.
  • Confirm flow logs + interface endpoints are healthy.

The infra code carries no snapshot reference at any point - the restore lived entirely in the script + import.