Refactoring Infrastructure Safely with Terraform and a Coding Agent

Introduction

I recently went through a series of infrastructure refactorings on production systems — from Cloud Run to virtual machines, then to a managed Kubernetes cluster, and finally to GKE Autopilot. Each step carried the risk of data loss, broken services, or missed snapshots.

What made this manageable was pairing with a coding agent, using terraform plan as our shared language.

The Problem: Agents Don’t Have the State

Here’s the thing about coding agents and Terraform: the agent doesn’t have your Terraform state in its context. It has no idea what’s currently running in production. When you ask it to rewrite your infrastructure, it will happily produce clean, idiomatic Terraform code that, if applied, would destroy your production database and recreate it empty.

This isn’t a bug — it’s the nature of the tool. The agent optimizes for the code it sees, not the state it can’t see. That’s why you need a method.

The Method: Plan, Analyze, Decompose

The workflow that emerged naturally is a loop:

1. Write the ideal infrastructure. Ask the agent to rewrite the Terraform to match your target architecture. Don’t worry about migration yet — just describe where you want to end up.

2. Run terraform plan. This is where reality hits. The plan reveals every breaking change — every resource that will be destroyed, every volume that will be wiped, every service that will go down.

# google_container_cluster.primary will be destroyed
- resource "google_container_cluster" "primary" {
    - name     = "my-cluster"
    - location = "us-central1"
    ...
  }

# google_container_cluster.primary will be created
+ resource "google_container_cluster" "primary" {
    + name     = "my-cluster"
    + location = "northamerica-northeast1"
    ...
  }

3. Analyze the impacts with the agent. Feed the plan output back to the agent. It excels at scanning hundreds of lines and surfacing risks: “This will destroy a persistent disk,” “This database will be recreated,” “This snapshot policy will be removed before the new one is in place.” On a large refactoring, a human will skim and miss things. The agent reads every line.

4. Replan the migration strategy. Based on the impact analysis, decompose the refactoring into safe, incremental steps. Each step addresses one breaking change without causing data loss. The agent helps rewrite the Terraform to introduce moved blocks, lifecycle rules, intermediate states, or data migration steps as needed.

5. Repeat until the plan shows only intended changes.

The Execution: One Commit Per Step

Once the migration strategy is solid, each incremental step becomes its own commit in a release branch. You apply them one by one:

Commit A: Add snapshot policies and take backups
Commit B: Create the new cluster alongside the old one
Commit C: Migrate persistent volumes with moved blocks
Commit D: Switch traffic to the new cluster
Commit E: Decommission the old resources

Each commit gets its own terraform plan review, its own terraform apply, and its own validation. If something goes wrong at commit C, you haven’t touched the running system beyond what commits A and B safely changed.

This is the meta-plan: a sequence of small, safe Terraform applies that together accomplish the full migration without ever putting production data at risk.

My Migration Journey

Cloud Run to Virtual Machines

The agent rewrote the infra for VMs. The plan showed it would destroy the Cloud Run services and their associated storage. We decomposed: first provision the VMs and migrate data, then decommission Cloud Run.

Virtual Machines to Managed Kubernetes

Bigger leap — individual machine definitions became node pools and persistent volume claims. The plan output was dense: dozens of creates, dozens of destroys. The agent flagged a persistent volume being destroyed instead of migrated and a snapshot schedule being removed too early. Each flag became a separate commit in the migration sequence.

Switching Kubernetes Management Model

Same orchestrator, different model. The real value here was catching drifts — subtle behavioral changes that wouldn’t show up as errors in the plan but could cause service disruptions at runtime.

Why This Works

The agent compensates for its own blind spot. It can’t see the state, but it can analyze the plan output that reflects the state. terraform plan bridges the gap.
Exhaustive reading. On a refactoring touching hundreds of resources, the agent reads every line. Humans skim.
Safe iteration. terraform plan changes nothing. You can run it a hundred times while the agent adjusts the code.
Incremental execution. Separate commits in a release branch mean each step is reviewable, reversible, and independently validated.

Conclusion

Coding agents are reckless with Terraform by default — they don’t see the state, so they don’t fear destroying production resources. But paired with terraform plan and a disciplined workflow of plan-analyze-decompose, they become powerful partners for the DevOps engineer who does see the state and does fear data loss.

The method is simple: write the ideal infra, let the plan reveal the consequences, decompose the migration into safe commits, and apply them one by one. The agent handles the exhaustive analysis and code iteration. You handle the judgment calls. Together, you ship infrastructure changes that would otherwise be too risky to attempt.