In early 2024, my Terraform setup technically worked, but operationally it was inconsistent. Plans were hard to review, state ownership was fuzzy, and environment promotions felt risky. This retrospective covers the redesign that turned Terraform from “sometimes reliable” into a predictable delivery system.

Stock photo source: Pexels , image reference: photo 1181244 .
Where the old setup failed
My initial problems were structural:
- one large root module tried to represent everything,
- environment differences were hidden in conditionals,
- state files were hard to map to ownership,
- plan output was noisy and difficult to trust,
- manual hotfixes created frequent drift.
This meant changes took longer because every apply needed extra caution.
The model I switched to
I rebuilt around strict environment boundaries:
- separate root modules per environment class,
- shared reusable modules with explicit inputs,
- remote state backend with locking,
- CI-driven plan and apply gates,
- documented ownership per state file.
The key idea was to optimize for clarity, not minimum file count.
Module design changes
I standardized module contracts so every module exposed predictable interfaces:
- required inputs were explicit and typed,
- outputs were minimal and meaningful,
- provider configuration stayed in root, not inside modules,
- every module had one clear responsibility.
I removed clever abstraction layers that had made debugging difficult.
State strategy that removed ambiguity
I moved to remote state with locking and strict naming.
terraform {
backend "s3" {
bucket = "gaborl-terraform-state"
key = "prod/network/terraform.tfstate"
region = "eu-central-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
The naming convention encoded environment and domain in the key. During incidents, this made state ownership obvious.
Plan review discipline
I stopped applying from local laptops for normal changes. The workflow became:
- pull request triggers
terraform fmt,validate, and static policy checks, - CI generates and stores plan artifact,
- reviewer approves exact plan diff,
- apply executes from controlled runner.
This reduced accidental configuration drift and made approvals auditable.
Environment promotion flow
I used progressive promotion:
- dev environment first,
- staging after smoke checks,
- production only after explicit approval and rollout window.
I avoided Terraform workspaces for critical separation because explicit environment roots were easier to reason about under pressure.
Drift management
I scheduled regular drift checks via plan-only jobs.
terraform init -input=false
terraform plan -detailed-exitcode -out=tfplan
Exit code 2 triggered investigation tickets. This caught manual changes before they became release blockers.
Incident that changed my process
I once had an urgent DNS change made manually in the provider console. Two days later, a routine apply reverted it because state and real world were out of sync.
After this:
- emergency manual changes required immediate Terraform backport,
- drift check frequency was increased,
- post-incident checklist gained a state reconciliation step.
That single rule prevented repeated rollback surprises.
Policy and guardrails
I added lightweight policy checks for common risks:
- unencrypted storage resources,
- public ingress rules outside allowed CIDRs,
- missing tags and ownership metadata,
- destructive changes without manual approval.
The goal was to block obvious hazards early, not create policy noise.
Metrics I tracked
I tracked:
- plan/apply success rate,
- mean review-to-apply time,
- drift findings per month,
- emergency rollbacks caused by infra changes.
After several weeks:
- drift findings dropped,
- failed applies became rare,
- reviewers trusted plans more,
- change windows became calmer.
Mistakes I corrected
Early mistakes in the redesign:
- too many module variables with weak defaults,
- insufficient provider version pinning,
- missing documentation for resource ownership.
Fixes:
- narrowed module interfaces,
- pinned provider versions and upgrade cadence,
- added ownership metadata as mandatory tags.
Rules that stuck
- No routine apply from personal machines.
- Every emergency console change must be reconciled in Terraform.
- Environment boundaries stay explicit.
- Plan artifacts are reviewed before apply.
- Drift detection runs on schedule, not only before releases.
Final takeaway
Terraform became predictable only when workflow discipline matched code quality. Good modules alone were not enough. Clear state ownership, strict review gates, and drift governance were what made infrastructure changes safe at speed.