Skip to content

Terraform in Production: Modules, State Management, and CI/CD Patterns

BackendBytes Engineering Team
BackendBytes Engineering Team
10 min read
Terraform in Production: Modules, State Management, and CI/CD Patterns

Key Takeaways

  • Remote state with locking prevents concurrent corruption — on Terraform 1.10+ enable native S3 locking with `use_lockfile = true`; without locking, two applies on the same state file result in race conditions, divergence, and deleted resources
  • State locking blocks the second engineer from running apply until the first completes — if you skip locking, engineer A's refactor and engineer B's stale branch apply in parallel, destroying the database
  • Directory-per-environment (`dev/`, `staging/`, `production/`) keeps changes scoped and reviewable — mistakes in dev don't acquire the production lock; state is isolated per environment
  • `lifecycle { prevent_destroy = true }` on critical resources stops accidental destruction — a typo in a module becomes a plan diff that humans review before apply
  • Separate `terraform plan` review from `terraform apply` in CI/CD with manual approval gates — catch destructive changes on the plan diff before resources are actually deleted

The classic Terraform-deletes-production incident. A platform team applies on a Friday afternoon. The plan shows 3 resources to update. Seconds later, the production RDS instance is gone. We debugged this exact failure mode on multiple teams — the pattern repeats whenever state locking, plan review, or prevent_destroy are missing.

The Friday Afternoon Apply

The classic Terraform-deletes-prod incident: a platform team runs terraform apply on a Friday afternoon. The plan shows 3 resources to update — "a security group change." Seconds later, the production RDS instance is gone. We've debugged this failure mode on multiple teams.

The mechanism is always the same — a state-file conflict[Terraform Docs]. Engineer A refactors an RDS module and pushes to main. Engineer B, working from a stale branch, runs terraform plan against the old state and applies. Terraform destroys the database, then fails to recreate it — leaving a hole where production used to be.

graph LR
    Dev[Engineer code] -->|terraform init| Backend[(Remote state<br/>S3 + use_lockfile)]
    Dev -->|terraform plan -out=tfplan| Plan[Plan file]
    Plan -->|review + approval| Gate{CI/CD gate}
    Gate -->|approved| Apply[terraform apply tfplan]
    Apply -->|writes new state| Backend
    Backend -->|releases lock| Done[Apply done]
    Gate -.->|destroy detected| Block[BLOCK + alert]
    style Block fill:#fee
    style Backend fill:#eef

The diagram above is the entire production discipline in one picture: never apply without a saved plan, never apply without a lock, never skip the gate.

TL;DR

Protect Terraform from concurrent applies and accidental destruction: enable remote state with locking[Terraform Docs], structure projects as directory-per-environment with thin root modules, design modules with opinionated defaults and input validation, and use CI/CD pipelines with plan review gates before any apply runs.

  • Remote state + locking prevents concurrent corruption and race conditions
  • Directory-per-environment structure keeps changes scoped and reviewable
  • Opinionated modules with validation catch misconfig before provisioning

Remote State and Locking

Local state is a single point of failure and enabler of concurrent corruption. Move to a remote backend[Terraform Docs] immediately.

Pick the backend that matches your cloud + locking story:

BackendLocking mechanismEncryption-at-restBest for
S3 (native lockfile)S3 conditional write (use_lockfile = true, Terraform 1.10+)SSE-S3 / SSE-KMSAWS-native infra; current default — no extra resources
S3 + DynamoDB (legacy)DynamoDB conditional writes (dynamodb_table, deprecated in 1.11)SSE-S3 / SSE-KMSPre-1.10 setups; keep until you migrate to use_lockfile
GCSNative object-versioning lockGoogle-managed / CMEKGCP-native infra
Azure StorageNative blob lease lockAzure-managed / customer keyAzure-native infra
HCP Terraform / TF CloudFirst-party state lockingVault-backedMulti-cloud + native run pipeline + RBAC
ConsulSession lockingTLS in transitSelf-hosted; on-prem; air-gapped
Local fileNoneNoneDevelopment only — never share state

S3 Native Locking (use_lockfile)

On Terraform 1.10+ the S3 backend locks the state file itself — no DynamoDB table, no second resource to provision[Terraform Docs]. Set use_lockfile = true:

## backend.tf — Terraform 1.10+ native S3 state locking
terraform {
  backend "s3" {
    bucket       = "my-terraform-state"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true
  }
}

When two engineers run terraform apply simultaneously:

  1. Engineer A writes a lock file (<key>.tflock) next to the state object using a conditional PutObject, succeeding only if the lock does not already exist.
  2. Engineer B's init issues the same conditional write, S3 rejects it, and the apply blocks.
  3. Engineer B waits (or fails after a timeout, forcing an explicit terraform force-unlock).
  4. Engineer A's apply completes and deletes the lock file.
  5. Engineer B can now proceed.

This works because S3 has had strong read-after-write consistency since December 2020[AWS S3 strong consistency, 2020], so the conditional write is a reliable mutex without a separate lock store. Grant the deploy role s3:GetObject, s3:PutObject, and s3:DeleteObject on the lock object path so it can acquire and release the lock.

Without locking, both applies run in parallel, each modifying the shared state file concurrently — resulting in corrupted state and divergence between the state file and actual resources.

Legacy: S3 + DynamoDB Locking (pre-1.10)

Before 1.10, S3 had no native lock, so the state file lock lived in a DynamoDB table referenced by dynamodb_table. This still works, but dynamodb_table is deprecated as of Terraform 1.11 and slated for removal in a future minor version — prefer use_lockfile for new setups and migrate existing ones. During migration you can set both use_lockfile = true and dynamodb_table so a mixed fleet of CLI versions stays safely locked.

## backend.tf — legacy DynamoDB locking (pre-1.10)
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

With the legacy backend, Engineer A acquires a lock by writing a LockID item (with timestamp and user metadata) to the DynamoDB table; Engineer B's conflicting init blocks until that item is deleted on apply completion.

Setup:

## terraform/state-backend/main.tf
## One-time setup — create the state bucket
## (the DynamoDB lock table is only needed for the legacy pre-1.10 path)
 
resource "aws_s3_bucket" "terraform_state" {
  bucket = "my-terraform-state"
}
 
resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}
 
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}
 
## Legacy only — omit this table when using use_lockfile (Terraform 1.10+).
resource "aws_dynamodb_table" "terraform_locks" {
  name             = "terraform-locks"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}

Run this once in a separate AWS account (or in prod with manual approval) before anything else. Every root module then points to this bucket — with use_lockfile = true the bucket is all you need; the DynamoDB table is only for the legacy locking path.

Environment-Specific Keys

Each environment gets its own state file:

## environments/dev/backend.tf
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "dev/terraform.tfstate"  # Separate from staging/production
    ...
  }
}
 
## environments/production/backend.tf
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "production/terraform.tfstate"
    ...
  }
}

A mistake in dev won't acquire the production lock. State corruption is sandboxed per environment.


Directory-Per-Environment Structure

[Terraform Docs]
infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf              # Root module — calls shared modules
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf           # dev/terraform.tfstate
│   ├── staging/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf           # staging/terraform.tfstate
│   └── production/
│       ├── main.tf
│       ├── variables.tf
│       ├── terraform.tfvars
│       └── backend.tf           # production/terraform.tfstate
├── modules/
│   ├── networking/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── database/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── eks-cluster/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf

Each environment is a root module — the entry point for terraform init/apply. Shared logic lives in modules/child modules that root modules call. Root modules stay thin: compose modules, pass env-specific values, wire outputs.

The promotion flow makes the discipline explicit — every change lands in dev first, never starts in production:

graph TD
    Change[Change request] --> Dev[environments/dev/<br/>terraform apply]
    Dev -->|smoke test passes| Staging[environments/staging/<br/>terraform apply]
    Dev -->|smoke test fails| Fix[Fix module + re-test]
    Fix --> Dev
    Staging -->|integration test passes| ProdPlan[environments/production/<br/>terraform plan -out=tfplan]
    ProdPlan --> Review{Plan review +<br/>change advisory}
    Review -->|approved| ProdApply[terraform apply tfplan<br/>during change window]
    Review -->|denied| Iterate[Iterate on plan]
    Iterate --> ProdPlan
    style ProdApply fill:#dfd
    style ProdPlan fill:#ffd
    style Review fill:#ffd

Each arrow is a guardrail. Skip an arrow and the next outage gets traced back to that decision in the post-mortem[Terraform Docs].

Example environments/production/main.tf:

module "networking" {
  source = "../../modules/networking"
 
  environment         = "production"
  vpc_cidr            = "10.0.0.0/16"
  availability_zones  = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}
 
module "database" {
  source = "../../modules/database"
 
  environment         = "production"
  vpc_id              = module.networking.vpc_id
  subnet_ids          = module.networking.private_subnet_ids
  instance_class      = "db.r6g.xlarge"
  multi_az            = true
  deletion_protection = true
}
 
module "eks" {
  source = "../../modules/eks-cluster"
 
  environment     = "production"
  vpc_id          = module.networking.vpc_id
  subnet_ids      = module.networking.private_subnet_ids
  cluster_version = "1.31"
  node_groups = {
    general = {
      instance_types = ["m6i.xlarge"]
      min_size       = 3
      max_size       = 10
      desired_size   = 5
    }
  }
}

Why not workspaces? Workspaces let you manage multiple environments from a single directory via terraform workspace select prod. Fewer files, but:

  • A misconfigured workspace flag affects everything.
  • PR diffs don't show which environment changes affect.
  • CI/CD requires explicit workspace switching (easy to forget).

Directory-per-environment makes the blast radius obvious: a PR modifying environments/production/ is clearly high-risk.


Opinionated Module Design with Validation

A well-designed module works with zero config for the common case and exposes knobs for the rest.

Defaults and Constraints

## modules/database/variables.tf
 
variable "environment" {
  type        = string
  description = "Environment name (dev, staging, production)"
 
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be one of: dev, staging, production."
  }
}
 
variable "instance_class" {
  type        = string
  default     = "db.t4g.medium"  # Sufficient for dev/staging
  description = "RDS instance class."
 
  validation {
    condition     = can(regex("^db\\.", var.instance_class))
    error_message = "Instance class must start with 'db.' (e.g., db.t4g.medium)."
  }
}
 
variable "multi_az" {
  type        = bool
  default     = false  # Cheap for dev
  description = "Enable Multi-AZ deployment."
}
 
variable "deletion_protection" {
  type        = bool
  default     = true  # Protected by default
  description = "Prevent accidental deletion."
}
 
variable "backup_retention_period" {
  type        = number
  default     = 7
  description = "Days to retain automated backups."
}
 
variable "subnet_ids" {
  type        = list(string)
  description = "Subnet IDs for the DB subnet group."
 
  validation {
    condition     = length(var.subnet_ids) >= 2
    error_message = "At least 2 subnet IDs required."
  }
}

Safe defaults (deletion_protection = true, multi_az = false) let dev environments work out of the box. Production overrides them explicitly.

The Core Resource Block

## modules/database/main.tf
 
resource "aws_db_instance" "main" {
  identifier             = "${var.environment}-postgres"
  engine                 = "postgres"
  engine_version         = var.engine_version
  instance_class         = var.instance_class
  allocated_storage      = var.allocated_storage
  storage_type           = "gp3"
  multi_az               = var.multi_az
  deletion_protection    = var.deletion_protection
  backup_retention_period = var.backup_retention_period
 
  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.db.id]
 
  skip_final_snapshot         = var.environment == "dev"
  final_snapshot_identifier   = var.environment != "dev" ? "${var.environment}-final-${formatdate("YYYY-MM-DD-hhmm", timestamp())}" : null
 
  tags = {
    Environment = var.environment
    ManagedBy   = "Terraform"
  }
}
 
resource "aws_security_group" "db" {
  name_prefix = "${var.environment}-db-"
  vpc_id      = var.vpc_id
 
  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = var.allowed_security_group_ids
  }
 
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

One validation block catches at plan time what would otherwise become a support ticket. Five minutes of validation saves an incident.


CI/CD Pipeline with Plan Review

[Terraform Docs]

Every change must be reviewed as a plan before applying. The workflow has three jobs:

  1. plan — matrix over [dev, staging, production]. setup-terraforminitplan -out=tfplan → upload artifact → PR comment with the plan diff.
  2. security-scanaquasecurity/tfsec-action + bridgecrewio/checkov-action on every change, hard-fail on violations.
  3. apply — gated on github.ref == 'refs/heads/main' AND push. environment: production forces a manual-approval gate before the apply step runs.
## .github/workflows/terraform.yml — shape only
jobs:
  plan:          # plan + upload artifact + comment on PR
  security-scan: # tfsec + checkov, hard fail
  apply:         # download artifact, apply, manual approval
    needs: [plan, security-scan]
    environment: production

Full workflow (with the artifact upload/download plumbing, matrix strategy, and PR-comment script): the official hashicorp/setup-terraform action — HashiCorp's maintained GitHub Action for running Terraform in CI.

Key gates:

  • Plan on every PR (reviewers see the diff).
  • Security scan blocks apply if violations are found.
  • Apply only on merge to main (not during PRs).
  • Manual approval for production (GitHub environment secret).

Production Checklist

Before deploying production Terraform:

  • Remote state with locking enabled — use_lockfile = true on Terraform 1.10+ (or the legacy DynamoDB table on older versions), encryption enabled
  • Separate state files per environment — dev won't lock production
  • Directory-per-environment structure with thin root modules
  • Input validation blocks on all variables — constraints at plan time
  • prevent_destroy on databases, load balancers, anything not easily recreated
  • moved blocks for refactoring (instead of manual terraform state mv)
  • Plan-on-PR gate — reviewers must approve diffs before apply
  • Security scanning with tfsec/Checkov on every plan
  • Drift detection running on a schedule with alerting (state != reality)
  • IAM role for CI/CD with least-privilege permissions

Drift detection script (run on schedule):

#!/bin/bash
## Check for state drift — alert if actual resources diverge from state
 
for env in dev staging production; do
  cd infrastructure/environments/$env
 
  # Refresh state from real resources
  terraform refresh -no-color > /tmp/refresh.log 2>&1
 
  # Plan with no changes expected
  if terraform plan -no-color | grep -q "No changes"; then
    echo "$env: state matches reality"
  else
    echo "$env: drift detected — investigate immediately"
    # Send alert to Slack/PagerDuty
  fi
 
  cd -
done

Continuous Drift Detection

State drift is what happens when reality diverges from the state file: a teammate fixes a security group manually during an incident, an autoscaler resizes a node group, or a wide-blast IAM role gets edited in the AWS console. The next terraform apply either reverts the fix (causing a second incident) or fails noisily because the resource it expects no longer exists. Drift always wins eventually — the question is whether you find it on a Tuesday morning or during a Friday outage.

Run drift detection on a schedule and alert when anything diverges. The simplest version is a GitHub Actions cron that runs terraform plan -detailed-exitcode against every environment and posts to Slack on exit code 2 (changes detected):

## .github/workflows/drift-detection.yml
name: Drift detection
 
on:
  schedule:
    - cron: "0 6 * * *"   # 06:00 UTC daily
  workflow_dispatch: {}
 
permissions:
  contents: read
  id-token: write          # OIDC into AWS, no static keys
 
jobs:
  drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        env: [dev, staging, production]
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::111122223333:role/github-tf-readonly
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.9.8
      - name: Plan ${{ matrix.env }}
        id: plan
        working-directory: infrastructure/environments/${{ matrix.env }}
        run: |
          terraform init -input=false
          terraform plan -detailed-exitcode -lock=false -no-color -out=tfplan
        continue-on-error: true
      - name: Notify on drift
        if: steps.plan.outcome == 'success' && steps.plan.outputs.exitcode == '2'
        run: |
          curl -X POST -H 'Content-type: application/json' \
            --data "{\"text\":\":warning: drift in ${{ matrix.env }} — see run $GITHUB_RUN_ID\"}" \
            "$SLACK_WEBHOOK"
        env:
          SLACK_WEBHOOK: ${{ secrets.SLACK_DRIFT_WEBHOOK }}

Exit code 0 means no changes, 1 is an error, 2 is "drift detected" — that contract has been stable since Terraform 0.12. Pair the schedule with -lock=false so the read-only plan never blocks an in-flight apply, and use aws-actions/configure-aws-credentials with OIDC so no long-lived secrets sit in GitHub.

OpenTofu vs Terraform Licensing

In August 2023 HashiCorp relicensed Terraform from MPL-2.0 to the Business Source License (BUSL-1.1), restricting "competitive offerings" of Terraform-as-a-service. The Linux Foundation forked the last MPL-2.0 release as OpenTofu, which now ships independent releases with full state-file compatibility. Terraform 1.6+ is BUSL; OpenTofu 1.6+ is MPL.

For the vast majority of teams this is a non-event — both CLIs accept the same .tf files, both backends are interchangeable, both ecosystems share the Terraform Registry. The decision matrix: [Terraform Docs]

ConcernStay on TerraformMigrate to OpenTofu
Embedding Terraform CLI in a paid SaaS productRisk of BUSL violationSafe (MPL)
Need state encryption client-sidePaid HCP onlyNative in OpenTofu 1.7+
Need provider mocking in testsTerraform 1.7+ test frameworkOpenTofu 1.6+ mock_provider
Buying HCP/TFC support contractStay on TerraformN/A
Air-gapped on-prem with no commercial relationshipEitherEither

Migration is two commands — point the install scripts at OpenTofu, swap terraform for tofu in CI:

## Replace terraform with OpenTofu — state files are wire-compatible
curl -fsSL https://get.opentofu.org/install-opentofu.sh | sh -s -- --install-method standalone
tofu --version    # OpenTofu v1.9.0
tofu init -upgrade
tofu plan -out=tfplan && tofu apply tfplan

The risk is provider compatibility on the bleeding edge — newer HashiCorp-maintained providers may add features that lag the OpenTofu provider mirror by a release. Pin provider versions explicitly in required_providers and run a smoke-test environment before committing.

Module Testing with terraform test

Terraform 1.6 (October 2023) introduced a native testing framework that runs in the same process as the CLI — no Go toolchain, no real cloud spend for plan-level checks. Tests live alongside the module in tests/*.tftest.hcl and assert against plan output, applied state, or both.

## modules/database/tests/defaults.tftest.hcl
variables {
  environment        = "dev"
  vpc_id             = "vpc-12345678"
  subnet_ids         = ["subnet-aaa", "subnet-bbb"]
  allowed_security_group_ids = ["sg-app"]
}
 
run "validate_required_inputs" {
  command = plan
 
  assert {
    condition     = aws_db_instance.main.deletion_protection == true
    error_message = "Deletion protection must default to true."
  }
 
  assert {
    condition     = aws_db_instance.main.multi_az == false
    error_message = "Multi-AZ should default off for cheap dev environments."
  }
}
 
run "rejects_single_subnet" {
  command = plan
  variables {
    subnet_ids = ["subnet-only-one"]
  }
  expect_failures = [var.subnet_ids]
}
 
run "production_overrides" {
  command = plan
  variables {
    environment = "production"
    multi_az    = true
    instance_class = "db.r6g.xlarge"
  }
  assert {
    condition     = aws_db_instance.main.multi_az == true
    error_message = "Production must enable Multi-AZ."
  }
}

Run with terraform test (no -chdir needed — the framework auto-discovers tests/). Each run block is an independent plan or apply, sharing nothing with its siblings unless you explicitly thread outputs. For higher-fidelity integration tests against real AWS APIs, fall back to Terratest (Go-based, asserts on real cloud state, costs real money). The rule of thumb: validate contracts and policy with terraform test, validate end-to-end behaviour with Terratest in a sandbox account.

Cross-Account State with assume_role

Production state should not live in the production account. If a compromised production credential can also tamper with the state file, drift detection becomes a feedback loop the attacker controls. The hardened pattern stores state in a dedicated tooling/security account and uses STS AssumeRole to reach it.

## environments/production/backend.tf
terraform {
  backend "s3" {
    bucket       = "company-tf-state-prod"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    use_lockfile = true # native locking (1.10+); lock object lives beside the state
    encrypt      = true
    kms_key_id   = "arn:aws:kms:us-east-1:444455556666:key/state-cmk"
 
    role_arn     = "arn:aws:iam::444455556666:role/tf-state-writer"
    session_name = "terraform-prod"
    external_id  = "prod-state-writer-2026"
  }
}
 
provider "aws" {
  region = "us-east-1"
  assume_role {
    role_arn     = "arn:aws:iam::111122223333:role/tf-deploy"
    session_name = "terraform-prod-deploy"
    external_id  = "prod-deploy-2026"
  }
}

The external_id requirement on both roles defeats the confused deputy attack — a third party who steals CI credentials still cannot assume the role without knowing the agreed-upon string. Pair this with KMS-key encryption on the state bucket using a CMK whose key policy denies decrypt to the production account; only the state-writer role in the tooling account can read state, even if the production account is fully compromised.


Frequently Asked Questions

How do you prevent accidental resource destruction with Terraform?

Use lifecycle on critical resources, require plan review before apply in CI/CD, enable remote state with locking to prevent concurrent applies, and always run terraform plan as a separate reviewed step before terraform apply.

What is Terraform state locking and why is it important?

State locking prevents two engineers from running terraform apply simultaneously against the same state file, which can cause resource conflicts, deletions, or corrupted state. On Terraform 1.10+ the S3 backend locks natively with use_lockfile = true and needs no extra infrastructure; the older S3 + DynamoDB lock table still works, but dynamodb_table is deprecated as of 1.11. Either way, never store state locally for shared infrastructure.

How should you structure Terraform for multiple environments?

Use separate directories per environment (dev, staging, production) with shared modules. Each environment has its own state file, backend config, and variable values. Shared logic lives in reusable modules, called from each environment's root module.

How do you test Terraform modules?

Use terraform test (native validation, plan-level checks, no cost) for module contracts and validation rules. Use Terratest for integration testing — real cloud API validation, multiple module interactions, and custom Go-based assertions.

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next