Terraform drift detection in CI isn’t about preventing manual changes; it’s about making the inevitable manual changes visible and manageable before they break your production infrastructure.
Let’s see what this looks like in practice. Imagine you have a Terraform configuration for an AWS S3 bucket, and you’ve defined it like this:
resource "aws_s3_bucket" "example" {
bucket = "my-unique-terraform-example-bucket-12345"
acl = "private"
tags = {
Environment = "Development"
ManagedBy = "Terraform"
}
}
This code is committed to your Git repository. Now, someone on your team, perhaps under pressure, decides to quickly change the ACL of this bucket directly in the AWS console to public-read to troubleshoot an issue, forgetting it’s managed by Terraform.
Here’s how you’d integrate drift detection into your CI pipeline, typically using GitHub Actions or GitLab CI. The core command is terraform plan.
name: Terraform CI
on:
push:
branches:
- main
pull_request:
branches:
- main
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.5.0 # Use your desired Terraform version
- name: Terraform Init
run: terraform init
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan
# This step will fail if there's drift.
# For a real CI, you'd capture the output and compare it.
# For a simple failure, Terraform plan will exit non-zero if changes are needed.
When the CI pipeline runs after the manual change, terraform plan will detect that the actual state of the aws_s3_bucket.example resource in AWS no longer matches what Terraform thinks it should be.
Terraform will perform the following actions:
# aws_s3_bucket.example will be updated in-place
~ resource "aws_s3_bucket" "example" {
~ acl = "private" -> "public-read" # This is the drift
id = "my-unique-terraform-example-bucket-12345"
~ tags = {
- "ManagedBy" = "Terraform"
"Environment" = "Development"
}
# (other attributes unchanged)
}
Plan: 0 to add, 1 to change, 0 to destroy.
The CI job will then fail because terraform plan exited with a non-zero status code, indicating that there are pending changes. This immediately alerts the team that the infrastructure state managed by Terraform has diverged from the actual deployed state.
The mental model here is that Terraform maintains a state.tfstate file (locally or in a remote backend) which is the source of truth for what Terraform believes is deployed. When you run terraform plan, it compares the desired state in your .tf files against the actual state in your state.tfstate file and then queries your cloud provider to see if the live infrastructure matches the state file. Any mismatch detected between the live infrastructure and the state file is "drift."
The exact levers you control are your Terraform configurations and your CI pipeline’s triggers and steps. By running terraform plan as a gatekeeper in your CI, you ensure that any deviation from the declared infrastructure is flagged before a terraform apply can merge or be approved.
Common Causes of Drift and How to Fix Them:
-
Manual Changes via Cloud Provider Console/CLI:
- Diagnosis: Run
terraform plan. Look for resources that are marked for update (~) or destruction (-) that you didn’t intend to change. Compare thePlan:output with your current.tffiles. - Fix:
- Option A (Revert Manual Change): If the manual change was accidental or can be easily undone, revert it in the cloud provider console/CLI. Then, re-run
terraform planto confirm drift is gone. This is the ideal scenario. - Option B (Import into Terraform): If the manual change was intentional and necessary, you need to bring Terraform’s state file in sync.
- Identify the drifted resource ID (e.g., the S3 bucket name
my-unique-terraform-example-bucket-12345). - Run
terraform import aws_s3_bucket.example my-unique-terraform-example-bucket-12345. - After import, run
terraform planagain. It should now show no changes for that resource, or minimal changes if your.tffile was also out of date. Update your.tffile to match the imported state if necessary.
- Identify the drifted resource ID (e.g., the S3 bucket name
- Option A (Revert Manual Change): If the manual change was accidental or can be easily undone, revert it in the cloud provider console/CLI. Then, re-run
- Why it works:
terraform importreads the live resource’s attributes from the cloud provider and writes them into yourstate.tfstatefile, making Terraform aware of the resource’s current configuration as if it had managed it all along.
- Diagnosis: Run
-
Changes by Other Tools or Automation:
- Diagnosis: Same as above:
terraform planwill show the drift. - Fix: If another automation tool is making changes, you must decide:
- Option A (Consolidate): Modify the other tool to stop managing the resource and let Terraform handle it. Then use
terraform importas described above. - Option B (Co-manage with Caution): If co-management is unavoidable, ensure the other tool’s changes are reflected in your Terraform code before running
terraform plan. This often involves manual updates to.tffiles or a more complex workflow.
- Option A (Consolidate): Modify the other tool to stop managing the resource and let Terraform handle it. Then use
- Why it works: Terraform’s
plancommand is the ultimate arbiter. If it detects a difference between its state and reality, it flags it. Bringing the state file in sync is key.
- Diagnosis: Same as above:
-
Terraform Provider Bugs or Updates:
- Diagnosis: Run
terraform plan. If a resource shows unexpected changes after a provider update or even without one, it might be a provider issue. Check the Terraform provider’s GitHub issues. - Fix:
- Option A (Pin Provider Version): In your
.tffiles, specify the exact provider version being used:
Then runterraform { required_providers { aws = { source = "hashicorp/aws" version = "5.0.0" # Pin to a known good version } } }terraform initandterraform plan. - Option B (Update Terraform Code): If the provider update requires changes to your
.tffiles, update them accordingly. - Option C (Report Bug): If it’s a genuine provider bug, report it to the provider’s maintainers.
- Option A (Pin Provider Version): In your
- Why it works: Pinning a provider version ensures that
terraform planuses the same logic it did previously, preventing unexpected drift due to provider changes.
- Diagnosis: Run
-
Resource Deletion Outside Terraform:
- Diagnosis:
terraform planwill show a resource marked for destruction (-) that is no longer present in the cloud provider. This is a specific type of drift. - Fix:
- Run
terraform state rm <resource_address>(e.g.,terraform state rm aws_s3_bucket.example). - Run
terraform plan. The resource should no longer appear as needing destruction.
- Run
- Why it works:
terraform state rmremoves the resource from Terraform’s state file, telling Terraform it no longer needs to manage it because it’s already gone from the cloud.
- Diagnosis:
-
Changes to Resource Attributes Not Managed by Terraform:
- Diagnosis:
terraform planmight show changes to attributes that your Terraform code doesn’t explicitly define. - Fix:
- Option A (Ignore): If these are attributes you don’t care about or that are managed by another system, you can tell Terraform to ignore them.
Then runresource "aws_s3_bucket" "example" { bucket = "my-unique-terraform-example-bucket-12345" acl = "private" tags = { Environment = "Development" ManagedBy = "Terraform" } lifecycle { ignore_changes = [ # Ignore changes to the 'acl' attribute if it's managed elsewhere acl, # Ignore changes to specific tags tags["ManagedBy"] ] } }terraform plan. - Option B (Manage): If you do want Terraform to manage these attributes, add them to your
.tffile and let Terraform apply them.
- Option A (Ignore): If these are attributes you don’t care about or that are managed by another system, you can tell Terraform to ignore them.
- Why it works: The
ignore_changeslifecycle block tellsterraform planto disregard differences in specified attributes between the state file and the actual infrastructure, effectively silencing that specific drift.
- Diagnosis:
-
Incorrectly Configured Remote State Backend:
- Diagnosis: If your remote state backend (like an S3 bucket for state) is misconfigured, inaccessible, or corrupted,
terraform planmight fail or report spurious drift because it cannot read the correct state. - Fix:
- Verify your remote backend configuration in
main.tf(or similar). - Ensure the credentials/roles used by your CI runner have permission to access the backend.
- Check the backend itself for integrity (e.g., is the S3 bucket name correct? Does it exist? Are there versioning issues if applicable?).
- If the state file itself is corrupted, you might need to restore from a backup or, in extreme cases, re-initialize the backend and import existing resources.
- Verify your remote backend configuration in
- Why it works: Terraform relies on an accurate, accessible state file to compare against. If the state file is unavailable or incorrect, it cannot perform drift detection reliably.
- Diagnosis: If your remote state backend (like an S3 bucket for state) is misconfigured, inaccessible, or corrupted,
The next error you’ll hit after fixing drift is often a terraform apply failure if you have pre-commit hooks that don’t run terraform plan or if there are other validation issues in your code.