Terraform data sources allow you to fetch information about existing infrastructure that wasn’t necessarily provisioned by Terraform itself.

Let’s say you have an existing AWS EC2 instance running, and you want to reference its ID or public IP address in your Terraform configuration. You can use the aws_instance data source for this.

data "aws_instance" "existing_web_server" {
  filter {
    name   = "tag:Name"
    values = ["my-production-web-server"]
  }
}

output "web_server_public_ip" {
  value = data.aws_instance.existing_web_server.public_ip
}

When you run terraform plan, Terraform will query AWS for an instance tagged with Name=my-production-web-server. If found, it will output the public IP address of that instance. This is incredibly useful for integrating Terraform into existing environments or for referencing resources managed by other tools.

Terraform data sources are essentially read-only queries into your cloud provider’s API. They abstract away the need to manually look up IDs, ARNs, or other attributes. Instead, you declare what information you need, and Terraform fetches it for you.

The general structure involves a data block, followed by the data source type (e.g., aws_instance, aws_vpc, google_compute_instance), and then a local name for that data source within your Terraform configuration. Inside the block, you use arguments to specify how to filter or identify the resource you’re interested in. These filters are specific to each data source type.

For example, if you needed to find an existing VPC by its CIDR block:

data "aws_vpc" "main_vpc" {
  cidr_block = "10.0.0.0/16"
}

output "main_vpc_id" {
  value = data.aws_vpc.main_vpc.id
}

This data.aws_vpc.main_vpc.id can then be used in other resource definitions, like creating a subnet within that specific VPC.

The power of data sources lies in their ability to decouple your Terraform code. You can have resources managed by different teams, different Terraform configurations, or even manually provisioned resources, and still bring them into the fold of a new Terraform project. This is crucial for phased migrations or for establishing a single source of truth for infrastructure management.

Consider a scenario where you have an existing Kubernetes cluster and want to deploy applications to it using Terraform. You’d use the kubernetes_cluster data source to fetch its configuration.

data "kubernetes_cluster" "production" {
  name = "prod-cluster"
}

provider "kubernetes" {
  host = data.kubernetes_cluster.production.endpoint
  token = data.kubernetes_cluster.production.token
  cluster_ca_certificate = base64decode(data.kubernetes_cluster.production.certificate_authority_data)
}

resource "kubernetes_namespace" "app_ns" {
  metadata {
    name = "my-application"
  }
}

Here, Terraform first queries for the prod-cluster and then configures the Kubernetes provider to use the fetched endpoint, token, and certificate authority. This allows you to manage Kubernetes resources (like namespaces or deployments) within that existing cluster.

Many data sources accept a most_recent = true argument, which is invaluable when you have multiple resources that match your filters. For instance, if you have several security groups with the same name (which is technically allowed, though not recommended), most_recent = true will ensure Terraform picks the one that was created last.

data "aws_security_group" "web_sg" {
  name = "web-server-sg"
  most_recent = true
}

resource "aws_instance" "web" {
  ami           = "ami-0abcdef1234567890" # Example AMI
  instance_type = "t3.micro"
  vpc_security_group_ids = [data.aws_security_group.web_sg.id]
}

This ensures that if the security group is updated or replaced, Terraform will always reference the latest version.

The underlying mechanism for data sources is straightforward: when Terraform encounters a data block during the plan or apply phase, it makes an API call to the relevant cloud provider or service. The results of that API call are then deserialized and made available as attributes of the data source. These attributes can be referenced using the data.<type>.<name>.<attribute> syntax. If the data source cannot find a matching resource, or if there’s an issue with the API call, Terraform will halt execution with an error, indicating that the prerequisite infrastructure could not be located or accessed.

A common pitfall is forgetting that data sources are evaluated at plan time. If the infrastructure they are querying changes between terraform plan and terraform apply, and the change affects the data source’s output, you might encounter unexpected behavior. This is because the plan is based on the state of the infrastructure when the plan was run, not when the apply happens.

The next concept you’ll likely encounter is how to use data sources to dynamically generate configurations, such as iterating over a list of existing subnets to create network interfaces.

Want structured learning?

Take the full Terraform course →