Splunk can ingest logs from AWS CloudTrail and S3, but the most surprising thing is how easily it can become a black hole for security-relevant data if not configured meticulously.
Let’s see it in action. Imagine you’ve set up a Splunk Heavy Forwarder (HF) on an EC2 instance, and you want to pull CloudTrail logs stored in an S3 bucket.
First, you need an IAM role for your Splunk HF EC2 instance. This role needs permissions to s3:GetObject and s3:ListBucket on the specific S3 bucket where CloudTrail logs are delivered.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-cloudtrail-log-bucket",
"arn:aws:s3:::your-cloudtrail-log-bucket/*"
]
}
]
}
Next, you configure Splunk to use this IAM role. On your Splunk HF, you’d create an inputs.conf stanza like this:
[aws_s3://your-cloudtrail-log-bucket]
bucket_name = your-cloudtrail-log-bucket
key_prefix = your-cloudtrail-prefix/AWSLogs/your-aws-account-id/CloudTrail/your-region/
interval = 300
aws_iam_role_arn = arn:aws:iam::your-aws-account-id:role/your-splunk-forwarder-iam-role
index = main
sourcetype = aws:cloudtrail
The key_prefix is crucial. It tells Splunk where within the bucket to look. For CloudTrail, it follows a specific pattern: [s3_prefix]/AWSLogs/[aws_account_id]/CloudTrail/[region]/. If your CloudTrail trail is configured to deliver logs to s3://my-logs/ctrail/, and your account ID is 123456789012, and you’re in us-east-1, your key_prefix would be ctrail/AWSLogs/123456789012/CloudTrail/us-east-1/.
Once configured, Splunk will poll the S3 bucket at the specified interval (here, every 5 minutes) and ingest new log files. You’ll see events appearing in your main index with the aws:cloudtrail sourcetype.
Now, let’s build the mental model. The problem this solves is centralizing and analyzing AWS activity logs for security monitoring, compliance, and operational troubleshooting. CloudTrail records API calls made in your AWS account, providing an audit trail. S3 acts as the durable storage for these logs. Splunk acts as the central SIEM, pulling these logs from S3, indexing them, and making them searchable.
The internal mechanism involves Splunk’s AWS add-on. It uses the AWS SDK to interact with S3. When you specify aws_iam_role_arn, Splunk (running on EC2) automatically assumes that role, gaining the necessary permissions to access the S3 bucket. It then lists objects within the specified key_prefix, downloads new or modified files, and sends them to the configured index with the specified sourcetype. The interval dictates how frequently Splunk checks for new files.
The sourcetype = aws:cloudtrail is critical because it tells Splunk how to parse the raw JSON log data into structured fields, making it searchable. Without it, you’d just have raw JSON blobs.
The exact levers you control are:
bucket_name: The S3 bucket where logs are stored.key_prefix: The directory structure within the bucket. This must precisely match how CloudTrail delivers logs.interval: How often Splunk checks for new logs (in seconds). Shorter intervals mean faster ingestion but more API calls and potentially higher costs.aws_iam_role_arn: The identity Splunk uses to access AWS resources.index: Where Splunk stores the data.sourcetype: How Splunk parses the data.
A common pitfall is the key_prefix. CloudTrail’s default prefix is often just the trail name, but if you specify a custom S3 bucket, you might need to include the AWSLogs/[aws_account_id]/CloudTrail/[region]/ path segments if they exist. For example, if your trail delivers to s3://my-log-bucket/my-trail-delivery/, the actual path within the bucket might be my-trail-delivery/AWSLogs/123456789012/CloudTrail/us-east-1/. In this case, your key_prefix would be my-trail-delivery/AWSLogs/123456789012/CloudTrail/us-east-1/. If you omit any part of this path, Splunk won’t find the files.
The most subtle aspect of S3 log ingestion is the handling of S3 event notifications. While CloudTrail delivers logs to S3, you can also configure S3 bucket notifications to trigger Lambda functions or SQS queues when new objects arrive. Splunk’s S3 input doesn’t inherently listen for these events; it polls S3. If you’re expecting near real-time ingestion and rely solely on polling, a long interval can lead to significant delays. However, Splunk can integrate with SQS. If your CloudTrail delivery pipeline is configured to send notifications to an SQS queue, Splunk can be configured to read directly from that SQS queue, offering a more event-driven ingestion model rather than pure polling. This bypasses the need to poll S3 for new objects and dramatically reduces ingestion latency, making the interval setting irrelevant for SQS-based inputs.
Once you have CloudTrail data in Splunk, the next step is often to correlate it with other AWS service logs, like VPC Flow Logs or GuardDuty findings.