Elasticsearch is actually terrible at storing logs at scale if you’re not careful.
Here’s how you can make it work, by treating it less like a database and more like a specialized, append-only time-series index.
Let’s say you’re sending logs from a fleet of services running in Kubernetes. You’ve got Fluentd collecting logs, and it’s configured to send them to Elasticsearch. Your basic Fluentd config might look something like this:
<match fluent.**>
@type null
</match>
<filter **>
@type kubernetes_metadata
@id kubernetes_metadata
k8s_json_annotation true
de_dot_record true
</filter>
<match your_app.log>
@type elasticsearch
@id elasticsearch
host elasticsearch-master.logging.svc.cluster.local
port 9200
logstash_format true
logstash_prefix your_app
flush_interval 5s
<buffer tag,time>
@type file
path /var/log/td-agent/buffer/your_app
flush_interval 5s
chunk_limit_size 2m
chunk_limit_num 10
retry_max_times 10
retry_wait 1s
</buffer>
</match>
This is a start, but it’s going to fall over when you have more than a few hundred logs per second.
The core issue with Elasticsearch for logs is its default behavior for indexing. It tries to be smart by dynamically mapping fields, which leads to massive index bloat and slow queries on large datasets. For logs, you want predictability and efficiency.
The first thing you need is a strict index mapping. Elasticsearch’s dynamic mapping is the enemy of predictable performance and storage costs. You need to define exactly what fields exist and what their types are. For logs, most fields are strings, numbers, or booleans. Avoid text for fields you’ll be filtering or aggregating on; use keyword.
Here’s a sample mapping for an index pattern like your_app-*:
PUT _template/your_app_template
{
"index_patterns": ["your_app-*"],
"settings": {
"index.number_of_shards": 3,
"index.number_of_replicas": 1,
"index.refresh_interval": "30s"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"message": { "type": "text" },
"level": { "type": "keyword" },
"kubernetes": {
"properties": {
"pod_name": { "type": "keyword" },
"namespace_name": { "type": "keyword" },
"container_name": { "type": "keyword" },
"labels": { "properties": {
"app": { "type": "keyword" },
"env": { "type": "keyword" }
}}
}
},
"log": { "properties": {
"source": { "type": "keyword" }
}},
"service": { "type": "keyword" },
"trace_id": { "type": "keyword" },
"span_id": { "type": "keyword" }
}
}
}
Notice how fields like level, pod_name, namespace_name, service, and any IDs are mapped as keyword. This tells Elasticsearch to treat them as exact values, which is much more efficient for filtering and aggregation than text, which is analyzed for full-text search.
Next, control your shards and replicas. For logs, you generally want more smaller shards rather than fewer large ones. Start with index.number_of_shards: 3 and index.number_of_replicas: 1. Replicas are great for read availability and speed, but they double your storage and indexing load. For logs, which are mostly appended, you can often get away with one replica.
Crucially, tune your index.refresh_interval. The default is 1s, meaning data becomes searchable almost immediately. For logs, this is often overkill and a huge performance drain. Increasing it to 30s or even 60s dramatically reduces the I/O load on Elasticsearch.
PUT _template/your_app_template
{
"index_patterns": ["your_app-*"],
"settings": {
"index.number_of_shards": 3,
"index.number_of_replicas": 1,
"index.refresh_interval": "30s" // <-- This is key
},
"mappings": {
// ... rest of mapping
}
}
Your Fluentd configuration needs to leverage this. Make sure your logstash_prefix matches your index pattern, and consider adjusting flush_interval if you’re seeing buffer issues, but the Elasticsearch settings are more critical.
<match your_app.log>
@type elasticsearch
@id elasticsearch
host elasticsearch-master.logging.svc.cluster.local
port 9200
logstash_format true
logstash_prefix your_app # <-- Matches index_patterns "your_app-*"
flush_interval 5s
# Consider adding these for better resilience
request_timeout 5s
retry_limit 5
retry_delay 10s
<buffer tag,time>
@type file
path /var/log/td-agent/buffer/your_app
flush_interval 5s
chunk_limit_size 2m
chunk_limit_num 10
retry_max_times 10
retry_wait 1s
</buffer>
</match>
Finally, implement index lifecycle management (ILM). Logs grow indefinitely. You need to automatically roll over indices, move older data to cheaper storage (like S3 via the Elasticsearch Snapshot/Restore or a dedicated log archive), and eventually delete it.
An ILM policy might look like this:
PUT _ilm/policy/log_retention_policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "7d",
"max_docs": 50000000,
"max_size": "50gb"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"set_priority": { "priority": 50 },
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"set_priority": { "priority": 0 },
"searchable_snapshot": {
"snapshot_repository": "my_s3_repository"
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
And then apply it to your template:
PUT _template/your_app_template
{
"index_patterns": ["your_app-*"],
"settings": {
// ... other settings
"index.lifecycle.name": "log_retention_policy" // <-- Apply ILM
},
"mappings": {
// ... mappings
}
}
This policy will automatically create new indices, optimize older ones for search, move them to cheaper tiers, and eventually delete them after 90 days.
The next error you’ll hit is likely related to shard allocation or disk space if you haven’t correctly configured your ILM policies or if your indexing rate is still too high for your hardware.