Observability Data Volume Reduction

Storage costs can balloon faster than you think, but understanding how your storage is actually being used is the key to reining them in.

Let’s look at a real-world scenario. Imagine a Kubernetes cluster with several applications. One application, frontend-app, is using a persistent volume (pv-frontend) provisioned with 100Gi of capacity.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: frontend-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: standard

This PVC is bound to a PersistentVolume pv-frontend of the same size. However, when we check the actual usage of the underlying storage device (e.g., an EBS volume in AWS, a persistent disk in GCP, or an Azure disk), we might find it’s only consuming 20Gi of data. The remaining 80Gi is provisioned but completely unused, silently incurring costs.

The problem is that by default, most cloud providers and storage systems charge for provisioned capacity, not actual consumed capacity. Without a system to report granular usage, you’re effectively paying for empty space.

This is where storage observability comes in. It’s the practice of instrumenting and monitoring your storage infrastructure to gain deep insights into its utilization, performance, and cost drivers. The goal isn’t just to see "disk full," but to understand "which application is using N Gi of storage, and why is it provisioned at 5N Gi?"

Consider a common setup with a Kubernetes cluster using aws-ebs-csi-driver. When a PVC is created, the CSI driver requests a new EBS volume from AWS. The storage: 100Gi in the PVC definition tells AWS to create a 100Gi EBS volume. AWS then bills you for that entire 100Gi, regardless of how much data is actually written to it.

To get actual usage, we need to look beyond Kubernetes’ reported PVC size. Tools that integrate with the storage provider’s APIs or use node-level agents can report the actual data written to the underlying storage device.

Let’s say we’re using a tool like OpenCost or Kubecost. These tools can query the Kubernetes API for PVC and PV information, and then, through integrations or direct calls, query the cloud provider for the actual size of the attached EBS volume and its consumed space.

For pv-frontend, OpenCost might report:

Provisioned Capacity: 100Gi
Actual Usage: 20Gi
Cost per Gi (Provisioned): $0.08
Cost per Gi (Actual): $0.08 (this is the misleading part, as you pay for provisioned)
Estimated Monthly Cost (Provisioned): $8.00
Estimated Monthly Cost (Actual Data): $1.60
Wasted Cost: $6.40

This immediately highlights frontend-app as a prime candidate for optimization. The next step is to investigate why the discrepancy exists. It could be due to:

Over-provisioning: The application simply doesn’t need that much space, and the PVC was created with a generous buffer.
Deleted Data Not Freed: Data was written and then deleted by the application, but the underlying filesystem or storage driver hasn’t yet reported the freed space back to the storage provider. This is less common with modern cloud block storage but can occur.
Snapshots/Cloning: While snapshots have their own costs, sometimes the base volume might still reflect a larger perceived size if not properly managed. This is more about understanding total cost than wasted provisioned space.
Application Behavior: Some applications might pre-allocate or reserve space that isn’t immediately used.
Stale Persistent Volumes: PVs that are no longer actively used by any pods but still exist and are provisioned.

For the frontend-pvc example, the most likely cause is over-provisioning. To fix this, you’d create a new PVC with the correct, smaller size and migrate the data.

Diagnosis Command (Conceptual - actual commands vary by tool):

To get this insight, you’d typically use a storage observability tool. If using kubectl and looking at raw data (less direct for usage):

# Get PVC details
kubectl get pvc frontend-pvc -o yaml

# Get PV details
kubectl get pv pv-frontend -o yaml

# Describe node where pod is running (to infer potential storage device, but not direct usage)
kubectl describe node <node-name>

A tool like OpenCost provides a dashboard showing:

# Example dashboard view (not a real command)
# Volume Name | PVC Name | Namespace | Provisioned | Used | Wasted Cost
# pv-frontend   | frontend-pvc | default   | 100Gi       | 20Gi | $6.40

Fix for Over-provisioning:

Identify the data: Determine the actual data size (e.g., 20Gi).

Create a new PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: frontend-pvc-new
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 25Gi # Provisioning a small buffer, e.g., 5Gi
  storageClassName: standard

Migrate data: This typically involves:
- Stopping the application writing to frontend-pvc.
- Creating a new pod that mounts both frontend-pvc (read-only) and frontend-pvc-new (read-write).
- Using a tool like rsync or cp to copy data from the old PVC to the new one.
- Updating the application deployment to use frontend-pvc-new.
- Deleting the old frontend-pvc and its associated pv-frontend (and the underlying EBS volume).

Why it works: By creating a smaller PVC and migrating only the necessary data, you tell the cloud provider to provision a smaller underlying storage volume. You then delete the oversized volume, stopping the associated charges for the unused 80Gi.

Common Causes & Fixes:

Cause 1: Over-provisioned PVCs: Applications request more storage than they actually need, often as a default or out of caution.
- Diagnosis: Storage observability tool reports Provisioned > Used.
- Fix: Resize PVCs or migrate data to smaller PVCs. kubectl edit pvc <pvc-name> might allow resizing if the storage class supports it, otherwise, data migration is needed. For cloud volumes, resizing might mean creating a new, smaller volume and migrating.
- Why: Reduces the provisioned capacity on the cloud provider’s side, directly lowering costs.
Cause 2: Unused/Stale Persistent Volumes: PVs and their backing storage exist but are not attached to any active pods.
- Diagnosis: Observability tool flags PVs with Status: Available or Status: Released that have no corresponding PVC or are not claimed. Cloud provider console shows unattached volumes.
- Fix: Delete the unused PV (kubectl delete pv <pv-name>) and the corresponding cloud storage resource (e.g., EBS volume).
- Why: Eliminates all costs associated with storage that is provisioned but not in use.
Cause 3: Snapshot Overgrowth: While snapshots are meant for recovery, retaining too many or too large snapshots can indirectly lead to higher underlying storage costs if they prevent storage reclamation or if the base volume is also large.
- Diagnosis: Snapshot management tools or cloud provider consoles show numerous or large snapshots. Observability tools might show the total storage footprint including snapshots.
- Fix: Implement a snapshot lifecycle policy to automatically delete old or unnecessary snapshots.
- Why: Frees up space previously occupied by snapshots, which can reduce the overall storage bill.
Cause 4: Filesystem Inefficiencies/Holes: Sometimes, data is deleted at the application level, but the filesystem or block device doesn’t immediately report that space as free. This is less common with modern cloud block storage which often handles this automatically, but can occur with certain configurations or older systems.
- Diagnosis: Used reported by the filesystem on the node is significantly higher than the Used reported by the cloud provider for the same volume. Tools like du on the pod might show less data than the volume reports.
- Fix: For some systems, forcing a filesystem check or specific unmap operations might be necessary. Often, the solution is to migrate data to a new volume.
- Why: Ensures you’re only paying for space that is truly occupied by data.
Cause 5: Inefficient Data Storage: Applications might store data in a way that is not space-efficient (e.g., uncompressed logs, redundant copies).
- Diagnosis: High Used space for a specific application, with no clear over-provisioning. Analysis of the data content itself.
- Fix: Optimize application data storage: implement compression, deduplication, or archival policies.
- Why: Reduces the absolute amount of data stored, thus reducing storage capacity requirements and costs.
Cause 6: Misconfigured StorageClass reclaimPolicy: If reclaimPolicy is set to Retain on a PV, the underlying cloud volume won’t be automatically deleted when the PVC is deleted.
- Diagnosis: PVC is deleted, but the PV remains Released and the cloud volume persists.
- Fix: Change reclaimPolicy to Delete for dynamically provisioned volumes if automatic cleanup is desired, or manually delete the PV and cloud volume.
- Why: Ensures that storage resources are properly cleaned up and de-provisioned when no longer needed.

The next challenge you’ll face is optimizing network egress costs, which often go hand-in-hand with storage management.

Related Concepts

More Deep Dives in Observability & Monitoring