티스토리 뷰
Understanding Disk Pressure and Root Causes
Disk pressure on AKS nodes is a common issue in production environments.
While Kubernetes provides basic mechanisms such as image garbage collection, these are often insufficient to resolve real-world disk usage problems.
This post walks through how disk is actually consumed on AKS nodes, what frequently causes disk pressure, and how to systematically analyze it using a diagnostic script.
A helper script is available here:
Why Disk Pressure Happens on AKS Nodes
From a Kubernetes perspective, node disk usage is not limited to a single component.
Instead, it is shared across multiple categories collectively known as local ephemeral storage.
In practice, disk pressure is usually caused by the following:
- Container images (image cache)
- Container writable layers (overlayfs)
- Container logs
- Pod volumes (such as emptyDir)
Understanding each of these categories is critical for accurate troubleshooting.
Key Disk Usage Categories
1. Container Images
Location: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
This is where container images are stored after being pulled.
Typical causes of growth:
- Frequent deployments with new image tags
- Large image sizes
- Stale images not cleaned up
Kubelet manages this area through image garbage collection using:
imageGcHighThresholdimageGcLowThreshold
However, this only applies to unused images.
2. Container Writable Layer (overlayfs)
Location: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
This is the writable layer for running containers.
Any file created inside a container (excluding mounted volumes) is stored here.
Typical causes:
- Applications writing logs to files instead of stdout
- Temporary or cache data inside the container filesystem
- High container churn (frequent restarts)
Important characteristics:
- Not managed by image garbage collection
- Often a major contributor to disk pressure
- Difficult to notice without explicit analysis
3. Container Logs
Locations: /var/log/containers, /var/log/pods
These are stdout/stderr logs captured by Kubernetes.
Typical causes:
- Verbose logging levels (debug/trace)
- Lack of log rotation configuration
- High request volume
Mitigation options:
containerLogMaxSizeMBcontainerLogMaxFiles
4. Pod Volumes (emptyDir and others)
Location: /var/lib/kubelet/pods
This includes all pod-level data such as:
emptyDir- mounted volume data stored on node disk
Typical use cases:
- temporary files
- caching
- data sharing between containers
Typical issues:
- Applications continuously writing data without cleanup
- Batch jobs generating files
- Sidecars buffering data (e.g., log forwarders)
Unlike overlayfs, this is intentional storage defined by workload configuration.
Why Image GC Alone Is Not Enough
A common misconception is that increasing image GC thresholds will resolve disk pressure.
This is not accurate.
Image garbage collection only affects: container images (content store)
It does not address:
- logs
- overlayfs usage
- emptyDir or pod volume data
In many cases, disk pressure persists even after image cleanup because the majority of usage is outside the image layer.
Approach to Disk Usage Analysis
To properly troubleshoot disk pressure, the goal is to answer:
Which category is consuming the most disk space?
A structured approach includes:
- Measure usage per category
- Compare relative proportions
- Identify dominant contributor
- Apply targeted mitigation
Diagnostic Script
To simplify this process, the following analyzer can be deployed to a specific node:
The script runs inside a privileged pod and inspects the host filesystem.
It provides:
- Per-category disk usage (images, overlay, logs, volumes)
- Percentage breakdown
- Top contributing directories
- Largest files on the node
- Classification hints
Example Output Interpretation
A typical output may look like:
Usage Summary (KB)
Image : 12,000,000
Overlay : 8,000,000
Log : 2,000,000
Volume : 500,000
TOTAL : 22,500,000
Percentage (%)
Image : 53%
Overlay : 35%
Log : 8%
Volume : 2%
How to interpret this
- Image dominant (>50%):
- large image cache
- stale images not cleaned
- Overlay dominant:
- application writing data inside container filesystem
- Log dominant:
- excessive stdout logging
- Volume dominant:
- emptyDir or mounted workload data growing
Troubleshooting Guidance by Category
| Category | Typical Root Cause | Recommended Action |
|---|---|---|
| Image | Stale images, large images | Adjust GC, prune images |
| Overlay | File writes inside container | Change logging pattern, cleanup temp files |
| Log | Excessive stdout logging | Tune log rotation, reduce verbosity |
| Volume | emptyDir or workload-generated files | Add lifecycle cleanup, enforce limits |
Best Practices
- Always identify the dominant disk consumer before taking action
- Do not rely solely on garbage collection
- Ensure application-level cleanup policies exist
- Configure log rotation proactively
- Monitor node disk usage continuously
Conclusion
Disk pressure in AKS is rarely caused by a single factor.
It is the result of how multiple layers in Kubernetes share the same node filesystem.
Accurate troubleshooting requires breaking down disk usage into:
- images
- overlayfs
- logs
- volumes
Using a structured approach and proper tooling allows you to identify the root cause quickly and apply the right mitigation strategy.
'Cloud > Kubernetes' 카테고리의 다른 글
| InternalTrafficPolicy (0) | 2025.11.25 |
|---|---|
| envoy gateway api controller (0) | 2025.11.17 |
| ingress-nginx (0) | 2025.07.14 |
| fluentbit with azure blob storage (0) | 2024.08.27 |
| Retina (0) | 2024.03.22 |
- Total
- Today
- Yesterday
- openstacksdk
- boundary ssh
- hashicorp boundary
- crashloopbackoff
- azure policy
- open policy agent
- K3S
- socket
- GateKeeper
- Terraform
- OpenStack
- Helm Chart
- minikube
- ceph
- aquasecurity
- ansible
- metallb
- Jenkinsfile
- minio
- vmware openstack
- DevSecOps
- kubernetes install
- kata container
- mattermost
- macvlan
- openstack backup
- kubernetes
- wsl2
- nginx-ingress
- jenkins
| 일 | 월 | 화 | 수 | 목 | 금 | 토 |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 7 | 8 | 9 | 10 | 11 | 12 | 13 |
| 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 |
| 28 | 29 | 30 |
