티스토리 뷰

Cloud/Kubernetes

AKS Node Disk Usage Analysis

Jacob_baek 2026. 6. 15. 10:46

Understanding Disk Pressure and Root Causes

Disk pressure on AKS nodes is a common issue in production environments.
While Kubernetes provides basic mechanisms such as image garbage collection, these are often insufficient to resolve real-world disk usage problems.

This post walks through how disk is actually consumed on AKS nodes, what frequently causes disk pressure, and how to systematically analyze it using a diagnostic script.

A helper script is available here:

Node Disk Analyzer

Why Disk Pressure Happens on AKS Nodes

From a Kubernetes perspective, node disk usage is not limited to a single component.
Instead, it is shared across multiple categories collectively known as local ephemeral storage.

In practice, disk pressure is usually caused by the following:

Container images (image cache)
Container writable layers (overlayfs)
Container logs
Pod volumes (such as emptyDir)

Understanding each of these categories is critical for accurate troubleshooting.

Key Disk Usage Categories

1. Container Images

Location: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs

This is where container images are stored after being pulled.

Typical causes of growth:

Frequent deployments with new image tags
Large image sizes
Stale images not cleaned up

Kubelet manages this area through image garbage collection using:

imageGcHighThreshold
imageGcLowThreshold

However, this only applies to unused images.

2. Container Writable Layer (overlayfs)

Location: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
This is the writable layer for running containers.

Any file created inside a container (excluding mounted volumes) is stored here.

Typical causes:

Applications writing logs to files instead of stdout
Temporary or cache data inside the container filesystem
High container churn (frequent restarts)

Important characteristics:

Not managed by image garbage collection
Often a major contributor to disk pressure
Difficult to notice without explicit analysis

3. Container Logs

Locations: /var/log/containers, /var/log/pods

These are stdout/stderr logs captured by Kubernetes.

Typical causes:

Verbose logging levels (debug/trace)
Lack of log rotation configuration
High request volume

Mitigation options:

containerLogMaxSizeMB
containerLogMaxFiles

4. Pod Volumes (emptyDir and others)

Location: /var/lib/kubelet/pods

This includes all pod-level data such as:

emptyDir
mounted volume data stored on node disk

Typical use cases:

temporary files
caching
data sharing between containers

Typical issues:

Applications continuously writing data without cleanup
Batch jobs generating files
Sidecars buffering data (e.g., log forwarders)

Unlike overlayfs, this is intentional storage defined by workload configuration.

Why Image GC Alone Is Not Enough

A common misconception is that increasing image GC thresholds will resolve disk pressure.

This is not accurate.

Image garbage collection only affects: container images (content store)

It does not address:

logs
overlayfs usage
emptyDir or pod volume data

In many cases, disk pressure persists even after image cleanup because the majority of usage is outside the image layer.

Approach to Disk Usage Analysis

To properly troubleshoot disk pressure, the goal is to answer:

Which category is consuming the most disk space?

A structured approach includes:

Measure usage per category
Compare relative proportions
Identify dominant contributor
Apply targeted mitigation

Diagnostic Script

To simplify this process, the following analyzer can be deployed to a specific node:

AKS Node Disk Analyzer

The script runs inside a privileged pod and inspects the host filesystem.

It provides:

Per-category disk usage (images, overlay, logs, volumes)
Percentage breakdown
Top contributing directories
Largest files on the node
Classification hints

Example Output Interpretation

A typical output may look like:

Usage Summary (KB)
Image   : 12,000,000
Overlay : 8,000,000
Log     : 2,000,000
Volume  : 500,000
TOTAL   : 22,500,000
Percentage (%)
Image   : 53%
Overlay : 35%
Log     : 8%
Volume  : 2%

How to interpret this

Image dominant (>50%):
- large image cache
- stale images not cleaned
Overlay dominant:
- application writing data inside container filesystem
Log dominant:
- excessive stdout logging
Volume dominant:
- emptyDir or mounted workload data growing

Troubleshooting Guidance by Category

Category	Typical Root Cause	Recommended Action
Image	Stale images, large images	Adjust GC, prune images
Overlay	File writes inside container	Change logging pattern, cleanup temp files
Log	Excessive stdout logging	Tune log rotation, reduce verbosity
Volume	emptyDir or workload-generated files	Add lifecycle cleanup, enforce limits

Best Practices

Always identify the dominant disk consumer before taking action
Do not rely solely on garbage collection
Ensure application-level cleanup policies exist
Configure log rotation proactively
Monitor node disk usage continuously

Conclusion

Disk pressure in AKS is rarely caused by a single factor.
It is the result of how multiple layers in Kubernetes share the same node filesystem.

Accurate troubleshooting requires breaking down disk usage into:

images
overlayfs
logs
volumes

Using a structured approach and proper tooling allows you to identify the root cause quickly and apply the right mitigation strategy.

저작자표시 비영리 (새창열림)

'Cloud > Kubernetes' 카테고리의 다른 글

Why Rootless Docker-in-Docker Fails on Ubuntu 24.04+ (0)	2026.07.01
InternalTrafficPolicy (0)	2025.11.25
envoy gateway api controller (0)	2025.11.17
ingress-nginx (0)	2025.07.14
fluentbit with azure blob storage (0)	2024.08.27

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

Cyuu

TAG more

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

글 보관함

Jacob Baek's home