티스토리 뷰

Cloud/Kubernetes

AKS Node Disk Usage Analysis

Jacob_baek 2026. 6. 15. 10:46

Understanding Disk Pressure and Root Causes

Disk pressure on AKS nodes is a common issue in production environments.
While Kubernetes provides basic mechanisms such as image garbage collection, these are often insufficient to resolve real-world disk usage problems.

This post walks through how disk is actually consumed on AKS nodes, what frequently causes disk pressure, and how to systematically analyze it using a diagnostic script.

A helper script is available here:


Why Disk Pressure Happens on AKS Nodes

From a Kubernetes perspective, node disk usage is not limited to a single component.
Instead, it is shared across multiple categories collectively known as local ephemeral storage.

In practice, disk pressure is usually caused by the following:

  1. Container images (image cache)
  2. Container writable layers (overlayfs)
  3. Container logs
  4. Pod volumes (such as emptyDir)

Understanding each of these categories is critical for accurate troubleshooting.


Key Disk Usage Categories

1. Container Images

Location: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs

This is where container images are stored after being pulled.

Typical causes of growth:

  • Frequent deployments with new image tags
  • Large image sizes
  • Stale images not cleaned up

Kubelet manages this area through image garbage collection using:

  • imageGcHighThreshold
  • imageGcLowThreshold

However, this only applies to unused images.


2. Container Writable Layer (overlayfs)

Location: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
This is the writable layer for running containers.

Any file created inside a container (excluding mounted volumes) is stored here.

Typical causes:

  • Applications writing logs to files instead of stdout
  • Temporary or cache data inside the container filesystem
  • High container churn (frequent restarts)

Important characteristics:

  • Not managed by image garbage collection
  • Often a major contributor to disk pressure
  • Difficult to notice without explicit analysis

3. Container Logs

Locations: /var/log/containers, /var/log/pods

These are stdout/stderr logs captured by Kubernetes.

Typical causes:

  • Verbose logging levels (debug/trace)
  • Lack of log rotation configuration
  • High request volume

Mitigation options:

  • containerLogMaxSizeMB
  • containerLogMaxFiles

4. Pod Volumes (emptyDir and others)

Location: /var/lib/kubelet/pods

This includes all pod-level data such as:

  • emptyDir
  • mounted volume data stored on node disk

Typical use cases:

  • temporary files
  • caching
  • data sharing between containers

Typical issues:

  • Applications continuously writing data without cleanup
  • Batch jobs generating files
  • Sidecars buffering data (e.g., log forwarders)

Unlike overlayfs, this is intentional storage defined by workload configuration.


Why Image GC Alone Is Not Enough

A common misconception is that increasing image GC thresholds will resolve disk pressure.

This is not accurate.

Image garbage collection only affects: container images (content store)

It does not address:

  • logs
  • overlayfs usage
  • emptyDir or pod volume data

In many cases, disk pressure persists even after image cleanup because the majority of usage is outside the image layer.


Approach to Disk Usage Analysis

To properly troubleshoot disk pressure, the goal is to answer:

Which category is consuming the most disk space?

A structured approach includes:

  1. Measure usage per category
  2. Compare relative proportions
  3. Identify dominant contributor
  4. Apply targeted mitigation

Diagnostic Script

To simplify this process, the following analyzer can be deployed to a specific node:

The script runs inside a privileged pod and inspects the host filesystem.

It provides:

  • Per-category disk usage (images, overlay, logs, volumes)
  • Percentage breakdown
  • Top contributing directories
  • Largest files on the node
  • Classification hints

Example Output Interpretation

A typical output may look like:

Usage Summary (KB)
Image   : 12,000,000
Overlay : 8,000,000
Log     : 2,000,000
Volume  : 500,000
TOTAL   : 22,500,000
Percentage (%)
Image   : 53%
Overlay : 35%
Log     : 8%
Volume  : 2%

How to interpret this

  • Image dominant (>50%):
    • large image cache
    • stale images not cleaned
  • Overlay dominant:
    • application writing data inside container filesystem
  • Log dominant:
    • excessive stdout logging
  • Volume dominant:
    • emptyDir or mounted workload data growing

Troubleshooting Guidance by Category

Category Typical Root Cause Recommended Action
Image Stale images, large images Adjust GC, prune images
Overlay File writes inside container Change logging pattern, cleanup temp files
Log Excessive stdout logging Tune log rotation, reduce verbosity
Volume emptyDir or workload-generated files Add lifecycle cleanup, enforce limits

Best Practices

  • Always identify the dominant disk consumer before taking action
  • Do not rely solely on garbage collection
  • Ensure application-level cleanup policies exist
  • Configure log rotation proactively
  • Monitor node disk usage continuously

Conclusion

Disk pressure in AKS is rarely caused by a single factor.
It is the result of how multiple layers in Kubernetes share the same node filesystem.

Accurate troubleshooting requires breaking down disk usage into:

  • images
  • overlayfs
  • logs
  • volumes

Using a structured approach and proper tooling allows you to identify the root cause quickly and apply the right mitigation strategy.

'Cloud > Kubernetes' 카테고리의 다른 글

InternalTrafficPolicy  (0) 2025.11.25
envoy gateway api controller  (0) 2025.11.17
ingress-nginx  (0) 2025.07.14
fluentbit with azure blob storage  (0) 2024.08.27
Retina  (0) 2024.03.22
댓글
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2026/06   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
글 보관함