Skip to content

GuanceDB Storage Disk Space Exhaustion

Overview

This runbook provides procedures for responding to GuanceDB storage disk space exhaustion incidents. When the storage disk reaches capacity, GuanceDB switches to read-only mode, preventing new metrics data ingestion.

Detection

Primary Indicators

  • NSQ Channel Pile-up: Alert from df_metric_guance channel
  • Zero QPS: Both kodo-x (metrics worker) and GuanceDB Metrics QPS drop to zero

Investigation Steps

1. Verify Service Status

Check monitoring dashboards to confirm:

  • GuanceDB Metrics QPS = 0
  • Kodo-x (metrics worker) QPS = 0

2. Examine Application Logs

Check guancedb-storage logs

bash
kubectl logs guancedb-cluster-guance-storage-0 -n middleware  --tail=100

Look for messages like:

"msg":"switching the storage at /storage to read-only mode, since it has less than -storage.minFreeDiskSpaceBytes=10000000 of free space"

3. Verify Disk Usage

Connect to the guancedb-storage pod and check disk usage:

bash
kubectl exec -it guancedb-cluster-guance-storage-0 -n middleware -- df -h /storage

Resolution Steps

1. Expand PVC Capacity

you can use either kubectl edit or Rancher to perform the following actions

  1. Find the GuanceDB storage PVCs (typically two PVCs called guance-storage-volume-guancedb-cluster-guance-storage-0 and guance-storage-volume-guancedb-cluster-guance-storage-1)
  2. Edit each PVC:
    • Modify spec.resources.requests.storage
    • Increase from current size (e.g., 20Gi → 40Gi)
  3. Apply changes

2. Verify Expansion

Check that the disk has been expanded:

bash
kubectl exec -it guancedb-cluster-guance-storage-0 -n middleware -- df -h /storage
kubectl exec -it guancedb-cluster-guance-storage-1 -n middleware -- df -h /storage

Expected output should show the new disk size.

3. Monitor Recovery

  1. Check NSQ Queue: Verify that df_metric_guance pile is decreasing
  2. Monitor QPS: Confirm that both guancedb-insert and kodo-x QPS return to normal levels
  3. Verify Write Operations: Ensure new metrics data is being written successfully

Follow-up Actions

  • [ ] Set up PVC usage alert. can we send an alert when guance-storage pvc reaches 70% usage?
  • [ ] Consider implementing automated disk expansion policies