GuanceDB Storage Disk Space Exhaustion
Overview
This runbook provides procedures for responding to GuanceDB storage disk space exhaustion incidents. When the storage disk reaches capacity, GuanceDB switches to read-only mode, preventing new metrics data ingestion.
Detection
Primary Indicators
- NSQ Channel Pile-up: Alert from
df_metric_guancechannel - Zero QPS: Both kodo-x (metrics worker) and GuanceDB Metrics QPS drop to zero
Investigation Steps
1. Verify Service Status
Check monitoring dashboards to confirm:
- GuanceDB Metrics QPS = 0
- Kodo-x (metrics worker) QPS = 0
2. Examine Application Logs
Check guancedb-storage logs
bash
kubectl logs guancedb-cluster-guance-storage-0 -n middleware --tail=100Look for messages like:
"msg":"switching the storage at /storage to read-only mode, since it has less than -storage.minFreeDiskSpaceBytes=10000000 of free space"3. Verify Disk Usage
Connect to the guancedb-storage pod and check disk usage:
bash
kubectl exec -it guancedb-cluster-guance-storage-0 -n middleware -- df -h /storageResolution Steps
1. Expand PVC Capacity
you can use either kubectl edit or Rancher to perform the following actions
- Find the GuanceDB storage PVCs (typically two PVCs called
guance-storage-volume-guancedb-cluster-guance-storage-0andguance-storage-volume-guancedb-cluster-guance-storage-1) - Edit each PVC:
- Modify
spec.resources.requests.storage - Increase from current size (e.g., 20Gi → 40Gi)
- Modify
- Apply changes
2. Verify Expansion
Check that the disk has been expanded:
bash
kubectl exec -it guancedb-cluster-guance-storage-0 -n middleware -- df -h /storage
kubectl exec -it guancedb-cluster-guance-storage-1 -n middleware -- df -h /storageExpected output should show the new disk size.
3. Monitor Recovery
- Check NSQ Queue: Verify that
df_metric_guancepile is decreasing - Monitor QPS: Confirm that both guancedb-insert and kodo-x QPS return to normal levels
- Verify Write Operations: Ensure new metrics data is being written successfully
Follow-up Actions
- [ ] Set up PVC usage alert. can we send an alert when guance-storage pvc reaches 70% usage?
- [ ] Consider implementing automated disk expansion policies