504 Gateway Timeout
The Mystery Connection
New nodes were provisioned for the ScopeDB nodegroup. At exactly the same time, Dataway pods began responding with numerous 504 gateway timeout errors. What could possibly connect these two seemingly unrelated events?
Incident Summary
- Date: 2025-07-30
- Cluster/Environment: id1
- Status: Resolved
- Severity: P1
- Duration: 3h
- Detection Delay: 35m
Impact
User Impact
- Users experienced slow and unstable connections to id1-console.truewatch.com
- Customer datakit instances encountered a portion of 504 gateway timeout errors (not 503 as initially logged)
Timeline
Start Phase
- 10:45 - Provisioned new nodes in ScopeDB-related nodegroups for ScopeDB deployment
Detection Phase (35-minute delay)
- 11:20 - Multiple 504 gateway timeout errors appeared on Cloudflare dashboard; synthetic test alerts triggered in Lark group
Investigation Phase (2h 30m)
- 11:30 - Examined Dataway and Kodo logs — no application errors found, but observed increased timeout counts and client connections in Tencent CLB metrics
- 12:18 - Attempted rollback of ScopeDB Helm release—no improvement observed
Resolution Phase
- 13:50 - Deleted all newly created nodes from ScopeDB nodegroup—immediate resolution
End Phase
- 13:50 - Service fully restored after node deletion
Root Cause Analysis
Immediate Cause
The newly provisioned ScopeDB nodes were configured with an incorrect security group that blocked the required NodePort access. Here's what happened:
- Security Group Misconfiguration: New nodes were attached to a security group that didn't allow the required NodePort to open
- Traffic Misdirection: When data requests reached the CLB (Cloud Load Balancer), the QCloud ingress class redirected them to the Dataway NodePort service
- Traffic Black Hole: The NodePort service directed traffic to nodes that couldn't accept connections on the required port (the ScopeDB node), causing requests to timeout
- 504 Response: The timeout requests triggered 504 gateway timeout errors
Contributing Factors
Several system behaviors compounded this issue:
QCloud CLB Health Check Bypass: The cloud load balancer's health checks bypassed security group restrictions, so health checks continued to pass even though the actual service port was blocked. This prevented early detection of the misconfiguration.
NodePort Behavior Misunderstanding: The team lacked full understanding of how NodePort services work.
- The QCloud ingress class directly uses NodePort services as backends
- NodePort opens the specified port on every single cluster node
- The kube-proxy on the each node handle traffic redirection after receiving it on the NodePort
Actual Traffic Flow:
Internet → CLB → Any Node:31792 → kube-proxy → Pod (any node)Even when traffic hits a node without the target pods, kube-proxy should forward it to correct pods on other nodes—but this failed due to the security group blocking the initial connection.
What Could Be Improved
Prevention
- Replace QCloud Ingress Controller: Migrate from QCloud ingress class to nginx ingress controller for more predictable behavior
- Optimize Traffic Routing: Implement
externalTrafficPolicy: Localandlocal-svc-only-bind-node-with-pod: "true"to prevent unnecessary traffic redirection by NodePort services, which also reduces network roundtrips