504 Gateway Timeout

The Mystery Connection

New nodes were provisioned for the ScopeDB nodegroup. At exactly the same time, Dataway pods began responding with numerous 504 gateway timeout errors. What could possibly connect these two seemingly unrelated events?

Incident Summary

Date: 2025-07-30
Cluster/Environment: id1
Status: Resolved
Severity: P1
Duration: 3h
Detection Delay: 35m

Impact

User Impact

Users experienced slow and unstable connections to id1-console.truewatch.com
Customer datakit instances encountered a portion of 504 gateway timeout errors (not 503 as initially logged)

Timeline

Start Phase

10:45 - Provisioned new nodes in ScopeDB-related nodegroups for ScopeDB deployment

Detection Phase (35-minute delay)

11:20 - Multiple 504 gateway timeout errors appeared on Cloudflare dashboard; synthetic test alerts triggered in Lark group

Investigation Phase (2h 30m)

11:30 - Examined Dataway and Kodo logs — no application errors found, but observed increased timeout counts and client connections in Tencent CLB metrics
12:18 - Attempted rollback of ScopeDB Helm release—no improvement observed

Resolution Phase

13:50 - Deleted all newly created nodes from ScopeDB nodegroup—immediate resolution

End Phase

13:50 - Service fully restored after node deletion

Root Cause Analysis

Immediate Cause

The newly provisioned ScopeDB nodes were configured with an incorrect security group that blocked the required NodePort access. Here's what happened:

Security Group Misconfiguration: New nodes were attached to a security group that didn't allow the required NodePort to open
Traffic Misdirection: When data requests reached the CLB (Cloud Load Balancer), the QCloud ingress class redirected them to the Dataway NodePort service
Traffic Black Hole: The NodePort service directed traffic to nodes that couldn't accept connections on the required port (the ScopeDB node), causing requests to timeout
504 Response: The timeout requests triggered 504 gateway timeout errors

Contributing Factors

Several system behaviors compounded this issue:

QCloud CLB Health Check Bypass: The cloud load balancer's health checks bypassed security group restrictions, so health checks continued to pass even though the actual service port was blocked. This prevented early detection of the misconfiguration.

NodePort Behavior Misunderstanding: The team lacked full understanding of how NodePort services work.

The QCloud ingress class directly uses NodePort services as backends
NodePort opens the specified port on every single cluster node
The kube-proxy on the each node handle traffic redirection after receiving it on the NodePort

Actual Traffic Flow:

Internet → CLB → Any Node:31792 → kube-proxy → Pod (any node)

Even when traffic hits a node without the target pods, kube-proxy should forward it to correct pods on other nodes—but this failed due to the security group blocking the initial connection.

What Could Be Improved

Prevention

Replace QCloud Ingress Controller: Migrate from QCloud ingress class to nginx ingress controller for more predictable behavior
Optimize Traffic Routing: Implement externalTrafficPolicy: Local and local-svc-only-bind-node-with-pod: "true" to prevent unnecessary traffic redirection by NodePort services, which also reduces network roundtrips

504 Gateway Timeout ​

The Mystery Connection ​

Incident Summary ​

Impact ​

User Impact ​

Timeline ​

Start Phase ​

Detection Phase (35-minute delay) ​

Investigation Phase (2h 30m) ​

Resolution Phase ​

End Phase ​

Root Cause Analysis ​

Immediate Cause ​

Contributing Factors ​

What Could Be Improved ​

Prevention ​