Skip to content

504 Gateway Timeout

The Mystery Connection

New nodes were provisioned for the ScopeDB nodegroup. At exactly the same time, Dataway pods began responding with numerous 504 gateway timeout errors. What could possibly connect these two seemingly unrelated events?

Incident Summary

  • Date: 2025-07-30
  • Cluster/Environment: id1
  • Status: Resolved
  • Severity: P1
  • Duration: 3h
  • Detection Delay: 35m

Impact

User Impact

  • Users experienced slow and unstable connections to id1-console.truewatch.com
  • Customer datakit instances encountered a portion of 504 gateway timeout errors (not 503 as initially logged)

Timeline

Start Phase

  • 10:45 - Provisioned new nodes in ScopeDB-related nodegroups for ScopeDB deployment

Detection Phase (35-minute delay)

  • 11:20 - Multiple 504 gateway timeout errors appeared on Cloudflare dashboard; synthetic test alerts triggered in Lark group

Investigation Phase (2h 30m)

  • 11:30 - Examined Dataway and Kodo logs — no application errors found, but observed increased timeout counts and client connections in Tencent CLB metrics
  • 12:18 - Attempted rollback of ScopeDB Helm release—no improvement observed

Resolution Phase

  • 13:50 - Deleted all newly created nodes from ScopeDB nodegroup—immediate resolution

End Phase

  • 13:50 - Service fully restored after node deletion

Root Cause Analysis

Immediate Cause

The newly provisioned ScopeDB nodes were configured with an incorrect security group that blocked the required NodePort access. Here's what happened:

  1. Security Group Misconfiguration: New nodes were attached to a security group that didn't allow the required NodePort to open
  2. Traffic Misdirection: When data requests reached the CLB (Cloud Load Balancer), the QCloud ingress class redirected them to the Dataway NodePort service
  3. Traffic Black Hole: The NodePort service directed traffic to nodes that couldn't accept connections on the required port (the ScopeDB node), causing requests to timeout
  4. 504 Response: The timeout requests triggered 504 gateway timeout errors

Contributing Factors

Several system behaviors compounded this issue:

QCloud CLB Health Check Bypass: The cloud load balancer's health checks bypassed security group restrictions, so health checks continued to pass even though the actual service port was blocked. This prevented early detection of the misconfiguration.

NodePort Behavior Misunderstanding: The team lacked full understanding of how NodePort services work.

  • The QCloud ingress class directly uses NodePort services as backends
  • NodePort opens the specified port on every single cluster node
  • The kube-proxy on the each node handle traffic redirection after receiving it on the NodePort

Actual Traffic Flow:

Internet → CLB → Any Node:31792 → kube-proxy → Pod (any node)

Even when traffic hits a node without the target pods, kube-proxy should forward it to correct pods on other nodes—but this failed due to the security group blocking the initial connection.

What Could Be Improved

Prevention

  • Replace QCloud Ingress Controller: Migrate from QCloud ingress class to nginx ingress controller for more predictable behavior
  • Optimize Traffic Routing: Implement externalTrafficPolicy: Local and local-svc-only-bind-node-with-pod: "true" to prevent unnecessary traffic redirection by NodePort services, which also reduces network roundtrips