SYLEN
AboutNewsConferenceMembershipDonate

Email updates

Conference, news, and membership updates by email.

Site

  • About
  • News
  • Membership
  • Waitlist
  • Donate

Conference

  • Conference 2027
  • Call for papers

Account

  • Create account
  • Membership details

SYLEN

  • Guidelines
  • Privacy
  • Terms

© 2026 Systems Leadership and Engineering Network. sylen.org.

Membership details →
Back to news
Reliability EngineeringSource: status.railway.comMay 20, 2026

Upstream Account Suspension: Inside Railway's Nine-Hour GCP Outage and Mitigation Path

A sudden Google Cloud Platform account suspension disabled Railway's control plane, API, and core GCP-hosted routing infrastructure, causing widespread "no healthy upstream" and "unconditional drop overload" errors. Railway restored services by recovering GCP compute nodes, routing around ongoing GCP-side networking failures, and leveraging their independent bare metal infrastructure.

Outage Genesis and Affected GCP Components

The incident began on May 19, 2026, at 22:29 UTC, characterized by immediate edge failures. Systems emitted "no healthy upstream" and "unconditional drop overload" errors, alongside dashboard access failures and authentication timeouts. By 23:37 UTC, the engineering team traced the root cause to an upstream account suspension by Google Cloud Platform (GCP). This block instantly severed access to critical GCP-hosted infrastructure, including DNS Traffic routing, GCP Build Machines, the GCP Image Registry, and TCP Proxies. The impact spanned multiple GCP regions, specifically US East (Virginia), US West (Oregon), EU West (Amsterdam), and Southeast Asia (Singapore).

Control Plane Restructuring and GCP Network Blocks

The sudden account termination disabled the core control plane governing the Railway API, dashboard, and internal network routing. Although the platform team established contact with Google Cloud support and regained basic account access within an hour of identification, restoring the operational state of the control plane was delayed. At 01:34 UTC on May 20, Railway successfully recovered its compute instances on GCP. However, underlying networking failures on GCP's side prevented these compute instances from establishing external communication, leaving workloads unable to start.

Bare Metal Fallback and Infrastructure Throttling

To resolve the deadlock, the infrastructure team routed services away from the blocked GCP resources to their independent bare metal infrastructure ("Railway metal"). This separate footprint consists of Metal TCP Proxies, dedicated Build Machines, and Image Registries, supported by localized Metrics and Logs systems across US East (Virginia), US West (California), EU West (Amsterdam), and Southeast Asia (Singapore). To prevent cascading failures and resource exhaustion on the bare metal build clusters during the migration, Railway temporarily paused all non-enterprise builds at 01:41 UTC, prioritizing resource allocation for unaffected enterprise deployments.

Recovery Phase and Post-Incident Deployment State

By 03:05 UTC, Railway reported progressive recovery across its metal workloads. The suspension of non-enterprise deployments was maintained to ensure platform stability. At 04:58 UTC, deployment pipelines were reopened as GCP network pathing stabilized, though GCP-hosted workloads continued to experience intermittent latency and connectivity issues. Railway confirmed full service recovery at 06:14 UTC. Automated systems began redeploying workloads detected as unhealthy, and operators were advised to execute manual redeploys via the CLI or dashboard for any remaining stalled services. The incident was declared fully resolved at 07:57 UTC.

Read the original article at status.railway.com.