SYLEN
AboutNewsConferenceMembershipDonate

Email updates

Conference, news, and membership updates by email.

Site

  • About
  • News
  • Membership
  • Waitlist
  • Donate

Conference

  • Conference 2027
  • Call for papers

Account

  • Create account
  • Membership details

SYLEN

  • Guidelines
  • Privacy
  • Terms

© 2026 Systems Leadership and Engineering Network. sylen.org.

Membership details →
Back to news
InfrastructureSource: blog.railway.comMay 20, 2026

Google Cloud Suspension Triggers Cascading Multi-Cloud Outage at Railway

A mistaken automated suspension of Railway's GCP production account caused an eight-hour platform-wide outage on May 19, 2026. The service disruption cascaded to AWS and bare-metal environments due to an edge routing dependency on a GCP-hosted control plane API.

The Suspension and Immediate Control Plane Collapse

On May 19, 2026, at 22:20 UTC, Google Cloud Platform executed an automated, system-wide suspension that incorrectly disabled Railway's production account. Because this was a platform-wide automated sweep by Google affecting multiple accounts, no proactive notification was issued to Railway. The account suspension immediately terminated Railway’s GCP-hosted infrastructure, which supports its management dashboard, API, core control plane, and auxiliary burst-compute workloads. Users attempting to access the platform were met with 503 errors, specifically "no healthy upstream" and "unconditional drop overload" messages.

Route Cache Expiration and the Cascading Multi-Cloud Outage

While Railway’s architecture features physical network redundancy—utilizing high-availability fiber interconnects between AWS, GCP, and Railway Metal—the system suffered from a critical logical dependency. The edge proxies, which handle ingress routing across all environments, rely on a control plane API hosted on GCP to update and populate their routing tables.

When the GCP instances were terminated, the edge proxies initially continued to serve traffic to active AWS and Metal workloads using local cached routes. However, at 22:35 UTC, approximately 15 minutes after the initial suspension, these route caches began to expire. Without a reachable GCP-hosted control plane to refresh the routing tables, the edge proxies could no longer resolve routes to healthy, active instances in AWS and Metal. This turned a localized cloud provider outage into a global, platform-wide cascade, returning 404 errors across all regions.

Serialized Recovery and Downstream Bottlenecks

Though Railway's GCP account access was restored at 22:29 UTC, restoring the virtualized infrastructure proved highly non-trivial and serialized. Re-enabling the account did not automatically restart resources; rather, compute instances remained stopped and persistent disks remained inaccessible.

  • Persistent disks were restored to a ready state by 23:54 UTC.
  • Core GCP networking and edge routing remained down until approximately 01:30 UTC on May 20, representing a significant recovery bottleneck.
  • Orchestration and build infrastructure were brought online at 01:57 UTC, with deployments temporarily paused to prevent race conditions or resource exhaustion from queued builds.
  • Individual compute hosts were incrementally recovered starting at 02:04 UTC.

During the infrastructure recovery, clearing the route caches prompted a massive burst of retried webhook and login requests. This traffic spike triggered aggressive rate-limiting from GitHub's OAuth and webhook integrations, blocking user logins and builds until the backlog drained. Additionally, the outage caused user terms-of-service acceptance records to reset.

Architectural Redesign and State Decentralization

To eliminate GCP as a single point of failure for the network, Railway is re-architecting both its data and control planes. The existing network topology operates as a physical mesh ring, but is logically centralized around the GCP-hosted control plane API for route discoverability.

  • Removing the control plane API dependency from the edge routing path to establish a true mesh topology where routing updates can bypass any single cloud provider.
  • Distributing high-availability database shards across AWS and Railway Metal to ensure database quorum is maintained even if an entire cloud provider's infrastructure disappears.
  • Deprecating GCP from the data plane's active hot path, reserving its compute resources strictly for secondary or failover capacity.

This structural overhaul aims to decouple core user-facing services and routing discovery from any single external infrastructure vendor.

Read the original article at blog.railway.com.