Cloud ReliabilitySource: status.cloud.google.comJune 14, 2025

Google Says a Non-Flagged Service Control Failure Triggered the June 2025 Outage

Google's incident report says a new quota-policy path in Service Control crashed on blank fields after globally replicated policy data exercised an untested code path. Recovery then slowed in larger regions because restarting tasks created a herd effect against dependencies without the right randomized exponential backoff.

The official failure chain

Google's report says the outage originated inside Service Control, the regional service used in Google API management and control planes to apply authorization, policy, and quota checks. A new feature for additional quota-policy checks had already been rolled out region by region, but the failing code path was never actually exercised during rollout because it required a specific policy condition to trigger it.

That missing trigger condition is the first important detail. The rollout existed, but the relevant behavior was not being tested in practice.

On June 12, 2025, a policy change with unintended blank fields was inserted into the regional Spanner tables used for policy data. Because quota metadata is globally replicated, the problematic data propagated across regions within seconds. Once the blank fields were read by Service Control, the new code path hit a null pointer and the binaries crashed into a loop.

Why recovery took longer than the root-cause diagnosis

Google says the root cause was identified within about ten minutes and that the red-button to disable the serving path was being rolled out within roughly twenty-five. The more interesting systems lesson is what happened next: in larger regions, restarting Service Control tasks overloaded the infrastructure they depended on. The incident report explicitly calls out missing randomized exponential backoff as part of the recovery drag.

That is the part infrastructure teams should pay attention to. A bug fix is only half the story if recovery behavior can amplify load against already stressed dependencies.

What Hacker News focused on

HN commenters predictably fixated on the missing error handling and lack of feature-flag protection, with some incredulity that a failure mode this basic could make it into a production path at Google scale. Others pushed back that large systems can still fail in mundane ways and that the more useful question is how the surrounding rollout and recovery controls were designed.

The report's own forward path is telling: modularize the architecture so pieces can fail open, audit systems that consume globally replicated data, and stop manual policy pushes until the stack is remediated. For systems engineers, this is a solid example of how globally consistent control-plane metadata can turn a narrow coding mistake into a broad service event.

Read the original article at status.cloud.google.com.