What caused the AWS – 20 October 2025 outage?

On Monday, 20 October 2025, AWS experienced a significant outage starting in its US-EAST-1 (Northern Virginia) region, which cascaded across dozens of dependent services worldwide.

What caused the AWS – 20 October 2025 outage?
Photo by Woliul Hasan / Unsplash

The Big Picture

On Monday, 20 October 2025, AWS experienced a significant outage starting in its US-EAST-1 (Northern Virginia) region, which cascaded across dozens of dependent services worldwide.
Many major online platforms, from gaming to banking to IoT, reported disruptions (Reuters, Tom's Guide).

Below is a breakdown of what AWS reported, how the outage unfolded, and key lessons for organisations.

How It Unfolded

Here’s a summary based on AWS’s published status updates and external reporting:

  • At 12:11 a.m. PDT, AWS first noted “increased error rates and latencies” in US-EAST-1 (Tom’s Guide).
  • By 2:01 a.m. PDT, AWS said the issue appeared linked to DNS resolution of the DynamoDB API endpoint in US-EAST-1.
  • At 3:35 a.m. PDT, AWS stated the “underlying DNS issue has been fully mitigated, and most AWS service operations are succeeding normally now.”
  • However, there were still elevated error rates for launching new EC2 instances, and services dependent on EC2 (like RDS, ECS, Glue) remained impaired. AWS recommended not targeting a specific Availability Zone so EC2 had flexibility.
  • By 8:43 a.m. PDT, AWS said it had “narrowed down the source of the network connectivity issues … the root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers.”
  • By 9:13 a.m. PDT, AWS reported additional mitigation steps and improving API recovery, though throttling for new EC2 instance launches continued.

In short: what began as a DNS resolution problem for DynamoDB escalated into a network load-balancer monitoring subsystem failure inside EC2’s internal network in US-EAST-1. This led to connectivity problems, latency spikes, and throttled instance launches across dependent services.

Why It Caused Such Widespread Impact

Several factors made this outage far-reaching:

  • US-EAST-1 is a major AWS region: Many global workloads, IAM updates, and global DynamoDB tables rely on it (The Guardian).
  • DNS as the trigger: DNS failures can cripple service connectivity even when backend systems remain functional.
  • Load-balancer health subsystem failure: This internal AWS component monitors Network Load Balancers (NLBs); its failure disrupted routing, health checks, and instance provisioning.
  • EC2 throttling: AWS intentionally slowed new EC2 launches to stabilise recovery, but this impacted RDS, ECS, Glue, and other dependent services.
  • Processing backlogs: Services like Lambda, EventBridge, and CloudTrail suffered delivery delays as AWS cleared queued events.
  • Global dependency: Because so many companies rely on AWS infrastructure, even regional instability caused worldwide effects (The Guardian).

Impacted AWS Services

At the peak, 91 AWS services were listed as “Impacted,” including EC2, S3, Lambda, RDS, DynamoDB, CloudWatch, and Elastic Load Balancing.
Another 27 services were marked as “Resolved.”
Key failures included:

  • EC2 launch errors
  • DynamoDB endpoint DNS issues
  • Lambda polling delays for SQS
  • API connectivity issues
  • Network load-balancer subsystem failures

Real-World Impact

According to multiple outlets, the outage affected a wide range of services and platforms:

  • Consumer apps like Snapchat and Signal
  • Gaming (Fortnite, Roblox)
  • IoT and smart-home services (Alexa, Ring)
  • Banking and fintech platforms (Venmo, Robinhood)

(The Verge, TechRadar, Business Insider, Times of India)

Lessons for Businesses

  • Avoid single-region dependency: Critical workloads should be distributed across multiple AWS regions or even across multiple cloud providers.
  • Design for failure: Don’t assume infrastructure layers like DNS or load-balancer health monitoring will always work.
  • Enable cross-AZ scaling: AWS explicitly advised using multiple Availability Zones to allow flexibility during mitigation.
  • Prepare for delayed recovery: Even after a root issue is fixed, backlogs and throttling can cause extended degradation.
  • Enhance observability: Monitor for provider-level latency and error anomalies, not just your own app metrics.
  • Transparent communication: Keeping users informed during outages maintains trust.

Timeline of Key Events (UTC Approximate)

Time (UTC)Event
07:11AWS reports increased error rates and latencies
09:01DNS resolution issue identified for DynamoDB endpoint
10:35DNS issue mitigated; EC2 instance launch issues persist
15:43Root cause narrowed to NLB monitoring subsystem
16:13Additional mitigations; connectivity recovery progressing

I've archieved a snapshot of the AWS Health page as of 20th October 2025 @ 17:00 UTC: https://web.archive.org/web/20251020082509/https://health.aws.amazon.com/health/status