Lessons learned from AWS outage – what can you do?

AI

After an outcry, OpenAI swiftly rereleased 4o to paid users. But experts say it should not have removed the model so suddenly.

OpenAI’s decision to replace 4o with the more straightforward GPT-5 follows a steady drumbeat of news about the potentially harmful effects of extensive chatbot use. Reports of incidents in which ChatGPT sparked psychosis in users have been everywhere for the past few months, and in a blog post last week, OpenAI acknowledged 4o’s failure to…

AI

‘Cheapfake’ AI Celeb Videos Are Rage-Baiting People on YouTube

“They’re tweaking my voice or whatever they’re doing, tweaking their own voice to make it sound like me, and people are commenting on it like it is me and it ain’t me,” Washington recently told WIRED, when asked about AI. “I don’t have an Instagram account. I don’t have TikTok. I don’t have any of…

AI

GPT-5 Doesn’t Dislike You—It Might Just Need a Benchmark for Emotional Intelligence

Since the all-new ChatGPT launched on Thursday, some users have mourned the disappearance of a peppy and encouraging personality in favor of a colder, more businesslike one (a move seemingly designed to reduce unhealthy user behavior.) The backlash shows the challenge of building artificial intelligence systems that exhibit anything like real emotional intelligence. Researchers at…

AI

OpenAI Designed GPT-5 to Be Safer. It Still Outputs Gay Slurs

OpenAI is trying to make its chatbot less annoying with the release of GPT-5. And I’m not talking about adjustments to its synthetic personality that many users have complained about. Before GPT-5, if the AI tool determined it couldn’t answer your prompt because the request violated OpenAI’s content guidelines, it would hit you with a…

What was the AWS Outage all about

The outage stemmed from multiple services in the US-East-1 region experiencing increased error rates and latencies, impacting many AWS services simultaneously.

In particular:
The root cause appears to involve the Amazon DynamoDB API endpoint and DNS resolution issues in us-east-1 region.

The cascade effect: although the initial fault was in one internal subsystem, many other services and workloads suffered – because of inter-dependencies and the central role of us-east-1 region.

This is not unique: there has been past AWS regions/outages causing outages to numerous systems around the world.

Key take-aways:
Even with a major cloud provider like AWS, you do not have immunity from downtime. This is specially true for single region deployment with no disaster recovery (DR). A “single region” failure can affect large swathes of services if you rely heavily on that region.

In regulated sectors (like healthcare) the risk is higher because service interruptions can mean patient-safety, regulatory, compliance and reputational damage.

What is so special about us-east-1 region

The outage was related to us-east-1 region, but as you know, many applications had partial or full outages which are hosted in different regions. So, what is special about us-east-1 region?

us-east-1 region is the first AWS region and is like a global region. Many AWS global services are anchored or managed from there. AWS services like IAM, CloudFormation, S3, Route53 and CloudFront have internal dependencies where metadata or management planes are based in us-east-1. Thus, an outage in us-east-1 can cause a partial outage in an application in a different region.

What you can do to make systems resilient

You need to know how critical your application is because the design comes at a cost. You can create multi-cloud or multi-region architecture but these are not cheap. For example, if you have multi-cloud, you need to pay for AWS direct Connect and AWS Azure Express Route separately which is just the start. You still need to deploy to multiple cloud which is also costly. Development cost is also high as you need to deploy different cloud.

So, what about multi-region? Multi-region is cheaper and minimal additional development work is needed. Many AWS services like Aurora database and DynamoDB support multi-region by default. You can also create replica/standby capacity in secondary regions. However, like I mentioned previously, us-east-1 is like a global region where many services like IAM are managed from.

What do you need to do? – Service-dependency mapping

You need to create a Service-dependency mapping. You need to identify critical services for you business and setup a business continuity plan. For each services, you need to map dependencies (e.g., database, authentication service, file store, third-party APIs).

For each dependency, ask: “if this one fails, what happens?” and build mitigation (e.g., caching, queueing, offline degraded mode).

DNS, routing and fail-over:

The AWS event shows how DNS/resolution problems can cascade. Ensure you have resiliency in DNS (multiple providers, health checks). You can use Route 53 for your DNS; AWS provides 100% for this service.

Use health-checks and automated traffic shifting (via load balancers, Route 53 or equivalent) to redirect traffic away from failed zones/regions.

Data replication and backups:

Use cross-region replication (database replicas, file storage replication) so you have standby data in a different region. You need to ensure your backups are recent, tested, and you can restore rapidly into another region if needed.

Testing & exercises: Regularly simulate failure of a region/service (chaos testing) to ensure your fail-over works. Measure RTO (Recovery Time Objective) and RPO (Recovery Point Objective) for each critical business service.

What can you NOT Control

Sure, you can make multi-cloud, multi-region and what not but there are still certain things that you cannot control from business perspective. For example, lets say your application is multi-region, however, if you integrate with a vendor-hosted application, then you resiliency is dependent on that vendor-based product as well. You cannot control whether they use multi-cloud or multi-region and even multi-AZ deployments.

You are more or less relying on what the vendor tells you or is in the contract but it’s never easy to calculate RTO when you rely on external services.

Why Multi-Region can be a NO-GO

I see many people posts in LinkedIn that you need to make you application multi-region but the reality is, it’s not always easy for a number of reasons.

Cost: It’s not cheap – you need budget allocated for this
Compliance: Critical business like healthcare and finance have strict data sovereignty policies. AWS does not have multiple regions in every country, so, multi-region is not always an option.
Complexity: Multi-region is complex. Managing infrastructure in multiple regions and then to keep it in sync in complex. You can it by infrastructure as code but it’s still complex.
Automatic Failover: Don’t forgot just having multi-region is not enough. You also need to safely failover to the secondary region. You need proper health checks and DNS failover to fail to the secondary region.

Summary

The AWS outage demonstrates that even dominant cloud providers and leading regions can suffer cascading failures spanning many services.

For critical businesses like healthcare or finance, the consequences can be more serious than just “website down” — they can affect patient care, compliance, revenue, and reputation.

To mitigate: adopt multi-region/fail-over architectures, decouple dependencies, maintain backups, exercise fail-over, monitor actively, and have a strong incident response plan.

The goal is not to assume “never will fail” (because failure is inevitable) but to plan for when it does so the impact is minimised.

Source link