Resilient by Design

Blog

10.23.25

In the early hours of the morning, a major AWS outage disrupted a wide range of cloud services that power much of the internet. Businesses across industries saw degraded performance, communication issues, and in some cases, complete downtime. Outages like this serve as a powerful reminder of just how interconnected today’s technology ecosystem is — and how critical it is to plan for failure, not just success.

At 4:00 AM, our team was alerted to the event. While many platforms were struggling, our systems remained fully operational. No downtime. No customer impact.

But that doesn’t mean we sat still — our Infrastructure Team immediately jumped into action to ensure full stability, verify system health, and stay ready for anything that might follow.

What Happened

The AWS outage affected several core components across multiple regions, taking down or degrading tools used by thousands of companies worldwide.

While our production systems weren’t directly impacted, several key third-party services we use for visibility and communication were affected, including:

Datadog: Our monitoring and observability tool went down, temporarily limiting visibility into system metrics and dashboards.
Alerting System: Our automated alerting pipeline experienced partial disruption.
Intercom: Our customer support platform had intermittent availability issues.

This combination meant that while our systems were running perfectly fine, our ability to see into them and communicate externally was temporarily limited — a challenging situation for any operations or infrastructure team.

Our Infrastructure Team’s Response

As soon as the first alerts came in, our Infrastructure Team mobilized. Even though production was stable, they treated it like an active incident, prioritizing situational awareness and redundancy.

Verification of Stability: The first step was confirming that no part of our production stack was degraded. With Datadog unavailable, the team relied on direct health checks, manual queries, and API-level testing to verify uptime.
Alternate Communication Channels: With Intercom affected, internal coordination shifted to Slack and status updates, ensuring no delay in information sharing.
Temporary Monitoring Measures: The team quickly stood up backup dashboards and manual metric collection through redundant systems.
Customer Monitoring: Support engineers were ready to respond to any potential customer issues via email and alternate channels, though none were reported.

By the time AWS confirmed the issue publicly, our internal observability was partially restored and redundant alerting systems were live — a testament to preparation and fast response.

Why This Matters

Cloud outages happen. Even the most reliable providers in the world occasionally experience downtime — and when they do, the ripple effects can reach far beyond the source. For many organizations, this means customers experience disruption, data delays, or communication blackouts.

But resilience isn’t just about uptime — it’s about how quickly you can adapt when parts of your ecosystem go down. That’s why our systems are designed with redundancy, isolation, and human readiness built in. Our Infrastructure Team’s training and processes are built around one principle: When something fails, it shouldn’t take us down with it.

How We Build for Resilience

Our uptime during the AWS outage wasn’t luck — it was the result of intentional design and discipline.

Some of the principles that made the difference include:

Multi-Region Architecture: Critical services are distributed across multiple AWS regions and other providers to prevent single-region dependency.
Dependency Isolation: Third-party tools (like monitoring or customer support systems) are separated from core production paths.
Manual Fallback Procedures: Our teams regularly practice operating without automated tools — ensuring we can still make decisions when systems go dark.
Human Readiness: Automation helps, but people make the difference. Our Infrastructure Team trains specifically for “visibility loss” scenarios — like what happened during this outage.

When AWS had a bad day, our systems didn’t flinch — and our team was ready to prove why.

What We Learned

Even though we experienced no downtime, incidents like this are reminders that reliability isn’t a static achievement — it’s an ongoing commitment.
We’re taking this opportunity to make our systems, tools, and processes even stronger.

Our takeaways:

Enhanced Monitoring Redundancy: Adding independent, secondary monitoring for visibility when primary tools go down.
Cross-Provider Alert Routing: Ensuring critical alerts can be delivered through multiple services simultaneously.
Incident Simulation and Training: Continuing to run realistic drills for tool and service outages.

Every incident, even one that doesn’t impact customers, is a chance to improve. Our team takes this with the highest priority, jumping in to help ensure that we are better prepared for the next time.

Final Thoughts

The AWS outage showed once again how complex and interconnected the internet has become — and how vital it is to be ready when something breaks.

We’re proud that our systems remained fully operational, but even prouder of how our team responded: calmly, quickly, and collaboratively.
Because in moments like these, resilience isn’t just about servers and code — it’s about people who know how to handle uncertainty with confidence and care

Blog

04.17.24

The Future Is Now

Blog

07.09.25

AI for TPA Firms: What’s Real, What Works

Automate your Census Workflow.

Simplify annual census collection through effortless payroll data gathering and automated scrubbing based on plan document provisions.

Schedule a Call

What Happened

Our Infrastructure Team’s Response

Why This Matters

How We Build for Resilience

What We Learned

Final Thoughts

Next

The Future Is Now

AI for TPA Firms: What’s Real, What Works

Automate your Census Workflow.