AWS incident affecting Pipedream

Incident Report for Pipedream

Postmortem

Today, AWS had an outage in their us-east-1 region, and that affected Pipedream for much of the day. Users could not access https://pipedream.com, and workflows were running intermittently, between 15:35 UTC on Dec 7 and 0:58 UTC on Dec 8

It's easy to blame AWS for incidents like this, and the scale of this AWS outage is certainly unprecedented. But we can do a better job of building resiliency for these events into the core service, and we own the downtime. We plan to improve this as we start 2022.

What happened

Today at 7:35am PT (15:35 UTC), we received our first alarm indicating that HTTP services and the Pipedream UI were down. Visiting https://pipedream.com returned a 502 error. We rely on AWS to run the platform, and upon investigation, it became clear that this was an AWS outage in the us-east-1 region, where most of our infrastructure runs.

Since AWS login and authentication was affected by the outage, we were unable to access the AWS Console or API to make changes. During this time, workflows were running intermittently. At 14:58pm PT (22:58 UTC), we regained full access to our production services. Many AWS services were still unavailable at this point, so we started to migrate workloads off of failing services (e.g. Fargate) to our core Kubernetes cluster.

By around 16:00 PT (0:00 UTC), we started enqueuing the majority of incoming events, and quickly thereafter, workflows and event sources started processing these events. At that point, we started work to recover the Pipedream API and UI.

At 16:58 PT (0:58 UTC), service was restored to the Pipedream UI. We continued to work through a backlog of queued events for workflows and event sources, and added capacity to accommodate the increased load. At 17:57 PT (1:57 UTC), the backlog of events had been processed and the service was fully-operational.

How to troubleshoot the impact to your workflows

Workflows and event sources were running intermittently throughout the day. To review the impact to your specific resources, visit your workflow and event source logs to see the events that were successfully processed.

If services that trigger your workflows deliver events via webhook, they may retry events that failed earlier in the day. Some services (like GitHub and Stripe) provide interfaces that let you see these queued events and retry them manually.

If you have any questions at all or observe any lingering issues from this incident, please let us know.

Posted Dec 08, 2021 - 05:09 UTC

Resolved

This incident should be resolved.

Posted Dec 08, 2021 - 02:02 UTC

Monitoring

Access to https://pipedream.com has been restored, and workflows and the REST API are operational again. We're working through a backlog of events for event sources, and will update everyone when service has been fully-restored.

Posted Dec 08, 2021 - 01:28 UTC

Identified

Some workflows have resumed processing as we bring up services. Access to https://pipedream.com is still down, but we're working to recover all services. We'll post another update when everything is online.

Posted Dec 08, 2021 - 00:45 UTC

Investigating

AWS is having a major outage in us-east-1, which is affecting Pipedream. We're investigating.

Posted Dec 07, 2021 - 16:46 UTC

This incident affected: Frontend (Pipedream), Backend (Event Sources - HTTP, Event Sources - Timer), and Public APIs (REST API).