API outage on event summary endpoints

Incident Report for Pipedream

Postmortem

On September 21st, an incident affected Pipedream’s builder and event inspector from approximately 10:00 AM UTC to 2:00 PM UTC.

In short, users could not edit or deploy workflows from the builder, and the event inspector did not show the results of incoming events. However, deployed workflows were not affected by the outage. Events were processed normally.

We can do a better job of building resiliency for these incidents into the core service, and we own the downtime. We wanted to share what happened and what we’re doing to address it.

What happened

On September 21st at 7:00 AM PT (13:00 UTC), our team responded to user reports that the Pipedream dashboard wasn’t functioning properly.

Upon investigation, we noticed our internal API had failed to connect to Redis — one of our core data stores — after Redis Labs initiated unscheduled maintenance, cycling the servers in our Redis cluster. This caused the cluster IP addresses to change. The Ruby client our API uses to connect to Redis failed to resolve the new hosts. Since the API failed to connect to the new cluster, workflows failed to deploy and events failed to load as a result.

At 7:18 AM PT (13:18 UTC), our team restarted the API pods on our Kubernetes cluster. After the restart, the API was able to reconnect to Redis and the incident was resolved.

Why an alarm was not raised

Most of our alarms and auto-recovery mechanisms are tied to the availability of services (e.g. is https://pipedream.com up, is the API able to receive traffic, are workflows running?).

In this particular outage, both the UI and API were available, and workflows continued running, but certain UI operations failed. These triggered exceptions in Sentry — our error-tracking system — but these specific errors failed to raise alarms to our team.

Going forward

As developers, we understand how frustrating downtime can be, especially when the initial response takes hours. A few items came out of our investigation that we plan to tackle:

Upgrade the Ruby Redis client. Newer versions of the client appear to be more resilient to this specific issue.
Investigate a move to a different Redis cluster type that’s more resilient to changes to the underlying cluster / networking (suggested by our hosting provider, Redis Labs).
Raise better alarms on specific, high-volume exceptions like the errors during the incident.

‌

We don’t take your trust for granted. If you have any questions at all or observe any lingering issues from this incident, please let us know.

Posted Sep 22, 2022 - 19:35 UTC

Resolved

This incident has been resolved.

Posted Sep 21, 2022 - 15:02 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 21, 2022 - 14:25 UTC

Investigating

We have observed an unusual amount of errors on the event summaries API used to hydrate the builder on page load.

We're currently investigating.

Posted Sep 21, 2022 - 14:14 UTC

This incident affected: Public APIs (REST API).