On September 21st, an incident affected Pipedream’s builder and event inspector from approximately 10:00 AM UTC to 2:00 PM UTC.
In short, users could not edit or deploy workflows from the builder, and the event inspector did not show the results of incoming events. However, deployed workflows were not affected by the outage. Events were processed normally.
We can do a better job of building resiliency for these incidents into the core service, and we own the downtime. We wanted to share what happened and what we’re doing to address it.
On September 21st at 7:00 AM PT (13:00 UTC), our team responded to user reports that the Pipedream dashboard wasn’t functioning properly.
Upon investigation, we noticed our internal API had failed to connect to Redis — one of our core data stores — after Redis Labs initiated unscheduled maintenance, cycling the servers in our Redis cluster. This caused the cluster IP addresses to change. The Ruby client our API uses to connect to Redis failed to resolve the new hosts. Since the API failed to connect to the new cluster, workflows failed to deploy and events failed to load as a result.
At 7:18 AM PT (13:18 UTC), our team restarted the API pods on our Kubernetes cluster. After the restart, the API was able to reconnect to Redis and the incident was resolved.
Most of our alarms and auto-recovery mechanisms are tied to the availability of services (e.g. is https://pipedream.com up, is the API able to receive traffic, are workflows running?).
In this particular outage, both the UI and API were available, and workflows continued running, but certain UI operations failed. These triggered exceptions in Sentry — our error-tracking system — but these specific errors failed to raise alarms to our team.
As developers, we understand how frustrating downtime can be, especially when the initial response takes hours. A few items came out of our investigation that we plan to tackle:
We don’t take your trust for granted. If you have any questions at all or observe any lingering issues from this incident, please let us know.