Hello, world. It's early afternoon here at Crowd Supply headquarters in Portland, Oregon. As many of you noticed (and helpfully brought to our attention), the Crowd Supply site was down for most of last night. As of this morning, everything is back to humming along as it should be and we've verified that nothing nefarious was at play - there was no breach of security, no user or payment information was compromised, and no records were lost or corrupted. In the nearly five years Crowd Supply has been a public website, this is by far the longest amount of downtime we've ever experienced. Previous instances of downtime have been measured in terms of seconds or minutes, not hours. Clearly, hours of downtime is unacceptable. It is disruptive to the many creators and backers who use Crowd Supply and bears some explanation. Read on for the gory details and what we're doing to prevent this from happening again.
Two days ago, we received notification from our host provider that the server Crowd Supply runs on would be rebooted for maintenance. This is a regular occurrence we've dealt with many times before. Unfortunately, in this particular case, the notification came amidst many other similar notifications (and spam) about the widely publicized Meltdown and Spectre vulnerabilities and we failed to even see the notification. Still, the Crowd Supply site was designed to handle unscheduled server reboots and should have rebooted cleanly with only a minute or so of downtime. At 07:12 UTC (11:12 PM PST), the Crowd Supply server rebooted. Unfortunately, the automatic procedure for restarting a key element of the website (the
systemd unit for the application server) had been manually disabled during some previous maintenance and testing. With only this critical element missing, the website was unusable, but appeared to our monitoring instrumentation to be alive and well. The problem was solved by simply restarting the application server, which we did at approximately 16:06 UTC (8:06 AM PST), just after realizing the site was down. Of course, we've also since re-enabled the automatic procedure that restarts the application server.
What We're Doing About It
Here's what we're doing to prevent downtime in the future:
- Instrument the site with better automated monitoring that will alert our team when either full or partial downtime events are detected.
- Implement stricter procedures for testing and maintenance on the production server.
- Carry out stress tests of our system to make sure it responds appropriately to adverse conditions (like unscheduled reboots).
We will work hard to ensure that this sort of disruption never happens again. From everyone at Crowd Supply, please accept our apologies for this disruption and thanks for all your patience, support, and goodwill.
Co-founder & CEO