A Very Unfunny Thing Happened Over the Weekend...
Many Smartsheeters experienced degraded performance this morning, including slow sheet loading and an inability to view your data. We want you to know that we do not consider this level of performance acceptable and are taking this issue very seriously.
Before we explain what happened, we want to apologize for the work interruption. We consider job #1 to be productivity, and clearly we did not meet our own high standards for productivity this morning.
What happened: The simple version
On Saturday we took the system offline in order to perform planned maintenance and add some new features. We announced this in advance and had planned to be offline for about two hours.
Unfortunately, the update did not go as quickly as planned. When we realized that we had passed an acceptable level of time to be offline, we changed the plan in order to come back online more quickly. In the end we were offline for about four hours on Saturday afternoon PDT.
On Sunday we were able to add two high priority features (with no offline time), which are live now.
However, when traffic surged on Monday our services were much slower than normal. It took us about two hours to find and fix the problem.
During the offline period, we posted live updates on all social channels (Google+, Twitter, Facebook) to inform everyone of the situation as quickly as possible.
What happened: The technical post mortem
On Saturday morning, we took the site down for a scheduled upgrade. As part of that upgrade, we applied several recommended database patches that appear to have prevented the database from coming back online. We immediately called our database vendor and worked together to bring the service back up, but once the issue was clear that a fix would not be immediate, we swapped over to our hot backup data center and came back up on Saturday afternoon.
As the Smartsheet workload spiked this morning a number of people were not able to log into the service. We were able to trace the issue to the fact that an earlier networking upgrade in the backup data center was incomplete and Smartsheet was running with about ½ the expected throughput. It wasn’t until we saw service loads spike this morning that the networking issue became apparent and we were able to rectify that within about two hours.
While no data was ever at risk, and our backup data center worked well throughout, we recognize that doesn’t help much when you are waiting to get back to work. We are already making big investments in service infrastructure to enable low-to-no-downtime updates. We will now also take a hard look at what happened and reconsider the way we handle planned maintenance and service updates going forward.
Thanks very much to everyone who took the time to notify us of the issues they were experiencing, and again we are truly sorry for anyone whose work was affected by our service issues today.
- The Smartsheet Team