While the New Year promises a lot of opportunity, the last couple of weeks have seen a number of periods where our system performance hasn't been up to standard. While system uptime has remained strong, the speed of the system at various times has been impaired, and I want to share a few insights into the issues and areas of ongoing improvement that we're focusing on.
Firstly, a little bit of context - Accelo is a SaaS cloud application that runs in a "multi-tenant" mode. Our systems routinely run with well over 100 servers tuned to perform specific tasks, most commonly in distinct and automatically scaling groups so that as load increases in one area, that area automatically scales up. This is one of the big benefits of the cloud and normally it is a wonderful thing. There is, however, one common part of the architecture that all auto-scaling systems need to interact with - the database, which provides a persistent and true record of everything in the system. We use multiple databases, but the most important one is MySQL, provided by Amazon Web Services' Aurora RDS.
This configuration uses multiple failover features to deliver high uptime, but at the end of the day, when a user is entering a time log or creating a new client record, the database needs to write/save that information consistently, so the ability to use techniques like caching (saving results in a temporary store - which we do do) aren't the silver-bullet that they would be in a more read-only cloud system. We do make extensive use of read-only database engines to maximize performance, but there are still times when - like the end of class in school or a movie in a cinema - everyone rushing for the door at once means slower going.
The challenges we've experienced over the last two weeks have almost all come down to the performance of the database cluster in responding to queries in a timely fashion. Here's a run down of the various events and causes.
The initial impact was caused a couple of weeks ago by an external security audit. Accelo takes security very seriously, and part of this is being deeply audited at least once a year by an industry-leading external security vendor. This process kicked off in mid-January and part of it involved their "bots" attempting to find problems with the Accelo application.
No security issues were found, however, they were able to break their own data in such a way that specific database queries slowed down, which had a significant impact on the database performance for all users, twice. This was quickly identified by our team, and was cleaned up as soon as we got the all-clear that we would not interfere with the audit.
The second issue was actually caused by the auto-scaling performance magic that is so beneficial in other parts of the application. In this case, we had a number of large imports being run by users, and one in particular was formatted in a way which put too much load on the database. This created too much contention on key tables and impacted our real-time users, too. The automatic scaling to address the imports then added more load onto the database. Our engineers identified and addressed the issue, and in addition to investigating the code, we upgraded the power of our writer database which improved performance.
A third issue was a user accidentally creating a project with tens of thousands of milestones and tasks. Since this normally isn't possible through the user interface - and there's no business reason for doing it - we hadn't built in protections/hardening that would stop such an eventuality. The result of a project of such scale stretching over so many years and then needing to be scheduled using the automation of Accelo caused a significant processing bottleneck - a gift that kept on giving every time someone loaded up even so much as a task on that massive project. We addressed the issues with the data and are putting in place limits so that users can't accidentally do something like this in the future.
In response to these performance issues, we undertook a number of changes to our database configuration, more than doubling capacity and re-balancing our database readers. Our previous testing suggested that this approach would have improved performance, but in this case they didn't and in fact re-balancing the readers actually made things worse for a few hours. We reverted this configuration and performance was restored.
Making changes to such central pieces of infrastructure is both a blessing and a curse. The blessing is that we can make changes like this through the cloud - in the old days, such changes would take weeks or months depending on hardware backorders and time to configure, but with the cloud it took only a few minutes. The curse is that each change results in a small outage of a few minutes - a lot better than the bad old days - but since we're working on something so central and wanted to improve performance for our users as quickly as possible, we chose to make these changes without declaring a maintenance period and waiting until the weekend (like we do when we have planned work), which led to more user frustration even while we were working almost around-the-clock to improve things.
Even these small outages were unexpected because such changes have had a much lower impact in the past. This resulted in several short outages where the entire database cluster was unavailable, where previously the failovers to standby servers was seamless. We have a much more involved and interlocked procedure to avoid this going forward, and we are taking this up with our AWS technical team.
These periods of load and pressure - unwelcome, but not unexpected as a growing SaaS business - have had one benefit which all of our users will see themselves in the coming weeks and months. A bit like how a very low-tide in an estuary exposes things that would otherwise be hidden, these periods of high load have exposed other areas of performance improvement which some of the most senior members of our team have been working on over the last two weeks, which will further optimize performance and system speed. Additionally, the new architecture for our new List Screens (now in Beta) has been engineered with these insights and lessons built-in, and their performance remained strong during the periods of higher database load as a result. When they launch out of beta in a few months, the experience of our users is going to be significantly faster for the 20% of page views which are list screens, and through their performance, also better across the application.
Thank you for your patience and understanding through these growing pains as we make Accelo a little better every day!