As a growing startup, we experience growing pains where increased use and success of our product can lead to degradation in performance from time to time. While we monitor thousands of data points in our system performance and make heavy use of Cloudwatch Alarms and Slack Notifications to spot issues when they arise, there are some things that aren't able to be effectively solved by throwing more hardware at scaling challenges.
Earlier this year we reached one of those points as usage increased significantly after the holidays and more businesses were getting back to work.
Today, Accelo users create over 3 million activities, log 1.2 million hours of work and create over $140M of invoices every single month in just one of our three data centers. In addition, integrations ingest and process over 30 million emails, 5 million calendar events and 24 million callbacks from other systems we integrate with. On top of that, Accelo's unique Trigger functionality automates over 3 million actions a month, making sure managers know that a project is going over budget before it does or telling account managers that their monthly retainer is running ahead of budget so they can avoid those "too late" surprises.
The result of all of this is a high amount of demand and load on our infrastructure, and the area of infrastructure that is hardest to scale is our database.
We use the Aurora RDS service from Amazon Web Services (AWS) to run Accelo. A derivative of MySQL, Aurora RDS provides us the ability to have high availability and scalable performance, with a large database writer and a number of database readers to share the load. While in many ways best practice, we still found a number of situations earlier this year where we were butting up against performance limits and our engineering team did some pretty impressive things behind the scenes to improve performance.
Data in Accelo is much more useful with context, and often that context involves tracing a relationship from something like a Task up to a Ticket or Milestone and then Project, and then to a Contact and Client.
This process of stepping up through a relationship hierarchy is known as forming a Breadcrumb, and when you're looking at something like a timesheet or performing an export, it is one of the more computationally intensive things we ask our database to do (since you're looking at forming these breadcrumbs for potentially hundreds of records linked to almost anything which in turn can be linked to almost anything).
When working to improve application performance for our users in January, we realized that the otherwise individually quick but incredibly frequent and thus collectively large queries to build breadcrumbs weren't as well structured as they could be. After a focused push, our engineering team improved the performance of our breadcrumbs and achieved a marked improvement in database performance and page load speeds.
The following graph shows the Average Active Sessions (AAS) on one of the database readers over a four day period, including two code releases going live. The y-axis shows the AAS, the average number of requests queued to be processed over the sample interval, so it is a good abstraction for the request "load". There is a significant difference between a load of 4 and 8, for example, because any additional latency in user queries immediately results in a bad user experience.
The AAS on the left is before the first change, and hovering around 16 was indicative of "normal" performance during the business week. Again, having a high AAS means queries are backing up and it makes for a bad user experience. After the two stages of the code changes were pushed live, you can see this reduced the load by an incredible 75%!
That was on the database reader, and an even larger reduction in load was possible at the writer. We discovered some unexpected behaviour in the breadcrumb code that appeared to be quite efficient, but after tracing and profiling the many small database requests that go into a single user request, we discovered that there was a huge opportunity to optimize writer load, too. We'll share details of this in the next section where we focus on the database engine itself.
Aurora RDS supports MySQL versions 5.6 and 5.7, but it wasn't until late 2018 (version 2.03) that some key features like the performance schema were available. Even then, issues with binlogs had us continue to depend on MySQL 5.6, but by early 2021 we knew it was a priority to upgrade to get the benefits of improved index hints and a completely rewritten query optimizer (the heart of the read-operations in a database).
Unfortunately, the promise of the completely rewritten optimizer was both exciting and scary - we had confidence that some of our well-indexed but still underperforming queries would improve from our testing, but we were worried about what things that were already doing fine would be impaired. Nonetheless, in an incredibly short and intense period from late February until early-April, our engineering and QA teams worked hard to test and validate this critical upgrade.
The engine update significantly reduced the peak load and greatly improved the user experience in the periods of high demand. It is also worth noting the dates in the graph below - we upgraded the database over the weekend (quieter time) and while load came back up into the 23rd and 24th of March (Australian time, so Monday 22nd and Tuesday 23rd US time), these represented peak loads of 3 or 4 compared to routine periods above 6 (and sometimes at 20) the end of the prior work week.
The following graph shows the writer load over a four week period to highlight this was indeed a large improvement. The blue trace shows the database writer load for about three weeks; two normal weeks of load, then a full business day with high load, before eliminating the busywork from the breadcrumb related improvements.
The load went from an average of about 8 with spikes over 30, reduced to load spikes of 4 with an average of 1, making for a much more responsive user experience!
What about the green and orange traces in the graph? They are from the MySQL version upgrade. The green trace represents a new writer instance with a configuration that did not pan out as well as expected, and was replaced by a new writer instance configuration in the orange trace that is still in production and performing well. As you can see, the orange spikes are smaller, and the average is lower, which shows how improving the breadcrumb requests and upgrading the database engine combine to greatly reduce the number of requests that needed to wait before they could be written to disk.
Our Daily Timesheet screen provides a snapshot of a user's day - what they have already worked on, what they plan to work on, and the other work that is also assigned that may not be today’s order of business.
Some of this information can be a lot to gather and provide to the browser, especially for the busiest users, who can least afford to wait. By finding some of the hardest exercised code, and making small performance improvements where it counts, as well as caching information that had been fetched from the database more frequently than needed, we've been able to reduce the average response times for a couple of the main API calls on this page by around 30%, with even bigger percentage savings for busy people at the end of the week when their timesheet has a lot of information from the week that needs to load in.
An important part of the data we manage for our clients is the project and task scheduling information. We have information related to the expected amount of workload for staff, as well as how things are tracking against their due dates, and this information isn’t static; projects start, progress, and reach their completion, but sometimes they also stall briefly (such as when you're waiting on clients before you can move forward). This means every day we need to do a lot of number crunching to get the latest status, and we work hard to get this done before our clients are in the office in the morning.
Unfortunately, we were seeing examples of clients seeing delays in data refreshes - in the month of April we experienced 42 examples of deployments that didn't finish their refresh in time for the start of the work day.
Through a series of concentrated efforts at profiling and improving our code, caching data to reduce load on our database, and being smarter with how we work has resulted in speed up of processing from of the average case around 15%. What this 15% average improvement doesn't capture is the much greater impacts on our biggest accounts - the ones with more users, more projects, more tasks and longer running schedules. For these larger accounts, we've almost halved the time to process, doubling our performance.
When it comes to what our clients see, the great news is that since these improvements went live at the end of April, there have been no examples of us missing our target of having every account refreshed overnight and ready in the morning for our clients.
One of the biggest challenges with Accelo's database is that we're such an expansive product. With hundreds of relational tables ensuring our clients can manage all of their client work from quote to cash, we have a fundamental challenge with having many objects related to many other objects. An example could be an email thanking a client for their payment made by one of the accounts team against an invoice for a project for a client. This activity is linked to the invoice, but you also want to see it represented in the project (so the project manager can see in the stream that it was sent) and also on the stream for the client record. In a different context though a team member might be entering time (as an activity) on a task against a retainer period, which requires links to the task, the period, the retainer and the client.
While we've made strong use of technologies like ElasticSearch to denormalize this data, the fact objects can be linked to many different things and those relationships also matter makes things complex from a query performance point of view. Having more than one JOIN option makes a big database very very sad, and so we needed a way to have better support for these flexible relationships without needing to bring a big database to its knees with lots of OR statements.
Through the second half of last year, our engineers implemented a new mapping structure to ensure we have a "swiss army knife" to connect the many and varied parts of Accelo. This functionality was actually critical to our new DBEngine architecture and was used prominently in our new Custom Lists feature and is central to achieving high performance. As with all things that sound easy, the challenge was actually in the implementation, especially handling the creation and updating of data in a consistent and reliable way.
While we've been focusing here on some pretty technical things like database and code optimization, our users really care about the performance of the application when they're using it. We measure this through the web server delivery of dynamic data to our users as "response time" in seconds.
The following graph shows this response time from the start of February to the end of May for all web requests made to Accelo. This response time is measured at the edge of our environment, so it summarizes how the optimizations made to the database and the code behind Accelo combine to improve responsiveness by over 25%!
While there's a lot of improvements outlined here, we're a long way from done. Our continued improvement efforts are focused on both tactical areas of optimization and improvement as well as more strategic areas of major upgrade. If you'd like to learn more about our big strategic direction in improving Accelo for our users, check out another blog post in this Performance series where we share a look at the big picture architecture improvements we've deployed over the last couple of months.