Over the past week (since our unplanned outage on Monday the 15th of July) we've been seeing intermittent performance issues in our live server clusters. As a fast growing cloud-based business we wanted to share the details of some of these challenges because we know that any downtime or slow performance has a negative impact on our users and their businesses.
[Update 1: July 25, 00:00 UTC] - after extensive monitoring and telemetry (of literally millions and millions of data packets across numerous hosts) we've found an unexpected error with IPv4 not correctly fragmenting packets beyond the 1500 MTU. This appears to be flooding the internal network with TCP retries - our sysadmins are researching and attempting to zero in on this issue through reproduction and then mitigation, and we'll hopefully have more information to share soon.
[Update 2: July 25, 17:45 UTC] - further analysis by engineers has identified the likely problem as a kernel lockup on our virtualized servers, which is consistent with the symptoms we've been seeing. Identifying the cause of the problem (between the host kernel, the guest kernel and the qemu virtualization layer is ongoing, however, their cumulative effect on performance became too significant on system performance and we undertook some deliberate downtime to clear out all backlogs. This involved host level restarts which caused a period of 8 minutes of downtime while services restarted and restored normal operations. This is consistent with an issue introduced by the kernel upgrade undertaken last Monday, and in previous experiences since then we've seen that this sort of deliberate downtime (cold booting hosts) mitigated the symptoms of the problem for a few days, hopefully buying us time to identify the root issue around these kernel lockups before performance degrades again.
[Update 3: July 26, 06:00 UTC] - our engineers have taken the mode details logs over the last few hours and narrowed down the problem to a problem with APIC on the guest VM kernels. In very specific (and rare, intermittent) circumstances this has been noted as a bug (see here and here), so we're hopeful that this indeed solves the problem. We're about to restart the affected VMs with a change in kernel parameters and we'll be observing to see if this addresses this intermittent and inconsistent problem.
[Update 4: July 26, 22:00 UTC] - it has been 16 hours since our patch and restart was applied, and so far we have not seen a single alert replicating the experience before the APIC patch was applied. This compares to almost one a minute between 17:00-18:00 UTC yesterday, for example, and while we're yet to call this completely solved (the problem has previously taken approximately 40 hours to recur after a host reboot) we'll be in a position to know with more confidence in the next 48 hours.
[Update 5: July 30, 16:45 UTC] - the stability issues related to the virtual machines and kernal lockups appear to have been resolved with the APIC patch late last week, with no recurrence of the issues appearing in our logs in more than 100 hours. Another issue has appeared, however (which may have been masked by the other issues that are now resolved) where for periods of 15-30 seconds our load balancers are not properly routing traffic back to our back-end private devices. Our engineers are looking into this next frustrating bug, and we're hoping that a kenel patch to bring everything into consistency will address this very intermittent (and less damaging) gremlin.
[Update 6: July 30, 21:45 UTC] - our engineers have been working on the problem over the last 5 hours; it is manifesting itself as a communications link breakdown between two High Availability load balancers, which is means that the load balancers are confused about who's got point to take the traffic (there are multiple of them to create high availability so if one goes down the other takes over, but the short breaks in communication are leaving them temporarily confused like people walking down a sidewalk and then doing the dance where they both go to the same side... but a million times faster). Our engineers are continuing to work on this problem and we'll provide an update as soon as we can.
[Update 7: July 30, 22:45 UTC] - 15 minutes after our last update, the load balancers became even more problematic, with each of them thinking they were in control; unfortunately, the time between this occuring and services being able to be restored and flushed through the network meant we had a hard outage of 40 minutes - we're very sorry for the inconvenience this undoubtably caused. To mitigate this problem we've reverted to a single load balancing front end while our engineers work out what is going on in the communications between them. We'll updated you as soon as we know more.
[Update 8: July 31 01:30 UTC] - after our 40 minute outage services remained inconsistent because the load balancers were constantly competing to hog/own all of the services; after an hour of trying to drop load balancer servers out of the pool and reconnect them (to avoid the overloading/greedy thing) we decided to bring down the load balancer array hard to clear the contention, confusion and conflict between the load balancers in our HA setup. Thankfully this outage lasted for less than 2 minutes and resulted in all services returning to normal (finally) by 01:30 UTC (11:30am Sydney, 9:30pm New York, 6:30pm San Francisco).
[Update 9: July 31 16:30 UTC] - after the headaches of the last couple of weeks, things have been clear and solid since the restart of the load balancer array yesterday. We're still doing further investigative work into the root cause, but preliminary signs suggest that the load balancers (which had remaind unaffected during the problems with our other hosts) suffered the same APIC-related lock up issues, causing intermittent network connection outages (in this case on the public and private interfaces) which in turns caused the load balancers to become confused as to which load balancer in the array should be "in charge" (since one would see the other disappear and think it was boss, and the one what DID actually disappear would see that it couldn't see its buddies and think, "no, I'm the boss" and when it came back, chaos would ensue).
The reality is that it takes quite a lot of servers doing quite specialized jobs to make Accelo work; servers tuned to do things like load dynamic pages, query databases, send and receive emails, synchronize with your calendar and a multitude of other things Accelo does millions and millions of times every day for users all over the world.
Since we're dealing with computers and software, we've come to expect things to fail, so instead of relying on a single big machine to do everything, we've adopted an infrastructure model that allows us to scale "horizontally", This is geek speak for basically saying we can add more and more servers for any given role we have as we grow, and if things start to get hot in one area - like web servers handling requests and running code - we can add more of them pretty easily without having to change everything else (like database servers, which don't like change nearly as much).
Most of the time, this works really really well. We can (and do) have servers and services fail, and while we're not yet big enough to have everything heal automatically (yet), we are normally pretty solid and we have 24/7 monitoring which sends an SMS (followed by a phonce call) to engineering if we have more than 2 minutes of downtime. Thankfully, our setup and redundancy means we don't get many SMS messages or calls; we had uptime of 99.886% in June for example, and the outage periods that made up the 0.114% of downtime were mostly planned upgrades we time to happen when you're asleep, skiing or at the beach.
To make sure things are secure, though, we have our collection of these specialist servers operating out of public sight on a virtual network which only they are able to see and use. All traffic comes in via our firewall and load balancing machines, which then route very specific traffic to a select group of servers tasked with certain jobs (like the mail or web servers mentioned above), and they in turn call on other servers like database servers to give them they information they need to answer your request, all in a fraction of a second. It looks a bit like the diagram below.
Unfortunately, there's still one place where things can go wrong, somewhere fairly unusual. The private network connecting all of the servers is the backbone of all the data interchange, and it is expected to work flawlessly (mainly because it has no moving parts and is built on super tried and tested technology which has been around for as long as the internet).
As you've probably guessed, it is this part of our infratructure that is intermittently letting us down - for periods from 15 seconds through to a few minutes, the internal network is blocking access to one of the servers in our cluster. It is completely at random, with the issue not appearing for a day or more and then flooding in more than a dozen times in an hour. It comes on when things are busy, as well as in the middle of the night (US time) on a Saturday when (most of) our Aussie and European clients are chilling on a Sunday and load is at its lowest.
When that access involves one of our web or front end machines you'll see a slow load time while the load balancers realize it is unavailable and route traffic to other servers (sometimes causing the remaining connected front end servers to run hotter and thus return pages more slowly). When it involves something more important like the network connection to a database or directory server, you might notice a 500 or other error message.
This issue has only been affecting us for the last week, and we have a team of engineers from our upstream suppliers working on it now. It is proving very difficult to diagnose because the network drops out quickly and without warning, and the servers themselves are all online, happy and green lights when there's a problem - they're just confused/surprised as to why all of a sudden no one is talking to them.
We'll be sure to keep you up to date - including with a follow up "victory" blog post that I honestly can't wait to write - as we work to find and solve this problem (my bet is on a faulty network interface card, but we'll see). We're really sorry for the inconvenience this issue is causing you, your colleagues and your business - we're feeling it too (since we'd otherwise be putting everything into the new exciting features we're on the cusp of releasing).
Finally, while it is tempting to make a big change in underlying infrastructure and move to another infrastructure vendor, the reality is that running off half-cocked with these sorts of big, important systems tends to create more problems that it solves. If nothing else, please know that we're focused completely on resolving this issue (we'd all like more than 4 hours sleep a night too) and of course, we'd also be only to happy to answer any questions you have - please just email support@accelo.com and we'll get back to you as quickly as we can.