At the beginning of September, our email service provider, SendGrid, blocked access to sending without notice or feedback for almost 30 hours. While it was the first time in the seven years we've been using SendGrid that something like this happened, this event caused significant inconvenience for our users.
During the outage, our engineering team explored a crash migration to another service provider. Given the lack of feedback from SendGrid, and their recent security issues (which appeared to be related) we didn't know how long they'd be blocked for and thus we prioritized finding a replacement for our users. Through this period we evaluated over a dozen service providers, and narrowed down our selection set to two vendors.
A crash migration for something as critical as Accelo - which sends millions of emails a month on behalf of our clients - was not something to take lightly; it was much more desirable to be careful and thoughtful about the migration process and to undertake it in stages.
Once SendGrid restored service, we switched our posture to a thoughtful and gradual approach to another service provider to reduce risk and ensure high availability. Over the month and a half since the original outage, our team had completed our vendor selection process and scoped out the high level of the migration process.
Then, on Thursday, October 15th at around 3 pm Pacfic Time, Sendgrid repeated the situation of September 1st. After once again getting no satisfactory response from SendGrid's support team and worrying that we'd be seeing a replication of the 30-hour delay on deliveries from just over 6 weeks prior, our engineering team decided to undertake the crash migration.
Over the next eight hours, our infrastructure, security, and development teams undertook that crash migration to use MailGun as our replacement email service provider. This crash migration required a number of shortcuts to be taken, and some functionality - such as Email Event Tracking - was going to have to be added later, and, given the urgency, the ability for us to test and validate the work was going to be limited.
A little after midnight Pacific Time our team had successfully migrated a test deployment in our production infrastructure across to MailGun. We had completed initial tests, and were preparing to move production domains to MailGun to allow emails to be delivered in a timely fashion once again. Then, nine hours following their outage, SendGrid restored service.
Given the preferred approach of executing a higher quality migration - as opposed to the crash migration - and the need to coordinate some attributes like DMARC with existing clients, we have not yet made a wholesale transition from SendGrid at this time. We do, however, have the ability to make a snap switch in production in the event of a future outage, and will continue to build out the parity of functionality between SendGrid and Mailgun over the coming weeks.
Additionally, we have identified one of the vectors that may have caused SendGrid to trigger their delivery blocks - where an external sender of spam to an address like support@yourcompany.com is then delivered to Accelo and relayed via SendGrid to internal users. In response, we have turned off click-tracking to help protect our customers.
We will be keeping our clients up-to-date with progress when we have completed the move to MailGun, but we felt it was important to share the steps we've already taken - and our abilities to work around SendGrid's failures in the future - in the interim now they've gone from one outage in 7 years to two outages in 7 weeks.
Update 1: Oct 27th @ 4pm San Francisco Time
Well, this post didn't stay in original shape for long. This afternoon at around 13:30 Pacific time Sendgrid again had a delivery outage. This time, for the first time, they did email us but again the visibility on service restoration is unknown. Fortunately, the work done on October 15th meant we were able to execute the crash migration and restore service to near real-time email deliverability in around 20 minutes. The MailGun platform does not yet have delivery, open and click tracking enabled and it also can't process quote signoffs in real time (these are still queuing at Sendgrid and will be delivered later). We apologize for this inconvenience and will provide an update when SendGrid comes back online. Note that SPF records will not need to be adjusted as we've already updated the SPF pointers on our end, so using include:_spf.accelo.com will be all you continue to need.
Update 2: Oct 28th @ 12:30pm San Francisco Time
As expected, there have been a number of hiccups with the crash migration and we're still waiting for SendGrid to reinstate our account so we can iron out those hiccups and migrate properly as soon as possible. The main issue is that MailGun are re-writing the Sender email header of our emails, which results in Outlook in particular showing a strange email address. As an example, if your email address is john.smith@acme.com, then MailGun is making the Sender address john.smith=acme.com@accelo.com. Messy and very annoying. We've reached out to their support team and are waiting to hear back on a setting that is not available in the UI which could be changed to stop this behavior. We also have a code change which will go live in the next few hours which will specifically disable DKIM signing of emails in the hope that the DKIM feature is what is causing the Sender address rewrite. We're also continuing to work with SendGrid to get account access restored (see Update 3).
Update 3: Oct 28th @ 12:45pm San Francisco Time
The cause of SendGrid's block as been identified, and unfortunately it is pretty clear at this point that they have blamed our account incorrectly for the behavior of a spammer. The short version of events is that a spammer used a trial of Accelo to send a single test email to themselves which a bad link contained in it. The spammer then copied and pasted this link and send out their spam using their own methods (it looks like it went via SendGrid, but didn't use our SendGrid account). When victims received this spam and clicked on the link they were bought back to SendGrid's servers (using the link tracking feature) which was associated with our account (even though our account didn't send the emails - probably something SendGrid should improve on their end as an impersonation security hole to close) and then some recipients rightly flagged these emails as being phishing/malware/bad. It looks like SendGrid then blocked our account for emails our account didn't even send - and we're still waiting for them to realize their error and reinstate access. While this clarification around the reason for putting an account on hold is much better than what we had received previously, the fact they've blocked an account which didn't send the bad emails and it took us to work it out (with less tools and logs available than they have into their own systems) is pretty concerning. Will continue to provide updates as we get them.
Update 4: Oct 28th @ 11:45pm San Francisco Time
Our team have made a number of improvements to the crash-migration to MailGun through the course of today. From around 6:30pm SF time we implemented a change to key signing configurations with MailGun which have resulted in a lot less emails with the unwelcome john.smith=acmeinc.com@accelo.com email sender header (and attendant ugliness in Outlook as well as potential spam warnings in Gmail). There were still some cases over the last 5 hours where these emails were being sent with incorrect headers, and our team has gotten to the bottom of these and now all emails send via MailGun will be using accurate Sender headers. Additionally, a number of email sending pathways - eg, autoreplies on requests, notifications on task or milestone activation, assignment, etc - have been delivered only via SendGrid which meant they have been delayed while the SendGrid outage continues. As of now, however, these emails are also being sent via MailGun and delivered in a timely fashion again. As a reminder, emails that were sent via SendGrid previously but which haven't been delivered are still queued and we expect they'll be delivered when SendGrid restore service. On the SendGrid front, the last we heard from their support team was around 24 hours ago (!!!) and while we've provided all of the information to show their mistake in extensive detail, there's no news to share yet re: service restoration.
Update 5: Oct 29th @ 5:30am San Francisco Time
SendGrid have restored our account and we have switched our email service back to them (for now). Our highest engineering priority is a non-crash migration away from SendGrid on a non-crash fashion, and we'll have more plans to share on this basis in the next few days. We sincerely apologize to our clients for this incredibly frustrating situation: I can assure you we're 10x as furious as you are, and are highly motivated to put this episode behind us by moving away from SendGrid as quickly as we can (without causing our users any more pain by rushing the migration).