Email Service Provider SendGrid, which we use here at Accelo to deliver emails out of our application, has been grappling with a security issue for the past few days caused by weak/stolen password settings for some accounts resulting in them being used to deliver significant amounts of spam. More details are available from the excellent Brian Krebs.
Unfortunately, even though our account maintains a reputation score over 99% and uses a very strong SMTP password, SendGrid decided to block our Accelo delivery account this morning without any notice at all. Please be assured that this situation is occurring entirely outside of Accelo's infrastructure from a security perspective - all of our client data remains safe.
SendGrid's decision to suspend our account without notice, justification or response has caused emails delivered out of Accelo - such as timesheet reports, invoices and other activities - to be held up in SendGrid's servers. Being the first day of the month when our thousands of clients are working hard to do their monthly billing - and in the middle of a pandemic no-less when cash-flow matters more than ever - the timing of SendGrid's action significantly compounds the pain caused by the incompetence of blocking an account in good-standing without any warning.
We have raised numerous tickets this morning with the SendGrid support team - who've told us they don't have the power to reinstate accounts, that it has to go through an escalation path/team, and we should expect a callback in 6 hours - and are impatiently waiting (along with our clients) for them to correct their error and reinstate our account.
We're also in the market for a new Email Service Provider so that we aren't in this position in the future. Unfortunately, it isn't just a matter of changing emails to be delivered directly from our servers as we make use of APIs on the SendGrid side which aren't able to be just "switched" to another vendor - so for this outage we're basically stuck waiting for SendGrid.
We sincerely apologize to our clients for the inconvenience caused by this vendor outage and will be updating this blog post when we have further information.
Update Sep 1, 14:45 San Francisco time, 6 hours into outage: SendGrid support finally called back, just to let us know that they were going to escalate the issue to their Compliance Team (which was the same promise from 5 hours ago). No ETA, no transparency, no demonstrable competency.
Update Sep 1, 15:30 San Francisco time, 6.75 hours into outage: SendGrid are still unable to provide information or ETA, but their support team have advised to create a separate account (which makes a mockery of their claim that this is a Compliance measure). We're in the process of provisioning the additional account to allow new emails to be sent, but previously sent emails will continue to languish in the jail of SendGrid's "Compliance" team for an indeterminate period of time.
Update Sep 1, 21:30 San Francisco time, 12.75 hours into outage: SendGrid are still unable to provide any information beyond "we will escalate this" and no sense of an ETA. We're part way through the evaluation process for a replacement vendor (which normally takes weeks or months, but we're working on doing it in hours or days) and will continue to try and get something - anything - out of SendGrid while we move with urgency to a competent vendor. Our sincere apologies for this unbelievable situation; while technology fails us all from time to time, to have had this happen as a result of a human decision without any warning or response for more than 12 hours is truly incredible.
Update Sep 2, 07:45 San Francisco time, 23 hours into outage: SendGrid have still not replied in any way around this outage beyond L1 support saying "we've escalated it". Our engineering team have been working urgently (through the night SF time) on moving to an alternative vendor, and work continued on this track. We'll provide a further update when we have an ETA for moving vendors.
Update Sep 2, 13:10 San Francisco time, 28.5 hours into outage: while we still haven't heard anything from SendGrid yet, we are seeing signs of email delivery once again. We're not sure if this is an intermittent or permanent recovery. Will update here as soon as we know more, and the technical work associated with moving away from SendGrid continues.
Update Sep 8, 05:45 San Francisco time: by around 3pm last Wednesday (Sep 2nd) deliveries had caught up; we've been monitoring the situation closely and since then things have remained "normal". Sendgrid did finally reply to us with some more detail about the issue on September 4th at 6:58am San Francisco time where they outlined a scenario where a client of ours had been sent a "request" via email from an external/public sender which contained a phishing link, and as part of our delivery pipeline this had tripped SendGrid's automated systems. The fact their automated systems didn't generate so much as a notice/warning to us (as their documentation says it does), that there was no information about the specifics when they did prematurely trip, and that it took almost 30 hours to correct the automated system and in total almost 70 hours to advise us at all is completely unacceptable. Since this outage last week we've been working hard on identifying a replacement vendor from SendGrid and have the field narrowed down as we complete our tests and PoC. Given the power that Accelo makes use of through an email service provider, getting the balance right between careful and thorough evaluation with a thoughtful migration on one hand and an urgent "that outage was unacceptable and we need to move now in case they mess up again" is not easy. Given this is the only snaffu of its nature in the many years we've been using SendGrid, we're opting more for the careful and thorough evaluation with a thoughtful migration, but we also feel more comfortable with being able to make a rapid move in the future should SendGrid fail us (and our clients) again.
Update Oct 15, 14:33 San Francisco time: SendGrid has again dropped the ball and our emails by deactivating our sending account without warning. Over the last month, our Engineering team has been investigating the best way to cleanly move to a replacement email service provider. SendGrid's consistent failures might force our hand to do a crash move instead. We sincerely apologize to our clients for the repeated inconvenience caused by this vendor outage and will continue to provide information as we have it.