HappyPress Outage Statement – June 2020

On Thursday 25th June and Friday 26th June, our hosting infrastructure (SMILE Cloud), suffered a service disruption resulting in slow loading of customer data and downtime.

We have previously had an excellent track record of service uptime, and this is something we have worked day and night to rectify. We’re heartbroken that we fell below the standards we need to meet our clients expectations. We’re devastated to have to publish this statement: We would like to firstly apologise, and secondly, sincerely thank our customers for their patience and support whilst we address the issues.

On Thursday and Friday, when each incident occurred, a live event began on our Hijack platform. This runs on SMILE Cloud infrastructure. Further to this, the outage began when a surge of traffic (event attendees) hit the web servers. Under normal traffic loads, there were no problems, but under greater numbers, there was an issue. Whilst there were many moving parts in this incident, the main issue was a problem with our endpoint firewall. 

We make every reasonable attempt to mitigate such outages. We have run online live events for many years without anything remotely similar to this issue. We run events with larger attendance than we saw on Thursday and Friday. We have also run all events with our endpoint firewall enabled.

Unlike other components in our infrastructure, the endpoint firewall is given the free reign to update itself. This means that our customers are protected in real-time. A key difference between our last high-traffic event and the event where the incident occured is an update to the endpoint firewall on June 16, 2020.

In one of a series of tests conducted, disabling the firewall completely, allowed high concurrent traffic to flow through the infrastructure without any problems. The endpoint firewall processes PHP (an underlying service) requests. We ran two load tests, both set to target a simple PHP constructed page that is not cached by any system – a login page. 

With the firewall turned on, the initial page response time is elevated, and increases quickly. Error rate increases with concurrent usage thereafter.

With the firewall turned off, the initial page response time is significantly lower. Even when this increases under load, it is much faster. At its greatest difference, when the firewall is turned on, it is 829% slower.  

An error rate greater than 1% typically leads to system problems. In this case, it led to high load times, system timeouts and ultimately, application downtime. With the firewall turned on, we consistently noted ~10% error rates under high concurrent usage. 

As a result we’ve already made changes: We have changed the amount of data stored and methods for storing firewall logging data. This data is now stored in a database, as opposed to a disk, meaning that there is no load on metadata read times. On Saturday, we ran two simultaneous live events on the Hijack platform.

As a result of changes that we made, there was 0% downtime on the events taking place on Saturday. 

Although our current configuration has been proven to mitigate the downtime that you have previously experienced, we are following up with our endpoint firewall provider for further technical details on their system. 

We hope that you will accept our sincerest apologies. Whilst it was caused by components that are outside of our control, we are working methodically to mitigate situations like this from happening again.

Yours sincerely,

Nathan and Matt,
Founders of SMILE

Leave a Comment