Postmortem: Web Stack Outage.
Issue Summary:
At around 14:00 EAT on the 1st of March 2023, it was observed that users could not reach our website as they were served error 500 (server error). This outage lasted until 15:30 EAT amounting to a downtime of 1hr 30 minutes. The impact was significant as 60% of users could not access any of our services within the downtime. The root cause was that the disk space in our server has reached critical limit with less than 30% free space.
TIMELINE
Issue was detected 14:20 EAT
Issue was noticed following a monitoring alert sent to my mail showing that the server was down.
Following the alert, the debugging process commenced 14:30 WAT. On further analysis of the monitoring alert, it was shown that the server response time was far below optimal. I logged into the server and commenced the debugging process from server side.
The bug was detected at 15:15 EAT and was fixed by me as I was the first to respond to the monitoring alert and also the software engineer directly managing the site.
ROOT CAUSE AND RESOLUTION
After logging into the server, I first confirmed that our NGINX server is still running. I pinged the IP address and the ping time was greater than 500ms confirming delayed response time. I went further to list and analyze the current running processes only to realize that the disk space available in our server was below 30%, far below the disk space required for optimal performance.
The issue was fixed by killing most processes that were consuming space and not needed at the moment thereby creating more space for our server to run efficiently.
CORRECTIVE AND PREVENTIVE MEASURES
As a preventive measure, the monitoring software, datadog, which we are using to monitor our server, was updated to monitor disk space and to give an alert once the disk space is below 60%.
At the time of this downtime, only one web server was serving our site constituting a single point of failure (SPOF). Hence, a good preventive measure and recommendation will be to use a load balancer to distribute the traffic on multiple servers to prevent a total downtime when one server is down.
/