Reading:
Communication Timeout: A Horror Story

Image

Communication Timeout: A Horror Story

We cannot stress enough the importance of communication on software development projects, especially when it comes to distributed or remote teams. This may seem like a cliché, but it has so many meanings (and implications!). In the context of software development, what we really mean by good communication is:

  • informing the rest of the team about important changes/events
  • explaining our motivation and reasoning to others, which helps them make better technical decisions of their own
  • ensuring that the understanding of all participants aligns
  • clarifying mutual expectations…

This list could go on and on. Broken communication leads to mistakes. Mistakes lead to problems. Problems lead to misery.

Connection Timeout: Communication is key

This is a story about one of those miserable mornings when your working day begins not with a cup of coffee. The first message our software engineer saw that morning was from our customer,

“Our production server is offline. Can someone check it out urgently please. My client just called me on my phone.”

The engineer broke out in a cold sweat. The reported problem manifested itself in a dull “This site can’t be reached. Connection Timeout.” message which the end users were seeing instead of the web-page of the platform. This platform was used by various call-centres for whom business hours had already started. So, unable to do their job, a lot of people were looking at the ERR_CONNECTION_TIMED_OUT error in frustration.

This site can’t be reached. Connection Timeout.

There was no obvious explanation for this situation. The engineer’s emergency troubleshooting checklist was as follows:

  1. Are there any obvious issues on the server or in its logs?
  2. Do we have the latest version of the app deployed?
  3. Does the Apache web server have permissions to access the file system?
  4. Have we run out of disk space?
  5. Have we run out of RAM?
  6. Does Apache respond via the intranet and locally?
  7. Have any Apache configuration files changed?
  8. Is there a connection to the database, and is the DB server working at all?
  9. Is the DNS server working as it should?
  10. Are our hostnames getting resolved in correct IP addresses?

None of the steps above identified a root cause. The engineer tried accessing the application by its direct URL and got the same connection timeout error. He then tried accessing the application via the intranet, and suddenly the application was responding just fine. This suggested taking a closer look at the firewall rules. To the engineer’s astonishment, he found out that the firewall rules on a production server, which had been running like clockwork for many months, had changed. But how did this happen?

The customer had a large distributed team. His DevOps engineer worked in a different time zone, and was sound asleep when our working hours began. The DevOps engineer had been doing some maintenance for the security groups. While working on our application’s instance, he slipped and missed the step where port 443 had to be added to the whitelist. The result:

  • all traffic to the application was blocked, which caused the connection timeout situation
  • one of our platform’s instances was down for more than an hour, which equates to a lot of wasted working time when multiplied by the size of the call centre this instance was serving
  • our software engineer got more grey hair
  • the poor DevOps engineer had an even more miserable morning, explaining his mistake.
mistake

This situation could have been easily avoided if the following simple actions had been taken:

  1. Run a simple smoke testing once your job is done — spend a minute to check that the application’s instance is working properly.
  2. Inform the rest of the team of what you’ve changed. A simple “Hey guys, I’ve updated the firewall rules for our EC2” would have been sufficient to narrow down our frantic search for a root cause.
  3. Make sure you keep a checklist with all the required steps, and include the previous two items on it.
  4. Automate! Replace manual operations with a bash script. This does not take much effort but massively reduces the chances of an important step being missed.
  5. As an extra safety measure, add some simple health check tools to the service.
check tools to the service

Or follow a single universal rule: communicate!

Related Stories

Cover Image for Legacy Systems: Reviving an old Flash app
April 29, 2022

Legacy Systems: Reviving an Old Flash App

image-02-2
March 31, 2017

Coloured Envelopes

The Thomas Cook Collapse. Importance of Digital Transformation
March 2, 2020

The Thomas Cook Collapse: Importance of Digital Transformation

Learn from the Thomas Cook collapse and understand the crucial role of digital transformation in the travel industry.