Reading:
One Step Away from Embarrassment

Image

One Step Away from Embarrassment

The Integration Gig

This is yet another story when incommunicado modus operandi leads to broken builds, shattered dreams, and embarrassment.

A FinTech business had several teams working on its automated platform: a dedicated team provided by SPG, a cross-functional distributed team of external specialists, and a technical management team, which was co-located with the customer. The latter was supervised directly by the CTO and included DevOps engineers who were responsible for the infrastructure and knew all the ins and outs of the platform. The SPG team was working on an integration with one of the US banks. The challenging bit was that the API we needed to integrate with was being developed in parallel. Moreover, our client was one step ahead with the development, which put us in a situation where the bank’s API was still quite raw, and a lot of changes were regularly introduced. This, in turn, required a lot of adjustments from our side.

Our development process was based on utilising the following environments:

  • Developers’ workstations
  • An AWS-based development environment with the back-end, mainly used by the front-end developers
  • Staging / UAT
  • Production

Our main environment for testing was an AWS-based development server. There, we had all the required permissions to run our CI/CD pipelines and deploy new versions.

As the deadline of our project was approaching, end-to-end testing and a demo were scheduled with the investors to present the integration with the bank’s API. It was time to rock’n’roll!

One Step Away from Embarrassment

Setting the Stage

The entire platform was utilising a microservice-based architecture. The new integration was also implemented as a set of services. Our team raised respective tickets for DevOps to prepare the staging environment and deploy the new functionality. This was required to be able to perform end-to-end testing.

Unfortunately for us, the DevOps team operated in a completely different time zone. By the time we started our working day, they were off to bed. As it happened, deployment didn’t go as planned, and there were various problems that required the collaboration of both teams. The testing got delayed until a series of infrastructure-related issues were resolved. Time was running out fast.

Near the deadline, our team attempted to deploy the most recent fixes at the staging and discovered that their standard approach to deployment was no longer working the way it used to. As it happened, the day before, one of the DevOps engineers [accidentally?] removed the pipeline configuration, which allowed us to deploy on staging. No notes were left for our team as to why this was done or what the alternative was. So, we had to escalate this to the DevOps and our customer. This was their night-time, and everyone was sound asleep. As a result, we could not finalise our deployment.

All these minor delays, running back and forth, and hindered communication resulted in a situation where, on the day of the demo, no one had run full-cycle, end-to-end testing for the integration at the staging environment. There was no guarantee it was working.

show-mask-img

Retrospectively looking at these events, it’s fair to admit that the overall process lacked some centralised governance. We did our bit, and someone else did theirs, but there was no one in place who’d coordinated the joint efforts.

So, five minutes before the scheduled demo with the investors, the staging was still not ready. By that time, the DevOps team was finally at their workstations and had noticed our cries for help. We jumped on a call, and they initiated the deployment process. The meeting with the investors kicked off at the same time. While the attendees were going through the agenda and meeting objectives, recapping the statuses, and so on, we managed to deploy the vital fix and pass the message to the presenters. By that time, they had all the UI forms filled in and ready. So, as soon as we confirmed the successful deployment, the presenter hit the submit button. Everyone, including our entire team, was watching the spinning icon intently. Once the operation was completed successfully, everyone gasped with relief. But our joy was premature. The request was not executed fully.

The reason behind this failure was that the configuration for one of the essential microservices missed an environment variable with credentials. We had raised a ticket for this ages ago, but it was not dealt with in time.

We had to excuse ourselves from the meeting, prepare another release to the staging, rejoin the call, and only then successfully demonstrate the entire sequence to the investors.

Then, we moved on to presenting another flow, which required some additional input from a user. Provided input had to be validated manually, and the entire flow had to be finalised by a member of staff at the bank. Our system had to check the status of the delayed request to identify the occurrence of this event by running a cron job. Rest assured, this cron job was also absent from the staging environment. The staging was nowhere near ready for this demo. Thankfully, this situation could be rescued via a semi-manual mode, which helped us rescue the demo and avert utter embarrassment.

nice-catch-viralhog-img

In a couple of days, another session was scheduled, where all those flows were demonstrated once again. This time, everything ran smoothly in a fully automated mode. On that occasion, nothing hindered the success of the demo.

Encore: Key Lessons and Reflections

The key takeaways for everyone involved were:
  • Assign someone who will take full ownership of teams’ coordination.
  • When a demo is scheduled, make sure that all required preparation steps, including environment configuration, are properly prioritised for each responsible party.
  • Time zone differences impact how quickly fixes can be applied.
  • DevOps <> Dev team interaction has to be planned more efficiently. Everyone has to ensure that not only are the tasks raised but that they are implemented and verified as well.
  • Make sure to run at least one successful full testing session.

Related Stories

Connection Timeout: Communication is key
March 4, 2024

Communication Timeout: A Horror Story

We cannot stress enough the importance of communication on software development projects, especially when it comes to distributed or remote teams.

Cover Image for Legacy Systems: Reviving an old Flash app
April 29, 2022

Legacy Systems: Reviving an Old Flash App

It-works-on-my-machine_1024x533