Root Cause Analysis in IT

Episode 13: Corporate Fresher’s Survival Guide – The RCA Files — Autopsy of an Outage

Tagline: “It wasn’t a glitch. It was a legacy monster… hiding in plain sight.”


🎬 Scene 1: Monday Mourning ☁️💀

It was a Monday. The kind that starts with extra-strong coffee ☕ and silent tears. The Incident that hit on Friday still echoed through the empty Slack channels. No app access. No emails. Even the internal site went poof.

Janet (ITSM Analyst):
“How did this even happen? One upgrade. ONE. And we took down the whole East Coast.”

Mike (Director of Ops):
“RCA. We need one. By EOD. No excuses.” 😬

The war room was reopened. But this time, not for firefighting — it was for postmortem. The Root Cause Analysis.


⚙️ Scene 2: The Tools of Truth 🛠️🔍

Janet spun her chair like a true crime detective and opened the sacred vault — ServiceNow’s Problem Module.

She whispered to herself:
“Logs. Timelines. User reports. Let’s start the autopsy.”

🕵️ Evidence Board

  • 🔧 Patch update at 10:01 PM
  • 🔥 CPU spike at 10:03 PM
  • 💣 Database hung by 10:07 PM
  • 🧑‍💻 Users screaming in Reddit threads by 10:10 PM

One thread led to another. Then another. Until finally… 💡
“A legacy script written in 2014?” Janet gasped.
No documentation. No owner. Just 400 lines of spaghetti code still running in prod.


🧟 Scene 3: Legacy Strikes Back

Cliff (Old Dev Retired in 2020):
“Oh yeah… I remember writing that. It was quick fix for an old DB bug. Didn’t know it was still running.” 🧓💻

Janet:
“Well, Cliff… it just broke the internet.” 🤦‍♀️


🪦 Scene 4: RCA Report — The Funeral Document

Janet typed like a crime novelist on deadline. The RCA Report had to be:

  • 🔹 Clear
  • 🔹 Honest
  • 🔹 Blameless (HR watching 👀)
  • 🔹 Actionable

RCA Summary

  • Incident Trigger: Unvalidated legacy script triggered during patch
  • Root Cause: Lack of ownership + No documentation
  • Impact: 4 hours of downtime, 1M+ users affected
  • Fix: Script removed, codebase audited
  • Prevention: CMDB updates, legacy ownership reassigned

Mike:
“Good job. Now let’s bury this thing properly.” 🪦


🎭 Scene 5: Reflection in the Breakroom

Janet to herself, sipping cold coffee:
“IT isn’t just about fixing issues. It’s about learning from the ghosts of past deployments.” 👻💻


🎯 Moral of the Story

RCA isn’t about blaming — it’s about understanding the root, trimming the weeds, and making sure it never happens again. Like CSI for servers. 🕵️‍♂️💼

Leave a Reply

Your email address will not be published. Required fields are marked *