It started like any other normal deploy.
A version bump. Nothing fancy. The kind of change that slips into prod without fanfare.
We had tests. Dev looked good. It was reviewed. Merged. ✅
And for a while… everything was fine.
Then the PagerDuty alert went off.
Then Slack lit up with:
“Checkout’s not loading.”
“The screen is blank.”
“Apple Pay’s broken.”
And that’s when my heart sank.
Because this wasn’t just a bug.
This was the entire payment flow going down.
What made it worse?
It was my first real production outage.
I’d read about outages before. I’d seen postmortems.
But nothing prepares you for what it feels like when your code takes down checkout.
When you're on an incident bridge, adrenaline kicking in, trying to act calm while your mind is racing.
If you’ve ever worked on a checkout team, you know how high the stakes are.
If you haven’t-trust me, this is the kind of breakage that makes your stomach drop.
This isn’t a deep technical write-up. It’s a story.
About what happened, what I learned, and what I wish I knew going in.
So when your moment comes (and it will), maybe you’ll feel a little more ready than I did.
🧨 Part 1: What Changed (And What Went Wrong)
To the readers who don’t know me personally or know what I do to pay my bills — I work on the web side of payments, maintaining internal libraries that power checkout flows across different products.
It’s a space that lives quietly in the background when everything’s working… and turns into absolute chaos when something breaks.
Recently, we had to update the Apple Pay SDK.
The version we were using had been deprecated by the Apple team, and the new one introduced a few changes in the API contract—mostly around how inputs were passed during initialisation.
We made the necessary updates, tested everything on dev, and it all looked solid. ✅
So we merged it and shipped it.
A while later, we got alerted about a spike in Apple Pay errors.
At first, it felt like a small fire.
But as we started digging in, it quickly escalated.
We discovered that the updated SDK relied on a function that wasn’t supported in certain older browsers. That resulted in a runtime error early in the flow-so early that it completely broke the entire checkout experience.
No Apple Pay.
No other payment methods.
Just a blank screen.
⚠️ First Things First - Don’t Start with Debugging
Before we go into what we found, here’s one thing I learned that I’ll never forget:
When you're facing a production outage, your first instinct shouldn’t be to start debugging.
That’s the mistake I almost made.
Because when your code is involved, your brain goes:
“Okay, where’s the bug? What broke? Let’s fix it.”
But here’s the thing: debugging takes time.
And while you’re digging through logs and testing theories, real users are getting blocked.
Revenue is dropping. The company is losing money. And most importantly, people are having a bad experience.
Unless you know the fix and you're confident it's safe to ship fast, the best move is this:
👉 Revert the commit. Redeploy. Stabilize.
It might not feel heroic, but trust me - it's the most responsible thing you can do.
The goal isn’t to debug under pressure. The goal is to stop the bleeding.
And then, when the incident is over, you can figure out what actually went wrong.
And that’s exactly what we did.
We rolled back the commit.
The blank screen went away. Checkout came back. Users could pay again.
The incident bridge went quiet.
PagerDuty stopped yelling.
But the work wasn’t over.
Now it was on us to figure out exactly what broke and why.
🛠️ Part 2: The Debug Spiral - What We Found
Once things were stable again, it was time to figure out what had actually happened.
The incident bridge was done. The pressure had dropped.
But the ownership was still on us to make sure we knew exactly why checkout had broken, and how to prevent it from ever happening again.
So we started going through logs, testing browsers, and trying to reproduce the issue in a controlled way.
At first, it was confusing.
There was no visible crash.
No red banners. No clean stack trace on screen. Just… silence.
The kind that’s somehow more unsettling than an obvious error.
Eventually, we spotted it: A runtime error coming from deep inside the updated Apple Pay SDK.
It was trying to use a method that wasn’t supported in some older browser environments.
That’s when it clicked:
Of course it was happening on Safari.
Apple Pay is only supported on Safari so if something in the SDK was going to fail, that’s exactly where it would show up.
And that’s what was happening here.
In certain versions of Safari, this one method wasn’t available.
And since the SDK tried to use it during initialization, it caused a hard crash.
Not just for Apple Pay.
But for everything.
Because the failure happened at the top level of our checkout bootup, no other payment methods loaded either.
No UI. No error message. Just a blank screen.
The kind of bug that slips past on dev.
And punches you right in prod.
🧠 Part 3: It's Not Just About the Fix - It's About Ownership
The bug was found. The rollback was done. Users could pay again.
But I knew my job wasn’t over.
Because being a good engineer doesn’t stop at unblocking users.
It starts when you step back and ask:
What did this break mean for the user?
How much revenue did we potentially lose?
What did this outage cost the business?
Could this have been caught earlier?
Will it happen again?
We’re not just here to respond to alerts and patch code like we’re on an assembly line.
We’re not coders you can prompt like ChatGPT and expect the problem to disappear.
We’re here to think deeply, own outcomes, and prevent future damage.
So I started writing.
I documented everything:
- What we changed
- Why we changed it
- What failed (and where)
- How it slipped past QA
- What we missed in our monitoring
- And what browser/version combos were affected
Not because someone told me to, but because future-me (or future teammates) might run into something similar and I want them to have answers faster than I did.
💡 What This Taught Me - and What I Want to Push For
This outage didn’t just teach me what went wrong.
It made me pause and ask: what would I do differently if this happened again?
So I started jotting down ideas.
Not all of them are done yet. Some are in progress.
But they’re now on my radar and they weren’t before this incident.
Here’s what I’m exploring post-outage:
🧪 Introducing browser-specific test cases, especially for critical SDK integrations
🛡️ Adding guard clauses around third-party SDKs to prevent one failure from taking everything down
📉 Revisiting alert thresholds and adding more granular monitors (especially for visual regressions and blank states)
📊 Surfacing “what changed in this release” summaries to make incident triage faster
📚 Starting a habit of writing lightweight incident summaries for internal sharing and future reference
These aren’t silver bullets.
But they’re the kinds of things I now know matter and they’re on my roadmap moving forward.
🧘♀️ Part 4: Wrapping It All Up
If you’ve made it this far thank you.
This wasn’t the easiest thing to write, but I wanted to put it out there because… no one tells you what your first production fire feels like.
It’s messy.
It’s stressful.
And sometimes, yes it’s your code.
But you learn.
You learn what it really means to take ownership.
You learn to zoom out and think beyond “what broke” and start asking “who did this impact?”
You learn that being a good engineer isn’t about writing perfect code - It’s about how you show up when things aren’t perfect.
I’m still learning. But one thing’s for sure: this incident taught me more than weeks of “normal” engineering ever could.
✅ A Checklist for When (Not If) It Happens to You
Here’s a simple list I wish I had in front of me when it all went down:
- 🔄 Don’t start debugging immediately - if the issue is real and users are impacted, revert first, then investigate
- 📣 Communicate early - even just “Looking into it” goes a long way in incident channels
- 🧪 Try reproducing the issue in controlled environments (same browser, device, etc.)
- 🔍 Check logs, network calls, error boundaries - don’t assume it’ll crash loudly
- 📉 Review your monitoring thresholds - are they tuned to catch subtle failures early?
- 📚 Document what happened, in plain language - not just for your team, but for future you
- 🧠 Think about business impact - how many users were affected, and for how long?
- ✍️ If you find patterns or takeaways, write them down - they’re gold in the next incident
In the end I can just say this, breaking prod might not be a badge of honour - but surviving it? Definitely is.
If this story helped you - or reminded you of your own - you can always reach out.
Would love to connect with others who’ve been through the fire. 🔥
📍Twitter / X: smileguptaaa