🔁 Building a Self-Healing CI/CD Pipeline on GitLab
Auto-resume from stuck jobs. Improve resilience. Save time.
🤔 The Problem: Flaky Pipelines, Delayed Delivery
If you've worked with CI/CD in GitLab long enough, you've likely run into this:
- Pipelines hang or fail midway because a GitLab runner disconnects.
- A flaky environment causes a single failed job — and the entire pipeline restarts.
- No checkpointing, no fallback... just re-run from the top.
These issues waste time, delay releases, and block engineering focus.
So I built something to fix it.
🔧 Introducing: GitLab Self-Healing Pipeline Framework
A fully open-source system that allows your pipelines to automatically resume from the last successful stage — without manual intervention.
👉 GitHub: gThiru/gitlab-self-healing-pipeline
🧠 How It Works
- Each stage of your GitLab pipeline records progress to
.ci-progress.json
(shared volume) - A Python watchdog script checks these files on a schedule (cron or Kubernetes)
- If a pipeline is stuck or timed out, it:
- Reads the last successful stage
- Triggers a new pipeline with
RESUME_STAGE
set to the next needed stage - Enforces retry limits and pipeline age cutoffs
🧰 Tech Stack
- 🟣 GitLab CI/CD
- 🐍 Python for watchdog
- 📂 Shared mount (e.g. NFS, EFS) across runners
- ⏱️ Linux cron or Kubernetes CronJob
- 🔐 Environment-safe with retry guardrails
🔍 Key Features
- ✅ Pipeline resumes from last good stage
- ✅ JSON-based per-pipeline tracking
- ✅ Retry limit + max age protection
- ✅ Works in hybrid GitLab runner setups
- ✅ Dev-friendly Bash helper to update stage status
- ✅ Linux + K8s CronJob support
🏁 Quick Example
In .gitlab-ci.yml
:
rules:
- if: '$RESUME_STAGE == "test" || $RESUME_STAGE == ""'
In build
job:
source ./update_stage_status.sh
update_stage_status build in_progress
# ... your build steps ...
update_stage_status build done
🚀 OSS Ready
This project is:
- ✅ Released under MIT License
- ✅ Submitted to awesome-devops
- ✅ Includes full documentation, examples, templates
- 🏁 Ready for production use in GitLab-based orgs
🌱 What's Next
I'm planning to add:
- Job-level tracking (not just stage)
- Webhook-based watchdog
- S3/GCS support instead of local disk
- Slack/email notifications on resume
🙌 Get Involved
👉 GitHub Repo
💫 Star it if you like it
💬 Open issues or suggest ideas
🤝 Contributions are welcome!
📢 Let’s build smarter pipelines, together.
— Thirunavukkarasu Ganesan
DevOps Manager / Architect