🔁 Building a Self-Healing CI/CD Pipeline on GitLab

Auto-resume from stuck jobs. Improve resilience. Save time.


🤔 The Problem: Flaky Pipelines, Delayed Delivery

If you've worked with CI/CD in GitLab long enough, you've likely run into this:

  • Pipelines hang or fail midway because a GitLab runner disconnects.
  • A flaky environment causes a single failed job — and the entire pipeline restarts.
  • No checkpointing, no fallback... just re-run from the top.

These issues waste time, delay releases, and block engineering focus.

So I built something to fix it.


🔧 Introducing: GitLab Self-Healing Pipeline Framework

A fully open-source system that allows your pipelines to automatically resume from the last successful stage — without manual intervention.

👉 GitHub: gThiru/gitlab-self-healing-pipeline


🧠 How It Works

  1. Each stage of your GitLab pipeline records progress to .ci-progress.json (shared volume)
  2. A Python watchdog script checks these files on a schedule (cron or Kubernetes)
  3. If a pipeline is stuck or timed out, it:
    • Reads the last successful stage
    • Triggers a new pipeline with RESUME_STAGE set to the next needed stage
    • Enforces retry limits and pipeline age cutoffs

🧰 Tech Stack

  • 🟣 GitLab CI/CD
  • 🐍 Python for watchdog
  • 📂 Shared mount (e.g. NFS, EFS) across runners
  • ⏱️ Linux cron or Kubernetes CronJob
  • 🔐 Environment-safe with retry guardrails

🔍 Key Features

  • ✅ Pipeline resumes from last good stage
  • ✅ JSON-based per-pipeline tracking
  • ✅ Retry limit + max age protection
  • ✅ Works in hybrid GitLab runner setups
  • ✅ Dev-friendly Bash helper to update stage status
  • ✅ Linux + K8s CronJob support

🏁 Quick Example

In .gitlab-ci.yml:

rules:
  - if: '$RESUME_STAGE == "test" || $RESUME_STAGE == ""'

In build job:

source ./update_stage_status.sh
update_stage_status build in_progress
# ... your build steps ...
update_stage_status build done

🚀 OSS Ready

This project is:

  • ✅ Released under MIT License
  • ✅ Submitted to awesome-devops
  • ✅ Includes full documentation, examples, templates
  • 🏁 Ready for production use in GitLab-based orgs

🌱 What's Next

I'm planning to add:

  • Job-level tracking (not just stage)
  • Webhook-based watchdog
  • S3/GCS support instead of local disk
  • Slack/email notifications on resume

🙌 Get Involved

👉 GitHub Repo

💫 Star it if you like it

💬 Open issues or suggest ideas

🤝 Contributions are welcome!


📢 Let’s build smarter pipelines, together.

Thirunavukkarasu Ganesan

DevOps Manager / Architect