# What is DevOps & Site Reliability Engineering (SRE)?
DevOps
DevOps (Development + Operations) is a set of practices, principles, and cultural philosophies that aim to enhance collaboration and communication between software development (Dev) and IT operations (Ops) teams.
The primary goal is to automate and streamline the process of software delivery and infrastructure changes.
Promoting a culture of continuous improvement and faster, more reliable releases.
SRE
Site Reliability Engineering, or SRE, is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
Developed at Google, to create scalable and highly reliable software systems.
# A Simplified Analogy
- DevOps is like a chef in a busy restaurant, responsible for creating & crafting delicious dishes.
- SRE is the kitchen assistant, ensuring everything runs smoothly behind the scenes, they are responsible for setting up the kitchen, maintaining the equipment, and creating workflows that make the chef's job easier.
Roles & Responsibilities of DevOps/SRE from an Analogy
1. New Recipe Development
- DevOps: Brainstorms new dishes, experiments with flavors and ingredients, and writes down the recipes(code)
- SRE: Sets up the kitchen with all the necessary tools and equipment, creates a clean and organized workspace, ensures everything is functioning properly.
2. Preparing the Kitchen
- DevOps: Gathers ingredients(data), prepares them (cleans and formats the code), and begins the cooking process(testing & deploying the code).
- SRE: Manages inventory, ensures ingredients(data) are fresh and available, and monitors the kitchen for any potential issues.
3. Cooking & Serving
- DevOps: Continuously monitors the cooking process, makes adjustments as needed, and ensures the dish is cooking to perfection(testing and refining the code)
- SRE: Handles any unexpected hiccups or spills, makes sure the kitchen remains clean & organized, and helps the chef deliver the finished dish to customers(users)
4. Cleaning Up
- DevOps: Analyzes the finished dish, identifies areas for improvement, and cleans up any leftovers(code or resources).
- SRE: Cleans the kitchen, puts away tools and equipment, and prepares for the next round of cooking.
Through this analogy you should get an idea that by working together, DevOps & SRE ensure that the restaurant(Software Development) runs smoothly, delicious dishes(features) are consistently cooked and served, and customers are always satisfied.
# Pre-DevOps Era
- It was a starkly(very Obviously & Clearly) different landscape that what we see today.
- Siloed(isolated) Teams with rigid methodologies**, and a **lot of manual work.
- Resulting in slow & unreliable software delivery.
Characteristics of the pre-DevOps era?
- Waterfall Model: Dominant methodology was the Waterfall model, a linear approach where each stage had to be completed before moving on to the next. Its was difficult to adapt to changes and respond quickly to new requirements.
Siloed Teams: Developers, testers, and operations teams worked independently in isolation, often with very little communication or collaboration. This created a "throw it over the wall" mentality, where each team blamed the other for problems
Manual Processes: Most tasks, from testing to deployment, were done manually. This was time-consuming and error-prone, leading to delays and inconsistencies.
Limited Automation: There were few tools available to automate repetitive tasks, making it difficult to scale software development.
Slow & Unreliable Delivery: Releases were infrequent and often buggy, causing frustration for both developers and users.
PSN Outage, a classical example of Challenges faced during Pre-DevOps era?
- In April 2011, the Sony PlayStation Network (PSN) experienced a massive outage that lasted for 23 days and affected over 100 million users.
Reason for such long outage?
Silos & Communication Gaps: Dev & Ops teams at Song worked in Separate silos with limited communication and collaboration. This led to a lack of understanding of each other's work and challenges, making it difficult to respond effectively to the evolving situation during the outage.
Manual and Slow Processes: Deployments and infrastructure changes were performed manually, requiring significant time and effort. This slowness hampered Sony's ability to quickly asses the situation and implement necessary fixes.
Limited Scalability & Flexibility: The PSN's infrastructure was not designed to handle the surging demand caused by the attack, leading to widespread outages and service disruptions.
Lack of Visibility & Tracking: Sony lacked effective monitoring tools to identify and diagnose the source of the outage promptly. This delayed the response time and made it difficult to determine the full scope of the attack.
Culture of Blame & Finger-Pointing: The siloed environment and lack of communication led to blame and finger-pointing between different teams, hindering collaboration & problem-solving efforts.
Consequences for such long outage?
- Financial Losses: Sony estimated the outage cost the company approximately $170 million in lost revenue and legal settlements.
- Reputation Damage: The incident severely damaged Sony's reputation and eroded user trust in the PSN platform.
- Customer Frustration: Millions of users were frustrated by the Prolonged outage and lack of information from Sony.
Lessons learned from the PSN outage?
- Importance of breaking down silos and fostering collaboration between development and operations teams.
- The need for automated deployments and infrastructure changes to enable faster response times.
- The importance of building scalable and flexible infrastructure to handle unexpected spikes in demand.
- The necessity for implementing effective monitoring tools to gain real-time insights into system health and performance.
- The value of building a culture of shared responsibility and collaboration to prevent future incidents.
By adopting modern DevOps principles and practices, organizations can avoid similar pitfalls and ensure greater agility, reliability, and security in their operations.