DevOps Professional


1. DevOps Basics

Understanding DevOps

Definition and Core Concept

  • DevOps combines two traditional tech roles:
    • Developers: Write application code
    • Operations Engineers: Set up and manage systems running the applications
  • DevOps emerged in the late 2000s to address the disconnect between these roles

Key Characteristics

  • Collaborative approach throughout the entire service lifecycle
  • Includes all specialized roles working together:
    • Front-end developers
    • Test engineers
    • Build engineers
    • Networking engineers
    • Security engineers
    • Database administrators (DBAs)

Modern Systems Engineering Approach

  • Operations engineers use development techniques
  • Systems engineering follows software development workflow:
    • Code checked into source control
    • Build, test, and deployment processes
  • Moves away from manual system administration

Three Levels of DevOps

  1. Values
  2. Principles
  3. Practices

Benefits and Impact (2021 State of DevOps Report)

Elite Teams vs. Low-Performing Teams

  • Deployment Frequency: 973 times more frequent
  • Lead Times: 6,570 times shorter
  • Quality Metrics:
    • 3x fewer failures
    • 6,570 times faster recovery from issues

Organizational Benefits

  • 22% less time spent on unplanned work and rework
  • 2x more likely to achieve organizational objectives
  • Higher success in:
    • Shipping products
    • Customer satisfaction
  • 50% reduction in employee burnout

Universal Application

  • Benefits apply across:
    • Different organization sizes
    • For-profit and non-profit organizations
    • Product engineering teams
    • Internal IT organizations

What DevOps is NOT

  • Not just a renamed operations team
  • Not a single job title
  • Not one person doing everything
  • Not tied to specific tools

Important Note

“Keep in mind that a lot of people use the term DevOps without really understanding what it means. So always check what you’re hearing against the core concepts of DevOps.”

DevOps Core Values: CAMS Model

Overview

The CAMS model, developed by DevOps pioneers John Willis and Damon Edwards, represents the fundamental values of DevOps:

  • Culture
  • Automation
  • Measurement
  • Sharing

“DevOps is a human problem.” - Patrick Debois (Godfather of DevOps)

Culture (C)

Understanding Culture

  • More than superficial perks (ping pong tables, free food)
  • Driven by human behavior
  • Based on mutual understanding between team members

Historical Context

  • Traditional IT organization split:
    • Development Teams: Focus on creating applications and features
      • Emphasis on speed and innovation
    • Operations Teams: Focus on maintenance and stability
      • Responsible for servers, networks, security, and cost control

Cultural Challenges

  • Formation of silos due to differing goals
  • Communication breakdown between teams
  • Focus on team-specific goals rather than overall business outcomes
  • Solution: Change underlying behaviors and assumptions to drive cultural change

Automation (A)

Key Points

  • Often the first thing people associate with DevOps
  • Warning: Implementing automation without other values can lead to DevOps failure
  • Creates a fabric for controlling systems and applications
  • Acts as an accelerator for other DevOps benefits

Best Practices

  • Make automation the primary approach to creating solutions
  • Address manual work as it’s a source of:
    • Inefficiencies
    • Quality problems in technology value streams

Measurement (M)

Importance

  • Enables observation of systems and people
  • Helps track improvement from changes
  • Provides rational approach to technology

Common Pitfalls

  1. Measuring wrong metrics
  2. Improper incentivization
  • Cross-organizational outcomes:
    • Mean time to restore service after outages
    • Cycle time for new feature deployment
  • Higher-level results:
    • Costs
    • Revenue
    • Employee satisfaction

Sharing (S)

Core Elements

  • Foundation of collaboration
  • Essential for DevOps success
  • Promotes teamwork and transparency

Sharing Methods

  • Documentation
  • Pair programming
  • Peer reviews
  • Mentoring
  • Inclusive practices

Conclusion

CAMS implementation focuses on:

  1. Changing people’s behavior
  2. Using automation as an accelerator
  3. Measuring progress for improvement
  4. Fostering collaboration for better outcomes

The CAMS values serve as the foundation for specific DevOps techniques and should be embraced for successful organizational transformation.

The Three Ways of DevOps

Overview

The Three Ways are strategic principles developed by Gene Kim and Mike Orzen to implement DevOps values effectively. These principles help bring core DevOps values to life in practical ways.

First Way: Systems Thinking and Principles of Flow

Key Concepts

  • Focus on the overall outcome of the entire system
  • Avoid optimizing individual parts at the expense of overall results
  • Consider end-to-end flow as the primary value producer

Example: Performance Optimization

  • Improving one area can create unexpected bottlenecks elsewhere
  • Case Study: Adding more application servers can overwhelm database servers with connections

Organizational Impact

  • Deployment team processes might look good in isolation but could compromise overall development
  • Handoffs and friction between teams often disrupt value flow
  • Success metrics should reflect system-wide outcomes

Second Way: Amplifying Feedback Loops

Definition

  • Processes that consider their own output when determining next steps
  • Focus on creating, shortening, and amplifying feedback loops between value chain components

Bug Detection Example

Three scenarios with increasing waste:

  1. Best case: Developer catches bug through desktop unit tests
  2. Medium case: QA finds bug, documents it, returns to developer
  3. Worst case: Customer finds bug → Support → Development → Product Management → Fix

Application

  • Use when creating multi-team processes
  • Important for visualizing metrics
  • Essential in designing delivery flows

Third Way: Culture of Continuous Experimentation and Learning

Core Elements

  • Create an environment that encourages learning and experimentation
  • Avoid analysis paralysis
  • Focus on practical implementation and iteration

Key Principles

  • “Working code wins”
  • “If it hurts, do it more often”
  • “Fail fast”

Implementation

  • Encourage active skill practice and mastery
  • Promote trying new approaches
  • Focus on doing rather than just discussing
  • Support sharing of new ideas

Practical Application

The Three Ways provide a framework to:

  • Implement specific processes and tools
  • Align with CAMS (Culture, Automation, Measurement, Sharing)
  • Guide decision-making in DevOps implementation

Key Questions to Consider

  1. How does this affect the whole system?
  2. Where can we build in more feedback loops?
  3. How can we facilitate experimentation and learning?

DevOps Practice Areas: The Five Pillars

Overview

Unlike Agile’s structured methodologies (like Scrum and XP), DevOps doesn’t have a strictly defined approach. However, it consists of five major practice areas that form a comprehensive implementation framework.

The Five Practice Areas

1. Culture

  • Focus on creating and maintaining a stable, safe environment
  • Key elements:
    • Learning and sharing
    • Experimentation
    • Embracing both success and failure
    • Reflects core DevOps values

2. Process

  • Foundation: Agile and lean product development techniques
  • Key practices:
    • Working in small batches
    • Limiting work in progress (WIP)
    • Incorporating feedback loops
    • Lightweight change approval processes
  • Strong correlation with IT and business success
  • Reflects the “Three Ways” in Lean and Agile frameworks

3. Infrastructure as Code

  • Technological approach using:
    • Cloud
    • Containers
    • Programmable infrastructure
  • Benefits:
    • Reproducibility
    • Self-service capabilities
    • Rapid scaling
    • Improved software delivery and operational performance

4. Continuous Delivery

  • Focuses on automation for implementing lean principles
  • Key aspects:
    • Automated testing
    • Frequent deployment of small changes
  • Benefits:
    • Increased speed
    • Improved quality
    • Better culture
    • Enhanced performance

5. Site Reliability Engineering (SRE)

  • Engineering approach to:
    • Building reliability into systems
    • Operating services with high observability
    • Implementing automation
  • Applies to both application and infrastructure levels

Important Considerations

Interdependence

  • Pillars are not effective in isolation
  • Must work together to build a solid DevOps foundation
  • Example: High software delivery performance (from continuous delivery) needs operational excellence (from SRE) to deliver business benefits

Implementation Strategy

  • Advance all five pillars iteratively
  • Avoid focusing on one pillar exclusively
  • Balance development across all areas
  • Regular assessment of organizational maturity in each pillar is recommended

“In your roadmap to DevOps maturity, you want to advance all five pillars in turn and iterate so that they can reinforce each other. Trying to completely implement one without bolstering the others will end in frustration.”

DevOps Tools Selection Guide

Core Principles

People Over Process Over Tools

“People over process over tools” - Alex Honor (Creator of Rundeck)

Correct Implementation Order:

  1. Identify responsible people and ensure they have proper skills/support
  2. Define necessary processes
  3. Select and implement appropriate tools

Common Mistake:

  • Organizations often reverse this order, focusing on:
    • Tools first
    • Processes second
    • People last (if at all)

Tool Selection Criteria

1. KISS Principle
  • Definition: Keep It Simple, Stupid
  • Rationale: Every tool requires:
    • Learning curve
    • Implementation
    • Upgrades
    • Security maintenance
    • Integration with other tools
2. Integration Requirements
  • Tools should function as a “tool chain”
  • Must operate well in dynamic environments
  • Key features:
    • Good integration capabilities
    • Ability to compose solutions
    • Automatic adaptation to changes
    • API availability

Challenges in Modern DevOps

Complexity Issues

  • Increasing complexity in tech landscape
  • Example: Cloud Native Computing Foundation’s landscape diagram
  • Recent trends show:
    • Declining quality of implementations
    • Overabundance of tools
    • Integration difficulties

Common Tool Categories

Popular tools mentioned:

  • Kubernetes
  • Terraform
  • Ansible
  • Puppet
  • Chef
  • GitHub
  • Jenkins
  • Docker
  • Linux
  • Amazon
  • Graphite
  • Artifactory

Best Practices for Tool Selection

  1. Focus on Collaboration

    • Consider how tools enhance team collaboration
    • Ensure all value stream participants can use them effectively
  2. Avoid Over-tooling

    • Resist the temptation to implement too many tools
    • Consider maintenance overhead
  3. Ensure Dynamic Compatibility

    • Tools must work with changing environments
    • Avoid static configurations
    • Prioritize API-driven solutions

Key Takeaway

“There is no such thing as the best tool. There’s only the best tool for you and your specific situation.”

2. DevOps and People: A Culture Change

The Need for DevOps Culture

Current IT Challenges

Traditional IT Department Issues

  • IT departments often face low success and satisfaction rates
  • Historical misalignment between business teams and technology teams
  • Popular media (e.g., “Office Space,” “IT Crowd,” “Silicon Valley”) reflects these real-world challenges

Internal Friction Points

  • Conflict exists between various technical teams:
    • Developers
    • Quality Assurance
    • System Administrators
    • Information Security Professionals
    • Network Administrators
    • Database Administrators (DBAs)

The Wall of Confusion

Definition

  • Represents the communication barriers between different teams
  • Creates division between groups that should share common goals

Typical Flow

  1. Business throws requirements to developers
  2. Developers throw code to testers
  3. Testers throw tested code to operations
  4. Operations throw final product to end users

Real-World Example

Server Provisioning Case Study

Traditional Process (6 weeks):

  • Negotiating specifications
  • Procurement process
  • Hardware delivery
  • Installation in data center
  • OS loading
  • Final handover

After Virtualization (4 weeks):

  • Technical process reduced to 15 minutes
  • Organizational overhead still resulted in 4-week delays due to:
    • Standards
    • Ticketing systems
    • Documentation requirements

Business Impact

Executive Perspective

  • Modern business executives are increasingly tech-savvy
  • Question why 15-minute tasks take 4 weeks
  • Concerned about:
    • Financial waste
    • Time inefficiency
    • Competitive disadvantage

Common Reactions

  • Turn to outsourcing
  • Develop shadow IT
  • Seek alternatives to central IT department

“The organizations and processes we’ve built up around IT” have created unnecessary complexity and delays, highlighting the need for a DevOps culture to bridge these gaps and improve efficiency.

Building DevOps Culture Through Communication and Trust

The Importance of Communication

  • Communication and trust are fundamental to a productive DevOps culture
  • Project success (from deployments to acquisitions) heavily depends on communication quality
  • Without proper communication and trust:
    • Technical practices may fail
    • Goals may compete
    • Misunderstandings can occur

Effective Communication Strategies

Structured Communication Channels

  • Establish dedicated channels for specific purposes:
    • File repositories for customer information
    • Chat channels for downtime incidents
    • Email aliases for software release communications

Communication Planning

  • Good communication requires intentional planning
  • Essential for:
    • Fast-moving organizations
    • High-pressure situations (e.g., outages)
  • Need clear processes defining:
    • When to communicate
    • Who to communicate with
    • How to handle business events

Organizational Types (Westrum Model)

  1. Pathological Organizations

    • Everyone looks out for their own needs
    • Limited information flow
  2. Bureaucratic Organizations

    • Focus on strictly defined roles
    • Teams defend their turf
  3. Generative Organizations

    • Mission-focused
    • Most effective information flow
    • Features high trust environment
    • Welcomes bad news as learning opportunities

Building Trust and Respect

Personal Development

  • Acknowledge that not everyone has natural social skills
  • Recommended resources for improvement:
    • “How to Win Friends and Influence People”
    • “Crucial Conversations”
    • “How to Say It At Work”

Key Principles

  1. Assume Good Faith

    • Most people try to do their best
    • Actions are based on perceived constraints
    • Misunderstandings often stem from lack of context
  2. Promote Transparency

    • Share access to:
    • Chat rooms
    • Team Wiki pages
    • Code repositories
    • Infrastructure details
    • Monitoring tools
    • Ticket trackers
  3. Break Down Barriers

    • Don’t over-restrict communication
    • Challenge unnecessary “least privilege” restrictions
    • Recognize business value in transparency

Best Practices

  • Create shared goals across teams
  • Provide visibility into different team activities
  • Be open and transparent
  • Stay curious and respectful
  • Focus on understanding others’ perspectives
  • Align goals across teams
  • Show value for others’ needs

“There’s no shortcut to building mutual trust. It develops over time.”

Real-World Example

  • Situation: Developer-Operations conflict over priorities
  • Problem: Lack of understanding about operations team’s workload
  • Solution: Implemented program to create:
    • Shared goals
    • Better visibility
    • Cross-team understanding
  • Result: Improved working relationships and effectiveness

Breaking Silos in DevOps: Enhancing Collaboration

The Wall of Confusion

Root Causes

  • Not primarily due to poor people skills of tech professionals
  • Main cause: Institutional incentivization of opposing behaviors
  • Different teams have conflicting responsibilities:
    • Development teams: Focus on new functionality and rapid changes
    • Operations teams: Maintain stability and control change

Impact of Misaligned Incentives

  • Creates harmful conflicts of interest
  • Diminishes feedback loops
  • Local optimization interferes with global optimization
  • Teams focus only on individual metrics rather than organizational success

Conway’s Law

“Systems will merely always align themselves to your communication boundaries.”

  • Organizational boundaries act as communication boundaries
  • First wave of DevOps emphasizes alignment around value stream
  • Simply renaming teams to “DevOps” without structural changes is ineffective

Solutions for Breaking Silos

1. Cross-Functional Teams

  • Integrate people from different specialties to work together
  • Success Story Example:
    • Large SaaS company in Austin
    • Embedded ops engineer into dev team
    • Shared ticket backlog between dev and ops tasks
    • Results:
      • Developers gained understanding of operational requirements
      • Increased respect and collaboration
      • Shared responsibility for production service

2. Self-Service Tooling

  • Implement automated access to shared services
  • Benefits:
    • Reduces dependencies between teams
    • Increases efficiency
    • Eliminates unnecessary waiting times
    • Better alignment with specific team needs

3. Aligned Communication and Goals

  • Role Evolution Requirements:
    • Developers:
      • Take responsibility for build/deployment failures
      • Participate in on-call rotations
    • Operations/QA:
      • Shift to providing self-service platforms
      • Focus on guidance rather than direct execution

Three-Step Path to Enhanced Collaboration

  1. Reduce Separate Teams:
    • Eliminate silos
    • Create cross-disciplinary teams
  2. Implement Self-Service:
    • Virtually remove team dependencies
  3. Align Remaining Teams:
    • Promote collaboration
    • Ensure mutual support
    • Align goals across teams

Action Items

  • Evaluate organizational maturity in these areas
  • Identify specific actions for improvement
  • Plan implementation steps towards collaborative goals

Continuous Learning in DevOps: The Third Way

Core Concepts

The Third Way Fundamentals

  • Focuses on creating a culture of continuous experimentation and learning
  • Emphasizes:
    • Mastering core skills
    • Experimenting and taking risks
    • Learning through practical experience

Kaizen (改善)

  • Japanese concept meaning “change for the better”
  • Translates roughly to continuous improvement
  • Key component of Toyota Production System (TPS)
  • Introduced to Western world in 1986 through Masaaki Imai’s book
  • Adopted by major companies including:
    • Lockheed Martin
    • Pixar Animation Studios
Five Principles of Kaizen
  1. Knowing the customer
  2. Enabling smooth workflow
  3. Going to the real place (gemba)
  4. Empowering people
  5. Maintaining transparency

Gemba (現場)

  • Means “the real place” in Japanese
  • Emphasizes direct observation and involvement
  • Key practice: Go to where value is created or where problems exist
  • Avoid relying on:
    • Secondary reports
    • Metrics alone
    • Documentation
    • Assumptions

“Show up in the project meeting. Go look at the code. Go try and use the system having problems.”

Implementation Process

Kaizen Improvement Process (Kata)

Follows the cycle of:

  1. Plan: Define intentions and expected results
  2. Do: Execute the plan
  3. Check: Measure and analyze results
  4. Act: Make necessary alterations

Key characteristics:

  • Similar to scientific method
  • Focuses on small, daily improvements
  • Creates new baselines when improvements are successful
  • Builds critical thinking skills

Practical Application

Best Practices

  • Make small iterative changes regularly
  • Implement improvements as part of daily work
  • Focus on teaching people critical thinking skills
  • Build people before building systems

Common Pitfalls to Avoid

Avoiding variations like:

  • Plan, don’t do, hide
  • Try to make it to Friday
  • Waiting for weekend instead of improving

Action Items

  • Use notebook function in course to document:
    • Potential improvement areas
    • Small, tangible next steps
    • Ideas for iterating towards DevOps
    • Progress and learning outcomes

3. DevOps and Process: The Building Blocks

DevOps and Agile: Historical Context and Framework

Origins of DevOps

  • First DevOps Discussion:
    • Occurred at Agile 2008 conference in Toronto
    • Between Patrick Deis and Andrew Clay Schaeffer
    • Started as an “Agile infrastructure” discussion

Key Historical Events

  1. 2008: Initial discussion at Agile conference
  2. 2009:
    • Andrew presented on Agile infrastructure at Velocity Conference
    • Patrick started “DevOps Days” conference in Belgium, coining the term “DevOps”

Understanding Software Development Lifecycle (SDLC)

Traditional Steps:

  1. Requirements gathering
  2. Design creation
  3. Implementation
  4. Testing
  5. Deployment
  6. Maintenance

Waterfall vs. Agile Approach

Waterfall Method:

  • Sequential, linear approach
  • Complete documentation before proceeding
  • “Throwing over the wall” mentality between teams
  • Results in:
    • Loss of context
    • Quality issues
    • Excessive rules and contracts
    • Finger-pointing

Agile Method:

  • Iterative approach
  • Small, frequent iterations
  • Active collaboration between teams
  • Includes end-user feedback
  • Focuses on working software

Agile Benefits (According to Version One’s Survey)

  • 85% increased productivity
  • 80% faster time to market
  • 81% better delivery time predictability
  • 79% enhanced software quality

Limitations of Agile

  • No mention of operations in original manifesto
  • Doesn’t address systems aspects:
    • Infrastructure building
    • Application deployment
    • Monitoring
    • Maintenance

DevOps and Agile Relationship

  • Not identical: Can be practiced independently
  • Best Practice: Implement DevOps as an extension of Agile
  • DevOps addresses the operational gaps in Agile

Historical Challenge

“In the beginning, Agile was seen as a threat by the infrastructure side of the house and IT organizations”

  • Operations teams initially struggled with Agile’s iteration speed
  • Success was found when operations teams adopted Agile principles themselves

DevOps and Agile: Historical Context and Framework

Origins of DevOps

  • First DevOps Discussion:
    • Occurred at Agile 2008 conference in Toronto
    • Between Patrick Deis and Andrew Clay Schaeffer
    • Started as an “Agile infrastructure” discussion

Key Historical Events

  1. 2008: Initial discussion at Agile conference
  2. 2009:
    • Andrew presented on Agile infrastructure at Velocity Conference
    • Patrick started “DevOps Days” conference in Belgium, coining the term “DevOps”

Understanding Software Development Lifecycle (SDLC)

Traditional Steps:

  1. Requirements gathering
  2. Design creation
  3. Implementation
  4. Testing
  5. Deployment
  6. Maintenance

Waterfall vs. Agile Approach

Waterfall Method:

  • Sequential, linear approach
  • Complete documentation before proceeding
  • “Throwing over the wall” mentality between teams
  • Results in:
    • Loss of context
    • Quality issues
    • Excessive rules and contracts
    • Finger-pointing

Agile Method:

  • Iterative approach
  • Small, frequent iterations
  • Active collaboration between teams
  • Includes end-user feedback
  • Focuses on working software

Agile Benefits (According to Version One’s Survey)

  • 85% increased productivity
  • 80% faster time to market
  • 81% better delivery time predictability
  • 79% enhanced software quality

Limitations of Agile

  • No mention of operations in original manifesto
  • Doesn’t address systems aspects:
    • Infrastructure building
    • Application deployment
    • Monitoring
    • Maintenance

DevOps and Agile Relationship

  • Not identical: Can be practiced independently
  • Best Practice: Implement DevOps as an extension of Agile
  • DevOps addresses the operational gaps in Agile

“You can practice DevOps without Agile and vice versa. But it can, and frankly probably should be implemented as an extension of Agile for best results.”

Historical Challenge

  • Initially, Agile was seen as a threat by infrastructure teams
  • Operations teams struggled with new iteration cadence
  • Success was found when operations teams adopted Agile principles themselves

Visible Ops Change Control Process

Introduction

  • Change is the primary cause of technical issues
    • 80% of outages are caused by changes intended to improve, patch, or upgrade systems
  • Solution: Implement controlled changes through review, testing, and scheduled rollouts

IT Service Management (ITSM) Background

  • Emerged in 1980s as IT operations scaled
  • Focuses on service delivery and support
  • Notable frameworks:
    • Microsoft Operations Framework
    • COBIT
    • ISO 20000
    • Six Sigma
    • ITIL (IT Infrastructure Library) - Most popular framework
      • Currently in 4th major version
      • Covers 34 different areas
      • Known for heavy-handed, slow processes

Traditional ITIL Change Management Issues

  • Requires extensive documentation for all changes
  • Relies on Change Advisory Board (CAB) for approval
  • Problems:
    • Too slow for modern technical organizations
    • Approval decisions made by those least qualified
    • Tends to add more process when changes fail

Visible Ops Approach

  • Introduced by Gene Kim, Kevin Bear, and Gene Spafford in 2004
  • Published in “The Visible Ops Handbook”
    • Condensed ITIL implementation into 4 practical steps
    • Only 112 pages vs. ITIL’s 2000+ pages
  • Focuses on lightweight, fast, scalable, repeatable change control

Key Principles of Lightweight Change Control

  1. Review and Documentation Requirements

    • All changes need review, approval, and documentation
    • Peer review by technologists close to the team
    • Risk-based escalation for complex changes
    • Example: Wireless access point installation vs. core router replacement
  2. Change Size Management

    • Keep changes as small as possible
    • Benefits:
      • Easier to review
      • Simpler to identify and fix errors
      • Better than batch releases with hundreds of changes
  3. Early Testing Implementation

    • Use continuous integration systems
    • Implement automated testing
    • Include security safeguards early in development
    • Peer review validates testing completion

Research Support

  • Google DevOps Research and Assessment Group findings:
    • Streamlined change approval processes lead to:
      • Higher performance
      • Lower burnout levels
      • Increased psychological safety

Additional Resources

  • LinkedIn Learning course: “IT Service Management Foundations Change Management” by Earnest
    • Detailed guidance on setting up lightweight change control processes

4. Infrastructure as a Code

Infrastructure as Code (IaC)

Traditional Infrastructure Management

  • Historically, infrastructure was managed manually:
    • Building data centers
    • Installing physical servers
    • Loading operating systems (Windows/Linux)
    • Configuring software
    • Installing applications

Problems with Manual Management

  • Each system became highly individual (“special snowflakes”)
  • System administration was:
    • Slow
    • Error-prone
    • Hard to maintain consistency
    • Difficult to track changes

Modern Infrastructure as Code

Definition

“Infrastructure as code is provisioning and managing infrastructure through writing automation code instead of through manual processes.”

Key Concepts

  • Programmable Infrastructure:
    • Write code to configure networks
    • Set up servers
    • Attach storage
    • Configure operating systems
    • Install applications

Benefits

  • Aligns with DevOps CAMS values:
    • Culture
    • Automation
    • Measurement
    • Sharing
  • Supports lean theory by:
    • Removing waste
    • Reducing delays

Modern Systems Challenges

Complexity Factors

  • Distributed systems
  • Microservice architectures
  • Cloud infrastructure
  • Containers
  • Machine learning
  • Ephemeral (temporary) components

New Approach: “Cattle not Pets”

  • Old way: Servers were “pets” (individually crafted and maintained)
  • New way: Servers are “cattle” (managed en masse)

Best Practices

  • Adopt a development lifecycle approach
  • Combine both operational and development perspectives:
    • Operations expertise with tools
    • Developer expertise with code
  • Version control for infrastructure code
  • Automated testing and deployment
  • Consistent build and deployment processes

Benefits of IaC

  • Scalability
  • Consistency
  • Reproducibility
  • Efficiency
  • Version control
  • Automated deployment
  • Reduced human error

DevOps Infrastructure as Code: Configuration Management Overview

Core Concepts

Configuration Management Definition

  • Process for creating and maintaining systems and software in a desired state
  • In DevOps: All configuration management should be automated and code-driven

Three Main Components

  1. Provisioning

    • Making servers and computing infrastructure ready for operation
    • Includes:
      • Hardware/virtual hardware setup
      • Operating system installation
      • System services configuration
      • Network connectivity setup
  2. Deployment

    • Automated installation and upgrading of application software
    • Applies to both:
      • In-house developed software
      • Third-party products
  3. Orchestration

    • Coordinated operations across multiple systems
    • Examples:
      • Automated failover
      • Rolling deployments
      • Running runbooks across server fleets

Key Terminology

Approach Types

  1. Imperative (Procedural)

    • Defines and executes specific commands to produce desired state
    • Example:
      1. Stop service
      2. Copy new NGINX binary
      3. Start service
      
  2. Declarative (Functional)

    • Defines desired end state
    • Tool handles convergence to that state
    • Example: “Server should run NGINX v1.24”
    • Usually builds on top of imperative systems

Important Characteristics

Idempotent

  • Ability to execute repeatedly with same end result
  • Declarative tools typically built to be idempotent
  • Must be manually ensured in imperative approaches

Self-Service

  • Allows end users to initiate processes independently
  • Benefits:
    • Removes operations team from critical path
    • Increases velocity
    • Improves developer satisfaction

Drift

  • Deviation from defined configuration
  • Causes:
    • Manual changes outside tool
    • Script execution issues
  • Many tools include drift detection capabilities

Notes

  • Configuration management tools often overlap in functionality
  • Tool selection should consider specific use cases

Evolution of DevOps Configuration Management

Early Days (1990s)

  • Commercial IT Provisioning Tools:
    • Ghost (system cloning)
    • Enterprise suites like Tivoli and HP
    • Focus on separate dev and ops approaches

Rise of Infrastructure as Code (2000s)

Major Configuration Management Tools

  • CFEngine
  • Puppet
  • Chef

“Our Unix admin team started using CFEngine to roll out operating system configurations” (circa 2005)

Challenges

  • Lack of collaboration between teams
  • Resistance to sharing tools across different functions
  • Configuration drift issues

Golden Image vs. Foil Ball Debate (2009)

  • Luke Kanies (Puppet founder) highlighted problems with image management:
    • Image sprawl
    • Configuration drift

New Approach: Stem Cell System

  • Minimal initial server images
  • Declarative CM tools for provisioning
  • Idempotent tools for:
    • Preventing configuration drift
    • Managing updates
    • Automatic state convergence

Cloud Era Challenges

Why Automated Server Provisioning Became Essential

  • Increased virtualization
  • Dynamic server instances
  • Growth in distributed systems
  • Exponential increase in virtual servers

Orchestration Problems

Traditional CM Tool Limitations

  • 15-minute wake-up cycle
  • Individual server checks
  • Pull-based changes
  • Issues with:
    • High availability requirements
    • Coordinated database/application changes

Initial Vendor Response

“You don’t need orchestration and if you think you do, you don’t understand configuration management.”

Evolution in the 2010s

New Tools and Approaches

  1. Ansible and SaltStack:

    • Push mechanism
    • Explicit orchestration
    • Dev-friendly deployment
    • Workflow automation capabilities
  2. Hybrid Solutions:

    • Combined push deployment with idempotence
    • Integration with existing CM tools
  3. Self-Service Tools:

    • Rundeck for orchestration
    • Compliant system activities
    • On-demand initiation

Limitations of Early CM Tools

  • Limited application deployment capabilities
  • Lack of virtual infrastructure provisioning
  • Focus primarily on system administration
  • Gap in addressing broader value stream needs

Evolution of Infrastructure as Code (IaC) in DevOps

Cloud Computing Era (2010s)

  • Enabled creation of servers, storage, and networks through code
  • Shifted from manual installation to programmatic infrastructure management
  • Introduced model-driven provisioning with declarative approaches

AWS CloudFormation Example

  • Provides templates for defining cloud assets
  • Allows automatic instantiation of resources
  • Uses declarative specifications for server configurations

Advanced IaC Solutions

Specialized Tools

  • Terraform and Pulumi:
    • Emerged as dominant solutions
    • Provide domain-specific languages for infrastructure provisioning

Programming Language Integration

  • Python: Boto library
  • AWS CDK: Enables pure code solutions
  • Note: These solutions may be less idempotent

Container Revolution (Late 2010s)

Key Features

  • Reduced server dependency
  • Docker containers package applications with minimal OS dependencies
  • Streamlined development and testing cycles

Benefits for Developers

  • Bundled runtime with applications
  • Reduced runtime bugs
  • Improved development workflow

Immutable Infrastructure

Netflix Model

  • Adopted golden image approach
  • Created cloud images with baked-in applications
  • Moved away from configuration management across servers

Characteristics

  • Servers not modified after deployment
  • Replace rather than modify approach
  • Reduces configuration drift through design

Modern Container Orchestration (2020s)

Platforms

  • Kubernetes
  • Mesos

Features

  • Unified solution for:
    • Provisioning
    • Deployment
    • Orchestration
  • Template-based application and infrastructure changes
  • Automated coordination of changes

Serverless and PaaS

  • Simplifies deployment process
  • Abstracts infrastructure management
  • Note: Platform operation still requires maintenance and oversight

Future Outlook

  • Moving towards integrated toolchains
  • Focus on simplified infrastructure management
  • Continued evolution of IaC approaches

“Someone operating the platform still has to worry about it” - highlighting the ongoing need for infrastructure expertise despite automation advances.

Infrastructure as Code (IaC) Toolchain Selection Guide

Core Principles

  • Choose tools appropriate for team’s skill level
  • Start simple, scale complexity as needed
  • Plan the entire toolchain before implementation
  • Design operational environment before creation

Key Decision Points

1. Infrastructure Management

Self-Managed vs. Managed Service Options:

  • Self-managed infrastructure:
    • Digital Rebar for bare metal automation
      • Handles PXE booting, BIOS, RAID configuration
      • OS and hypervisor installation
      • Integrates with tools like Terraform

2. Infrastructure Provisioning

Three Main Approaches:

  1. Template-Driven:

    • Amazon CloudFormation
    • Azure ARM templates
    • Uses JSON/YAML format
  2. Custom Language Solutions:

    • Terraform
    • Pulumi
    • Benefit: Works across multiple cloud providers
  3. Pure Code Approach:

    • Python boto
    • Amazon CDK
    • Azure Bicep
    • Leverages full programming languages

3. System Management

Options:

  1. Runtime Configuration:

    • Chef
    • Puppet
    • CFengine
  2. Configuration + Orchestration:

    • Ansible
    • Salt
  3. Image Creation:

    • Hashicorp Packer for automated image building (“baking”)
    • Docker files for container images

Note: These approaches can be combined. Example: Configure base image with Chef, then bake with Packer

4. Orchestration Options

  • Configuration management tools (Ansible/Salt)
  • Platform-based (Kubernetes/Mesos)
  • External runbook automation (Rundeck)
  • Custom code solutions

5. Application Deployment Methods

  • Configuration management
  • Immutable deployments (container/system images)
  • Continuous deployment systems

6. Testing Strategy

Important Considerations:

  • Essential component of infrastructure as code
  • Utilize existing test frameworks
  • Implement both:
    • Unit testing for infrastructure code
    • Integration testing for produced infrastructure

Real-World Example

Enterprise SaaS Implementation

Tools Used:

  • Terraform: Base infrastructure, network, core servers
  • Puppet: Base image configuration
  • Packer: Image baking
  • Rundeck: Orchestration and updates

Process Flow:

  1. Infrastructure building with Terraform
  2. Configuration management with Puppet
  3. Image creation with Packer
  4. Orchestration via Rundeck
  5. Continuous integration pipeline for testing

Simplified System Example

Tools Used:

  • CloudFormation: Base infrastructure
  • Docker: Container creation
  • Amazon managed container service: Orchestration

Benefits:

  • Simpler implementation
  • Less maintenance overhead
  • Cost-effective
  • Suitable for immutable deployment

5. Continuous Delivery

Continuous Delivery Overview

Key Stages in Software Development

  1. Build Stage

    • Compile and test code
    • Convert code into software
  2. Deploy Stage

    • Run the software
    • Test the software
  3. Release Stage

    • Send software to end users
    • Deploy to production environment

Traditional vs. Modern Approaches

Old Way (Traditional)

  • Application built only at major milestones
  • Large, complex integration builds
  • Long test phases
  • Late bug detection
  • Error-prone and wasteful

Modern Approach (CI/CD)

  • Continuous Integration (CI)

    • Automatic building and unit testing
    • Occurs on every source code check-in
    • Maintains application in working state
  • Continuous Delivery (CD)

    • Deploys changes to production-like test environment
    • Automated integration and acceptance testing
    • Ensures application is always release-ready
  • Continuous Deployment

    • Automatically releases to production
    • Used by major companies (Amazon, Meta, Google, Wells Fargo)
    • Can lead to 10+ deployments daily

Benefits of CI/CD

Performance Improvements

  • Decreased deployment time
  • Faster market validation
  • Rapid experimentation
  • Lower change failure rate
  • Earlier bug detection

Key Advantages

  1. Quality

    • Testing occurs earlier in process
    • Changes evaluated one by one
    • Continuous working state maintained
  2. Recovery

    • Easier to identify failure sources
    • Quick bug fix deployment
    • Better problem isolation

Real-World Impact

Performance Metrics

  • High Performers: Deploy changes in < 1 hour
  • Low Performers: Deploy changes in 1-6 months

Case Study Example

“By overlaying our database connection growth graph with the deploys that happened that week, we could quickly figure out precisely which production deployment correlated with the increase of database connections.”

DevOps Principles

  • Follows first way of DevOps (optimizing end-to-end flow)
  • Implements second way through fast feedback loops
  • Reduces Work in Progress (WIP)
  • Minimizes risk and waste from undelivered code

Common Problems Solved

  • Eliminates panic from monthly release cycles
  • Reduces error-prone manual releases
  • Prevents finger-pointing during issues
  • Enables quick problem identification and resolution

Six Practices for Continuous Integration

Overview of CI/CD Pipeline

  • Continuous Integration, Delivery, and Deployment form a pipeline
  • Each stage flows from build → deploy → release
  • Each stage depends on successful completion of previous stage

Continuous Integration Basics

  • Purpose: Keep software in working state at all times
  • Process:
    • Automatically triggered build on each commit
    • Builds entire codebase
    • Runs unit tests and code validation
    • Packages artifact
    • Provides build status and log

Six Key Practices

1. Fast Builds

  • Should pass the “coffee test” (approximately 5 minutes)
  • Why: Longer builds lead to:
    • Developers batching changes
    • Increased Work in Progress (WIP)
    • System problems

2. Small Commits

  • Commit smallest possible amount of code
  • Benefits:
    • Easier for team to understand
    • Simpler failure isolation

3. Fix Broken Builds Immediately

  • Build breaks are normal and expected
  • Important: Don’t leave builds broken
  • Recommended:
    • Delay meetings until build is fixed
    • Stop all work until resolution
  • Sets tone for delivery culture

4. Use Trunk-Based Development

  • Two Main Development Approaches:
    1. Branch-based development
      • Developers work on separate branches
      • Long development time
      • Problematic merges
    2. Trunk-based development
      • No long-running branches
      • Multiple small changes daily
      • Always up-to-date trunk
  • Feature Management: Use feature flags instead of branches
  • Recommendation: Choose trunk-based approach
    • Minimizes WIP
    • Ensures frequent code review
    • Reduces merge issues

5. Address Flaky Tests

  • Fix unreliable tests immediately
  • Inconsistent test results reduce trust in CI system
  • Impacts build artifact reliability

6. Build Output Requirements

  • Status: Simple pass/fail or red/green indicator
  • Log: Detailed record of tests and results
    • Aids troubleshooting
    • Supports compliance
  • Artifact: Installable application version
    • Should be uploaded and tagged with build number
    • Ensures auditability and immutability

Action Item

“Take a moment and use the course notebook to reflect and write down the next steps you could take to implement some of these six practices in a build pipeline you work with.”

Five Practices for Continuous Delivery

Core Concept

“It’s not how much you can deliver, but how little.” - Jez Humble and Dave Farley

Pipeline Structure

  1. Build StageDeployment Stage
    • Deploy successful build artifacts to live environment
    • Environment should mirror production
    • Names may vary: CI, staging, test, or pre-production
    • Automated testing follows deployment

Five Key Techniques

1. Artifact Management

  • Create single artifact upon successful build
  • Types of artifacts:
    • RPM or Debian packages
    • MSI installers
    • Java WAR files
    • ZIP files
  • Build once, use across all environments
  • No rebuilding for different stages

2. Artifact Immutability

  • Artifacts must remain unchanged throughout pipeline
  • Access Control:
    • CI system: Write access only
    • Deployment system: Read access only
  • Benefits:
    • Builds trust between teams during debugging
    • Enables verification through checksums
    • Maintains auditability
    • Allows tracing from code version → build artifact → running system

3. Pre-production Environment

  • Must mirror production environment as closely as possible
  • Must Include:
    • Load balancers
    • Network settings
    • Security controls
    • Production-like data
  • Enables thorough testing:
    • Acceptance testing
    • Smoke tests
    • Integration tests

4. Pipeline Control

  • System must halt pipeline on any failure
  • Stop Points:
    • Broken build → No deployment
    • Failed deployment → No release
  • Focus on overall software delivery flow, not individual productivity
  • Team should collaborate to fix issues

5. Idempotent Deployments

  • Multiple deployment runs should yield identical results
  • Implementation Options:
    • Immutable packaging (Docker containers)
    • Configuration management tools (Puppet, Chef)
  • Eliminates variability in pipeline
  • Builds trust in deployment process

Note

The authors recommend reading “Continuous Delivery” by Jez Humble and Dave Farley for comprehensive understanding.

The Role of QA in DevOps

Introduction

  • Continuous delivery benefits:
    • Faster deployments
    • Fewer bugs
    • Less technical debt
    • Better dev-ops collaboration

The Importance of Automated Testing

  • Key Point: Automated testing is crucial for CI/CD success
  • Manual testing:
    • Considered slow and unreliable
    • Best reserved for final acceptance testing only
  • Modern QA role:
    • QA professionals work alongside developers
    • Focus on designing and writing tests
    • Let automation handle repetitive testing tasks

Testing Types (Bottom-up Approach)

1. Unit Testing

  • Most developer-centric testing
  • Characteristics:
    • Written by developers within the codebase
    • Validates individual function behavior
    • Fastest testing method
    • Uses stubs to bypass external dependencies
    • Run locally during development

2. Code Hygiene

  • Checks code against language/framework best practices
  • Implemented using:
    • Linters
    • Formatters

3. Integration Testing

  • Performed in test environment
  • Tests:
    • Individual component functionality
    • Inter-component interactions
    • All dependencies included

4. Acceptance/End-to-End Testing

  • Tests complete product from user perspective
  • Often UI-level testing
  • Can be automated
  • Manual verification still valuable for final checks

Test-Driven Development (TDD) & Behavior-Driven Development (BDD)

  • Write tests before implementing code
  • Process example:
    1. Write test for desired output
    2. Test fails initially
    3. Implement functionality
    4. Test passes when implementation is correct

Handling Slow Tests

Strategies:

  1. Parallel Execution

    • Run slow tests alongside pipeline
    • Don’t block until final release
  2. Scheduled Testing

    • Nightly test suites
    • Regular scheduled runs
  3. Continuous Testing

    • Run against test environment
    • Accept possibility of non-critical bugs
    • Quick fixes possible in CD environment

Additional Testing Types

  • Infrastructure testing
  • Performance testing
  • Security testing
  • Browser compatibility testing
  • Compliance testing

Key Takeaway

“Getting good at automated testing is your single most significant factor in successful continuous delivery.”

Continuous Deployment Overview

Key Differences from Continuous Delivery

When to Consider CD

  • Organizations may not be ready for Continuous Deployment due to:
    • Need for manual test cycles
    • Product manager sign-off requirements
    • Preference for bundled changes over frequent small updates

Prerequisites

  • Strong CI/CD foundation
  • Automated approvals and testing within pipeline
  • Manual workflow steps can be integrated (like code reviews)
  • Feature flags enable pre-deployment of code before user access

“If you stay ready, you ain’t got to get ready.” - Suga Free

Release Stage Components

Process Flow

  1. Artifact passes all tests
  2. Artifact is marked as released
  3. Deployment to production environment
  4. Trigger notifications for:
    • Compliance
    • Internal communication
    • End user communication

Production Considerations

  • Complexity: Production releases often require significant engineering work
  • Challenges:
    • Packaged software: Focus on data and configuration compatibility
    • Running services: Must handle live users and flowing data
  • Important: Test environment must mirror production deployment procedures

Production Release Patterns

Types of Deployments

  1. Rolling Deployment

    • Upgrades one system at a time
    • Allows seamless traffic shifting
  2. Blue-Green Deployment

    • Creates entirely new version
    • Switches traffic from current (Blue) to new (Green) system
    • Can involve swapping environments or creating new ones in cloud
  3. Canary Deployment

    • Upgrades single system
    • Tests under production load
    • Monitors for issues
  4. A/B Deployment

    • Uses feature flags
    • Releases features to specific user subsets
    • Useful for:
      • Canary Testing
      • Public Beta Testing

Real-World Implementation Example: Signal Sciences

System Overview

  • Built internal tool called “Deployer” (inspired by Etsy’s Deployinator)
  • Enabled company-wide deployment capabilities
  • Five-minute deployment time from commit to production

Key Features

  • Push-button deployment to staging
  • Automated testing
  • Self-service automation
  • Feature flag implementation
  • Gradual release strategy:
    1. Internal users
    2. Early adopters
    3. All customers

Success Factors

  • Strong CI/CD foundation
  • Self-deploying capability
  • Integration of DEV and OPs workflows
  • Focus on user experience

Important Considerations

  • Deployment strategy must align with:
    • Packaging choices
    • Infrastructure as code strategy
    • Software architecture
  • Requires collaboration across teams
  • System should be opinionated with clear, standardized procedures

DevOps CI Toolchain Overview

Approach to Building a CI Toolchain

  • Traditional approach: Start from developer and work outward
  • Recommended approach: “Onion Layer” model - start from outer layer and work inward
  • Focus on end-state perspective when considering the entire toolchain

Layer 1: Deployment (Outermost Layer)

Deployment Considerations

  • Determine how software will be deployed:
    • Containers
    • System images
    • Windows installers

Deployment Types & Tools

  • A/B Deployments

    • Requires feature flagging
    • Tools:
      • LaunchDarkly
      • Split
      • Custom-built solutions
  • Rolling Deployments

    • Requires orchestration tools
    • Platform-specific options:
      • Kubernetes
      • Serverless
      • Ansible
      • Salt

Layer 2: Artifact Repository

General Solutions

  • Artifactory
  • Nexus

Specialized Solutions

  • Cloud provider container repositories
  • Language-specific repositories (e.g., bit.dev for NPM)
  • Minimal solution: Build system tagging + Amazon S3

Layer 3: Building & Testing

Testing Categories

  1. Unit Testing

    • Language-specific tools (e.g., go test for Golang)
  2. Code Hygiene & Linters

    • ESLint (JavaScript)
    • Staticcheck (Golang)
  3. Integration Testing

    • Pytest (Python)
    • TestNG (Java)
  4. Acceptance/End-to-End Testing

    • Selenium
    • Cypress.io
    • Robot Framework
    • Postman

Additional Testing Types

  • Infrastructure Testing

    • InSpec
    • ChefSpec
  • Performance Testing

    • JMeter
    • LoadRunner
  • Security Testing

    • GitHub Dependabot
    • GitGuardian
    • Dryrun Security
    • StackHawk

Layer 4: Build System

Options

  • Jenkins (Open source)

    • Pros: Community support, wide integration
    • Cons: UI navigation challenges
  • SaaS Solutions

    • CloudBees
    • CircleCI
    • GitHub Actions

Layer 5: Version Control (Innermost Layer)

  • Git-based:
    • GitHub
    • GitLab
    • Bitbucket

Specialized Version Control

  • Perforce
  • PlasticSCM (for large binary assets)

Best Practices

  1. Track Cycle Time

    • Measure time from developer system to production
    • Record and share metrics with team
    • Actively work to improve cycle time
  2. Tool Selection

    • Choose tools that reduce overall cycle time
    • Consider integration capabilities
    • Factor in team expertise and requirements

6. Site Reliability Engineering

Site Reliability Engineering (SRE) Overview

Definition and Core Concepts

  • SRE is the practical operations component of DevOps
  • Engineering Definition: Application of theoretical principles to solve real-world problems
  • Reliability Definition: System’s ability to perform intended functions correctly and consistently
  • Encompasses:
    • Availability
    • Performance
    • Security
    • Service delivery capabilities

Origins

  • Originated at Google, focusing initially on website reliability
  • Google published a free online book titled “Site Reliability Engineering”

Key Components

1. Operational Aspects

  • Monitoring production services
  • Managing systems
  • Problem resolution
  • Automation of operational processes

2. Patrick Debois’ Four Key DevOps Areas

  1. Extending delivery to production
  2. Extending feedback from operations to dev
  3. Embedding dev into operations
  4. Embedding ops into dev

3. Holistic Approach

Two main components:

  1. Building Reliability

    • Focus on constructing resilient systems
    • Emphasis on maintainability
    • Engineering for reliability from the start
  2. Operational Feedback

    • Observability practices
    • Incident response procedures
    • Production operations feedback loop

Business Impact and Metrics

SRE improves key performance indicators:

  • Change Failure Rate

    • Reduces production issues through:
      • Reliability testing
      • Deployment automation
  • Time to Restore Service

    • Improves through:
      • Enhanced problem detection
      • Operational automation
      • Disciplined processes
  • Service Level Objectives

    • Better meeting of uptime goals
    • Improved performance targets
    • Enhanced through observability and resilience

Important Notes

“You can’t just bolt on reliability once something goes live.”

  • SRE requires proactive engineering approach
  • Combines both development and operational perspectives
  • Requires continuous improvement through feedback loops

Building for Reliability: Design Theory

Core Concepts

  • Success in production largely depends on design-time decisions and software architecture
  • Focus on creating reliable applications through thoughtful planning
  • Important to understand how applications work in real systems environments

Key Resources

1. “Release It!” by Michael Nygard

  • Equivalent to “Gang of Four Design Patterns” but focused on stability
  • Key Findings:
    • Integration points are the #1 cause of architectural issues
    • Cascading failures are the biggest threat to stability in layered architecture
Example of Cascading Failure:
  • Database layer issues can lead to:
    • Exhaustion of database connection pools
    • Application server tier choking

2. Circuit Breaker Pattern

  • Purpose: Prevents cascading outages
  • Functionality:
    • Monitors integration point failures/slowness
    • Stops making calls when unusual failure rates detected
    • Works with timeouts to prevent outage spread
  • Implementation: Available through libraries like Resilience4j

3. Twelve-Factor App (12factor.net)

  • Manifesto for service-ready software
  • Example - Factor 3 (Config):
    • Separate runtime configuration from app code
    • Store in environment variables
    • Keep configurations independent
    • Avoid environment groupings
    • Benefits: Reduces fragility and improves portability

4. Martin Fowler’s Resources

  • Provides concise descriptions of architectural concepts:
    • Page objects
    • Serverless
    • Bimodal IT
    • DevOps topics
  • Perspective from experienced software engineer

Modern Architecture Considerations

  • Microservice architectures have multiple integration points
  • Higher likelihood of integration point failures
  • Need for robust stability patterns and solutions

Action Items

  • Schedule focused time to study these patterns
  • Evaluate patterns based on your technical ecosystem
  • Consider implementing solutions for common production failures

“Take a minute to schedule some focus time on your calendar to look more deeply into these patterns and consider which may have value in your own particular technical ecosystem today based on the kinds of failures that you see in production.”

Building for Reliability: Key Principles

Core Concept: Dev vs Ops Background

“Dev comes from school, but Ops comes from the street.”

  • Developers typically have computer science backgrounds
  • System administrators often self-taught through real-world experience
  • SRE bridges ops experience with disciplined engineering approach

Understanding System Failure

Fundamental Truths

  • All systems fail
  • Individual components fail frequently
  • Slowdowns are as threatening as complete outages
  • Systems often run in degraded mode

Swiss Cheese Model

  • System components are like stacked Swiss cheese slices
  • Problems occur when holes (failures) align
  • Multiple layers provide protection against complete failure

Richard Cook’s “How Complex Systems Fail”

Key findings:

  1. Changes introduce new forms of failure
  2. Complex systems contain latent failures
  3. Complex systems always run in degraded mode

System Availability Metrics

  • Measured in “nines” of availability
  • Examples:
    • Three nines (99.9%): 8.77 hours downtime/year
    • Five nines (99.999%): 5.26 minutes downtime/year

Resilience Engineering

Definition

“Resilience is the intrinsic ability of a system to maintain or regain a dynamically stable state, which allows it to continue operations after a major mishap and/or in the presence of a continuous stress.”

Key Tools and Approaches

  1. Redundancy

    • Multiple identical copies of components
    • Maintains service if one fails
  2. Load Balancing

    • Directs traffic to healthy system parts
    • Traffic shaping for optimal performance
  3. Automatic Scaling

    • Adds resources as needed
    • Eliminates need for manual server upgrades

Example: Kubernetes

  • Runs redundant copies of core services
  • Built-in health checking
  • Automatic failover
  • State replication across multiple locations

Sociotechnical Systems

Important Considerations

  • People are integral parts of the system
  • Human actions can both break and maintain system health
  • Systems are always partially broken
  • Expert intervention is necessary

SRE Best Practices

  1. Time Management

    • At least 50% of time should be spent developing tools
    • Focus on automation over manual fixes
  2. Developer Involvement

    “You write it, you run it”

    • Developers should be on-call for their code
    • Must be proficient with debugging and monitoring tools
    • Required to support services until proving stability
  3. Documentation

    • Create comprehensive runbooks
    • Document safe intervention procedures
    • Establish monitoring and control systems

Key Takeaway

Building reliable systems isn’t about achieving perfect uptime, but rather creating resilient systems that can maintain functionality despite partial failures and require skilled practitioners for maintenance and improvement.

Observability in Systems

Overview

  • Observability measures how well internal system states can be understood from external outputs
  • Goal: Understanding system state through metrics and logs to enable action and improvement
  • Supports the Three Ways principles through feedback loops

Five Key Areas of Observability

1. Synthetic Checks

  • Also known as health checks
  • Programmatic testing of service performance and uptime
  • Not based on real user traffic
  • Answers basic question: “Is it working?”
  • Can be implemented at both:
    • High-level service checks
    • Sub-component levels

2. System and Application Metrics

System Metrics
  • Measures fundamental system resources:
    • CPU usage
    • Memory utilization
  • Time-series data stored in Graft
  • Helps determine normal functioning
Application Metrics (Custom)
  • Application-specific measurements
  • More diagnostic than system metrics
  • Examples:
    • Function call duration
    • Login counts
    • Error event frequency

3. Performance Metrics

Application Performance Monitoring (APM)
  • Code-level performance instrumentation
  • Measures:
    • Function execution time
    • API call duration
    • Database query performance
Real User Monitoring (RUM)
  • Front-end instrumentation (e.g., JavaScript page tags)
  • Captures actual user experience
  • Provides direct insight into customer experience
Tracing
  • Tracks requests across multiple services
  • Measures duration of each component
  • Useful for complex system analysis

4. System and Application Logs

  • Provides detailed contextual information
  • Answers key questions:
    • What happened?
    • When did it happen?
    • Where did it happen?
    • What was involved?
  • Use cases:
    • Problem detection
    • Troubleshooting
    • Audit and compliance
    • Capacity planning
    • Security forensics

5. Security Monitoring

  • Utilizes existing logs and metrics
  • Focuses on threat detection
  • Monitors for:
    • Indicators of compromise
    • Suspicious endpoints
    • Connections from known bad IPs
    • Bad configurations
    • Unusual behavior
  • Example alerts:
    • Login failure spikes
    • Website injection attempts
    • Malformed network requests

Best Practices

  • Analyze which monitoring types best support production systems
  • Use monitoring data to help development teams improve applications
  • Collaborate between operations and development
  • Encourage improved custom metrics and logging
  • Use production data to drive product improvements

“Monitoring isn’t just for production performance and uptime, it’s also a source of valuable information to developers about how the service is really used out in production.”

Incident Response and Retrospectives

Core Concepts of Incident Response

System Reality

  • All systems are sociotechnical systems with humans as part of their resilient operation
  • Even with excellent design, development, testing, and monitoring, systems will still experience failures
  • Getting good at responding to and remediating problems is a crucial part of the job

Key Activities for Incident Response

  1. Troubleshooting

    • Requires in-depth system knowledge
    • Need ability to diagnose and remediate problems
  2. Automation

    • Having pre-created tooling
    • Enables faster and safer information gathering
    • Supports remediation activities
  3. Communication

    • Often requires team of specialists
    • Need to keep business stakeholders informed
    • Must update end users on situation

Incident Management Process

  • Inspired by Incident Command System (ICS)
    • Originally created in 1968 for Northern California wildfires
    • Now recommended by UN as international standard
  • Key aspects:
    • Incident detection and reporting
    • Participant coordination
    • Custom to organization, team, and product

Post-Incident Analysis

Modern Approach vs Traditional

  • Moving away from traditional “root cause analysis”
  • Avoiding blame-focused investigations
  • Recognition that human error shouldn’t cause major outages
    • If it does, system needs improvement
    • Systems should be resilient to mistakes

Effective Postmortem Principles

  1. Multiple Causes

    • No single root cause
    • Consider deficiencies at multiple levels:
      • Testing
      • Monitoring
      • Processes
  2. Blame-Free Analysis

    • Understand actions from practitioners’ point of view
    • Recognize decisions made with best available information
    • Address cognitive biases
    • Focus on system improvement
  3. Transparency

    • Open communication during incidents
    • Clear stakeholder updates
    • Honest post-incident reporting
    • Builds trust and goodwill

“Real talk moment. Organizations have performed so-called root cause analyses for decades. These are usually a thinly veiled attempt to find somebody to blame for an outage. But if someone making a mistake can cause a major outage, your system itself is terrible and not resilient and it needs to improve.”

Best Practices

  • Practice incident response regularly
  • Maintain cool head during incidents
  • Focus on system improvement rather than blame
  • Document and learn from each incident
  • Share learnings transparently

DevOps SRE Toolchain Overview

Two Main Components

1. Building for Reliability

  • Highly dependent on programming language and tech stack
  • Focus on libraries and development techniques rather than tools
  • Requires collaboration between dev and ops at design time
  • Resources available:
    • Technical books
    • Libraries (e.g., Java’s Resilience4j)

2. Operational Feedback

  • Common set of observability and incident response tools
  • Rich ecosystem of options:
    • SaaS Solutions:
      • Datadog
      • Honeycomb
      • SumoLogic
    • Open Source Tools:
      • Nagios
      • Grafana
      • Prometheus
    • Commercial Software:
      • Solarwinds
      • Splunk

Five Key Areas of Observability

  1. Synthetic checks
  2. System and application metrics
  3. End-user performance
  4. System and application logs
  5. Security monitoring

Lean Approach to Observability Implementation

Build-Measure-Learn Cycle

  1. Build: Create minimum viable monitoring stack

    • Basic endpoint synthetic monitors
    • Basic system monitoring
    • Performance latency from logs
  2. Measure: Collect metrics from all monitoring areas

  3. Learn:

    • Analyze application stack with monitoring
    • Identify areas needing more detailed metrics
    • Evaluate effectiveness of app logs
    • Iterate and improve as needed

“Monitoring you don’t use, that’s waste.”

Best Practices and Considerations

Stakeholder Access

  • Make monitoring accessible to:
    • Developers
    • Product managers
    • Business decision makers

Custom Development

  • Create custom visualizations when needed
  • Focus on making monitoring meaningful to different stakeholders

Incident Response Tools

Popular Solutions:

  • PagerDuty (SaaS)

    • Handles alerts from observability tools
    • Manages on-call scheduling
    • Provides escalation workflows
  • Other Options:

    • VictorOps
    • OpsGenie

Runbook Automation Tools

  • Rundeck (Open source, commercial, and SaaS options)
  • Ansible Tower
  • StackStorm

Status Page Tools

  • Atlassian Statuspage
  • Status.io

Key Takeaways

  1. Keep solutions simple
  2. Consider team collaboration needs
  3. Iterate and improve based on actual usage
  4. Focus on specific use cases
  5. Be prepared to develop custom tooling as needed

7. Advanced Topics

Platform Engineering: The Paved Road

The Challenge of Scale

  • Organizations face difficulties in managing:
    • Infrastructure as code
    • Continuous builds
    • Incident response
    • Security and compliance
  • Key Problem: As value streams multiply, solution diversity can lead to chaos

The Automation Solution

Pioneer Companies

  • Organizations that first tackled extreme DevOps scale:
    • Netflix
    • Meta
    • Google
    • Spotify
  • These companies invested in self-service automation

The Paved Road Concept

  • Also known as the “golden path”
  • Evolution from early DevOps “wilderness trail blazing”
  • Creates an opinionated framework for standardized processes
  • Benefits:
    • Easier adoption
    • Shared improvements
    • Simplified team transitions between projects

Common Implementation Examples

  1. CI/CD Pipelines

    • Automated check-in hooks
    • Automatic test runs on pull requests
    • Automated test deployments
  2. Self-Service Platforms

    • Cloud account provisioning
    • HPC cluster setup for machine learning
    • Built-in security guidance
    • Automated compliance

Platform Engineering Evolution

Definition

“Platform engineering is the discipline of designing and building tool chains and workflows that enable self-service capabilities for software engineering organizations.”

Components

  • Development environment
  • Testing capabilities
  • Deployment automation
  • Infrastructure creation
  • Observability
  • Security
  • Runtime environment
  • Scaling
  • Service discovery

Success Factors

1. Product Management Approach

  • Platforms must serve users, not creators
  • Key principles:
    • Focus on user requirements
    • Ensure product quality
    • Market the platform internally
    • Keep usage voluntary

2. Lean Implementation

  • Avoid over-building platforms
  • Follow the progression:
    1. Blaze the trail
    2. Pave the road
    3. Build the train
  • Focus on actual user needs
  • Maintain flexibility for innovation

Warning Signs vs. Good Practices

Warning Signs

  • Centralized control focus
  • Mandatory usage
  • Optimization for central team needs
  • Excessive upfront building

Good Practices

  • Global system optimization
  • Value stream focus
  • User-centric design
  • Incremental development
  • Flexibility for innovation

Key Differentiator

The main difference between modern platforms and traditional centralized IT is the focus on:

  • User empowerment
  • Value optimization
  • Flexibility
  • Continuous improvement based on actual needs

DevSecOps: Security in the DevOps Way

Traditional Security Challenges

  • Historical tension between security and technical teams
  • Security originally handled by sysadmins and developers
  • InfoSec specialization created new silos
  • Typical staffing ratio problem:
    • 100 developers
    • 10 operations staff
    • 1 security person

Common Issues

  • Security teams have different priorities
  • Focus often compliance-oriented
  • Appears as “busy work” to development teams
  • Security teams understaffed and downstream
  • Developers care about security but lack:
    • Time
    • Clear direction from security teams

DevSecOps Introduction

“If security introduces blocking to the organization, it will be ignored, not embraced.” - Zane Lackey and Rich Smith (Etsy)

CAMS Framework with Security Lens

1. Culture
  • Security works alongside developers
  • Avoid creating blocking gates
  • Prevent value stream from routing around security
2. Automation

Shifting Left Concept

  • Introduce security earlier in development
  • Implement security tools in:
    • IDE
    • CI systems
  • Warning: Avoid common pitfalls
    • Don’t dump security work on developers
    • Prevent bloated build times
    • Avoid forcing developers to parse complex security tools
  • Focus on minimal impact on cycle time
3. Sharing
  • Build bridges between teams
  • Create security champions program
    • Methods to identify champions:
      • Host Capture the Flag events
      • Search code repos for security bug fixers
      • Ask for volunteers
  • Benefits:
    • Security team trains champions
    • Champions help understand team concerns
    • Improves communication between teams
4. Measurement
  • Establish security observability
  • Create joint team goals
  • Avoid FUD (Fear, Uncertainty, Doubt) approach
  • Focus on metrics that matter

Key Takeaways

  • Security is critical regardless of terminology
  • Modern approaches focus on integration
  • DevSecOps bridges gap between security and development
  • Success requires balance between security needs and development efficiency

Kubernetes and Cloud Native Overview

What is Kubernetes?

  • An open-source container orchestration system that automates:
    • Software deployment
    • Scaling
    • Management
  • Provides a platform for running containerized applications

Key Benefits

Automation and Features

  • Automates infrastructure plumbing
  • Provides standardized management features:
    • Observability
    • Service discovery
    • Health monitoring
    • Custom networking
  • Developers get built-in capabilities without additional development

Infrastructure Abstraction

  • Manages compute, networking, and storage
  • Enables multi-cloud deployment
  • Standardizes deployment across:
    • On-premise environments
    • Different cloud providers
  • Simple deployment process:
    1. Containerize application
    2. Specify redundancy requirements
    3. Deploy across cluster nodes
    4. Expose API

Cloud Native Computing Foundation (CNCF)

Understanding “Cloud Native”

  • Definition: Essentially means “Kubernetes add-on”
  • Not limited to cloud environments
  • Large ecosystem of tools and products
  • CNCF maintains an interactive tool landscape

Challenges and Considerations

1. Complexity

  • Highly configurable with numerous options
    • 20+ choices for network backplane alone
  • Requires integration of multiple tools
  • Complex upgrades and interoperability
  • Steep learning curve

2. Resource Requirements

  • Significant costs:
    • Base 3 server cluster can cost hundreds of dollars monthly
    • Requires dedicated administration team
  • Not suitable for lightweight management by dev teams

3. Implementation Risks

  • Can work against DevOps goals if not carefully managed
  • Potential creation of silos
  • Risk of increased waste
  • Requires systems thinking and CAMS values alignment

Best Practices

  • Start simple
  • Add complexity only when necessary
  • Ensure thorough understanding of platform behavior
  • Consider alternatives:
    • Serverless solutions
    • Lighter container orchestration
    • Can provide 80% of benefits with 20% of effort

“Kubernetes is a good tool to learn about and a very popular tool. Just make sure it’s the right tool for the job at hand.”

Chaos Engineering in DevOps

Introduction to Chaos Engineering

  • Definition: The discipline of experimenting on a system to build confidence in its capability to withstand real-world production conditions
  • Core Concept: Creating deliberate adversity for systems to test and improve resilience

Netflix’s Chaos Monkey

Origin Story

  • Netflix developed Chaos Monkey while running large cloud clusters for millions of users
  • Goal: Build systems resilient to component failures (servers, network links, etc.)

Why It Was Needed

  • Traditional CICD testing proved insufficient
  • Couldn’t replicate the full complexity of:
    • Thousands of interconnected components
    • Real production environment
    • Interactions with millions of users

How It Works

  • Chaos Monkey deliberately breaks components in the production system
  • Purpose: Forces technical teams to ensure system resilience
  • Resulted in continuous improvement of system resilience through controlled failures

Modern Chaos Engineering Practices

Controlled Testing Approach

  • Not random destruction, but structured experiments
  • Requires proper fault testing during development
  • Validates automatic fault remediation
  • Tests human intervention scenarios

Game Days

  • Structured activities for testing incident response
  • Either emulates or creates real faults
  • Tests the human component of sociotechnical systems
  • Validates incident response procedures

Kubernetes Chaos Engineering Example

Testing Scenarios

  • Multiple failure points to test:
    • Server failures
    • Application container failures
    • Network issues
    • Container repository problems
    • DNS failures

Key Learnings

  • System behavior often differs from assumptions
  • Important aspects to monitor:
    • Recovery capability
    • Recovery speed
    • System behavior during reconnection
    • Impact on human operators

Benefits and Philosophy

  • Similar to automobile crash testing methodology
  • Promotes innovative thinking
  • Breaks traditional constraints
  • Supports becoming a learning organization
  • Aligns with DevOps culture and three ways
  • Emphasizes learning through feedback loops

“IT systems aren’t binary, at least not above the chip level.”

Best Practices

  • Conduct thorough development testing first
  • Create structured experiments
  • Test both automated and human responses
  • Monitor and learn from results
  • Apply learnings to improve system resilience

MLOps: DevOps for Machine Learning Systems

Introduction to MLOps

  • Definition: Combination of Machine Learning and DevOps practices
  • Current Context:
    • ML has historically been used mainly by:
      • Large social media companies
      • Scientists and engineers
    • Recent boom in generative AI has led to widespread adoption
    • Business value often requires:
      • Private data handling
      • Training private ML models

Key Differences from Traditional DevOps

1. User Base Characteristics

  • Primary Users: Data scientists (vs. developers)
    • Often less familiar with computer systems
    • Work is tightly coupled with hardware
    • Requires close collaboration and empathetic support

2. Additional Components to Manage

  • Beyond traditional code and infrastructure:
    • Data versioning
    • Model management
    • Massive datasets
    • ML models (pattern-finding algorithms)

3. Infrastructure Requirements

  • Training Workloads:
    • Intensive batch jobs
    • Run on HPC (High Performance Computing) clusters
    • Characteristics:
      • Highly optimized systems
      • Integrated compute storage and network
      • GPU utilization
      • Typically expensive

4. Results Tracking and Governance

  • Different from Traditional Testing:
    • AI systems don’t provide single correct answers
    • Varying output quality
    • Continuous modification for improvement
    • Rich feedback loop beyond pass/fail testing

Production Aspects

1. Inference and Training

  • Initial training followed by inference in production
  • Continuous learning from user input
  • Need to detect drift in AI predictions
  • Vector databases:
    • Large scale storage
    • Growing costs with long-term inference memory

2. Development Value Stream

Three parallel CICD processes for:

  1. Software
  2. Infrastructure
  3. Data and models

Key Participants

  • Developers
  • Operations teams
  • Data scientists

Success Factors

  • Application of core DevOps concepts:
    • Automation
    • Measurement
    • Continuous monitoring
    • Collaborative approach

Future Outlook

  • AI becoming a fundamental computing pattern
  • Increasing need for DevOps professionals to understand MLOps
  • Growing importance in business operations

“AI is here to stay as a fundamental computing pattern and workload, so more and more DevOps professionals will need to understand it in the future.”

AIOps: AI Integration in DevOps Work

Key Principles

  • AI doesn’t replace engineers
  • Engineers remain responsible for evaluating and testing AI outputs
  • Cannot blindly implement AI-generated code without proper validation

Why AI in DevOps?

Current Challenges

  • Too many tools and vendors
  • Inconsistent documentation
  • Complex architectures
  • Excessive specialized knowledge requirements
  • Cognitive overload from context switching

Practical Applications

1. API and Code Work

  • Assists with command-line operations
  • Creates integration scripts
  • Helps with data transformation
  • Facilitates webhook interactions

2. Natural Language Queries

  • Prompt Engineering Tips:
    • Set context (e.g., “respond as a DevOps engineer with Linux experience”)
    • Request explanations at different complexity levels
    • Frame questions effectively for better results

3. Code Management

  • Code refactoring
  • Documentation writing
  • Pipeline documentation
  • Language conversion

    Example: Rob Hirschfeld’s approach of converting Terraform to AWS CLI and Bash

4. Monitoring and Security

  • Enhanced detection and alerting
  • Automated remediation recommendations
  • Security testing and code review
    • Faster than traditional methods
    • Improved accuracy
    • Acts as an “automated security buddy”

Future Evolution: The Three Waves

Wave 1: Code Generation

  • AI-assisted coding
  • Automated code reviews
  • Test generation
  • Documentation
  • Tools: GitHub Copilot, AWS Code Whisperer

Wave 2: Systems Management

  • Advanced alerting
  • Monitoring cluster health
  • Automated runbooks
  • System explanation
    • Example: k8sgpt for Kubernetes system state explanation

Wave 3: Human Integration

  • Self-service functionality
  • Enhanced cross-team collaboration
  • Platform accessibility improvements
  • Better business integration

Conclusion

AIOps will:

  • Make DevOps more approachable
  • Improve business integration
  • Transform work methods without replacing human expertise
  • Enhance efficiency and accessibility across organizations

8. DevOps Career

DevOps Career Guide and Resources

Career Perspectives in DevOps

DevOps as a Mindset

  • DevOps is not just a job title but a mindset and suite of practices
  • Applicable across various technical roles
  • Focuses on improving technology organization results

Role-Specific Applications

For Developers:

  • Can remain developers while incorporating DevOps principles
  • Focus on building reliable applications
  • Better understanding of build and testing structures
  • Improved instrumentation for production environments

For System/IT Administrators:

  • May be titled “DevOps Engineer”
  • Key skills include:
    • Infrastructure as code
    • System reliability design
    • Observability platform implementation
    • Runbook automation

Specialized DevOps Roles:

  • Platform Engineers
  • Automation Experts
  • Build Engineers/Release Managers
  • Site Reliability Engineers (SREs)
  • Specialized positions in large organizations:
    • Incident Managers
    • Application Performance Management Teams

Beyond Technical Roles

  • Security Engineers → DevSecOps Engineers
  • Applicable to non-technical roles:
    • Sales
    • Marketing
    • Product Management
    • Executive positions

Learning Resources

Top 10 DevOps Books

  1. DevOps Handbook
  2. Accelerate
  3. The Phoenix Project
  4. Continuous Delivery
  5. Site Reliability Engineering Book
  6. Infrastructure as Code
  7. Release It!
  8. The Practice of Cloud System Administration
  9. Visible Ops
  10. Lean Software Development

Online Resources

  • Weekly newsletters
  • Notable websites:
    • Martin Fowler’s articles
    • Julia Evans’ technical zines and comics

Certifications

  • Technology-specific certifications:
    • AWS Cloud
    • HashiCorp
    • Kubernetes
    • Cloud Native
  • DevOps Institute certifications
  • University-run DevOps boot camps

Conferences and Events

  • DevOps Enterprise Summit (US and Europe)
  • DevOpsDays (50+ global events in 2023)
  • All Day DevOps (24-hour online conference)

Personal Development Path

Creating Your Learning Journey

  • Consider your target role and current position
  • Design a learning path based on:
    • Current skills
    • Career goals
    • Desired specialization

Core Technical Skills

  • Operating systems
  • Programming languages
  • Cloud technologies
  • Containerization

Additional Learning Resources

  • DevOps Foundations curriculum
  • Specialized courses:
    • Lean and Agile
    • Infrastructure as Code
    • CICD
    • Site Reliability Engineering
    • DevSecOps
    • Observability
    • Incident Management
    • DevOps Management
    • DevOps Anti-patterns

Best Practices for Learning

  • Gain hands-on experience
  • Utilize continuous learning principles
  • Engage with feedback loops
  • Connect with DevOps community
  • Participate in course Q&A
  • Network on LinkedIn