1. DevOps Basics
2. DevOps and People: A Culture Change
3. DevOps and Process: The Building Blocks
4. Infrastructure as a Code
5. Continuous Delivery
6. Site Reliability Engineering
7. Advanced Topics
8. DevOps Career
- DevOps Career Guide and Resources

1. DevOps Basics

Understanding DevOps

Definition and Core Concept

DevOps combines two traditional tech roles:
- Developers: Write application code
- Operations Engineers: Set up and manage systems running the applications
DevOps emerged in the late 2000s to address the disconnect between these roles

Key Characteristics

Collaborative approach throughout the entire service lifecycle
Includes all specialized roles working together:
- Front-end developers
- Test engineers
- Build engineers
- Networking engineers
- Security engineers
- Database administrators (DBAs)

Modern Systems Engineering Approach

Operations engineers use development techniques
Systems engineering follows software development workflow:
- Code checked into source control
- Build, test, and deployment processes
Moves away from manual system administration

Three Levels of DevOps

Values
Principles
Practices

Benefits and Impact (2021 State of DevOps Report)

Elite Teams vs. Low-Performing Teams

Deployment Frequency: 973 times more frequent
Lead Times: 6,570 times shorter
Quality Metrics:
- 3x fewer failures
- 6,570 times faster recovery from issues

Organizational Benefits

22% less time spent on unplanned work and rework
2x more likely to achieve organizational objectives
Higher success in:
- Shipping products
- Customer satisfaction
50% reduction in employee burnout

Universal Application

Benefits apply across:
- Different organization sizes
- For-profit and non-profit organizations
- Product engineering teams
- Internal IT organizations

What DevOps is NOT

Not just a renamed operations team
Not a single job title
Not one person doing everything
Not tied to specific tools

Important Note

“Keep in mind that a lot of people use the term DevOps without really understanding what it means. So always check what you’re hearing against the core concepts of DevOps.”

DevOps Core Values: CAMS Model

Overview

The CAMS model, developed by DevOps pioneers John Willis and Damon Edwards, represents the fundamental values of DevOps:

Culture
Automation
Measurement
Sharing

“DevOps is a human problem.” - Patrick Debois (Godfather of DevOps)

Culture (C)

Understanding Culture

More than superficial perks (ping pong tables, free food)
Driven by human behavior
Based on mutual understanding between team members

Historical Context

Traditional IT organization split:
- Development Teams: Focus on creating applications and features
  - Emphasis on speed and innovation
- Operations Teams: Focus on maintenance and stability
  - Responsible for servers, networks, security, and cost control

Cultural Challenges

Formation of silos due to differing goals
Communication breakdown between teams
Focus on team-specific goals rather than overall business outcomes
Solution: Change underlying behaviors and assumptions to drive cultural change

Automation (A)

Key Points

Often the first thing people associate with DevOps
Warning: Implementing automation without other values can lead to DevOps failure
Creates a fabric for controlling systems and applications
Acts as an accelerator for other DevOps benefits

Best Practices

Make automation the primary approach to creating solutions
Address manual work as it’s a source of:
- Inefficiencies
- Quality problems in technology value streams

Measurement (M)

Importance

Enables observation of systems and people
Helps track improvement from changes
Provides rational approach to technology

Common Pitfalls

Measuring wrong metrics
Improper incentivization

Recommended Metrics

Cross-organizational outcomes:
- Mean time to restore service after outages
- Cycle time for new feature deployment
Higher-level results:
- Costs
- Revenue
- Employee satisfaction

Core Elements

Foundation of collaboration
Essential for DevOps success
Promotes teamwork and transparency

Documentation
Pair programming
Peer reviews
Mentoring
Inclusive practices

Conclusion

CAMS implementation focuses on:

Changing people’s behavior
Using automation as an accelerator
Measuring progress for improvement
Fostering collaboration for better outcomes

The CAMS values serve as the foundation for specific DevOps techniques and should be embraced for successful organizational transformation.

The Three Ways of DevOps

Overview

The Three Ways are strategic principles developed by Gene Kim and Mike Orzen to implement DevOps values effectively. These principles help bring core DevOps values to life in practical ways.

First Way: Systems Thinking and Principles of Flow

Key Concepts

Focus on the overall outcome of the entire system
Avoid optimizing individual parts at the expense of overall results
Consider end-to-end flow as the primary value producer

Example: Performance Optimization

Improving one area can create unexpected bottlenecks elsewhere
Case Study: Adding more application servers can overwhelm database servers with connections

Organizational Impact

Deployment team processes might look good in isolation but could compromise overall development
Handoffs and friction between teams often disrupt value flow
Success metrics should reflect system-wide outcomes

Second Way: Amplifying Feedback Loops

Definition

Processes that consider their own output when determining next steps
Focus on creating, shortening, and amplifying feedback loops between value chain components

Bug Detection Example

Three scenarios with increasing waste:

Best case: Developer catches bug through desktop unit tests
Medium case: QA finds bug, documents it, returns to developer
Worst case: Customer finds bug → Support → Development → Product Management → Fix

Application

Use when creating multi-team processes
Important for visualizing metrics
Essential in designing delivery flows

Third Way: Culture of Continuous Experimentation and Learning

Core Elements

Create an environment that encourages learning and experimentation
Avoid analysis paralysis
Focus on practical implementation and iteration

Key Principles

“Working code wins”
“If it hurts, do it more often”
“Fail fast”

Implementation

Encourage active skill practice and mastery
Promote trying new approaches
Focus on doing rather than just discussing
Support sharing of new ideas

Practical Application

The Three Ways provide a framework to:

Implement specific processes and tools
Align with CAMS (Culture, Automation, Measurement, Sharing)
Guide decision-making in DevOps implementation

Key Questions to Consider

How does this affect the whole system?
Where can we build in more feedback loops?
How can we facilitate experimentation and learning?

DevOps Practice Areas: The Five Pillars

Overview

Unlike Agile’s structured methodologies (like Scrum and XP), DevOps doesn’t have a strictly defined approach. However, it consists of five major practice areas that form a comprehensive implementation framework.

The Five Practice Areas

1. Culture

Focus on creating and maintaining a stable, safe environment
Key elements:
- Learning and sharing
- Experimentation
- Embracing both success and failure
- Reflects core DevOps values

2. Process

Foundation: Agile and lean product development techniques
Key practices:
- Working in small batches
- Limiting work in progress (WIP)
- Incorporating feedback loops
- Lightweight change approval processes
Strong correlation with IT and business success
Reflects the “Three Ways” in Lean and Agile frameworks

3. Infrastructure as Code

Technological approach using:
- Cloud
- Containers
- Programmable infrastructure
Benefits:
- Reproducibility
- Self-service capabilities
- Rapid scaling
- Improved software delivery and operational performance

4. Continuous Delivery

Focuses on automation for implementing lean principles
Key aspects:
- Automated testing
- Frequent deployment of small changes
Benefits:
- Increased speed
- Improved quality
- Better culture
- Enhanced performance

5. Site Reliability Engineering (SRE)

Engineering approach to:
- Building reliability into systems
- Operating services with high observability
- Implementing automation
Applies to both application and infrastructure levels

Important Considerations

Interdependence

Pillars are not effective in isolation
Must work together to build a solid DevOps foundation
Example: High software delivery performance (from continuous delivery) needs operational excellence (from SRE) to deliver business benefits

Implementation Strategy

Advance all five pillars iteratively
Avoid focusing on one pillar exclusively
Balance development across all areas
Regular assessment of organizational maturity in each pillar is recommended

“In your roadmap to DevOps maturity, you want to advance all five pillars in turn and iterate so that they can reinforce each other. Trying to completely implement one without bolstering the others will end in frustration.”

DevOps Tools Selection Guide

Core Principles

People Over Process Over Tools

“People over process over tools” - Alex Honor (Creator of Rundeck)

Correct Implementation Order:

Identify responsible people and ensure they have proper skills/support
Define necessary processes
Select and implement appropriate tools

Common Mistake:

Organizations often reverse this order, focusing on:
- Tools first
- Processes second
- People last (if at all)

Tool Selection Criteria

1. KISS Principle

Definition: Keep It Simple, Stupid
Rationale: Every tool requires:
- Learning curve
- Implementation
- Upgrades
- Security maintenance
- Integration with other tools

2. Integration Requirements

Tools should function as a “tool chain”
Must operate well in dynamic environments
Key features:
- Good integration capabilities
- Ability to compose solutions
- Automatic adaptation to changes
- API availability

Challenges in Modern DevOps

Complexity Issues

Increasing complexity in tech landscape
Example: Cloud Native Computing Foundation’s landscape diagram
Recent trends show:
- Declining quality of implementations
- Overabundance of tools
- Integration difficulties

Common Tool Categories

Popular tools mentioned:

Kubernetes
Terraform
Ansible
Puppet
Chef
GitHub
Jenkins
Docker
Linux
Amazon
Graphite
Artifactory

Best Practices for Tool Selection

Focus on Collaboration
- Consider how tools enhance team collaboration
- Ensure all value stream participants can use them effectively
Avoid Over-tooling
- Resist the temptation to implement too many tools
- Consider maintenance overhead
Ensure Dynamic Compatibility
- Tools must work with changing environments
- Avoid static configurations
- Prioritize API-driven solutions

Key Takeaway

“There is no such thing as the best tool. There’s only the best tool for you and your specific situation.”

2. DevOps and People: A Culture Change

The Need for DevOps Culture

Current IT Challenges

Traditional IT Department Issues

IT departments often face low success and satisfaction rates
Historical misalignment between business teams and technology teams
Popular media (e.g., “Office Space,” “IT Crowd,” “Silicon Valley”) reflects these real-world challenges

Internal Friction Points

Conflict exists between various technical teams:
- Developers
- Quality Assurance
- System Administrators
- Information Security Professionals
- Network Administrators
- Database Administrators (DBAs)

The Wall of Confusion

Definition

Represents the communication barriers between different teams
Creates division between groups that should share common goals

Typical Flow

Business throws requirements to developers
Developers throw code to testers
Testers throw tested code to operations
Operations throw final product to end users

Real-World Example

Server Provisioning Case Study

Traditional Process (6 weeks):

Negotiating specifications
Procurement process
Hardware delivery
Installation in data center
OS loading
Final handover

After Virtualization (4 weeks):

Technical process reduced to 15 minutes
Organizational overhead still resulted in 4-week delays due to:
- Standards
- Ticketing systems
- Documentation requirements

Business Impact

Executive Perspective

Modern business executives are increasingly tech-savvy
Question why 15-minute tasks take 4 weeks
Concerned about:
- Financial waste
- Time inefficiency
- Competitive disadvantage

Common Reactions

Turn to outsourcing
Develop shadow IT
Seek alternatives to central IT department

“The organizations and processes we’ve built up around IT” have created unnecessary complexity and delays, highlighting the need for a DevOps culture to bridge these gaps and improve efficiency.

Building DevOps Culture Through Communication and Trust

The Importance of Communication

Communication and trust are fundamental to a productive DevOps culture
Project success (from deployments to acquisitions) heavily depends on communication quality
Without proper communication and trust:
- Technical practices may fail
- Goals may compete
- Misunderstandings can occur

Effective Communication Strategies

Structured Communication Channels

Establish dedicated channels for specific purposes:
- File repositories for customer information
- Chat channels for downtime incidents
- Email aliases for software release communications

Communication Planning

Good communication requires intentional planning
Essential for:
- Fast-moving organizations
- High-pressure situations (e.g., outages)
Need clear processes defining:
- When to communicate
- Who to communicate with
- How to handle business events

Organizational Types (Westrum Model)

Pathological Organizations
- Everyone looks out for their own needs
- Limited information flow
Bureaucratic Organizations
- Focus on strictly defined roles
- Teams defend their turf
Generative Organizations
- Mission-focused
- Most effective information flow
- Features high trust environment
- Welcomes bad news as learning opportunities

Building Trust and Respect

Personal Development

Acknowledge that not everyone has natural social skills
Recommended resources for improvement:
- “How to Win Friends and Influence People”
- “Crucial Conversations”
- “How to Say It At Work”

Key Principles

Assume Good Faith
- Most people try to do their best
- Actions are based on perceived constraints
- Misunderstandings often stem from lack of context
Promote Transparency
- Share access to:
- Chat rooms
- Team Wiki pages
- Code repositories
- Infrastructure details
- Monitoring tools
- Ticket trackers
Break Down Barriers
- Don’t over-restrict communication
- Challenge unnecessary “least privilege” restrictions
- Recognize business value in transparency

Best Practices

Create shared goals across teams
Provide visibility into different team activities
Be open and transparent
Stay curious and respectful
Focus on understanding others’ perspectives
Align goals across teams
Show value for others’ needs

“There’s no shortcut to building mutual trust. It develops over time.”

Real-World Example

Situation: Developer-Operations conflict over priorities
Problem: Lack of understanding about operations team’s workload
Solution: Implemented program to create:
- Shared goals
- Better visibility
- Cross-team understanding
Result: Improved working relationships and effectiveness

Breaking Silos in DevOps: Enhancing Collaboration

The Wall of Confusion

Root Causes

Not primarily due to poor people skills of tech professionals
Main cause: Institutional incentivization of opposing behaviors
Different teams have conflicting responsibilities:
- Development teams: Focus on new functionality and rapid changes
- Operations teams: Maintain stability and control change

Impact of Misaligned Incentives

Creates harmful conflicts of interest
Diminishes feedback loops
Local optimization interferes with global optimization
Teams focus only on individual metrics rather than organizational success

Conway’s Law

“Systems will merely always align themselves to your communication boundaries.”

Organizational boundaries act as communication boundaries
First wave of DevOps emphasizes alignment around value stream
Simply renaming teams to “DevOps” without structural changes is ineffective

Solutions for Breaking Silos

1. Cross-Functional Teams

Integrate people from different specialties to work together
Success Story Example:
- Large SaaS company in Austin
- Embedded ops engineer into dev team
- Shared ticket backlog between dev and ops tasks
- Results:
  - Developers gained understanding of operational requirements
  - Increased respect and collaboration
  - Shared responsibility for production service

2. Self-Service Tooling

Implement automated access to shared services
Benefits:
- Reduces dependencies between teams
- Increases efficiency
- Eliminates unnecessary waiting times
- Better alignment with specific team needs

3. Aligned Communication and Goals

Role Evolution Requirements:
- Developers:
  - Take responsibility for build/deployment failures
  - Participate in on-call rotations
- Operations/QA:
  - Shift to providing self-service platforms
  - Focus on guidance rather than direct execution

Three-Step Path to Enhanced Collaboration

Reduce Separate Teams:
- Eliminate silos
- Create cross-disciplinary teams
Implement Self-Service:
- Virtually remove team dependencies
Align Remaining Teams:
- Promote collaboration
- Ensure mutual support
- Align goals across teams

Action Items

Evaluate organizational maturity in these areas
Identify specific actions for improvement
Plan implementation steps towards collaborative goals

Continuous Learning in DevOps: The Third Way

Core Concepts

The Third Way Fundamentals

Focuses on creating a culture of continuous experimentation and learning
Emphasizes:
- Mastering core skills
- Experimenting and taking risks
- Learning through practical experience

Kaizen (改善)

Japanese concept meaning “change for the better”
Translates roughly to continuous improvement
Key component of Toyota Production System (TPS)
Introduced to Western world in 1986 through Masaaki Imai’s book
Adopted by major companies including:
- Lockheed Martin
- Pixar Animation Studios

Five Principles of Kaizen

Knowing the customer
Enabling smooth workflow
Going to the real place (gemba)
Empowering people
Maintaining transparency

Gemba (現場)

Means “the real place” in Japanese
Emphasizes direct observation and involvement
Key practice: Go to where value is created or where problems exist
Avoid relying on:
- Secondary reports
- Metrics alone
- Documentation
- Assumptions

“Show up in the project meeting. Go look at the code. Go try and use the system having problems.”

Implementation Process

Kaizen Improvement Process (Kata)

Follows the cycle of:

Plan: Define intentions and expected results
Do: Execute the plan
Check: Measure and analyze results
Act: Make necessary alterations

Key characteristics:

Similar to scientific method
Focuses on small, daily improvements
Creates new baselines when improvements are successful
Builds critical thinking skills

Practical Application

Best Practices

Make small iterative changes regularly
Implement improvements as part of daily work
Focus on teaching people critical thinking skills
Build people before building systems

Common Pitfalls to Avoid

Avoiding variations like:

Plan, don’t do, hide
Try to make it to Friday
Waiting for weekend instead of improving

Action Items

Use notebook function in course to document:
- Potential improvement areas
- Small, tangible next steps
- Ideas for iterating towards DevOps
- Progress and learning outcomes

3. DevOps and Process: The Building Blocks

DevOps and Agile: Historical Context and Framework

Origins of DevOps

First DevOps Discussion:
- Occurred at Agile 2008 conference in Toronto
- Between Patrick Deis and Andrew Clay Schaeffer
- Started as an “Agile infrastructure” discussion

Key Historical Events

2008: Initial discussion at Agile conference
2009:
- Andrew presented on Agile infrastructure at Velocity Conference
- Patrick started “DevOps Days” conference in Belgium, coining the term “DevOps”

Understanding Software Development Lifecycle (SDLC)

Traditional Steps:

Requirements gathering
Design creation
Implementation
Testing
Deployment
Maintenance

Waterfall vs. Agile Approach

Waterfall Method:

Sequential, linear approach
Complete documentation before proceeding
“Throwing over the wall” mentality between teams
Results in:
- Loss of context
- Quality issues
- Excessive rules and contracts
- Finger-pointing

Agile Method:

Iterative approach
Small, frequent iterations
Active collaboration between teams
Includes end-user feedback
Focuses on working software

Agile Benefits (According to Version One’s Survey)

85% increased productivity
80% faster time to market
81% better delivery time predictability
79% enhanced software quality

Limitations of Agile

No mention of operations in original manifesto
Doesn’t address systems aspects:
- Infrastructure building
- Application deployment
- Monitoring
- Maintenance

DevOps and Agile Relationship

Not identical: Can be practiced independently
Best Practice: Implement DevOps as an extension of Agile
DevOps addresses the operational gaps in Agile

Historical Challenge

“In the beginning, Agile was seen as a threat by the infrastructure side of the house and IT organizations”

Operations teams initially struggled with Agile’s iteration speed
Success was found when operations teams adopted Agile principles themselves

DevOps and Agile: Historical Context and Framework

Origins of DevOps

First DevOps Discussion:
- Occurred at Agile 2008 conference in Toronto
- Between Patrick Deis and Andrew Clay Schaeffer
- Started as an “Agile infrastructure” discussion

Key Historical Events

2008: Initial discussion at Agile conference
2009:
- Andrew presented on Agile infrastructure at Velocity Conference
- Patrick started “DevOps Days” conference in Belgium, coining the term “DevOps”

Understanding Software Development Lifecycle (SDLC)

Traditional Steps:

Requirements gathering
Design creation
Implementation
Testing
Deployment
Maintenance

Waterfall vs. Agile Approach

Waterfall Method:

Sequential, linear approach
Complete documentation before proceeding
“Throwing over the wall” mentality between teams
Results in:
- Loss of context
- Quality issues
- Excessive rules and contracts
- Finger-pointing

Agile Method:

Iterative approach
Small, frequent iterations
Active collaboration between teams
Includes end-user feedback
Focuses on working software

Agile Benefits (According to Version One’s Survey)

85% increased productivity
80% faster time to market
81% better delivery time predictability
79% enhanced software quality

Limitations of Agile

No mention of operations in original manifesto
Doesn’t address systems aspects:
- Infrastructure building
- Application deployment
- Monitoring
- Maintenance

DevOps and Agile Relationship

Not identical: Can be practiced independently
Best Practice: Implement DevOps as an extension of Agile
DevOps addresses the operational gaps in Agile

“You can practice DevOps without Agile and vice versa. But it can, and frankly probably should be implemented as an extension of Agile for best results.”

Historical Challenge

Initially, Agile was seen as a threat by infrastructure teams
Operations teams struggled with new iteration cadence
Success was found when operations teams adopted Agile principles themselves

Visible Ops Change Control Process

Introduction

Change is the primary cause of technical issues
- 80% of outages are caused by changes intended to improve, patch, or upgrade systems
Solution: Implement controlled changes through review, testing, and scheduled rollouts

IT Service Management (ITSM) Background

Emerged in 1980s as IT operations scaled
Focuses on service delivery and support
Notable frameworks:
- Microsoft Operations Framework
- COBIT
- ISO 20000
- Six Sigma
- ITIL (IT Infrastructure Library) - Most popular framework
  - Currently in 4th major version
  - Covers 34 different areas
  - Known for heavy-handed, slow processes

Traditional ITIL Change Management Issues

Requires extensive documentation for all changes
Relies on Change Advisory Board (CAB) for approval
Problems:
- Too slow for modern technical organizations
- Approval decisions made by those least qualified
- Tends to add more process when changes fail

Visible Ops Approach

Introduced by Gene Kim, Kevin Bear, and Gene Spafford in 2004
Published in “The Visible Ops Handbook”
- Condensed ITIL implementation into 4 practical steps
- Only 112 pages vs. ITIL’s 2000+ pages
Focuses on lightweight, fast, scalable, repeatable change control

Key Principles of Lightweight Change Control

Review and Documentation Requirements
- All changes need review, approval, and documentation
- Peer review by technologists close to the team
- Risk-based escalation for complex changes
- Example: Wireless access point installation vs. core router replacement
Change Size Management
- Keep changes as small as possible
- Benefits:
  - Easier to review
  - Simpler to identify and fix errors
  - Better than batch releases with hundreds of changes
Early Testing Implementation
- Use continuous integration systems
- Implement automated testing
- Include security safeguards early in development
- Peer review validates testing completion

Research Support

Google DevOps Research and Assessment Group findings:
- Streamlined change approval processes lead to:
  - Higher performance
  - Lower burnout levels
  - Increased psychological safety

Additional Resources

LinkedIn Learning course: “IT Service Management Foundations Change Management” by Earnest
- Detailed guidance on setting up lightweight change control processes

4. Infrastructure as a Code

Infrastructure as Code (IaC)

Traditional Infrastructure Management

Historically, infrastructure was managed manually:
- Building data centers
- Installing physical servers
- Loading operating systems (Windows/Linux)
- Configuring software
- Installing applications

Problems with Manual Management

Each system became highly individual (“special snowflakes”)
System administration was:
- Slow
- Error-prone
- Hard to maintain consistency
- Difficult to track changes

Modern Infrastructure as Code

Definition

“Infrastructure as code is provisioning and managing infrastructure through writing automation code instead of through manual processes.”

Key Concepts

Programmable Infrastructure:
- Write code to configure networks
- Set up servers
- Attach storage
- Configure operating systems
- Install applications

Benefits

Aligns with DevOps CAMS values:
- Culture
- Automation
- Measurement
- Sharing
Supports lean theory by:
- Removing waste
- Reducing delays

Modern Systems Challenges

Complexity Factors

Distributed systems
Microservice architectures
Cloud infrastructure
Containers
Machine learning
Ephemeral (temporary) components

New Approach: “Cattle not Pets”

Old way: Servers were “pets” (individually crafted and maintained)
New way: Servers are “cattle” (managed en masse)

Best Practices

Adopt a development lifecycle approach
Combine both operational and development perspectives:
- Operations expertise with tools
- Developer expertise with code
Version control for infrastructure code
Automated testing and deployment
Consistent build and deployment processes

Benefits of IaC

Scalability
Consistency
Reproducibility
Efficiency
Version control
Automated deployment
Reduced human error

DevOps Infrastructure as Code: Configuration Management Overview

Core Concepts

Configuration Management Definition

Process for creating and maintaining systems and software in a desired state
In DevOps: All configuration management should be automated and code-driven

Three Main Components

Provisioning
- Making servers and computing infrastructure ready for operation
- Includes:
  - Hardware/virtual hardware setup
  - Operating system installation
  - System services configuration
  - Network connectivity setup
Deployment
- Automated installation and upgrading of application software
- Applies to both:
  - In-house developed software
  - Third-party products
Orchestration
- Coordinated operations across multiple systems
- Examples:
  - Automated failover
  - Rolling deployments
  - Running runbooks across server fleets

Key Terminology

Approach Types

Imperative (Procedural)
- Defines and executes specific commands to produce desired state
- Example:
```
1. Stop service
2. Copy new NGINX binary
3. Start service
```
Declarative (Functional)
- Defines desired end state
- Tool handles convergence to that state
- Example: “Server should run NGINX v1.24”
- Usually builds on top of imperative systems

Important Characteristics

Idempotent

Ability to execute repeatedly with same end result
Declarative tools typically built to be idempotent
Must be manually ensured in imperative approaches

Self-Service

Allows end users to initiate processes independently
Benefits:
- Removes operations team from critical path
- Increases velocity
- Improves developer satisfaction

Drift

Deviation from defined configuration
Causes:
- Manual changes outside tool
- Script execution issues
Many tools include drift detection capabilities

Notes

Configuration management tools often overlap in functionality
Tool selection should consider specific use cases

Evolution of DevOps Configuration Management

Early Days (1990s)

Commercial IT Provisioning Tools:
- Ghost (system cloning)
- Enterprise suites like Tivoli and HP
- Focus on separate dev and ops approaches

Rise of Infrastructure as Code (2000s)

Major Configuration Management Tools

CFEngine
Puppet
Chef

“Our Unix admin team started using CFEngine to roll out operating system configurations” (circa 2005)

Challenges

Lack of collaboration between teams
Resistance to sharing tools across different functions
Configuration drift issues

Golden Image vs. Foil Ball Debate (2009)

Luke Kanies (Puppet founder) highlighted problems with image management:
- Image sprawl
- Configuration drift

New Approach: Stem Cell System

Minimal initial server images
Declarative CM tools for provisioning
Idempotent tools for:
- Preventing configuration drift
- Managing updates
- Automatic state convergence

Cloud Era Challenges

Why Automated Server Provisioning Became Essential

Increased virtualization
Dynamic server instances
Growth in distributed systems
Exponential increase in virtual servers

Orchestration Problems

Traditional CM Tool Limitations

15-minute wake-up cycle
Individual server checks
Pull-based changes
Issues with:
- High availability requirements
- Coordinated database/application changes

Initial Vendor Response

“You don’t need orchestration and if you think you do, you don’t understand configuration management.”

Evolution in the 2010s

New Tools and Approaches

Ansible and SaltStack:
- Push mechanism
- Explicit orchestration
- Dev-friendly deployment
- Workflow automation capabilities
Hybrid Solutions:
- Combined push deployment with idempotence
- Integration with existing CM tools
Self-Service Tools:
- Rundeck for orchestration
- Compliant system activities
- On-demand initiation

Limitations of Early CM Tools

Limited application deployment capabilities
Lack of virtual infrastructure provisioning
Focus primarily on system administration
Gap in addressing broader value stream needs

Evolution of Infrastructure as Code (IaC) in DevOps

Cloud Computing Era (2010s)

Enabled creation of servers, storage, and networks through code
Shifted from manual installation to programmatic infrastructure management
Introduced model-driven provisioning with declarative approaches

AWS CloudFormation Example

Provides templates for defining cloud assets
Allows automatic instantiation of resources
Uses declarative specifications for server configurations

Advanced IaC Solutions

Specialized Tools

Terraform and Pulumi:
- Emerged as dominant solutions
- Provide domain-specific languages for infrastructure provisioning

Programming Language Integration

Python: Boto library
AWS CDK: Enables pure code solutions
Note: These solutions may be less idempotent

Container Revolution (Late 2010s)

Key Features

Reduced server dependency
Docker containers package applications with minimal OS dependencies
Streamlined development and testing cycles

Benefits for Developers

Bundled runtime with applications
Reduced runtime bugs
Improved development workflow

Immutable Infrastructure

Netflix Model

Adopted golden image approach
Created cloud images with baked-in applications
Moved away from configuration management across servers

Characteristics

Servers not modified after deployment
Replace rather than modify approach
Reduces configuration drift through design

Modern Container Orchestration (2020s)

Platforms

Kubernetes
Mesos

Features

Unified solution for:
- Provisioning
- Deployment
- Orchestration
Template-based application and infrastructure changes
Automated coordination of changes

Serverless and PaaS

Simplifies deployment process
Abstracts infrastructure management
Note: Platform operation still requires maintenance and oversight

Future Outlook

Moving towards integrated toolchains
Focus on simplified infrastructure management
Continued evolution of IaC approaches

“Someone operating the platform still has to worry about it” - highlighting the ongoing need for infrastructure expertise despite automation advances.

Infrastructure as Code (IaC) Toolchain Selection Guide

Core Principles

Choose tools appropriate for team’s skill level
Start simple, scale complexity as needed
Plan the entire toolchain before implementation
Design operational environment before creation

Key Decision Points

1. Infrastructure Management

Self-Managed vs. Managed Service Options:

Self-managed infrastructure:
- Digital Rebar for bare metal automation
  - Handles PXE booting, BIOS, RAID configuration
  - OS and hypervisor installation
  - Integrates with tools like Terraform

2. Infrastructure Provisioning

Three Main Approaches:

Template-Driven:
- Amazon CloudFormation
- Azure ARM templates
- Uses JSON/YAML format
Custom Language Solutions:
- Terraform
- Pulumi
- Benefit: Works across multiple cloud providers
Pure Code Approach:
- Python boto
- Amazon CDK
- Azure Bicep
- Leverages full programming languages

3. System Management

Options:

Runtime Configuration:
- Chef
- Puppet
- CFengine
Configuration + Orchestration:
- Ansible
- Salt
Image Creation:
- Hashicorp Packer for automated image building (“baking”)
- Docker files for container images

Note: These approaches can be combined. Example: Configure base image with Chef, then bake with Packer

4. Orchestration Options

Configuration management tools (Ansible/Salt)
Platform-based (Kubernetes/Mesos)
External runbook automation (Rundeck)
Custom code solutions

5. Application Deployment Methods

Configuration management
Immutable deployments (container/system images)
Continuous deployment systems

6. Testing Strategy

Important Considerations:

Essential component of infrastructure as code
Utilize existing test frameworks
Implement both:
- Unit testing for infrastructure code
- Integration testing for produced infrastructure

Real-World Example

Enterprise SaaS Implementation

Tools Used:

Terraform: Base infrastructure, network, core servers
Puppet: Base image configuration
Packer: Image baking
Rundeck: Orchestration and updates

Process Flow:

Infrastructure building with Terraform
Configuration management with Puppet
Image creation with Packer
Orchestration via Rundeck
Continuous integration pipeline for testing

Simplified System Example

Tools Used:

CloudFormation: Base infrastructure
Docker: Container creation
Amazon managed container service: Orchestration

Benefits:

Simpler implementation
Less maintenance overhead
Cost-effective
Suitable for immutable deployment

5. Continuous Delivery

Continuous Delivery Overview

Key Stages in Software Development

Build Stage
- Compile and test code
- Convert code into software
Deploy Stage
- Run the software
- Test the software
Release Stage
- Send software to end users
- Deploy to production environment

Traditional vs. Modern Approaches

Old Way (Traditional)

Application built only at major milestones
Large, complex integration builds
Long test phases
Late bug detection
Error-prone and wasteful

Modern Approach (CI/CD)

Continuous Integration (CI)
- Automatic building and unit testing
- Occurs on every source code check-in
- Maintains application in working state
Continuous Delivery (CD)
- Deploys changes to production-like test environment
- Automated integration and acceptance testing
- Ensures application is always release-ready
Continuous Deployment
- Automatically releases to production
- Used by major companies (Amazon, Meta, Google, Wells Fargo)
- Can lead to 10+ deployments daily

Benefits of CI/CD

Performance Improvements

Decreased deployment time
Faster market validation
Rapid experimentation
Lower change failure rate
Earlier bug detection

Key Advantages

Quality
- Testing occurs earlier in process
- Changes evaluated one by one
- Continuous working state maintained
Recovery
- Easier to identify failure sources
- Quick bug fix deployment
- Better problem isolation

Real-World Impact

Performance Metrics

High Performers: Deploy changes in < 1 hour
Low Performers: Deploy changes in 1-6 months

Case Study Example

“By overlaying our database connection growth graph with the deploys that happened that week, we could quickly figure out precisely which production deployment correlated with the increase of database connections.”

DevOps Principles

Follows first way of DevOps (optimizing end-to-end flow)
Implements second way through fast feedback loops
Reduces Work in Progress (WIP)
Minimizes risk and waste from undelivered code

Common Problems Solved

Eliminates panic from monthly release cycles
Reduces error-prone manual releases
Prevents finger-pointing during issues
Enables quick problem identification and resolution

Six Practices for Continuous Integration

Overview of CI/CD Pipeline

Continuous Integration, Delivery, and Deployment form a pipeline
Each stage flows from build → deploy → release
Each stage depends on successful completion of previous stage

Continuous Integration Basics

Purpose: Keep software in working state at all times
Process:
- Automatically triggered build on each commit
- Builds entire codebase
- Runs unit tests and code validation
- Packages artifact
- Provides build status and log

Six Key Practices

1. Fast Builds

Should pass the “coffee test” (approximately 5 minutes)
Why: Longer builds lead to:
- Developers batching changes
- Increased Work in Progress (WIP)
- System problems

2. Small Commits

Commit smallest possible amount of code
Benefits:
- Easier for team to understand
- Simpler failure isolation

3. Fix Broken Builds Immediately

Build breaks are normal and expected
Important: Don’t leave builds broken
Recommended:
- Delay meetings until build is fixed
- Stop all work until resolution
Sets tone for delivery culture

4. Use Trunk-Based Development

Two Main Development Approaches:
1. Branch-based development
  - Developers work on separate branches
  - Long development time
  - Problematic merges
2. Trunk-based development
  - No long-running branches
  - Multiple small changes daily
  - Always up-to-date trunk
Feature Management: Use feature flags instead of branches
Recommendation: Choose trunk-based approach
- Minimizes WIP
- Ensures frequent code review
- Reduces merge issues

5. Address Flaky Tests

Fix unreliable tests immediately
Inconsistent test results reduce trust in CI system
Impacts build artifact reliability

6. Build Output Requirements

Status: Simple pass/fail or red/green indicator
Log: Detailed record of tests and results
- Aids troubleshooting
- Supports compliance
Artifact: Installable application version
- Should be uploaded and tagged with build number
- Ensures auditability and immutability

Action Item

“Take a moment and use the course notebook to reflect and write down the next steps you could take to implement some of these six practices in a build pipeline you work with.”

Five Practices for Continuous Delivery

Core Concept

“It’s not how much you can deliver, but how little.” - Jez Humble and Dave Farley

Pipeline Structure

Build Stage → Deployment Stage
- Deploy successful build artifacts to live environment
- Environment should mirror production
- Names may vary: CI, staging, test, or pre-production
- Automated testing follows deployment

Five Key Techniques

1. Artifact Management

Create single artifact upon successful build
Types of artifacts:
- RPM or Debian packages
- MSI installers
- Java WAR files
- ZIP files
Build once, use across all environments
No rebuilding for different stages

2. Artifact Immutability

Artifacts must remain unchanged throughout pipeline
Access Control:
- CI system: Write access only
- Deployment system: Read access only
Benefits:
- Builds trust between teams during debugging
- Enables verification through checksums
- Maintains auditability
- Allows tracing from code version → build artifact → running system

3. Pre-production Environment

Must mirror production environment as closely as possible
Must Include:
- Load balancers
- Network settings
- Security controls
- Production-like data
Enables thorough testing:
- Acceptance testing
- Smoke tests
- Integration tests

4. Pipeline Control

System must halt pipeline on any failure
Stop Points:
- Broken build → No deployment
- Failed deployment → No release
Focus on overall software delivery flow, not individual productivity
Team should collaborate to fix issues

5. Idempotent Deployments

Multiple deployment runs should yield identical results
Implementation Options:
- Immutable packaging (Docker containers)
- Configuration management tools (Puppet, Chef)
Eliminates variability in pipeline
Builds trust in deployment process

Note

The authors recommend reading “Continuous Delivery” by Jez Humble and Dave Farley for comprehensive understanding.

The Role of QA in DevOps

Introduction

Continuous delivery benefits:
- Faster deployments
- Fewer bugs
- Less technical debt
- Better dev-ops collaboration

The Importance of Automated Testing

Key Point: Automated testing is crucial for CI/CD success
Manual testing:
- Considered slow and unreliable
- Best reserved for final acceptance testing only
Modern QA role:
- QA professionals work alongside developers
- Focus on designing and writing tests
- Let automation handle repetitive testing tasks

Testing Types (Bottom-up Approach)

1. Unit Testing

Most developer-centric testing
Characteristics:
- Written by developers within the codebase
- Validates individual function behavior
- Fastest testing method
- Uses stubs to bypass external dependencies
- Run locally during development

2. Code Hygiene

Checks code against language/framework best practices
Implemented using:
- Linters
- Formatters

3. Integration Testing

Performed in test environment
Tests:
- Individual component functionality
- Inter-component interactions
- All dependencies included

4. Acceptance/End-to-End Testing

Tests complete product from user perspective
Often UI-level testing
Can be automated
Manual verification still valuable for final checks

Test-Driven Development (TDD) & Behavior-Driven Development (BDD)

Write tests before implementing code
Process example:
1. Write test for desired output
2. Test fails initially
3. Implement functionality
4. Test passes when implementation is correct

Handling Slow Tests

Strategies:

Parallel Execution
- Run slow tests alongside pipeline
- Don’t block until final release
Scheduled Testing
- Nightly test suites
- Regular scheduled runs
Continuous Testing
- Run against test environment
- Accept possibility of non-critical bugs
- Quick fixes possible in CD environment

Additional Testing Types

Infrastructure testing
Performance testing
Security testing
Browser compatibility testing
Compliance testing

Key Takeaway

“Getting good at automated testing is your single most significant factor in successful continuous delivery.”

Continuous Deployment Overview

Key Differences from Continuous Delivery

When to Consider CD

Organizations may not be ready for Continuous Deployment due to:
- Need for manual test cycles
- Product manager sign-off requirements
- Preference for bundled changes over frequent small updates

Prerequisites

Strong CI/CD foundation
Automated approvals and testing within pipeline
Manual workflow steps can be integrated (like code reviews)
Feature flags enable pre-deployment of code before user access

“If you stay ready, you ain’t got to get ready.” - Suga Free

Release Stage Components

Process Flow

Artifact passes all tests
Artifact is marked as released
Deployment to production environment
Trigger notifications for:
- Compliance
- Internal communication
- End user communication

Production Considerations

Complexity: Production releases often require significant engineering work
Challenges:
- Packaged software: Focus on data and configuration compatibility
- Running services: Must handle live users and flowing data
Important: Test environment must mirror production deployment procedures

Production Release Patterns

Types of Deployments

Rolling Deployment
- Upgrades one system at a time
- Allows seamless traffic shifting
Blue-Green Deployment
- Creates entirely new version
- Switches traffic from current (Blue) to new (Green) system
- Can involve swapping environments or creating new ones in cloud
Canary Deployment
- Upgrades single system
- Tests under production load
- Monitors for issues
A/B Deployment
- Uses feature flags
- Releases features to specific user subsets
- Useful for:
  - Canary Testing
  - Public Beta Testing

Real-World Implementation Example: Signal Sciences

System Overview

Built internal tool called “Deployer” (inspired by Etsy’s Deployinator)
Enabled company-wide deployment capabilities
Five-minute deployment time from commit to production

Key Features

Push-button deployment to staging
Automated testing
Self-service automation
Feature flag implementation
Gradual release strategy:
1. Internal users
2. Early adopters
3. All customers

Success Factors

Strong CI/CD foundation
Self-deploying capability
Integration of DEV and OPs workflows
Focus on user experience

Important Considerations

Deployment strategy must align with:
- Packaging choices
- Infrastructure as code strategy
- Software architecture
Requires collaboration across teams
System should be opinionated with clear, standardized procedures

DevOps CI Toolchain Overview

Approach to Building a CI Toolchain

Traditional approach: Start from developer and work outward
Recommended approach: “Onion Layer” model - start from outer layer and work inward
Focus on end-state perspective when considering the entire toolchain

Layer 1: Deployment (Outermost Layer)

Deployment Considerations

Determine how software will be deployed:
- Containers
- System images
- Windows installers

Deployment Types & Tools

A/B Deployments
- Requires feature flagging
- Tools:
  - LaunchDarkly
  - Split
  - Custom-built solutions
Rolling Deployments
- Requires orchestration tools
- Platform-specific options:
  - Kubernetes
  - Serverless
  - Ansible
  - Salt

Layer 2: Artifact Repository

General Solutions

Artifactory
Nexus

Specialized Solutions

Cloud provider container repositories
Language-specific repositories (e.g., bit.dev for NPM)
Minimal solution: Build system tagging + Amazon S3

Layer 3: Building & Testing

Testing Categories

Unit Testing
- Language-specific tools (e.g., go test for Golang)
Code Hygiene & Linters
- ESLint (JavaScript)
- Staticcheck (Golang)
Integration Testing
- Pytest (Python)
- TestNG (Java)
Acceptance/End-to-End Testing
- Selenium
- Cypress.io
- Robot Framework
- Postman

Additional Testing Types

Infrastructure Testing
- InSpec
- ChefSpec
Performance Testing
- JMeter
- LoadRunner
Security Testing
- GitHub Dependabot
- GitGuardian
- Dryrun Security
- StackHawk

Layer 4: Build System

Options

Jenkins (Open source)
- Pros: Community support, wide integration
- Cons: UI navigation challenges
SaaS Solutions
- CloudBees
- CircleCI
- GitHub Actions

Layer 5: Version Control (Innermost Layer)

Popular Options

Git-based:
- GitHub
- GitLab
- Bitbucket

Specialized Version Control

Perforce
PlasticSCM (for large binary assets)

Best Practices

Track Cycle Time
- Measure time from developer system to production
- Record and share metrics with team
- Actively work to improve cycle time
Tool Selection
- Choose tools that reduce overall cycle time
- Consider integration capabilities
- Factor in team expertise and requirements

6. Site Reliability Engineering

Site Reliability Engineering (SRE) Overview

Definition and Core Concepts

SRE is the practical operations component of DevOps
Engineering Definition: Application of theoretical principles to solve real-world problems
Reliability Definition: System’s ability to perform intended functions correctly and consistently
Encompasses:
- Availability
- Performance
- Security
- Service delivery capabilities

Origins

Originated at Google, focusing initially on website reliability
Google published a free online book titled “Site Reliability Engineering”

Key Components

1. Operational Aspects

Monitoring production services
Managing systems
Problem resolution
Automation of operational processes

2. Patrick Debois’ Four Key DevOps Areas

Extending delivery to production
Extending feedback from operations to dev
Embedding dev into operations
Embedding ops into dev

3. Holistic Approach

Two main components:

Building Reliability
- Focus on constructing resilient systems
- Emphasis on maintainability
- Engineering for reliability from the start
Operational Feedback
- Observability practices
- Incident response procedures
- Production operations feedback loop

Business Impact and Metrics

SRE improves key performance indicators:

Change Failure Rate
- Reduces production issues through:
  - Reliability testing
  - Deployment automation
Time to Restore Service
- Improves through:
  - Enhanced problem detection
  - Operational automation
  - Disciplined processes
Service Level Objectives
- Better meeting of uptime goals
- Improved performance targets
- Enhanced through observability and resilience

Important Notes

“You can’t just bolt on reliability once something goes live.”

SRE requires proactive engineering approach
Combines both development and operational perspectives
Requires continuous improvement through feedback loops

Building for Reliability: Design Theory

Core Concepts

Success in production largely depends on design-time decisions and software architecture
Focus on creating reliable applications through thoughtful planning
Important to understand how applications work in real systems environments

Key Resources

1. “Release It!” by Michael Nygard

Equivalent to “Gang of Four Design Patterns” but focused on stability
Key Findings:
- Integration points are the #1 cause of architectural issues
- Cascading failures are the biggest threat to stability in layered architecture

Example of Cascading Failure:

Database layer issues can lead to:
- Exhaustion of database connection pools
- Application server tier choking

2. Circuit Breaker Pattern

Purpose: Prevents cascading outages
Functionality:
- Monitors integration point failures/slowness
- Stops making calls when unusual failure rates detected
- Works with timeouts to prevent outage spread
Implementation: Available through libraries like Resilience4j

3. Twelve-Factor App (12factor.net)

Manifesto for service-ready software
Example - Factor 3 (Config):
- Separate runtime configuration from app code
- Store in environment variables
- Keep configurations independent
- Avoid environment groupings
- Benefits: Reduces fragility and improves portability

4. Martin Fowler’s Resources

Provides concise descriptions of architectural concepts:
- Page objects
- Serverless
- Bimodal IT
- DevOps topics
Perspective from experienced software engineer

Modern Architecture Considerations

Microservice architectures have multiple integration points
Higher likelihood of integration point failures
Need for robust stability patterns and solutions

Action Items

Schedule focused time to study these patterns
Evaluate patterns based on your technical ecosystem
Consider implementing solutions for common production failures

“Take a minute to schedule some focus time on your calendar to look more deeply into these patterns and consider which may have value in your own particular technical ecosystem today based on the kinds of failures that you see in production.”

Building for Reliability: Key Principles

Core Concept: Dev vs Ops Background

“Dev comes from school, but Ops comes from the street.”

Developers typically have computer science backgrounds
System administrators often self-taught through real-world experience
SRE bridges ops experience with disciplined engineering approach

Understanding System Failure

Fundamental Truths

All systems fail
Individual components fail frequently
Slowdowns are as threatening as complete outages
Systems often run in degraded mode

Swiss Cheese Model

System components are like stacked Swiss cheese slices
Problems occur when holes (failures) align
Multiple layers provide protection against complete failure

Richard Cook’s “How Complex Systems Fail”

Key findings:

Changes introduce new forms of failure
Complex systems contain latent failures
Complex systems always run in degraded mode

System Availability Metrics

Measured in “nines” of availability
Examples:
- Three nines (99.9%): 8.77 hours downtime/year
- Five nines (99.999%): 5.26 minutes downtime/year

Resilience Engineering

Definition

“Resilience is the intrinsic ability of a system to maintain or regain a dynamically stable state, which allows it to continue operations after a major mishap and/or in the presence of a continuous stress.”

Key Tools and Approaches

Redundancy
- Multiple identical copies of components
- Maintains service if one fails
Load Balancing
- Directs traffic to healthy system parts
- Traffic shaping for optimal performance
Automatic Scaling
- Adds resources as needed
- Eliminates need for manual server upgrades

Example: Kubernetes

Runs redundant copies of core services
Built-in health checking
Automatic failover
State replication across multiple locations

Sociotechnical Systems

Important Considerations

People are integral parts of the system
Human actions can both break and maintain system health
Systems are always partially broken
Expert intervention is necessary

SRE Best Practices

Time Management
- At least 50% of time should be spent developing tools
- Focus on automation over manual fixes
Developer Involvement
“You write it, you run it”
- Developers should be on-call for their code
- Must be proficient with debugging and monitoring tools
- Required to support services until proving stability
Documentation
- Create comprehensive runbooks
- Document safe intervention procedures
- Establish monitoring and control systems

Key Takeaway

Building reliable systems isn’t about achieving perfect uptime, but rather creating resilient systems that can maintain functionality despite partial failures and require skilled practitioners for maintenance and improvement.

Observability in Systems

Overview

Observability measures how well internal system states can be understood from external outputs
Goal: Understanding system state through metrics and logs to enable action and improvement
Supports the Three Ways principles through feedback loops

Five Key Areas of Observability

1. Synthetic Checks

Also known as health checks
Programmatic testing of service performance and uptime
Not based on real user traffic
Answers basic question: “Is it working?”
Can be implemented at both:
- High-level service checks
- Sub-component levels

2. System and Application Metrics

System Metrics

Measures fundamental system resources:
- CPU usage
- Memory utilization
Time-series data stored in Graft
Helps determine normal functioning

Application Metrics (Custom)

Application-specific measurements
More diagnostic than system metrics
Examples:
- Function call duration
- Login counts
- Error event frequency

3. Performance Metrics

Application Performance Monitoring (APM)

Code-level performance instrumentation
Measures:
- Function execution time
- API call duration
- Database query performance

Real User Monitoring (RUM)

Front-end instrumentation (e.g., JavaScript page tags)
Captures actual user experience
Provides direct insight into customer experience

Tracing

Tracks requests across multiple services
Measures duration of each component
Useful for complex system analysis

4. System and Application Logs

Provides detailed contextual information
Answers key questions:
- What happened?
- When did it happen?
- Where did it happen?
- What was involved?
Use cases:
- Problem detection
- Troubleshooting
- Audit and compliance
- Capacity planning
- Security forensics

5. Security Monitoring

Utilizes existing logs and metrics
Focuses on threat detection
Monitors for:
- Indicators of compromise
- Suspicious endpoints
- Connections from known bad IPs
- Bad configurations
- Unusual behavior
Example alerts:
- Login failure spikes
- Website injection attempts
- Malformed network requests

Best Practices

Analyze which monitoring types best support production systems
Use monitoring data to help development teams improve applications
Collaborate between operations and development
Encourage improved custom metrics and logging
Use production data to drive product improvements

“Monitoring isn’t just for production performance and uptime, it’s also a source of valuable information to developers about how the service is really used out in production.”

Incident Response and Retrospectives

Core Concepts of Incident Response

System Reality

All systems are sociotechnical systems with humans as part of their resilient operation
Even with excellent design, development, testing, and monitoring, systems will still experience failures
Getting good at responding to and remediating problems is a crucial part of the job

Key Activities for Incident Response

Troubleshooting
- Requires in-depth system knowledge
- Need ability to diagnose and remediate problems
Automation
- Having pre-created tooling
- Enables faster and safer information gathering
- Supports remediation activities
Communication
- Often requires team of specialists
- Need to keep business stakeholders informed
- Must update end users on situation

Incident Management Process

Inspired by Incident Command System (ICS)
- Originally created in 1968 for Northern California wildfires
- Now recommended by UN as international standard
Key aspects:
- Incident detection and reporting
- Participant coordination
- Custom to organization, team, and product

Post-Incident Analysis

Modern Approach vs Traditional

Moving away from traditional “root cause analysis”
Avoiding blame-focused investigations
Recognition that human error shouldn’t cause major outages
- If it does, system needs improvement
- Systems should be resilient to mistakes

Effective Postmortem Principles

Multiple Causes
- No single root cause
- Consider deficiencies at multiple levels:
  - Testing
  - Monitoring
  - Processes
Blame-Free Analysis
- Understand actions from practitioners’ point of view
- Recognize decisions made with best available information
- Address cognitive biases
- Focus on system improvement
Transparency
- Open communication during incidents
- Clear stakeholder updates
- Honest post-incident reporting
- Builds trust and goodwill

“Real talk moment. Organizations have performed so-called root cause analyses for decades. These are usually a thinly veiled attempt to find somebody to blame for an outage. But if someone making a mistake can cause a major outage, your system itself is terrible and not resilient and it needs to improve.”

Best Practices

Practice incident response regularly
Maintain cool head during incidents
Focus on system improvement rather than blame
Document and learn from each incident
Share learnings transparently

DevOps SRE Toolchain Overview

Two Main Components

1. Building for Reliability

Highly dependent on programming language and tech stack
Focus on libraries and development techniques rather than tools
Requires collaboration between dev and ops at design time
Resources available:
- Technical books
- Libraries (e.g., Java’s Resilience4j)

2. Operational Feedback

Common set of observability and incident response tools
Rich ecosystem of options:
- SaaS Solutions:
  - Datadog
  - Honeycomb
  - SumoLogic
- Open Source Tools:
  - Nagios
  - Grafana
  - Prometheus
- Commercial Software:
  - Solarwinds
  - Splunk

Five Key Areas of Observability

Synthetic checks
System and application metrics
End-user performance
System and application logs
Security monitoring

Lean Approach to Observability Implementation

Build-Measure-Learn Cycle

Build: Create minimum viable monitoring stack
- Basic endpoint synthetic monitors
- Basic system monitoring
- Performance latency from logs
Measure: Collect metrics from all monitoring areas
Learn:
- Analyze application stack with monitoring
- Identify areas needing more detailed metrics
- Evaluate effectiveness of app logs
- Iterate and improve as needed

“Monitoring you don’t use, that’s waste.”

Best Practices and Considerations

Stakeholder Access

Make monitoring accessible to:
- Developers
- Product managers
- Business decision makers

Custom Development

Create custom visualizations when needed
Focus on making monitoring meaningful to different stakeholders

Incident Response Tools

Popular Solutions:

PagerDuty (SaaS)
- Handles alerts from observability tools
- Manages on-call scheduling
- Provides escalation workflows
Other Options:
- VictorOps
- OpsGenie

Runbook Automation Tools

Rundeck (Open source, commercial, and SaaS options)
Ansible Tower
StackStorm

Status Page Tools

Atlassian Statuspage
Status.io

Key Takeaways

Keep solutions simple
Consider team collaboration needs
Iterate and improve based on actual usage
Focus on specific use cases
Be prepared to develop custom tooling as needed

7. Advanced Topics

Platform Engineering: The Paved Road

The Challenge of Scale

Organizations face difficulties in managing:
- Infrastructure as code
- Continuous builds
- Incident response
- Security and compliance
Key Problem: As value streams multiply, solution diversity can lead to chaos

The Automation Solution

Pioneer Companies

Organizations that first tackled extreme DevOps scale:
- Netflix
- Meta
- Google
- Spotify
These companies invested in self-service automation

The Paved Road Concept

Also known as the “golden path”
Evolution from early DevOps “wilderness trail blazing”
Creates an opinionated framework for standardized processes
Benefits:
- Easier adoption
- Shared improvements
- Simplified team transitions between projects

Common Implementation Examples

CI/CD Pipelines
- Automated check-in hooks
- Automatic test runs on pull requests
- Automated test deployments
Self-Service Platforms
- Cloud account provisioning
- HPC cluster setup for machine learning
- Built-in security guidance
- Automated compliance

Platform Engineering Evolution

Definition

“Platform engineering is the discipline of designing and building tool chains and workflows that enable self-service capabilities for software engineering organizations.”

Components

Development environment
Testing capabilities
Deployment automation
Infrastructure creation
Observability
Security
Runtime environment
Scaling
Service discovery

Success Factors

1. Product Management Approach

Platforms must serve users, not creators
Key principles:
- Focus on user requirements
- Ensure product quality
- Market the platform internally
- Keep usage voluntary

2. Lean Implementation

Avoid over-building platforms
Follow the progression:
1. Blaze the trail
2. Pave the road
3. Build the train
Focus on actual user needs
Maintain flexibility for innovation

Warning Signs vs. Good Practices

Warning Signs

Centralized control focus
Mandatory usage
Optimization for central team needs
Excessive upfront building

Good Practices

Global system optimization
Value stream focus
User-centric design
Incremental development
Flexibility for innovation

Key Differentiator

The main difference between modern platforms and traditional centralized IT is the focus on:

User empowerment
Value optimization
Flexibility
Continuous improvement based on actual needs

DevSecOps: Security in the DevOps Way

Traditional Security Challenges

Historical tension between security and technical teams
Security originally handled by sysadmins and developers
InfoSec specialization created new silos
Typical staffing ratio problem:
- 100 developers
- 10 operations staff
- 1 security person

Common Issues

Security teams have different priorities
Focus often compliance-oriented
Appears as “busy work” to development teams
Security teams understaffed and downstream
Developers care about security but lack:
- Time
- Clear direction from security teams

DevSecOps Introduction

“If security introduces blocking to the organization, it will be ignored, not embraced.” - Zane Lackey and Rich Smith (Etsy)

CAMS Framework with Security Lens

1. Culture

Security works alongside developers
Avoid creating blocking gates
Prevent value stream from routing around security

2. Automation

Shifting Left Concept

Introduce security earlier in development
Implement security tools in:
- IDE
- CI systems
Warning: Avoid common pitfalls
- Don’t dump security work on developers
- Prevent bloated build times
- Avoid forcing developers to parse complex security tools
Focus on minimal impact on cycle time

Build bridges between teams
Create security champions program
- Methods to identify champions:
  - Host Capture the Flag events
  - Search code repos for security bug fixers
  - Ask for volunteers
Benefits:
- Security team trains champions
- Champions help understand team concerns
- Improves communication between teams

4. Measurement

Establish security observability
Create joint team goals
Avoid FUD (Fear, Uncertainty, Doubt) approach
Focus on metrics that matter

Key Takeaways

Security is critical regardless of terminology
Modern approaches focus on integration
DevSecOps bridges gap between security and development
Success requires balance between security needs and development efficiency

Kubernetes and Cloud Native Overview

What is Kubernetes?

An open-source container orchestration system that automates:
- Software deployment
- Scaling
- Management
Provides a platform for running containerized applications

Key Benefits

Automation and Features

Automates infrastructure plumbing
Provides standardized management features:
- Observability
- Service discovery
- Health monitoring
- Custom networking
Developers get built-in capabilities without additional development

Infrastructure Abstraction

Manages compute, networking, and storage
Enables multi-cloud deployment
Standardizes deployment across:
- On-premise environments
- Different cloud providers
Simple deployment process:
1. Containerize application
2. Specify redundancy requirements
3. Deploy across cluster nodes
4. Expose API

Cloud Native Computing Foundation (CNCF)

Understanding “Cloud Native”

Definition: Essentially means “Kubernetes add-on”
Not limited to cloud environments
Large ecosystem of tools and products
CNCF maintains an interactive tool landscape

Challenges and Considerations

1. Complexity

Highly configurable with numerous options
- 20+ choices for network backplane alone
Requires integration of multiple tools
Complex upgrades and interoperability
Steep learning curve

2. Resource Requirements

Significant costs:
- Base 3 server cluster can cost hundreds of dollars monthly
- Requires dedicated administration team
Not suitable for lightweight management by dev teams

3. Implementation Risks

Can work against DevOps goals if not carefully managed
Potential creation of silos
Risk of increased waste
Requires systems thinking and CAMS values alignment

Best Practices

Start simple
Add complexity only when necessary
Ensure thorough understanding of platform behavior
Consider alternatives:
- Serverless solutions
- Lighter container orchestration
- Can provide 80% of benefits with 20% of effort

“Kubernetes is a good tool to learn about and a very popular tool. Just make sure it’s the right tool for the job at hand.”

Chaos Engineering in DevOps

Introduction to Chaos Engineering

Definition: The discipline of experimenting on a system to build confidence in its capability to withstand real-world production conditions
Core Concept: Creating deliberate adversity for systems to test and improve resilience

Netflix’s Chaos Monkey

Origin Story

Netflix developed Chaos Monkey while running large cloud clusters for millions of users
Goal: Build systems resilient to component failures (servers, network links, etc.)

Why It Was Needed

Traditional CICD testing proved insufficient
Couldn’t replicate the full complexity of:
- Thousands of interconnected components
- Real production environment
- Interactions with millions of users

How It Works

Chaos Monkey deliberately breaks components in the production system
Purpose: Forces technical teams to ensure system resilience
Resulted in continuous improvement of system resilience through controlled failures

Modern Chaos Engineering Practices

Controlled Testing Approach

Not random destruction, but structured experiments
Requires proper fault testing during development
Validates automatic fault remediation
Tests human intervention scenarios

Game Days

Structured activities for testing incident response
Either emulates or creates real faults
Tests the human component of sociotechnical systems
Validates incident response procedures

Kubernetes Chaos Engineering Example

Testing Scenarios

Multiple failure points to test:
- Server failures
- Application container failures
- Network issues
- Container repository problems
- DNS failures

Key Learnings

System behavior often differs from assumptions
Important aspects to monitor:
- Recovery capability
- Recovery speed
- System behavior during reconnection
- Impact on human operators

Benefits and Philosophy

Similar to automobile crash testing methodology
Promotes innovative thinking
Breaks traditional constraints
Supports becoming a learning organization
Aligns with DevOps culture and three ways
Emphasizes learning through feedback loops

“IT systems aren’t binary, at least not above the chip level.”

Best Practices

Conduct thorough development testing first
Create structured experiments
Test both automated and human responses
Monitor and learn from results
Apply learnings to improve system resilience

MLOps: DevOps for Machine Learning Systems

Introduction to MLOps

Definition: Combination of Machine Learning and DevOps practices
Current Context:
- ML has historically been used mainly by:
  - Large social media companies
  - Scientists and engineers
- Recent boom in generative AI has led to widespread adoption
- Business value often requires:
  - Private data handling
  - Training private ML models

Key Differences from Traditional DevOps

1. User Base Characteristics

Primary Users: Data scientists (vs. developers)
- Often less familiar with computer systems
- Work is tightly coupled with hardware
- Requires close collaboration and empathetic support

2. Additional Components to Manage

Beyond traditional code and infrastructure:
- Data versioning
- Model management
- Massive datasets
- ML models (pattern-finding algorithms)

3. Infrastructure Requirements

Training Workloads:
- Intensive batch jobs
- Run on HPC (High Performance Computing) clusters
- Characteristics:
  - Highly optimized systems
  - Integrated compute storage and network
  - GPU utilization
  - Typically expensive

4. Results Tracking and Governance

Different from Traditional Testing:
- AI systems don’t provide single correct answers
- Varying output quality
- Continuous modification for improvement
- Rich feedback loop beyond pass/fail testing

Production Aspects

1. Inference and Training

Initial training followed by inference in production
Continuous learning from user input
Need to detect drift in AI predictions
Vector databases:
- Large scale storage
- Growing costs with long-term inference memory

2. Development Value Stream

Three parallel CICD processes for:

Software
Infrastructure
Data and models

Key Participants

Developers
Operations teams
Data scientists

Success Factors

Application of core DevOps concepts:
- Automation
- Measurement
- Continuous monitoring
- Collaborative approach

Future Outlook

AI becoming a fundamental computing pattern
Increasing need for DevOps professionals to understand MLOps
Growing importance in business operations

“AI is here to stay as a fundamental computing pattern and workload, so more and more DevOps professionals will need to understand it in the future.”

AIOps: AI Integration in DevOps Work

Key Principles

AI doesn’t replace engineers
Engineers remain responsible for evaluating and testing AI outputs
Cannot blindly implement AI-generated code without proper validation

Why AI in DevOps?

Current Challenges

Too many tools and vendors
Inconsistent documentation
Complex architectures
Excessive specialized knowledge requirements
Cognitive overload from context switching

Practical Applications

1. API and Code Work

Assists with command-line operations
Creates integration scripts
Helps with data transformation
Facilitates webhook interactions

2. Natural Language Queries

Prompt Engineering Tips:
- Set context (e.g., “respond as a DevOps engineer with Linux experience”)
- Request explanations at different complexity levels
- Frame questions effectively for better results

3. Code Management

Code refactoring
Documentation writing
Pipeline documentation
Language conversion
Example: Rob Hirschfeld’s approach of converting Terraform to AWS CLI and Bash

4. Monitoring and Security

Enhanced detection and alerting
Automated remediation recommendations
Security testing and code review
- Faster than traditional methods
- Improved accuracy
- Acts as an “automated security buddy”

Future Evolution: The Three Waves

Wave 1: Code Generation

AI-assisted coding
Automated code reviews
Test generation
Documentation
Tools: GitHub Copilot, AWS Code Whisperer

Wave 2: Systems Management

Advanced alerting
Monitoring cluster health
Automated runbooks
System explanation
- Example: k8sgpt for Kubernetes system state explanation

Wave 3: Human Integration

Self-service functionality
Enhanced cross-team collaboration
Platform accessibility improvements
Better business integration

Conclusion

AIOps will:

Make DevOps more approachable
Improve business integration
Transform work methods without replacing human expertise
Enhance efficiency and accessibility across organizations

8. DevOps Career

DevOps Career Guide and Resources

Career Perspectives in DevOps

DevOps as a Mindset

DevOps is not just a job title but a mindset and suite of practices
Applicable across various technical roles
Focuses on improving technology organization results

Role-Specific Applications

For Developers:

Can remain developers while incorporating DevOps principles
Focus on building reliable applications
Better understanding of build and testing structures
Improved instrumentation for production environments

For System/IT Administrators:

May be titled “DevOps Engineer”
Key skills include:
- Infrastructure as code
- System reliability design
- Observability platform implementation
- Runbook automation

Specialized DevOps Roles:

Platform Engineers
Automation Experts
Build Engineers/Release Managers
Site Reliability Engineers (SREs)
Specialized positions in large organizations:
- Incident Managers
- Application Performance Management Teams

Beyond Technical Roles

Security Engineers → DevSecOps Engineers
Applicable to non-technical roles:
- Sales
- Marketing
- Product Management
- Executive positions

Learning Resources

Top 10 DevOps Books

DevOps Handbook
Accelerate
The Phoenix Project
Continuous Delivery
Site Reliability Engineering Book
Infrastructure as Code
Release It!
The Practice of Cloud System Administration
Visible Ops
Lean Software Development

Online Resources

Weekly newsletters
Notable websites:
- Martin Fowler’s articles
- Julia Evans’ technical zines and comics

Certifications

Technology-specific certifications:
- AWS Cloud
- HashiCorp
- Kubernetes
- Cloud Native
DevOps Institute certifications
University-run DevOps boot camps

Conferences and Events

DevOps Enterprise Summit (US and Europe)
DevOpsDays (50+ global events in 2023)
All Day DevOps (24-hour online conference)

Personal Development Path

Creating Your Learning Journey

Consider your target role and current position
Design a learning path based on:
- Current skills
- Career goals
- Desired specialization

Core Technical Skills

Operating systems
Programming languages
Cloud technologies
Containerization

Additional Learning Resources

DevOps Foundations curriculum
Specialized courses:
- Lean and Agile
- Infrastructure as Code
- CICD
- Site Reliability Engineering
- DevSecOps
- Observability
- Incident Management
- DevOps Management
- DevOps Anti-patterns

Best Practices for Learning

Gain hands-on experience
Utilize continuous learning principles
Engage with feedback loops
Connect with DevOps community
Participate in course Q&A
Network on LinkedIn

Chirag Bharambe

DevOps Professional

1. DevOps Basics

Understanding DevOps

Definition and Core Concept

Key Characteristics

Modern Systems Engineering Approach

Three Levels of DevOps

Benefits and Impact (2021 State of DevOps Report)

Elite Teams vs. Low-Performing Teams

Organizational Benefits

Universal Application

What DevOps is NOT

Important Note

DevOps Core Values: CAMS Model

Overview

Culture (C)

Understanding Culture

Historical Context

Cultural Challenges

Automation (A)

Key Points

Best Practices

Measurement (M)

Importance

Common Pitfalls

Recommended Metrics

Sharing (S)

Core Elements

Sharing Methods

Conclusion

The Three Ways of DevOps

Overview

First Way: Systems Thinking and Principles of Flow

Key Concepts

Example: Performance Optimization

Organizational Impact

Second Way: Amplifying Feedback Loops

Definition

Bug Detection Example

Application

Third Way: Culture of Continuous Experimentation and Learning

Core Elements

Key Principles

Implementation

Practical Application

Key Questions to Consider

DevOps Practice Areas: The Five Pillars

Overview

The Five Practice Areas

1. Culture

2. Process

3. Infrastructure as Code

4. Continuous Delivery

5. Site Reliability Engineering (SRE)

Important Considerations

Interdependence

Implementation Strategy

DevOps Tools Selection Guide

Core Principles

People Over Process Over Tools

Tool Selection Criteria

1. KISS Principle

2. Integration Requirements

Challenges in Modern DevOps

Complexity Issues

Common Tool Categories

Best Practices for Tool Selection

Key Takeaway

2. DevOps and People: A Culture Change

The Need for DevOps Culture

Current IT Challenges

Traditional IT Department Issues

Internal Friction Points

The Wall of Confusion

Definition

Typical Flow

Real-World Example

Server Provisioning Case Study

Business Impact