DevOps Professional
- 1. DevOps Basics
- Understanding DevOps
- DevOps Core Values: CAMS Model
- The Three Ways of DevOps
- DevOps Practice Areas: The Five Pillars
- DevOps Tools Selection Guide
- 2. DevOps and People: A Culture Change
- 3. DevOps and Process: The Building Blocks
- 4. Infrastructure as a Code
- Infrastructure as Code (IaC)
- DevOps Infrastructure as Code: Configuration Management Overview
- Evolution of DevOps Configuration Management
- Evolution of Infrastructure as Code (IaC) in DevOps
- Infrastructure as Code (IaC) Toolchain Selection Guide
- 5. Continuous Delivery
- Continuous Delivery Overview
- Six Practices for Continuous Integration
- Five Practices for Continuous Delivery
- The Role of QA in DevOps
- Continuous Deployment Overview
- DevOps CI Toolchain Overview
- 6. Site Reliability Engineering
- Site Reliability Engineering (SRE) Overview
- Building for Reliability: Design Theory
- Building for Reliability: Key Principles
- Observability in Systems
- Incident Response and Retrospectives
- DevOps SRE Toolchain Overview
- 7. Advanced Topics
- Platform Engineering: The Paved Road
- DevSecOps: Security in the DevOps Way
- Kubernetes and Cloud Native Overview
- Chaos Engineering in DevOps
- MLOps: DevOps for Machine Learning Systems
- AIOps: AI Integration in DevOps Work
- 8. DevOps Career
1. DevOps Basics
Understanding DevOps
Definition and Core Concept
- DevOps combines two traditional tech roles:
- Developers: Write application code
- Operations Engineers: Set up and manage systems running the applications
- DevOps emerged in the late 2000s to address the disconnect between these roles
Key Characteristics
- Collaborative approach throughout the entire service lifecycle
- Includes all specialized roles working together:
- Front-end developers
- Test engineers
- Build engineers
- Networking engineers
- Security engineers
- Database administrators (DBAs)
Modern Systems Engineering Approach
- Operations engineers use development techniques
- Systems engineering follows software development workflow:
- Code checked into source control
- Build, test, and deployment processes
- Moves away from manual system administration
Three Levels of DevOps
- Values
- Principles
- Practices
Benefits and Impact (2021 State of DevOps Report)
Elite Teams vs. Low-Performing Teams
- Deployment Frequency: 973 times more frequent
- Lead Times: 6,570 times shorter
- Quality Metrics:
- 3x fewer failures
- 6,570 times faster recovery from issues
Organizational Benefits
- 22% less time spent on unplanned work and rework
- 2x more likely to achieve organizational objectives
- Higher success in:
- Shipping products
- Customer satisfaction
- 50% reduction in employee burnout
Universal Application
- Benefits apply across:
- Different organization sizes
- For-profit and non-profit organizations
- Product engineering teams
- Internal IT organizations
What DevOps is NOT
- Not just a renamed operations team
- Not a single job title
- Not one person doing everything
- Not tied to specific tools
Important Note
“Keep in mind that a lot of people use the term DevOps without really understanding what it means. So always check what you’re hearing against the core concepts of DevOps.”
DevOps Core Values: CAMS Model
Overview
The CAMS model, developed by DevOps pioneers John Willis and Damon Edwards, represents the fundamental values of DevOps:
- Culture
- Automation
- Measurement
- Sharing
“DevOps is a human problem.” - Patrick Debois (Godfather of DevOps)
Culture (C)
Understanding Culture
- More than superficial perks (ping pong tables, free food)
- Driven by human behavior
- Based on mutual understanding between team members
Historical Context
- Traditional IT organization split:
- Development Teams: Focus on creating applications and features
- Emphasis on speed and innovation
- Operations Teams: Focus on maintenance and stability
- Responsible for servers, networks, security, and cost control
- Development Teams: Focus on creating applications and features
Cultural Challenges
- Formation of silos due to differing goals
- Communication breakdown between teams
- Focus on team-specific goals rather than overall business outcomes
- Solution: Change underlying behaviors and assumptions to drive cultural change
Automation (A)
Key Points
- Often the first thing people associate with DevOps
- Warning: Implementing automation without other values can lead to DevOps failure
- Creates a fabric for controlling systems and applications
- Acts as an accelerator for other DevOps benefits
Best Practices
- Make automation the primary approach to creating solutions
- Address manual work as it’s a source of:
- Inefficiencies
- Quality problems in technology value streams
Measurement (M)
Importance
- Enables observation of systems and people
- Helps track improvement from changes
- Provides rational approach to technology
Common Pitfalls
- Measuring wrong metrics
- Improper incentivization
Recommended Metrics
- Cross-organizational outcomes:
- Mean time to restore service after outages
- Cycle time for new feature deployment
- Higher-level results:
- Costs
- Revenue
- Employee satisfaction
Sharing (S)
Core Elements
- Foundation of collaboration
- Essential for DevOps success
- Promotes teamwork and transparency
Sharing Methods
- Documentation
- Pair programming
- Peer reviews
- Mentoring
- Inclusive practices
Conclusion
CAMS implementation focuses on:
- Changing people’s behavior
- Using automation as an accelerator
- Measuring progress for improvement
- Fostering collaboration for better outcomes
The CAMS values serve as the foundation for specific DevOps techniques and should be embraced for successful organizational transformation.
The Three Ways of DevOps
Overview
The Three Ways are strategic principles developed by Gene Kim and Mike Orzen to implement DevOps values effectively. These principles help bring core DevOps values to life in practical ways.
First Way: Systems Thinking and Principles of Flow
Key Concepts
- Focus on the overall outcome of the entire system
- Avoid optimizing individual parts at the expense of overall results
- Consider end-to-end flow as the primary value producer
Example: Performance Optimization
- Improving one area can create unexpected bottlenecks elsewhere
- Case Study: Adding more application servers can overwhelm database servers with connections
Organizational Impact
- Deployment team processes might look good in isolation but could compromise overall development
- Handoffs and friction between teams often disrupt value flow
- Success metrics should reflect system-wide outcomes
Second Way: Amplifying Feedback Loops
Definition
- Processes that consider their own output when determining next steps
- Focus on creating, shortening, and amplifying feedback loops between value chain components
Bug Detection Example
Three scenarios with increasing waste:
- Best case: Developer catches bug through desktop unit tests
- Medium case: QA finds bug, documents it, returns to developer
- Worst case: Customer finds bug → Support → Development → Product Management → Fix
Application
- Use when creating multi-team processes
- Important for visualizing metrics
- Essential in designing delivery flows
Third Way: Culture of Continuous Experimentation and Learning
Core Elements
- Create an environment that encourages learning and experimentation
- Avoid analysis paralysis
- Focus on practical implementation and iteration
Key Principles
- “Working code wins”
- “If it hurts, do it more often”
- “Fail fast”
Implementation
- Encourage active skill practice and mastery
- Promote trying new approaches
- Focus on doing rather than just discussing
- Support sharing of new ideas
Practical Application
The Three Ways provide a framework to:
- Implement specific processes and tools
- Align with CAMS (Culture, Automation, Measurement, Sharing)
- Guide decision-making in DevOps implementation
Key Questions to Consider
- How does this affect the whole system?
- Where can we build in more feedback loops?
- How can we facilitate experimentation and learning?
DevOps Practice Areas: The Five Pillars
Overview
Unlike Agile’s structured methodologies (like Scrum and XP), DevOps doesn’t have a strictly defined approach. However, it consists of five major practice areas that form a comprehensive implementation framework.
The Five Practice Areas
1. Culture
- Focus on creating and maintaining a stable, safe environment
- Key elements:
- Learning and sharing
- Experimentation
- Embracing both success and failure
- Reflects core DevOps values
2. Process
- Foundation: Agile and lean product development techniques
- Key practices:
- Working in small batches
- Limiting work in progress (WIP)
- Incorporating feedback loops
- Lightweight change approval processes
- Strong correlation with IT and business success
- Reflects the “Three Ways” in Lean and Agile frameworks
3. Infrastructure as Code
- Technological approach using:
- Cloud
- Containers
- Programmable infrastructure
- Benefits:
- Reproducibility
- Self-service capabilities
- Rapid scaling
- Improved software delivery and operational performance
4. Continuous Delivery
- Focuses on automation for implementing lean principles
- Key aspects:
- Automated testing
- Frequent deployment of small changes
- Benefits:
- Increased speed
- Improved quality
- Better culture
- Enhanced performance
5. Site Reliability Engineering (SRE)
- Engineering approach to:
- Building reliability into systems
- Operating services with high observability
- Implementing automation
- Applies to both application and infrastructure levels
Important Considerations
Interdependence
- Pillars are not effective in isolation
- Must work together to build a solid DevOps foundation
- Example: High software delivery performance (from continuous delivery) needs operational excellence (from SRE) to deliver business benefits
Implementation Strategy
- Advance all five pillars iteratively
- Avoid focusing on one pillar exclusively
- Balance development across all areas
- Regular assessment of organizational maturity in each pillar is recommended
“In your roadmap to DevOps maturity, you want to advance all five pillars in turn and iterate so that they can reinforce each other. Trying to completely implement one without bolstering the others will end in frustration.”
DevOps Tools Selection Guide
Core Principles
People Over Process Over Tools
“People over process over tools” - Alex Honor (Creator of Rundeck)
Correct Implementation Order:
- Identify responsible people and ensure they have proper skills/support
- Define necessary processes
- Select and implement appropriate tools
Common Mistake:
- Organizations often reverse this order, focusing on:
- Tools first
- Processes second
- People last (if at all)
Tool Selection Criteria
1. KISS Principle
- Definition: Keep It Simple, Stupid
- Rationale: Every tool requires:
- Learning curve
- Implementation
- Upgrades
- Security maintenance
- Integration with other tools
2. Integration Requirements
- Tools should function as a “tool chain”
- Must operate well in dynamic environments
- Key features:
- Good integration capabilities
- Ability to compose solutions
- Automatic adaptation to changes
- API availability
Challenges in Modern DevOps
Complexity Issues
- Increasing complexity in tech landscape
- Example: Cloud Native Computing Foundation’s landscape diagram
- Recent trends show:
- Declining quality of implementations
- Overabundance of tools
- Integration difficulties
Common Tool Categories
Popular tools mentioned:
- Kubernetes
- Terraform
- Ansible
- Puppet
- Chef
- GitHub
- Jenkins
- Docker
- Linux
- Amazon
- Graphite
- Artifactory
Best Practices for Tool Selection
Focus on Collaboration
- Consider how tools enhance team collaboration
- Ensure all value stream participants can use them effectively
Avoid Over-tooling
- Resist the temptation to implement too many tools
- Consider maintenance overhead
Ensure Dynamic Compatibility
- Tools must work with changing environments
- Avoid static configurations
- Prioritize API-driven solutions
Key Takeaway
“There is no such thing as the best tool. There’s only the best tool for you and your specific situation.”
2. DevOps and People: A Culture Change
The Need for DevOps Culture
Current IT Challenges
Traditional IT Department Issues
- IT departments often face low success and satisfaction rates
- Historical misalignment between business teams and technology teams
- Popular media (e.g., “Office Space,” “IT Crowd,” “Silicon Valley”) reflects these real-world challenges
Internal Friction Points
- Conflict exists between various technical teams:
- Developers
- Quality Assurance
- System Administrators
- Information Security Professionals
- Network Administrators
- Database Administrators (DBAs)
The Wall of Confusion
Definition
- Represents the communication barriers between different teams
- Creates division between groups that should share common goals
Typical Flow
- Business throws requirements to developers
- Developers throw code to testers
- Testers throw tested code to operations
- Operations throw final product to end users
Real-World Example
Server Provisioning Case Study
Traditional Process (6 weeks):
- Negotiating specifications
- Procurement process
- Hardware delivery
- Installation in data center
- OS loading
- Final handover
After Virtualization (4 weeks):
- Technical process reduced to 15 minutes
- Organizational overhead still resulted in 4-week delays due to:
- Standards
- Ticketing systems
- Documentation requirements
Business Impact
Executive Perspective
- Modern business executives are increasingly tech-savvy
- Question why 15-minute tasks take 4 weeks
- Concerned about:
- Financial waste
- Time inefficiency
- Competitive disadvantage
Common Reactions
- Turn to outsourcing
- Develop shadow IT
- Seek alternatives to central IT department
“The organizations and processes we’ve built up around IT” have created unnecessary complexity and delays, highlighting the need for a DevOps culture to bridge these gaps and improve efficiency.
Building DevOps Culture Through Communication and Trust
The Importance of Communication
- Communication and trust are fundamental to a productive DevOps culture
- Project success (from deployments to acquisitions) heavily depends on communication quality
- Without proper communication and trust:
- Technical practices may fail
- Goals may compete
- Misunderstandings can occur
Effective Communication Strategies
Structured Communication Channels
- Establish dedicated channels for specific purposes:
- File repositories for customer information
- Chat channels for downtime incidents
- Email aliases for software release communications
Communication Planning
- Good communication requires intentional planning
- Essential for:
- Fast-moving organizations
- High-pressure situations (e.g., outages)
- Need clear processes defining:
- When to communicate
- Who to communicate with
- How to handle business events
Organizational Types (Westrum Model)
Pathological Organizations
- Everyone looks out for their own needs
- Limited information flow
Bureaucratic Organizations
- Focus on strictly defined roles
- Teams defend their turf
Generative Organizations
- Mission-focused
- Most effective information flow
- Features high trust environment
- Welcomes bad news as learning opportunities
Building Trust and Respect
Personal Development
- Acknowledge that not everyone has natural social skills
- Recommended resources for improvement:
- “How to Win Friends and Influence People”
- “Crucial Conversations”
- “How to Say It At Work”
Key Principles
Assume Good Faith
- Most people try to do their best
- Actions are based on perceived constraints
- Misunderstandings often stem from lack of context
Promote Transparency
- Share access to:
- Chat rooms
- Team Wiki pages
- Code repositories
- Infrastructure details
- Monitoring tools
- Ticket trackers
Break Down Barriers
- Don’t over-restrict communication
- Challenge unnecessary “least privilege” restrictions
- Recognize business value in transparency
Best Practices
- Create shared goals across teams
- Provide visibility into different team activities
- Be open and transparent
- Stay curious and respectful
- Focus on understanding others’ perspectives
- Align goals across teams
- Show value for others’ needs
“There’s no shortcut to building mutual trust. It develops over time.”
Real-World Example
- Situation: Developer-Operations conflict over priorities
- Problem: Lack of understanding about operations team’s workload
- Solution: Implemented program to create:
- Shared goals
- Better visibility
- Cross-team understanding
- Result: Improved working relationships and effectiveness
Breaking Silos in DevOps: Enhancing Collaboration
The Wall of Confusion
Root Causes
- Not primarily due to poor people skills of tech professionals
- Main cause: Institutional incentivization of opposing behaviors
- Different teams have conflicting responsibilities:
- Development teams: Focus on new functionality and rapid changes
- Operations teams: Maintain stability and control change
Impact of Misaligned Incentives
- Creates harmful conflicts of interest
- Diminishes feedback loops
- Local optimization interferes with global optimization
- Teams focus only on individual metrics rather than organizational success
Conway’s Law
“Systems will merely always align themselves to your communication boundaries.”
- Organizational boundaries act as communication boundaries
- First wave of DevOps emphasizes alignment around value stream
- Simply renaming teams to “DevOps” without structural changes is ineffective
Solutions for Breaking Silos
1. Cross-Functional Teams
- Integrate people from different specialties to work together
- Success Story Example:
- Large SaaS company in Austin
- Embedded ops engineer into dev team
- Shared ticket backlog between dev and ops tasks
- Results:
- Developers gained understanding of operational requirements
- Increased respect and collaboration
- Shared responsibility for production service
2. Self-Service Tooling
- Implement automated access to shared services
- Benefits:
- Reduces dependencies between teams
- Increases efficiency
- Eliminates unnecessary waiting times
- Better alignment with specific team needs
3. Aligned Communication and Goals
- Role Evolution Requirements:
- Developers:
- Take responsibility for build/deployment failures
- Participate in on-call rotations
- Operations/QA:
- Shift to providing self-service platforms
- Focus on guidance rather than direct execution
- Developers:
Three-Step Path to Enhanced Collaboration
- Reduce Separate Teams:
- Eliminate silos
- Create cross-disciplinary teams
- Implement Self-Service:
- Virtually remove team dependencies
- Align Remaining Teams:
- Promote collaboration
- Ensure mutual support
- Align goals across teams
Action Items
- Evaluate organizational maturity in these areas
- Identify specific actions for improvement
- Plan implementation steps towards collaborative goals
Continuous Learning in DevOps: The Third Way
Core Concepts
The Third Way Fundamentals
- Focuses on creating a culture of continuous experimentation and learning
- Emphasizes:
- Mastering core skills
- Experimenting and taking risks
- Learning through practical experience
Kaizen (改善)
- Japanese concept meaning “change for the better”
- Translates roughly to continuous improvement
- Key component of Toyota Production System (TPS)
- Introduced to Western world in 1986 through Masaaki Imai’s book
- Adopted by major companies including:
- Lockheed Martin
- Pixar Animation Studios
Five Principles of Kaizen
- Knowing the customer
- Enabling smooth workflow
- Going to the real place (gemba)
- Empowering people
- Maintaining transparency
Gemba (現場)
- Means “the real place” in Japanese
- Emphasizes direct observation and involvement
- Key practice: Go to where value is created or where problems exist
- Avoid relying on:
- Secondary reports
- Metrics alone
- Documentation
- Assumptions
“Show up in the project meeting. Go look at the code. Go try and use the system having problems.”
Implementation Process
Kaizen Improvement Process (Kata)
Follows the cycle of:
- Plan: Define intentions and expected results
- Do: Execute the plan
- Check: Measure and analyze results
- Act: Make necessary alterations
Key characteristics:
- Similar to scientific method
- Focuses on small, daily improvements
- Creates new baselines when improvements are successful
- Builds critical thinking skills
Practical Application
Best Practices
- Make small iterative changes regularly
- Implement improvements as part of daily work
- Focus on teaching people critical thinking skills
- Build people before building systems
Common Pitfalls to Avoid
Avoiding variations like:
- Plan, don’t do, hide
- Try to make it to Friday
- Waiting for weekend instead of improving
Action Items
- Use notebook function in course to document:
- Potential improvement areas
- Small, tangible next steps
- Ideas for iterating towards DevOps
- Progress and learning outcomes
3. DevOps and Process: The Building Blocks
DevOps and Agile: Historical Context and Framework
Origins of DevOps
- First DevOps Discussion:
- Occurred at Agile 2008 conference in Toronto
- Between Patrick Deis and Andrew Clay Schaeffer
- Started as an “Agile infrastructure” discussion
Key Historical Events
- 2008: Initial discussion at Agile conference
- 2009:
- Andrew presented on Agile infrastructure at Velocity Conference
- Patrick started “DevOps Days” conference in Belgium, coining the term “DevOps”
Understanding Software Development Lifecycle (SDLC)
Traditional Steps:
- Requirements gathering
- Design creation
- Implementation
- Testing
- Deployment
- Maintenance
Waterfall vs. Agile Approach
Waterfall Method:
- Sequential, linear approach
- Complete documentation before proceeding
- “Throwing over the wall” mentality between teams
- Results in:
- Loss of context
- Quality issues
- Excessive rules and contracts
- Finger-pointing
Agile Method:
- Iterative approach
- Small, frequent iterations
- Active collaboration between teams
- Includes end-user feedback
- Focuses on working software
Agile Benefits (According to Version One’s Survey)
- 85% increased productivity
- 80% faster time to market
- 81% better delivery time predictability
- 79% enhanced software quality
Limitations of Agile
- No mention of operations in original manifesto
- Doesn’t address systems aspects:
- Infrastructure building
- Application deployment
- Monitoring
- Maintenance
DevOps and Agile Relationship
- Not identical: Can be practiced independently
- Best Practice: Implement DevOps as an extension of Agile
- DevOps addresses the operational gaps in Agile
Historical Challenge
“In the beginning, Agile was seen as a threat by the infrastructure side of the house and IT organizations”
- Operations teams initially struggled with Agile’s iteration speed
- Success was found when operations teams adopted Agile principles themselves
DevOps and Agile: Historical Context and Framework
Origins of DevOps
- First DevOps Discussion:
- Occurred at Agile 2008 conference in Toronto
- Between Patrick Deis and Andrew Clay Schaeffer
- Started as an “Agile infrastructure” discussion
Key Historical Events
- 2008: Initial discussion at Agile conference
- 2009:
- Andrew presented on Agile infrastructure at Velocity Conference
- Patrick started “DevOps Days” conference in Belgium, coining the term “DevOps”
Understanding Software Development Lifecycle (SDLC)
Traditional Steps:
- Requirements gathering
- Design creation
- Implementation
- Testing
- Deployment
- Maintenance
Waterfall vs. Agile Approach
Waterfall Method:
- Sequential, linear approach
- Complete documentation before proceeding
- “Throwing over the wall” mentality between teams
- Results in:
- Loss of context
- Quality issues
- Excessive rules and contracts
- Finger-pointing
Agile Method:
- Iterative approach
- Small, frequent iterations
- Active collaboration between teams
- Includes end-user feedback
- Focuses on working software
Agile Benefits (According to Version One’s Survey)
- 85% increased productivity
- 80% faster time to market
- 81% better delivery time predictability
- 79% enhanced software quality
Limitations of Agile
- No mention of operations in original manifesto
- Doesn’t address systems aspects:
- Infrastructure building
- Application deployment
- Monitoring
- Maintenance
DevOps and Agile Relationship
- Not identical: Can be practiced independently
- Best Practice: Implement DevOps as an extension of Agile
- DevOps addresses the operational gaps in Agile
“You can practice DevOps without Agile and vice versa. But it can, and frankly probably should be implemented as an extension of Agile for best results.”
Historical Challenge
- Initially, Agile was seen as a threat by infrastructure teams
- Operations teams struggled with new iteration cadence
- Success was found when operations teams adopted Agile principles themselves
Visible Ops Change Control Process
Introduction
- Change is the primary cause of technical issues
- 80% of outages are caused by changes intended to improve, patch, or upgrade systems
- Solution: Implement controlled changes through review, testing, and scheduled rollouts
IT Service Management (ITSM) Background
- Emerged in 1980s as IT operations scaled
- Focuses on service delivery and support
- Notable frameworks:
- Microsoft Operations Framework
- COBIT
- ISO 20000
- Six Sigma
- ITIL (IT Infrastructure Library) - Most popular framework
- Currently in 4th major version
- Covers 34 different areas
- Known for heavy-handed, slow processes
Traditional ITIL Change Management Issues
- Requires extensive documentation for all changes
- Relies on Change Advisory Board (CAB) for approval
- Problems:
- Too slow for modern technical organizations
- Approval decisions made by those least qualified
- Tends to add more process when changes fail
Visible Ops Approach
- Introduced by Gene Kim, Kevin Bear, and Gene Spafford in 2004
- Published in “The Visible Ops Handbook”
- Condensed ITIL implementation into 4 practical steps
- Only 112 pages vs. ITIL’s 2000+ pages
- Focuses on lightweight, fast, scalable, repeatable change control
Key Principles of Lightweight Change Control
Review and Documentation Requirements
- All changes need review, approval, and documentation
- Peer review by technologists close to the team
- Risk-based escalation for complex changes
- Example: Wireless access point installation vs. core router replacement
Change Size Management
- Keep changes as small as possible
- Benefits:
- Easier to review
- Simpler to identify and fix errors
- Better than batch releases with hundreds of changes
Early Testing Implementation
- Use continuous integration systems
- Implement automated testing
- Include security safeguards early in development
- Peer review validates testing completion
Research Support
- Google DevOps Research and Assessment Group findings:
- Streamlined change approval processes lead to:
- Higher performance
- Lower burnout levels
- Increased psychological safety
- Streamlined change approval processes lead to:
Additional Resources
- LinkedIn Learning course: “IT Service Management Foundations Change Management” by Earnest
- Detailed guidance on setting up lightweight change control processes
4. Infrastructure as a Code
Infrastructure as Code (IaC)
Traditional Infrastructure Management
- Historically, infrastructure was managed manually:
- Building data centers
- Installing physical servers
- Loading operating systems (Windows/Linux)
- Configuring software
- Installing applications
Problems with Manual Management
- Each system became highly individual (“special snowflakes”)
- System administration was:
- Slow
- Error-prone
- Hard to maintain consistency
- Difficult to track changes
Modern Infrastructure as Code
Definition
“Infrastructure as code is provisioning and managing infrastructure through writing automation code instead of through manual processes.”
Key Concepts
- Programmable Infrastructure:
- Write code to configure networks
- Set up servers
- Attach storage
- Configure operating systems
- Install applications
Benefits
- Aligns with DevOps CAMS values:
- Culture
- Automation
- Measurement
- Sharing
- Supports lean theory by:
- Removing waste
- Reducing delays
Modern Systems Challenges
Complexity Factors
- Distributed systems
- Microservice architectures
- Cloud infrastructure
- Containers
- Machine learning
- Ephemeral (temporary) components
New Approach: “Cattle not Pets”
- Old way: Servers were “pets” (individually crafted and maintained)
- New way: Servers are “cattle” (managed en masse)
Best Practices
- Adopt a development lifecycle approach
- Combine both operational and development perspectives:
- Operations expertise with tools
- Developer expertise with code
- Version control for infrastructure code
- Automated testing and deployment
- Consistent build and deployment processes
Benefits of IaC
- Scalability
- Consistency
- Reproducibility
- Efficiency
- Version control
- Automated deployment
- Reduced human error
DevOps Infrastructure as Code: Configuration Management Overview
Core Concepts
Configuration Management Definition
- Process for creating and maintaining systems and software in a desired state
- In DevOps: All configuration management should be automated and code-driven
Three Main Components
Provisioning
- Making servers and computing infrastructure ready for operation
- Includes:
- Hardware/virtual hardware setup
- Operating system installation
- System services configuration
- Network connectivity setup
Deployment
- Automated installation and upgrading of application software
- Applies to both:
- In-house developed software
- Third-party products
Orchestration
- Coordinated operations across multiple systems
- Examples:
- Automated failover
- Rolling deployments
- Running runbooks across server fleets
Key Terminology
Approach Types
Imperative (Procedural)
- Defines and executes specific commands to produce desired state
- Example:
1. Stop service 2. Copy new NGINX binary 3. Start service
Declarative (Functional)
- Defines desired end state
- Tool handles convergence to that state
- Example: “Server should run NGINX v1.24”
- Usually builds on top of imperative systems
Important Characteristics
Idempotent
- Ability to execute repeatedly with same end result
- Declarative tools typically built to be idempotent
- Must be manually ensured in imperative approaches
Self-Service
- Allows end users to initiate processes independently
- Benefits:
- Removes operations team from critical path
- Increases velocity
- Improves developer satisfaction
Drift
- Deviation from defined configuration
- Causes:
- Manual changes outside tool
- Script execution issues
- Many tools include drift detection capabilities
Notes
- Configuration management tools often overlap in functionality
- Tool selection should consider specific use cases
Evolution of DevOps Configuration Management
Early Days (1990s)
- Commercial IT Provisioning Tools:
- Ghost (system cloning)
- Enterprise suites like Tivoli and HP
- Focus on separate dev and ops approaches
Rise of Infrastructure as Code (2000s)
Major Configuration Management Tools
- CFEngine
- Puppet
- Chef
“Our Unix admin team started using CFEngine to roll out operating system configurations” (circa 2005)
Challenges
- Lack of collaboration between teams
- Resistance to sharing tools across different functions
- Configuration drift issues
Golden Image vs. Foil Ball Debate (2009)
- Luke Kanies (Puppet founder) highlighted problems with image management:
- Image sprawl
- Configuration drift
New Approach: Stem Cell System
- Minimal initial server images
- Declarative CM tools for provisioning
- Idempotent tools for:
- Preventing configuration drift
- Managing updates
- Automatic state convergence
Cloud Era Challenges
Why Automated Server Provisioning Became Essential
- Increased virtualization
- Dynamic server instances
- Growth in distributed systems
- Exponential increase in virtual servers
Orchestration Problems
Traditional CM Tool Limitations
- 15-minute wake-up cycle
- Individual server checks
- Pull-based changes
- Issues with:
- High availability requirements
- Coordinated database/application changes
Initial Vendor Response
“You don’t need orchestration and if you think you do, you don’t understand configuration management.”
Evolution in the 2010s
New Tools and Approaches
Ansible and SaltStack:
- Push mechanism
- Explicit orchestration
- Dev-friendly deployment
- Workflow automation capabilities
Hybrid Solutions:
- Combined push deployment with idempotence
- Integration with existing CM tools
Self-Service Tools:
- Rundeck for orchestration
- Compliant system activities
- On-demand initiation
Limitations of Early CM Tools
- Limited application deployment capabilities
- Lack of virtual infrastructure provisioning
- Focus primarily on system administration
- Gap in addressing broader value stream needs
Evolution of Infrastructure as Code (IaC) in DevOps
Cloud Computing Era (2010s)
- Enabled creation of servers, storage, and networks through code
- Shifted from manual installation to programmatic infrastructure management
- Introduced model-driven provisioning with declarative approaches
AWS CloudFormation Example
- Provides templates for defining cloud assets
- Allows automatic instantiation of resources
- Uses declarative specifications for server configurations
Advanced IaC Solutions
Specialized Tools
- Terraform and Pulumi:
- Emerged as dominant solutions
- Provide domain-specific languages for infrastructure provisioning
Programming Language Integration
- Python: Boto library
- AWS CDK: Enables pure code solutions
- Note: These solutions may be less idempotent
Container Revolution (Late 2010s)
Key Features
- Reduced server dependency
- Docker containers package applications with minimal OS dependencies
- Streamlined development and testing cycles
Benefits for Developers
- Bundled runtime with applications
- Reduced runtime bugs
- Improved development workflow
Immutable Infrastructure
Netflix Model
- Adopted golden image approach
- Created cloud images with baked-in applications
- Moved away from configuration management across servers
Characteristics
- Servers not modified after deployment
- Replace rather than modify approach
- Reduces configuration drift through design
Modern Container Orchestration (2020s)
Platforms
- Kubernetes
- Mesos
Features
- Unified solution for:
- Provisioning
- Deployment
- Orchestration
- Template-based application and infrastructure changes
- Automated coordination of changes
Serverless and PaaS
- Simplifies deployment process
- Abstracts infrastructure management
- Note: Platform operation still requires maintenance and oversight
Future Outlook
- Moving towards integrated toolchains
- Focus on simplified infrastructure management
- Continued evolution of IaC approaches
“Someone operating the platform still has to worry about it” - highlighting the ongoing need for infrastructure expertise despite automation advances.
Infrastructure as Code (IaC) Toolchain Selection Guide
Core Principles
- Choose tools appropriate for team’s skill level
- Start simple, scale complexity as needed
- Plan the entire toolchain before implementation
- Design operational environment before creation
Key Decision Points
1. Infrastructure Management
Self-Managed vs. Managed Service Options:
- Self-managed infrastructure:
- Digital Rebar for bare metal automation
- Handles PXE booting, BIOS, RAID configuration
- OS and hypervisor installation
- Integrates with tools like Terraform
- Digital Rebar for bare metal automation
2. Infrastructure Provisioning
Three Main Approaches:
Template-Driven:
- Amazon CloudFormation
- Azure ARM templates
- Uses JSON/YAML format
Custom Language Solutions:
- Terraform
- Pulumi
- Benefit: Works across multiple cloud providers
Pure Code Approach:
- Python boto
- Amazon CDK
- Azure Bicep
- Leverages full programming languages
3. System Management
Options:
Runtime Configuration:
- Chef
- Puppet
- CFengine
Configuration + Orchestration:
- Ansible
- Salt
Image Creation:
- Hashicorp Packer for automated image building (“baking”)
- Docker files for container images
Note: These approaches can be combined. Example: Configure base image with Chef, then bake with Packer
4. Orchestration Options
- Configuration management tools (Ansible/Salt)
- Platform-based (Kubernetes/Mesos)
- External runbook automation (Rundeck)
- Custom code solutions
5. Application Deployment Methods
- Configuration management
- Immutable deployments (container/system images)
- Continuous deployment systems
6. Testing Strategy
Important Considerations:
- Essential component of infrastructure as code
- Utilize existing test frameworks
- Implement both:
- Unit testing for infrastructure code
- Integration testing for produced infrastructure
Real-World Example
Enterprise SaaS Implementation
Tools Used:
- Terraform: Base infrastructure, network, core servers
- Puppet: Base image configuration
- Packer: Image baking
- Rundeck: Orchestration and updates
Process Flow:
- Infrastructure building with Terraform
- Configuration management with Puppet
- Image creation with Packer
- Orchestration via Rundeck
- Continuous integration pipeline for testing
Simplified System Example
Tools Used:
- CloudFormation: Base infrastructure
- Docker: Container creation
- Amazon managed container service: Orchestration
Benefits:
- Simpler implementation
- Less maintenance overhead
- Cost-effective
- Suitable for immutable deployment
5. Continuous Delivery
Continuous Delivery Overview
Key Stages in Software Development
Build Stage
- Compile and test code
- Convert code into software
Deploy Stage
- Run the software
- Test the software
Release Stage
- Send software to end users
- Deploy to production environment
Traditional vs. Modern Approaches
Old Way (Traditional)
- Application built only at major milestones
- Large, complex integration builds
- Long test phases
- Late bug detection
- Error-prone and wasteful
Modern Approach (CI/CD)
Continuous Integration (CI)
- Automatic building and unit testing
- Occurs on every source code check-in
- Maintains application in working state
Continuous Delivery (CD)
- Deploys changes to production-like test environment
- Automated integration and acceptance testing
- Ensures application is always release-ready
Continuous Deployment
- Automatically releases to production
- Used by major companies (Amazon, Meta, Google, Wells Fargo)
- Can lead to 10+ deployments daily
Benefits of CI/CD
Performance Improvements
- Decreased deployment time
- Faster market validation
- Rapid experimentation
- Lower change failure rate
- Earlier bug detection
Key Advantages
Quality
- Testing occurs earlier in process
- Changes evaluated one by one
- Continuous working state maintained
Recovery
- Easier to identify failure sources
- Quick bug fix deployment
- Better problem isolation
Real-World Impact
Performance Metrics
- High Performers: Deploy changes in < 1 hour
- Low Performers: Deploy changes in 1-6 months
Case Study Example
“By overlaying our database connection growth graph with the deploys that happened that week, we could quickly figure out precisely which production deployment correlated with the increase of database connections.”
DevOps Principles
- Follows first way of DevOps (optimizing end-to-end flow)
- Implements second way through fast feedback loops
- Reduces Work in Progress (WIP)
- Minimizes risk and waste from undelivered code
Common Problems Solved
- Eliminates panic from monthly release cycles
- Reduces error-prone manual releases
- Prevents finger-pointing during issues
- Enables quick problem identification and resolution
Six Practices for Continuous Integration
Overview of CI/CD Pipeline
- Continuous Integration, Delivery, and Deployment form a pipeline
- Each stage flows from build → deploy → release
- Each stage depends on successful completion of previous stage
Continuous Integration Basics
- Purpose: Keep software in working state at all times
- Process:
- Automatically triggered build on each commit
- Builds entire codebase
- Runs unit tests and code validation
- Packages artifact
- Provides build status and log
Six Key Practices
1. Fast Builds
- Should pass the “coffee test” (approximately 5 minutes)
- Why: Longer builds lead to:
- Developers batching changes
- Increased Work in Progress (WIP)
- System problems
2. Small Commits
- Commit smallest possible amount of code
- Benefits:
- Easier for team to understand
- Simpler failure isolation
3. Fix Broken Builds Immediately
- Build breaks are normal and expected
- Important: Don’t leave builds broken
- Recommended:
- Delay meetings until build is fixed
- Stop all work until resolution
- Sets tone for delivery culture
4. Use Trunk-Based Development
- Two Main Development Approaches:
- Branch-based development
- Developers work on separate branches
- Long development time
- Problematic merges
- Trunk-based development
- No long-running branches
- Multiple small changes daily
- Always up-to-date trunk
- Branch-based development
- Feature Management: Use feature flags instead of branches
- Recommendation: Choose trunk-based approach
- Minimizes WIP
- Ensures frequent code review
- Reduces merge issues
5. Address Flaky Tests
- Fix unreliable tests immediately
- Inconsistent test results reduce trust in CI system
- Impacts build artifact reliability
6. Build Output Requirements
- Status: Simple pass/fail or red/green indicator
- Log: Detailed record of tests and results
- Aids troubleshooting
- Supports compliance
- Artifact: Installable application version
- Should be uploaded and tagged with build number
- Ensures auditability and immutability
Action Item
“Take a moment and use the course notebook to reflect and write down the next steps you could take to implement some of these six practices in a build pipeline you work with.”
Five Practices for Continuous Delivery
Core Concept
“It’s not how much you can deliver, but how little.” - Jez Humble and Dave Farley
Pipeline Structure
- Build Stage → Deployment Stage
- Deploy successful build artifacts to live environment
- Environment should mirror production
- Names may vary: CI, staging, test, or pre-production
- Automated testing follows deployment
Five Key Techniques
1. Artifact Management
- Create single artifact upon successful build
- Types of artifacts:
- RPM or Debian packages
- MSI installers
- Java WAR files
- ZIP files
- Build once, use across all environments
- No rebuilding for different stages
2. Artifact Immutability
- Artifacts must remain unchanged throughout pipeline
- Access Control:
- CI system: Write access only
- Deployment system: Read access only
- Benefits:
- Builds trust between teams during debugging
- Enables verification through checksums
- Maintains auditability
- Allows tracing from code version → build artifact → running system
3. Pre-production Environment
- Must mirror production environment as closely as possible
- Must Include:
- Load balancers
- Network settings
- Security controls
- Production-like data
- Enables thorough testing:
- Acceptance testing
- Smoke tests
- Integration tests
4. Pipeline Control
- System must halt pipeline on any failure
- Stop Points:
- Broken build → No deployment
- Failed deployment → No release
- Focus on overall software delivery flow, not individual productivity
- Team should collaborate to fix issues
5. Idempotent Deployments
- Multiple deployment runs should yield identical results
- Implementation Options:
- Immutable packaging (Docker containers)
- Configuration management tools (Puppet, Chef)
- Eliminates variability in pipeline
- Builds trust in deployment process
Note
The authors recommend reading “Continuous Delivery” by Jez Humble and Dave Farley for comprehensive understanding.
The Role of QA in DevOps
Introduction
- Continuous delivery benefits:
- Faster deployments
- Fewer bugs
- Less technical debt
- Better dev-ops collaboration
The Importance of Automated Testing
- Key Point: Automated testing is crucial for CI/CD success
- Manual testing:
- Considered slow and unreliable
- Best reserved for final acceptance testing only
- Modern QA role:
- QA professionals work alongside developers
- Focus on designing and writing tests
- Let automation handle repetitive testing tasks
Testing Types (Bottom-up Approach)
1. Unit Testing
- Most developer-centric testing
- Characteristics:
- Written by developers within the codebase
- Validates individual function behavior
- Fastest testing method
- Uses stubs to bypass external dependencies
- Run locally during development
2. Code Hygiene
- Checks code against language/framework best practices
- Implemented using:
- Linters
- Formatters
3. Integration Testing
- Performed in test environment
- Tests:
- Individual component functionality
- Inter-component interactions
- All dependencies included
4. Acceptance/End-to-End Testing
- Tests complete product from user perspective
- Often UI-level testing
- Can be automated
- Manual verification still valuable for final checks
Test-Driven Development (TDD) & Behavior-Driven Development (BDD)
- Write tests before implementing code
- Process example:
- Write test for desired output
- Test fails initially
- Implement functionality
- Test passes when implementation is correct
Handling Slow Tests
Strategies:
Parallel Execution
- Run slow tests alongside pipeline
- Don’t block until final release
Scheduled Testing
- Nightly test suites
- Regular scheduled runs
Continuous Testing
- Run against test environment
- Accept possibility of non-critical bugs
- Quick fixes possible in CD environment
Additional Testing Types
- Infrastructure testing
- Performance testing
- Security testing
- Browser compatibility testing
- Compliance testing
Key Takeaway
“Getting good at automated testing is your single most significant factor in successful continuous delivery.”
Continuous Deployment Overview
Key Differences from Continuous Delivery
When to Consider CD
- Organizations may not be ready for Continuous Deployment due to:
- Need for manual test cycles
- Product manager sign-off requirements
- Preference for bundled changes over frequent small updates
Prerequisites
- Strong CI/CD foundation
- Automated approvals and testing within pipeline
- Manual workflow steps can be integrated (like code reviews)
- Feature flags enable pre-deployment of code before user access
“If you stay ready, you ain’t got to get ready.” - Suga Free
Release Stage Components
Process Flow
- Artifact passes all tests
- Artifact is marked as released
- Deployment to production environment
- Trigger notifications for:
- Compliance
- Internal communication
- End user communication
Production Considerations
- Complexity: Production releases often require significant engineering work
- Challenges:
- Packaged software: Focus on data and configuration compatibility
- Running services: Must handle live users and flowing data
- Important: Test environment must mirror production deployment procedures
Production Release Patterns
Types of Deployments
Rolling Deployment
- Upgrades one system at a time
- Allows seamless traffic shifting
Blue-Green Deployment
- Creates entirely new version
- Switches traffic from current (Blue) to new (Green) system
- Can involve swapping environments or creating new ones in cloud
Canary Deployment
- Upgrades single system
- Tests under production load
- Monitors for issues
A/B Deployment
- Uses feature flags
- Releases features to specific user subsets
- Useful for:
- Canary Testing
- Public Beta Testing
Real-World Implementation Example: Signal Sciences
System Overview
- Built internal tool called “Deployer” (inspired by Etsy’s Deployinator)
- Enabled company-wide deployment capabilities
- Five-minute deployment time from commit to production
Key Features
- Push-button deployment to staging
- Automated testing
- Self-service automation
- Feature flag implementation
- Gradual release strategy:
- Internal users
- Early adopters
- All customers
Success Factors
- Strong CI/CD foundation
- Self-deploying capability
- Integration of DEV and OPs workflows
- Focus on user experience
Important Considerations
- Deployment strategy must align with:
- Packaging choices
- Infrastructure as code strategy
- Software architecture
- Requires collaboration across teams
- System should be opinionated with clear, standardized procedures
DevOps CI Toolchain Overview
Approach to Building a CI Toolchain
- Traditional approach: Start from developer and work outward
- Recommended approach: “Onion Layer” model - start from outer layer and work inward
- Focus on end-state perspective when considering the entire toolchain
Layer 1: Deployment (Outermost Layer)
Deployment Considerations
- Determine how software will be deployed:
- Containers
- System images
- Windows installers
Deployment Types & Tools
A/B Deployments
- Requires feature flagging
- Tools:
- LaunchDarkly
- Split
- Custom-built solutions
Rolling Deployments
- Requires orchestration tools
- Platform-specific options:
- Kubernetes
- Serverless
- Ansible
- Salt
Layer 2: Artifact Repository
General Solutions
- Artifactory
- Nexus
Specialized Solutions
- Cloud provider container repositories
- Language-specific repositories (e.g., bit.dev for NPM)
- Minimal solution: Build system tagging + Amazon S3
Layer 3: Building & Testing
Testing Categories
Unit Testing
- Language-specific tools (e.g., go test for Golang)
Code Hygiene & Linters
- ESLint (JavaScript)
- Staticcheck (Golang)
Integration Testing
- Pytest (Python)
- TestNG (Java)
Acceptance/End-to-End Testing
- Selenium
- Cypress.io
- Robot Framework
- Postman
Additional Testing Types
Infrastructure Testing
- InSpec
- ChefSpec
Performance Testing
- JMeter
- LoadRunner
Security Testing
- GitHub Dependabot
- GitGuardian
- Dryrun Security
- StackHawk
Layer 4: Build System
Options
Jenkins (Open source)
- Pros: Community support, wide integration
- Cons: UI navigation challenges
SaaS Solutions
- CloudBees
- CircleCI
- GitHub Actions
Layer 5: Version Control (Innermost Layer)
Popular Options
- Git-based:
- GitHub
- GitLab
- Bitbucket
Specialized Version Control
- Perforce
- PlasticSCM (for large binary assets)
Best Practices
Track Cycle Time
- Measure time from developer system to production
- Record and share metrics with team
- Actively work to improve cycle time
Tool Selection
- Choose tools that reduce overall cycle time
- Consider integration capabilities
- Factor in team expertise and requirements
6. Site Reliability Engineering
Site Reliability Engineering (SRE) Overview
Definition and Core Concepts
- SRE is the practical operations component of DevOps
- Engineering Definition: Application of theoretical principles to solve real-world problems
- Reliability Definition: System’s ability to perform intended functions correctly and consistently
- Encompasses:
- Availability
- Performance
- Security
- Service delivery capabilities
Origins
- Originated at Google, focusing initially on website reliability
- Google published a free online book titled “Site Reliability Engineering”
Key Components
1. Operational Aspects
- Monitoring production services
- Managing systems
- Problem resolution
- Automation of operational processes
2. Patrick Debois’ Four Key DevOps Areas
- Extending delivery to production
- Extending feedback from operations to dev
- Embedding dev into operations
- Embedding ops into dev
3. Holistic Approach
Two main components:
Building Reliability
- Focus on constructing resilient systems
- Emphasis on maintainability
- Engineering for reliability from the start
Operational Feedback
- Observability practices
- Incident response procedures
- Production operations feedback loop
Business Impact and Metrics
SRE improves key performance indicators:
Change Failure Rate
- Reduces production issues through:
- Reliability testing
- Deployment automation
- Reduces production issues through:
Time to Restore Service
- Improves through:
- Enhanced problem detection
- Operational automation
- Disciplined processes
- Improves through:
Service Level Objectives
- Better meeting of uptime goals
- Improved performance targets
- Enhanced through observability and resilience
Important Notes
“You can’t just bolt on reliability once something goes live.”
- SRE requires proactive engineering approach
- Combines both development and operational perspectives
- Requires continuous improvement through feedback loops
Building for Reliability: Design Theory
Core Concepts
- Success in production largely depends on design-time decisions and software architecture
- Focus on creating reliable applications through thoughtful planning
- Important to understand how applications work in real systems environments
Key Resources
1. “Release It!” by Michael Nygard
- Equivalent to “Gang of Four Design Patterns” but focused on stability
- Key Findings:
- Integration points are the #1 cause of architectural issues
- Cascading failures are the biggest threat to stability in layered architecture
Example of Cascading Failure:
- Database layer issues can lead to:
- Exhaustion of database connection pools
- Application server tier choking
2. Circuit Breaker Pattern
- Purpose: Prevents cascading outages
- Functionality:
- Monitors integration point failures/slowness
- Stops making calls when unusual failure rates detected
- Works with timeouts to prevent outage spread
- Implementation: Available through libraries like Resilience4j
3. Twelve-Factor App (12factor.net)
- Manifesto for service-ready software
- Example - Factor 3 (Config):
- Separate runtime configuration from app code
- Store in environment variables
- Keep configurations independent
- Avoid environment groupings
- Benefits: Reduces fragility and improves portability
4. Martin Fowler’s Resources
- Provides concise descriptions of architectural concepts:
- Page objects
- Serverless
- Bimodal IT
- DevOps topics
- Perspective from experienced software engineer
Modern Architecture Considerations
- Microservice architectures have multiple integration points
- Higher likelihood of integration point failures
- Need for robust stability patterns and solutions
Action Items
- Schedule focused time to study these patterns
- Evaluate patterns based on your technical ecosystem
- Consider implementing solutions for common production failures
“Take a minute to schedule some focus time on your calendar to look more deeply into these patterns and consider which may have value in your own particular technical ecosystem today based on the kinds of failures that you see in production.”
Building for Reliability: Key Principles
Core Concept: Dev vs Ops Background
“Dev comes from school, but Ops comes from the street.”
- Developers typically have computer science backgrounds
- System administrators often self-taught through real-world experience
- SRE bridges ops experience with disciplined engineering approach
Understanding System Failure
Fundamental Truths
- All systems fail
- Individual components fail frequently
- Slowdowns are as threatening as complete outages
- Systems often run in degraded mode
Swiss Cheese Model
- System components are like stacked Swiss cheese slices
- Problems occur when holes (failures) align
- Multiple layers provide protection against complete failure
Richard Cook’s “How Complex Systems Fail”
Key findings:
- Changes introduce new forms of failure
- Complex systems contain latent failures
- Complex systems always run in degraded mode
System Availability Metrics
- Measured in “nines” of availability
- Examples:
- Three nines (99.9%): 8.77 hours downtime/year
- Five nines (99.999%): 5.26 minutes downtime/year
Resilience Engineering
Definition
“Resilience is the intrinsic ability of a system to maintain or regain a dynamically stable state, which allows it to continue operations after a major mishap and/or in the presence of a continuous stress.”
Key Tools and Approaches
Redundancy
- Multiple identical copies of components
- Maintains service if one fails
Load Balancing
- Directs traffic to healthy system parts
- Traffic shaping for optimal performance
Automatic Scaling
- Adds resources as needed
- Eliminates need for manual server upgrades
Example: Kubernetes
- Runs redundant copies of core services
- Built-in health checking
- Automatic failover
- State replication across multiple locations
Sociotechnical Systems
Important Considerations
- People are integral parts of the system
- Human actions can both break and maintain system health
- Systems are always partially broken
- Expert intervention is necessary
SRE Best Practices
Time Management
- At least 50% of time should be spent developing tools
- Focus on automation over manual fixes
Developer Involvement
“You write it, you run it”
- Developers should be on-call for their code
- Must be proficient with debugging and monitoring tools
- Required to support services until proving stability
Documentation
- Create comprehensive runbooks
- Document safe intervention procedures
- Establish monitoring and control systems
Key Takeaway
Building reliable systems isn’t about achieving perfect uptime, but rather creating resilient systems that can maintain functionality despite partial failures and require skilled practitioners for maintenance and improvement.
Observability in Systems
Overview
- Observability measures how well internal system states can be understood from external outputs
- Goal: Understanding system state through metrics and logs to enable action and improvement
- Supports the Three Ways principles through feedback loops
Five Key Areas of Observability
1. Synthetic Checks
- Also known as health checks
- Programmatic testing of service performance and uptime
- Not based on real user traffic
- Answers basic question: “Is it working?”
- Can be implemented at both:
- High-level service checks
- Sub-component levels
2. System and Application Metrics
System Metrics
- Measures fundamental system resources:
- CPU usage
- Memory utilization
- Time-series data stored in Graft
- Helps determine normal functioning
Application Metrics (Custom)
- Application-specific measurements
- More diagnostic than system metrics
- Examples:
- Function call duration
- Login counts
- Error event frequency
3. Performance Metrics
Application Performance Monitoring (APM)
- Code-level performance instrumentation
- Measures:
- Function execution time
- API call duration
- Database query performance
Real User Monitoring (RUM)
- Front-end instrumentation (e.g., JavaScript page tags)
- Captures actual user experience
- Provides direct insight into customer experience
Tracing
- Tracks requests across multiple services
- Measures duration of each component
- Useful for complex system analysis
4. System and Application Logs
- Provides detailed contextual information
- Answers key questions:
- What happened?
- When did it happen?
- Where did it happen?
- What was involved?
- Use cases:
- Problem detection
- Troubleshooting
- Audit and compliance
- Capacity planning
- Security forensics
5. Security Monitoring
- Utilizes existing logs and metrics
- Focuses on threat detection
- Monitors for:
- Indicators of compromise
- Suspicious endpoints
- Connections from known bad IPs
- Bad configurations
- Unusual behavior
- Example alerts:
- Login failure spikes
- Website injection attempts
- Malformed network requests
Best Practices
- Analyze which monitoring types best support production systems
- Use monitoring data to help development teams improve applications
- Collaborate between operations and development
- Encourage improved custom metrics and logging
- Use production data to drive product improvements
“Monitoring isn’t just for production performance and uptime, it’s also a source of valuable information to developers about how the service is really used out in production.”
Incident Response and Retrospectives
Core Concepts of Incident Response
System Reality
- All systems are sociotechnical systems with humans as part of their resilient operation
- Even with excellent design, development, testing, and monitoring, systems will still experience failures
- Getting good at responding to and remediating problems is a crucial part of the job
Key Activities for Incident Response
Troubleshooting
- Requires in-depth system knowledge
- Need ability to diagnose and remediate problems
Automation
- Having pre-created tooling
- Enables faster and safer information gathering
- Supports remediation activities
Communication
- Often requires team of specialists
- Need to keep business stakeholders informed
- Must update end users on situation
Incident Management Process
- Inspired by Incident Command System (ICS)
- Originally created in 1968 for Northern California wildfires
- Now recommended by UN as international standard
- Key aspects:
- Incident detection and reporting
- Participant coordination
- Custom to organization, team, and product
Post-Incident Analysis
Modern Approach vs Traditional
- Moving away from traditional “root cause analysis”
- Avoiding blame-focused investigations
- Recognition that human error shouldn’t cause major outages
- If it does, system needs improvement
- Systems should be resilient to mistakes
Effective Postmortem Principles
Multiple Causes
- No single root cause
- Consider deficiencies at multiple levels:
- Testing
- Monitoring
- Processes
Blame-Free Analysis
- Understand actions from practitioners’ point of view
- Recognize decisions made with best available information
- Address cognitive biases
- Focus on system improvement
Transparency
- Open communication during incidents
- Clear stakeholder updates
- Honest post-incident reporting
- Builds trust and goodwill
“Real talk moment. Organizations have performed so-called root cause analyses for decades. These are usually a thinly veiled attempt to find somebody to blame for an outage. But if someone making a mistake can cause a major outage, your system itself is terrible and not resilient and it needs to improve.”
Best Practices
- Practice incident response regularly
- Maintain cool head during incidents
- Focus on system improvement rather than blame
- Document and learn from each incident
- Share learnings transparently
DevOps SRE Toolchain Overview
Two Main Components
1. Building for Reliability
- Highly dependent on programming language and tech stack
- Focus on libraries and development techniques rather than tools
- Requires collaboration between dev and ops at design time
- Resources available:
- Technical books
- Libraries (e.g., Java’s Resilience4j)
2. Operational Feedback
- Common set of observability and incident response tools
- Rich ecosystem of options:
- SaaS Solutions:
- Datadog
- Honeycomb
- SumoLogic
- Open Source Tools:
- Nagios
- Grafana
- Prometheus
- Commercial Software:
- Solarwinds
- Splunk
- SaaS Solutions:
Five Key Areas of Observability
- Synthetic checks
- System and application metrics
- End-user performance
- System and application logs
- Security monitoring
Lean Approach to Observability Implementation
Build-Measure-Learn Cycle
Build: Create minimum viable monitoring stack
- Basic endpoint synthetic monitors
- Basic system monitoring
- Performance latency from logs
Measure: Collect metrics from all monitoring areas
Learn:
- Analyze application stack with monitoring
- Identify areas needing more detailed metrics
- Evaluate effectiveness of app logs
- Iterate and improve as needed
“Monitoring you don’t use, that’s waste.”
Best Practices and Considerations
Stakeholder Access
- Make monitoring accessible to:
- Developers
- Product managers
- Business decision makers
Custom Development
- Create custom visualizations when needed
- Focus on making monitoring meaningful to different stakeholders
Incident Response Tools
Popular Solutions:
PagerDuty (SaaS)
- Handles alerts from observability tools
- Manages on-call scheduling
- Provides escalation workflows
Other Options:
- VictorOps
- OpsGenie
Runbook Automation Tools
- Rundeck (Open source, commercial, and SaaS options)
- Ansible Tower
- StackStorm
Status Page Tools
- Atlassian Statuspage
- Status.io
Key Takeaways
- Keep solutions simple
- Consider team collaboration needs
- Iterate and improve based on actual usage
- Focus on specific use cases
- Be prepared to develop custom tooling as needed
7. Advanced Topics
Platform Engineering: The Paved Road
The Challenge of Scale
- Organizations face difficulties in managing:
- Infrastructure as code
- Continuous builds
- Incident response
- Security and compliance
- Key Problem: As value streams multiply, solution diversity can lead to chaos
The Automation Solution
Pioneer Companies
- Organizations that first tackled extreme DevOps scale:
- Netflix
- Meta
- Spotify
- These companies invested in self-service automation
The Paved Road Concept
- Also known as the “golden path”
- Evolution from early DevOps “wilderness trail blazing”
- Creates an opinionated framework for standardized processes
- Benefits:
- Easier adoption
- Shared improvements
- Simplified team transitions between projects
Common Implementation Examples
CI/CD Pipelines
- Automated check-in hooks
- Automatic test runs on pull requests
- Automated test deployments
Self-Service Platforms
- Cloud account provisioning
- HPC cluster setup for machine learning
- Built-in security guidance
- Automated compliance
Platform Engineering Evolution
Definition
“Platform engineering is the discipline of designing and building tool chains and workflows that enable self-service capabilities for software engineering organizations.”
Components
- Development environment
- Testing capabilities
- Deployment automation
- Infrastructure creation
- Observability
- Security
- Runtime environment
- Scaling
- Service discovery
Success Factors
1. Product Management Approach
- Platforms must serve users, not creators
- Key principles:
- Focus on user requirements
- Ensure product quality
- Market the platform internally
- Keep usage voluntary
2. Lean Implementation
- Avoid over-building platforms
- Follow the progression:
- Blaze the trail
- Pave the road
- Build the train
- Focus on actual user needs
- Maintain flexibility for innovation
Warning Signs vs. Good Practices
Warning Signs
- Centralized control focus
- Mandatory usage
- Optimization for central team needs
- Excessive upfront building
Good Practices
- Global system optimization
- Value stream focus
- User-centric design
- Incremental development
- Flexibility for innovation
Key Differentiator
The main difference between modern platforms and traditional centralized IT is the focus on:
- User empowerment
- Value optimization
- Flexibility
- Continuous improvement based on actual needs
DevSecOps: Security in the DevOps Way
Traditional Security Challenges
- Historical tension between security and technical teams
- Security originally handled by sysadmins and developers
- InfoSec specialization created new silos
- Typical staffing ratio problem:
- 100 developers
- 10 operations staff
- 1 security person
Common Issues
- Security teams have different priorities
- Focus often compliance-oriented
- Appears as “busy work” to development teams
- Security teams understaffed and downstream
- Developers care about security but lack:
- Time
- Clear direction from security teams
DevSecOps Introduction
“If security introduces blocking to the organization, it will be ignored, not embraced.” - Zane Lackey and Rich Smith (Etsy)
CAMS Framework with Security Lens
1. Culture
- Security works alongside developers
- Avoid creating blocking gates
- Prevent value stream from routing around security
2. Automation
Shifting Left Concept
- Introduce security earlier in development
- Implement security tools in:
- IDE
- CI systems
- Warning: Avoid common pitfalls
- Don’t dump security work on developers
- Prevent bloated build times
- Avoid forcing developers to parse complex security tools
- Focus on minimal impact on cycle time
3. Sharing
- Build bridges between teams
- Create security champions program
- Methods to identify champions:
- Host Capture the Flag events
- Search code repos for security bug fixers
- Ask for volunteers
- Methods to identify champions:
- Benefits:
- Security team trains champions
- Champions help understand team concerns
- Improves communication between teams
4. Measurement
- Establish security observability
- Create joint team goals
- Avoid FUD (Fear, Uncertainty, Doubt) approach
- Focus on metrics that matter
Key Takeaways
- Security is critical regardless of terminology
- Modern approaches focus on integration
- DevSecOps bridges gap between security and development
- Success requires balance between security needs and development efficiency
Kubernetes and Cloud Native Overview
What is Kubernetes?
- An open-source container orchestration system that automates:
- Software deployment
- Scaling
- Management
- Provides a platform for running containerized applications
Key Benefits
Automation and Features
- Automates infrastructure plumbing
- Provides standardized management features:
- Observability
- Service discovery
- Health monitoring
- Custom networking
- Developers get built-in capabilities without additional development
Infrastructure Abstraction
- Manages compute, networking, and storage
- Enables multi-cloud deployment
- Standardizes deployment across:
- On-premise environments
- Different cloud providers
- Simple deployment process:
- Containerize application
- Specify redundancy requirements
- Deploy across cluster nodes
- Expose API
Cloud Native Computing Foundation (CNCF)
Understanding “Cloud Native”
- Definition: Essentially means “Kubernetes add-on”
- Not limited to cloud environments
- Large ecosystem of tools and products
- CNCF maintains an interactive tool landscape
Challenges and Considerations
1. Complexity
- Highly configurable with numerous options
- 20+ choices for network backplane alone
- Requires integration of multiple tools
- Complex upgrades and interoperability
- Steep learning curve
2. Resource Requirements
- Significant costs:
- Base 3 server cluster can cost hundreds of dollars monthly
- Requires dedicated administration team
- Not suitable for lightweight management by dev teams
3. Implementation Risks
- Can work against DevOps goals if not carefully managed
- Potential creation of silos
- Risk of increased waste
- Requires systems thinking and CAMS values alignment
Best Practices
- Start simple
- Add complexity only when necessary
- Ensure thorough understanding of platform behavior
- Consider alternatives:
- Serverless solutions
- Lighter container orchestration
- Can provide 80% of benefits with 20% of effort
“Kubernetes is a good tool to learn about and a very popular tool. Just make sure it’s the right tool for the job at hand.”
Chaos Engineering in DevOps
Introduction to Chaos Engineering
- Definition: The discipline of experimenting on a system to build confidence in its capability to withstand real-world production conditions
- Core Concept: Creating deliberate adversity for systems to test and improve resilience
Netflix’s Chaos Monkey
Origin Story
- Netflix developed Chaos Monkey while running large cloud clusters for millions of users
- Goal: Build systems resilient to component failures (servers, network links, etc.)
Why It Was Needed
- Traditional CICD testing proved insufficient
- Couldn’t replicate the full complexity of:
- Thousands of interconnected components
- Real production environment
- Interactions with millions of users
How It Works
- Chaos Monkey deliberately breaks components in the production system
- Purpose: Forces technical teams to ensure system resilience
- Resulted in continuous improvement of system resilience through controlled failures
Modern Chaos Engineering Practices
Controlled Testing Approach
- Not random destruction, but structured experiments
- Requires proper fault testing during development
- Validates automatic fault remediation
- Tests human intervention scenarios
Game Days
- Structured activities for testing incident response
- Either emulates or creates real faults
- Tests the human component of sociotechnical systems
- Validates incident response procedures
Kubernetes Chaos Engineering Example
Testing Scenarios
- Multiple failure points to test:
- Server failures
- Application container failures
- Network issues
- Container repository problems
- DNS failures
Key Learnings
- System behavior often differs from assumptions
- Important aspects to monitor:
- Recovery capability
- Recovery speed
- System behavior during reconnection
- Impact on human operators
Benefits and Philosophy
- Similar to automobile crash testing methodology
- Promotes innovative thinking
- Breaks traditional constraints
- Supports becoming a learning organization
- Aligns with DevOps culture and three ways
- Emphasizes learning through feedback loops
“IT systems aren’t binary, at least not above the chip level.”
Best Practices
- Conduct thorough development testing first
- Create structured experiments
- Test both automated and human responses
- Monitor and learn from results
- Apply learnings to improve system resilience
MLOps: DevOps for Machine Learning Systems
Introduction to MLOps
- Definition: Combination of Machine Learning and DevOps practices
- Current Context:
- ML has historically been used mainly by:
- Large social media companies
- Scientists and engineers
- Recent boom in generative AI has led to widespread adoption
- Business value often requires:
- Private data handling
- Training private ML models
- ML has historically been used mainly by:
Key Differences from Traditional DevOps
1. User Base Characteristics
- Primary Users: Data scientists (vs. developers)
- Often less familiar with computer systems
- Work is tightly coupled with hardware
- Requires close collaboration and empathetic support
2. Additional Components to Manage
- Beyond traditional code and infrastructure:
- Data versioning
- Model management
- Massive datasets
- ML models (pattern-finding algorithms)
3. Infrastructure Requirements
- Training Workloads:
- Intensive batch jobs
- Run on HPC (High Performance Computing) clusters
- Characteristics:
- Highly optimized systems
- Integrated compute storage and network
- GPU utilization
- Typically expensive
4. Results Tracking and Governance
- Different from Traditional Testing:
- AI systems don’t provide single correct answers
- Varying output quality
- Continuous modification for improvement
- Rich feedback loop beyond pass/fail testing
Production Aspects
1. Inference and Training
- Initial training followed by inference in production
- Continuous learning from user input
- Need to detect drift in AI predictions
- Vector databases:
- Large scale storage
- Growing costs with long-term inference memory
2. Development Value Stream
Three parallel CICD processes for:
- Software
- Infrastructure
- Data and models
Key Participants
- Developers
- Operations teams
- Data scientists
Success Factors
- Application of core DevOps concepts:
- Automation
- Measurement
- Continuous monitoring
- Collaborative approach
Future Outlook
- AI becoming a fundamental computing pattern
- Increasing need for DevOps professionals to understand MLOps
- Growing importance in business operations
“AI is here to stay as a fundamental computing pattern and workload, so more and more DevOps professionals will need to understand it in the future.”
AIOps: AI Integration in DevOps Work
Key Principles
- AI doesn’t replace engineers
- Engineers remain responsible for evaluating and testing AI outputs
- Cannot blindly implement AI-generated code without proper validation
Why AI in DevOps?
Current Challenges
- Too many tools and vendors
- Inconsistent documentation
- Complex architectures
- Excessive specialized knowledge requirements
- Cognitive overload from context switching
Practical Applications
1. API and Code Work
- Assists with command-line operations
- Creates integration scripts
- Helps with data transformation
- Facilitates webhook interactions
2. Natural Language Queries
- Prompt Engineering Tips:
- Set context (e.g., “respond as a DevOps engineer with Linux experience”)
- Request explanations at different complexity levels
- Frame questions effectively for better results
3. Code Management
- Code refactoring
- Documentation writing
- Pipeline documentation
- Language conversion
Example: Rob Hirschfeld’s approach of converting Terraform to AWS CLI and Bash
4. Monitoring and Security
- Enhanced detection and alerting
- Automated remediation recommendations
- Security testing and code review
- Faster than traditional methods
- Improved accuracy
- Acts as an “automated security buddy”
Future Evolution: The Three Waves
Wave 1: Code Generation
- AI-assisted coding
- Automated code reviews
- Test generation
- Documentation
- Tools: GitHub Copilot, AWS Code Whisperer
Wave 2: Systems Management
- Advanced alerting
- Monitoring cluster health
- Automated runbooks
- System explanation
- Example: k8sgpt for Kubernetes system state explanation
Wave 3: Human Integration
- Self-service functionality
- Enhanced cross-team collaboration
- Platform accessibility improvements
- Better business integration
Conclusion
AIOps will:
- Make DevOps more approachable
- Improve business integration
- Transform work methods without replacing human expertise
- Enhance efficiency and accessibility across organizations
8. DevOps Career
DevOps Career Guide and Resources
Career Perspectives in DevOps
DevOps as a Mindset
- DevOps is not just a job title but a mindset and suite of practices
- Applicable across various technical roles
- Focuses on improving technology organization results
Role-Specific Applications
For Developers:
- Can remain developers while incorporating DevOps principles
- Focus on building reliable applications
- Better understanding of build and testing structures
- Improved instrumentation for production environments
For System/IT Administrators:
- May be titled “DevOps Engineer”
- Key skills include:
- Infrastructure as code
- System reliability design
- Observability platform implementation
- Runbook automation
Specialized DevOps Roles:
- Platform Engineers
- Automation Experts
- Build Engineers/Release Managers
- Site Reliability Engineers (SREs)
- Specialized positions in large organizations:
- Incident Managers
- Application Performance Management Teams
Beyond Technical Roles
- Security Engineers → DevSecOps Engineers
- Applicable to non-technical roles:
- Sales
- Marketing
- Product Management
- Executive positions
Learning Resources
Top 10 DevOps Books
- DevOps Handbook
- Accelerate
- The Phoenix Project
- Continuous Delivery
- Site Reliability Engineering Book
- Infrastructure as Code
- Release It!
- The Practice of Cloud System Administration
- Visible Ops
- Lean Software Development
Online Resources
- Weekly newsletters
- Notable websites:
- Martin Fowler’s articles
- Julia Evans’ technical zines and comics
Certifications
- Technology-specific certifications:
- AWS Cloud
- HashiCorp
- Kubernetes
- Cloud Native
- DevOps Institute certifications
- University-run DevOps boot camps
Conferences and Events
- DevOps Enterprise Summit (US and Europe)
- DevOpsDays (50+ global events in 2023)
- All Day DevOps (24-hour online conference)
Personal Development Path
Creating Your Learning Journey
- Consider your target role and current position
- Design a learning path based on:
- Current skills
- Career goals
- Desired specialization
Core Technical Skills
- Operating systems
- Programming languages
- Cloud technologies
- Containerization
Additional Learning Resources
- DevOps Foundations curriculum
- Specialized courses:
- Lean and Agile
- Infrastructure as Code
- CICD
- Site Reliability Engineering
- DevSecOps
- Observability
- Incident Management
- DevOps Management
- DevOps Anti-patterns
Best Practices for Learning
- Gain hands-on experience
- Utilize continuous learning principles
- Engage with feedback loops
- Connect with DevOps community
- Participate in course Q&A
- Network on LinkedIn