Reliable Machine Learning
Applying SRE Principles to ML in Production
Paperback Engels 2022 1e druk 9781098106225Samenvatting
Whether you're part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run and establish ML reliably, effectively, and accountably within your organization. You'll gain insight into everything from how to do model monitoring in production to how to run a well-tuned model development team in a product organization.
By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guest authors show you how to run an efficient and reliable ML system. Whether you want to increase revenue, optimize decision making, solve problems, or understand and influence customer behavior, you'll learn how to perform day-to-day ML tasks while keeping the bigger picture in mind.
You'll examine:
- What ML is: how it functions and what it relies on
- Conceptual frameworks for understanding how ML "loops" work
- How effective productionization can make your ML systems easily monitorable, deployable, and operable
- Why ML systems make production troubleshooting more difficult, and how to compensate accordingly
- How ML, product, and production teams can communicate effectively
Specificaties
Lezersrecensies
Inhoudsopgave
Preface
Why We Wrote This Book
SRE as the Lens on ML
Intended Audience
How This Book Is Organized
Our Approach
Let's Knit!
Navigating This Book
About the Authors
Conventions Used in This Book
O'Reilly Online Learning
How to Contact Us
Acknowledgments
Cathy Chen
Niall Richard Murphy
Kranti Parisa
D. Sculley
Todd Underwood
1. Introduction
The ML Lifecycle
Data Collection and Analysis
ML Training Pipelines
Build and Validate Applications
Quality and Performance Evaluation
Defining and Measuring SLOs
Launch
Monitoring and Feedback Loops
Lessons from the Loop
2. Data Management Principles
Data as Liability
The Data Sensitivity of ML Pipelines
Phases of Data
Creation
Ingestion
Processing
Storage
Management
Analysis and Visualization
Data Reliability
Durability
Consistency
Version Control
Performance
Availability
Data Integrity
Security
Privacy
Policy and Compliance
Conclusion
3. Basic Introduction to Models
What Is a Model?
A Basic Model Creation Workflow
Model Architecture Versus Model Definition Versus Trained Model
Where Are the Vulnerabilities?
Training Data
Labels
Training Methods
Infrastructure and Pipelines
Platforms
Feature Generation
Upgrades and Fixes
A Set of Useful Questions to Ask About Any Model
An Example ML System
Yarn Product Click-Prediction Model
Features
Labels for Features
Model Updating
Model Serving
Common Failures
Conclusion
4. Feature and Training Data
Features
Feature Selection and Engineering
Lifecycle of a Feature
Feature Systems
Labels
Human-Generated Labels
Annotation Workforces
Measuring Human Annotation Quality
An Annotation Platform
Active Learning and AI-Assisted Labeling
Documentation and Training for Labelers
Metadata
Metadata Systems Overview
Dataset Metadata
Feature Metadata
Label Metadata
Pipeline Metadata
Data Privacy and Fairness
Privacy
Fairness
Conclusion
5. Evaluating Model Validity and Quality
Evaluating Model Validity
Evaluating Model Quality
Offline Evaluations
Evaluation Distributions
A Few Useful Metrics
Operationalizing Verification and Evaluation
Conclusion
6. Fairness, Privacy, and Ethical ML Systems
Fairness (a.k.a. Fighting Bias)
Definitions of Fairness
Reaching Fairness
Fairness as a Process Rather than an Endpoint
A Quick Legal Note
Privacy
Methods to Preserve Privacy
A Quick Legal Note
Responsible AI
Explanation
Effectiveness
Social and Cultural Appropriateness
Responsible AI Along the ML Pipeline
Use Case Brainstorming
Data Collection and Cleaning
Model Creation and Training
Model Validation and Quality Assessment
Model Deployment
Products for the Market
Conclusion
7. Training Systems
Requirements
Basic Training System Implementation
Features
Feature Store
Model Management System
Orchestration
Quality Evaluation
Monitoring
General Reliability Principles
Most Failures Will Not Be ML Failures
Models Will Be Retrained
Models Will Have Multiple Versions (at the Same Time!)
Good Models Will Become Bad
Data Will Be Unavailable
Models Should Be Improvable
Features Will Be Added and Changed
Models Can Train Too Fast
Resource Utilization Matters
Utilization != Efficiency
Outages Include Recovery
Common Training Reliability Problems
Data Sensitivity
Example Data Problem at YarnIt
Reproducibility
Example Reproducibility Problem at YarnIt
Compute Resource Capacity
Example Capacity Problem at YarnIt
Structural Reliability
Organizational Challenges
Ethics and Fairness Considerations
Conclusion
8. Serving
Key Questions for Model Serving
What Will Be the Load to Our Model?
What Are the Prediction Latency Needs of Our Model?
Where Does the Model Need to Live?
What Are the Hardware Needs for Our Model?
How Will the Serving Model Be Stored, Loaded, Versioned, and Updated?
What Will Our Feature Pipeline for Serving Look Like?
Model Serving Architectures
Offline Serving (Batch Inference)
Online Serving (Online Inference)
Model as a Service
Serving at the Edge
Choosing an Architecture
Model API Design
Testing
Serving for Accuracy or Resilience?
Scaling
Autoscaling
Caching
Disaster Recovery
Ethics and Fairness Considerations
Conclusion
9. Monitoring and Observability for Models
What Is Production Monitoring and Why Do It?
What Does It Look Like?
The Concerns That ML Brings to Monitoring
Reasons for Continual ML Observabilityâin Production
Problems with ML Production Monitoring
Difficulties of Development Versus Serving
A Mindset Change Is Required
Best Practices for ML Model Monitoring
Generic Pre-serving Model Recommendations
Training and Retraining
Model Validation (Before Rollout)
Serving
Other Things to Consider
High-Level Recommendations for Monitoring Strategy
Conclusion
10. Continuous ML
Anatomy of a Continuous ML System
Training Examples
Training Labels
Filtering Out Bad Data
Feature Stores and Data Management
Updating the Model
Pushing Updated Models to Serving
Observations About Continuous ML Systems
External World Events May Influence Our Systems
Models Can Influence Their Own Training Data
Temporal Effects Can Arise at Several Timescales
Emergency Response Must Be Done in Real Time
New Launches Require Staged Ramp-ups and Stable Baselines
Models Must Be Managed Rather Than Shipped
Continuous Organizations
Rethinking Noncontinuous ML Systems
Conclusion
11. Incident Response
Incident Management Basics
Life of an Incident
Incident Response Roles
Anatomy of an ML-Centric Outage
Terminology Reminder: Model
Story Time
Story 1: Searching but Not Finding
Story 2: Suddenly Useless Partners
Story 3: Recommend You Find New Suppliers
ML Incident Management Principles
Guiding Principles
Model Developer or Data Scientist
Software Engineer
ML SRE or Production Engineer
Product Manager or Business Leader
Special Topics
Production Engineers and ML Engineering Versus Modeling
The Ethical On-Call Engineer Manifesto
Conclusion
12. How Product and ML Interact
Different Types of Products
Agile ML?
ML Product Development Phases
Discovery and Definition
Business Goal Setting
MVP Construction and Validation
Model and Product Development
Deployment
Support and Maintenance
Build Versus Buy
Models
Data Processing Infrastructure
End-to-End Platforms
Scoring Approach for Making the Decision
Making the Decision
Sample YarnIt Store Features Powered by ML
Showcasing Popular Yarns by Total Sales
Recommendations Based on Browsing History
Cross-selling and Upselling
Content-Based Filtering
Collaborative Filtering
Conclusion
13. Integrating ML into Your Organization
Chapter Assumptions
Leader-Based Viewpoint
Detail Matters
ML Needs to Know About the Business
The Most Important Assumption You Make
The Value of ML
Significant Organizational Risks
ML Is Not Magic
Mental (Way of Thinking) Model Inertia
Surfacing Risk Correctly in Different Cultures
Siloed Teams Donât Solve All Problems
Implementation Models
Remembering the Goal
Greenfield Versus Brownfield
ML Roles and Responsibilities
How to Hire ML Folks
Organizational Design and Incentives
Strategy
Structure
Processes
Rewards
People
A Note on Sequencing
Conclusion
14. Practical ML Org Implementation Examples
Scenario 1: A New Centralized ML Team
Background and Organizational Description
Process
Rewards
People
Default Implementation
Scenario 2: Decentralized ML Infrastructure and Expertise
Background and Organizational Description
Process
Rewards
People
Default Implementation
Scenario 3: Hybrid with Centralized Infrastructure/Decentralized Modeling
Background and Organizational Description
Process
Rewards
People
Default Implementation
Conclusion
15. Case Studies: MLOps in Practice
1. Accommodating Privacy and Data Retention Policies in ML Pipelines
Background
Problem and Resolution
Takeaways
2. Continuous ML Model Impacting Traffic
Background
Problem and Resolution
Takeaways
3. Steel Inspection
Background
Problem and Resolution
Takeaways
4. NLP MLOps: Profiling and Staging Load Test
Background
Problem and Resolution
Takeaways
5. Ad Click Prediction: Databases Versus Reality
Background
Problem and Resolution
Takeaways
6. Testing and Measuring Dependencies in ML Workflow
Background
Problem and Resolution
Takeaways
Index
About the Authors
Rubrieken
- advisering
- algemeen management
- coaching en trainen
- communicatie en media
- economie
- financieel management
- inkoop en logistiek
- internet en social media
- it-management / ict
- juridisch
- leiderschap
- marketing
- mens en maatschappij
- non-profit
- ondernemen
- organisatiekunde
- personal finance
- personeelsmanagement
- persoonlijke effectiviteit
- projectmanagement
- psychologie
- reclame en verkoop
- strategisch management
- verandermanagement
- werk en loopbaan