Description
Trustworthy Online Controlled Experiments 1st Edition by Ron Kohavi, ISBN-13: 978-1108724265
[PDF eBook eTextbook] – Available Instantly
- Publisher: Cambridge University Press
- Publication date: April 2, 2020
- Edition: 1st
- Language : English
- 290 pages
- ISBN-10: 1108724264
- ISBN-13: 978-1108724265
Getting numbers is easy; getting numbers you can trust is hard. This practical guide by experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to accelerate innovation using trustworthy online controlled experiments, or A/B tests.
Based on practical experiences at companies that each run more than 20,000 controlled experiments a year, the authors share examples, pitfalls, and advice for students and industry professionals getting started with experiments, plus deeper dives into advanced topics for practitioners who want to improve the way they make data-driven decisions. Learn how to
- Use the scientific method to evaluate hypotheses using controlled experiments.
- Define key metrics and ideally an Overall Evaluation Criterion.
- Test for trustworthiness of the results and alert experimentersto violated assumptions.
- Build a scalable platform that lowers the marginal cost of experiments close to zero.
- Avoid pitfalls like carryover effects and Twyman’s law * Understand how statistical issues play out in practice.
Table of Contents:
Half-title
Reviews
Title page
Copyright information
Contents
Preface
Acknowledgments
Part I Introductory Topics for Everyone
1 Introduction and Motivation
Online Controlled Experiments Terminology
Why Experiment? Correlations, Causality, and Trustworthiness
Necessary Ingredients for Running Useful Controlled Experiments
Tenets
Tenet 1: The Organization Wants to Make Data-Driven Decisions and Has Formalized an OEC
Tenet 2: The Organization Is Willing to Invest in the Infrastructure and Tests to Run Controlled Exp
Tenet 3: The Organization Recognizes That It Is Poor at Assessing the Value of Ideas
Improvements over Time
Google Ads Example
Bing Relevance Example
Bing Ads Example
Examples of Interesting Online Controlled Experiments
UI Example: 41 Shades of Blue
Making an Offer at the Right Time
Personalized Recommendations
Speed Matters a LOT
Malware Reduction
Backend Changes
Strategy, Tactics, and Their Relationship to Experiments
Scenario 1: You Have a Business Strategy and You Have a Product with Enough Users to Experiment
Scenario 2: You Have a Product, You Have a Strategy, but the Results Suggest That You Need to Consid
Additional Reading
2 Running and Analyzing Experiments: An End-to-End Example
Setting up the Example
Hypothesis Testing: Establishing Statistical Significance
Designing the Experiment
Running the Experiment and Getting Data
Interpreting the Results
From Results to Decisions
3 Twyman’s Law and Experimentation Trustworthiness
Misinterpretation of the Statistical Results
Lack of Statistical Power
Misinterpreting p-values
Peeking at p-values
Multiple Hypothesis Tests
Confidence Intervals
Threats to Internal Validity
Violations of SUTVA
Survivorship Bias
Intention-to-Treat
Sample Ratio Mismatch (SRM)
Threats to External Validity
Primacy Effects
Novelty Effects
Detecting Primacy and Novelty Effects
Segment Differences
Segmented View of a Metric
Segmented View of the Treatment Effect (Heterogeneous Treatment Effect)
Analysis by Segments Impacted by Treatment Can Mislead
Simpson’s Paradox
Encourage Healthy Skepticism
4 Experimentation Platform and Culture
Experimentation Maturity Models
Leadership
Process
Build vs. Buy
Can an External Platform Provide the Functionality You Need?
What Would the Cost Be to Build Your Own?
What’s the Trajectory of Your Experimentation Needs?
Do You Need to Integrate into Your System’s Configuration and Deployment Methods?
Infrastructure and Tools
Experiment Definition, Set-up, and Management
Experiment Deployment
Experiment Instrumentation
Scaling Experimentation: Digging into Variant Assignment
Single-Layer Method
Concurrent Experiments
Experimentation Analytics
Part II Selected Topics for Everyone
5 Speed Matters: An End-to-End Case Study
Key Assumption: Local Linear Approximation
How to Measure Website Performance
The Slowdown Experiment Design
Impact of Different Page Elements Differs
Extreme Results
6 Organizational Metrics
Metrics Taxonomy
Formulating Metrics: Principles and Techniques
Evaluating Metrics
Evolving Metrics
Additional Resources
SIDEBAR: Guardrail Metrics
SIDEBAR: Gameability
7 Metrics for Experimentation and the Overall Evaluation Criterion
From Business Metrics to Metrics Appropriate for Experimentation
Combining Key Metrics into an OEC
Example: OEC for E-mail at Amazon
Example: OEC for Bing’s Search Engine
Goodhart’s Law, Campbell’s Law, and the Lucas Critique
8 Institutional Memory and Meta-Analysis
What Is Institutional Memory?
Why Is Institutional Memory Useful?
9 Ethics in Controlled Experiments
Background
Risk
Benefits
Provide Choices
Data Collection
Culture and Processes
SIDEBAR: User Identifiers
Part III Complementary and Alternative Techniques to Controlled Experiments
10 Complementary Techniques
The Space of Complementary Techniques
Logs-based Analysis
Human Evaluation
User Experience Research (UER)
Focus Groups
Surveys
External Data
Putting It All Together
11 Observational Causal Studies
When Controlled Experiments Are Not Possible
Designs for Observational Causal Studies
Interrupted Time Series
Interleaved Experiments
Regression Discontinuity Design
Instrumented Variables (IV) and Natural Experiments
Propensity Score Matching
Difference in Differences
Pitfalls
SIDEBAR: Refuted Observational Causal Studies
Part IV Advanced Topics for Building an Experimentation Platform
12 Client-Side Experiments
Differences between Server and Client Side
Difference #1: Release Process
Difference #2: Data Communication between Client and Server
Implications for Experiments
Implication #1: Anticipate Changes Early and Parameterize
Implication #2: Expect a Delayed Logging and Effective Starting Time
Implication #3: Create a Failsafe to Handle Offline or Startup Cases
Implication #4: Triggered Analysis May Need Client-Side Experiment Assignment Tracking
Implication #5: Track Important Guardrails on Device and App Level Health
Implication #6: Monitor Overall App Release through Quasi-experimental Methods
Implication #7: Watch Out for Multiple Devices/Platforms and Interactions between Them
Conclusions
13 Instrumentation
Client-Side vs. Server-Side Instrumentation
Processing Logs from Multiple Sources
Culture of Instrumentation
14 Choosing a Randomization Unit
Randomization Unit and Analysis Unit
User-level Randomization
15 Ramping Experiment Exposure: Trading Off Speed, Quality, and Risk
What Is Ramping?
SQR Ramping Framework
Four Ramp Phases
Ramp Phase One: Pre-MPR
Ramp Phase Two: MPR
Ramp Phase Three: Post-MPR
Ramp Phase Four: Long-Term Holdout or Replication
Post Final Ramp
16 Scaling Experiment Analyses
Data Processing
Data Computation
Results Summary and Visualization
Part V Advanced Topics for Analyzing Experiments
17 The Statistics behind Online Controlled Experiments
Two-Sample t-Test
p-Value and Confidence Interval
Normality Assumption
Type I/II Errors and Power
Bias
Multiple Testing
Fisher’s Meta-analysis
18 Variance Estimation and Improved Sensitivity: Pitfalls and Solutions
Common Pitfalls
Delta vs. Delta %
Ratio Metrics. When Analysis Unit Is Different from Experiment Unit
Outliers
Improving Sensitivity
Variance of Other Statistics
19 The A/A Test
Why A/A Tests?
Example 1: Analysis Unit Differs from Randomization Unit
Example 2: Optimizely Encouraged Stopping When Results Were Statistically Significant
Example 3: Browser Redirects
Example 4: Unequal Percentages
Example 5: Hardware Differences
How to Run A/A Tests
When the A/A Test Fails
20 Triggering for Improved Sensitivity
Examples of Triggering
Example 1: Intentional Partial Exposure
Example 2: Conditional Exposure
Example 3: Coverage Increase
Example 4: Coverage Change
Example 5: Counterfactual Triggering for Machine Learning Models
A Numerical Example (Kohavi, Longbotham et al. 2009)
Optimal and Conservative Triggering
Overall Treatment Effect
Example 1
Example 2
Trustworthy Triggering
Common Pitfalls
Pitfall 1: Experimenting on Tiny Segments That Are Hard to Generalize
Pitfall 2: A Triggered User Is Not Properly Triggered for the Remaining Experiment Duration
Pitfall 3: Performance Impact of Counterfactual Logging
Open Questions
Question 1: Triggering Unit
Question 2: Plotting Metrics over Time
21 Sample Ratio Mismatch and Other Trust-Related Guardrail Metrics
Sample Ratio Mismatch
Scenario 1
Scenario 2
SRM Causes
Debugging SRMs
Other Trust-Related Guardrail Metrics
22 Leakage and Interference between Variants
Examples
Direct Connections
Indirect Connections
Some Practical Solutions
Rule-of-Thumb: Ecosystem Value of an Action
Isolation
Edge-Level Analysis
Detecting and Monitoring Interference
23 Measuring Long-Term Treatment Effects
What Are Long-Term Effects?
Reasons the Treatment Effect May Differ between Short-Term and Long-Term
Why Measure Long-Term Effects?
Long-Running Experiments
Alternative Methods for Long-Running Experiments
Method #1: Cohort Analysis
Method #2: Post-Period Analysis
Method #3: Time-Staggered Treatments
Method #4: Holdback and Reverse Experiment
References
Index
Ron Kohavi is a VP and Technical Fellow at Airbnb. He was previously a Technical Fellow and Corporate VP at Microsoft. Prior to Microsoft, he was the director of data mining and personalization at Amazon.com. He has a PhD in Computer Science for Stanford University. His papers have over 40,000 citations and three of his papers are in the top 1,000 most-cited papers in Computer Science.
Diane Tang is a Google Fellow, with expertise in large-scale data analysis and infrastructure, online controlled experiments, and ads systems. She has an AB from Harvard and MS/PhD from Stanford, and has patents and publications in mobile networking, information visualization, experiment methodology, data infrastructure, and data mining / large data.
Ya Xu heads Data Science and Experimentation at LinkedIn. She has led LinkedIn to become one of the most well-regarded companies when it comes to A/B testing. Before LinkedIn, she worked at Microsoft and received a PhD in Statistics from Stanford University. She is widely regarded as one of the premier scientists, practitioners and thought leaders in the domain of experimentation, with several filed patents and publications. She is also a frequent speaker at top conferences, universities and companies across the country.
What makes us different?
• Instant Download
• Always Competitive Pricing
• 100% Privacy
• FREE Sample Available
• 24-7 LIVE Customer Support