AB Testing
Workflow
- know business goal.google
- define goal metrics
- Unit of diversion (randomization unit) - views, users, or cookie
- Population - to run experiment only on the population that will be affected (instead of all users, use users who initiate the checkout process for a checkout feature)
- Sample size - based on the baseline, practical significant level, significance level, power
- Duration - considering usage pattern, business cycle, and novelty effect, also if the experiment is risky or not reversible, should start with very small size and duration should be longer
- Assignment - how to split control and treatment (random? Network effect?)
- sanity check & check if the result is significant to reject the null hypothesis and accept there’s a difference in control and experiment
- launch or not – trade off
Misinterpretation of the statistical power
Lack of statistical power
- Null Hypothesis Significance Testing we assume no difference in metric value between the control and treatment groups
- if an experiment only imapcts a small subset of the population – even a large effect on a small set of users could be diluted and not be detectable overall. 作用在一小部分user上 overall 是很难detect的
Misinterpretation of P-values
- p-value DOES NOT represent the prob that the average metric value in control is different from the everage metric value in Treatment
- p value 不代表 俩组之间有区别
Peeking at Pvalues
- sequential tests with always valid p values.
- predetermined experiments duration, such as week, for detecting minimal statistical significance.