AB Testing

Workflow

know business goal.google
define goal metrics
Unit of diversion (randomization unit) - views, users, or cookie
Population - to run experiment only on the population that will be affected (instead of all users, use users who initiate the checkout process for a checkout feature)
Sample size - based on the baseline, practical significant level, significance level, power
Duration - considering usage pattern, business cycle, and novelty effect, also if the experiment is risky or not reversible, should start with very small size and duration should be longer
Assignment - how to split control and treatment (random? Network effect?)
sanity check & check if the result is significant to reject the null hypothesis and accept there’s a difference in control and experiment
launch or not – trade off

Null Hypothesis Significance Testing we assume no difference in metric value between the control and treatment groups
if an experiment only imapcts a small subset of the population – even a large effect on a small set of users could be diluted and not be detectable overall. 作用在一小部分user上 overall 是很难detect的

p-value DOES NOT represent the prob that the average metric value in control is different from the everage metric value in Treatment
p value 不代表俩组之间有区别

sequential tests with always valid p values.
predetermined experiments duration, such as week, for detecting minimal statistical significance.