**TLDR:** The sample size per group, $N$, needed to get to significance should satisfy

\[ N \ge 2.25\frac{(1.65^2)\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)}{ ( \hat{p}_A - \hat{p}_B)^2} \] where $p_{A,B} = \frac{c_{A,B}}{n}$, $c_{A,B}$ is the number of conversions in group $A,B$ and $n$ is the currently observed sample sizes.

How to use this before the experiment is launched? Ask the (hopefully unbiased) experimenter what he thinks the conversion rates will be for groups $A$ and $B$, and plug those values in for $ p_{A}$ and $p_{B}$ respectively. What if he doesn't know/has no good guess? Ask him what he thinks the *difference** between the groups will be, and use the alternate formula:
\[ N \ge 2.25\frac{(1.65^2){ 2 ( \delta )^2} \]
where $\delta$ is the difference he/she suggests. (Keep in mind this difference is probably greatly overestimated).
*

### Introduction

I'm often asked to estimate of the number of samples need in an A/B test. Now I've sat down and tried to work out a formula (being dissatisfied with other formulas' missing derivations). The below derivation starts off with Bayesian A/B, but uses frequentist methods to derive a single estimate.

### The setup:

Suppose we have two groups, \(A,B,\) in our experiment - each of equal sample size, \(N\). The groups have true conversion rates \(p_A, p_B\) respectively. We use a beta-binomial model to find the posterior of \(p_{A,B}\), e.g. \(p_A \sim Beta(\alpha=1 + c_A, \beta= 1 + N - c_A)\), where \(c_A\) is the number of conversions observed. For large \(n\), this posterior is approximately (actually really close to being) Normally distributed, e.g.

\[p_A \sim Nor \left( \mu_A = \frac{\alpha}{\alpha+\beta}, \sigma_A = \frac{\frac{\alpha}{\alpha+\beta}\frac{\beta}{\alpha + \beta}}{(\alpha+\beta+1)} \right)\]

Ultimately, we are interested in when \(Pr( p_A > p_B \;| \;c_A, c_B, N ) \ge 0.95 \Rightarrow Pr( p_A - p_B > 0 \;| \;c_A, c_B, N ) \ge 0.95\). As both \(p_B\) and \(p_A\) are Normal, denoting \(D = p_A - p_B\), then \(D\) is Normal, \(D \;| \;c_A, c_B, N \sim Nor\left( \mu = \mu_A - \mu_B, \sigma^2 = \sigma_A^2 + \sigma_B^2 \right)\). Suppose \(\mu_A > \mu_B\) so that \(\mu > 0\).

We'd like to find a \(\sigma^2 = \sigma^2(N)\) (as \(\sigma_{A,B}\) are functions of the sample size \(N\)) s.t. \(Pr(D > 0 \;| \;c_A, c_B, n ) \ge 0.95 \Rightarrow Pr(D < 0 \;| \;c_A, c_B, n ) \le 0.05\).

\[ Pr(\frac{D - \mu}{\sigma} < \frac{-\mu}{\sigma}) \le 0.05 \]

Inverting the normal CDF:

\[\frac{-\mu}{\sigma} \le -1.65\]

\[\frac{\mu^2}{1.65^2} \ge \sigma^2 = \sigma_A^2 + \sigma_B^2 \]

\(\sigma_A^2\) can be approximated by \(\frac{\hat{p}_A(1-\hat{p}_A)}{N}\), where \(\hat{p}_A = \frac{c_A}{N}\), so after rearranging:

\[ N \ge \frac{(1.65^2)\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)}{\mu^2} \]

where \(\mu = \hat{p}_A - \hat{p}_B\). Note that \(N\) here is the samples size *per group*. Denote

\[ N^* = \frac{(1.65^2)\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)}{\mu^2} \]

Let's examine this formula for a moment:

The constant is a function of the significance desired. I chose 95%, with corresponding constant \(1.65^2\), but more generally it is \(\left(\Phi^{-1}(1-p)\right)^2\). If \(p=1\), i.e. we want absolutely certainty (impossible), we'll need an infinite number of samples (impossible). Good.

It is an inverse square-law of the distance between the conversions rates. That's cool. If the distance is 0, i.e. \(p_A = p_B\), then it is naturally impossible to determine which is larger, hence why we get a degenerate solution (infinity).

A function of the variances of the posteriors. This is really interesting too. It still blows my mind/confuses me that the closer \(p\) is to 0.5, the larger the posterior variance. (But see Fisher Information matrix for a Binomial for as to why).

### Diving deeper:

It can be empirically shown that the above formula exhibits

\[E[ 1_{Pr( D> 0 \;| \;C_A, C_B, N^*, p_A, p_B) \ge 0.95 )} ] = 0.5\]

i.e. there is a 50% chance that using \(N^*\) will provide significance *over all states of the world*, a desirable property when we talk about *expected* sample size. Thus the "power" (defined as the probability of correctly rejecting insignificance) of the test is 50%. Practically though, choosing \(N^*\) should be a lower bound, and values above it should be chosen. How much higher should be make it? What is a good pre-determined power? Most practitioners choose 80%, so let's use that first. Empirically, it seems like multiplying the \(N^*\) by 2.25 achieves about 80% power. No formal proof is given (see here for some Monte Carlo code). So really you should be using sample sizes greater than

\[ N^* = 2.25\frac{(1.65^2)\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)}{\mu^2} \]

### Unequal groups:

Above we assumed that the two groups had equal sample size. What about the case where \(N_A \ne N_B\)? Replicating above, we get as far as:

\[\frac{\mu^2}{1.65^2} \ge \left( \frac{\hat{p}_A(1-\hat{p}_A)}{N_A}+ \frac{\hat{p}_B(1-\hat{p}_B)}{N_B}\right) \]

and so invoking the old *multiply by 1 trick*:

\[\frac{\mu^2}{1.65^2} \ge \frac{ \left( \frac{\hat{p}_A(1-\hat{p}_A)}{N_A}+ \frac{\hat{p}_B(1-\hat{p}_B)}{N_B}\right) } {\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)} \left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right) \]

We can call the following term

\[ \frac{ \left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right) } {\left( \frac{\hat{p}_A(1-\hat{p}_A)}{N_A}+ \frac{\hat{p}_B(1-\hat{p}_B)}{N_B}\right)} \]

the *effective sample size*. So if we assign \(N_A = N, N_B = rN_A \Rightarrow r = \frac{N_B}{N_A}\), where \(0 < r \le 1\), (so in the equal case \(r=1\)), then:

\[ \frac{ \left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right) } {\left( \frac{\hat{p}_A(1-\hat{p}_A)}{N}+ \frac{\hat{p}_B(1-\hat{p}_B)}{rN}\right)} \]

\[ N \frac{ \left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right) } {\left( \hat{p}_A(1-\hat{p}_A) + \frac{\hat{p}_B(1-\hat{p}_B)}{r}\right)} \]

and we need this quantity to be larger than \(N^{*}\) from above. Notice that by symmetry, \(r<1\) (consider if \(r>1\), then we could rewrite the above \(N_B = N, N_A = rN_B\), in which case \(r < 1\) again). It is easy to see that for every \(r < 1\), the effective sample size is smaller than \(N\). Thus maximum power is achieved when the samples sizes are equal (a known statistical fact).

## Conclusion

Take all this with a grain of salt, my math might be off somewhere, but it seems to work pretty well in practice.

Other articles to enjoy:

- Multi-Armed Bandits
- Machine Learning counter-examples
- How to solve the
*Price is Right's*Showdown - An algorithm to sort "Top" Comments