Deriving formulas for the expected sample size needed in A/B tests

TLDR: The sample size per group, $N$, needed to get to significance should satisfy

\[ N \ge 2.25\frac{(1.65^2)\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)}{ ( \hat{p}_A - \hat{p}_B)^2} \] where $p_{A,B} = \frac{c_{A,B}}{n}$, $c_{A,B}$ is the number of conversions in group $A,B$ and $n$ is the currently observed sample sizes.

How to use this before the experiment is launched? Ask the (hopefully unbiased) experimenter what he thinks the conversion rates will be for groups $A$ and $B$, and plug those values in for $ p_{A}$ and $p_{B}$ respectively. What if he doesn't know/has no good guess? Ask him what he thinks the difference between the groups will be, and use the alternate formula: \[ N \ge 2.25\frac{(1.65^2){ 2 ( \delta )^2} \] where $\delta$ is the difference he/she suggests. (Keep in mind this difference is probably greatly overestimated).

Introduction

I'm often asked to estimate of the number of samples need in an A/B test. Now I've sat down and tried to work out a formula (being dissatisfied with other formulas' missing derivations). The below derivation starts off with Bayesian A/B, but uses frequentist methods to derive a single estimate.

The setup:

Suppose we have two groups, \(A,B,\) in our experiment - each of equal sample size, \(N\). The groups have true conversion rates \(p_A, p_B\) respectively. We use a beta-binomial model to find the posterior of \(p_{A,B}\), e.g. \(p_A \sim Beta(\alpha=1 + c_A, \beta= 1 + N - c_A)\), where \(c_A\) is the number of conversions observed. For large \(n\), this posterior is approximately (actually really close to being) Normally distributed, e.g.

\[p_A \sim Nor \left( \mu_A = \frac{\alpha}{\alpha+\beta}, \sigma_A = \frac{\frac{\alpha}{\alpha+\beta}\frac{\beta}{\alpha + \beta}}{(\alpha+\beta+1)} \right)\]

Ultimately, we are interested in when \(Pr( p_A > p_B \;| \;c_A, c_B, N ) \ge 0.95 \Rightarrow Pr( p_A - p_B > 0 \;| \;c_A, c_B, N ) \ge 0.95\). As both \(p_B\) and \(p_A\) are Normal, denoting \(D = p_A - p_B\), then \(D\) is Normal, \(D \;| \;c_A, c_B, N \sim Nor\left( \mu = \mu_A - \mu_B, \sigma^2 = \sigma_A^2 + \sigma_B^2 \right)\). Suppose \(\mu_A > \mu_B\) so that \(\mu > 0\).

We'd like to find a \(\sigma^2 = \sigma^2(N)\) (as \(\sigma_{A,B}\) are functions of the sample size \(N\)) s.t. \(Pr(D > 0 \;| \;c_A, c_B, n ) \ge 0.95 \Rightarrow Pr(D < 0 \;| \;c_A, c_B, n ) \le 0.05\).

\[ Pr(\frac{D - \mu}{\sigma} < \frac{-\mu}{\sigma}) \le 0.05 \]

Inverting the normal CDF:

\[\frac{-\mu}{\sigma} \le -1.65\]

\[\frac{\mu^2}{1.65^2} \ge \sigma^2 = \sigma_A^2 + \sigma_B^2 \]

\(\sigma_A^2\) can be approximated by \(\frac{\hat{p}_A(1-\hat{p}_A)}{N}\), where \(\hat{p}_A = \frac{c_A}{N}\), so after rearranging:

\[ N \ge \frac{(1.65^2)\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)}{\mu^2} \]

where \(\mu = \hat{p}_A - \hat{p}_B\). Note that \(N\) here is the samples size per group. Denote

\[ N^* = \frac{(1.65^2)\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)}{\mu^2} \]

Let's examine this formula for a moment:

  1. The constant is a function of the significance desired. I chose 95%, with corresponding constant \(1.65^2\), but more generally it is \(\left(\Phi^{-1}(1-p)\right)^2\). If \(p=1\), i.e. we want absolutely certainty (impossible), we'll need an infinite number of samples (impossible). Good.

  2. It is an inverse square-law of the distance between the conversions rates. That's cool. If the distance is 0, i.e. \(p_A = p_B\), then it is naturally impossible to determine which is larger, hence why we get a degenerate solution (infinity).

  3. A function of the variances of the posteriors. This is really interesting too. It still blows my mind/confuses me that the closer \(p\) is to 0.5, the larger the posterior variance. (But see Fisher Information matrix for a Binomial for as to why).

Diving deeper:

It can be empirically shown that the above formula exhibits

\[E[ 1_{Pr( D> 0 \;| \;C_A, C_B, N^*, p_A, p_B) \ge 0.95 )} ] = 0.5\]

i.e. there is a 50% chance that using \(N^*\) will provide significance over all states of the world, a desirable property when we talk about expected sample size. Thus the "power" (defined as the probability of correctly rejecting insignificance) of the test is 50%. Practically though, choosing \(N^*\) should be a lower bound, and values above it should be chosen. How much higher should be make it? What is a good pre-determined power? Most practitioners choose 80%, so let's use that first. Empirically, it seems like multiplying the \(N^*\) by 2.25 achieves about 80% power. No formal proof is given (see here for some Monte Carlo code). So really you should be using sample sizes greater than

\[ N^* = 2.25\frac{(1.65^2)\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)}{\mu^2} \]

Unequal groups:

Above we assumed that the two groups had equal sample size. What about the case where \(N_A \ne N_B\)? Replicating above, we get as far as:

\[\frac{\mu^2}{1.65^2} \ge \left( \frac{\hat{p}_A(1-\hat{p}_A)}{N_A}+ \frac{\hat{p}_B(1-\hat{p}_B)}{N_B}\right) \]

and so invoking the old multiply by 1 trick:

\[\frac{\mu^2}{1.65^2} \ge \frac{ \left( \frac{\hat{p}_A(1-\hat{p}_A)}{N_A}+ \frac{\hat{p}_B(1-\hat{p}_B)}{N_B}\right) } {\left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right)} \left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right) \]

We can call the following term

\[ \frac{ \left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right) } {\left( \frac{\hat{p}_A(1-\hat{p}_A)}{N_A}+ \frac{\hat{p}_B(1-\hat{p}_B)}{N_B}\right)} \]

the effective sample size. So if we assign \(N_A = N, N_B = rN_A \Rightarrow r = \frac{N_B}{N_A}\), where \(0 < r \le 1\), (so in the equal case \(r=1\)), then:

\[ \frac{ \left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right) } {\left( \frac{\hat{p}_A(1-\hat{p}_A)}{N}+ \frac{\hat{p}_B(1-\hat{p}_B)}{rN}\right)} \]

\[ N \frac{ \left(\hat{p}_A(1-\hat{p}_A)+ \hat{p}_B(1-\hat{p}_B)\right) } {\left( \hat{p}_A(1-\hat{p}_A) + \frac{\hat{p}_B(1-\hat{p}_B)}{r}\right)} \]

and we need this quantity to be larger than \(N^{*}\) from above. Notice that by symmetry, \(r<1\) (consider if \(r>1\), then we could rewrite the above \(N_B = N, N_A = rN_B\), in which case \(r < 1\) again). It is easy to see that for every \(r < 1\), the effective sample size is smaller than \(N\). Thus maximum power is achieved when the samples sizes are equal (a known statistical fact).

Conclusion

Take all this with a grain of salt, my math might be off somewhere, but it seems to work pretty well in practice.



Other articles to enjoy:

Follow me on Twitter at cmrn_dp


All Blog Articles

DataOrigami Launch

June 24th, 2014

I'm proud to announce my latest project, dataorigami.net! Why are you still here, go check it out!

continue reading...


Feature Space

May 22th, 2014

Feature space refers to the $n$-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is *feature extraction*, hence we view all variables as features. For example, consider the data set with:

continue reading...


Generating exponential survival data

March 02th, 2014

TLDR: Suppose we interested in generating exponential survival times with scale parameter $\lambda$, and having $\alpha$ probability of censorship ( $0 < \alpha < 1$. This is actually, at least from what I tried, a non-trivial problem. Here's the algorithm, and below I'll go through what doesn't work to:

continue reading...


Deriving formulas for the expected sample size needed in A/B tests

December 27th, 2013

Often an estimate of the number of samples need in an A/B test is asked. Now I've sat down and tried to work out a formula (being disatisfied with other formulas' missing derivations). The below derivation starts off with Bayesian A/B, but uses frequentist methods to derive a single estimate (God help an individual interested in a posterior sample size distribution!)

continue reading...


lifelines: survival analysis in Python

December 19th, 2013

The lifelines library provides a powerful tool to data analysts and statisticians looking for methods to solve a common problem:

How do I predict durations?

This question seems very vague and abstract, but thats only because we can be so general in this space. Some more specific questions lifelines will help you solve are:

continue reading...


Evolutionary Group Theory

October 03th, 2013

We construct a dynamical population whose individuals are assigned elements from an algebraic group \(G\) and subject them to sexual reproduction. We investigate the relationship between the dynamical system and the underlying group and present three dynamical properties equivalent to the standard group properties.

continue reading...


Videos about the Bayesian Methods for Hackers project

August 25th, 2013

  1. New York Tech Meetup, July 2013: This one is about 2/3 the way through, under the header "Hack of the month"

    Available via MLB Media player
  2. PyData Boston, July 2013: Slides available here

    Video available here.
continue reading...


Warrior Dash 2013

August 03th, 2013

Warrior dash data, just like last year: continue reading...


The Next Steps

June 16th, 2013

June has been an exciting month. The opensource book Bayesian Methods for Hackers I am working on blew up earlier this month, propelling it into Github's stratosphere. This is both a good and bad thing: good as it exposes more people to the project, hence more collaborators; bad because it is showing off an incomplete project -- a large fear is that advanced data specialists disparage in favour of more mature works the work to beginner dataists.

continue reading...


NSA Honeypot

June 08th, 2013

Let's perform an experiment.

continue reading...


21st Century Problems

May 16th, 2013

The technological challenges, and achievements, of the 20th century brought society enormous progress. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it.

continue reading...


ML Counterexamples Pt.2 - Regression Post-PCA

April 26th, 2013

Principle Component Analysis (PCA), also known as Singular Value Decomposition, is one of the most popular tools in the data scientist's toolbox, and it deserves to be there. The following are just a handful of the uses of PCA:

  • data visualization
  • remove noise
  • find noise (useful in finance)
  • clustering
  • reduce dataset dimension before regression/classification, with minimal negative effect
continue reading...


Machine Learning Counterexamples Pt.1

April 24th, 2013

This will the first of a series of articles on some useful counterexamples in machine learning. What is a machine learning counterexample? I am perhaps using the term counterexample loosely, but in this context a counterexample is a hidden gotcha or otherwise a deviation from intuition.

Suppose you have a data matrix $X$, which has been normalized and demeaned (as appropriate for linear models). A response vector $Y$, also standardized, is regressed on $X$ using your favourite library and the following coefficients, $\beta$, are returned:

continue reading...


Multi-Armed Bandits

April 06th, 2013

Suppose you are faced with $N$ slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don't know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.

continue reading...


Cover for Bayesian Methods for Hackers

March 25th, 2013

The very kind Stef Gibson created an amazing cover for my open source book Bayesian Methods for Hackers. View it below:

continue reading...


An algorithm to sort "Top" Comments

March 10th, 2013

Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product.

This has created flaws in how we sort items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results.

continue reading...


How to solve the Price is Right's Showdown

February 05th, 2013

Preface: This example is a (greatly modified) excerpt from the book Probabilistic Programming and Bayesian Methods for Hackers in Python, currently being developed on Github ;)

How to solve* the Showdown on the Price is Right

*I use the term loosely and irresponsibly.

It is incredibly surprising how wild some bids can be on The Price is Right's final game, The Showcase. If you are unfamiliar with how it is played (really?), here's a quick synopsis:

continue reading...


My favourite part of The Extended Phenotype

February 02th, 2013

To quote directly from the book, by Richard Dawkins:

continue reading...


The awesome power of Bayesian Methods - Part II - Optimizing Loss Functions

January 10th, 2013

Hi again, this article will really show off the flexibility of Bayesian analysis. Recall, Bayesian inference is basically being interested in the new random variables, $\Theta$, distributed by $$ P( \Theta | X ) \propto L( X | \Theta )P(\Theta )$$ where $X$ is observed data, $L(X | \Theta )$ is the likelihood function and P(\Theta) is the prior distribution for $\Theta$. Normally, computing the closed-form formula for the left-hand side of the above equation is difficult, so I say screw closed-forms. If we can sample from $P( \Theta | X )$ accurately, then we can do as much, possibly more, than if we just had the closed-form. For example, by drawing samples from $P( \Theta | X )$, we can estimate the distribution to arbitrary accuracy. Or find expected values for easily using Monte Carlo. Or maximize functions. Or...well I'll get into it.

continue reading...


Interior Design with Machine Learning

January 04th, 2013

While designing my new apartment, I found a very cool use of machine learning. Yes, that's right, you can use machine learning in interior design. As crazy as it sounds, it is completely legitimate.

continue reading...


The awesome power of Bayesian Methods - What they didn't teach you in grad school. Part I

December 27th, 2012

For all the things we learned in grad school, Bayesian methods was something that was skimmed over. Strange too, as we learned all the computationally machinery necessary, but we were never actually shown the power of these methods. Let's start our explanation with an example where the Bayesian analysis clearly simply is more correct (in the sense of getting the right answer).

continue reading...


How to bootstrap your way out of biased estimates

December 06th, 2012

Bootstrapping is like getting a free lunch, low variance and low bias, by exploiting the Law of Large numbers. Here's how to do it:

continue reading...


High-dimensional outlier detection using statistics

November 27th, 2012

I stumbled upon a really cool idea of detecting outliers. Classically, one can plot the data and visually find outliers. but this is not possible in higher-dimensions. A better approach to finding outliers is to consider the distance of each point to some central location. Data points that are unreasonably far away are considered outliers and are dealt with.

continue reading...


A more sensible omnivore.

November 17th, 2012

My girlfriend, who is a vegetarian, and I often discuss the merits and dismerits of being a vegetarian. Though I am not a vegetarian (though I did experiment with veganism and holistic diets during some One Week Ofs), very much agree that eating as much meat as we do is not optimal.

Producing an ounce of meat requires a surprising amount of energy, whereas it's return energy is very small. We really only eat meat for its taste. It is strange how often we, the human omnivores, require meat in a meal, less it's not a real meal (and we do this three times a day). And unfortunately, a whole culture eating this way is not sustainable.

I have often thought about a life without meat,

continue reading...


Kaggle Data Science Solution: Predicting US Census Return Rates

November 01th, 2012

The past month two classmates and I have been attacking a new Kaggle contest, Predicting US Census mail return rates. Basically, we were given large amounts of data about block groups, the second smallest unit of division in the US census, and asked to predict what fraction of individuals from the block group would mail back their 2010 census form.

continue reading...


Visualizing clusters of stocks

October 14th, 2012

One troubling aspect of an estimated covariance matrix is that it always overestimates the true covariance. For example, if two random variables are independent the covariance estimate for the two variables is always non-zero. It will converge to 0, yes, but it may take a really long time.

What's worse is that the covariance matrix does not understand causality. Consider the certainly common situation below:

continue reading...


Twitter Pyschopathy

October 08th, 2012

Released earlier was the research paper about predicting psychopathy using Twitter behaviour. Read the paper here.

continue reading...


UWaterloo Subway Map

September 22th, 2012

I think my thing with subway maps is getting weird. I just created a fictional University of Waterloo subway map using my subway.js library.

continue reading...


Sampling from a Conditional Markov Chain

September 15th, 2012

My last project involving the artificial creation of "human-generated" passwords required me to sample from a Markov Chain. This is not very difficult, and I'll outline the sampling algorithm below. For the setup, suppose you have a transition probability matrix $M$ and an initial probability vector $\mathbf{v}$. The element $(i,j)$ of $M$ is the probability of the next state being $j$ given that the current state is $i$. The initial probability vector element $i$ is the probability that the first state is $i$. If you have these quantities, then to sample from a realized Markov process is simple:

continue reading...


Modeling password creation

September 14th, 2012

Creating a password is an embarrassingly difficult task. A password needs to be both memorable and unique enough not to be guessed. The former criterion prevents using randomly generated passwords (try remembering 9st6Uqfe4Z for Gmail, rAEOZmePfT for Facebook, etc.), and the latter is the reason why passwords exist in the first place. So the task falls on humans to create their own passwords and carefully balance these two criteria. This has been, and still is, a bad idea.

continue reading...


Eurotrip & Python

August 13th, 2012

Later this month, my lovely girlfriend and I are travelling to Amsterdam, Berlin and Kiel. The first half of the trip we will be exploring the tourist and nontourist areas of Amsterdam and Berlin. I'm very excited as I get to spend time drinking and relaxing with my girlfriend. But then...

continue reading...


Turn your Android phone into a SMS-based Command Line

August 11th, 2012

One of my biggest pet peeves is not having my phone with me. This often occurs if the phone is charging and I need to leave, or I have forgotten it somewhere, or it is lost, or etc. I've created a partial solution.

continue reading...


Least Squares Regression with L1 Penalty

July 31th, 2012

I want to discuss, and exhibit, a really cool idea in machine learning, optimization and statistics. It's a simple idea: adding a constraint to an optimization problem, specifically a constraint on the sum, can have huge impacts on your interpretation, robustness and sanity. I must first introduce the family of functions we will be discussing.

The family of L-norm penalty functions, $L_p:R^d \rightarrow R$, is defined: $$ L_p( x ) = || x ||_p = \left( \sum_{i=1}^d |x_i|^p \right) ^{1/p} \;\: p>0 $$ For $p=2$, this is the familar Euclidean distance. The most often used in machine learning literature are the

continue reading...


Warrior Dash Data

July 25th, 2012

Last Sunday I competed in a pretty epic competition: The Warrior Dash. It's 5k of, well honestly, it's 5k of mostly hills and trail running. Plus spread throughout are some pretty fun obstacles. With only five training workouts un...

continue reading...


Subway.js

July 17th, 2012

The javascript code that creates and controls the subway map above is available on GitHub. You can build your own using the pretty self-explanatory code + README document. Imagine using the code in a school project or advertising...

continue reading...


Kernelized and Supervised Principle Component Analysis

July 13th, 2012

Sorry the title is a bit of a mouthful. Everyone in statistics has heard of Principle Components Analysis ( PCA ). The idea is so simple, and a personal favourite of mine, so I'll detail it here.

continue reading...


Python Android Scripts

July 05th, 2012

I am having a blast messing around with my new Android phone. It has Python! Currently I am playing with the sensors on the phone. Built-in is a light sensor, accelerometer, and an

continue reading...


Predicting Psychopathy using Twitter Data

July 03th, 2012

The goal of this Kaggle contest was to predict an individuals psychopathic rating using information from their Twitter profile. I was given the already processed data and psychopathic scores. This was the first Kaggle competition I entered, and certainly not the last! If you'll excuse me, I must begin my technical remarks on my solution:

continue reading...


CamDP++

July 03th, 2012

Camdp.com is my latest attempt to digitize myself. I tried to map the subway lines to mimic my life and work, with each subway line representing a train of thought. I hope you enjoy the continue reading...


Data Science FAQ

July 02th, 2012

What is data science? What is an example of a data set? What are some of the goals of data science? What are some examples of data science in action? continue reading...


(All Blog Articles).filter( Science )

Feature Space

May 22th, 2014

Feature space refers to the $n$-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is *feature extraction*, hence we view all variables as features. For example, consider the data set with:

continue...


Generating exponential survival data

March 02th, 2014

TLDR: Suppose we interested in generating exponential survival times with scale parameter $\lambda$, and having $\alpha$ probability of censorship ( $0 < \alpha < 1$. This is actually, at least from what I tried, a non-trivial problem. Here's the algorithm, and below I'll go through what doesn't work to:

continue...


Deriving formulas for the expected sample size needed in A/B tests

December 27th, 2013

Often an estimate of the number of samples need in an A/B test is asked. Now I've sat down and tried to work out a formula (being disatisfied with other formulas' missing derivations). The below derivation starts off with Bayesian A/B, but uses frequentist methods to derive a single estimate (God help an individual interested in a posterior sample size distribution!)

continue...


lifelines: survival analysis in Python

December 19th, 2013

The lifelines library provides a powerful tool to data analysts and statisticians looking for methods to solve a common problem:

How do I predict durations?

This question seems very vague and abstract, but thats only because we can be so general in this space. Some more specific questions lifelines will help you solve are:

continue...


Evolutionary Group Theory

October 03th, 2013

We construct a dynamical population whose individuals are assigned elements from an algebraic group \(G\) and subject them to sexual reproduction. We investigate the relationship between the dynamical system and the underlying group and present three dynamical properties equivalent to the standard group properties.

continue...


21st Century Problems

May 16th, 2013

The technological challenges, and achievements, of the 20th century brought society enormous progress. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it.

continue...


ML Counterexamples Pt.2 - Regression Post-PCA

April 26th, 2013

Principle Component Analysis (PCA), also known as Singular Value Decomposition, is one of the most popular tools in the data scientist's toolbox, and it deserves to be there. The following are just a handful of the uses of PCA:

  • data visualization
  • remove noise
  • find noise (useful in finance)
  • clustering
  • reduce dataset dimension before regression/classification, with minimal negative effect
continue...


Machine Learning Counterexamples Pt.1

April 24th, 2013

This will the first of a series of articles on some useful counterexamples in machine learning. What is a machine learning counterexample? I am perhaps using the term counterexample loosely, but in this context a counterexample is a hidden gotcha or otherwise a deviation from intuition.

Suppose you have a data matrix $X$, which has been normalized and demeaned (as appropriate for linear models). A response vector $Y$, also standardized, is regressed on $X$ using your favourite library and the following coefficients, $\beta$, are returned:

continue...


Multi-Armed Bandits

April 06th, 2013

Suppose you are faced with $N$ slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don't know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.

continue...


An algorithm to sort "Top" Comments

March 10th, 2013

Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product.

This has created flaws in how we sort items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results.

continue...


My favourite part of The Extended Phenotype

February 02th, 2013

To quote directly from the book, by Richard Dawkins:

continue...


N is never large.

January 15th, 2013

continue...


The awesome power of Bayesian Methods - Part II - Optimizing Loss Functions

January 10th, 2013

Hi again, this article will really show off the flexibility of Bayesian analysis. Recall, Bayesian inference is basically being interested in the new random variables, $\Theta$, distributed by $$ P( \Theta | X ) \propto L( X | \Theta )P(\Theta )$$ where $X$ is observed data, $L(X | \Theta )$ is the likelihood function and P(\Theta) is the prior distribution for $\Theta$. Normally, computing the closed-form formula for the left-hand side of the above equation is difficult, so I say screw closed-forms. If we can sample from $P( \Theta | X )$ accurately, then we can do as much, possibly more, than if we just had the closed-form. For example, by drawing samples from $P( \Theta | X )$, we can estimate the distribution to arbitrary accuracy. Or find expected values for easily using Monte Carlo. Or maximize functions. Or...well I'll get into it.

continue...


The awesome power of Bayesian Methods - What they didn't teach you in grad school. Part I

December 27th, 2012

For all the things we learned in grad school, Bayesian methods was something that was skimmed over. Strange too, as we learned all the computationally machinery necessary, but we were never actually shown the power of these methods. Let's start our explanation with an example where the Bayesian analysis clearly simply is more correct (in the sense of getting the right answer).

continue...


How to bootstrap your way out of biased estimates

December 06th, 2012

Bootstrapping is like getting a free lunch, low variance and low bias, by exploiting the Law of Large numbers. Here's how to do it:

continue...


High-dimensional outlier detection using statistics

November 27th, 2012

I stumbled upon a really cool idea of detecting outliers. Classically, one can plot the data and visually find outliers. but this is not possible in higher-dimensions. A better approach to finding outliers is to consider the distance of each point to some central location. Data points that are unreasonably far away are considered outliers and are dealt with.

continue...


Visualizing clusters of stocks

October 14th, 2012

One troubling aspect of an estimated covariance matrix is that it always overestimates the true covariance. For example, if two random variables are independent the covariance estimate for the two variables is always non-zero. It will converge to 0, yes, but it may take a really long time.

What's worse is that the covariance matrix does not understand causality. Consider the certainly common situation below:

continue...


Twitter Pyschopathy

October 08th, 2012

Released earlier was the research paper about predicting psychopathy using Twitter behaviour. Read the paper here.

continue...


Sampling from a Conditional Markov Chain

September 15th, 2012

My last project involving the artificial creation of "human-generated" passwords required me to sample from a Markov Chain. This is not very difficult, and I'll outline the sampling algorithm below. For the setup, suppose you have a transition probability matrix $M$ and an initial probability vector $\mathbf{v}$. The element $(i,j)$ of $M$ is the probability of the next state being $j$ given that the current state is $i$. The initial probability vector element $i$ is the probability that the first state is $i$. If you have these quantities, then to sample from a realized Markov process is simple:

continue...


Least Squares Regression with L1 Penalty

July 31th, 2012

I want to discuss, and exhibit, a really cool idea in machine learning, optimization and statistics. It's a simple idea: adding a constraint to an optimization problem, specifically a constraint on the sum, can have huge impacts on your interpretation, robustness and sanity. I must first introduce the family of functions we will be discussing.

The family of L-norm penalty functions, $L_p:R^d \rightarrow R$, is defined: $$ L_p( x ) = || x ||_p = \left( \sum_{i=1}^d |x_i|^p \right) ^{1/p} \;\: p>0 $$ For $p=2$, this is the familar Euclidean distance. The most often used in machine learning literature are the

continue...


Kernelized and Supervised Principle Component Analysis

July 13th, 2012

Sorry the title is a bit of a mouthful. Everyone in statistics has heard of Principle Components Analysis ( PCA ). The idea is so simple, and a personal favourite of mine, so I'll detail it here.

continue...


Predicting Psychopathy using Twitter Data

July 03th, 2012

The goal of this Kaggle contest was to predict an individuals psychopathic rating using information from their Twitter profile. I was given the already processed data and psychopathic scores. This was the first Kaggle competition I entered, and certainly not the last! If you'll excuse me, I must begin my technical remarks on my solution:

continue...


Data Science FAQ

July 02th, 2012

What is data science? What is an example of a data set? What are some of the goals of data science? What are some examples of data science in action? continue...


(All Blog Articles).filter( Coding )

lifelines: survival analysis in Python

December 19th, 2013

The lifelines library provides a powerful tool to data analysts and statisticians looking for methods to solve a common problem:

How do I predict durations?

This question seems very vague and abstract, but thats only because we can be so general in this space. Some more specific questions lifelines will help you solve are:

continue...


An algorithm to sort "Top" Comments

March 10th, 2013

Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product.

This has created flaws in how we sort items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results.

continue...


How to solve the Price is Right's Showdown

February 05th, 2013

Preface: This example is a (greatly modified) excerpt from the book Probabilistic Programming and Bayesian Methods for Hackers in Python, currently being developed on Github ;)

How to solve* the Showdown on the Price is Right

*I use the term loosely and irresponsibly.

It is incredibly surprising how wild some bids can be on The Price is Right's final game, The Showcase. If you are unfamiliar with how it is played (really?), here's a quick synopsis:

continue...


Kaggle Data Science Solution: Predicting US Census Return Rates

November 01th, 2012

The past month two classmates and I have been attacking a new Kaggle contest, Predicting US Census mail return rates. Basically, we were given large amounts of data about block groups, the second smallest unit of division in the US census, and asked to predict what fraction of individuals from the block group would mail back their 2010 census form.

continue...


Visualizing clusters of stocks

October 14th, 2012

One troubling aspect of an estimated covariance matrix is that it always overestimates the true covariance. For example, if two random variables are independent the covariance estimate for the two variables is always non-zero. It will converge to 0, yes, but it may take a really long time.

What's worse is that the covariance matrix does not understand causality. Consider the certainly common situation below:

continue...


UWaterloo Subway Map

September 22th, 2012

I think my thing with subway maps is getting weird. I just created a fictional University of Waterloo subway map using my subway.js library.

continue...


Modeling password creation

September 14th, 2012

Creating a password is an embarrassingly difficult task. A password needs to be both memorable and unique enough not to be guessed. The former criterion prevents using randomly generated passwords (try remembering 9st6Uqfe4Z for Gmail, rAEOZmePfT for Facebook, etc.), and the latter is the reason why passwords exist in the first place. So the task falls on humans to create their own passwords and carefully balance these two criteria. This has been, and still is, a bad idea.

continue...


Eurotrip & Python

August 13th, 2012

Later this month, my lovely girlfriend and I are travelling to Amsterdam, Berlin and Kiel. The first half of the trip we will be exploring the tourist and nontourist areas of Amsterdam and Berlin. I'm very excited as I get to spend time drinking and relaxing with my girlfriend. But then...

continue...


Turn your Android phone into a SMS-based Command Line

August 11th, 2012

One of my biggest pet peeves is not having my phone with me. This often occurs if the phone is charging and I need to leave, or I have forgotten it somewhere, or it is lost, or etc. I've created a partial solution.

continue...


Subway.js

July 17th, 2012

The javascript code that creates and controls the subway map above is available on GitHub. You can build your own using the pretty self-explanatory code + README document. Imagine using the code in a school project or advertising...

continue...


Python Android Scripts

July 05th, 2012

I am having a blast messing around with my new Android phone. It has Python! Currently I am playing with the sensors on the phone. Built-in is a light sensor, accelerometer, and an

continue...


Predicting Psychopathy using Twitter Data

July 03th, 2012

The goal of this Kaggle contest was to predict an individuals psychopathic rating using information from their Twitter profile. I was given the already processed data and psychopathic scores. This was the first Kaggle competition I entered, and certainly not the last! If you'll excuse me, I must begin my technical remarks on my solution:

continue...


(All Blog Articles).filter( Awesome Stuff )

DataOrigami Launch

June 24th, 2014

I'm proud to announce my latest project, dataorigami.net! Why are you still here, go check it out!

continue...


Videos about the Bayesian Methods for Hackers project

August 25th, 2013

  1. New York Tech Meetup, July 2013: This one is about 2/3 the way through, under the header "Hack of the month"

    Available via MLB Media player
  2. PyData Boston, July 2013: Slides available here

    Video available here.
continue...


Warrior Dash 2013

August 03th, 2013

Warrior dash data, just like last year: continue...


The Next Steps

June 16th, 2013

June has been an exciting month. The opensource book Bayesian Methods for Hackers I am working on blew up earlier this month, propelling it into Github's stratosphere. This is both a good and bad thing: good as it exposes more people to the project, hence more collaborators; bad because it is showing off an incomplete project -- a large fear is that advanced data specialists disparage in favour of more mature works the work to beginner dataists.

continue...


NSA Honeypot

June 08th, 2013

Let's perform an experiment.

continue...


21st Century Problems

May 16th, 2013

The technological challenges, and achievements, of the 20th century brought society enormous progress. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it.

continue...


Cover for Bayesian Methods for Hackers

March 25th, 2013

The very kind Stef Gibson created an amazing cover for my open source book Bayesian Methods for Hackers. View it below:

continue...


My favourite part of The Extended Phenotype

February 02th, 2013

To quote directly from the book, by Richard Dawkins:

continue...


Interior Design with Machine Learning

January 04th, 2013

While designing my new apartment, I found a very cool use of machine learning. Yes, that's right, you can use machine learning in interior design. As crazy as it sounds, it is completely legitimate.

continue...


A more sensible omnivore.

November 17th, 2012

My girlfriend, who is a vegetarian, and I often discuss the merits and dismerits of being a vegetarian. Though I am not a vegetarian (though I did experiment with veganism and holistic diets during some One Week Ofs), very much agree that eating as much meat as we do is not optimal.

Producing an ounce of meat requires a surprising amount of energy, whereas it's return energy is very small. We really only eat meat for its taste. It is strange how often we, the human omnivores, require meat in a meal, less it's not a real meal (and we do this three times a day). And unfortunately, a whole culture eating this way is not sustainable.

I have often thought about a life without meat,

continue...


UWaterloo Subway Map

September 22th, 2012

I think my thing with subway maps is getting weird. I just created a fictional University of Waterloo subway map using my subway.js library.

continue...


Modeling password creation

September 14th, 2012

Creating a password is an embarrassingly difficult task. A password needs to be both memorable and unique enough not to be guessed. The former criterion prevents using randomly generated passwords (try remembering 9st6Uqfe4Z for Gmail, rAEOZmePfT for Facebook, etc.), and the latter is the reason why passwords exist in the first place. So the task falls on humans to create their own passwords and carefully balance these two criteria. This has been, and still is, a bad idea.

continue...


Eurotrip & Python

August 13th, 2012

Later this month, my lovely girlfriend and I are travelling to Amsterdam, Berlin and Kiel. The first half of the trip we will be exploring the tourist and nontourist areas of Amsterdam and Berlin. I'm very excited as I get to spend time drinking and relaxing with my girlfriend. But then...

continue...


Warrior Dash Data

July 25th, 2012

Last Sunday I competed in a pretty epic competition: The Warrior Dash. It's 5k of, well honestly, it's 5k of mostly hills and trail running. Plus spread throughout are some pretty fun obstacles. With only five training workouts un...

continue...


Subway.js

July 17th, 2012

The javascript code that creates and controls the subway map above is available on GitHub. You can build your own using the pretty self-explanatory code + README document. Imagine using the code in a school project or advertising...

continue...


CamDP++

July 03th, 2012

Camdp.com is my latest attempt to digitize myself. I tried to map the subway lines to mimic my life and work, with each subway line representing a train of thought. I hope you enjoy the continue...