A Thinking Person's Guide to Statistics

Introduction
Useful Definitions
Significance
Probability Distributions
Randomness

Introduction

Statistics is one of those subjects that has the potential to be the most obscure thing you will ever see. Yet, without it, you can't really do an experimental or make an observation and come up with a universally meaningful conclusion.

What does this really mean? Let's start with the idea that someone asks a simple question, like "what time is it?". The answer is "12:05". Here the question is well defined, and so is the answer. But is it really? It could be 12:05, but then maybe that came from looking at one of those electric clocks that use a small AA battery that drives a small motor that drives the seconds, minute, and hour hands around. Then you read off the time. Or maybe you are looking at some digital readout of some device that has an oscillator, counting seconds and displaying the time relative to some offset. Each of these methods ("experiments") are equally capable of delivering an answer ("result"), but you have to admit that one of them might be a lot closer to being correct than the other (more accurate). In addition, the digital readout gives a time as displayed, whereas by reading the hands of a clock you could easy be off by a minute (so the digital clock is more precise).

So here are two experiments, but the results might have very different accuracies, and precision. You will find that most people in science tend to think that this added information (precision, accuracy, etc) are important when reporting the result.

Of course, in our world, knowing the time to less than a minute or two is usually not important. But in the world of science it is amazingly important, for a very simple reason: we often do experiments to find out something that we do not already know! In which case we would get a result, and compare it to what was already believed. For instance, if we drop two balls of different weight off the Tower of Pisa, and we want to see if they do hit the ground at the same time, we would make 2 measurements, $t_1$ and $t_2$, take the difference $\Delta t = t_1 - t_2$, and compare to what we expect, which is that $\Delta t=0$, also known as the "null" result ("null" meaning "nothing new"). What we want to be able to know, and to report in a systematic way, is the uncertainty in $t_1$ and $t_2$: $\delta t_1$ and $\delta t_2$. If we know these uncertainties, we can form the uncertainty in the difference, and report $\Delta t= x \pm y$ where $y$ comes from knowing $\delta t_1$ and $\delta t_2$. The goal here is to give you some understanding and tools so that you can do just this, for not only this simple situation but in general.

This is what "statistics", in the sciences, is really about: what are the uncertainties, how do you think of them, how do you report them, and how do you use them to understand the significance of a measurements, especially relative to the "null" hypothesis.

Useful Definitions

The "mean" of a set of $N$ numbers $x_i$ is the same as the "average", defined as: $$\bar x = \frac{\sum_{i=0}^N x_i}{N}\label{emean}$$ So if you have the numbers $1, 3, 6, 2, 10$, then the mean would be given by $(1+3+6+2+10)/5=22/5=4.4$. The mean is also sometimes called the "first moment".

However, the mean of a set of numbers does not tell the whole story about the set. Imagine you have the set $4, 4, 5, 5,4$. If you calculate the mean of that set, you also get $4.4$. And the two sets are completely different in how the numbers are spread out: the second set is much "tighter" than the 1st. This motivates another measure, the "2nd moment", which measures the spread. We use the "variance", defined as the deviations from the mean: $$\sigma^2 = \frac{\sum_{i=0}^N (x_i-\bar{x})^2}{N}\label{evar}$$ The RMS (root mean square) is similar to a standard deviation, and tells you something about how the values are spread about the mean: $$\sigma_{RMS}=\sqrt{\frac{\sigma^2}{N}}\label{erms}$$

It is always worthwhile to look closely at formulae like these for deeper meaning. One thing you can notice is the factor $1/N$ in both. This factor "normalizes" things, so that the mean and variance are in the same scale as the individual $x_i$ (or close anyway). But we can also write things a different way, for instance: $$\bar x = \sum_{i=0}^N \frac{1}{N}x_i\nonumber$$ Same as equation $\ref{emean}$, only now each $1/N$ multiplies each $x_i$. Seems like more work than we need to do. But let's make the following replacement: $$P_i = \frac{1}{N}\nonumber$$ and rewite the mean as: $$\bar x = \sum_{i=0}^N P_ix_i\nonumber$$ This formula says something more than what's in equation \ref{emean}: it says that the first moment (the mean) is given by the sum of each $x_i$ "weighted" by some probability $P_i$ that tells you how likely it is that that particular $x_i$ is seen. So let's calculate the $P_i$ for the two sets ${1,3,6,2,10}$ and $4,4,5,5,4$, but one thing we want to keep in mind is the following very important point about probabilities: if you sum them up, they always add up to 1. If they don't add to 1, then they are not probabilities! Why 1? Because given a set of probabilities, if they describe all possibilities, then since something has to happen, the probabilities have to add to 1. So this formula is very important: $$\sum_{i=0}^N P_i=1\label{eprob}$$ Back to our two sets. For the first set, $P_1$ would be the probability that we see the number $1$ in the set, $P_3$ would be the probability we see the number $3$, and so on. We only have 5 numbers, and each is distinct (not repeated), so we would have $P_i \equiv P = 1/5 = 0.2$. The mean is then given by

$\bar{x}=P_1\cdot 1+P_3\cdot 3+P_6\cdot 6+P_2\cdot 2+P_{10}\cdot 10 = P\cdot (1+3+6+2+10)=0.2\cdot 22=4.4$

as before.

For the second set, we see the number $4$ occur 3 times, so $P_4=3/5=0.6$, and the number $5$ occurs twice, so $P_5=2/5=0.4$. Notice that $P_4+P_5=1$ as required. Then the mean is given by

$\bar{x}=P_4\cdot 4+P_5\cdot 5=0.6\cdot 4+0.4\cdot 5=2.4+2.0=4.4$.

Voila. This way of looking at means and variances will come in handy later.

Significance

Imagine you have a new list of numbers, like this: $4,5,4,3,4,6,4,5$. There are 8 numbers, 4 of which are unique, so the 4 probabilities are $P_3=1/8$, $P_4=4/8$, $P_5=2/8$, and $P_6=1/8$, and of course $P_3+P_4+P_5+P_6=1$. The mean will be given by

$\bar{x} = 3\cdot (1/8) + 4\cdot (4/8)+ 5\cdot (2/8) + 6\cdot (1/8)=4.375$

and the variance (square root of equation $\ref{evar}$) is $0.86$.

Now add another point to this distribution, one that is far from the mean, e.g. $x_9=15$, and recalculate the mean and variance. The new mean will be $\bar x=5.56$ and new variance will become $\sigma = 3.44$. The mean changed by a small amount compared to the old mean ($\sim 25\%$) however the variance changed by huge amount: from $0.86$ to $3.44$! That's a big percentage change.

If we look at all the individual points one by one and form the variance using equation $\ref{evar}$, each term we would find the following values in the sum:

Value	4	5	4	3	4	6	4	5	15	$\bar x=5.56$
$\sigma^2_i$	2.42	0.31	2.42	6.53	2.42	0.20	2.42	0.31	89.20	$\sigma = 3.44$

As you can see, each of the first 8 terms contribute some small amount to the variance, but nothing like the last term, which is more than 10 times bigger than all but one other term. What this tells you is that the variance is a lot more sensitive to an "outlier", than is the mean.

To consider how signficant each contribution to the variance is, we should look at things on the same scale. Each term in the variance involves the square of the difference between the particlar value ($x_i$) and the mean. If we instead take the square root of each of the terms, it would look like this:

Value	4	5	4	3	4	6	4	5	15	$\bar x=5.56$
$\sqrt{\sigma^2_i}$	1.56	0.56	1.56	2.56	1.56	0.44	1.56	0.56	9.44	$\sigma = 3.44$

There are many good ways to characterize how much each term contributes to the variance, and that leads us to the concept of "significance". Here's one proposal that is commonly used: take each term that contributes to the variance ($(x_i-\bar{x}))^2$), and find out how much each contributes to the and divide it by the final variance ($sum(x_i-\bar{x})^2$): $$S_i\equiv \frac{(x_i-\bar{x})^2}{\sum(x_i-\bar{x}))^2}\label{esig}$$ Each of these ratios should be dimensionless, tells you how much it contributes to the variance, and should sum to 1.0. For our example, you should get the following table for $S_i$:

Value	4	5	4	3	4	6	4	5	15
$S_i^2$	0.023	0.003	0.023	0.061	0.023	0.02	0.023	0.003	0.840

As you can see, the last point (15) is very significant compared to the other points, using this measure: it contains $84\%$ of the variance right there. Again, this is just telling you that the variance is very sensitive to outliers. Hence its value.

Below, you will see a table that consists of an initial list of some values ($x_i$), the contribution to the variance for each piece ($\sigma_i\equiv\|x_i-\bar{x})$), and the mean and variance as given by equations $\ref{emean}$ and $\ref{evar}$. You can use the number widget and enter any integer and see how it changes the mean and variance. Try adding small and large numbers and see how it moves the mean by a little, but the variance by a lot.

$\bar{x}$	$=$
$\sqrt{\sigma^2}$	$=$

One last question: why do we call the mean the "first moment", and the variance the "second moment"? To see this, imagine you have a collection of identical weights $m$, place them all at different distances from a pivot, and calculate the torque about the pivot. The picture looks like this:

The torque $\tau$ about the pivot would be given by:

$\tau$	$=$	$x_1\cdot mg + x_2\cdot mg + x_3\cdot mg + x_4\cdot mg$
	$=$	$mg\cdot (x_1+x_2+x_3+x_4)$

You could also place a mass that is the sum of all the other masses at some distance $\bar{x}$ from the pivot, and find $\bar{x}$ such that the torque was the same, as in the following picture:

The calculation for that torque would be:

$\tau=\bar{x}\cdot (4m)g$

Equating the two equivalent torques gives $4\bar{x}=x_1+x_2+x_3+x_4$, or $\bar{x}=(x_1+x_2+x_3+x_4)/4$, which is the same equation for how to calculate the mean. So the mean is calculated the same way you would calculate the center of mass of a group of objects. You can play the same game and calculate the moment of inertia about the mean, and you would get the variance. In mechanics, the first moment is the center of mass, and the 2nd moment is the moment of inertia, and so on. Hence the correspondence to the mean and variance in the names.

(Probability) Distributions

Sometimes, events (numbers) are produced from some kind of systematic process. For instance, if you randomly pulled cards out of a deck, the process limits the cards to being Ace, 2-10, J, Q, and K only. So the card you pulled on any particular pull has constraints on what it can be. Another example, an even simpler one, might be an experiment where the answer is either yes, or no. Or, 0 or 1, the main point being that the answer is binomial (1 of 2 possible choices). What kind of questions can you ask about the results of such an experiment? Not much, since the answer is either yes, or no. However, what if you did the experiment $N$ times, and asked how the answers distributed themselves. For instance, how many times you saw $n$ yes and $N-n$ no. And then you would ask the important question of how this compares to what we expect, which brings you to the question of how to calculate what to expect.

To understand the answer here, let's go to a simple case where the probability of a yes ($P_1$) is equal to the probability of a now ($P_0$). Since probabilities sum to 1, we would have $P_0+P_1=2P_0=1$ therefore $P_0=P_1=\half$. This would be the case for example if you were flipping a coin.

Let's say $P_1$ = "heads" and $P_0$ = tails. Now, what if you flipped the coin 2 times and ask for the probability that you get 2 heads in a row (call it $P_{11}$). The way to calculate that probability is to go through all the possible outcomes, and calculate the fraction of times that that outcome exists. Each fraction will be the probability (it will add to 1!). So in a 2 toss experiment, we could get the following combinations: HH, HT, TH, and TT where the first digit is the 1st flip and the 2nd digit is the 2nd flip. The fraction of times you get 2 heads is 1/4, since that combination showed up 1 in 4 times. The fraction of times you get 2 tails is also 1/4. The fraction of times you saw one heads and one tails, without caring about which one came first, would be 2/4. Another way to see this: of the 3 different combinations, the HH and TT probabilities add to 1/2, so the remaining probability has to be 1/2.

This gets to a very important point: if order doesn't matter, and all you care about is the probability of getting a result, you have to keep track of the "combinatorics". More on this below.

As an interesting aside, imagine you did the same 2 flip experiment, and on the 1st flip you got a tails. What's the probability that on the 2nd flip you got a tails? You might think that it's 1/4, since after 2 flips the probability of seeing 2 tails is 1/4. Do the experiment, see what you get, but the answer will be that the probability of seeing a tails on the 2nd flip is 1/2, not 1/4. Why is this? Because the two flips are uncorrelated. And the probability of seeing a tails after a tails is not the same as the probability of seeing 2 tails in a row. The former is very specific: flip a tails, then ask for the probability of seeing another tails. The latter says: flip the coin twice, what's the probability of seeing 2 heads. So be careful with statistics. It can really cause headaches!

Let's Make A Deal

An interesting aside: in the 1960s, there was a game show on TV called "Let's Make a Deal", starrying Monty Hall as the host. What would happen during the show is that contestants who won some kind of prize would be shown 3 doors on the stage. They would be told that 2 of the doors contained something worthless (called a "zonk"), and 1 of the doors would be a grand prize like a vacation, or a new color TV (this was the 60s - color TVs were very novel!). The contestant was told that they could choose any of the doors, or stick with what they had. Of course, the probability of choosing the door with the grand prize was 1/3. Not so high. But the contestant wanted that color TV, so they would choose one of the doors. Monty would then say that to help them out, he was going to show them what's behind one of the doors (he would pick one of the zonk prizes). Of course, he wouldn't show them the door that the contestant actually picked, he would show them a different door. And then comes the interesting part: he would say that the contestant could switch their choice, or stick with what they had. The question is, what should you do based on maximizing the probability of getting the grand prize?

The way to answer this is to do the experiment many times, and see what happens. But you can also apply what we've learned and calculate whether it's better, on average, to switch or to stay. Here's how to think about it: before you chose the door, you had a 1/3 chance of getting the grand prize. Then you choose, and then the host opens another door. There are 2 possibilities: one, you chose correctly (1/3 chance), and so the host could have opened either of the other two zonk doors. The other possibility is that you did not choose correctly the first time (2/3 chance). In that case, the host had only 1 zoor door to show you, and so the other unchosen unopened door contains the grand prize. This says that you should always switch, and that if you do your chances of getting the grand prize goes from 1/3 to 2/3.

This answer sometimes bothers people, even intelligent people who claim to know statistics. They will say that since you made a choice, you either had the prize or you didn't, and so it's either behidn the door you chose or behind the unopened unchosen door. Hence it makes no difference if you switch or not. Can you figure out where this reasoning breaks down? Anyway here's another way to think of it. Imagine that there were 100 doors and you chose one of them, say number 49. Then the host opened 98 others, showed you the zonk prize, and asked if you wanted to switch. Isn't it obvious that you will want to switch then? Now take the limit as 100 goes to 3.

Binomial Distribution

We want to calculate the probability that if we flip $N$ coins, we will get $n$ heads. This probability is denoted as $P(n,N)$, and is called the binomial probability. The graph of this probability as a function of $n$ is called the binomial distribution. It's one if not the most important probability distributions in all of science, and it turns out that many other probability distributions can be derived from it.

To start, let's say that we have 1 coin, that heads and tails are equal probability, and that we flip them 4 times ($N=4$). Here are the 16 different possibilities for what you should see:

HHHH, HHHT, HHTH, HHTT, HTHH, HTHT, HTTH, HTTT, THHH, THHT, THTH, THTT, TTHH, TTHT, TTTH, TTTT

The binomial probability $P(n,N)$ is the probability of seeing $n$ heads after $N$ flips, which we can get by looking at the number of times we see $n$ heads, and dividing by the number of flips. So the first thing we need to know is the number of times we see $n$ flips, and to get that all we have to do is count: there are 6 combinations of seeing 2 heads (HHTT, HTHT, HTTH, THHT, THTH, and TTHH). This turns out to be easy to put into a formula, called the combinatoric formula: $$C(n,N)=\frac{N!}{n!(N-n)!}\nonumber$$ $C(n,N)$ is just the combinatoric function for the number of ways you can arrange $N$ things so that you have $n$ of them showing a particular value (here it's heads). The $N!$ notation means "factorial", so that for example $4!=4\times 3\times 2\times 1=24$. There are many ways to motivate this particular formula, but they all have to do with the idea of keeping track of permutations.

$C(n,N)$ tells you the number of ways you can see $n$ out of $N$, but it doesn't tell you $P(n,N)$, the probability of seeing $n$ out of $N$. For that, you have to multiply by the joint probability of each part. For instance, if there are 3 colors (R,B,G), and each one comes with a different probability $p_R$, $p_B$, and $p_G$, then the probability of seeing any combination, say RB would be given by $p_R\times p_B$. If we have a system with 2 possible states (heads, tails), and each has a probability $p_H$ and $p_T$ respectively, then the probability of seeing 2 heads and 2 tails will be given by $p_H^2\times p_T^2$ Let's simplify it more: let's define $p\equiv p_H$, and remember that $p_T=1-p_H=1-p$. Also, if $n$ is the number of heads, then $N-n$ is the number of tails. So the probability of seeing any combination of $n$ heads will be given by $P_n=p^n(1-p)^{N-n}$. This is the 2nd piece we need. The binomial probability $P(n,N)$ is then: $$P(n,N) = C(n,N)\cdot P_n = \frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}\label{ebinomial}$$ For $N=4$, $n=2$, we would have $P(2,4)=\frac{4!}{2!2!}\cdot \half^2\half^2=6\cdot \frac{1}{16}=6/16$. This is the same as what we got from seeing 6 combinations of our specific pattern out of 16 possible answers.

The graph below shows the binomial distribution (normalized to 1) for $N=10$ and $p=\half$. Changing either $N$ or $p$ will redraw it, allowing you to see how the distribution chnages. Notice what happens when you make $p$ very small (or very large), and also when you let $N$ become large. Note that you can click on any of the points and read off the values for $n$ and $P(N,n)$.

N=	Log scale?
p=0.50

The mean and variance of the binomial distribution are not all that difficult to calculate. To get the mean, we start with the definition that the mean of anything is given by the sum of that thing times the probability for that thing: $$\bar{n} = \sum_{n=0}^\infty n\cdot P(N,n) = \sum_{n=0}^\infty n\cdot \frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}\nonumber$$ The way to calculate this is to note that $n/n!=1/(n-1)!$, so we are going to have to deal with $n-1$. This suggests we factor $N$ and $p$ out, to get the following: $$\bar{n} = Np\sum_{n=1}^\infty \frac{(N-1)!}{(n-1)!(N-1-[n-1])!}p^{n-1}(1-p)^{N-1-[n-1]}\nonumber$$ This is a nice little trick. Notice that the sum now goes from $n-1$ to $\infty$ since when we multiply $n\cdot P(N,n)$ for $n=0$ we get $0$ anyway. Now all we have to do is make the substitution $m=n-1$ and $M=N-1$ to get $$\bar{n} = Np\sum_{m=0}^\infty \frac{M!}{m!(M-m)!}p^m(1-p)^{M-m}\nonumber$$ Since the sum is over the binomial distribution, which is normalized to 1, then the sum sums to 1. So the final answer is simple: $$\bar{n} = Np\label{emeanbinomial}$$ To get the variance, we would do something analogous to what we just did to calculate the mean, except as in equation $\ref{evar}$, we replace $n$ by the square of the difference between $n$ and $\bar{n}$: $(n-\bar{n})^2$. Then we have to evaluate the following: $$\sigma^2 = \sum_{n=0}^\infty (n-\bar{n})^2\cdot P(N,n) = \sum_{n=0}^\infty (n-\bar{n})^2\cdot \frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}\nonumber$$ We first multiply out the square of the difference: $(n-\bar{n})^2=n^2-2n\bar{n}+\bar{n}^2$, giving 3 terms: $$\sigma^2 = \sum_{n=0}^\infty (n^2-2n\bar{n}+\bar{n}^2)\cdot \frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}\nonumber$$ Let's label them $S_1$, $S_2$, and $S_3$, where the $S_1$ term has the $n^2$, $S_2$ has the $-2n\bar{n}$ part, and $S_3$ has the $\bar{n}^2$.

For $S_3$, we can pull the $\bar{n}^2$ out of the sum, leaving just the binomial distribution, which sums to 1. So $S_3=\bar{n}^2$. For $S_2$, we can pull the factor $-2\bar{n}$ out of the sum, leaving the $n$ along with the binomial distribution. However, that is the same sum we evaluated above, yielding the same $\bar{n}$, which gives us $S_2=-2\bar{n}^2$. This gives us $S_2+S_3=-\bar{n}^2$.

The first term with the $n^2$ part is slightly more complicated, and the solution goes like this:

$S_1$	=	$\sum_{n=0}^\infty n^2\frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}$
	=	$pN\sum_{n=1}^\infty n\frac{(N-1)!}{(n-1)!(N-1-[n-1])!}p^{n-1}(1-p)^{N-1-[n-1]}$
	=	$pN\sum_{m=0}^\infty (m+1)\frac{L!}{m!(L-m)!}p^L(1-p)^{L-m}$	($L\equiv N-1$, $m\equiv n-1$)
	=	$pN[(\sum_{m=0}^\infty m\frac{L!}{m!(L-m)!}p^L(1-p)^{L-m})+1]$
	=	$pN(pL+1)$
	=	$pN(p(N-1)+1)$
	=	$pN(pN-p+1)$
	=	$pN(1-p)+(pN)^2$

Adding all 3 terms gives, and using $\bar{n}=pN$:

$S_1+S_2+S_3=pN(1-p)+(pN)^2 -(pN)^2=pN(1-p)$

or $$\sigma^2=pN(1-p)=\bar{n}(1-p)\label{evarbinomial}$$ Equations $\ref{emeanbinomial}$ and $\ref{evarbinomial}$ are very useful, and tells you that the mean of a binomial distribution is just the total number of possible combinations, $N$, times the probability $p$ of any one event. In the coin experiment, $p$ is the probability of getting a heads, and $N$ is the number of coin tosses. $P(n,N)$ tells you the probability that you will see $n$ heads (and therefore $N-n$ tails), $\bar{n}=pN$ tells you the average number of heads (if $p=\half$, then $\bar{n}=N/2$), and $\sigma^2=\bar{n}(1-p)=N/4$ is the variance on the average number of heads.

It is interesting that the variance $\sigma^2$ will be equal to the mean $\bar{n}$ when $p\lt\lt 1$. This is going to be important when we next consider the Poisson distribution.

Combinatorics

Imagine doing an experiment where you ask some number of strangers for their birthday, and make a histogram of what how many people had birthdays on any particular day (where day means day and month as a sequential number from 1 to 365). The histogram will have 365 bins, and the sum of all the entries in the bins adds up to the total number of people you queried ($N$). One would expect that all days should have the same probability $p=1/365$ of being a birthday. Of course, there could be biases that favor some days over others, for instnce one could imagine that maybe more people were conceived on holidays than non-holidays, therefore the birth dates should cluster around holidays plus 9 months. But let's assume the probability $p(n)$ of any randome birthday to be on day $n$ is given by $p(n)=p=1/365$.

For instance, let $N=1000$ people. If we make the plot of the number of birthdays $N$ on any given day $n$, then $N(n)$ will look something like this:

The horizontal axis is the day of the year (1-365) and the vertical tells you how many times we found a person with a birthday at that day. We can see right away that it's much more likely that we see something between 1 and 3 counts than that we see something between 8 and 10 counts.

You might then want to make a histogram of the distribution of counts (how many days had 0 counts, how many had 1, etc). Here, it looks like no day had more than 10 counts, so our histogram will be from 0 to 10. It will look like this:

To turn our histogram into a probability distribution, all we have to do is divide each bin by the total number in the sample, which in this case is 365 (if you sum over all the bin contents in the Frequency plot, you will get 365 since the x axis tells you the number of days that had 0, 1, 2, etc birthdays). The result is shown next:

To understand the shape of this probability distribution, let's formulate the problem slowly. We are asking each person from a sample of $N$ people for a date corresponding to their birthday. Each answer has a probability $p$, and since all dates are equally likely if there is no bias, then $p=1/365$. This "stream" of 1000 is no different than the "stream" of 1000 coin tosses, each being either heads or tails. However, the difference is that for the coin tosses, there are 2 possible outcomes, while for the birthday experiment, there are 365.

If we want to know the number of people who have the birthday Feb 25 after $N$ samples, the the answer will be that on average, there will be $N/365$. That's easy, because it doesn't involve combinatorics. What if we want to know the probability that after $N$ tosses, there will be only 1 day (of 365) with $n$ birthdays found, then that involves combinatorics, because there are many possibilities. For example, if we sample $N=365$ people, then we would expect that the average number of birthdays on any given day would be 1.0. But that is very different from asking how many days had 2 birthdays seen, then there are many ways that we could take the 365 days and get 2 birthdays (e.g. Jan 1 had 2 and Jan 2 had 0). So the binomial distribution, which deals with probability and combinatorics, is what we need to use here.

Combinatorics and Throwing Dice

Unfortunately there are characters in the world who cheat, and to cheat in dice all you have to do is "load" the dice so that you get the right face showing more often than not. Imagine that your cousin who owns a gambling establishment offers to pay you to test some dice to see if they are loaded, you will need to know what you are doing! So now let's attack the problem of calculating the expected probability distribution for situations with combinatorics using the dice.

We can start with a single die, and roll it $N$ times. If each face is equally likely to show up, then the probability $p(n)$ for seeing face $n$ appear will be independent of which face: $p(n)=1/6$. What we want to do is to calculate the following quantities after $N$ rolls:

$\bar{f}(n)$: the average of number of times we see face $n$ appear
$\delta^2(n)$: the variance on $\bar{f}(n)$

Then we will do the experiment, roll the die $N$ times, and measure the frequency $f(n)$ for seeing face $n$ appear, and compare it to $\bar{f}(n)\pm\delta(n)$. This will give us an idea of whether the die is loaded or not.

When you roll the die $N$ times, you will be looking at a stream of numbers, each with $M=6$ different possibile values, and you want to know the probability that we see face $n$ appear $m$ times. For example, say you roll the dice 5 times ($N=5$) and ask for the probability that you should see face 1 appear $m=2$ times. As usual, you can calculate these probabilities by adding up the number of different ways something can be seen, and multiplying by the probability of seeing just that thing. So for $m=2$, we could see face 1 on trial 1, and then on trial 2,3,4,5, or 6. Or, we could see some other face on trial 1, face 1 on trial 2, and then on 3,4,5, or 6. And so on. If you add up all possible combinations, you will get 5 (1 shows up the first time) plus 4 (2nd), plus 3 + 2 + 1 (face 1 shows up on trial 5 and 6). So there are 5+4+3+2+1=15 ways to see face 1 show up twice. This is exactly what the combinatorics part of the binomial distribution calculates for you, the number of ways you can take 6 things and arrange it so you see 2 of them: $$\frac{M!}{m!(M-m)!}=\frac{6!}{2!(6-2)!}= \frac{6\cdot 5\cdot 4\cdot 3\cdot 2}{2(4\cdot 3\cdot 2)}=15\nonumber$$ To calculate the probability of seeing face $n=1$ show up $m=2$ times, we would need to multiply by the probability of seeing face 1 showing up twice, and any other face showing up 4 times, which would be given by $p^2(1-p)^4$. Putting this together gives you the binomial probability distribution $P_n(M,m)$, where $M$ is the number of different die in the set ($M=6$), and $m$ is the number of times you would see face $n=1$: $$P_1(m,M)=\frac{M!}{m!(M-m)!}p_1^m(1-p_1)^{M-m}\nonumber$$ Since all faces are equally likely, you can drop the subscript $n$, and just write the probability of seeing any of the faces snow up $m$ times is: $$P(m,M)=\frac{M!}{m!(M-m)!}p^m(1-p)^{M-m}\nonumber$$ where $p=1/6$.

Since we know that the mean and variance of a binomial distribution will be given by equations $\ref{emeanbinomial}$ and $\ref{evarbinomial}$, we can write that

$\bar{f}(n) = \bar{m} = Np$
$\delta(n) = \sqrt{\sigma^2} = \sqrt{Np(1-p)}$.

All you have to do now is roll the die $N$ times, make a histogram of how many times you see face $n$ apear, and compare that to what the binomial distribution predicts, which is that each face will have an equal mean and variance after $N$ rolls.

Poisson Distribution

The binomial distribution is a general case, and somewhat complex. Sometimes, however, the situation you are in will have a small probability $p\lt\lt 1$, a large number $N$, and you want to know the probability $P(n,N)$ where $n\lt\lt N$. For instance, imagine we were to ask $N$ people for their birthday, and tablulate the date as the ith day of the year (from 1 to 365). Then we would want to calculate the expected frequency probability distribution for any particular day to have 2, 3, 4, 5, etc birthdays. For example, you ask 365 people for their birthday, and want to know the probability of there being 2 birthdays in any given day. Or 3 birthdays. And so forth. This will again involve the binomial distribution, because you are dealing with the combinatorics (the combinatorics and associated probabilities for taking $N$ things $n$ ways). But here, the probability of any given day will be quite small, and if all days are equal, we should have $p=1/365$. If we took the data ($N$), made the histogram of how many days had 2, 3, etc ($n$) people, and compared to what you would expect, you could maybe see if there was some reason why some days would be more preferred or not (for insance, perhaps more birthdays 9 months from holidays). Here, the number of possibilities in the stream is 365 ($p=1/365$), which is quite a small number. If you took $N$ samples, then the average for each day will be given by the mean of the binomial distribution, or $\bar{n} = Np$. So if you sampled 365 people, you would expect $\bar{n}=1$, but you might actually find that some days had 2, even maybe 3. But you would not expect that some days would have anywhere near 365! So in our case here, we have the following conditions:

$p \lt\lt 1$
$n \lt\lt N$

We can investigate how the binomial distribution behaves by doing the relevant expansions, and there are several equally complex ways to do this. One very straightforward way is to do make the following approximations:

For $n \lt\lt N$, the factor $\frac{N!}{(N-n)!}$ will have $N-n$ terms, and they will all be large compared to $n$. So a reasonable approximation here is $\frac{N!}{(N-n)!}\to N^n$
That leaves $P(N,n)=\frac{N^n}{n!}p^n(1-p)^{N-n}$
The factor $(1-p)^{N-n}\to (1-p)^N$ for $n\lt\lt N$.
For $p\lt\lt 1$, we can expand $(1-p)^N$ using the expansion for the natural log $ln$. We take $ln(1-p)^N = Nln(1-p)$, and expand $ln (1-x) \sim -x - \half x^2$. So the expansion for $(1-p)^N\sim e^{-Np}e^{-Np^2/2} \sim e^{-Np}$ as long as $Np^2\lt\lt 1$, or $Np\lt\lt 1/p$.
So replacing $(1-p)^{N-n}\to e^{-Np}$, we have $P(N,n)=\frac{N^n}{n!}p^ne^{-Np}=\frac{(Np)^n}{n!}e^{-NP}$
Now replace $\mu = NP$

A brief word on the approximation made above, that $Np = \mu \lt\lt 1/p$. The probability distribution will max out in the region where $n\sim \mu$, and will fall to zero way outside that region. So as long as we keep away from distributions where $\mu$ is close to $N$, we are ok. For the birthday example, this approximation will hold as long as the average number in any day is less than $1/p=365$. This also tells us that $N\lt\lt 1/p^2\sim 10^5$, so it sets a limit on $N$

With these approximations, the binomial distribution then goes to: $$P(n,N)\to \frac{\mu^ne^{-\mu}}{n!}\nonumber$$ as $n/N\to 0$, $N\lt\lt 1/p^2$, and $p\lt\lt 1$, and the definition $\mu\equiv Np$ is the mean. This is called the Poisson distribution, and is one of the more important distributions in science (up there with the binomial, and gaussian).

Why the poisson has only 1 parameter $\mu$ instead of the 2 parameters $p$ and $N$ for the binomial is because of the limiting condition: $p$ is very small, $N$ is very large, but the product $\mu=pN$ is finite. Also, if the variance of the binomial distribution is given by equation $\ref{evarbinomial}$, then in the limit of $p\to 0$, then $Np(1-p)\to Np = \mu$ so the variance of the Poisson will be given by $\sigma^2 = \mu$, and there really is only 1 free parameter ($\mu$).

So the poisson distribution is: $$P(n,\mu)= \frac{\mu^ne^{-\mu}}{n!}\label{epoisson}$$ and the mean and RMS values are: $$\mu = NP\nonumber\label{mupoisson}$$ $$\sigma_{RMS}=\sqrt{\mu}\label{vpoisson}$$ For this to be a real distribution, we would want $\sum P(n,\mu)=1$, and this is the case (in the sum, the factor $e^{-\mu}$ factors out, leaving a sum over $\mu^n/n!$, which is the expansion for $e^{+\mu}$).

The physical significance of these limits is interesting. For an experiment like the birthday experiment, it describes a situation with a very small probability $p$ but a large number of trials compared to the combinations ($n\lt\lt N$). For something like a radioactivity experiment, where you count the number of disintegrations in some time window, the poisson also applies, since the limiting conditions means that the probaiblity for any given particle of the sample to decay is small, and number of counts in any particular time window will be small as well.

In the following plot, you can change the mean value $\mu$ and see the shape of the poisson distribution $P(n,\mu)$. Notice how the distribution becomes more symmetric as $\mu$ gets larger.

$\mu =$

Log scale?

The poisson distribution is over discrete values for $n$. However, as $\mu$ gets larger, the term $e^{-\mu}$ sends the probability to 0 except in the region near $n\sim\mu$ (play with the poisson distribution above, driving $\mu$ up). To see this more explicitly, we can use Stirling's formula for the factorial: $$n!\to x! \to \sqrt{2\pi x}e^{-x}x^x\label{estirling}$$ Substituting, we have $$P(n,\mu)= \frac{\mu^ne^{-\mu}}{n!}\to\frac{\mu^ne^{-\mu}}{\sqrt{2\pi n}e^{-n}n^n} = \frac {e^{-(\mu-n)}} {\sqrt{2\pi n}} \Big(\frac {\mu}{n}\Big)^n \nonumber$$ Now that we have an equation that substitutes for the integer $n$, we can consider the poisson as a probability over a continuous variable $x$ instead of $n$. This will be true when $\mu \gt\gt 1$. This allows us to write the poisson distribution as $$P(x,\mu)= \frac{\mu^xe^{-\mu}}{x!}\label{epoissonx}$$ Note that even though we derived the poisson starting with the binomial, the poisson can also be derived from first principles for counting experiments where the numbers counted are independent of each other (for instance, the number of photons from radioactive decay, or the number of people in line at the grocery store), and the average rate is constant (and so you are just seeing fluctuations about the average).

To derive the poisson distribution from first principles, we define define $\lambda$ as the rate for something to occur per unit time, and break up time into intervals of length $\dt$, so that after a time $t$ there will be $N$ intervals ($t=N\dt$). The probability that something happens in any interval will be given by $P=\lambda\dt$, so the probabilty that nothing happens in the interval will be given by $1-\lambda\dt$. To calculate the probability that nothing happens after $N$ intervals (at a time $t$ relative to $t=0$), we would just multiply the probabilities for all intervals together, and since the occurance is independent of the interval, we would get $$P(0,t)=(1-\lambda\dt)^N\nonumber$$ If we want to take the limit as $\dt\to 0$, that means $N\to\infty$. To do this properly you use the Taylor expansion, and assume $N\gt\gt 1$, and this gives the well-known formula for the exponential: $$P(0,t) = e^{-\lambda t}\nonumber$$ The probability that the event happens at some time is what we want to calculate: $P(t)$. But we are talking about intervals, so the probability for the event to happen in a time interval between $t$ and $t+\dt$ will be given by the probability that it did not happen up until time $t$ ($e^{-\lambda t}$) times the probability that it happened in the next interval $\lambda\dt$, or $$P(t)\dt = e^{-\lambda t}\lambda\dt\nonumber$$. Therefore, we have calculated that $P(t) = \lambda e^{-\lambda t}$, and we are almost there, since what we really want to calculate is the probability that after some time $t+\dt$, we will have seen $n$ events. This is where the combinatorics come in: you could see $n-1$ events have happened before $t$ and 1 in the next interval $\dt$, or $n-2$ before and 2 in $\dt$, etc. After some algebra and logic, one can show that the probability for seeing $n$ events after a time $t$ will be given by the poisson distribution: $$P(t) = e^{-\lambda t}\frac{\lambda t}{n!}\nonumber$$

Gaussian Distribution

As we just saw, the poisson distribution is the binomial distribution in the limit of $p\lt\lt 1$ and $n\lt\lt N$, but $pN=\mu$ is finite. Both are distributions over discrete values, and usually have to do with predicting the results from counting experiments. Using Stirling's rule, we can turn the poisson into a distribution over continuous variables, or we can calculate it from first principles. One can then look at poisson distributions when the mean $\mu$ is quite large, and when $n$ is large, and after taking the limits one can get the following distribution: $$P(x,\mu)=\frac{e^{-(x-\mu)^2/\mu}}{\sqrt{2\pi\mu}}\nonumber$$. This is called the gaussian distribution, and is probably the most important and ubiquitous distribution in all of science.

Instead of deriving the gaussian starting with the binomial, or poisson, there's another way to do it, and to appreciate the importance of the gaussian. We start with a rather simple and interesting problem: consider a random variable $x_i$, and we take the sum over $n$ of them and form the average: $$X = \sum_{i=1}^N x_i\nonumber$$ Then we make a histogram of $X$, and what we want to know is the probability distribution for $X$.

Copywrite Drew Baden, Jan 27, 2017

Table of Contents