A Thinking Person's Guide to Statistics

Binomial
Poisson
Gaussian

Introduction

Statistics is one of those subjects that has the potential to be the most obscure thing you will ever see. Yet, without it, you can't really do an experimental or make an observation and come up with a universally meaningful conclusion.

What does this really mean? Let's start with the idea that someone asks a simple question, like "what time is it?". The answer is "12:05". Here the question is well defined, and so is the answer. But is it really? It could be 12:05, but then maybe that came from looking at one of those electric clocks that use a small AA battery that drives a small motor that drives the seconds, minute, and hour hands around. Then you read off the time. Or maybe you are looking at some digital readout of some device that has an oscillator, counting seconds and displaying the time relative to some offset. Each of these methods ("experiments") are equally capable of delivering an answer ("result"), but you have to admit that one of them might be a lot closer to being correct than the other (more accurate). In addition, the digital readout gives a time as displayed, whereas by reading the hands of a clock you could easy be off by a minute (so the digital clock is more precise).

So here are two experiments, but the results might have very different accuracies, and precision. You will find that most people in science tend to think that this added information (precision, accuracy, etc) are important when reporting the result.

Of course, in our world, knowing the time to less than a minute or two is usually not important. But in the world of science it is amazingly important, for a very simple reason: we often do experiments to find out something that we do not already know! In which case we would get a result, and compare it to what was already believed. For instance, if we drop two balls of different weight off the Tower of Pisa, and we want to see if they do hit the ground at the same time, we would make 2 measurements, $t_1$ and $t_2$, take the difference $\Delta t = t_1 - t_2$, and compare to what we expect, which is that $\Delta t=0$, also known as the "null" result ("null" meaning "nothing new"). What we want to be able to know, and to report in a systematic way, is the uncertainty in $t_1$ and $t_2$: $\delta t_1$ and $\delta t_2$. If we know these uncertainties, we can form the uncertainty in the difference, and report $\Delta t= x \pm y$ where $y$ comes from knowing $\delta t_1$ and $\delta t_2$. The goal here is to give you some understanding and tools so that you can do just this, for not only this simple situation but in general.

This is what "statistics", in the sciences, is really about: what are the uncertainties, how do you think of them, how do you report them, and how do you use them to understand the significance of a measurements, especially relative to the "null" hypothesis.

Probability Distributions

Sometimes, events (numbers) are produced from some kind of systematic process. For instance, if you randomly pulled cards out of a deck, the process limits the cards to being Ace, 2-10, J, Q, and K only. So the card you pulled on any particular pull has constraints on what it can be. Another example, an even simpler one, might be an experiment where the answer is either yes, or no. Or, 0 or 1, the main point being that the answer is binomial (1 of 2 possible choices). What kind of questions can you ask about the results of such an experiment? Not much, since the answer is either yes, or no. However, what if you did the experiment $N$ times, and asked how the answers distributed themselves. For instance, how many times you saw $n$ yes and $N-n$ no. And then you would ask the important question of how this compares to what we expect, which brings you to the question of how to calculate what to expect.

To understand the answer here, let's go to a simple case where the probability of a yes ($P_1$) is equal to the probability of a now ($P_0$). Since probabilities sum to 1, we would have $P_0+P_1=2P_0=1$ therefore $P_0=P_1=\half$. This would be the case for example if you were flipping a coin.

Let's say $P_1$ = "heads" and $P_0$ = tails. Now, what if you flipped the coin 2 times and ask for the probability that you get 2 heads in a row (call it $P_{11}$). The way to calculate that probability is to go through all the possible outcomes, and calculate the fraction of times that that outcome exists. Each fraction will be the probability (it will add to 1!). So in a 2 toss experiment, we could get the following combinations: HH, HT, TH, and TT where the first digit is the 1st flip and the 2nd digit is the 2nd flip. The fraction of times you get 2 heads is 1/4, since that combination showed up 1 in 4 times. The fraction of times you get 2 tails is also 1/4. The fraction of times you saw one heads and one tails, without caring about which one came first, would be 2/4. Another way to see this: of the 3 different combinations, the HH and TT probabilities add to 1/2, so the remaining probability has to be 1/2.

This gets to a very important point: if order doesn't matter, and all you care about is the probability of getting a result, you have to keep track of the "combinatorics". More on this below.

As an interesting aside, imagine you did the same 2 flip experiment, and on the 1st flip you got a tails. What's the probability that on the 2nd flip you got a tails? You might think that it's 1/4, since after 2 flips the probability of seeing 2 tails is 1/4. Do the experiment, see what you get, but the answer will be that the probability of seeing a tails on the 2nd flip is 1/2, not 1/4. Why is this? Because the two flips are uncorrelated. And the probability of seeing a tails after a tails is not the same as the probability of seeing 2 tails in a row. The former is very specific: flip a tails, then ask for the probability of seeing another tails. The latter says: flip the coin twice, what's the probability of seeing 2 heads. So be careful with statistics. It can really cause headaches!

Let's Make A Deal

An interesting aside: in the 1960s, there was a game show on TV called "Let's Make a Deal", starrying Monty Hall as the host. What would happen during the show is that contestants who won some kind of prize would be shown 3 doors on the stage. They would be told that 2 of the doors contained something worthless (called a "zonk"), and 1 of the doors would be a grand prize like a vacation, or a new color TV (this was the 60s - color TVs were very novel!). The contestant was told that they could choose any of the doors, or stick with what they had. Of course, the probability of choosing the door with the grand prize was 1/3. Not so high. But the contestant wanted that color TV, so they would choose one of the doors. Monty would then say that to help them out, he was going to show them what's behind one of the doors (he would pick one of the zonk prizes). Of course, he wouldn't show them the door that the contestant actually picked, he would show them a different door. And then comes the interesting part: he would say that the contestant could switch their choice, or stick with what they had. The question is, what should you do based on maximizing the probability of getting the grand prize?

The way to answer this is to do the experiment many times, and see what happens. But you can also apply what we've learned and calculate whether it's better, on average, to switch or to stay. Here's how to think about it: before you chose the door, you had a 1/3 chance of getting the grand prize. Then you choose, and then the host opens another door. There are 2 possibilities: one, you chose correctly (1/3 chance), and so the host could have opened either of the other two zonk doors. The other possibility is that you did not choose correctly the first time (2/3 chance). In that case, the host had only 1 zoor door to show you, and so the other unchosen unopened door contains the grand prize. This says that you should always switch, and that if you do your chances of getting the grand prize goes from 1/3 to 2/3.

This answer sometimes bothers people, even intelligent people who claim to know statistics. They will say that since you made a choice, you either had the prize or you didn't, and so it's either behidn the door you chose or behind the unopened unchosen door. Hence it makes no difference if you switch or not. Can you figure out where this reasoning breaks down? Anyway here's another way to think of it. Imagine that there were 100 doors and you chose one of them, say number 49. Then the host opened 98 others, showed you the zonk prize, and asked if you wanted to switch. Isn't it obvious that you will want to switch then? Now take the limit as 100 goes to 3.

Binomial Distribution

We want to calculate the probability that if we flip $N$ coins, we will get $n$ heads. This probability is denoted as $P(n,N)$, and is called the binomial probability. The graph of this probability as a function of $n$ is called the binomial distribution. It's one if not the most important probability distributions in all of science, and it turns out that many other probability distributions can be derived from it.

To start, let's say that we have 1 coin, that heads and tails are equal probability, and that we flip them 4 times ($N=4$). Here are the 16 different possibilities for what you should see:

HHHH, HHHT, HHTH, HHTT, HTHH, HTHT, HTTH, HTTT, THHH, THHT, THTH, THTT, TTHH, TTHT, TTTH, TTTT

The binomial probability $P(n,N)$ is the probability of seeing $n$ heads after $N$ flips, which we can get by looking at the number of times we see $n$ heads, and dividing by the number of flips. So the first thing we need to know is the number of times we see $n$ flips, and to get that all we have to do is count: there are 6 combinations of seeing 2 heads (HHTT, HTHT, HTTH, THHT, THTH, and TTHH). This turns out to be easy to put into a formula, called the combinatoric formula: $$C(n,N)=\frac{N!}{n!(N-n)!}\nonumber$$ $C(n,N)$ is just the combinatoric function for the number of ways you can arrange $N$ things so that you have $n$ of them showing a particular value (here it's heads). The $N!$ notation means "factorial", so that for example $4!=4\times 3\times 2\times 1=24$. There are many ways to motivate this particular formula, but they all have to do with the idea of keeping track of permutations.

$C(n,N)$ tells you the number of ways you can see $n$ out of $N$, but it doesn't tell you $P(n,N)$, the probability of seeing $n$ out of $N$. For that, you have to multiply by the joint probability of each part. For instance, if there are 3 colors (R,B,G), and each one comes with a different probability $p_R$, $p_B$, and $p_G$, then the probability of seeing any combination, say RB would be given by $p_R\times p_B$. If we have a system with 2 possible states (heads, tails), and each has a probability $p_H$ and $p_T$ respectively, then the probability of seeing 2 heads and 2 tails will be given by $p_H^2\times p_T^2$ Let's simplify it more: let's define $p\equiv p_H$, and remember that $p_T=1-p_H=1-p$. Also, if $n$ is the number of heads, then $N-n$ is the number of tails. So the probability of seeing any combination of $n$ heads will be given by $P_n=p^n(1-p)^{N-n}$. This is the 2nd piece we need. The binomial probability $P(n,N)$ is then: $$P(n,N) = C(n,N)\cdot P_n = \frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}\label{ebinomial}$$ For $N=4$, $n=2$, we would have $P(2,4)=\frac{4!}{2!2!}\cdot \half^2\half^2=6\cdot \frac{1}{16}=6/16$. This is the same as what we got from seeing 6 combinations of our specific pattern out of 16 possible answers.

The graph below shows the binomial distribution (normalized to 1) for $N=10$ and $p=\half$. Changing either $N$ or $p$ will redraw it, allowing you to see how the distribution chnages. Notice what happens when you make $p$ very small (or very large), and also when you let $N$ become large. Note that you can click on any of the points and read off the values for $n$ and $P(N,n)$.

N=	Log scale?
p=0.50

The mean and variance of the binomial distribution are not all that difficult to calculate. To get the mean, we start with the definition that the mean of anything is given by the sum of that thing times the probability for that thing: $$\bar{n} = \sum_{n=0}^\infty n\cdot P(N,n) = \sum_{n=0}^\infty n\cdot \frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}\nonumber$$ The way to calculate this is to note that $n/n!=1/(n-1)!$, so we are going to have to deal with $n-1$. This suggests we factor $N$ and $p$ out, to get the following: $$\bar{n} = Np\sum_{n=1}^\infty \frac{(N-1)!}{(n-1)!(N-1-[n-1])!}p^{n-1}(1-p)^{N-1-[n-1]}\nonumber$$ This is a nice little trick. Notice that the sum now goes from $n-1$ to $\infty$ since when we multiply $n\cdot P(N,n)$ for $n=0$ we get $0$ anyway. Now all we have to do is make the substitution $m=n-1$ and $M=N-1$ to get $$\bar{n} = Np\sum_{m=0}^\infty \frac{M!}{m!(M-m)!}p^m(1-p)^{M-m}\nonumber$$ Since the sum is over the binomial distribution, which is normalized to 1, then the sum sums to 1. So the final answer is simple: $$\bar{n} = Np\label{emeanbinomial}$$ To get the variance, we would do something analogous to what we just did to calculate the mean, except as in equation $\ref{evar}$, we replace $n$ by the square of the difference between $n$ and $\bar{n}$: $(n-\bar{n})^2$. Then we have to evaluate the following: $$\sigma^2 = \sum_{n=0}^\infty (n-\bar{n})^2\cdot P(N,n) = \sum_{n=0}^\infty (n-\bar{n})^2\cdot \frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}\nonumber$$ We first multiply out the square of the difference: $(n-\bar{n})^2=n^2-2n\bar{n}+\bar{n}^2$, giving 3 terms: $$\sigma^2 = \sum_{n=0}^\infty (n^2-2n\bar{n}+\bar{n}^2)\cdot \frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}\nonumber$$ Let's label them $S_1$, $S_2$, and $S_3$, where the $S_1$ term has the $n^2$, $S_2$ has the $-2n\bar{n}$ part, and $S_3$ has the $\bar{n}^2$.

For $S_3$, we can pull the $\bar{n}^2$ out of the sum, leaving just the binomial distribution, which sums to 1. So $S_3=\bar{n}^2$. For $S_2$, we can pull the factor $-2\bar{n}$ out of the sum, leaving the $n$ along with the binomial distribution. However, that is the same sum we evaluated above, yielding the same $\bar{n}$, which gives us $S_2=-2\bar{n}^2$. This gives us $S_2+S_3=-\bar{n}^2$.

The first term with the $n^2$ part is slightly more complicated, and the solution goes like this:

$S_1$	=	$\sum_{n=0}^\infty n^2\frac{N!}{n!(N-n)!}p^n(1-p)^{N-n}$
	=	$pN\sum_{n=1}^\infty n\frac{(N-1)!}{(n-1)!(N-1-[n-1])!}p^{n-1}(1-p)^{N-1-[n-1]}$
	=	$pN\sum_{m=0}^\infty (m+1)\frac{L!}{m!(L-m)!}p^L(1-p)^{L-m}$	($L\equiv N-1$, $m\equiv n-1$)
	=	$pN[(\sum_{m=0}^\infty m\frac{L!}{m!(L-m)!}p^L(1-p)^{L-m})+1]$
	=	$pN(pL+1)$
	=	$pN(p(N-1)+1)$
	=	$pN(pN-p+1)$
	=	$pN(1-p)+(pN)^2$

Adding all 3 terms gives, and using $\bar{n}=pN$:

$S_1+S_2+S_3=pN(1-p)+(pN)^2 -(pN)^2=pN(1-p)$

or $$\sigma^2=pN(1-p)=\bar{n}(1-p)\label{evarbinomial}$$ Equations $\ref{emeanbinomial}$ and $\ref{evarbinomial}$ are very useful, and tells you that the mean of a binomial distribution is just the total number of possible combinations, $N$, times the probability $p$ of any one event. In the coin experiment, $p$ is the probability of getting a heads, and $N$ is the number of coin tosses. $P(n,N)$ tells you the probability that you will see $n$ heads (and therefore $N-n$ tails), $\bar{n}=pN$ tells you the average number of heads (if $p=\half$, then $\bar{n}=N/2$), and $\sigma^2=\bar{n}(1-p)=N/4$ is the variance on the average number of heads.

It is interesting that the variance $\sigma^2$ will be equal to the mean $\bar{n}$ when $p\lt\lt 1$. This is going to be important when we next consider the Poisson distribution.

Combinatorics

Imagine doing an experiment where you ask some number of strangers for their birthday, and make a histogram of what how many people had birthdays on any particular day (where day means day and month as a sequential number from 1 to 365). The histogram will have 365 bins, and the sum of all the entries in the bins adds up to the total number of people you queried ($N$). One would expect that all days should have the same probability $p=1/365$ of being a birthday. Of course, there could be biases that favor some days over others, for instnce one could imagine that maybe more people were conceived on holidays than non-holidays, therefore the birth dates should cluster around holidays plus 9 months. But let's assume the probability $p(n)$ of any randome birthday to be on day $n$ is given by $p(n)=p=1/365$.

For instance, let $N=1000$ people. If we make the plot of the number of birthdays $N$ on any given day $n$, then $N(n)$ will look something like this:

The horizontal axis is the day of the year (1-365) and the vertical tells you how many times we found a person with a birthday at that day. We can see right away that it's much more likely that we see something between 1 and 3 counts than that we see something between 8 and 10 counts.

You might then want to make a histogram of the distribution of counts (how many days had 0 counts, how many had 1, etc). Here, it looks like no day had more than 10 counts, so our histogram will be from 0 to 10. It will look like this:

To turn our histogram into a probability distribution, all we have to do is divide each bin by the total number in the sample, which in this case is 365 (if you sum over all the bin contents in the Frequency plot, you will get 365 since the x axis tells you the number of days that had 0, 1, 2, etc birthdays). The result is shown next:

To understand the shape of this probability distribution, let's formulate the problem slowly. We are asking each person from a sample of $N$ people for a date corresponding to their birthday. Each answer has a probability $p$, and since all dates are equally likely if there is no bias, then $p=1/365$. This "stream" of 1000 is no different than the "stream" of 1000 coin tosses, each being either heads or tails. However, the difference is that for the coin tosses, there are 2 possible outcomes, while for the birthday experiment, there are 365.

If we want to know the number of people who have the birthday Feb 25 after $N$ samples, the the answer will be that on average, there will be $N/365$. That's easy, because it doesn't involve combinatorics. What if we want to know the probability that after $N$ tosses, there will be only 1 day (of 365) with $n$ birthdays found, then that involves combinatorics, because there are many possibilities. For example, if we sample $N=365$ people, then we would expect that the average number of birthdays on any given day would be 1.0. But that is very different from asking how many days had 2 birthdays seen, then there are many ways that we could take the 365 days and get 2 birthdays (e.g. Jan 1 had 2 and Jan 2 had 0). So the binomial distribution, which deals with probability and combinatorics, is what we need to use here.

Combinatorics and Throwing Dice

Unfortunately there are characters in the world who cheat, and to cheat in dice all you have to do is "load" the dice so that you get the right face showing more often than not. Imagine that your cousin who owns a gambling establishment offers to pay you to test some dice to see if they are loaded, you will need to know what you are doing! So now let's attack the problem of calculating the expected probability distribution for situations with combinatorics using the dice.

We can start with a single die, and roll it $N$ times. If each face is equally likely to show up, then the probability $p(n)$ for seeing face $n$ appear will be independent of which face: $p(n)=1/6$. What we want to do is to calculate the following quantities after $N$ rolls:

$\bar{f}(n)$: the average of number of times we see face $n$ appear
$\delta^2(n)$: the variance on $\bar{f}(n)$

Then we will do the experiment, roll the die $N$ times, and measure the frequency $f(n)$ for seeing face $n$ appear, and compare it to $\bar{f}(n)\pm\delta(n)$. This will give us an idea of whether the die is loaded or not.

When you roll the die $N$ times, you will be looking at a stream of numbers, each with $M=6$ different possibile values, and you want to know the probability that we see face $n$ appear $m$ times. For example, say you roll the dice 5 times ($N=5$) and ask for the probability that you should see face 1 appear $m=2$ times. As usual, you can calculate these probabilities by adding up the number of different ways something can be seen, and multiplying by the probability of seeing just that thing. So for $m=2$, we could see face 1 on trial 1, and then on trial 2,3,4,5, or 6. Or, we could see some other face on trial 1, face 1 on trial 2, and then on 3,4,5, or 6. And so on. If you add up all possible combinations, you will get 5 (1 shows up the first time) plus 4 (2nd), plus 3 + 2 + 1 (face 1 shows up on trial 5 and 6). So there are 5+4+3+2+1=15 ways to see face 1 show up twice. This is exactly what the combinatorics part of the binomial distribution calculates for you, the number of ways you can take 6 things and arrange it so you see 2 of them: $$\frac{M!}{m!(M-m)!}=\frac{6!}{2!(6-2)!}= \frac{6\cdot 5\cdot 4\cdot 3\cdot 2}{2(4\cdot 3\cdot 2)}=15\nonumber$$ To calculate the probability of seeing face $n=1$ show up $m=2$ times, we would need to multiply by the probability of seeing face 1 showing up twice, and any other face showing up 4 times, which would be given by $p^2(1-p)^4$. Putting this together gives you the binomial probability distribution $P_n(M,m)$, where $M$ is the number of different die in the set ($M=6$), and $m$ is the number of times you would see face $n=1$: $$P_1(m,M)=\frac{M!}{m!(M-m)!}p_1^m(1-p_1)^{M-m}\nonumber$$ Since all faces are equally likely, you can drop the subscript $n$, and just write the probability of seeing any of the faces snow up $m$ times is: $$P(m,M)=\frac{M!}{m!(M-m)!}p^m(1-p)^{M-m}\nonumber$$ where $p=1/6$.

Since we know that the mean and variance of a binomial distribution will be given by equations $\ref{emeanbinomial}$ and $\ref{evarbinomial}$, we can write that

$\bar{f}(n) = \bar{m} = Np$
$\delta(n) = \sqrt{\sigma^2} = \sqrt{Np(1-p)}$.

All you have to do now is roll the die $N$ times, make a histogram of how many times you see face $n$ apear, and compare that to what the binomial distribution predicts, which is that each face will have an equal mean and variance after $N$ rolls.

Poisson Distribution

The binomial distribution is a general case, and somewhat complex. Sometimes, however, the situation you are in will have a small probability $p\lt\lt 1$, a large number $N$, and you want to know the probability $P(n,N)$ where $n\lt\lt N$. For instance, imagine we were to ask $N$ people for their birthday, and tablulate the date as the ith day of the year (from 1 to 365). Then we would want to calculate the expected frequency probability distribution for any particular day to have 2, 3, 4, 5, etc birthdays. For example, you ask 365 people for their birthday, and want to know the probability of there being 2 birthdays in any given day. Or 3 birthdays. And so forth. This will again involve the binomial distribution, because you are dealing with the combinatorics (the combinatorics and associated probabilities for taking $N$ things $n$ ways). But here, the probability of any given day will be quite small, and if all days are equal, we should have $p=1/365$. If we took the data ($N$), made the histogram of how many days had 2, 3, etc ($n$) people, and compared to what you would expect, you could maybe see if there was some reason why some days would be more preferred or not (for insance, perhaps more birthdays 9 months from holidays). Here, the number of possibilities in the stream is 365 ($p=1/365$), which is quite a small number. If you took $N$ samples, then the average for each day will be given by the mean of the binomial distribution, or $\bar{n} = Np$. So if you sampled 365 people, you would expect $\bar{n}=1$, but you might actually find that some days had 2, even maybe 3. But you would not expect that some days would have anywhere near 365! So in our case here, we have the following conditions:

$p \lt\lt 1$
$n \lt\lt N$

We can investigate how the binomial distribution behaves by doing the relevant expansions, and there are several equally complex ways to do this. One very straightforward way is to do make the following approximations:

For $n \lt\lt N$, the factor $\frac{N!}{(N-n)!}$ will have $N-n$ terms, and they will all be large compared to $n$. So a reasonable approximation here is $\frac{N!}{(N-n)!}\to N^n$
That leaves $P(N,n)=\frac{N^n}{n!}p^n(1-p)^{N-n}$
The factor $(1-p)^{N-n}\to (1-p)^N$ for $n\lt\lt N$.
For $p\lt\lt 1$, we can expand $(1-p)^N$ using the expansion for the natural log $ln$. We take $ln(1-p)^N = Nln(1-p)$, and expand $ln (1-x) \sim -x - \half x^2$. So the expansion for $(1-p)^N\sim e^{-Np}e^{-Np^2/2} \sim e^{-Np}$ as long as $Np^2\lt\lt 1$, or $Np\lt\lt 1/p$.
So replacing $(1-p)^{N-n}\to e^{-Np}$, we have $P(N,n)=\frac{N^n}{n!}p^ne^{-Np}=\frac{(Np)^n}{n!}e^{-NP}$
Now replace $\mu = NP$

A brief word on the approximation made above, that $Np = \mu \lt\lt 1/p$. The probability distribution will max out in the region where $n\sim \mu$, and will fall to zero way outside that region. So as long as we keep away from distributions where $\mu$ is close to $N$, we are ok. For the birthday example, this approximation will hold as long as the average number in any day is less than $1/p=365$. This also tells us that $N\lt\lt 1/p^2\sim 10^5$, so it sets a limit on $N$

With these approximations, the binomial distribution then goes to: $$P(n,N)\to \frac{\mu^ne^{-\mu}}{n!}\nonumber$$ as $n/N\to 0$, $N\lt\lt 1/p^2$, and $p\lt\lt 1$, and the definition $\mu\equiv Np$ is the mean. This is called the Poisson distribution, and is one of the more important distributions in science (up there with the binomial, and gaussian).

Why the poisson has only 1 parameter $\mu$ instead of the 2 parameters $p$ and $N$ for the binomial is because of the limiting condition: $p$ is very small, $N$ is very large, but the product $\mu=pN$ is finite. Also, if the variance of the binomial distribution is given by equation $\ref{evarbinomial}$, then in the limit of $p\to 0$, then $Np(1-p)\to Np = \mu$ so the variance of the Poisson will be given by $\sigma^2 = \mu$, and there really is only 1 free parameter ($\mu$).

So the poisson distribution is: $$P(n,\mu)= \frac{\mu^ne^{-\mu}}{n!}\label{epoisson}$$ and the mean and RMS values are: $$\mu = NP\nonumber\label{mupoisson}$$ $$\sigma_{RMS}=\sqrt{\mu}\label{vpoisson}$$ For this to be a real distribution, we would want $\sum P(n,\mu)=1$, and this is the case (in the sum, the factor $e^{-\mu}$ factors out, leaving a sum over $\mu^n/n!$, which is the expansion for $e^{+\mu}$).

The physical significance of these limits is interesting. For an experiment like the birthday experiment, it describes a situation with a very small probability $p$ but a large number of trials compared to the combinations ($n\lt\lt N$). For something like a radioactivity experiment, where you count the number of disintegrations in some time window, the poisson also applies, since the limiting conditions means that the probaiblity for any given particle of the sample to decay is small, and number of counts in any particular time window will be small as well.

In the following plot, you can change the mean value $\mu$ and see the shape of the poisson distribution $P(n,\mu)$. Notice how the distribution becomes more symmetric as $\mu$ gets larger.

$\mu =$

Log scale?

The poisson distribution is over discrete values for $n$. However, as $\mu$ gets larger, the term $e^{-\mu}$ sends the probability to 0 except in the region near $n\sim\mu$ (play with the poisson distribution above, driving $\mu$ up). To see this more explicitly, we can use Stirling's formula for the factorial: $$n!\to x! \to \sqrt{2\pi x}e^{-x}x^x\label{estirling}$$ Substituting, we have $$P(n,\mu)= \frac{\mu^ne^{-\mu}}{n!}\to\frac{\mu^ne^{-\mu}}{\sqrt{2\pi n}e^{-n}n^n} = \frac {e^{-(\mu-n)}} {\sqrt{2\pi n}} \Big(\frac {\mu}{n}\Big)^n \nonumber$$ Now that we have an equation that substitutes for the integer $n$, we can consider the poisson as a probability over a continuous variable $x$ instead of $n$. This will be true when $\mu \gt\gt 1$. This allows us to write the poisson distribution as $$P(x,\mu)= \frac{\mu^xe^{-\mu}}{x!}\label{epoissonx}$$ Note that even though we derived the poisson starting with the binomial, the poisson can also be derived from first principles for counting experiments where the numbers counted are independent of each other (for instance, the number of photons from radioactive decay, or the number of people in line at the grocery store), and the average rate is constant (and so you are just seeing fluctuations about the average).

To derive the poisson distribution from first principles, we define define $\lambda$ as the rate for something to occur per unit time, and break up time into intervals of length $\dt$, so that after a time $t$ there will be $N$ intervals ($t=N\dt$). The probability that something happens in any interval will be given by $P=\lambda\dt$, so the probabilty that nothing happens in the interval will be given by $1-\lambda\dt$. To calculate the probability that nothing happens after $N$ intervals (at a time $t$ relative to $t=0$), we would just multiply the probabilities for all intervals together, and since the occurance is independent of the interval, we would get $$P(0,t)=(1-\lambda\dt)^N\nonumber$$ If we want to take the limit as $\dt\to 0$, that means $N\to\infty$. To do this properly you use the Taylor expansion, and assume $N\gt\gt 1$, and this gives the well-known formula for the exponential: $$P(0,t) = e^{-\lambda t}\nonumber$$ The probability that the event happens at some time is what we want to calculate: $P(t)$. But we are talking about intervals, so the probability for the event to happen in a time interval between $t$ and $t+\dt$ will be given by the probability that it did not happen up until time $t$ ($e^{-\lambda t}$) times the probability that it happened in the next interval $\lambda\dt$, or $$P(t)\dt = e^{-\lambda t}\lambda\dt\nonumber$$. Therefore, we have calculated that $P(t) = \lambda e^{-\lambda t}$, and we are almost there, since what we really want to calculate is the probability that after some time $t+\dt$, we will have seen $n$ events. This is where the combinatorics come in: you could see $n-1$ events have happened before $t$ and 1 in the next interval $\dt$, or $n-2$ before and 2 in $\dt$, etc. After some algebra and logic, one can show that the probability for seeing $n$ events after a time $t$ will be given by the poisson distribution: $$P(t) = e^{-\lambda t}\frac{\lambda t}{n!}\nonumber$$

Gaussian Distribution

As we just saw, the poisson distribution is the binomial distribution in the limit of $p\lt\lt 1$ and $n\lt\lt N$, but $pN=\mu$ is finite. Both are distributions over discrete values, and usually have to do with predicting the results from counting experiments. Using Stirling's rule, we can turn the poisson into a distribution over continuous variables, or we can calculate it from first principles. One can then look at poisson distributions when the mean $\mu$ is quite large, and when $n$ is large, and after taking the limits one can get the following distribution: $$P(x,\mu)=\frac{e^{-(x-\mu)^2/\mu}}{\sqrt{2\pi\mu}}\nonumber$$. If we use the fact that the mean $\bar{x}$ and RMS $\sqrt{\sigma^2}$ (where $\sigma^2$ is the variance), we can substitute $\sigma^2$ in 2 places of the above formula to get: $$P(x,\mu)=\frac{e^{-(x-\mu)^2/\sigma^2}}{\sqrt{2\pi\sigma^2}}\label{egaussian}$$ This is called the gaussian distribution, and is probably the most important and ubiquitous distribution in all of science.

Instead of deriving the gaussian starting with the binomial, or poisson, there's another way to do it, and to appreciate the importance of the gaussian. We start with a rather simple and interesting problem: consider a variable $x$, distributed in some random way. If you made a histogram of $x$, you will see how it's distributed (e.g. it's probability function $P(x)$). We do not have to specify $P(x)$: it could be totally random, in which case $P(x)=$constant, or it could have some totally weird shape. The key thing is that the shape is irrelevant. Next, we generate a large number of $x_i$ according to $P(x)$. In the world of computer programming, there are always built-in functions that generate random numbers between 0 and 1, so each $x_i$ would be some random number in that interval. Next, we take all of these $x_i$ and form an average over $N$ of them: $$X = \frac{1}{N}\sum_{i=1}^N x_i\nonumber$$ For instance, if $N=10$, and the $x_i$ are random numbers "flat" over the interval between 0 and 1, then the probability of any particular $x_i$ value seen is the same for all $x_i$, and the average $X$ would also be a number between 0 and 1. If we were to then make a histogram of $X$, we could see the probability distribution $P(X)$. The question is: what would you expect for the distribution of $X$? The answer is surprising: $P(X)$ turns out to be a gaussian!

In the histogram below, you can see the distribution of random numbers. It is "flat", but of course there are fluctuations. The more numbers you generate, the "flatter" the distribution become. The controls below allow you to change the number of $x_i$ generated, the range (Min to Max), and the bins in the histogram. As you increase the bins, the bin size decreases and the number per bin falls. The number of entries in each bin will fluctuate around the mean for each bin, given by $\mu = N_{tot}/m$ where $N_{tot}$ is the number of points generated and $m$ is the number of bins.

Generate distribution of random numbers flat between 0 and 1

# to generate:
# bins in histogram:
# to sum:		1

Average number per bin $\bar {N}:$
$\bar{x}=$ $\sqrt{\sigma^2}=$

Next, we run the same random number generator, but instead of making a historgram of $x_i$, we will take an average over $N$ points, and histogram that. For instance, we could generate 10000 points, calculate $X$ as the average over every 10 points, and histogram that.

In the above, if you check the radio button "Averages" and hit "Regenerate" you will see what happens when you change from histograming $x_i$ to $X$. As you can see, the distribution over the averages is highly peaked. This is straight forward to understand: when you calculate the average over say 10 points, then much of the time you will get something near the true average, which will be at 0.5 if $0\le x_i\lt 1$ But not always. Consider how you would get an average that is far away from 0.5, say at 0.1. In that case, you will need to get pretty lucky to generate 10 random values that are all clustered around 0.1. There are not many ways to do this, as opposed to the many ways you could generate numbers that have an average near 0.5. So the probability of getting something far from the true average falls fast as you go away from that average. And combinatorics plays a big role here: there are many more ways to generate 10 points that form an average near 0.5 than there are ways to generate 10 points to form an average near 0.1.

This effect is the main subject of what is known as the "Central Limit Theorem" (CLT). There are two amazing things about CLT:

The probability distribution of averages approaches the gaussian probability distribution as described above in equation $\ref{egaussian}$
The gaussian is independent of the shape of the distribution for $x_i$

It can be shown that the distribution for the average approaches the gaussian as the number of points in the average ($N$) approaches $N_{tot}$, with a variance given by: $$\sigma^2_{Gaussian} = \frac{\sigma^2_{random}}{N}\nonumber$$ where $N$ is the number of random values to include in each mean that is histogrammed. In the above, if you select "Averages", then the histogram will be of the averages $X$, and the line will be a gaussian that uses the mean $\bar x$ and RMS $\sqrt{\sigma^2}$ as the 2 parameters. You can see that the difference is pretty small! It is educational to play with the parameters to get a feeling for this: try increasing the number of points to generate, and the number of points for averaging, and notice how the variance changes: you should see it get smaller and smaller as the number of points in the average gets larger and larger. This reflects the falling probability of seeing combinations of $x_i$ that are all far away from $\bar{x}$.

So we see here that the gaussian distribution is not merely something derived by taking the limits for binomial distributions, but is in fact much more basic and fundamental, and is all tied up with the notion of randomness. This is what makes the gaussian so important: if you are doing an experiment and want to see if what you are measuring is according to some known process, or something new, you want to understand the random fluctuations in the known process and compare to the experimental observations. Those fluctuations will be (under most circumstances) described by gaussian (random) fluctuations. If the comparison is good, then what are measuring is just a fluctuation. The question is, what does "good" mean?

The more you get to work with and know statistics, the more you will use the gaussian distribution. And you will get a feeling for how the gaussian quantifies fluctuations in measurements. For instance, if you have a quantity $x$ with a mean $\bar x$, and RMS $\sigma_x$, then the gaussian describes the probability of any particular measurement yielding the value $x_i$: $$ P(x_i) = \frac{1}{\sqrt{2\pi\sigma_x}}e^{-(x_i-\bar{x})^2/2\sigma_x^2}\nonumber$$ The implicit assumption here is that the true value of $x$ will be $\bar x$, so that when you make a measurement, what you want to know is the probability that the measured value $x_i$ will be within some finite distance from the mean $\bar x$. That means you have to integrate the probability distribution from $\bar x-x_i$ to $\bar x+x_i$.

In the plot below, you can see the gaussian function plotted with $\bar x=0$ and $\sigma_x=1$. So the horizontal goes from $-10\sigma$ to $+10\sigma$. If you click on any 2 points, the program will calculate the integral outside those points, and display in a label above the graph. For instance, if you click on $x=-4$ and $x=+4$, the program will calculate the probability of getting any value of $x_i$ that is $\lt -4$ or $\gt +4$.

Log scale:
Symmetric Interval:
between $\pm$

$\chi^2$ and Significance

We usually want to make measurements and compare to theory. This chapter introduces techniques of comparison, and the idea of "significance".

To start, imagine you take some data and want to compare to some theory. For instance, you might make a series of experiments ($N$ measurements, each one denoted by the subscript $i$) and for each one measure the force $F_i$ with some uncertainty (aka "error bar") $\delta_i$ and the acceleration $a_i$ on some object of unknown mass. We could then use Newton's law and "fit" the data to extract the mass: $$F_i=ma_i\nonumber$$ Plotting each measurement on a scatter plot with $F_i$ vertical and $a_i$ horizontal should yield a straight line, with slope $m$. But of course, each measurement is subject to fluctuations, so it won't be a perfect straight line. It might look like this (here the uncertainties $\delta_i$ in each measured value $F_i$ are pretty much constant):

Generate straight line to: $F=m\cdot a+b$

Slope $m$:
Y intercept $b$:

$\chi^2=$	DOF=

The task at hand is to determine the "best" straight line that describes a linear relationship between $F$ and $a$.

Your intuition here is useful: if you were to lay a pencil down on top of the distribution of $F$ vs $a$, and eyeball the line that would best fit the data, you would probably try for the line that goes through the most points. And you would be pretty close. But to get the very best fit, what you would want to do is draw a line that minimizes the distances between the y-value of the line ($y=mx_i + b$), and the y-value of the data point ($y_i$). So if you pick a slope $m$ and intercept $b$, then you would perhaps form the following quantity: $$d_i = y_i - y = y_i - (mx_i + b)\nonumber$$ Summing over all points (all $N$ $d_i$'s) would give a measure of how well the theory (the straight line) fits the data: $$\sum_{i=1}^N d_i = \sum_{i=1}^N [y_i - (mx_i+b)]\nonumber$$ The problem with this particular choice of measuring how well the theory and data match (using $d_i$) is that in the sum, some of the time the line will be below the point ($d_i\gt 0$), and some of the time above ($d_i\lt 0$). So you could have the situation where the individual $d_i$'s are large, but by (bad) luck cancel each other out, giving a small value for the sum. We need another measure!

An alternative would be to just take the absolute value of the various $d_i$ values. However it turns out that the technique that has the most statistical validity uses the square of the $\delta_i$'s, something intrinsically positive: $$\sum_{i=1}^N d_i^2 = \sum_{i=1}^N \Big(y_i - (mx_i+b)\Big)^2\nonumber$$ This is close to being the best thing to do, however how will you judge whether the value of the sum is "good" enough? And this particular technique, using $d_i^2$, does not take into account the uncertainty $\delta_i$ in the measurement of each $y_i$.

A sensible (and as it turns out, statistically valid) way to solve both problems is to first divide $d_i$ by the uncertainty $\delta_i$, and then square forming the "residual" $r_i$: $$r_i\equiv \frac{d_i}{\delta_i}\nonumber$$ By doing that, we are asking for the vertical distance between the measured $y_i$ point at a given $x_i$, and the $y(x_i)$ value calculated from that $x_i$ and the line parameters $m$ and $b$, in units of the uncertainty in $y_i$: $r_i=d_i/\delta_i$ tells you "how many uncertainties" separates $y_i$ from $y(x_i)$. Then we form the sum over all $r_i^2$ (squared), and call this value $\chi^2$: $$\chi^2 = \sum_{i=1}^N r_i^2 = \sum_{i=1}^N \Big(\frac{d_i}{\delta_i}\Big)^2 = \sum_{i=1}^N \Big(\frac{y_i - (mx_i+b)}{\delta_i}\Big)^2\label{echi2}$$ The procedure would then be to vary $m$ and $b$ independently, forming a new $\chi^2$ each time, and choose our final values for $m$ and $b$ that gives the smallest possible $\chi^2$. This is what the "Solver" does for you in Excel. Notice that in the above plot of "Force vs Acceleration", the $\chi^2$ is given. The assumption made in the code is that all of the $\delta_i$'s are the same, and equal to around $3$, just as a guess. But in reality, you have to come up with a reasonable value for the $\delta_i$'s, and this is one of the most important parts of doing experiments: knowing your uncertainties. (Note: scientists often say "know your errors", a cute expression that is actually the same thing as "know your uncertainties".)

In the plot above, you can type in a non-zero value for the slope and y-intercept, the line will be displayed, and the code will tell you the $\chi^2$ and number of degrees of freedom. You can change the slope and y-intercept and see where the $\chi^2$ is minimized for the best fit. If you click on one of the blue points, it will tell you the distance between the data point $F_i$ and the value of the line $F(x_i)$.

Linear Fit, Analytically

For a linear fit, you can actually find the best slope analytically. To keep it simple and illustrate the concept, let's choose a constant value for all of the $\delta_i$'s (call it $\delta$). There are 2 parameters here, $m$ and $b$, so to calculate values of $m$ and $b$ that minimize the $\chi^2$, we set the 1st derivative $\partial\chi^2/dm$ and $\partial\chi^2/db$ to 0, and solve for the best value of $m$ and $b$. The calculation goes like this:

$\chi^2$	$=$	$\sum_{i=1}^N \Big(\frac{y_i - (mx_i+b)}{\delta_i}\Big)^2$
	$=$	$(1/\delta^2)\sum (y_i - mx_i-b)^2$

Now we take the 2 derivatives and set them equal to 0: $\partial\chi^2/dm=0$ and $\partial\chi^2/db=0$: $$\frac{\partial\chi^2}{dm}=-(2/\delta^2)\sum x_i(y_i - mx_i-b)=0\nonumber$$ $$\frac{\partial\chi^2}{db}=-(2/\delta^2)\sum (y_i - mx_i-b)=0\nonumber$$ We can drop the $-2/\delta^2$, and get 2 equations with 2 unknowns after some rearranging: $$m\sum x_i^2 + b\sum x_i= \sum x_iy_i\nonumber$$ $$m\sum x_i + b\sum = \sum y_i\nonumber$$ The $\sum $ term in the 2nd equation takes some getting used to, but if you were to write it as $\sum_{i=1}^N 1$, you can see that it is equal to $N$. So our two equations are: $$m\sum x_i^2 + b\sum x_i= \sum x_iy_i\nonumber$$ $$m\sum x_i + bN = \sum y_i\nonumber$$ With some linear algebra, you can then solve for $m$ and $b$ to get: $$m = \frac{N\sum x_iy_i - \sum x_i\sum y_i}{N\sum x_i^2 - (\sum x_i)^2}\nonumber$$ $$ b = \frac{\sum x_i^2\sum y_i - \sum x_iy_i\sum x_i} {N\sum x_i^2 - (\sum x_i)^2}\nonumber$$ These are pretty busy formulae, but they are correct. If each of the $\delta_i$ are not different, then it doesn't matter, you can carry out the derivatives just like above, it just gets a bit more complicated.

Once you have a value for the parameters that minimize the $\chi^2$, you still need to ask whether that minimum $\chi^2$ is describing a "good" fit, or not. Note that you have taken the sum of $(d_i/\delta_i)^2$. If the fit is "good", and your uncertainties are accurate, then you would expect that the fit curve will go through most of the points within $\pm\delta_i$, more or less, so that each of the $r_i$ will be "close" to 1, on average. Of course sometimes you will have points where the fit is more than 1 $\delta$ away, and sometimes you will have points that are less than 1 $\delta$ away, but on average, a "good fit" should give you a $\chi^2$ where each of the $r_i$ are of order 1. So we can expect a good fit to have the sum $\sum_{i=1}^N$ will be close to $N$.

Actually, you have to take care here, because if you are fitting for 2 parameters (as in a straight line, $m$ and $b$), then you need at least 2 points. If you have only 2 points, then you have no freedom to move the slope $m$ and y-intercept $b$ around (2 points determine a straight line), and so the $\chi^2$ for 2 points ($N=2$ will be identically 0.0. So the number of "degrees of freedom" will be, for a straight line fit, 2 less than the total number of points, and for a good fit you would expect that the $\chi^2$ will add up to something very close to $N_{dof}=N-2$. And so you can look at the quantity $\chi^2/N_{dof}$, and that should be close to 1. If it is, then it's a good fit. If it's not, then it's not a good fit! And if that ratio is much larger than 1, it means that either the points do not describe the fit function, or it could mean that your uncertainties $\delta_i$ are too small. If on the other hand the ratio is close to 0, it could mean that your uncertainties are too large (or that your points are too good, which should make any scientist nervous!).

It's straight forward to come up with a probability distribution for the $\chi^2$ per $N_{dof}$. However, it's a complicated function, best left to the more in-depth text books. Suffice it to say that one can take a value for a $\chi^2$, and $N_{dof}$, and form a probability that any ratio is consistent with random fluctuations on a "null" result. That is, if you measure $F_i$ and $a_i$ and have the correct uncertainties, you won't always get a $\chi^2/N_{dof} =1$, that it will fluctuate around 1, with a smaller probability the futher away you get.

The CHIDIST test in Excel will calculate the "p-value". What that tells you is that if you repeated the experiment, there is a probability p that the $\chi^2/N_{dof}$ would be even larger than the value you found just due to the uncertainties you quoted in the data. For example, if you find a p-value of $0.05$, then that's saying that if you do the experiment 20 times, you should get a fluctuation that will send the $\chi^2/N_{dof}$ to this value or larger. If the p-value is $0.01$, then 1 out of a 100 times you should see this kind of fluctuation.

The p-value is important in deciding quantitatively whether or not your data agrees or disagrees with a theory For example, suppose you do an experiment and get a p-value of $0.02$. You could conclude that it's not likely to be a fluctuation (aka "null result"), and could in fact be an indication of something new or a signficant disagreement with your theory. In fact it is up to you to decide how small the p-value has to be before it indicates a significant disagreement. The most common convention is tuse the 5-9\% confidence limit, or $0.05\lt$p$\lt 0.95$ to distinguish a "good" fit from a "bad" fit, and this is the range we use int eh undergraduate physics labs. In the world of experimental particle physics, where you are looking for new particles, p$\lt 3\times 10^{-7}$ is used, corresponding to a disagreement of 5 error bars ("5$\sigma$"). Why so conservative? Because in science, if you make a claim about nature, you should be pretty sure you are not incorrect! So if your result is 5$\sigma$ away from being a fluctuation, and your uncertainties are well understood, then you are on thick ice.

Uncertainties and Error Propagation

Sometimes we make measurements on more than 1 quantity at a time, and combine them to form something that is a function of the measured quantities. For instance, we measure the width $W$ and length $L$ of a parking lot, with corresponding uncertainties $\delta W$ and $\delta L$. Then we want to combine them to form the perimeter $P=2(W+L)$. Knowing the uncertainties in $W$ ($\delta W$) and $L$ ($\delta L$, we want to know the uncertainty in $P$.

One sensible thing to do is to say that the extreme values of $W$ are given by $W_+=W+\delta W$, and $W_-=W-\delta W$. And similarly for $L_+$ and $L_-$. Then the extreme values of $P$ would be given by: $$P_+=P+\delta P = W_++L_+ = W+\delta W + L + \delta L = W+L+(\delta W+\delta L)\nonumber$$ and $$P_-=P-\delta P = W_-+L_- = W-\delta W + L - \delta L = W+L - (\delta W+\delta L)\nonumber$$ This suggests we use $\delta P=\delta W + \delta L$.

However, there is more to the story, going back to what $\delta W$ and $\delta L$ mean. These quantities are usually given as the "$1\sigma$" uncertainties of the quantities $W$ and $L$, meaning that the measured values of $W$ and $L$ will fluctuate within the uncertainties such that $68\%$ of the time, $W$ and $L$ will be between $W\pm\delta W$ and $L\pm\delta L$. (See the above discussion on gaussian intervals.) To be consistent, we would want to have a $\delta P$ that also describes the $1\sigma$ in $P$, however $\delta P=\delta W + \delta L$ clearly does not accomplish that.

The following shows this explicitly. We generate random numbers that are normally distributed around $2$ for the length $L$, and around $4$ for the width $W$, and we use $\sigma=1$ for both distributions. We calculate $P=L+W$ and plot this as well.

	Mean	$\sigma$
$L$
$W$
$P$

As you can clearly see, $\sigma_P$ does not equal the sum of $\sigma_L+\sigma_W$. But this should be expected: it is not very probable that when $L$ fluctuates by the maximum, that $W$ will do the same, because $W$ and $L$ are uncorrelated. That is, the fluctuation in the measurement of $L$ does not have any effect on the fluctuation of the measurement of $W$. We need a better way to form uncertainties (so called "errors").

Copywrite Drew Baden, Feb 13, 2017

Table of Contents