Histograms

A histogram is a very useful tool for displaying data that arrives as a set of numbers. like the age of all the students in your class, or the height of all the people in your town.

Let's call this data $x$ and say that $x_i$ is the $i^{th}$ data point (first, second, 16th, etc). You can take the average (call this $\bar x$) using: $$\bar x = \frac{\sum_{i=1}^N x_i}{N}$$ The sum is over all of the $x_i$ values, and you divide by the number of values $N$ to get the average. But $\bar x$ doesn't tell you everything you need to know about how the $x_i$ are distributed. For instance, the following two sets of data have the same average: $$x_i = 0, 10, 20$$ $$y_i = 9, 10, 11$$ Clearly $x_i$ has a "wider" distribution than does $y_i$. To characterize this we need some kind of measure that quantifies how far each member of the set is away from the average, and a sensible measure is the "standard deviation", $\sigma$, defined as: $$\sigma = \frac{\sum_{i=1}^N (\bar x - x_i)^2}{N}$$ However, even knowing $\bar x$ and $\sigma$ is sometimes not enough, and you need to see how the values $x_i$ are actually distributed. But ploting $x_i$ is not like the usual plot that shows the relationship between 2 variables, like $x$ and $y$. What we want to do, then, is to do the following, and this is what a histogram does:

Make a list of ranges of $x_i$. For example, say $x_i$ represents ages of a group of people in a city. Then we construct "bins" that deliniate what ranges of ages we are interested in. For instance, we might ask how many people are between 0 and 10 years old, 10 and 20, 20 and 30, and so on. So our bins would be specified by a starting value (e.g. 0), a bid width (10 years), and the number of bins (e.g. 10). That would give us a list of bins $b_i$: bin 0 low edge is 0, bin 1 low edge is 10, and so on up to bin 9 which has a low edge of 90. These bins would be an array with each element set to 0 initially.
Knowing the binning details, we then go through and count how many people would have an age in the first bin, which is 0 to 10. Then how nany in the 2nd bin, 10 to 20, and so on up to the last bin, which is 90 to 100. We would have to have an "overflow" bin to count how many peoplel are outside the last bin, and maybe an underflow in general although negative ages won't occur.
Be careful here, we don't want to double count people who are exactly on the bin edge. For example, if there's someone who is 30 years old, do you count them in the 20-30 bin or the 30-40 bin? So we define the condition for incrementing the number of counts in each bin as being greater than or equal to the low edge and less than the upper edge. Like this in math symbols: $[a, b)$. So we have a set of left edges, $L_i$ and right edges, $R_i$, and the algorithm for incrementing bin $i$ is that $L_i \le x_i \lt R_i$. If that is satisfied, then $b_i = b_i + 1$.

Once we have the binning, and the rules for counting, then we go through all of the data, increment the appropriate bin, and then plot $b_i$ vs, usually, the low edge of bin $i$. This is a histogram.

If you then take each bin and divide the number of counts by the total number of data points $x_i$, then each bin is a number between 0 and 1 and the integral adds to 1 exactly. For instance, say $N$ is the number of data points, and $n$ is the number of bins, then $$N = \sum_{i=1}^n b_i$$ So if you defined $p_i = b_i/N$, then $\sum {i=1}^n p_i = 1$. Then you can interpret the $p_i$ as the probability that $L_i \le x_i \t R_i$. Then when you plot $p_i$ vs say $L_i$, you are plotting the probability distribution of $x_i$. Voila!

To make it easy to produce such a histogram, you can enter the data $x_i$ in the window below. Data should be a list separated by a space, or a newline:

Next, enter the number of bins $n=$ , the low edge of the first bin $L_0=$ , and the upper edge of the last bin $R_n =$ . Note that the bin width $w$ is derived from the relation $n\cdot w = R_n - L_0$. Then each bin edge $L_i = n*L_0$ and $R_i = L_i + w$.

You can also specify the:
histogram title
title for the horizontal axis

You can also specify the number of tick marks along the horizontal on the plot (an integer) here:

Histogram bin contents are of course subject to fluctuations, which are characterized by being from a Poisson distribution. That means that the uncertainty in the number of counts in each bin is given by the square root of those bin counts: $\delta n_i = \sqrt n_i$. To have these uncertainties (also known as "error bars") drawn, click here:

To see the histogram, hit this button:

Email Drew Baden for further info. (26-Apr-2022)