A histogram (AKA distribution chart) shows the underlying frequency distribution for a collection of continuous values. The histogram is really just a special case of the column chart with each column representing a value range (called a bin) and the height of each column representing the number of values (frequency) from the collection that falls within that bin.
The purpose of the histogram is to get a representation of the distribution of values in a collection. If the histogram is symmetrical around a central high frequency bin then the collection is approaching normally distributed. More often the histogram has a peak at one end (e.g. lots of low values and a few high values like Figure 1).
Figure 1: A histogram showing the distribution of CO2 emmissions per person by country (data source: World Bank)
How many bins in a histogram
There is no hard and fast rule about the number of bins, other than use the minimum which adequately represents the distribution. I usually start with the Truii default of 3, then try a larger number (say 10) to see if there is a cluster of very close values which should be pointed out. I then reduce the number of bins to get the minimum which adequately represents the distribution – I usually end up with about 6-8 bins.
The basic rule for constructing the bins in a histogram is that the ranges must be continuous. That is, the end of one range is the start of the next range so that the entire continuous range of the data is represented. Strictly speaking the ranges do not have to be of identical size – however if they are not identical in size, the width of each column should reflect the different range of each bin. In Truii we use identical bin ranges which are determined by dividing the range of your input data into your desired number of bins. This makes interpretation a little easier because the reader only needs to asses the height of the column to determine frequency (not the area).
Why would you log the bins?
In statistical analysis, many tests are based on the assumption that the underlying data is normally distributed. It is common to transform data to make the collection more closely approximate a normal distribution before applying statistical testing. Log transformation (logging each element in the data collection) is the most common transformation.
By applying the ‘log’ mode in the Truii distribution plot – the underlying data is logged and the bins based on the log transformed data – the resulting histogram gives you a sense of how the distribution of the log transformed data would look. Figure 1 shows a skewed distribution – with lots of countries with low CO2 emissions. In Figure 2 we have applied the ‘log’ mode – Truii log transforms the underlying data and then creates bin sizes based on equal size ranges (but in the log domain).
Figure 2: A histogram with log bin sizes showing the distribution of CO2 emissions per person by country
Comparing distributions between collections
You can place several collections on the same histogram just like adding categories to a column chart or bar chart. In Figure 3 we are comparing the distribution of CO2 emissions in 1960 with those from 2010. The underlying data is the CO2 emitted per per peson for each country. In 1960 you can see that all countries have pretty low emissions with no country emitting more than 20 tons per person. By 2010 the upper level of emissions has increased to 40 tons per person and the overall distribution has changed quite a bit.
Note that the Y axis in Figure 3 is ‘percentage’ rather than count. This is because the 1960’s data set is slightly incomplete, so the the total count of countries reported between the two periods is different and cannot be directly compared based on count alone.
Figure 3: Comparing the distribution of different collections
An alternative way to compare the distribution of collections is using a box plot (Figure 4). The box plot summarises the distribution based on the 25th and 75th percentiles (the box) and the median value is the line in the middle of the box. The whiskers show a 1.5 standard deviation range and the dots show potential outliers (values beyond the standard deviation range). More on box plots in this post…
Figure 4: Box plot comparing the distribution of different collections
Making a histogram or distribution chart in Truii
Don’t forget to sign up to Truii’s news and posts (form on the right).