A box plot (also known as a box and whisker plot) is used to summarise the distribution of values in a collection. Box plots are really useful as a visual comparison of different collections.
The main ‘box’ part of the chart represents the range of the middle half of the values in the collection. The ‘whiskers’ – top and bottom – represent the data range. At a glance you can see if two collections cover a similar range (do the whiskers line up), and you can see if the middle half of the data lines up between collections.
What are the different parts of a box plot?
The most common approach for splitting up a collection for presentation in a box plot is to use the quartiles and range. The quartiles are the 25th, 50th (median) and 75th percentiles. These quartiles form the ‘box’ part of the box plot with the 75th and 25th percentiles forming the upper and lower bounds of the box and the 50th percentile (median) usually represented as a line through the box.
The 75th percentile (also called the upper quartile) is the value that is greater than 75% of the values in your collection. The 50th percentile (or median) is the middle value when you create a sorted list of all values and the 25th percentile (or lower quartile) is the value that is greater than 25% of the values in your collection. The box part of the box plot represents this spread of the middle half of your data (from 25th to 75th percentile).
The whiskers that stretch out from the box represent the upper and lower stretch of the data. The wiskers can be set to the extreme range, or more often to 1.5 times the standard deviation of the data – with values beyond that range represented as points beyond the whisker.
Mean and standard deviation can also be included in the box plot. In Truii we represent the mean as a dashed horizontal line and the standard deviation as a dashed triangle around the mean.
Figure 1:The average cost of research publications in Australian Universities (data from research income and publications data for 2014)
How to interpret a box plot
Figure 1 compares the average cost of publication (research income divided by publication number) for Australian Universities. The group of eight (blue) are the eight most prestigious universities and the grey box represents the publication cost for 31 other universities (See this post for the full analysis).
What we can see from the box plot (Figure 1)?:
- The group of eight spend more research income per publication than the non-group of eight universities
- The spread of values (range) for the group of eight is much less than for other universities (the range between the whiskers)
- The median cost per publication from the group of eight is $78.8K, compared to $31.2 for the other universities
- The median and mean cost of publication within each group are reasonably similar indicating a close to normal distribution
Another way to represent the data in Figure 1 is via a histogram or distributions plot (Figure 2). The distribution plot is not that great when the collections you are comparing do not overlap by much.
Figure 1:Histogram or distribution plot of the average cost of research publications in Australian Universities (data from research income and publications data for 2014)
Making a box plot in Truii
Don’t forget to sign up to Truii’s news and posts (form on the right).