MathCS.org - Statistics

back | next

4.6 Box Plot and Skewed Distributions

By now we have a multitude of numerical descriptive statistics that describe some feature of a data set of values: mean, median, range, variance, quartiles, percentiles, ranks, etc. There are, in fact, so many different descriptors that it is going to be convenient to collect many of them in a suitable graph called the Box Plot.

The Box Plot, sometimes also called "box and whiskers plot", combines the minimum and maximum values (and therefore the range) with the quartiles into one useful graph. It consists of a horizontal line, drawn according to scale, from the minimum to the maximum data value, and a box drawn from the lower to upper quartile with a vertical line marking the median.

It might sound pretty convoluted, so to see how it works it is best to consider an example.

Example: In an earlier example we considered the following cotinine levels of 40 smokers. Draw a box plot for that data.

0 87 173 253 1 103 173 265 1 112
198 266 3 121 208 277 17 123 210 284
32 130 222 289 35 131 227 290 44 149
234 313 48 164 245 477 86 167 250 491

We already computed the lower and upper quartiles to be Q1 = 86.5 and Q3 = 251.5, respectively. It is easy to see that the minimum is 0 and the maximum is 491. A quick computation shows that the median is 170. The corresponding box plot looks therefore as follows:

You can see that the horizontal line (sometimes called the "whiskers") goes from 0 to 491 (from min to max), while the box extends from 86.5 (= Q1) to 251.5 (= Q3) with a middle vertical line at 170 (the median).

Drawing a Box Plot with Excel

Unfortunately Excel does not have a nice build-in facility to quickly create a box plot. You could of course use the formulas "max(RANGE)", "min(RANGE)" together with "PERCENTILE(RANGE, 0.25)", "PERCENTILE(RANGE, 0.75)" and "median(RANGE)" and then draw a box plot by hand. However, I found an easy-to-use Excel template that is not quite as convenient as the Data Analysis tools we've been using, but should still be pretty simple and useful.

To use the Excel Box Plot template, click on the icon below to download the file:

boxplot.xls

When you open the file, Excel will show you a worksheet with a finished box plot already, and a column on the right in green where you can enter or paste your data. Simply delete the data currently in that column and replace it with your new data to create a new plot. The box plot will update automatically.

Example: Create a box plot for the Life Expectancy by country that we considered before.

We first need to open the Life Expectancy data file - click on the icon below for the data file.

life.xls

When the spreadsheet opens up, mark all numeric data in column B (the Life Expectancy column) but not including the codlumn header and copy them to the clipboard (for example, press CTRL-C). Then open the boxplot.xls spreadsheet and position your cursor to the first data value in column M. Paste the copied data values (for example, press CTRL-V) into that column and the box plot will automatically update itself so that you should see the following picture:

For some data sets you will see some points beyond the max/min value of the whisker. Those points are outliers; they are exceptionally small or large as compared to the rest of the data. Technically these outliers are the max/min, but they would distort the box plot too much. The exact definition of an outlier will be provided below.

Note that the difference between the upper and lower quartile is called the Inter Quartile Range, or IQR. It is used to define outliers (see below).

Example: Find the IQR for the Life Expectancy data above.

We know from the above box plot that the "lower hinge" is 63.5 and the upper hinge is 76. By definition, that means that the quartiles are Q1 = 63.5 and Q3 = 76. That makes the Inter Quartile Range IQR = 76 - 63.5 = 12.5

Box Plot and Distributions

In addition to giving you a quick view of the range, the quartiles, and the median, the picture also indicates that if we were to draw a histogram for this data it would look slightly skewed to the left because the box in the box plot is a little towards the right side (yes I know, this looks like a typo but it isn't: a distribution is skewed to the left if the box is on the right side, and skewed to the right if the box is on the left side. In fact, even though the box plot does not directly contain the mean (it only shows the median) it is possible to estimate whether the mean is less than or greater than the median by looking whether the box plot is skewed to the left or to the right.

First, let's look again at histograms and define what we mean by "skewed" histograms (and distributions):

A histogram (distribution) is called

Bell-Shaped or normal

if it looks similar to a "bell curve".

Most data points fall in the middle,
there are few exceptionally small
and few exceptionally large values.

 


Normal Distribution
 

 

A histogram (distribution) is called

skewed to the right

if it looks like a bell curve with a
longer tail on the right and the
mount pushed somewhat to the
left.

Most data points fall to the left of the
middle, there are more small than large values, but there are a few extreme values on the right.

 


Skewed to the right Distribution
 

 

A histogram (distribution) is called

skewed to the left

if it looks like a bell curve with a
longer tail on the left and the
mount pushed somewhat to the
right.

Most data points fall to the right of the
middle, there are more large than small values but there are a few extreme vaues on the left.

 


Skewed to the left Distribution

You can tell the shape of the histogram (distribution) - in many cases at least - by just looking the box plot, and you can also estimate whether the mean is less than or greater than the median. Recall that the mean is impacted by especially large or small values, even if there are just a few of them, while the median is more stable with respect to exceptional values. Therefore:

As a quick way to remember skewedness:

Example: Here is some (fictitious) data in an Excel sheet for three variables named varA, varB, and varC.

distribution-data.xls

Create a box plot for the data from each variable and decide, based on that box plot, whether the distribution of values is normal, skewed to the left or skewed to the right, and estimate the value of the mean in relation to the median. Then compute the values and compare them with your conjector.

One of the data columns results in the following box plot and interpretation based on it:

Distribution is shifted to the left, the mean should be less than median (the exact numbers are: mean = 0.3319, median = 0.4124).

The other data column has the following box plot and interpretation based on it:

Distribution is shifted to the right, the mean should be greater than the median (the exact numbers are: mean = -0.3192, median = -0.4061)

The final data column has the following box plot and interpretation based on it:

Distribution is (approximately) normal, mean and median should be similar (the exact numbers are: mean = 0.013 median = 0.041)

Unfortunately I forgot to write down which of these cases correspond to varA, varB, and varC - can you figure it out -:)

Box Plot, Outliers, and Standard Deviation

We have seen that even though the box plot does not explicitely include the mean, it is possible to get an approximate idea about it by comparing it against the median and the skewness of the box plot:

 In a somewhat similar fashion you can estimate the standard deviation based on the box plot:

Both estimates work best for normal distribution, i.e. distributions that are not skewed, and the first approximation works best if they are no outliers. We will later determine additional relations between the standard deviation for normally distributed data.That reminds me: another useful application for the IQR is to define outliers:

Example: Consider the above data on cotinine levels of 40 smokers. Find the IQR and use it to estimate the standard deviation. Also, identify any outliers.

The data ranges rom 0 to 491 (from min to max), while the Q1 = 86.5 and Q3 = 251. Thus, we have two estimates for the standard deviation:

The estimate are pretty close and since the true standard deviation is 119.5, they are both pretty close to the actual value. The best part of these estimates is, however, that they are so very simple to compute and thus they give you a quick ballpark estimate for the standard deviation.

As for any outliers, they would be data values:

So there are no outliers in this case (which is one reason why the estimate of range/ 4 works prertty well).

Example: Find all outliers for the life expectancy data above.

For that data set we found that IQR = 76 - 63.5 = 12.5 and therefore outliers woukd be data values:

Thus, the three data points for Uganda (42), Cent. Afri. R (43), and Tanzania (43) are outliers below, while there are no outliers above. Note that since there are outliers, the range/4 estimate for the standard deviation should not work as well as the estimate based on the IQR. Confirm that!