Exploring data: the median and the mean, and everything in between

Here is the assignment for this blog. “Write a report on the progress being made in the Western Cape by the BirdPix section of the Virtual Museum.” The report needs to communicate to the citizen scientists who participate in the project. It needs to provide them with insights into how well the project is doing.

This map shows one aspect of progress. It provides the number of records submitted to BirdPix per quarter degree grid cell. There are lots and lots of numbers. This is not anecdote; this is real data, from which recommendations need to be made.

I could write something like this. One long paragraph could start “Grid cell 3018DA Kliprand in the far north has 32 records,” … later on it would say … “grid cell 3318CD Cape Town has 1322 records,” … and it would end by saying … “grid cell 3420CC at Cape Agulhas has 67 records.” This paragraph would fill up a few pages with utterly boring and useless text. It is just providing essentially the same information as is presented a lot more effectively in the map. What is needed is some sort of a summary of the data. I need to convey the overall picture, and not get bogged down in detail.

In general terms, the first task of any statistician is to summarize lots of numbers down to a tiny handful of numbers. The message in the data cannot be accessed by reading every number, or by simply eye-balling the data. There are just too many numbers to absorb. To extract succinct stories out of data is the role of the statistician, .

Now this blog is supposed to be a tutorial, and not a full scale data analysis, so I will illustrate the ideas with a subset of the Western Cape; this map goes from Langebaan to Cape Town, and inland.

The first thing to determine is the sample size, the number of numbers. This is 23. The statistician would write n=23. Statisticians have a convention that they reserve the letter n for sample size. Woe and betide any statistician who comes along and uses n for any other purpose. Mathematics is full of these little conventions and rules; if you know these secret codes, equations can often be understood far more quickly.

The sample of size 23 is big enough not to be trivial. But it is small enough to be manageable. Here are the numbers, copied row by row off the map: 20 430 2 43 13 9 12 67 57 93 64 33 16 72 258 484 44 59 1322 1022 132 54 86. These are the numbers of BirdPix records per gridcell.

The first (and obvious) thing to try is the mean, also called the average. We have known how to calculate this since we were schoolkids. Add the numbers together, and divide by the number of numbers. In this case, it is 4392/23=191.0. So the mean is 191.0 records per grid cell. We do this bit of trivial arithmetic, and move on to the next task. But, hey, let’s stop and look at this more carefully. Does 191.0 make sense? Does it really communicate what is going on in the sample? A little thought shows that the mean is doing a ghastly job of summarizing the data. Only five of the 23 values are larger than the mean: 258, 430, 484, 1022 and 1322. And the remaining 18 are smaller than the mean. The arithmetic is perfectly correct, but somehow it doesn’t make sense. The mean does not really communicate where the “middle” of the data really lies.

Statisticians have a strategy for dealing with this problem. They simply sort the data, and pick the number in the middle. When they are sorted the 23 numbers look like this: 2 9 12 13 16 20 33 43 44 54 57 59 64 67 72 86 93 132 258 430 484 1022 1322. The number in the middle is 59. There are 11 numbers which are smaller than 59 and 11 which are larger. This number in the middle has a technical name. It is called the median.

We are trying to communicate how well BirdPix is doing. In this situation, the median, 59 records per grid, communicates the reality far better than the mean, 119.0, does. The mean is biased, pulled upwards by the two grid cells with more than 1000 records. In contrast, the median is unfazed by “outliers”. If the largest number in the dataset was 13220 instead of 1322, the impact on the mean would be dramatic (it would change to 708.3), but there would be no impact on the median. The number in the middle remains 59. There is a technical term for this property of the median. In their jargon, statisticians say that it is robust against outliers.

There is a formula for finding the “rank” of the median. Once the numbers are sorted, the median has rank (n+1)/2. With n=23 numbers the median has rank (23+1)/2 = 12. It is the 12th largest number.

This works fine when the sample size is an odd number. But if n is even, there’s a problem. Suppose n=24. The the median has rank (24+1)/2 = 12½. The trick is to use the average of the pair of numbers in the middle of the sorted sample as the median. The median would be defined as the average of the 12th and 13th largest numbers in the sample of 24 numbers.

We now go back to our sample of 23 sorted numbers: 2 9 12 13 16 20 33 43 44 54 57 59 64 67 72 86 93 132 258 430 484 1022 1322.

The smallest number is 2, the largest number is 1322, and the median, the number in the middle, is 59. A more subtle question is to ask to what extent the numbers are concentrated around the median, or are they spread out towards the two extremes. One clever way to get a handle on this is to compute the medians of the top half and the bottom half of the data. In broad brush terms, the “lower median” will be a quarter of the way from the smallest number, so it gets called the lower quartile, and the “upper median” will be a quarter of the way from the largest number, so it gets called the upper quartile.

There are various ways of doing this. The right way is to take this set of numbers as the bottom half of the data: 2 9 12 13 16 20 33 43 44 54 57 59. Note that it includes the median, 59. There are now 12 numbers. 12 is an even number. So we need to find the middle pair of numbers; they are 20 and 33. Their average is 26.5. This is the lower quartile. The top half of the data also contains 12 numbers, because the median is used again: 59 64 67 72 86 93 132 258 430 484 1022 1322. The middle pair is 93 and 132, and their average is 112.5. This is the upper quartile.

*The grid cell for Hopefield (3318AB) has only two records. Here is one of them. Black Stork* Ciconia nigra submitted to BirdPix by Linda and Eddie du Plessis. This record is curated at http://vmus.adu.org.za/?vm=BirdPix-16884

Our sample size of 23 is not exactly divisible by 4, so the rest of this paragraph is only approximately true. But as the sample size n gets bigger, it gets closer and closer to the truth. A quarter of the sample lies between the upper quartile and the largest number, and a quarter lies between the lower quarter and the smallest number. So that means the remaining half of the sample lies between the lower quartile and the upper quartile. Half the sample is greater than the median and half the sample is less than the median. So these five numbers, smallest number, lower quartile, median, upper quartile and largest number provide a neatly interpretable summary of the numbers in our sample. We call them the five-number summary. However large n is, this strategy crunches the sample down to just five numbers. These provide real insight. But it is rare for them to be presented in a paper and actually called the five-number summary. Usually, they are plotted in a particular style, and that graphic is universally called the box-and-whisker plot. One of the next blogs in this series is devoted to the box-and-whisker plot! And that blog will also reveal the person who invented this crazy name.

So for the small sample of n=23, a statistician would write the five number summary using this notation: (2, 26.5, 59, 112.5, 1322). Now that you are initiated into the secrets of unpacking and interpreting this, we know: (1) all the numbers lie between 2 and 1322, (2) half the numbers lie between 26.5 and 112.5, (3) half the numbers are smaller than 59 and half the numbers are greater than 59.

*The grid cell for Hopefield (3318AB) has only two records. Here is second of the two. Steppe Buzzard* Buteo buteo submitted to BirdPix by Mark Stanton. The record is curated at http://vmus.adu.org.za/?vm=BirdPix-60463

Just for the record, for the Western Cape as a whole, confining ourselves to the 200 grid cells with at least one record, the five-number summary is (1, 7, 25, 58.5, 1322). Interpretation: (1) all the numbers lie between 1 and 1322, (2) half the numbers lie between 7 and 58.5, (3) half the numbers are smaller than 25 and half the numbers are greater than 25. (And there are 62 grid cells without data!) This summary becomes interesting when it is put alongside the summaries for the other nine provinces, and that is what the box-and-whisker plot, coming to you in a blog soon, will achieve. The simple recommendation out of this analysis is that vastly more data are needed before we can claim that BirdPix has a comprehensive dataset for the Western Cape.

The mean and the median seem to be very different animals. Try this exercise. The trimmed mean is calculated by finding the mean after the smallest and largest values in the sample (2 and 1322 in our small dataset) have been eliminated. The trimmed mean is 146.1. we can repeat the process. Chop off the two largest and two smallest numbers, and find the mean of the remaining 19 numbers: the 2-trimmed mean is 107.2. Keep going. If the sample size is odd, you ultimately are left with a single number, which is the median. If the sample size is even, ultimately you reach a point were you need to take the average of two numbers, which is also the median. So the mean and the median are the two ends of a spectrum. The trimmed mean is a real strategy for describing the “middle” of the data, in situations where there are only occasional outliers.

When should you use the mean and when should you use the median? There are no strict rules. The median is always good, and it has a simple interpretation. The mean works fine if the sample has no outliers. If the mean and the median are close together, then the mean will be fine. In fact the mean is then preferred. This is because a vast amount of sophisticated statistical theory has been built up around the mean, and so it has become the dominant way to measure the “middle” of a sample. But, beware, as in the example used here, the mean is often misleading.

The blog draws heavily on the textbook IntroSTAT. If you are impatient to move faster than these blogs on “data and statistics” do, then you can download the whole book. It is an amazingly small file (1.6MB).

Karis Daniel produced the maps.