Numerical Summary Part-Ⅱ: Familiarizing Measures of Spread


In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range.

Published on May 08, 2021

Range IQR Standard Deviation

17 min READ

 In my previous post Understanding Measures of Central Tendency, we talked about the center of a data sample.

 Now we will be going through how spread the data points are around its respective central measure.

 In general, there are three measures of dispersion or spread namely
  - Range
  - IQR and
  - Standard Deviation

Range: difference between the extremes

 In statistics Range is simply the difference between the highest data point the lowest data point in the sample.

 Steps to find out range
  - Arrange the given data points in ascending order
  - Identify the largest and the smallest data point in the sample.
  - Then the range of the sample is given by Range = Maximum - Minumum

Note!

 The range function used in programming and the Range measure used in statistics are quite different. One defined a sequence of numbers and the other says about the spread amoung data values.

 
Eg-1 : The ages of 7 participants in a lemon and spoon competition conducted at the University of WhY stats in 2021 are as follows. Figure out the how much spread apart the ages are?

Participant1234567
Age37193129263321

sol:
 To find out the range let’s arrange the ages in order

Age19212629313337

 Here the highest age is 37 years and the lowest is 19 years and range is given by

 
Eg-2 : Martin scores 67, 100, 93, 81, 96 in 5 different math test respectively. Find out the range of his overall math scores?
sol:
 Arranging martin scores,

   67  81  93  96  100

 Here 100 and 67 are the maximum and minimum marks scored. Then the Range is

Quartiles: divides into 4 parts

 In statistics, the quartiles are data points that divide the given sample into 4 equal parts.

 Steps to find out the quartile:
  - Arrange the data points in ascending order.
  - Calculate the median of the data set which is the 2nd quartile.
  - Split the data into 2 parts such that they are left side and right side of the median.
  - Calculate median of the individual parts which are nothing but the quartiles.

Figure-1 : 3 quartiles of 15 unknown data points

 
Eg-3 : The following were the hourly collections from a Salvation Army kettle at a local store one day in December: $19, $26, $25, $37, $32, $28, $22, $23, $29, $34, $39, and $31. Determine the first quartile and third quartile for the amount collected.
sol:
 First let’s arrange the data points

   19  22  23  24  25  26  28  31  33  34  37  39

 Here there are even no of data points(i.e. n=12). So median is given by

   

 Now that over data is of 2 parts as shown

   19  22  23  24  25  26   27   28  31  33  34  37  39

 Computing median for the 2 parts separately

19  22  23  24  25  2628  31  33  34  37  39

 Therefore our new sequence with the quartiles would be

   19  22  23   23.5   24  25  26   27   28  31  33   33.5   34  37  39

 Hence $23.5 and $33.5 are the 1st and the 3rd quartiles respectively.

Quartiles as Percentile

 A percentile is a measure at which that percentage of the total values are the same as or below that measure.

 They divide the data into 100 equal parts.

 So we can say that quartiles divide the data into 25%, 50%, 75% and 100%.

Figure-2 : Expressing quartiles as percentiles

 

 Note: Median is the 2nd quartile and also the 50th percentile.

Interquartile Range(IQR): Q3-Q1

 In statistics the Interquartile Range is the difference between the 3rd quartile and the 1st od the given data sample.

 Steps to find IQR:
  - As usual arrange the data in ascending order.
  - Find out the 3 quartiles of the data points.
  - Then the IQR for the data pointd is given by IQR = Q3 - Q1

Eg-4 : The marks scored by each students at WhY stats of two different tests are as follows. Compare them.

    Test Scores for Test A: 69, 96, 81, 79, 65, 76, 83, 99, 89, 67, 90, 77, 85, 98, 66, 91, 77, 69, 80, 94

    Test Scores for Test B: 90, 72, 80, 92, 90, 97, 92, 75, 79, 68, 70, 80, 99, 95, 78, 73, 71, 68, 95, 100
sol:
 To compare two samples with same characteristics we need to find some statistics. So let’s find out the median.

 Arranging the two samples in order

Test ATest B
65, 66, 67, 69, 69, 76, 77, 77, 79, 80, 81, 83, 85, 89, 90, 91, 94, 96, 98, 99

68, 68, 70, 71, 72, 73, 75, 78, 79, 80, 80, 90, 90, 92, 92, 95, 95, 97, 99, 100

 So median value of Test A scores is 0.5 more than Test B

 Does this mean that students performed even worse in the 2nd test.

 Let’s conform this mathematically using the IQR

 In test A

1st quartile(Test A)3rd quartile(Test A)
65  66  67  69  69  76  77  77  79  80

81  83  85  89  90  91  94  96  98  99

 So, the Interquartile Range is given by IQR = 90.5 - 72.5 = 18

 In test B

1st quartile(Test B)3rd quartile(Test B)
68  68  70  71  72  73  75  78  79  80

80  90  90  92  92  95  95  97  99  100

 So, the Interquartile Range is given by IQR = 93.5 - 72.5 = 21

With this mathematical knowledge we can infer that students performed well in 2nd. Hence they improved.

The 5 Point Summary

 The Minimum, the First Quartile, the Median, the Third Quartile and the Maximum are the five numbers often used to summarise a given data sample.

 These data points are also known as The Five Point Summary.

Figure 3: 5 point summary for a 15 point unknown data sample

 
Eg-5 : A year ago, Angela began working at a computer store. Her supervisor asked her to keep a record of the number of sales she made each month.

 The following data set is a list of her sales for the last 12 months:

 34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37

sol:
 Arranging the values in ascending order

11115192024283437475057

 W csn say that the maximum sales were 57 and the minimum were 1

  
1  11  15  19  20  24

28  34  37  47  50  57

Hence the 5 point summary is

Minimum1st QuartileMedian3rd QuartileMaximum
117264257

The Anatomy of a Box-Plot

 A boxplot is a standardized way of displaying the dataset based on a five-number summary.

 They were introduced by the American statistician John Tukey around 1970 and became widely known after the publication of his book Exploratory Data Analysis in 1977.

 They are widely used while comparing two or more categories, samples or even populations.

Figure-4: Visualizing the various parts of a Boxplot

 
 Box plots are made of 5 key components
  - Median
  - Hinges: two hinges located at the lower and the upper quartiles denoted by Q1 and Q3, respectively.
  - Fences: two fences determined as the data values which are adjacent to the extremes:
    Lower Extreme = Q1 – 1.5(IQR),

    Upper Extreme = Q3 + 1.5(IQR),

            where IQR denotes the inter quartile range, IQR = Q3 – Q1.

  - Whiskers: two lines that connect the hinges with the fences.
  - Outliers: all individual points further away from the lower and upper extremes are represented as dots.

Eg-6 : The School of WhYrus conducted an experiment on amount of time people spent exercising every day. Here is a sample of 15 values. Interpret them.

 0 minutes, 40 minutes, 60 minutes, 30 minutes, 60 minutes, 10 minutes, 45 minutes, 30 minutes, 300 minutes, 90 minutes, 30 minutes, 120 minutes, 60 minutes, 0 minutes, 20 minutes
sol:
 First let’s arrange the data into order

001020303030404560606090120300

 Now find out the median

 Then we will find out the middle point of each individual groups separated by the median.

Q1Q3
0  0  10  20  30  30  30

45  60  60  60  90  120  300

 So Interquartile range is IQR = Q3- Q1 = 60 - 20 = 40

 Then calculate the extreme values as follows

  Lower Extreme = Q1 – 1.5(IQR) = 20 - 1.5*40 = 20 -60 = -40

  Upper Extreme = Q3 + 1.5(IQR) = 60 + 1.5*40 = 60 + 60 = 120

 So we have one potential outlier(i.e.300)

  Now summing up all of this information into a boxplot give us

Figure-5 : Five point summary of individuals and their sleeping time

Standard Deviation

 In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

 It is nothing but the square root of variance.

Population StdSample Std

 Steps to calculate the Standard Deviation
  - First figure out the mean value of the data points.
  - Calculate the sum of the squares of the difference from each data point to the mean value.
  - Now divide this whole sum with the one less than the no of data points(i.e. n-1)
  - And finally take the square root of this to find out the standard deviation of the data points.

Eg-7 : The reporter compares a week of high temperatures (in Fahrenheit) in two different seasons. The data looks like this:

 City A ForecastCity B Forecast
Monday9591
Tuesday9381
Wednesday9595
Thursday9491
Friday9686
Saturday9482
Sunday9578

sol:
 At first he figures out the mean temperature in city-A is 94.6 F and the mean temperature in the city-B is 86.1 F.

  

 To infer this he considered calculating the amount of spread between each city individually.

 That is

For city-A

For city-B

This confirms the fact that city A’s forecasts are more reliable than City B’s forecasts.

Why we divide by n-1?

 As the sample size tends to reduce while compared to the population. The sample statistic will become an biased estimate of the population parameter.

 Hence by using n-1 degrees of freedom we can achieve a statistic which is close to the true to the population.

Summarizing Reasons

 As the population size reduces there is a huge chance of underestimating the true population mean.

I recommend you watch the khan academy review and try those simulations before reading this.

One of the simulations shows how the unbiased variance estimates its true value when it uses n-1 degrees of freedom.

Another one tells us that for a sample size n we are approaching n-1/n times the population variance.

Calculating the unbiased estimate would be as follows

 
And in the last simulation, he/she records the sample estimate with respect to different values of a using the formula

When we generate more and more samples the best estimate is recorded when a is close to -1. Also, we will overestimate and underestimate while a is less than or greater than -1 respectively.