Who's Yambakam - a deep dive into my external self | Numerical Summary Part-Ⅱ: Familiarizing Measures of Spread

In my previous post Understanding Measures of Central Tendency, we talked about the center of a data sample.

Now we will be going through how spread the data points are around its respective central measure.

In general, there are three measures of dispersion or spread namely
- Range
- IQR and
- Standard Deviation

Range: difference between the extremes

In statistics Range is simply the difference between the highest data point the lowest data point in the sample.

Steps to find out range
- Arrange the given data points in ascending order
- Identify the largest and the smallest data point in the sample.
- Then the range of the sample is given by Range = Maximum - Minumum

Note!

The range function used in programming and the Range measure used in statistics are quite different. One defined a sequence of numbers and the other says about the spread amoung data values.

Eg-1 : The ages of 7 participants in a lemon and spoon competition conducted at the University of WhY stats in 2021 are as follows. Figure out the how much spread apart the ages are?

Participant	1	2	3	4	5	6	7
Age	37	19	31	29	26	33	21

sol:
To find out the range let’s arrange the ages in order

Age

Here the highest age is 37 years and the lowest is 19 years and range is given by

$Range = \mathrm{37} - \mathrm{19} = \textbf{18}$

Eg-2 : Martin scores 67, 100, 93, 81, 96 in 5 different math test respectively. Find out the range of his overall math scores?
sol:
Arranging martin scores,

67 81 93 96 100

Here 100 and 67 are the maximum and minimum marks scored. Then the Range is

$Range = \mathrm{100} - \mathrm{67} = \textbf{33}$

Quartiles: divides into 4 parts

In statistics, the quartiles are data points that divide the given sample into 4 equal parts.

Steps to find out the quartile:
- Arrange the data points in ascending order.
- Calculate the median of the data set which is the 2^nd quartile.
- Split the data into 2 parts such that they are left side and right side of the median.
- Calculate median of the individual parts which are nothing but the quartiles.

Figure-1 : 3 quartiles of 15 unknown data points

Eg-3 : The following were the hourly collections from a Salvation Army kettle at a local store one day in December: $19, $26, $25, $37, $32, $28, $22, $23, $29, $34, $39, and $31. Determine the first quartile and third quartile for the amount collected.
sol:
First let’s arrange the data points

19 22 23 24 25 26 28 31 33 34 37 39

Here there are even no of data points(i.e. n=12). So median is given by

$Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(12/2)+X(12/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(6)+X(7)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{26+28}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 27$

Now that over data is of 2 parts as shown

19 22 23 24 25 26 27 28 31 33 34 37 39

Computing median for the 2 parts separately

19 22 23 24 25 26

28 31 33 34 37 39

$Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(6/2)+X(6/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(3)+X(4)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{23+24}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 23.5$

$Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(6/2)+X(6/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(3)+X(4)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{33+34}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 33.5$

Therefore our new sequence with the quartiles would be

19 22 23 23.5 24 25 26 27 28 31 33 33.5 34 37 39

Hence $23.5 and $33.5 are the 1^st and the 3^rd quartiles respectively.

Quartiles as Percentile

A percentile is a measure at which that percentage of the total values are the same as or below that measure.

They divide the data into 100 equal parts.

So we can say that quartiles divide the data into 25%, 50%, 75% and 100%.

Figure-2 : Expressing quartiles as percentiles

Note: Median is the 2^nd quartile and also the 50^th percentile.

Interquartile Range(IQR): Q3-Q1

In statistics the Interquartile Range is the difference between the 3^rd quartile and the 1^st od the given data sample.

Steps to find IQR:
- As usual arrange the data in ascending order.
- Find out the 3 quartiles of the data points.
- Then the IQR for the data pointd is given by IQR = Q3 - Q1

Eg-4 : The marks scored by each students at WhY stats of two different tests are as follows. Compare them.

Test Scores for Test A: 69, 96, 81, 79, 65, 76, 83, 99, 89, 67, 90, 77, 85, 98, 66, 91, 77, 69, 80, 94

Test Scores for Test B: 90, 72, 80, 92, 90, 97, 92, 75, 79, 68, 70, 80, 99, 95, 78, 73, 71, 68, 95, 100
sol:
To compare two samples with same characteristics we need to find some statistics. So let’s find out the median.

Arranging the two samples in order

Test A	Test B
65, 66, 67, 69, 69, 76, 77, 77, 79, 80, 81, 83, 85, 89, 90, 91, 94, 96, 98, 99 $Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(20/2)+X(20/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(10)+X(11)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{80+81}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 80.5$	68, 68, 70, 71, 72, 73, 75, 78, 79, 80, 80, 90, 90, 92, 92, 95, 95, 97, 99, 100 $Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(20/2)+X(20/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(10)+X(11)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{80+80}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 80$

So median value of Test A scores is 0.5 more than Test B

Does this mean that students performed even worse in the 2^nd test.

Let’s conform this mathematically using the IQR

In test A

1^st quartile(Test A)	3^rd quartile(Test A)
65 66 67 69 69 76 77 77 79 80 $Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(10/2)+X(10/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(5)+X(6)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{69+76}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 72.5$	81 83 85 89 90 91 94 96 98 99 $Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(10/2)+X(10/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(5)+X(6)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{90+91}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 90.5$

So, the Interquartile Range is given by IQR = 90.5 - 72.5 = 18

In test B

1^st quartile(Test B)	3^rd quartile(Test B)
68 68 70 71 72 73 75 78 79 80 $Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(10/2)+X(10/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(5)+X(6)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{69+76}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 72.5$	80 90 90 92 92 95 95 97 99 100 $Median = \frac{X(n/2)+X(n/2+1)}{2} \\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(10/2)+X(10/2+1)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{X(5)+X(6)}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = \frac{90+91}{2}\\\\ \frac{}{} \quad \quad \quad \quad \quad = 90.5$

So, the Interquartile Range is given by IQR = 93.5 - 72.5 = 21

With this mathematical knowledge we can infer that students performed well in 2^nd. Hence they improved.

The 5 Point Summary

The Minimum, the First Quartile, the Median, the Third Quartile and the Maximum are the five numbers often used to summarise a given data sample.

These data points are also known as The Five Point Summary.

Figure 3: 5 point summary for a 15 point unknown data sample

Eg-5 : A year ago, Angela began working at a computer store. Her supervisor asked her to keep a record of the number of sales she made each month.

The following data set is a list of her sales for the last 12 months:

34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37

sol:
Arranging the values in ascending order

W csn say that the maximum sales were 57 and the minimum were 1

$Median =\frac{X(\frac{n}{2})+X(\frac{n}{2}+1)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{X(12/2)+X(12/2+1)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{X(6)+X(7)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{24+28}{2} \\\\ {} \quad \quad \quad \quad \quad = 26$


1 11 15 19 20 24 $Median =\frac{X(\frac{n}{2})+X(\frac{n}{2}+1)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{X(6/2)+X(6/2+1)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{X(3)+X(4)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{15+19}{2} \\\\ {} \quad \quad \quad \quad \quad = 17$	28 34 37 47 50 57 $Median =\frac{X(\frac{n}{2})+X(\frac{n}{2}+1)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{X(6/2)+X(6/2+1)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{X(3)+X(4)}{2}\\\\ {} \quad \quad \quad \quad \quad = \frac{37+47}{2} \\\\ {} \quad \quad \quad \quad \quad = 42$

Hence the 5 point summary is

Minimum	1^st Quartile	Median	3^rd Quartile	Maximum
1	17	26	42	57

The Anatomy of a Box-Plot

A boxplot is a standardized way of displaying the dataset based on a five-number summary.

They were introduced by the American statistician John Tukey around 1970 and became widely known after the publication of his book Exploratory Data Analysis in 1977.

They are widely used while comparing two or more categories, samples or even populations.

Figure-4: Visualizing the various parts of a Boxplot

Box plots are made of 5 key components
- Median
- Hinges: two hinges located at the lower and the upper quartiles denoted by Q1 and Q3, respectively.
- Fences: two fences determined as the data values which are adjacent to the extremes:
Lower Extreme = Q1 – 1.5(IQR),

Upper Extreme = Q3 + 1.5(IQR),

where IQR denotes the inter quartile range, IQR = Q3 – Q1.

- Whiskers: two lines that connect the hinges with the fences.
- Outliers: all individual points further away from the lower and upper extremes are represented as dots.

Eg-6 : The School of WhYrus conducted an experiment on amount of time people spent exercising every day. Here is a sample of 15 values. Interpret them.

0 minutes, 40 minutes, 60 minutes, 30 minutes, 60 minutes, 10 minutes, 45 minutes, 30 minutes, 300 minutes, 90 minutes, 30 minutes, 120 minutes, 60 minutes, 0 minutes, 20 minutes
sol:
First let’s arrange the data into order

120

300

Now find out the median

$Median = X(\frac{n+1}{2})= X(\frac{15+1}{2})= X(8) = 40$

Then we will find out the middle point of each individual groups separated by the median.

Q1	Q3
0 0 10 20 30 30 30 $Median = X(\frac{n+1}{2})\\\\ {} \quad \quad \quad \quad \quad = X(\frac{7+1}{2})\\\\ {} \quad \quad \quad \quad \quad = X(4) \\\\ {} \quad \quad \quad \quad \quad = 20$	45 60 60 60 90 120 300 $Median = X(\frac{n+1}{2})\\\\ {} \quad \quad \quad \quad \quad = X(\frac{7+1}{2})\\\\ {} \quad \quad \quad \quad \quad = X(4) \\\\ {} \quad \quad \quad \quad \quad = 60$

So Interquartile range is IQR = Q3- Q1 = 60 - 20 = 40

Then calculate the extreme values as follows

Lower Extreme = Q1 – 1.5(IQR) = 20 - 1.5*40 = 20 -60 = -40

Upper Extreme = Q3 + 1.5(IQR) = 60 + 1.5*40 = 60 + 60 = 120

So we have one potential outlier(i.e.300)

Now summing up all of this information into a boxplot give us

Figure-5 : Five point summary of individuals and their sleeping time

Standard Deviation

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

It is nothing but the square root of variance.

Population Std	Sample Std
$\dpi{200}\sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_{i} - \mu)^{2}}{N}}$	$\dpi{200}s = \sqrt{\frac{\sum_{i=1}^{n} (x_{i} - \bar{x})^{2}}{n-1}}$

Steps to calculate the Standard Deviation
- First figure out the mean value of the data points.
- Calculate the sum of the squares of the difference from each data point to the mean value.
- Now divide this whole sum with the one less than the no of data points(i.e. n-1)
- And finally take the square root of this to find out the standard deviation of the data points.

Eg-7 : The reporter compares a week of high temperatures (in Fahrenheit) in two different seasons. The data looks like this:

	City A Forecast	City B Forecast
Monday	95	91
Tuesday	93	81
Wednesday	95	95
Thursday	94	91
Friday	96	86
Saturday	94	82
Sunday	95	78

sol:
At first he figures out the mean temperature in city-A is 94.6 F and the mean temperature in the city-B is 86.1 F.


$\dpi{200}\bar{x} = \frac{95+93+95+94+96+94+95}{7} \\ \\ ^{}\; \; \; \; \; \; \; = 94.6 F$	$\dpi{200}\bar{x} = \frac{91+81+95+91+86+82+78}{7} \\ \\ ^{}\; \; \; \; \; \; \; = 86.1 F$

To infer this he considered calculating the amount of spread between each city individually.

That is

For city-A

$\dpi{100}s = \sqrt{\frac{\sum_{i=1}^{7} (x_{i} - 96.4)^{2}}{7-1}}\\ \\^{} \; \; \; \; \; \; \;= \sqrt{\frac{(95-94.6)^2+(93-94.6)^2+(95-94.6)^2+(94-94.6)^2+(96-94.6)^2+(94-94.6)^2+(95-94.6)^2}{6}}\\ \\^{} \; \; \; \; \; \; \;= \sqrt{\frac{0.16+2.56+0.16+0.36+1.96+0.36+0.16}{6}} \\ \\^{} \; \; \; \; \; \; \;= \sqrt{0.953}\\ \\ ^{} \; \; \; \; \; \; \;= 0.976$

For city-B

$\dpi{100}s = \sqrt{\frac{\sum_{i=1}^{7} (x_{i} - 86.1)^{2}}{7-1}}\\ \\^{} \; \; \; \; \; \; \;= \sqrt{\frac{(90-86.1)^2+(81-96.1)^2+(95-86.1)^2+(91-86.1)^2+(86-86.1)^2+(82-86.1)^2+(78-86.1)^2}{6}}\\ \\^{} \; \; \; \; \; \; \;= \sqrt{\frac{15.21+26.01+79.21+24.01+0.01+16.81+65.61}{6}} \\ \\^{} \; \; \; \; \; \; \;= \sqrt{37.81}\\\\ ^{} \; \; \; \; \; \; \;= 6.15$

This confirms the fact that city A’s forecasts are more reliable than City B’s forecasts.

Why we divide by n-1?

As the sample size tends to reduce while compared to the population. The sample statistic will become an biased estimate of the population parameter.

Hence by using n-1 degrees of freedom we can achieve a statistic which is close to the true to the population.

Summarizing Reasons

As the population size reduces there is a huge chance of underestimating the true population mean.

I recommend you watch the khan academy review and try those simulations before reading this.

One of the simulations shows how the unbiased variance estimates its true value when it uses n-1 degrees of freedom.

Another one tells us that for a sample size n we are approaching n-1/n times the population variance.

Calculating the unbiased estimate would be as follows

$\frac{n-1}{n}\times \sigma^2 = \frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^2}{n} \\\\ multiplying\;with\; \frac{n}{n-1} \;on\;both\;sides\;we\;get \\\\ \frac{n-1}{n}\times \sigma^2 \times \frac{n}{n-1} = \frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^2}{n} \times \frac{n}{n-1} \\\\ ^{}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;s^2 = \frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^2}{n-1}$

And in the last simulation, he/she records the sample estimate with respect to different values of a using the formula

$Variance = \frac{sum((x[i]-mean(x))^2)}{n+a}$

When we generate more and more samples the best estimate is recorded when a is close to -1. Also, we will overestimate and underestimate while a is less than or greater than -1 respectively.