In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range.
Published on May 08, 2021
Range IQR Standard Deviation
17 min READ
In my previous post Understanding Measures of Central Tendency, we talked about the center of a data sample.
Now we will be going through how spread the data points are around its respective central measure.
In general, there are three measures of dispersion or spread namely
- Range
- IQR and
- Standard Deviation
In statistics Range is simply the difference between the highest data point the lowest data point in the sample.
Steps to find out range
- Arrange the given data points in ascending order
- Identify the largest and the smallest data point in the sample.
- Then the range of the sample is given by Range = Maximum - Minumum
The range function used in programming and the Range measure used in statistics are quite different. One defined a sequence of numbers and the other says about the spread amoung data values.
Eg-1 : The ages of 7 participants in a lemon and spoon competition conducted at the University of WhY stats in 2021 are as follows. Figure out the how much spread apart the ages are?
Participant | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Age | 37 | 19 | 31 | 29 | 26 | 33 | 21 |
sol:
To find out the range let’s arrange the ages in order
Age | 19 | 21 | 26 | 29 | 31 | 33 | 37 |
Here the highest age is 37 years and the lowest is 19 years and range is given by
Eg-2 : Martin scores 67, 100, 93, 81, 96 in 5 different math test respectively. Find out the range of his overall math scores?
sol:
Arranging martin scores,
67 81 93 96 100
Here 100 and 67 are the maximum and minimum marks scored. Then the Range is
In statistics, the quartiles are data points that divide the given sample into 4 equal parts.
Steps to find out the quartile:
- Arrange the data points in ascending order.
- Calculate the median of the data set which is the 2nd quartile.
- Split the data into 2 parts such that they are left side and right side of the median.
- Calculate median of the individual parts which are nothing but the quartiles.
Figure-1 : 3 quartiles of 15 unknown data points
Eg-3 : The following were the hourly collections from a Salvation Army kettle at a local store one day in December: $19, $26, $25, $37, $32, $28, $22, $23, $29, $34, $39, and $31. Determine the first quartile and third quartile for the amount collected.
sol:
First let’s arrange the data points
19 22 23 24 25 26 28 31 33 34 37 39
Here there are even no of data points(i.e. n=12). So median is given by
Now that over data is of 2 parts as shown
19 22 23 24 25 26 27 28 31 33 34 37 39
Computing median for the 2 parts separately
19 22 23 24 25 26 | 28 31 33 34 37 39 |
Therefore our new sequence with the quartiles would be
19 22 23 23.5 24 25 26 27 28 31 33 33.5 34 37 39
Hence $23.5 and $33.5 are the 1st and the 3rd quartiles respectively.
A percentile is a measure at which that percentage of the total values are the same as or below that measure.
They divide the data into 100 equal parts.
So we can say that quartiles divide the data into 25%, 50%, 75% and 100%.
Figure-2 : Expressing quartiles as percentiles
Note: Median is the 2nd quartile and also the 50th percentile.
In statistics the Interquartile Range is the difference between the 3rd quartile and the 1st od the given data sample.
Steps to find IQR:
- As usual arrange the data in ascending order.
- Find out the 3 quartiles of the data points.
- Then the IQR for the data pointd is given by IQR = Q3 - Q1
Eg-4 : The marks scored by each students at WhY stats of two different tests are as follows. Compare them.
Test Scores for Test A: 69, 96, 81, 79, 65, 76, 83, 99, 89, 67, 90, 77, 85, 98, 66, 91, 77, 69, 80, 94
Test Scores for Test B: 90, 72, 80, 92, 90, 97, 92, 75, 79, 68, 70, 80, 99, 95, 78, 73, 71, 68, 95, 100
sol:
To compare two samples with same characteristics we need to find some statistics. So let’s find out the median.
Arranging the two samples in order
Test A | Test B |
---|---|
65, 66, 67, 69, 69, 76, 77, 77, 79, 80, 81, 83, 85, 89, 90, 91, 94, 96, 98, 99 | 68, 68, 70, 71, 72, 73, 75, 78, 79, 80, 80, 90, 90, 92, 92, 95, 95, 97, 99, 100 |
So median value of Test A scores is 0.5 more than Test B
Does this mean that students performed even worse in the 2nd test.
Let’s conform this mathematically using the IQR
In test A
1st quartile(Test A) | 3rd quartile(Test A) |
---|---|
65 66 67 69 69 76 77 77 79 80 | 81 83 85 89 90 91 94 96 98 99 |
So, the Interquartile Range is given by IQR = 90.5 - 72.5 = 18
In test B
1st quartile(Test B) | 3rd quartile(Test B) |
---|---|
68 68 70 71 72 73 75 78 79 80 | 80 90 90 92 92 95 95 97 99 100 |
So, the Interquartile Range is given by IQR = 93.5 - 72.5 = 21
With this mathematical knowledge we can infer that students performed well in 2nd. Hence they improved.
The Minimum, the First Quartile, the Median, the Third Quartile and the Maximum are the five numbers often used to summarise a given data sample.
These data points are also known as The Five Point Summary.
Figure 3: 5 point summary for a 15 point unknown data sample
Eg-5 : A year ago, Angela began working at a computer store. Her supervisor asked her to keep a record of the number of sales she made each month.
The following data set is a list of her sales for the last 12 months:
34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37
sol:
Arranging the values in ascending order
1 | 11 | 15 | 19 | 20 | 24 | 28 | 34 | 37 | 47 | 50 | 57 |
W csn say that the maximum sales were 57 and the minimum were 1
1 11 15 19 20 24 | 28 34 37 47 50 57 |
Hence the 5 point summary is
Minimum | 1st Quartile | Median | 3rd Quartile | Maximum |
---|---|---|---|---|
1 | 17 | 26 | 42 | 57 |
A boxplot is a standardized way of displaying the dataset based on a five-number summary.
They were introduced by the American statistician John Tukey around 1970 and became widely known after the publication of his book Exploratory Data Analysis in 1977.
They are widely used while comparing two or more categories, samples or even populations.
Figure-4: Visualizing the various parts of a Boxplot
Box plots are made of 5 key components
- Median
- Hinges: two hinges located at the lower and the upper quartiles denoted by Q1 and Q3, respectively.
- Fences: two fences determined as the data values which are adjacent to the extremes:
Lower Extreme = Q1 – 1.5(IQR),
Upper Extreme = Q3 + 1.5(IQR),
where IQR denotes the inter quartile range, IQR = Q3 – Q1.
- Whiskers: two lines that connect the hinges with the fences.
- Outliers: all individual points further away from the lower and upper extremes are represented as dots.
Eg-6 : The School of WhYrus conducted an experiment on amount of time people spent exercising every day. Here is a sample of 15 values. Interpret them.
0 minutes, 40 minutes, 60 minutes, 30 minutes, 60 minutes, 10 minutes, 45 minutes, 30 minutes, 300 minutes, 90 minutes, 30 minutes, 120 minutes, 60 minutes, 0 minutes, 20 minutes
sol:
First let’s arrange the data into order
0 | 0 | 10 | 20 | 30 | 30 | 30 | 40 | 45 | 60 | 60 | 60 | 90 | 120 | 300 |
Now find out the median
Then we will find out the middle point of each individual groups separated by the median.
Q1 | Q3 |
---|---|
0 0 10 20 30 30 30 | 45 60 60 60 90 120 300 |
So Interquartile range is IQR = Q3- Q1 = 60 - 20 = 40
Then calculate the extreme values as follows
Lower Extreme = Q1 – 1.5(IQR) = 20 - 1.5*40 = 20 -60 = -40
Upper Extreme = Q3 + 1.5(IQR) = 60 + 1.5*40 = 60 + 60 = 120
So we have one potential outlier(i.e.300)
Now summing up all of this information into a boxplot give us
Figure-5 : Five point summary of individuals and their sleeping time
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.
It is nothing but the square root of variance.
Population Std | Sample Std |
---|---|
Steps to calculate the Standard Deviation
- First figure out the mean value of the data points.
- Calculate the sum of the squares of the difference from each data point to the mean value.
- Now divide this whole sum with the one less than the no of data points(i.e. n-1)
- And finally take the square root of this to find out the standard deviation of the data points.
Eg-7 : The reporter compares a week of high temperatures (in Fahrenheit) in two different seasons. The data looks like this:
City A Forecast | City B Forecast | |
---|---|---|
Monday | 95 | 91 |
Tuesday | 93 | 81 |
Wednesday | 95 | 95 |
Thursday | 94 | 91 |
Friday | 96 | 86 |
Saturday | 94 | 82 |
Sunday | 95 | 78 |
sol:
At first he figures out the mean temperature in city-A is 94.6 F and the mean temperature in the city-B is 86.1 F.
To infer this he considered calculating the amount of spread between each city individually.
That is
For city-A
For city-B
This confirms the fact that city A’s forecasts are more reliable than City B’s forecasts.
As the sample size tends to reduce while compared to the population. The sample statistic will become an biased estimate of the population parameter.
Hence by using n-1 degrees of freedom we can achieve a statistic which is close to the true to the population.
As the population size reduces there is a huge chance of underestimating the true population mean.
I recommend you watch the khan academy review and try those simulations before reading this.
One of the simulations shows how the unbiased variance estimates its true value when it uses n-1 degrees of freedom.
Another one tells us that for a sample size n we are approaching n-1/n times the population variance.
Calculating the unbiased estimate would be as follows
And in the last simulation, he/she records the sample estimate with respect to different values of a using the formula
When we generate more and more samples the best estimate is recorded when a is close to -1. Also, we will overestimate and underestimate while a is less than or greater than -1 respectively.