Scatter Diagrams and Correlation
Scatter diagrams, lines of best fit, Spearman's rank correlation, PMCC
Scatter Diagrams
A scatter diagram plots pairs of values (x, y) to investigate whether there is a relationship (correlation) between two variables.
A measure of the strength and direction of the linear relationship between two variables.
As one variable increases, the other also increases. The points slope upwards from left to right.
As one variable increases, the other decreases. The points slope downwards from left to right.
There is no clear linear relationship between the variables. The points are scattered randomly.
A point that does not fit the general pattern of the scatter diagram. It should be identified and investigated.
Just because two variables are correlated does not mean one causes the other. There may be a third variable (confounding variable) that explains the relationship.
Line of Best Fit
A line of best fit (regression line) summarises the relationship between two variables and can be used to make predictions.
A straight line drawn through the mean point (x̄, ȳ) with roughly equal numbers of points above and below the line.
Using the line of best fit to predict a value within the range of the data. This is reliable.
Using the line of best fit to predict a value outside the range of the data. This is unreliable as the relationship may not continue.
The line of best fit must pass through the mean point (x̄, ȳ). Always calculate and plot this point first before drawing the line.
Spearman's Rank Correlation Coefficient
Spearman's Rank Correlation Coefficient (SRCC) measures the strength of the correlation between two variables using their ranks rather than their actual values.
rₛ = 1 − (6Σd²) ÷ (n(n²−1)) Where d = difference in ranks for each pair, n = number of data pairs
- rₛ = +1: perfect positive correlation
- rₛ = −1: perfect negative correlation
- rₛ = 0: no correlation
- 0.8 ≤ rₛ < 1: strong positive correlation
- 0 < rₛ < 0.5: weak positive correlation
Calculate Spearman's rank for 3 pairs with d values: 1, 2, 0.
If two values are equal, give them the average of the ranks they would have occupied. For example, if two values tie for 3rd and 4th place, both get rank 3.5.
Product Moment Correlation Coefficient (PMCC)
The PMCC (Pearson's r) measures the strength of the linear correlation between two variables using the actual data values.
A value between −1 and +1. r = +1 means perfect positive linear correlation; r = −1 means perfect negative linear correlation; r = 0 means no linear correlation.
Use SRCC when: data is ranked, data is not normally distributed, or there are outliers. Use PMCC when: data is continuous and approximately normally distributed with no extreme outliers.