Correlation

Correlation is a statistical procedure designed to measure the strength and direction of the linear relation between two variables. The most common test statistic for correlation is the Pearson product-moment correlation coefficient, r.

Data for Example: Gender Questionnaire
For this exercise, you'll be working with some questionnaire data I collected from 411 students. You can access the data by typing the following command in the console and pressing “Enter”:

data(ChivQues)

The questionnaire included five measures of gender role attitudes:

  1. Chivalry (chiv), a measure of the degree to which a person endorses the idea that men have more of an obligation to protect and provide for women than vice-versa.
  2. Moral Virtue (MVIRT), a measure of the degree to which a person believes women are more morally virtuous (have a better conscience, more morally "pure," etc.) than men.
  3. Sexual Virtue (SVIRT), a measure of the degree to which a person believes that women are more sexually virtuous (think about sex less often, don't think about others in sexual ways, etc.) than men.
  4. Attitudes toward Women Scale (AWS), a published measure of conservative or traditional gender role attitudes (e.g., women should not work outside the home). I gave this measure to only half of the sample.
  5. Female Agency (agency), a measure of the degree to which a person believes that women are as competent and as well-suited to positions of authority as men are.

Positive Correlations

r ranges in value from -1 to +1. A positive r indicates that high values on one variable tend to be found with high values on another variable. For example, the scatterplot below shows a correlation of r = +0.5 between the Attitudes Toward Women Scale (AWS), a measure of conservative gender roles, and a measure of the belief that women are more "Morally Virtuous" (MVIRT) than men are. You can get this scatterplot by selecting Analysis -> Correlation, putting “MVIRT” into the “Variables” box, “AWS” into the “With” box, pressing “Plots” and selecting “Scatterplots”, and clicking “OK” and then “Run”.

The dots in the plot above are slightly transparent so that overlapping points show up as darker. The solid blue line in the graph above is the “line of best fit”, a line that minimizes the vertical distances between the data points and the line itself. The “best fit line” is a useful way of representing the linear trend in a scatterplot. It helps capture the pattern indicated by r = +0.5: higher scores on AWS are found with higher scores on MVIRT. The gray shading around the blue line represents the 95% confidence interval around the line of best fit. You can have 95% confidence that the line of best fit for the population (the “true” best fit line) is within the shaded area. Note that the shaded area is not reflective of where 95% of the data points are - it corresponds to where the line of best fit would be drawn, not to where data are likely to appear.

You should also get this output:

Correlation
Pearson's product-moment correlation

MVIRT

cor

0.500

95% CI

[0.388, 0.597]

AWS

N

201

t (df)

8.143 (199)

p-value*

<0.001

Notes:
H0: correlation = 0
*HA: two.sided

That output includes a number of pieces of information:
cor The top row contains the correlation coefficient, r, here equal to +0.500.
95% CI Under that is the 95% confidence interval of the correlation. You can be 95% confident that the “true” correlation between AWS and MVIRT (defined as the correlation you would get if your sample became infinitely large) is somewhere between 0.388 and 0.597. The confidence interval tells you how precise the estimate of r = 0.5 is.
N The number of observations on which the correlation is based.
t(df) This is the test statistic and degrees of freedom used to test whether the correlation coefficient of 0.5 is significantly different from 0. In this case, t(199) = 8.143.
p-value The p-value, indicating that the probability of obtaining r = 0.5 or higher from an N of 201, given the null hypothesis (r = 0), is less than 1 in 1,000 (p < .001).


You can request fewer pieces of information by clicking on “Options” in the Correlation dialog and de-selecting items in the “Output” area.
APA Style
To report the results above in APA style, you could write:

There was a significant positive correlation between AWS and moral virtue, r(201) = +.5, p < .001. This correlation indicates that people with traditional gender role attitudes tend to believe that women are more morally virtuous than men are.

Remember to always add a sentence providing an interpretation after you report statistical output.
Negative Correlations
The scatterplot below shows the relation between the AWS and people's beliefs in women's "Agency", in which high scores indicate that women are competent and well-suited to positions of authority. You can get that plot by re-running the correlation above by replacing “MVIRT” with “AGENCY” in the example above. The correlation in the scatterplot below is r = -0.8.
The scatterplot and the negative correlation indicate that high values on the AWS tend to be found with LOW scores on Agency.
Strength of the Correlation
The absolute value of r indicates how "strong" it is. The farther it is from 0, the stronger the pattern. In the two graphs above, one correlation has an absolute value of 0.5, the other has an absolute value of 0.8. Looking at the scatterplots, you can see that the pattern - the linear relation between the two variables - is stronger for the second one. A stronger correlation means that it is more accurate to describe the data in terms of a straight line. As the data become more spread out from that line, the correlation decreases.

r-squared. One way to express the strength of a correlation is to square the r-value. r2 is the percentage of the variance or "information" in one variable that can be "explained" or "predicted" from the other variable, assuming a linear relation between them. If the correlation between AWS and MVIRT is r = 0.5, then r2 is 0.25, which means that 25% of the variance in people's beliefs about gender differences in moral virtue can be explained by their belief in conservative gender roles.
Curvilinear Relations
Correlation is designed to measure the linear relation between variables. A linear relation is very simple: if one variable goes up, the other goes up (positive correlation) or goes down (negative correlation). Correlation cannot detect a relation between variables that is non-linear (i.e., that cannot be described by a straight line). For example, look at the following data I made up describing the relation between the number of aspirin a person takes and the amount of relief that person feels:
As the number of aspirin taken increases from 1 to 5 aspirin, relief increases. However, after 5 aspirin, adding more aspirin doesn't increase relief; it decreases it. There is not a linear relation between aspirin and relief. Taking 9 aspirin is NOT better than taking 4 aspirin, as the graph above indicates.

The correlation between aspirin and relief in the example above is exactly r = 0. Imagine that someone asked what the relation was between aspirin and relief. If you relied only on the correlation coefficient, you might be tempted to answer, "no relation at all." But if you look at the scatterplot, you should reach a different conclusion. The lesson here is that you must always plot your data. If the data are curvilinear (not linear but curved), you should not use correlation - it will not accurately capture the pattern in the data.

So, what do you do if you detect a curvilinear relation? You can report the r-value but make sure you also state that the scatterplot indicated a curvilinear relation and attempt to describe it. In the example above, you could say that there was a generally positive correlation between number of aspirin and relief when number of aspirin increased from 0 to 5, but between 5 and 10 aspirin, the relation became negative.

What if the correlation is non-significant and there is no discernible pattern in the scatterplot? In that case, it is better to report the correlation value, identify it as non-significant, and not provide any interpretation: "The correlation between shoe size and IQ was not significant, r(28) = .13, p = .34."

Although it is beyond the scope of this course, there are ways to test for particular non-linear relations using Analysis -> Linear Model and adding polynomial terms. For example, the pattern given above is the result of a fairly common non-linear function called a quadratic, or second-order polynomial. For now, you can just describe the non-linear pattern in words.