PSY325: Statistics for the Behavioral & Social Sciences (PSI2317A) Week 5 discussion

Prior to beginning work on this discussion forum, read Chapter 8 in the course textbook and the Instructor Guidance for Week 5 and review the

Correlation Doesn’t Equal Causation: Crash Course Statistics #8Links to an external site.

and

The Danger of Mixing Up Causality and Correlation: Ionica Smeets at TEDxDelftLinks to an external site.

videos. In this post, you will be challenged to look at how statistical tests, such as correlation, are commonly used and the possible limitations of such analyses. Additionally, you will need to explain statistical concepts; accurately interpret results of statistical tests; and assess assumptions, limitations, and implications associated with statistical tests.

Much has been written about the relationship between students’ SAT test scores and their family’s income. Generally speaking, there is a strong positive correlation between income and SAT scores. Consider and discuss the following questions as you respond: 100 words per question

What does this correlation tell you?
Is this correlation evidence that having a high family income causes one to have high SAT scores?
Is this correlation evidence that high SAT scores cause higher income? Or does this tell you something else? Explain your answer.
Explain why correlation alone is rarely sufficient to demonstrate cause.
Provide a personal example of two variables that may be correlated but not have a cause and effect relationship. Identify what type of bivariate correlation is involved, based on the measurement scales of the variables.

PSY 325 Week 5 Guidance
Statistics for the Behavioral & Social Sciences
This week, you will be reading Chapters 8 and 9 of the textbook (Tanner, 2016). Other
required resources are videos by Crash Course (2018) and TEDx Talks (2012) about
correlation and causation. There are also several recommended resources shown on the
Week Five page which you will find helpful in understanding this week’s concepts. I urge
you to visit Trochim’s (2006) web page on correlation listed there. The discussion topic is
correlation. There is also an interactive learning activity on correlation and regression. As
this is the final week of the course, there is a final exam instead of a weekly review. The
final exam has two parts. Part 1 consists of 20 multiple choice and true/false questions
covering all of the concepts we have studied during the course. In Part 2, you will write a
critique of an assigned research study. Your instructor will post the assigned study in an
announcement. After completing the readings and activities for the week, you will be
able to:
•
•
•
Describe correlations and regression analyses.
Analyze the relationship between correlations and predictions.
Critically review and evaluate a quantitative research article involving
statistical analyses covered in the course.
Keep these objectives in mind as you read and review the required chapters and other
materials.
The major concepts we cover this week are linear correlation and regression. Correlation
is a measure of the relationship between two variables. Simple linear regression is like
correlation, but it goes a step further and develops an equation using the two variables,
with one being designated as independent and the other dependent. The regression
equation can be used to predict or forecast the value of the dependent variable given a
specific value for the independent variable.
To illustrate correlation (the relationship between two variables), we could use a scatter
plot (also called a scatter diagram or scattergram). The dependent (outcome) variable
goes on the Y (vertical) axis, and the independent variable goes on the X (horizontal) axis.
If you took algebra, you may remember plotting points from paired (X,Y) coordinates and
drawing a trend line or “line of best fit” to approximate the pattern of the dots on the
graph. If there is a linear relationship between the two variables, the points would tend
to cluster around a straight line with a slope (steepness of slant) and direction. A positive
relationship (positive slope) means that as the value of one variable increases, the value
of the other variable increases also. In this case, the line goes upward from left to right. A
negative relationship (negative slope) is indicated when the value of one variable
increases while the value of the other variable decreases. This would show on the graph
as a line going downward from left to right.
Here is a YouTube video on correlation by Michael Herman (2012):
Excel Statistics 05 – Calculating Correlations with ExcelLinks to an external site.
You may also remember from algebra that the equation for a straight line is Y = mX + b,
where Y is the dependent variable (also called the criterion in regression), m is the slope
of the line, X is the independent variable (also called the predictor in regression), and b is
the Y-intercept (the point where the line crosses the vertical axis, or the value of Y when
X = 0). Linear regression is basically the same thing, except that different letters are used
and the terms are presented in a different order. Also, because we are dealing with
human situations in regression, we must add an error term to account for unknown
factors, measurement errors, and human or mechanical variation.
In a regression equation, in place of X and Y, we might use more descriptive variable
names. The error term may be represented by the lower case letter e. We calculate an
intercept (b0) and slope (b1) based on real data available to us, then we use the resulting
equation to predict what the outcome would be if we plugged in a new value for the
independent variable.
Algebraic equation:
Y = mX + b
Regression equation: Y = b0 + b1X + e
Note: In regression, the intercept term comes first on the right side of the equation.
The simple linear regression equation will yield an average (“mean”) value of the outcome
or criterion variable (Y) based on a given value of the predictor variable (X). Every
individual case with the same X value will not have exactly the same Y value, but the
values should fall into a reasonable range. The error term (e) represents the deviation of
individual cases from the mean outcome. These deviations are also called residuals.
Using the least squares method of estimating the intercept (b0) and slope (b1) assures that
the regression line will be positioned so that the individual data points are all as close to
the line as possible. In other words, the amount of individual deviation from the line is
minimized, so that it really is the best fitting line of all the possible lines that could be
drawn through the scatter of data points. The least squares calculations are shown step
by step in Section 9.2 of the textbook (Tanner, 2016). You should read this and try to
understand it, but you will not be required to calculate regression coefficients manually.
You can use Excel to do these calculations automatically. Instructions for doing this are
in Section 9.3 of the chapter and in the accompanying video in the electronic textbook.
In addition, here is a video showing how to do correlation and regression in VassarStats
using the examples in Section 8.4 and Section 9.3 (Murphy, 2018): Correlation and
Regression in VassarStatsLinks to an external site.
Last week, we learned about ANOVA and the F ratio. This procedure is also related to
correlation and regression. The F ratio which results from ANOVA is a measure of the
strength of the regression relationship. Another useful statistic is the coefficient of
determination, r2, which is the proportion of variation in the outcome variable that is
explained by the predictor variable. Finally, the square root of the coefficient of
determination, r, is the correlation coefficient. While r2 must always be a positive
number, r can be either positive or negative. The sign denotes the direction of the
relationship between the variables.
Another thing we have learned about was hypothesis testing and statistical significance.
When we do regression, we can determine the statistical significance of the regression
coefficients and the relationship between them. As described at the end of Section 9.3,
there are several assumptions inherent in the process. If these assumptions are not
reasonable, the results of hypothesis test will not be reliable. The assumptions are that
the variables are interval or ratio scale, the variables are normally distributed in the
population, the mean of all potential error terms is zero, the variance is constant and
does not change at different levels of the independent variable, and the error term for
one individual case does not depend on the error term for another case (called
independence of observations).
The due dates for your assignments are listed on the Overview page for Week 5 as well
as in the Course Guide. If you have any questions about this week’s readings or
assignments, email your instructor or send a message through the Canvas inbox. For
assistance with mastering the course concepts, consider using the free 24/7 Tutoring on
Demand in the classroom.
References
•
•
•
•
•
•
CrashCourse. (2018b, March 14). Correlation doesn’t equal causation: Crash
Course statistics #8 [Video]. https://youtu.be/GtV-VYdNt_g
Herman, M. (2012). Excel Statistics 05 – Calculating Correlations with
Excel.[Video]. YouTube. https://www.youtube.com/watch?v=wY4S6F2k8no
Murphy, P. F. (2018). PSY325 Correlation and Regression in VassarStats
[Video]. Kaltura. https://ashford.mediaspace.kaltura.com/media/PSY325+Corr
elation+and+Regression+in+VassarStats/1_069y0sz0
Tanner, D. (2016). Statistics for the behavioral & social sciences (2nd ed.).
Bridgepoint Education. https://content.uagc.edu
TEDx Talks. (2012, November 5). The danger of mixing up causality and
correlation: Ionica Smeets at TEDxDelft [Video].

Trochim, W. M. K. (n.d.). Correlation. Research Methods Knowledge
Base. http://www.socialresearchmethods.net/kb/statcorr.php
8
Correlation
Anrodphoto/iStock/Thinkstock
Chapter Learning Objectives
After reading this chapter, you should be able to do the following:
1. Explain the hypothesis of association.
2. Interpret the correlation coefficient.
3. List the Pearson correlation requirements.
4. Describe what the coefficient of determination explains.
5. Explain the variables involved in the point-biserial correlation.
6. Describe the applications for the Spearman correlation.
227
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 227
3/3/16 12:33 PM
The Hypothesis of Association
Section 8.1
Introduction
Correlation, the concept of a relationship or dependence between variables, transcends statistical analysis. Cloudy days are related to (correlated with) cooler temperatures. Natural disasters
are related to declines in the stock market. An impending test is related to the need to study, and
grinding noises in the engine compartment of a car are usually related to repair bills.
Some relationships are stronger than others, so statistical procedures have been developed
to quantify, or numerically gauge, the strength of the relationship between two variables. The
numerical indicators are called correlation coefficients, and one of the most common is the
Pearson correlation coefficient, which indicates the strength of the relationship between
interval- or ratio-scale variables. The name Pearson refers to Karl Pearson, whose impact not
just on studying correlation but on statistical analysis generally may be greater than that of
any other individual.
In the early years of the 20th century, Pearson founded the first department of statistical analysis at University College London. Under Pearson’s direction, the department attracted, among
others, William Sealy Gosset of t test fame; Ronald Fisher, who produced analysis of variance;
and Charles Spearman, for whom an alternative correlation coefficient is named, as well as an
elegant statistical procedure based on correlation called factor analysis. To put it succinctly, it is
difficult to overstate the impact that Pearson had on the evolution of statistical analysis.
A man of fierce independence, Pearson’s education at Cambridge centered in religion and
philosophy rather than mathematics. As a student of religion, he sued the university over the
compulsory chapel attendance required of all undergraduates. Winning his suit brought a
change to university rules—after which Pearson chose to attend chapel. His graduate work (in
Germany) emphasized literature, and it is a testimony to his extraordinary breadth of talent
that his greatest contributions would be in statistical analysis. Pearson was a contemporary
of Einstein, who sought a grand theory that would unite all of physics. Pearson tried to do the
same with mathematics. That both men were disappointed in these efforts should not detract
from what they did accomplish. Although Pearson’s associations with his colleagues were not
always harmonious, he and the others who found an academic home in his department virtually defined modern quantitative analysis. Whether or not they realize it, almost all of those
who crunch numbers for any length of time rely on their work.
8.1 The Hypothesis of Association
Previous chapters concentrated on tests of significant difference. The z test, the t tests, analysis of variance, and the repeated-measures designs test the differences between groups. They
all fall under a general assumption referred to as the hypothesis of difference. But some
kinds of analyses do not involve questions about whether there are significant differences
between groups.
If a psychologist asks about the relationship between birth order and achievement motivation among siblings or about the connection between the amount of time children read and
their school grades, the subject of research concerns relationships rather than differences.
Those questions call for procedures connected to the hypothesis of association, and when
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 228
3/3/16 12:33 PM
Section 8.1
The Hypothesis of Association
results are statistically significant, it means that the relationship, rather than the difference, is
unlikely to be a random occurrence.
Correlation versus Cause
Before pursuing correlation, researchers must
make a distinction between correlation and cause.
Because two characteristics co-vary, or vary
together, that does not presume that one necessarily causes the other. Although there may be a causal
relationship, researchers usually cannot determine
one just by studying the correlation. One of the
author’s statistics professors explained the risk of
confusing correlation with cause this way: A person
drinks for three successive nights. The first night, the
drink is scotch and water, the second bourbon and
water, and on the third, vodka and water. Each morning after is accompanied by a hangover. Because the
water is common to each experience, water must be
the cause.
Design Pics/Kelly Redinger/Thinkstock
As the classic study involving ice
cream sales and burglaries shows us,
it is important to make a distinction
between correlation and cause.
A classic study demonstrates, among other things, a correlation between the sale of ice cream
by vendors on city streets and burglaries in the same city. Someone rushing to judgment
about cause might wish to curb ice cream sales or check the criminal records of ice cream
vendors to reduce the number of burglaries. Such an individual does not recognize that hotter weather—and the open windows that result—probably drive both ice cream sales and
burglaries. It is not unusual for some third variable to explain an association between a first
and a second. Although correlation values provide some evidence for causation, correlation
alone is rarely sufficient to demonstrate cause.
Scatterplots
Breaking down the word correlation—co-relation—makes its meaning clear: the variables
are related. The evidence for the relationship is that the characteristics co-vary. As the level of
one variable changes, the other changes as well because both variables contain some of the
same information. The higher the correlation, the more common information they contain.
A researcher gathers verbal ability and intelligence scores for 12 subjects and presents them
in Table 8.1. Note that the first participant has a verbal ability score of 20 and an intelligence
score of 80. Scanning the two rows of data, we can see that as the values of one score increase,
so do those of the other. In other words, there appears to be a positive correlation between
the two scores. The relationship is easier to see in the
scatterplot. A scatterplot is a graph plotting the values
of one variable along the horizontal axis and the other
Try It!: #1
variable along the vertical axis, using dots to indicate
How many raw scores does a single point
the intersection of each pair of values. Figure 8.1 shows
on a scatterplot represent?
an Excel-generated scatterplot of the verbal ability/
intelligence data.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 229
3/3/16 12:34 PM
Section 8.1
The Hypothesis of Association
Table 8.1: Results of a study comparing verbal ability and intelligence
Participant
Verbal ability:
Intelligence:
1
20
80
2
35
95
3
4
5
6
7
8
9
10
11
12
90
100
100
100
110
115
120
115
110
125
42
48
55
60
63
66
72
76
78
85
Figure 8.1: The relationship between verbal ability and intelligence
In the Figure 8.1 scatterplot, intelligence scores are plotted along the vertical, or y, axis and
the verbal ability scores are plotted along the horizontal, or x, axis. Each diamond-shaped
point in the graph, then, represents an intelligence score and a verbal ability score.
The plot verifies what our cursory view of the two rows of data in the table suggested: A positive correlation exists between measures of intelligence and those of verbal ability. The general trend is from lower left to upper right. As the value of one variable increases, the value of
the other tends to do likewise. The incline is not dramatic, but the graph shows a general rise
in the data points.
Less-than-Perfect Relationships
The relationship certainly is not perfect. The fourth, fifth, and sixth participants all have the
same level of intelligence but different levels of verbal ability. The same is true of participants
8 and 10, as well as participants 7 and 11. Still, there is a general lower-left to upper-right
relationship, which might be expected. Brighter people often have more complex language
patterns, something suggested by higher verbal-ability scores.
It also is not surprising that the relationship between intelligence and verbal ability is less
than perfect. An extensive vocabulary alone is no guarantee of an unusually high intelligence
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 230
3/3/16 12:34 PM
The Hypothesis of Association
Section 8.1
score. Perhaps the individual is just an avid reader. At the other end of the spectrum, not all
highly intelligent people excel verbally.
The exceptions point to the fact that people are very complex. Human behavior is rarely
explained by one or two variables. Although intelligence is related to verbal aptitude, so are
a number of other variables: how much the individual reads, how easily the individual is distracted, how much experience the person has had, and so on. One of the reasons researchers
calculate correlation values is to determine the level of agreement when the relationships are
not perfect, as they rarely are with people.
The issue the hypothesis of association seeks to resolve is not whether the relationship is
perfect—because it would be extremely rare if it were—but rather, whether the relationship
is statistically significant. Statistically significant correlations produce correlation values that
tend to reemerge every time new data are gathered for the variables and the strength of the
correlation re-calculated.
Although perfect correlations are rare when dealing with people, that is not necessarily the
case elsewhere. Mathematicians, for example, enjoy the stability of perfect relationships; the
formula for the area of a circle, A 5 πr2 (where the area is found by multiplying the value of pi
by the square of the radius), works for circles of any size because a perfect relationship exists
between a circle’s radius and its area.
Still, even imperfect correlations, such as those related to human-subjects research, can be
very important. If health professionals know a correlation, even a weak one, exists between
exposure to secondhand smoke and the later development of respiratory problems, they can
warn against such exposure. In that particular instance, by the way, the research supports the
causal assumption. If educators know there is a correlation between how much homework
students do and their success on a high school exit exam, educators can encourage students to
complete more assignments. The instructors expect that pass rates will rise as a consequence.
In the case of homework and exit exam scores, however, a causal relationship is not as clear.
Perhaps people who have a higher level of academic achievement do more homework and
have higher exit exam scores. That suggests the academic achievement is the causal element
rather than the homework. Maybe the increased homework is the manifestation of that other
variable, academic achievement, or perhaps parental involvement is the causal factor—students whose parents are directly involved in their schooling do more homework and prepare
for their exit exams with greater care.
The Amount of Scatter
The amount of scatter in a scatterplot, the degree to which the points in the scatterplot stray
from a straight line, suggests weakness in the correlation. Scatterplots graphed for strong
correlations have very little scatter. The points appear to line up.
What Correlations Provide
Calculating a correlation involves quantifying the strength of the relationship between the
variables involved. Correlation values, or coefficients, range from to 21.0 to 11.0. Correlation
values of either 21.0 or 11.0 indicate perfect relationships. With positive correlations, as the
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 231
3/3/16 12:34 PM
The Hypothesis of Association
Section 8.1
value of one variable increases, so does the value of the other—more verbal reinforcement of
subjects in a test of problem-solving ability is probably associated with more effort expended
by the subject. With negative correlations, as the value
of one variable increases, the other decreases—more
involvement with video-gaming while a text passage
Try It!: #2
is read to subjects is probably associated with lower
retention of the details of the text passage; as the value
If two variables are normally distributed
but uncorrelated, what pattern will their
of one increases, the value of the other declines. A cordata points make in a scatterplot?
relation of 0 indicates no relationship—fluctuations in
the value of one variable are unrelated to changes in the
value of the other. Values less than the absolute value of
1.0, but greater than 0, indicate imperfect relationships, with the strength of the relationship
declining as the value approaches 0.
Correlating two variables does not require that they both measure the same characteristic
or even that they both be gathered from the same subjects. Often, entirely different kinds
of things are correlated. The example of secondhand smoke and respiratory issues involves
two completely different variables, but the strength of the relationship between them can be
calculated nevertheless. As long as the two variables can be quantified—reduced to a number—the strength of any relationship can be determined.
Requirements for the Pearson Correlation
Researchers may employ any of several different correlation procedures. The appropriate
procedure for a particular problem is determined by characteristics such as the scale and
normality of the data involved. The Pearson correlation, for example, requires variables of
either interval or ratio scale. Nominal or ordinal scale data can be correlated as well, but they
involve other correlation procedures. In addition to interval or ratio data, the Pearson correlation also requires the following:
• In their populations, the characteristics are assumed to be normally distributed. Normal distributions can never be reflected in relatively small samples, but
researchers must have reason to believe that the samples come from populations
that are normal.
• The distributions from which the samples come must be similarly distributed.
• The two samples are assumed to be randomly selected from their populations.
• The relationship between the variables must be linear; it remains constant throughout their ranges.
Recall that normality is indicated when the standard deviation is about one-sixth of the range,
the measures of central tendency all have about the same value, and so on (Chapter 2). The
way data are distributed in the scatterplot also suggests the normality of the two variables
involved in a correlation. When both variables are normal, the points in the plot will be distributed from left to right, with the frequency of the points gradually increasing toward the
middle of the graph and then gradually decreasing to the right extreme. If the relationship is
positive (example A in Figure 8.2), the scatter is generally from lower left to upper right. If
it is negative (example B in Figure 8.2), the graph follows a pattern from upper left to lower
right. If the variables have no correlation (example C in Figure 8.2), the points fall into a circular pattern in the middle of the graph with the greatest density at the circle’s center. The
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 232
3/3/16 12:34 PM
Section 8.1
The Hypothesis of Association
greater frequency in the middle of the circle
reflects the fact that most of the data in any
normal distribution occur near the middle of
the distribution. (The pattern in our example
does not look circular because so few data are
present.)
The similar-distribution requirement does
not mean that the standard deviations should
be the same. That is not likely to happen
unless both variables are measured along the
same range. It means that the standard deviations should account for similar proportions
of their respective ranges.
Figure 8.2: Scatterplots for positive,
negative, and zero correlations
A. A positive Correlation
30
20
10
5
0
0
2
4
6
8
10
B. A negative Correlation
30
20
10
The strength of a correlation is affected by
0
0
5
10
15
range attenuation. When the range of scores
in either variable is artificially abbreviated,
C. A Zero Correlation
the correlation value will be artificially low.
15
Range attenuation can be indicated by a stan10
dard deviation that is substantially smaller
than we know it to be in the population. If
5
we were correlating intelligence scores with
0
0
5
10
15
20
reading comprehension, and the intelligence
scores have a standard deviation of 8 points
when we know that the population standard
deviation is 15 points, we can expect any resulting correlation value to be artificially low. One
of the advantages of random selection is that random samples of a reasonable size tend to
mirror their populations reasonably well. Range restriction problems are much less likely to
occur with randomly selected samples.
Linear and Nonlinear Correlations
When the relationship between two variables is linear, it means that the degree to which they
change in concert with each other is the same throughout their ranges; if it is low and positive, it is low and positive at low levels of both variables and at higher levels of both variables.
Some correlations, however, are not linear. Consider the correlation between anxiety and the
quality of a musician’s performance. In that instance, a little anxiety is probably a good thing.
It prompts the individual to prepare for the performance by practicing, studying the music
carefully, asking others for feedback, and so on. Without anxiety, the musician might not make
the necessary preparations. It seems likely that, at least in the early going, the quality of the
performance improves as anxiety increases.
But it is possible that if anxiety continues to increase, the individual’s performance may reach
a plateau and then begin to diminish. The musician may become so anxious that concentration is difficult and performance declines, with more anxiety actually depreciating the quality
of the music. These conditions describe a relationship that is curvilinear. It is illustrated in
Table 8.2, where anxiety is gauged as a function of someone’s increasing pulse rate in beats
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 233
3/3/16 12:34 PM
Section 8.1
The Hypothesis of Association
per minute. The quality of the musician’s performance is represented by the judgment of a
trained observer, with higher values indicating a more virtuoso performance. If scores were
awarded every 5 minutes during a 65-minute performance, the data are as follows:
Table 8.2: Study results of anxiety versus quality of a musician’s performance
Anxiety
Performance quality
52
3
54
5
58
6
62
6
64
8
67
8
72
9
73
7
75
5
78
5
82
4
86
3
88
1
Figure 8.3 shows the scatterplot illustrating the relationship between the musician’s anxiety
and the quality of the musician’s performance.
Figure 8.3: The relationship between performance quality and anxiety
Initially, there is a positive relationship between anxiety and the quality of the music. The
first few pairs of data have points that rise from left to right. However, a positive relationship
becomes negative when performance begins to diminish as anxiety increases. Viewed as a
whole, the correlation is curvilinear. After performance
reaches the judge’s high of 9, more anxiety is not associated with better music.
Try It!: #3
What impact does range attenuation have
on a correlation?
The scatterplot also reveals some of the danger associated with range restriction. If someone collects data so
that only the first six pairs of scores were the sample,
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 234
3/3/16 12:34 PM
The Hypothesis of Association
Section 8.1
those scores provide very different indicators of the relationship between anxiety and performance than the last six pairs of scores. The first part of the distribution makes the relationship look linear and positive. The latter part of the data makes the relationship look linear but
negative. An accurate picture of the relationship requires data throughout the entire ranges
of the two variables.
Understanding Correlation Values
It is important not to confuse the sign of the correlation (1 or 2) with its strength. A correlation of 20.50 contains the same amount of information about the two variables as does a
correlation of 10.50. The sign makes a great deal of difference how the relationship is interpreted, but it has nothing to do with the strength of the relationship. With positive correlations, as the value of one variable increases so does the value of the other. When correlations
are negative, increasing values of one variable are associated with decreasing values of the
other.
Earlier we noted that different scales of data require different types of correlation procedures. The number of variables involved also dictates the need for different correlation
procedures:
• Bivariate correlations indicate the relationship between two variables. For example, the correlation between intelligence and verbal aptitude is a bivariate correlation. This chapter focuses on bivariate correlations.
• Multiple correlation gauges the relationship between one variable and a combination of others. For example, the correlation between a combined reading comprehension and vocabulary measure with an analytical-ability measure would indicate
how well reading comprehension and vocabulary ability, combined, correlate with
analytical ability.
• Canonical correlation measures the relationship between two groups of variables.
For example, determining how a combination of reading comprehension and vocabulary ability and a combination of analytical ability and problem-solving ability
relate calls for a canonical correlation.
• Partial correlation measures the relationship between two variables after neutralizing the influence of some third variable on both of the first two. For example,
a correlation of analytical ability with problem-solving ability, with the influence of
age controlled in both of the other variables, eliminates age differences as a factor
in the resulting correlation. In effect, a partial correlation would be the correlation
of analytical ability with problem-solving ability as if all subjects were the same
age.
• Semipartial correlation gauges the relationship between two variables after neutralizing the influence of a third on either of the first two. For example, a correlation
of intelligence with verbal aptitude, with age differences controlled in the verbalaptitude variable, is a semipartial correlation. Age would not be controlled in the
intelligence variable. (This makes some sense since intelligence is often argued to be
a stable variable across age differences in the individual.)
Only the bivariate correlations are covered here. The others are beyond the scope of this book
but are described here very simply, so that the reader has a sense of where bivariate correlations fit into the broader discussion of these procedures.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 235
3/3/16 12:34 PM
Section 8.2
Calculating the Pearson Correlation
8.2 Calculating the Pearson Correlation
Formally called the Pearson product-moment correlation coefficient, the Pearson correlation,
or—because its symbol is typically a lowercase r—“Pearson’s r,” is probably the most often
calculated of any correlation value. Thumbing through statistics books and glancing at online
sources reveals several formulas. All provide the same answer, but some are easier to complete than others. Visually, at least, Formula 8.1 is probably simplest:
Formula 8.1
rxy 5
∑[(zx)(zy)]
n21
Note that the r symbol has x and y subscripts. These indicate that the procedure correlates
two variables designated x and y. Which variable is assigned x and which y is unimportant,
since correlation does not presume that the x variable causes y, for example. Formula 8.1
indicates that if the x and y scores are transformed into z scores (Formula 3.1: z 5 x 2s M ), the
value of rxy, (the correlation value) is the sum of the products of the x and yz scores for each
participant, divided by the number of participants in the data group (rather than the number
of scores), minus 1.
The n 2 1 signifies that this is a correlation formula for sample, rather than population, data.
It is the same adjustment for sample data made with the standard deviation calculation in
Chapter 1. Formula 8.1 can be used to calculate the correlation value of the verbal ability
and intelligence scores from the earlier example. Calculating the equivalent verbal-ability and
intelligence z values with Formula 3.1 produces the z values for the original raw scores listed
in Table 8.3.
Table 8.3: z values
Verbal
ability (x)
Intelligence (y)
21.991
21.212
20.848
20.537
20.173 0.087
0.242
0.398
0.710
0.917
1.021
1.385
21.902
20.761
21.141
20.380
20.380
0.380
0.761
1.141
0.761
0.380
1.522
20.380
Here, each pair of z scores is multiplied and the products summed:
(21.991 3 21.902) 1 (21.212 3 20.761) 1 . . . 1 (1.385 3 1.522) 5 10.313
This provides the numerator to be used in the formula
rxy 5
∑[(zx)(zy)]
n21
Then, for the denominator, n (the number of pairs of scores) 5 12, so n 2 1 5 11. Therefore,
substituting these values into the above equation gives
rxy 5
10.313
11
5 0.938
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 236
3/3/16 12:34 PM
Section 8.2
Calculating the Pearson Correlation
With a maximum possible correlation value of 1.0, rxy 5 0.938 indicates a strong relationship
between verbal ability and intelligence, something that is reflected in the fact that many intelligence tests include subtests of verbal ability.
Although Formula 8.1 is visually simple, the need to transform everything into z scores before
calculating rxy makes the calculations very time consuming and tedious. Completing the calculations by hand takes too much time. Formula 8.2, the formula we will use, turns out to be
the formula programmed into many hand-calculators. It is visually more complex but much
easier to execute:
Formula 8.2
rxy 5
n∑xy 2 (∑x)(∑ y)
2
2
2
2
Î {[n∑x 2 (∑x) ][n∑ y 2 (∑ y) ]}
where
x 5 one of the scores in each pair as above in the z score formula.
y 5 the other score in the pair.
n 5 the number of participants (the number of pairs of scores).
∑xy indicates that each pair of scores is multiplied and then the products for each pair
summed. The resulting value is the “sum of the cross-products.”
∑x2 indicates that each x score is squared, and then the squares summed.
(∑x)2 indicates that the original x scores are totaled, and then the total is squared.
∑y2 indicates that each y score is squared, and then the squares summed.
(∑y)2 indicates that the original y scores are totaled, and then the total is squared.
The formula is not as daunting as it appears. The
process will become familiar after a few problems.
Probably Excel or a hand-calculator with a built-in
correlation function will perform most of the statistical “heavy-lifting,” but it is helpful to prepare for
that occasional time when there is no computer and
the calculator has no correlation function.
A Correlation Example
A researcher is duplicating a classic experiment by
psychologist E. L. Thorndike. The experiment relates
to Thorndike’s Law of Effect, which maintains that
behaviors followed by a satisfying state of affairs will
likely be repeated. In the experiment, the researcher
sets up a cage equipped with a door that opens if a
cat placed in the cage bats a string suspended inside
iStockphoto/ Thinkstock
Thorndike’s Law of Effect maintains
that behaviors followed by satisfaction
are likely to be repeated. A hungry cat
will learn to bat a suspended string if
that action is followed by food.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 237
3/3/16 12:34 PM
Section 8.2
Calculating the Pearson Correlation
the cage. According to the law of effect, if batting the string is followed by something satisfying,
that behavior should occur more frequently in future trials than other behaviors. A hungry cat
is placed in the cage and food placed outside where it is inaccessible from the inside of the cage.
Data comprise the several trials and the elapsed time, in minutes, before the cat releases itself.
This experiment is repeated 10 times over as many days. Table 8.4 lists the data.
Table 8.4: Experimental results from cat behavioral study
Trial number
Elapsed time
1
2
5.0
5.5
3
4.75
4
4.5
5
6
4.25
3.5
7
2.75
8
2.0
9
1.0
10
0.25
Figure 8.4 shows the scatterplot for these data, which suggests that the relationship is probably negative and quite strong.
Figure 8.4: The relationship between number of trials and elapsed time
The correlation value checks both conclusions. To determine the correlation, we use
Formula 8.2:
rxy 5
n∑xy 2 (∑x)(∑ y)
2
2
2
2
Î {[n∑x 2 (∑x) ][n∑ y 2 (∑ y) ]}
The number of trials (n) 5 10. The researcher can then verify that
∑xy 5 137.25
∑x2 5 385
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 238
3/3/16 12:34 PM
Section 8.2
Calculating the Pearson Correlation
(∑x)2 5 (55)2
∑y2 5 141
(∑y)2 5 (33.5)2
Substituting the relevant values gives
rxy 5
5
5
10(137.15) 2 (55)(33.5)
2
2
Î {[10(385) 2 (55) ][10(141) 2 (33.5) ]}
1372.5 2 1842.5
Î [(3850 2 (3025)][(1410 2 1122.25]
2470
Î (825 3 287.75)
5 20.965
Interpreting Results
The relationship is indeed negative and because the maximum correlation is 61.0, the relationship is also very strong. Neither of those conclusions indicates whether the result is statistically significant, however. As with z, t, and F, significance is determined by comparing
the calculated value to the table value indicated by the relevant degrees of freedom and the
selected level of probability. A calculated correlation value for which the absolute value is as
large is one that probably did not occur by chance. For the Pearson correlation, the values are
in Table 8.5 (see also Table B.5 in Appendix B).
Like the t and F values, the correct critical value for r is determined by degrees of freedom
and by the level of probability the researcher selects. The degrees of freedom for a Pearson
correlation are the number of pairs of data, minus 2. Be careful not to confuse the number of
pairs with the number of scores.
The probability values in Table 8.5 indicate the absolute value that the calculated rxy must
reach to be confident that the correlation did not occur by chance. The level of confidence in
that conclusion is indicated by the columns for p 5 0.1, p 5 0.05, and p 5 0.01. To have some
practice interpreting the values, note the following:
• If a correlation were calculated for n 5 7 pairs of data (which means that df 5 5)
and the result was rxy 5 1/2 0.669, there is 1 chance in 10, or in other words
p 5 0.1, that the correlation occurred by chance. A chance, or random, correlation
means that if new data were collected and the rxy value calculated a second time, it
would probably be less than the table value.
• If the researcher wants more assurance against a random correlation,
rxy 5 1/2 0.754 (also with 5 degrees of freedom) will occur by chance just 5 times
in 100 (p 5 0.05) and rxy 5 1/2 0.875 will occur by chance just 1 time in 100
(p 5 0.01).
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 239
3/3/16 12:34 PM
Section 8.2
Calculating the Pearson Correlation
Table 8.5: The critical values of rxy
Lowest statistically significant correlation
for the specified probability
Number of
xy pairs (n)
df (n 2 2)
3
5
4
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
p 5 0.10
p 5 0.05
p 5 0.01
1
0.988
0.997
1.000
3
0.805
0.878
0.959
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
0.900
0.729
0.669
0.621
0.582
0.549
0.521
0.497
0.476
0.458
0.441
0.426
0.412
0.400
0.389
0.378
0.369
0.360
0.352
0.344
0.337
0.950
0.811
0.754
0.707
0.666
0.632
0.602
0.576
0.553
0.532
0.514
0.497
0.482
0.468
0.456
0.444
0.433
0.423
0.413
0.404
0.396
0.990
0.917
0.875
0.834
0.798
0.765
0.735
0.708
0.684
0.661
0.641
0.623
0.606
0.590
0.575
0.561
0.549
0.537
0.526
0.515
0.505
Source: Brighton Webs Ltd. (2006). Critical values of correlation coefficient (R). Statistics for Energy and the Environment. Retrieved
from https://web.archive.org/web/20110117193722/http://www.brighton-webs.co.uk/tables/critical_values_r.asp
Researchers most commonly settle on p 5 0.05 or 0.01. The p 5 0.1 occurs in statistical tables
less often because in most research settings, a one-in-ten chance of a random correlation is
too great. No one wants to conclude that a correlation is not statistically significant when
there is too much chance that the finding will not hold up under further investigation. In
exploratory or descriptive research when there is little prior research on which to rely, however, sometimes investigators will relax the probability to p 5 0.1.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 240
3/3/16 12:34 PM
Calculating the Pearson Correlation
Section 8.2
The Relationship Between Degrees of Freedom and Significance
Even with a correlation value as extreme as 20.965, checking the table for significance is
important. In both the t test and ANOVA, the magnitude of the critical values declines as
degrees of freedom (and sample size) increase. It is the same with correlation, but here the
decline in critical values is more dramatic. Note from the table, for example, that if n 5 3 (and
therefore df 5 1), the correlation would need to be at least rxy 5 0.997 (nearly perfect) to be
statistically significant. The related point is that with only three pairs of data, the potential for
a random relationship that looks significant is very high. At the other extreme, if n 5 25 (so
that df 5 23), a correlation of just rxy= 0.396 is statistically significant. That much data bears
a much lower potential for an accidental (random) relationship.
The Statistical Hypotheses
The null and alternate hypotheses for correlation reflect the fact that we have moved away
from the hypothesis of difference. The null hypothesis is that no relationship between the
variables exists. Symbolically, it is written: H0: ρ 5 0.
The symbol ρ is the Greek letter rho (as in “row” your boat) and the equivalent of r. So the
null hypothesis states that the correlation (r) equals 0. More specifically, it means that there
is no statistically significant relationship. The alternate hypothesis states that the correlation
does not equal 0, that a statistically significant relationship will emerge each time data are
collected and the relationship calculated: HA: ρ ? 0.
The Coefficient of Determination
One of our important recurring themes is the distinction between statistical significance and
practical importance. Determining practical importance was the reason for omega-squared
and eta-squared calculations for significant t test and ANOVA results, respectively.
Effect sizes take on particular importance with correlation because with large samples, relatively small correlations can be statistically significant. The effect size corresponding to the
Pearson correlation is the coefficient of determination (rxy2). As the notation suggests, the
coefficient of determination is the square of the correlation coefficient. Squaring the correlation indicates how much of the variance in y is explained by x (or vice versa since correlation
does not assume cause).
In the problem about number of trials and elapsed time, rxy 5 20.965 so rxy2 5 0.931.
For that problem, the coefficient of determination is interpreted this way: the number of trials can explain about 93% of the variance in time elapsed, which would be a very important
finding with implications for many kinds of performance tasks, except that the numbers were
contrived.
The Interpretive Value of rxy2
The coefficient of determination can also indicate how unimportant some low correlations
are, even when they are statistically significant. For example, with 23 degrees of freedom, a
correlation of rxy 5 0.396 is statistically significant. The coefficient of determination for that
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 241
3/3/16 12:34 PM
Section 8.2
Calculating the Pearson Correlation
value is rxy2 5 0.157. One variable in such a relationship explains just 16% of the variance in
the other. The other 84% of the variability is related to other factors.
When the variables describe the behavior of people, small coefficients of determination do
not surprise us because they are part of human subjects’ complexity. Very few individual variables will explain large proportions of human behavior.
Sometimes, however, even low correlations and low rxy2 values are important. If research
revealed that the correlation between the age of first exposure to illegal narcotics and the
development of an addiction was rxy 5 20.3, that value (note the negative correlation) indicates that the younger subjects are at first exposure, the more likely they are to develop an
addiction. The resulting rxy2 value would be just 0.09. But even if just 9% of the variance
in addiction is explained by age at first exposure, within the context of human complexity that
would be considered important. Practical importance is a function of consequences.
Comparing Correlation Values
In isolation, correlation coefficients can be difficult to interpret because correlation strength
does not increase or decrease in consistent increments. The change from rxy 5 0.2 to
rxy 5 0.3 is a less dramatic increase in strength than the increase from rxy 5 0.75 to
rxy 5 0.85, for example. Although the Pearson r requires equal interval data, in the coefficients
that are the result, an increase in correlation strength of 0.1 reflects a very different change
from 0.8 to 0.9 than it does from 0.2 to 0.3. It takes a much stronger increase in the relationship to increase by 0.1 in the upper ranges of correlation values than in the lower ranges,
something suggested by the distance between tenths in this number line:
rxy 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Squaring the correlation coefficient makes the intervals consistent. A change in the coefficient
of determination from 0.35 to 0.5, for example, represents the same increase in proportion of
variance explained as an increase from 0.7 to 0.85, as the line suggests: r2xy 5 0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9.
Another Correlation Problem
A foundation interested in what prompts contributions to charitable causes retains a consultant. Noting that age varies with donation, the consultant gathers the data in Table 8.6 and
generates the values in Problem 8.1.
Table 8.6: Data on charity donations
Donor:
Age:
Amount:
1
25
20
2
27
20
3
32
35
4
5
25
100
32
35
6
38
50
7
43
75
8
9
10
11
12
13
14
15
45
100
150
100
200
50
100
125
45
45
47
48
52
63
65
66
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 242
3/3/16 12:34 PM
Section 8.2
Calculating the Pearson Correlation
Problem 8.1: The Pearson correlation for contributor’s age
and contribution amount
Donor’s age
Contribution amount
x2
x
25
625
27
20
729
32
400
20
400
35
1,024
32
y2
y
1,024
1,225
25
5,625
3,225
100
10,000
45
2,025
45
2,025
43
50
1,849
45
47
63
66
∑x 5 663
∑x
2,025
10,000
4,704
200
40,000
10,400
15,625
8,250
150
22,500
100
10,000
50
3,969
65
1,900
100
2,304
52
3,500
2,025
2,209
48
2,500
75
1,120
800
1,225
1,444
500
540
625
35
38
xy
4,225
4,356
2 5 31,737
100
125
∑y 5 1,195
2,500
10,000
∑y2 5 133,425
4,500
7,050
4,800
3,150
6,500
∑xy 5 58,260
The correlation of the donor’s age and the contribution amount is calculated as follows:
rxy 5
5
5
n∑xy 2 (∑x)(∑ y)
2
2
2
2
Î {[n∑x 2 (∑x) ][n∑ y 2 (∑ y) ]}
15(58,260) 2 (663)(1,195)
2
2
Î {[15(31,737) 2 (663) ][15(133,425) 2 1,195 ]}
81,615
Î [(36,486)(573,350)]
5 0.564
• The critical value at p 5 0.05 and 13 df (r0.05(13)) is 0.514.
• Because rxy . r0.05(13), the correlation is statistically significant.
• The coefficient of determination (rxy2) 5 0.318, which indicates that age can explain
about 32% of the variability in donation amount.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 243
3/3/16 12:34 PM
Section 8.3
Correlating Data When One Variable Is Dichotomous
Problem 8.1 suggests some of the hazard in rushing
to judgment about cause from correlation data. While
Try It!: #4
we might be tempted to reduce the problem to “older
What is the relationship between degrees
people contribute more to charity than younger peoof freedom and statistical significance in
ple,” other factors are probably at work, not the least
correlation?
of which is that age likely correlates with income as
well. Perhaps it is not age that explains contribution
amount so much as income. The correlation value,
while instructive and important, indicates only how variables co-vary, not necessarily why
the variables involved vary.
8.3 Correlating Data When One Variable Is Dichotomous
If the consultant had asked how the donation amount and the donor’s gender relate, Pearson still provides the answer, but the procedure becomes a point-biserial correlation. The
word point refers to the continuous variable, the amount of money donated in this example. The word biserial refers to the other variable, which has only two levels. The required
change is coding the gender variable in a way that reflects its dichotomy: as either 0 or
1. Which of females or males are coded 0 and which 1 will not affect the strength of the
coefficient.
The point-biserial correlation has a number of applications. Questions about the relationship between marital status and income, between public versus private school students and
achievement, or between Republicans’ and Democrats’ optimism are all questions that could
be analyzed with point-biserial correlation.
In point-biserial correlations, which level is coded 0 and which 1 affects only the sign of the
coefficient. We will need to be careful when interpreting the result. If donors 3, 5, 6, 7, 9, 10,
11, and 14 are female, and if females are coded 1 and males 0, the research obtains the data
in Table 8.7.
Table 8.7: Data on charity donations by donor type (gender)
Donor (x)
Amount (y)
0
20
0
20
1
35
0
25
1
100
1
50
1
75
0
45
1
100
1
150
1
100
0
200
0
50
1
100
0
125
Calculating the Point-Biserial Correlation
The amounts donated (the y values) remain the same from the age/donor problem (Problem
8.1, where ∑ y 5 1,195 and ∑ y2 5 133,425). The other values must be recalculated, although
that task becomes much simpler with gender (x) recoded to 1s and 0s. Table 8.8 lists those
results.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 244
3/3/16 12:34 PM
Section 8.3
Correlating Data When One Variable Is Dichotomous
Table 8.8: Point-biserial correlation results
Gender (x)
x2
Amount (y)
y2
xy
0
0
20
400
0
0
0
0
0
1
20
1
1
35
25
1
1
100
50
1
1
1
0
75
0
1
1
45
0
40,000
0
50
100
0
125
∑y51,195
n∑xy 2 (∑x)(∑ y)
2
2
2
22,500
10,000
2,500
10,000
15,625
∑y2=133,425
150
100
0
100
0
∑xy5710
2
Î {[n∑x 2 (∑x) ][n∑ y 2 (∑ y) ]}
Substituting in the values from Table 8.8 gives
rxy 5
0
200
100
Return to Formula 8.2, in which
rxy 5
2,025
75
0
150
∑x258
∑x58
5,625
50
100
1
0
2,500
0
100
10,000
0
1
625
10,000
35
100
1
0
1,225
0
1
1
1
400
15(710) 2 (8)(1.195)
2
2
Î {[15(8) 2 (8) ][15(133,425) 2 (1,195) ]}
5 0.19
Still testing at p 5 0.05 and with the degrees of freedom still df 5 13, from Table 8.5 the critical value is still rxy0.05(13) 5 0.514. Therefore the statistical decision will be to fail to reject H0.
The relationship between the donor’s gender and the amount contributed is not statistically
significant. The rxy 5 0.19 result is probably a random correlation that is unlikely to reach the
critical value from the table in any new analysis with new subjects.
The interpretation of the point-biserial correlation is the same as it is for conventional Pearson correlations, except that sign of the coefficient is a function only of which variable is
coded 1. If male donors had been coded with 1s, the correlation would have been negative,
rxy 5 20.19. Consider a few more applications for the point-biserial correlation:
• What is the relationship between whether or not a parent earned a college degree
and the child’s grades?
• How is whether or not a student is a native speaker of English related to the
student’s test score?
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 245
3/3/16 12:34 PM
Section 8.4
The Pearson Correlation in Excel
• What is the correlation between blue-collar/white-collar jobs and the amount of
leisure time?
If both variables are dichotomous, another bivariate correlation is involved. It is called the phi
coefficient, discussed in Chapter 10.
Degrees of Significance?
At rxy 5 0.19 and a table value of rxy0.05(13) 5 0.514, the correlation value is not significant.
If the value had been rxy 5 0.50, and this correlation value represented some relationship
calculated for your senior thesis, would it be appropriate to refer to it as “almost significant”
or “nearly significant”? It is not uncommon to see such qualifiers even in the published literature, but significance decisions should be treated the same way as dichotomous variables.
Only two outcomes are possible: The correlation is significant or it is not significant. To try
to make a statement about the nearness to an alternative outcome undermines the principle
behind significance testing. Only two hypotheses for significance exist, and the outcome is
couched in terms of one or the other.
8.4 The Pearson Correlation in Excel
A psychologist is interested in determining the relationship between risk-taking and success
solving novel problems. Having devised the Inventory Risk Survey Catalog (the I-RiSC), the
psychologist gauges the willingness of a group of 16-year-olds to do the unconventional and
then provides a series of word problems with which the participants are unfamiliar. Scores on
the I-RiSC and the problems for 10 participants are listed in Table 8.9.
Table 8.9: Risk-taking and problem-solving success data
I-RiSC:
Problems:
2
14
7
17
4
14
5
16
1
12
8
17
7
16
9
17
3
15
6
15
To complete the problem in Excel, it is best to set up the data in two columns. Two rows also
will work, but parallel columns are visually simpler.
1. Create a label in cell A1 for “I-RiSC” and in cell B1 “ProbSolv” so that
the I-RiSC data appear in cells A2 to A11
and the ProbSolv data appear in B2 to B11.
2. From the Home tab at the top of the page click Data, and then Data Analysis at the
far right.
3. Select Correlation, which is the second option in the window.
4. In the Input Range window enter A2:B11, which indicates the cells where the data
are found. Note that the default groups the data in columns. (Change the default if
entering the data in rows.) Had the “Labels in First Row” box been checked, Excel
would have treated the first row in each column (A2 and B2 because that is what is
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 246
3/3/16 12:34 PM
Section 8.4
The Pearson Correlation in Excel
designated) as labels rather than data. Our adjustment for the labels was made by
indicating that the data begin in A2 rather than A1.
5. Enter a cell value below or to the right of the last data entry for the Output Range so
that the results do not overwrite the scores—either cell A12 or below, or to the right
of column B.
6. Click OK.
The results appear in a box called a correlation matrix (see Table 8.10). The intersection of
column 1 and column 2 indicates how well the data in column 1 (the Excel A column, where
I-RiSC data are located) correlate with the data in column 2 (the Excel B column, which contains the problem-solving scores).
Table 8.10: Correlation matrix
Column 1
Column 2
Column 1
Column 2
1
0.904203
0.904203
1
The result of the analysis is a Pearson correlation of rxy 5 0.904. The 1s in the diagonal indicate that each variable correlates perfectly with itself (rxy 5 1.0), of course. Note that the
output does not indicate whether the calculated value is statistically significant, which makes
a check of the critical values table necessary. Table 8.5 indicates that rxy0.05(8) 5 0.632. The
relationship between risk-taking and problem solving is statistically significant. Were these
data not contrived, it would be quite important to know that about 82% (rxy2 5 0.818) of
problem-solving success (0.9042) is explained by whatever the I-RiSC measures, ostensibly
the subject’s willingness to be unconventional.
Apply It!
Investigating the Correlation
between Crime and Unemployment
A law enforcement analyst is interested in any link
between crime and unemployment as a guide to allocating crime-prevention funds. Specifically, she would like
to know whether murders and property crimes correlate
with the unemployment rate.
The analyst obtains the murder and property-crime rates
for her state for the 16 years from 1990 to 2005 from
the FBI Uniform Crime Reports (rates are per 100,000
inhabitants). She then consults the Bureau of Labor Statistics for the unemployment rate in the state for the
same period. The analyst will compute the Pearson correlation between murder rate and unemployment and
then between property-crime rate and unemployment.
Table 8.11 shows the data.
Digital Vision/Photodisc/Thinkstock
(continued)
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 247
3/3/16 12:34 PM
Section 8.4
The Pearson Correlation in Excel
(continued)
Table 8.11: Murder rate, property crime, and unemployment
Year
Murder rate
(per 100,000 people)
Property crime rate
(per 100,000 people)
Unemployment
percentage
1990
7.1
4462
5.6
1993
6.4
4662
6.9
1991
1992
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
6.7
6.4
6.2
5.7
5.8
5.4
6.1
5.5
5.1
4.9
4.3
4.2
4.7
5.0
The Excel results indicate the following:
5092
4801
4678
4460
4438
4279
4040
3852
3592
3456
3412
3289
3168
3081
6.8
7.5
6.1
5.6
5.4
4.9
4.5
4.2
4.0
4.7
5.8
6.0
5.5
5.1
• The correlation between murder rate and unemployment is rxy 5 0.386.
• Comparing the murder rate/unemployment rate correlation to the critical value from
Table 8.5 (rxy0.05(14) 5 0.497) indicates that the calculated correlation is not statistically
significant at p 5 0.05.
• The analyst fails to reject the null hypothesis, ρ 5 0.
• The property crimes rate and unemployment correlation is rxy 5 0.551.
• Comparing the calculated value to the critical value from Table 8.5 (the same
rxy0.05(14) 5 0.497, since df are unchanged) indicates that this correlation is statistically
significant at p 5 0.05.
• The analyst rejects the null hypothesis, ρ 5 0.
• The coefficient of determination for this relationship is rxy2 5 0.55122 5 0.303. About 30%
of the variance in the property crime rate can be explained by the unemployment rate.
Although the rxy2 indicates that about 30% of property crime is explained by variations in
unemployment, the analyst will want to be careful about making the conceptual leap to a
causal conclusion. “Explained by” isn’t the same as “caused by.” To reiterate the point, perhaps something else explains both crime rate and unemployment. Perhaps underfunded public schooling prompts an unusually high dropout rate from school. The consequently under
educated population has more difficulty securing stable unemployment. Perhaps state budget
cuts have been disproportionately imposed on police agencies, and with fewer officers on the
street, crime rises. In other words, the simplest explanation might not be the most accurate. A
statistically significant correlation is not where the analysis ends.
Apply It! boxes written by Shawn Murphy
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 248
3/3/16 12:34 PM
Section 8.5
Spearman’s Rho
8.5 Spearman’s Rho
The Pearson correlation requires that both variables must be at least interval scale. The pointbiserial correlation requires that one variable must be at least interval scale, and the other
must be a variable with only two levels.
Neither of these correlations is helpful when the data are ordinal scale, which describes much
of the data that psychologists and other social scientists encounter. Nearly everyone who goes
to the mall or answers the telephone has been asked to take a survey, particularly if it happens to be an election year. Survey data are usually ordinal scale. It is common for the questionnaires to have a Likert-type format, where a statement is read and the respondents are
asked the degree to which they agree with the statement by selecting from a range of choices
such as:
• Strongly agree
• Agree
• Neither agree nor disagree
• Disagree
• Strongly disagree
Although surveyors commonly code the responses (strongly agree 5 1, agree 5 2 and so
on) and then calculate means and standard deviations for all respondents, those statistics
assume that the data are at least interval scale. Survey data rarely are. The Likert types of
responses are essentially rankings. A response of “strongly agree” is more positive than
“agree” but precisely how much more is not clear. Besides, one respondent’s “disagree” may
be another respondent’s “strongly disagree.” These data are more safely treated as ordinal
scale responses.
Correlating Ordinal, or Mixed Ordinal/Interval Data
In addition to survey data, ordinal scale characterizes other common data, such as class rankings and percentile scores. Sometimes the variables investigators might wish to correlate
have mixed scales. For example, a researcher wants to correlate subjects’ income (ratio scale
data) with their optimism (usually gauged with a Likert-type survey and so ordinal scale).
Along with the ordinal variable, the income variable is often not normally distributed. The
lack of normality in both the ratio variable and the ordinal scale variable rules out a Pearson’s
correlation.
Charles Spearman, Pearson’s colleague at University College London, developed a tremendously flexible correlation procedure. It accommodates two variables in a correlation procedure, provided the variables fit any of the following:
• Both are ordinal scale.
• One variable is ordinal scale and one is interval or ratio scale.
• Two variables are interval or ratio scale, but one or both fail to meet the Pearson
correlation requirement for normality.
The procedure is Spearman’s rho, symbolized by ρ. Spearman’s rho is a nonparametric
procedure, which means that it makes no assumptions about parameters; it means that ρ will
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 249
3/3/16 12:34 PM
Section 8.5
Spearman’s Rho
accommodate data when there are reasons to suspect that the data are not normally distributed. The formula, which requires that the scores for each variable be independently ranked,
is as follows:
Formula 8.3
ρ512
where
6∑d2
n(n2 2 1)
d 5 the difference between the rankings for the two variables
n 5 the number of pairs of data
The formula’s 1s and 6 are constant values, used every time a Spearman’s correlation
is calculated.
Following are the steps to calculating a Spearman’s rho:
1. Rank the scores for both variables separately.
2. For each pair of rankings, subtract the second ranking in the pair from the first to
produce a difference score, d.
3. Square each of the d values for d2.
4. Sum the d2 values for ∑d2.
5. Solve for ρ.
Ranking Tied Scores
The ranking procedure must follow rules. If some of the scores for one of the variables have
multiples, all must receive the same ranking. If someone were ranking the following values,
for example:
3, 5, 6, 6, 7, 8, 8, 8, 9, 10
ranking the values from smallest to largest produces the following values:
1, 2, 3.5, 3.5, 5, 7, 7, 7, 9, 10.
The smallest value, 3, was ranked “1,” the 5 was ranked “2,” and so on. The two 6s and the
three 8s were handled as follows:
• Because the two 6s are rankings 3 and 4, those two values are added and divided by
the number of them (2), which results in 3.5 ([3 1 4] 4 2). After both 6s are ranked
3.5 (for places 3 and 4) the next value in the data set, 7, is ranked 5.
• The 8s are all ranked 7 ([6 1 7 1 8] 4 3), after which the next value, the 9, is ranked 9.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 250
3/3/16 12:34 PM
Section 8.5
Spearman’s Rho
An Example
Suppose the data ranked above measure emotional stability, a variable thought to correlate negatively with
stress. If those data are collected for career military service personnel assigned to combat areas, and age data
are added for 10 subjects, Table 8.12 might be the result.
Try It!: #5
Spearman’s rho requires data of what
scale?
Table 8.12: Emotional stability and age data
Emotional stability
Age
3
26
6
32
5
25
6
35
7
35
8
34
8
37
8
40
9
42
10
39
Calculations for a Spearman’s rho solution, based on the information in Problem 8.1, give
ρ512
512
6∑d2
n(n2 2 1)
6(24.5)
10(102 2 1)
5 0.852
Table 8.13 lists the critical values for Spearman’s rho (Table B.6 in Appendix B). There are no
degrees of freedom for this procedure. The correct critical value for rho is indicated by the
number of data pairs. Note that for p 5 0.05 and 10 pairs ρ.05(10) 5 0.648. The relationship
between emotional stability and age among service personnel assigned to combat zones is
statistically significant; therefore, we reject H0.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 251
3/3/16 12:34 PM
Section 8.5
Spearman’s Rho
Table 8.13: The critical values for Spearman’s rho
Number of pairs of scores
p 5 0.05
5
p 5 0.01
1.0
6
7
0.886
1.0
0.683
0.883
0.786
8
0.929
0.738
9
10
0.881
0.648
12
0.794
0.591
14
0.777
0.544
16
0.715
0.506
18
0.665
0.475
20
0.625
0.450
22
0.591
0.428
24
0.562
0.409
26
0.537
0.392
28
0.515
0.377
30
0.496
0.364
0.478
Source: University of Sussex. (n.d.). Critical values of Spearman’s rho (two-tailed). Retrieved
from www.sussex.ac.uk/Users/grahamh/RM1web/Rhotable.htm
Problem 8.2: The Spearman’s rho correlation: emotional stability
and age among service personnel
1. Ranking the scores produces ρ1 for emotional stability and ρ2 for age.
2. The d score is the difference between the two rankings.
3. The square of the difference score is d2.
Emotional stability
Age
ρ1
ρ2
d(ρ1 2 ρ2)
d2
3
26
1
2
21
1
6
35
5
6
7
8
8
8
9
10
25
32
35
34
37
40
42
39
2
3.5
3.5
5
7
7
7
9
10
1
3
5.5
1
0.5
20.5
9
22
7
10
8
0.25
4
22
5.5
4
1
0.25
3
9
0
0
4
21
2
1
4
∑d2 5 24.50
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 252
3/3/16 12:34 PM
Section 8.5
Spearman’s Rho
Apply It!
Exploring the Correlation between
Job Satisfaction and Commute Times
As part of the justification for allowing workers
to work at home part-time, the human resources
director for a large firm intends to investigate
any correlation between job satisfaction and
average commute time for employees. The
director asks ten randomly selected employees
to fill out a job-satisfaction questionnaire with
the following responses to a series of questions:
Response
Score
• very satisfied (vs)
• somewhat satisfied (ss)
• somewhat dissatisfied (sd)
• very dissatisfied (vd)
1
2
3
4
Digital Vision/Photodisc/Thinkstock
The employees were also asked to indicate their average one-way commute time in minutes.
Recognizing that job satisfaction responses will be ordinal scale, the HR director opts for
Spearman’s rho. The data and the difference scores are shown in Table 8.14.
Table 8.14: Spearman’s rho data for the correlation between job
satisfaction and commute time
Commute
time
(minutes)
Commute
rank
Job
satisfaction
total
Satisfaction
rank
2
1
10
2
7
11
15
17
23
28
32
36
40
2
3
4
5
6
7
8
9
10
14
10
14
10
14
17
22
22
17
5
2
5
2
5
7.5
9.5
9.5
7.5
Difference
21
23
1
21
3
1
20.5
21.5
20.5
2.5
Difference
squared
1
9
1
1
9
1
0.25
2.25
0.25
6.25
From the table, the sum of the differences is
∑d2 5 1 1 9 1 1 1 1 1 9 11 1 0.25 1 2.25 1 0.25 1 6.25 5 31
(continued)
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 253
3/3/16 12:35 PM
Section 8.5
Spearman’s Rho
(continued)
For n 5 10, the Spearman’s rho formula is
ρ512
512
6∑d2
n(n2 2 1)
6(31)
10(102 2 1)
5 0.812
For rs 5 0.05 and 10 pairs of data, the critical value is rs0.05(10) 5 0.648. The relationship between
job satisfaction and average commute time is statistically significant. Those who commute the
least time have the highest levels of job satisfaction. Perhaps the attitudes of those who have
the lowest levels of job satisfaction—those who have the longest commutes—will improve if
they are required to commute less often because they can sometimes work from home.
Apply It! boxes written by Shawn Murphy
Direction of the Ranking
Try It!: #6
In the study of emotional stability and age for service
personnel, the least stable value received the ranking of
For 10 students, grade averages and rank
1, and the most stable a ranking of 10, while the youngin class are correlated. How will the resultest subject received the age ranking of 1. In terms of
ing coefficient be affected if the highest
the value of the statistic, it would not have mattered
ranked student is given the lowest value
whether the rankings go from lowest to highest, or
(1) versus the highest value (10)?
from highest to lowest, as long as both variables are
ranked the same way. We could have ranked the most
emotionally stable 1 and the oldest 1, and the coefficient would have come out the same. If we reversed just one of them, however, the correlation
would appear to be negative.
Summary of Spearman’s Rho
Spearman’s correlation provides flexibility to the analyst. As long as some evidence of a relationship exists, correlations can be calculated for any combination of ordinal, interval, and
ratio variables. But of course so much latitude requires some sacrifice, and it is statistical
power. In the course of ranking values, the amount of difference between any two data points
is lost. When the ages of the service personnel were ranked,
• the 25-year-old was 1,
• the 26-year-old was 2,
• and the 32-year-old was 3.
Once ranked, the fact that from the first to the second ranking is a one-year difference and
from the second to the third ranking is a six-year difference is lost. Pearson’s r retains those
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 254
3/3/16 12:35 PM
Summary and Resources
differences. When both correlations are calculated for the same data, their coefficients usually have little difference, but a Pearson correlation will sometimes be statistically significant when Spearman’s is not. Note the comparison of critical values at p 5 0.05 shown in
Table 8.15.
Table 8.15: Comparison of Pearson and Spearman critical values
No. pairs
Pearson critical value*
Spearman critical value
5
0.878
1.000
10
0.632
0.648
6
*for df =number of pairs, 22
0.811
0.886
In the examples above, the value required for significance with a Spearman correlation is
higher than that required for a Pearson correlation.
Another limitation of the Spearman correlation is that we cannot square the Spearman value
to determine the proportion of variance in y explained by x. Spearman’s rho has no equivalent
of rxy2. When the data do not meet the Pearson requirements, however, the researcher has no
choice. When the data do meet the requirements, a Pearson’s r is usually preferable to Spearman’s rho.
Correlation in Research
Correlation procedures answer enough of the questions that interest researchers and consumers of research that the procedures pervade research literature. Arroyo (2015) examined the correlation between work engagement and internal self-concept. Arroyo found that
people tend to engage in the work they do to earn a living, not for the external rewards, but
for the work’s own sake; their work is intrinsically satisfying.
Ceci and Kumar (2015), meanwhile, asked whether happiness correlates with creative capacity. They found no significant correlation but did find a significant correlation between creative capacity and intrinsic motivation, suggesting that those with the greatest creative capacity are probably those who are most internally driven to create. The researchers’ approach to
quantifying happiness is also a matter of interest, since it is often a challenge to find a way to
quantify something so subjective.
Summary and Resources
Chapter Summary
Many of the questions researchers and scholars ask deal with the relationships between
variables. To accommodate them, the discussion in this chapter shifted to statistical
procedures that reflect the hypothesis of association (Objective 1). Three of the many
correlation procedures that respond to the hypothesis of association are the Pearson
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 255
3/3/16 12:35 PM
Summary and Resources
correlation, the point-biserial correlation, and Spearman’s rho. In each case, possible
values range from –1.0 to 11.0, and all their coefficients are interpreted the same
way. Positive correlations indicate that as the values in one variable increase, the values
in the other also increase. Negative correlations indicate that as one increases, the
other decreases. The sign of the coefficient, however, is unrelated to its strength
(Objective 2).
The differences among the correlation procedures in this chapter are in the kinds
of variables they accommodate. The Pearson correlation requires interval or ratio
variables that are normally and similarly distributed (Objective 3). A special application of Pearson, the point-biserial correlation, requires an interval/ratio variable and a
second variable that has only two manifestations, or a dichotomously scored variable
(Objective 5). Spearman’s rho accommodates any combination of ordinal, interval, or
ratio variables (Objective 6). Because the data used in a Pearson correlation contain
more information than the rankings that make up the data for Spearman’s approach,
the Pearson value provides more information about the nature of the relationship
between the variables. This is evident in the fact that the Pearson value can be squared
to produce the coefficient of determination. The rxy2 value indicates the proportion of
one variable that can be explained by changes in the other (Objective 4). Spearman
values have no equivalent of this statistic.
When two variables share information, they are correlated. The amount of one explained
by the other is what that rxy2 value, the coefficient of determination, indicates. This concept provides a foundation for regression, which is the focus of Chapter 9. Regression
allows what is known of y from analyzing x to predict the value of y from a value of x.
It involves calculations and thinking with which you are already familiar, so work the
end-of-chapter problems, reread any of the sections in Chapter 8, and prepare for
Chapter 9.
Key Terms
bivariate correlations Include all procedures that test for significant relationships
between two variables.
canonical correlation Measures the relationship between two groups of variables.
coefficient of determination Indicates the
proportion of one variable in a Pearson correlation that can be explained by the other.
correlation matrix A box in which the variables involved are listed in rows as well as
in columns, and each variable is correlated
with all variables, including itself.
hypothesis of association The umbrella
term for significance tests that analyze the
correlation between or among variables.
hypothesis of difference The umbrella
term for significance tests that analyze the
differences between groups.
linear Describes a relationship between
two variables whose strength is consistent
throughout their ranges. With curvilinear
relationships, the strength and sometimes
even the nature of the relationship (positive
or negative) changes depending upon where
in the variables’ ranges they are measured.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 256
3/3/16 12:35 PM
Summary and Resources
multiple correlation Gauges the strength
of the relationship between one variable and
two or more other variables.
nonparametric Tests for data that do not
meet the usual normality requirements.
More technically, a test in which there is no
interest in population parameters.
partial correlation Measures the relationship between two variables, controlling for
the influence of a third in both of the first two.
Pearson correlation coefficient Indicates
the strength of the relationship between
interval- or ratio-scale variables.
point-biserial correlation A special application of the Pearson correlation for those
instances where one of the variables, such
as gender or marital status, has just two
manifestations.
range attenuation Occurs when a variable
is not measured throughout its entire range.
Attenuated range artificially reduces the
strength of any resulting correlation value.
scatterplot A graph representing two variables, one on the horizontal axis, the other
on the vertical axis. Each point in the graph
indicates the measure of both variables for
one individual.
semi-partial correlation Gauges the relationship between two variables, controlling
for a third in just one of the first two.
Spearman’s rho A correlation procedure
for two ordinal variables, one ordinal and
one interval/ratio variable or two interval or
ratio variables, that fail to meet Pearson correlation requirements for normality.
Review Questions
Answers to the odd-numbered questions are provided in Appendix A.
1. What values indicate the strongest and weakest values for a Pearson’s r?
2. What is the equivalent in a Pearson correlation for η2?
3. What are the requirements for calculating Pearson’s r?
4. What is “range attenuation,” and how does it affect correlation values for linear
relationships?
5. A university counselor gathers data on students’ grades and whether or not they
are employed. What statistical procedure will gauge that relationship?
6. What procedure will indicate whether there is a significant relationship
between sales representatives’ sales rank and their attitudes about the product
they sell?
7. a. What procedure will gauge the relationship between university students’ grade
averages and their scores on, for example, a statistics test?
b. What statistic will indicate the proportion of the students’ test scores that is a
function of their GPA?
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 257
3/3/16 12:35 PM
Summary and Resources
8. A forensic psychologist gathers data on the average time of night juveniles go to bed
and whether or not they have an arrest record.
a. What procedure will allow the psychologist to evaluate the relationship between
those two variables?
b. What is the resulting coefficient?
c. How much of variability in arrest records can be explained by what time the juvenile goes to bed?
Juvenile
Retire
Arrest
1
9.0
No
3
11.0
Yes
5
10.0
Yes
7
10.0
No
2
4
6
8
9.5
11.5
9.75
10.25
No
Yes
No
Yes
9. A group of consumers has just taken two surveys on (a) their attitude about
the economy and (b) their attitude about those in government. In both, higher
scores mean more optimism. The data are ordinal scale. Are the two attitudes
related?
Consumer
Economy
Government
1
15
10
3
16
11
5
11
13
7
12
10
9
10
7
2
4
6
8
10
5
10
3
11
14
4
8
4
8
9
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 258
3/3/16 12:35 PM
Summary and Resources
10. A group of students has been told that reading will help them in a test of verbal
ability required by the university they wish to attend. The x variable indicates the
minutes per day spent reading. The y variable represents students’ scores on
the test.
Student
Minutes (x)
Score (y)
1
15
57
3
0
60
2
80
4
75
6
10
5
7
8
30
22
15
84
92
65
60
75
68
a. Is the relationship statistically significant?
b. How much of the variance in test scores can be explained by differences in the
amount of time spent reading?
11. A district psychologist is working with developmentally disabled students in a
special education setting and is curious about the relationship between students’
persistence on puzzle tasks (measured in the number of minutes they remain on
task) and their number of absences from class.
Student
Persist
Absent
1
12
3
3
15
5
5
12
2
4
6
7
8
4
18
5
8
9
3
7
1
4
3
4
Is the relationship between persistence and attendance statistically significant at
p 5 0.05?
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 259
3/3/16 12:35 PM
Summary and Resources
12. An employer wishes to analyze the relationship between stress and job performance. Stress is reflected by systolic blood pressure. Job performance is measured in
the number of sales per day.
a. What is the appropriate correlation procedure?
b. Is the relationship statistically significant?
Employee
Sales
Blood pressure
1
1
150
3
3
140
2
4
5
6
7
8
9
10
4
140
6
110
2
140
4
130
0
160
3
110
5
120
7
160
13. An industrial psychologist is determining the relationship between workers’ willing
ness to embrace new manufacturing procedures, gauged with a dogmatism scale
(higher scores indicate greater dogmatism), and their level of job satisfaction (higher
scores indicate greater satisfaction). The satisfaction data are at least ordinal scale.
a. What is the relationship?
b. What is the null hypothesis?
c. Do you reject or fail to reject the null hypothesis?
d. What is the relationship between dogmatism and job satisfaction?
e. Is the correlation statistically significant?
Worker
Dogmatism
Satisfaction
1
8
4
3
3
2
4
5
6
7
8
4
12
5
15
7
14
5
2
14
1
15
3
15
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 260
3/3/16 12:35 PM
Summary and Resources
Answers to Try It! Questions
1. A single point in a scatterplot represents two raw scores, one for x and one for y.
2. If the two variables are normally distributed but uncorrelated, their combined scatterplot will be circular with greatest density in the middle of the plot because of the
tendency for most of the data to fall in the middle of either distribution.
3. Range attenuation diminishes the strength of the correlation value in linear relationships. It produces an artificially low correlation coefficient.
4. As degrees of freedom increase, the correlation value required to reach significance
diminishes.
5. Spearman’s rho accommodates variables that have any combination of ordinal,
interval, or ratio scale.
6. The coefficient would indicate that the higher the ranking, the lower the GPA. If a
ranking of 1 is “best,” the best (highest) GPA must also receive a class ranking of 1.
Otherwise, the relationship looks negative when it is not.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 261
3/3/16 12:35 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_08_ch08_227-262.indd 262
3/3/16 12:35 PM
Linear Regression
9
Seth Joel/Corbis
Chapter Learning Objectives
After reading this chapter, you should be able to do the following:
1. Explain the relationship between correlation and regression.
2. Describe the regression line in least-squares regression.
3. Estimate a predictor-based criterion value using regression.
4. Explain multiple regression.
263
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_09_ch09_263-294.indd 263
3/3/16 1:01 PM
Regression and Correlation
Section 9.1
Introduction
Regression is a powerful analytical tool that in its simplest form uses the relationship between
two variables to predict one from the other. People often make regression-like predictions.
The presence of clouds in the morning sky prompts us to take an umbrella to work, for example, or a phone call from unexpected guests results in extra food prepared for a meal. The concepts in this chapter follow the same thinking, except that the predictions are mathematical.
Social scientists rely on regression in virtually every advanced statistical procedure. Most of those
high-end statistical techniques—such as multivariate analysis of variance, d
iscriminant-function
analysis, and structural-equations modeling—are beyond the scope of an introductory text,
but regression analysis is an essential part of the preparation for each of them. In the meantime, regression has value in its own right, as a mathematical process that uses the relationship
between two variables to predict the value of one from the value of the other.
9.1 Regression and Correlation
Chapter 8 made the point that when two variables are correlated, it is because they share
information. For example, if intelligence correlates with reading comprehension, it is because
to some degree each measures a common characteristic. The more highly they are correlated,
the greater the quantity of whatever is measured that the two characteristics have in common, which is what the coefficient of determination (rxy2) indicates. It reveals the proportion
of one variable that can be explained by the other. If intelligence (x) and reading comprehension (y) are correlated at say, rxy 5 0.8, then rxy2 5 0.64: 64% of whatever reading comprehension measures can be explained by variations in intelligence.
If correlated variables share information and we have information about the value of one
of those variables, we should be able to make a better-than-chance prediction of the corresponding value of the other.
• If age and height are correlated for teenagers, and we know how old a subject is, we
should be able to make a better-than-chance prediction of the individual’s height.
Conversely, if we know a teen’s height, we ought to be able to predict age.
• If education and income are correlated, and we know how many years of schooling a subject has had, we should be able to make a reasonable prediction of that person’s income.
• If the length of soldiers’ exposure to combat correlates with their manifestation of
post-traumatic stress disorder (PTSD), we can predict the severity of PTSD from the
length of combat exposure.
Regression allows us to address issues such as these mathematically. The concept is not new.
Karl Gauss, the same mathematician who defined the characteristics of the normal (Gaussian)
distribution, began developing the procedures behind regression in the early part of the 19th
century. Many others have also contributed. Collectively, their work has allowed experts in a
variety of fields to use regression procedures in their decision-making for many years.
• Economists gather data on unemployment rates, wholesale inventories, and consumer
spending in order to predict the rate at which the economy will grow. This approach is
effective because each of those variables correlates with economic expansion.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_09_ch09_263-294.indd 264
3/3/16 1:01 PM
Section 9.1
Regression and Correlation
• Meteorologists use changes in barometric pressure to predict weather. Because
drops in barometric pressure are predictors of violent storms in the Great Plains
states and in the southeastern part of
the United States, meteorologists watch
particularly for dramatic drops in air
pressure.
• Sports oddsmakers rely on data such as a
team’s past performance, injuries to key
players, and the quality of the opponent to
predict game outcomes.
• Psychologists use genetic and social
factors, including the history of alcohol
abuse in a family, to predict an individual’s
predisposition to abuse drugs.
Victor Zastol`skiy/Hemera/Thinkstock
Meteorologists use regression
procedures to predict the occurrence
of violent storms.
Each of these scenarios is possible because of correlations between variables. Correlations
pave the way for prediction. The point is not that the people in the examples necessarily sit
down with mathematical models to calculate the probability that certain results will emerge,
but they could. In fact, a review of the professional
literature provides ample evidence that scholars perform analyses like these frequently. They are imporTry It!: #1
tant because prediction allows those who must act to
From the standpoint of making a predicbe proactive. Rather than waiting for some important
tion, why does the strength of the correlacondition to emerge, affected parties can anticipate its
tion between the two variables involved
timing with some precision, and then take approprimatter?
ate action. Prediction is the basis for sound decisionmaking.
The Language of Regression
Many types of regression procedures are employed; although this chapter concerns itself
with just one, the concepts and much of the language used here are common to the different approaches. In the Statistical Package for the Social Sciences (SPSS), one of the most
popular computer programs for statistical analysis, the variable to be predicted is called
the dependent variable. The variable used to make the prediction is called the independent
variable.
Those terms are common enough in statistics, but in regression discussions, the words independent and dependent run the risk of suggesting a causal relationship between the antecedent variables and that dependent variable. Although we also used this language with t tests
and ANOVA, the risk is greater in correlation discussions because the discussion begins by
assuming a relationship between the variables. To avoid this slippery slope, we will make an
adjustment in discussing regression. Rather than the terms “dependent variable” and “independent variable,” as common as those terms are elsewhere, we will refer to the variable to
be predicted in a regression procedure as the criterion variable, and the variable used to
make the prediction as the predictor variable. We adopt this language to minimize the risk
of confusing correlations with causal relationships.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_09_ch09_263-294.indd 265
3/3/16 1:01 PM
Regression and Correlation
Section 9.1
This does not mean that no causal relationship exists; just such a connection may be at work.
In fact, the relationship between exposure to combat and the development of post-traumatic
stress, for example, probably is causal. The point is that the correlation alone—and correlation is the foundation for pursuing regression—is not usually sufficient by itself to establish
causality.
Although this chapter uses the terms criterion and predictor here for descriptive purposes,
some shorthand indicators are needed as well. The symbols used in regression are the same
as those used with the Pearson correlation in Chapter 8: x and y for the correlated variables.
Here, x symbolizes the predictor variable and y symbolizes the criterion variable.
Choosing the Predictor
The confusion that can occur when equating correlation with cause increases when we recognize that either variable in a significant correlation can be used to predict the other. If a correlation exists between the degree of post-traumatic stress disorder (PTSD) and the length of
exposure to combat, it means that each variable is equally related to the other. A researcher
might, for instance, predict the degree of PTSD from the length of combat exposure or predict
the converse relationship: the length of combat exposure from the degree of PTSD.
Either variable in a statistically significant correlation can predict the other. From the point
of view of the mathematics involved, which variable predicts which does not matter, although
practical considerations may dictate the predictor and the criterion. Sometimes one of the
variables will prove more elusive than the other because the data involved are more difficult
to gather. In such cases, the difficulty involved may require that the more accessible variable
becomes the predictor and that the less available variable be predicted rather than gathered.
If reading comprehension scores are significantly correlated with intelligence scores, and
someone wishes to predict the value of one from the other, to use reading scores as the predictor variable makes sense. Reading scores are more accessible than intelligence scores. Most
major intelligence tests must be administered to one subject at a time by someone trained
to use the instrument. This process makes gathering intelligence scores expensive and timeconsuming. Reading tests, on the other hand, can be group administered and usually require
little training.
Other factors must be considered when determining predictors and criteria. Perhaps the
scores from the college-aptitude tests students take in high school are correlated with
the grades that students earn during their first year of college study. From the standpoint
of the correlation, scores can be predicted from grades quite as readily as grades can be predicted from scores, but it will generally be the students’ future—rather than their past—
performance that will be of interest.
Picturing Regression
Chapter 8’s scatterplot illustrated the correlation between verbal ability and intelligence. In
the graph, each point represented one subject’s scores on two variables. When variables are
highly correlated, the points reflect an inclining or declining line from left to right in the scatterplot, depending upon whether the correlation is positive or negative. Little “scatter” along
the line indicates high correlation between variables.
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
tan82773_09_ch09_263-294.indd 266
3/3/16 1:01 PM
Section 9.1
Regression and Correlation
Table 9.1: Study data for hours studied
and grade average
When scatterplots are applied to regression, the predictor variable scores (x) are plotted on the horizontal
axis, the criterion variable scores (y) on the vertical
axis. Perhaps a researcher randomly selects a group of
20 students at the end of their first term of study at a
large university and gathers from them two types of
data: (a) the number of hours per week they typically
study and (b) their recorded grade averages at the end
of the first term. The researcher wants to determine
how well the number of hours the student studies per
week will predict the student’s grades. This means
that the hours studied is the predictor variable, x, and
grade average is the criterion variable, y. Table 9.1 lists
the data.
Subject
Hours studied (x)
Grade average (y)
1
1
1.5
4
3
2.0
2
2
3
3
5
2.0
5
7
2.1
7
8
2.4
7
9
2.2
8
10
2.4
10
11
2.7
10
12
13
2.6
11
2.9
16
3.1
17
3.0
13
14
3.0
15
15
16
3.0
16
17
2.7
16
18
To plot the data manually, draw the vertical and horizontal axes of the graph. Mark equal intervals on the
horizontal axis for increasing hours studied and along
2.2
5
6
These data can be used to create another scatterplot
like the one in Chapter 8. I…