Help with Homework - Achiever Papers

Homework #4: Chapters 4.12-4.14, Due 10/3/22 by Midnight PTQuestion 1:
Question 2:
In the library on a university campus, there is a sign in the elevator that indicates a weight limit of 2500
pounds. Assume the average weight of students, faculty and staff on campus is right-skewed, with a
mean of 150 pounds, and standard deviation 27 pounds. A random sample of 16 persons from the
campus is selected.
a. Describe the sampling distribution of the sample mean weight.
b. What is the probability that the average weight of the 16 people in the sample is less than 160
pounds?
c. Suppose the sample of 16 people is placed in the library elevator. What is the probability that the
total weight of the 16 persons on the elevator will exceed the weight limit of 2500 pounds?
Question 3:
Suppose individuals with a certain gene have a 0.70 probability of eventually contracting a certain
disease. Using normal approximation to the Binomial, answer the following questions:
a. If 100 individuals with the gene participate in a lifetime study, what is the distribution of the random
variable, X, describing the number of individuals who will contract the disease?
b. Suppose in the study from the problem above you found 78 of the individuals contracted the disease.
Does this seem too high? Justify your answer by finding the probability that at least 78 individuals
contract the disease.
Question 4:
Question 5: MINITAB PROBLEM
The IQ scores of a certain city follows a bell-shaped curve with a mean of 100 and variance of 225.
(a) Using Minitab, generate a random sample of size 100 from this population (do not copy and paste
the whole raw data to your submission). Draw a histogram of your sample values and calculate the
mean, standard deviation, and variance.
(c) Using the Empirical Rule, estimate the percentage of your sample values that fall within 1, 2, and 3
standard deviations of your sample mean.
(d) Draw a Normal Probability Plot of your sample values and determine whether your sample
distribution follows a normal distribution.
Stat 350A: Chapter 4.12
Part 1: Sampling Distributions
• Population: Entire collection of items or individuals you wish to study
• Sample: A subset of the population that has been selected to study or measure
• Statistics: Measurements made on sample data (Ȳ, s, π)
• Parameters: Measurements made on population data (μ, σ, π)
• A point estimate of a population parameter is a sample statistic that represents a feasible value
of the parameter of interest.
• An unbiased estimator is a sample statistic whose mean value is equal to the value of the
population parameter being estimated.
• The sample mean Ȳ is an unbiased estimator of the population mean μ, but Ȳ varies from sample
to sample (sampling variation).
EXAMPLE 1: Suppose we wish to estimate the mean height of Stat 350A students. The following
samples of size 2 were collected.
Random Sample
1
2
3
4
5
Height 1
74”
65”
66”
64”
63”
Height 2
76”
69”
68”
70”
73”
ȳ
EXAMPLE 2: Tossing a die
(A) Tossing a single die 10,000 times.
(B) Tossing a pair of dice 10,000 times and calculating the average of each pair.
(C) Tossing twenty dice 10,000 times and calculating the averages of each toss.
Part 2: The Central Limit Theorem and Sampling Distribution of the Sample Mean
• The Central Limit Theorem (CLT): When drawing a random sample of size n from any nonnormal population with a mean μ and the standard deviation σ is known, then the sample mean,
Ȳ, has a sampling distribution that is approximately normal as long as n is large enough (rule of
thumb: n > 30).
• Assumptions and Conditions:
◦1) The data values must be sampled randomly.
◦2) The sampled values must be independent of one another.
◦3) Sample size should be less than 10% of the population size.
• The Sampling Distribution Model for a Sample Mean Ȳ
◦The mean of the sample averages is μ = μ
Ȳ
◦The standard deviation of the sample averages is σ = σ/√n
Ȳ
◦If a population is normal, then the sampling distribution of the sample
2
mean is normal: Ȳ ~ N(μ, σ /n)
◦If the a population is non-normal, then the sampling distribution of the
sample mean is approximately normal according to the CLT as long as n
is large enough: Ȳ ~ AN(μ, σ2 /n)
◦The Z-score formula for the sample mean is de ned as
Z=
Ȳ-μ
σ/√n
EXAMPLE: Suppose the weights of men are normally distributed with a mean of 173 lbs. and
variance of 900.
(A) What is the probability a randomly selected men weighs more than 200 lbs.?
(B) What is the probability that the mean weight of 9 randomly selected men is more than 200 lbs.?
Part 3: More Examples
EXAMPLE 1: The times of the nishers in a 10km run are normally distributed with a mean of 61
minutes and a variance of 81. A random sample of 30 runners is selected.
(A) Describe the sampling distribution of the average 10km nishing times for this sample.
(B) Find the probability that the average time of the above sample will be more than 65 minutes?
EXAMPLE 2: A rental car company has noticed that the distribution of the number of miles
customers put on rental cars per day is right-skewed. The distribution has a mean of 60 miles and a
standard deviation of 25 miles. A random sample of 120 rental cars is selected.
(A) Describe the sampling distribution of the average number of miles per day for this sample.
(B) What is the probability that the mean number of miles driven per day for this sample is less than
54?
(C) What is the probability that the total number of miles driven per day for this sample exceeds
7400?
Stat 350A: Chapter 4.13
Part 1: Normal Approximation to Binomial
• Let Y ~ Bin(n, π). If n has a very large sample size, calculations (by hand) of the Binomial
distribution can be strenuous. For these large number of trials for Binomial experiments, what we
can do instead is use the Normal distribution to approximate these Binomial probabilities.
• The Normal approximation to the Binomial distribution is an application of the Central Limit
Theorem. Why?
‣ a) Let X ~ Bin(1, π) = Bernoulli(π), then μ = π, σ = √π(1-π). And suppose we run n
Bernoulli trials.
‣ b) X̄ = (X1 + X2 + … + Xn)/n ~ AN(π, π(1-π)/n)
if n is large enough
‣ c) Let Y = X1 + X2 + … + Xn = nX̄ , then Y ~ AN(nπ, nπ(1-π))
• Assumptions:
◦1) Random sample.
◦2) Trials are independent.
◦3) If sampling without replacement, n < 10% of N ◦4) Rule of Thumb for su ciently large sample size: nπ > 5 and n(1-π) > 5
• Approximating Probabilities for Binomial Random Variables
◦To estimate probabilities for a binomial random variable, Z-scores can be used.
◦P(Y < y) ≈ P(Z < z) y - nπ Z= nπ(1-π) Part 2: Examples EXAMPLE 1: Eighty percent of all patrons at a local restaurant request water with their meal. (a) Suppose we sample 7 customers. What is the probability that at least 5 of the 7 customers selected will request water with their meal? (b) Suppose we now take a larger random sample of 119 customers. What is the probability that at most 100 of this sample will request water with their meal? EXAMPLE 2: Many toothpaste commercials that 3 out of 4 dentists recommend their brand of toothpaste. A random survey of 400 dentists is taken. Assuming the commercials are correct, what is the probability that at least 320 dentists from this sample will recommend Brand X toothpaste? Part 3: Continuity Correction EXAMPLE 1: Suppose 30% of the population have 20/20 vision. What is the probability of having 5 people with 20/20 vision in a sample of 20? EXAMPLE 2: Can we use the normal approximation for the previous problem? EXAMPLE 3: Referring to the above two examples. What is the probability that at least 8 of 20 have 20/20? Use both Binomial and a Normal approximation. EXAMPLE 4: Referring to the above examples. What is the probability that 5 to 7 people out of 20 have 20/20 vision? Use both Binomial and a Normal approximation. • Summary: ◦1) If nπ < 5 or n(1 - π) < 5, Normal approximation cannot be used. ◦2) If nπ > 20 and n(1 – π) > 20, Normal approximation with no continuity correction can be
used.
◦3) If 5 < nπ < 20 and 5 < n(1 - π) < 20, Normal approximation with continuity correction can be used. Stat 350A - Chapter 4.14 Part 1: The Empirical Rule • The Empirical Rule states that for any normal or approximately normal distribution, approximate percentages under the curve can be estimated. Also referred to as the 68-95-99.7% rule, it states: ◦68% of the observations are within one standard deviation of the mean. ◦95% of the observations are within two standard deviations of the mean. ◦99.7% of the observations are within three standard deviations of the mean. • We can use the Empirical rule to help us assess whether a distribution follows a normal or approximately normal distribution. ◦1) Calculate the sample mean, Ȳ, and sample standard deviation, s, of the distribution. ◦2) Calculate what percentage of the observations fall within 1, 2, and 3 standard deviations of the mean. ◦3) Compare these percentages from 2) to 68%, 95%, and 99.7%. If they are close, then the distribution is normal. MINITAB EXAMPLE: Let us a take a sample of 100 observations from a normal population with a mean of 10 and variance of 4. Part 3: Histogram and Normal Probability Plot • Normal probability plots give a visual way to determine if a distribution is normal or approximately normal. It is a Scatterplot of sorted data vs normal scores • Normal scores are the expected values of the ordered observations in a sample size of n from a the standard normal curve, i.e., N(0, 1). It calculates where one would expecte data to fall if sampling from a standard normal distribution. • If the distribution is normal, the plotted points will lie close to a line. Systematic deviations from the line indicate a non-normal distribution. MINITAB EXAMPLE: Suppose we have a normal distribution with a mean of 3 and variance of 0.25. (A) Take a random sample of 10 measurements from this distribution and draw a NPP. (B) Take a random sample of 100 measurements from this distribution and draw a NPP. An Introduction to Statistical Methods & Data Analysis Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. An Introduction to Statistical Methods & Data Analysis Seventh Edition R. Lyman Ott Michael Longnecker Texas A&M University Australia • Brazil • Mexico • Singapore • United Kingdom • United States Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. This is an electronic version of the print textbook. Due to electronic rights restrictions, some third party content may be suppressed. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. The publisher reserves the right to remove content from this title at any time if subsequent rights restrictions require it. For valuable information on pricing, previous editions, changes to current editions, and alternate formats, please visit www.cengage.com/highered to search by ISBN#, author, title, or keyword for materials in your areas of interest. Important Notice: Media content referenced within the product description or the product text may not be available in the eBook version. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. An Introduction to Statistical Methods and Data Analysis, Seventh Edition R. Lyman Ott, Michael Longnecker Senior Product Team Manager: Richard Stratton Content Developer: Andrew Coppola Associate Content Developer: Spencer Arritt Product Assistant: Kathryn Schrumpf Marketing Manager: Julie Schuster © 2016, 2010 Cengage Learning WCN: 02-200-203 ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored, or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the publisher. Content Project Manager: Cheryll Linthicum For product information and technology assistance, contact us at Cengage Learning Customer & Sales Support, 1-800-354-9706. Art Director: Vernon Boes For permission to use material from this text or product, submit all requests online at www.cengage.com/permissions. Further permissions questions can be e-mailed to permissionrequest@cengage.com Manufacturing Planner: Sandee Milewski Intellectual Property Analyst: Christina Ciaramella Intellectual Property Project Manager: Farah Fard Production Service and Compositor: Cenveo Publishing Services Photo and Text Researcher: Lumina Datamatics, LTD Copy Editor: Illustrator: Macmillan Publishing Services/ Cenveo Publishing Services Text and Cover Designer: C. Miller Cover Image: polygraphus/Getty Images Library of Congress Control Number: 2015938496 ISBN: 978-1-305-26947-7 Cengage Learning 20 Channel Center Street Boston, MA 02210 USA Cengage Learning is a leading provider of customized learning solutions with employees residing in nearly 40 different countries and sales in more than 125 countries around the world. Find your local representative at www.cengage.com Cengage Learning products are represented in Canada by Nelson Education, Ltd. To learn more about Cengage Learning Solutions, visit www.cengage.com Purchase any of our products at your local college store or at our preferred online store www.cengagebrain.com Printed in the United States of America Print Number: 01 Print Year: 2015 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. CONTENTS Preface PART 1 CHAPTER 1 1 2 Introduction 2 Why Study Statistics? 6 Some Current Applications of Statistics 9 A Note to the Student 13 Summary 13 Exercises 14 PART 2 Collecting Data 17 Using Surveys and Experimental Studies to Gather Data 18 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Introduction and Abstract of Research Study 18 Observational Studies 20 Sampling Designs for Surveys 26 Experimental Studies 32 Designs for Experimental Studies 38 Research Study: Exit Polls Versus Election Results 48 Summary 50 Exercises 50 PART 3 CHAPTER 3 Introduction Statistics and the Scientific Method 1.1 1.2 1.3 1.4 1.5 1.6 CHAPTER 2 xi Summarizing Data Data Description 3.1 3.2 3.3 3.4 3.5 3.6 3.7 59 60 Introduction and Abstract of Research Study 60 Calculators, Computers, and Software Systems 65 Describing Data on a Single Variable: Graphical Methods 66 Describing Data on a Single Variable: Measures of Central Tendency 82 Describing Data on a Single Variable: Measures of Variability 90 The Boxplot 104 Summarizing Data from More Than One Variable: Graphs and Correlation 109 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. v vi Contents 3.8 3. 9 3.10 3.11 CHAPTER 4 Research Study: Controlling for Student Background in the Assessment of Teaching 119 R Instructions 124 Summary and Key Formulas 124 Exercises 125 Probability and Probability Distributions 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 Introduction and Abstract of Research Study 149 Finding the Probability of an Event 153 Basic Event Relations and Probability Laws 155 Conditional Probability and Independence 158 Bayes’ Formula 161 Variables: Discrete and Continuous 164 Probability Distributions for Discrete Random Variables 166 Two Discrete Random Variables: The Binomial and the Poisson 167 Probability Distributions for Continuous Random Variables 177 A Continuous Probability Distribution: The Normal Distribution 180 Random Sampling 187 Sampling Distributions 190 Normal Approximation to the Binomial 200 Evaluating Whether or Not a Population Distribution Is Normal 203 Research Study: Inferences About Performance-Enhancing Drugs Among Athletes 208 R Instructions 211 Summary and Key Formulas 212 Exercises 214 PART 4 CHAPTER 5 Analyzing THE Data, Interpreting the Analyses, and Communicating THE Results Inferences About Population Central Values 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 CHAPTER 6 149 231 232 Introduction and Abstract of Research Study 232 Estimation of m 235 Choosing the Sample Size for Estimating m 240 A Statistical Test for m 242 Choosing the Sample Size for Testing m 255 The Level of Significance of a Statistical Test 257 Inferences About m for a Normal Population, s Unknown 260 Inferences About m When the Population Is Nonnormal and n Is Small: Bootstrap Methods 269 Inferences About the Median 275 Research Study: Percentage of Calories from Fat 280 Summary and Key Formulas 283 Exercises 285 Inferences Comparing Two Population Central Values 300 6.1 6.2 Introduction and Abstract of Research Study 300 Inferences About m1 2 m2: Independent Samples 303 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Contents 6.3 6.4 6.5 6.6 6.7 6.8 6.9 CHAPTER 7 7.2 7.3 7.4 7.5 7.6 7.7 CHAPTER 8 366 Introduction and Abstract of Research Study 366 Estimation and Tests for a Population Variance 368 Estimation and Tests for Comparing Two Population Variances 376 Tests for Comparing t . 2 Population Variances 382 Research Study: Evaluation of Methods for Detecting E. coli 385 Summary and Key Formulas 390 Exercises 391 Inferences About More Than Two Population Central Values 400 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 CHAPTER 9 A Nonparametric Alternative: The Wilcoxon Rank Sum Test 315 Inferences About m1 2 m2: Paired Data 325 A Nonparametric Alternative: The Wilcoxon Signed-Rank Test 329 Choosing Sample Sizes for Inferences About m1 2 m2 334 Research Study: Effects of an Oil Spill on Plant Growth 336 Summary and Key Formulas 341 Exercises 344 Inferences About Population Variances 7.1 vii Introduction and Abstract of Research Study 400 A Statistical Test About More Than Two Population Means: An Analysis of Variance 403 The Model for Observations in a Completely Randomized Design 412 Checking on the AOV Conditions 414 An Alternative Analysis: Transformations of the Data 418 A Nonparametric Alternative: The Kruskal–Wallis Test 425 Research Study: Effect of Timing on the Treatment of Port-Wine Stains with Lasers 428 Summary and Key Formulas 433 Exercises 435 Multiple Comparisons 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 445 Introduction and Abstract of Research Study 445 Linear Contrasts 447 Which Error Rate Is Controlled? 454 Scheffé’s S Method 456 Tukey’s W Procedure 458 Dunnett’s Procedure: Comparison of Treatments to a Control 462 A Nonparametric Multiple-Comparison Procedure 464 Research Study: Are Interviewers’ Decisions Affected by Different Handicap Types? 467 Summary and Key Formulas 474 Exercises 475 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. viii Contents CHAPTER 10 Categorical Data 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10 10.11 CHAPTER 11 555 Introduction and Abstract of Research Study 555 Estimating Model Parameters 564 Inferences About Regression Parameters 574 Predicting New y-Values Using Regression 577 Examining Lack of Fit in Linear Regression 581 Correlation 587 Research Study: Two Methods for Detecting E. coli 598 Summary and Key Formulas 602 Exercises 604 Multiple Regression and the General Linear Model 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 12.10 12.11 12.12 CHAPTER 13 Introduction and Abstract of Research Study 482 Inferences About a Population Proportion p 483 Inferences About the Difference Between Two Population Proportions, p1 2 p2 491 Inferences About Several Proportions: Chi-Square Goodness-of-Fit Test 501 Contingency Tables: Tests for Independence and Homogeneity 508 Measuring Strength of Relation 515 Odds and Odds Ratios 517 Combining Sets of 2 3 2 Contingency Tables 522 Research Study: Does Gender Bias Exist in the Selection of Students for Vocational Education? 525 Summary and Key Formulas 531 Exercises 533 Linear Regression and Correlation 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 CHAPTER 12 482 Introduction and Abstract of Research Study 625 The General Linear Model 635 Estimating Multiple Regression Coefﬁcients 636 Inferences in Multiple Regression 644 Testing a Subset of Regression Coefﬁcients 652 Forecasting Using Multiple Regression 656 Comparing the Slopes of Several Regression Lines 658 Logistic Regression 662 Some Multiple Regression Theory (Optional) 669 Research Study: Evaluation of the Performance of an Electric Drill 676 Summary and Key Formulas 683 Exercises 685 Further Regression Topics 13.1 13.2 13.3 13.4 625 711 Introduction and Abstract of Research Study 711 Selecting the Variables (Step 1) 712 Formulating the Model (Step 2) 729 Checking Model Assumptions (Step 3) 745 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Contents 13.5 13.6 13.7 CHAPTER 14 Analysis of Variance for Completely Randomized Designs 798 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 CHAPTER 15 15.5 15.6 15.7 15.8 16.6 16.7 865 Introduction and Abstract of Research Study 865 Randomized Complete Block Design 866 Latin Square Design 878 Factorial Treatment Structure in a Randomized Complete Block Design 889 A Nonparametric Alternative—Friedman’s Test 893 Research Study: Control of Leatherjackets 897 Summary and Key Formulas 902 Exercises 904 The Analysis of Covariance 16.1 16.2 16.3 16.4 16.5 CHAPTER 17 Introduction and Abstract of Research Study 798 Completely Randomized Design with a Single Factor 800 Factorial Treatment Structure 805 Factorial Treatment Structures with an Unequal Number of Replications 830 Estimation of Treatment Differences and Comparisons of Treatment Means 837 Determining the Number of Replications 841 Research Study: Development of a Low-Fat Processed Meat 846 Summary and Key Formulas 851 Exercises 852 Analysis of Variance for Blocked Designs 15.1 15.2 15.3 15.4 CHAPTER 16 Research Study: Construction Costs for Nuclear Power Plants 765 Summary and Key Formulas 772 Exercises 773 917 Introduction and Abstract of Research Study 917 A Completely Randomized Design with One Covariate 920 The Extrapolation Problem 931 Multiple Covariates and More Complicated Designs 934 Research Study: Evaluation of Cool-Season Grasses for Putting Greens 936 Summary 942 Exercises 942 Analysis of Variance for Some Fixed-, Random-, and Mixed-Effects Models 952 17.1 17.2 17.3 17.4 17.5 Introduction and Abstract of Research Study 952 A One-Factor Experiment with Random Treatment Effects 955 Extensions of Random-Effects Models 959 Mixed-Effects Models 967 Rules for Obtaining Expected Mean Squares 971 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. ix x Contents 17.6 17.7 17.8 17.9 CHAPTER 18 Split-Plot, Repeated Measures, and Crossover Designs 1004 18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 CHAPTER 19 Nested Factors 981 Research Study: Factors Affecting Pressure Drops Across Expansion Joints 986 Summary 991 Exercises 992 Introduction and Abstract of Research Study 1004 Split-Plot Designed Experiments 1008 Single-Factor Experiments with Repeated Measures 1014 Two-Factor Experiments with Repeated Measures on One of the Factors 1018 Crossover Designs 1025 Research Study: Effects of an Oil Spill on Plant Growth 1033 Summary 1035 Exercises 1035 Analysis of Variance for Some Unbalanced Designs 1050 19.1 19.2 19.3 19.4 19.5 19.6 19.7 Introduction and Abstract of Research Study 1050 A Randomized Block Design with One or More Missing Observations 1052 A Latin Square Design with Missing Data 1058 Balanced Incomplete Block (BIB) Designs 1063 Research Study: Evaluation of the Consistency of Property Assessors 1070 Summary and Key Formulas 1074 Exercises 1075 Appendix: Statistical Tables Answers to Selected Exercises References Index 1085 1125 1151 1157 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. PREFACE INDEX Intended Audience An Introduction to Statistical Methods and Data Analysis, Seventh Edition, provides a broad overview of statistical methods for advanced undergraduate and graduate students from a variety of disciplines. This book is intended to prepare students to solve problems encountered in research projects, to make decisions based on data in general settings both within and beyond the university setting, and finally to become critical readers of statistical analyses in research papers and in news reports. The book presumes that the students have a minimal mathematical background (high school algebra) and no prior course work in statistics. The first 11 chapters of the textbook present the material typically covered in an introductory statistics course. However, this book provides research studies and examples that connect the statistical concepts to data analysis problems that are often encountered in undergraduate capstone courses. The remaining chapters of the book cover regression modeling and design of experiments. We develop and illustrate the statistical techniques and thought processes needed to design a research study or experiment and then analyze the data collected using an intuitive and proven four-step approach. This should be especially helpful to graduate students conducting their MS thesis and PhD dissertation research. Major Features of Textbook Learning from Data In this text, we approach the study of statistics by considering a four-step process by which we can learn from data: 1. Defining the Problem 2. Collecting the Data 3. Summarizing the Data 4. Analyzing the Data, Interpreting the Analyses, and Communicating the Results Case Studies In order to demonstrate the relevance and critical nature of statistics in solving realworld problems, we introduce the major topic of each chapter using a case study. The case studies were selected from many sources to illustrate the broad applicability of statistical methodology. The four-step learning from data process is illustrated through the case studies. This approach will hopefully assist in overcoming Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. xi xii Preface the natural initial perception held by many people that statistics is just another “math course.’’ The introduction of major topics through the use of case studies provides a focus on the central nature of applied statistics in a wide variety of research and business-related studies. These case studies will hopefully provide the reader with an enthusiasm for the broad applicability of statistics and the statistical thought process that the authors have found and used through their many years of teaching, consulting, and R & D management. The following research studies illustrate the types of studies we have used throughout the text. ●● Exit Polls Versus Election Results: A study of why the exit polls from 9 of 11 states in the 2004 presidential election predicted John Kerry as the winner when in fact President Bush won 6 of the 11 states. ●● Evaluation of the Consistency of Property Assessors: A study to determine if county property assessors differ systematically in their determination of property values. ●● Effect of Timing of the Treatment of Port-Wine Stains with Lasers: A prospective study that investigated whether treatment at a younger age would yield better results than treatment at an older age. ●● Controlling for Student Background in the Assessment of Teaching: An examination of data used to support possible improvements to the No Child Left Behind program while maintaining the important concepts of performance standards and accountability. Each of the research studies includes a discussion of the whys and hows of the study. We illustrate the use of the four-step learning from data process with each case study. A discussion of sample size determination, graphical displays of the data, and a summary of the necessary ingredients for a complete report of the statistical findings of the study are provided with many of the case studies. Examples and Exercises We have further enhanced the practical nature of statistics by using examples and exercises from journal articles, newspapers, and the authors’ many consulting experiences. These will provide the students with further evidence of the practical usages of statistics in solving problems that are relevant to their everyday lives. Many new exercises and examples have been included in this edition of the book. The number and variety of exercises will be a great asset to both the instructor and students in their study of statistics. Topics Covered This book can be used for either a one-semester or a two-semester course. Chapters 1 through 11 would constitute a one-semester course. The topics covered would include Chapter 1—Statistics and the scientific method Chapter 2—Using surveys and experimental studies to gather data Chapters 3 & 4—Summarizing data and probability distributions Chapters 5–7—Analyzing data: inferences about central values and variances Chapters 8 & 9—One-way analysis of variance and multiple comparisons Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Preface xiii Chapter 10—Analyzing data involving proportions Chapter 11—Linear regression and correlation The second semester of a two-semester course would then include model building and inferences in multiple regression analysis, logistic regression, design of experiments, and analysis of variance: Chapters 11–13—Regression methods and model building: multiple regression and the general linear model, logistic regression, and building regression models with diagnostics Chapters 14–19—Design of experiments and analysis of variance: design concepts, analysis of variance for standard designs, analysis of covariance, random and mixed effects models, split-plot designs, repeated measures designs, crossover designs, and unbalanced designs Emphasis on Interpretation, not Computation In the book are examples and exercises that allow the student to study how to calculate the value of statistical estimators and test statistics using the definitional form of the procedure. After the student becomes comfortable with the aspects of the data the statistical procedure is reflecting, we then emphasize the use of computer software in making computations in the analysis of larger data sets. We provide output from three major statistical packages: SAS, Minitab, and SPSS. We find that this approach provides the student with the experience of computing the value of the procedure using the definition; hence, the student learns the basics b ehind each procedure. In most situations beyond the statistics course, the student should be using computer software in making the computations for both e xpedience and quality of calculation. In many exercises and examples, the use of the computer allows for more time to emphasize the interpretation of the results of the computations without having to expend enormous amounts of time and effort in the actual computations. In numerous examples and exercises, the importance of the following aspects of hypothesis testing are demonstrated: 1. The statement of the research hypothesis through the summarization of the researcher’s goals into a statement about population parameters. 2. The selection of the most appropriate test statistic, including sample size computations for many procedures. 3. The necessity of considering both Type I and Type II error rates (a and b) when discussing the results of a statistical test of hypotheses. 4. The importance of considering both the statistical significance and the practical significance of a test result. Thus, we illustrate the importance of estimating effect sizes and the construction of confidence intervals for population parameters. 5. The statement of the results of the statistical test in nonstatistical jargon that goes beyond the statement ‘‘reject H0’’ or ‘‘fail to reject H0.’’ New to the Seventh Edition ●● There are instructions on the use of R code. R is a free software package that can be downloaded from http:/ /lib.stat.cmu.edu/R/CRAN. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. xiv Preface Click your choice of platform (Linux, MacOS X, or Windows) for the precompiled binary distribution. Note the FAQs link to the left for additional information. Follow the instructions for installing the base system software (which is all you will need). ●● New examples illustrate the breadth of applications of statistics to real-world problems. ●● An alternative to the standard deviation, MAD, is provided as a measure of dispersion in a population/sample. ●● The use of bootstrapping in obtaining confidence intervals and p-values is discussed. ●● Instructions are included on how to use R code to obtain percentiles and probabilities from the following distributions: normal, binomial, Poisson, chi-squared, F, and t. ●● A nonparametric alternative to the Pearson correlation coefficient: Spearman’s rank correlation, is provided. ●● The binomial test for small sample tests of proportions is presented. ●● The McNemar test for paired count data has been added. ●● The Akaike information criterion and Bayesian information criterion for variable selection are discussed. Additional Features Retained from Previous Editions ●● Many practical applications of statistical methods and data analysis from agriculture, business, economics, education, engineering, medicine, law, political science, psychology, environmental studies, and sociology have been included. ●● The seventh edition contains over 1,000 exercises, with nearly 400 of the exercises new. ●● Computer output from Minitab, SAS, and SPSS is provided in numerous examples. The use of computers greatly facilitates the use of more sophisticated graphical illustrations of statistical results. ●● Attention is paid to the underlying assumptions. Graphical procedures and test procedures are provided to determine if assumptions have been violated. Furthermore, in many settings, we provide alternative procedures when the conditions are not met. ●● The first chapter provides a discussion of “What Is Statistics?” We provide a discussion of why students should study statistics along with a discussion of several major studies that illustrate the use of statistics in the solution of real-life problems. Ancillaries Student Solutions Manual (ISBN-10: 1-305-26948-9; ISBN-13: 978-1-305-26948-4), containing select worked solutions for problems in the textbook. l A Companion Website at www.cengage.com/statistics/ott, containing downloadable data sets for Excel, Minitab, SAS, SPSS, and others, plus additional resources for students and faculty. l Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Preface xv Acknowledgments There are many people who have made valuable, constructive suggestions for the development of the original manuscript and during the preparation of the subsequent editions. We are very appreciative of the insightful and constructive comments from the following reviewers: Naveen Bansal, Marquette University Kameryn Denaro, San Diego State University Mary Gray, American University Craig Leth-Steensen, Carleton University Jing Qian, University of Massachusetts Mark Riggs, Abilene Christian University Elaine Spiller, Marquette University We are also appreciate of the preparation assistance received from Molly Taylor and Jay Campbell; the scheduling of the revisions by Mary Tindle, the Senior Project Manager at Cenveo Publisher Services, who made sure that the book was completed in a timely manner. The authors of the solutions manual, Soma Roy, California Polytechnic State University, and John Draper, The Ohio State University, provided me with excellent input which resulted in an improved set of exercises for the seventh edition. The person who assisted me the greatest degree in the preparation of the seventh edition, was Sherry Goldbecker, the copy editor. Sherry not only corrected my many grammatical errors but also provided rephrasing of many sentences which made for a more straight forward explanation of statistical concepts. The students, who use this book in their statistics classes, will be most appreciative of Sherry’s many contributions. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. PART 1 Introduction Chapter 1 St atistic s a nd the Sc ientific Method Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. CHAPTER 1 1.1 Introduction 1.2 Why Study Statistics? 1.3 Some Current Applications of Statistics 1.4 A Note to the Student Statistics and the Scientific Method 1.1 1.5 Summary 1.6 Exercises Introduction Statistics is the science of designing studies or experiments, collecting data, and modeling/analyzing data for the purpose of decision making and scientific discovery when the available information is both limited and variable. That is, statistics is the science of Learning from Data. Almost everyone, including social scientists, medical researchers, superintendents of public schools, corporate executives, market researchers, engineers, government employees, and consumers, deals with data. These data could be in the form of quarterly sales ﬁgures, percent increase in juvenile crime, contamination levels in water samples, survival rates for patients undergoing medical therapy, census ﬁgures, or information that helps determine which brand of car to purchase. In this text, we approach the study of statistics by considering the four-step process in Learning from Data: (1) defining the problem, (2) collecting the data, (3) summarizing the data, and (4) analyzing the data, interpreting the analyses, and communicating the results. Through the use of these four steps in Learning from Data, our study of statistics closely parallels the Scientific Method, which is a set of principles and procedures used by successful scientists in their p ursuit of knowledge. The method involves the formulation of research goals, the design of observational studies and/or experiments, the collection of data, the modeling/analysis of the data in the context of research goals, and the testing of hypotheses. The conclusion of these steps is often the formulation of new research goals for a nother study. These steps are illustrated in the schematic given in Figure 1.1. This book is divided into sections corresponding to the four-step process in Learning from Data. The relationship among these steps and the chapters of the book is shown in Table 1.1. As you can see from this table, much time is spent discussing how to analyze data using the basic methods presented in Chapters 5–19. 2 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 1.1 Introduction 3 FIGURE 1.1 Scientific Method Schematic Formulate research goal: research hypotheses, models Design study: sample size, variables, experimental units, sampling mechanism TABLE 1.1 Organization of the text Formulate new research goals: new models, new hypotheses Make decisions: written conclusions, oral presentations Collect data: data management Draw inferences: graphs, estimation, hypotheses testing, model assessment The Four-Step Process Chapters 1 Defining the Problem 2 Collecting the Data 3 Summarizing the Data 4 Analyzing the Data, Interpreting the Analyses, and Communicating the Results 1 Statistics and the Scientific Method 2 Using Surveys and Experimental Studies to Gather Data 3 Data Description 4 Probability and Probability Distributions 5 Inferences about Population Central Values 6 Inferences Comparing Two Population Central Values 7 Inferences about Population Variances 8 Inferences about More Than Two Population Central Values 9 Multiple Comparisons 10 Categorical Data 11 Linear Regression and Correlation 12 Multiple Regression and the General Linear Model 13 Further Regression Topics 14 Analysis of Variance for Completely Randomized Designs 15 Analysis of Variance for Blocked Designs 16 The Analysis of Covariance 17 Analysis of Variance for Some Fixed-, Random-, and Mixed-Effects Models 18 Split-Plot, Repeated Measures, and Crossover Designs 19 Analysis of Variance for Some Unbalanced Designs However, you must remember that for each data set requiring analysis, someone has defined the problem to be examined (Step 1), developed a plan for collecting data to address the problem (Step 2), and summarized the data and prepared the data for analysis (Step 3). Then following the analysis of the data, the results of the analysis must be interpreted and communicated either verbally or in written form to the intended audience (Step 4). All four steps are important in Learning from Data; in fact, unless the problem to be addressed is clearly defined and the data collection carried out properly, the interpretation of the results of the analyses may convey misleading information because the analyses were based on a data set that did not address the problem or that was incomplete and contained improper information. Throughout the text, Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 4 Chapter 1 Statistics and the Scientific Method we will try to keep you focused on the bigger picture of Learning from Data through the four-step process. Most chapters will end with a summary section that emphasizes how the material of the chapter fits into the study of statistics— Learning from Data. To illustrate some of the above concepts, we will consider four situations in which the four steps in Learning from Data could assist in solving a real-world problem. 1. Problem: Inspection of ground beef in a large beef-processing facility. A beef-processing plant produces approximately half a million packages of ground beef per week. The government inspects packages for possible improper labeling of the packages with respect to the percent fat in the meat. The inspectors must open the ground beef package in order to determine the fat content of the ground beef. The inspection of every package would be prohibitively costly and time consuming. An alternative approach is to select 250 packages for inspection from the daily production of 100,000 packages. The fraction of packages with improper labeling in the sample of 250 packages would then be used to estimate the fraction of packages improperly labeled in the complete day’s production. If this fraction exceeds a set specification, action is then taken against the meat processor. In later chapters, a procedure will be formulated to determine how well the sample fraction of improperly labeled packages approximates the fraction of improperly labeled packages for the whole day’s output. 2. Problem: Is there a relationship between quitting smoking and gaining weight? To investigate the claim that people who quit smoking often experience a subsequent weight gain, researchers selected a random sample of 400 participants who had successfully participated in programs to quit smoking. The individuals were weighed at the beginning of the program and again 1 year later. The average change in weight of the participants was an increase of 5 pounds. The investigators concluded that there was evidence that the claim was valid. We will develop techniques in later chapters to assess when changes are truly significant changes and not changes due to random chance. 3. Problem: What effect does nitrogen fertilizer have on wheat production? For a study of the effects of nitrogen fertilizer on wheat production, a total of 15 fields was available to the researcher. She randomly assigned three fields to each of the five nitrogen rates under investigation. The same variety of wheat was planted in all 15 fields. The fields were cultivated in the same manner until harvest, and the number of pounds of wheat per acre was then recorded for each of the 15 fields. The experimenter wanted to determine the optimal level of nitrogen to apply to any wheat field, but, of course, she was limited to running experiments on a limited number of fields. After determining the amount of nitrogen that yielded the largest production of wheat in the study fields, the experimenter then concluded that similar results would hold for wheat fields possessing characteristics somewhat the same as the study fields. Is the experimenter justified in reaching this conclusion? Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 1.1 Introduction 5 4. Problem: Determining public opinion toward a question, issue, product, or candidate. Similar applications of statistics are brought to mind by the frequent use of the New York Times/CBS News, Washington Post/ABC News, Wall Street Journal/NBC News, Harris, Gallup/Newsweek, and CNN/Time polls. How can these pollsters determine the opinions of more than 195 million Americans who are of voting age? They certainly do not contact every potential voter in the United States. Rather, they sample the opinions of a small number of potential voters, perhaps as few as 1,500, to estimate the reaction of every person of voting age in the country. The amazing result of this process is that if the selection of the voters is done in an unbiased way and voters are asked unambiguous, nonleading questions, the fraction of those persons contacted who hold a particular opinion will closely match the fraction in the total population holding that opinion at a particular time. We will supply convincing supportive evidence of this assertion in subsequent chapters. These problems illustrate the four-step process in Learning from Data. First, there was a problem or question to be addressed. Next, for each problem a study or experiment was proposed to collect meaningful data to solve the problem. The government meat inspection agency had to decide both how many packages to inspect per day and how to select the sample of packages from the total daily output in order to obtain a valid prediction. The polling groups had to decide how many voters to sample and how to select these individuals in order to obtain information that is representative of the population of all voters. Similarly, it was necessary to carefully plan how many participants in the weight-gain study were needed and how they were to be selected from the list of all such participants. Furthermore, what variables did the researchers have to measure on each participant? Was it necessary to know each participant’s age, sex, physical fitness, and other health-related variables, or was weight the only important variable? The results of the study may not be relevant to the general population if many of the participants in the study had a particular health condition. In the wheat experiment, it was important to measure both the soil characteristics of the fields and the environmental conditions, such as temperature and rainfall, to obtain results that could be generalized to fields not included in the study. The design of a study or experiment is crucial to obtaining results that can be generalized beyond the study. Finally, having collected, summarized, and analyzed the data, it is important to report the results in unambiguous terms to interested people. For the meat inspection example, the government inspection agency and the personnel in the beef-processing plant would need to know the distribution of fat content in the daily production of ground beef. Based on this distribution, the agency could then impose fines or take other remedial actions against the production facility. Also, knowledge of this distribution would enable company production personnel to make adjustments to the process in order to obtain acceptable fat content in their ground beef packages. Therefore, the results of the statistical analyses cannot be presented in ambiguous terms; decisions must be made from a well-deﬁned knowledge base. The results of the weight-gain study would be of vital interest to physicians who have patients participating in the smoking-cessation program. If a signiﬁcant increase in weight was recorded for those individuals who had quit smoking, physicians would have to recommend diets so that the former smokers Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 6 Chapter 1 Statistics and the Scientific Method FIGURE 1.2 Population and sample Set of all measurements: the population Set of measurements selected from the population: the sample population sample would not go from one health problem (smoking) to another (elevated blood pressure due to being overweight). It is crucial that a careful description of the participants—that is, age, sex, and other health-related information—be included in the report. In the wheat study, the experiment would provide farmers with information that would allow them to economically select the optimum amount of nitrogen required for their ﬁelds. Therefore, the report must contain information concerning the amount of moisture and types of soils present on the study ﬁelds. Otherwise, the conclusions about optimal wheat production may not pertain to farmers growing wheat under considerably different conditions. To infer validly that the results of a study are applicable to a larger group than just the participants in the study, we must carefully deﬁne the population (see Deﬁnition 1.1) to which inferences are sought and design a study in which the sample (see Deﬁnition 1.2) has been appropriately selected from the designated population. We will discuss these issues in Chapter 2. DEFINITION 1.1 A population is the set of all measurements of interest to the sample collector. (See Figure 1.2.) DEFINITION 1.2 A sample is any subset of measurements selected from the population. (See Figure 1.2.) 1.2 Why Study Statistics? We can think of many reasons for taking an introductory course in statistics. One reason is that you need to know how to evaluate published numerical facts. Every person is exposed to manufacturers’ claims for products; to the results of sociological, consumer, and political polls; and to the published results of scientiﬁc research. Many of these results are inferences based on sampling. Some inferences are valid; others are invalid. Some are based on samples of adequate size; others are not. Yet all these published results bear the ring of truth. Some people (particularly statisticians) say that statistics can be made to support almost Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 1.2 Why Study Statistics? 7 anything. Others say it is easy to lie with statistics. Both statements are true. It is easy, purposely or unwittingly, to distort the truth by using statistics when presenting the results of sampling to the uninformed. It is thus crucial that you become an informed and critical reader of data-based reports and articles. A second reason for studying statistics is that your profession or employment may require you to interpret the results of sampling (surveys or experimentation) or to employ statistical methods of analysis to make inferences in your work. For example, practicing physicians receive large amounts of advertising describing the beneﬁts of new drugs. These advertisements frequently display the numerical results of experiments that compare a new drug with an older one. Do such data really imply that the new drug is more effective, or is the observed difference in results due simply to random variation in the experimental measurements? Recent trends in the conduct of court trials indicate an increasing use of probability and statistical inference in evaluating the quality of evidence. The use of statistics in the social, biological, and physical sciences is essential because all these sciences make use of observations of natural phenomena, through sample surveys or experimentation, to develop and test new theories. Statistical methods are employed in business when sample data are used to forecast sales and proﬁt. In addition, they are used in engineering and manufacturing to monitor product quality. The sampling of accounts is a useful tool to assist accountants in conducting audits. Thus, statistics plays an important role in almost all areas of science, business, and industry; persons employed in these areas need to know the basic concepts, strengths, and limitations of statistics. The article “What Educated Citizens Should Know About Statistics and Probability,” by J. Utts (2003), contains a number of statistical ideas that need to be understood by users of statistical methodology in order to avoid confusion in the use of their research findings. Misunderstandings of statistical results can lead to major errors by government policymakers, medical workers, and consumers of this information. The article selected a number of topics for discussion. We will summarize some of the findings in the article. A complete discussion of all these topics will be given throughout the book. 1. One of the most frequent misinterpretations of statistical findings is when a statistically significant relationship is established between two variables and it is then concluded that a change in the explanatory variable causes a change in the response variable. As will be discussed in the book, this conclusion can be reached only under very restrictive constraints on the experimental setting. Utts examined a recent Newsweek article discussing the relationship between the strength of religious beliefs and physical healing. Utts’ article discussed the problems in reaching the conclusion that the stronger a patient’s religious beliefs, the more likely the patient would be cured of his or her ailment. Utts showed that there are numerous other factors involved in a patient’s health and the conclusion that religious beliefs cause a cure cannot be validly reached. 2. A common confusion in many studies is the difference between (statistically) significant findings in a study and (practically) significant findings. This problem often occurs when large data sets are involved in a study or experiment. This type of problem will be discussed in detail throughout the book. We will use a number of examples that will illustrate how this type of confusion can be avoided by researchers when reporting the findings of their experimental results. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 8 Chapter 1 Statistics and the Scientific Method Utts’ article illustrated this problem with a discussion of a study that found a statistically significant difference in the average heights of military recruits born in the spring and in the fall. There were 507,125 recruits in the study and the difference in average height was about 1/4 inch. So, even though there may be a difference in the actual average heights of recruits in the spring and the fall, the difference is so small (1/4 inch) that it is of no practical importance. 3. The size of the sample also may be a determining factor in studies in which statistical significance is not found. A study may not have selected a sample size large enough to discover a difference between the several populations under study. In many government-sponsored studies, the researchers do not receive funding unless they are able to demonstrate that the sample sizes selected for their study are of an appropriate size to detect specified differences in populations if in fact they exist. Methods to determine appropriate sample sizes will be provided in the chapters on hypotheses testing and experimental design. 4. Surveys are ubiquitous, especially during the years in which national elections are held. In fact, market surveys are nearly as widespread as political polls. There are many sources of bias that can creep into the most reliable of surveys. The manner in which people are selected for inclusion in the survey, the way in which questions are phrased, and even the manner in which questions are posed to the subject may affect the conclusions obtained from the survey. We will discuss these issues in Chapter 2. 5. Many students find the topic of probability to be very confusing. One of these confusions involves conditional probability where the probability of an event occurring is computed under the condition that a second event has occurred with certainty. For example, a new diagnostic test for the pathogen Escherichia coli in meat is proposed to the U.S. Department of Agriculture (USDA). The USDA evaluates the test and determines that the test has both a low false positive rate and a low false negative rate. That is, it is very unlikely that the test will declare the meat contains E. coli when in fact it does not contain E. coli. Also, it is very unlikely that the test will declare the meat does not contain E. coli when in fact it does contain E. coli. Although the diagnostic test has a very low false positive rate and a very low false negative rate, the probability that E. coli is in fact present in the meat when the test yields a positive test result is very low for those situations in which a particular strain of E. coli occurs very infrequently. In Chapter 4, we will demonstrate how this probability can be computed in order to provide a true assessment of the performance of a diagnostic test. 6. Another concept that is often misunderstood is the role of the degree of variability in interpreting what is a “normal” occurrence of some naturally occurring event. Utts’ article provided the following example. A company was having an odor problem with its wastewater treatment plant. It attributed the problem to “abnormal” rainfall during the period in which the odor problem was occurring. A company official stated that the facility experienced 170% to 180% of its “normal” rainfall during this period, which resulted in the water in Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 1.3 Some Current Applications of Statistics 9 the holding ponds t aking longer to exit for irrigation. Thus, there was more time for the pond to develop an odor. The company official did not point out that yearly rainfall in this region is extremely variable. In fact, the historical range for rainfall is between 6.1 and 37.4 inches with a median rainfall of 16.7 inches. The rainfall for the year of the odor problem was 29.7 inches, which was well within the “normal” range for rainfall. There was a confusion between the terms “average” and “normal” rainfall. The concept of natural variability is crucial to correct interpretation of statistical results. In this example, the company official should have evaluated the percentile for an annual rainfall of 29.7 inches in order to demonstrate the abnormality of such a rainfall. We will discuss the ideas of data summaries and percentiles in Chapter 3. The types of problems expressed above and in Utts’ article represent common and important misunderstandings that can occur when researchers use statistics in interpreting the results of their studies. We will attempt throughout the book to discuss possible misinterpretations of statistical results and how to avoid them in your data analyses. More importantly, we want the reader of this book to become a discriminating reader of statistical findings, the results of surveys, and project reports. 1.3 Some Current Applications of Statistics Defining the Problem: Obtaining Information from Massive Data Sets Data mining is defined to be a process by which useful information is obtained from large sets of data. Data mining uses statistical techniques to discover patterns and trends that are present in a large data set. In most data sets, important patterns would not be discovered by using traditional data exploration techniques because the types of relationships between the many variables in the data set are either too complex or because the data sets are so large that they mask the relationships. The patterns and trends discovered in the analysis of the data are defined as data mining models. These models can be applied to many different situations, such as: ●● Forecasting: Estimating future sales, predicting demands on a power grid, or estimating server downtime ●● Assessing risk: Choosing the rates for insurance premiums, selecting best customers for a new sales campaign, determining which medical therapy is most appropriate given the physiological characteristics of the patient ●● Identifying sequences: Determining customer preferences in online purchases, predicting weather events ●● Grouping: Placing customers or events into cluster of related items, analyzing and predicting relationships between demographic characteristics and purchasing patterns, identifying fraud in credit card purchases A new medical procedure referred to as gene editing has the potential to assist thousands of people suffering many different diseases. An article in the Houston Chronicle (2013 ), describes how data mining techniques are used to Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 10 Chapter 1 Statistics and the Scientific Method explore massive genomic data bases to interpret millions of bits of data in a person’s DNA. This information is then used to identify a single defective gene, which is cut out, and splice in a correction. This area of research is referred to as biomedical informatics and is based on the premise that the human body is a data bank of incredible depth and complexity. It is predicted that by 2015, the average hospital will have approximately 450 terabytes of patient data consisting of large, complex images from CT scans, MRIs, and other imaging techniques. However, only a small fraction of the current medical data has been analyzed, thus opening huge opportunities for persons trained in data mining. In a case described in the article, a 7-year-old boy tormented by scabs, blisters, and scars was given a new lease on life by using data mining techniques to discover a single letter in his faulty genome. Defining the Problem: Determining the Effectiveness of a New Drug Product The development and testing of the Salk vaccine for protection against poliomyelitis (polio) provide an excellent example of how statistics can be used in solving practical problems. Most parents and children growing up before 1954 can recall the panic brought on by the outbreak of polio cases during the summer months. Although relatively few children fell victim to the disease each year, the pattern of outbreak of polio was unpredictable and caused great concern because of the possibility of paralysis or death. The fact that very few of today’s youth have even heard of polio demonstrates the great success of the vaccine and the testing program that preceded its release on the market. It is standard practice in establishing the effectiveness of a particular drug product to conduct an experiment (often called a clinical trial) with human participants. For some clinical trials, assignments of participants are made at random, with half receiving the drug product and the other half receiving a solution or tablet that does not contain the medication (called a placebo). One statistical problem concerns the determination of the total number of participants to be included in the clinical trial. This problem was particularly important in the testing of the Salk vaccine because data from previous years suggested that the incidence rate for polio might be less than 50 cases for every 100,000 children. Hence, a large number of participants had to be included in the clinical trial in order to detect a difference in the incidence rates for those treated with the vaccine and those receiving the placebo. With the assistance of statisticians, it was decided that a total of 400,000 children should be included in the Salk clinical trial begun in 1954, with half of them randomly assigned the vaccine and the remaining children assigned the placebo. No other clinical trial had ever been attempted on such a large group of participants. Through a public school inoculation program, the 400,000 participants were treated and then observed over the summer to determine the number of children contracting polio. Although fewer than 200 cases of polio were reported for the 400,000 participants in the clinical trial, more than three times as many cases appeared in the group receiving the placebo. These results, together with some statistical calculations, were sufﬁcient to indicate the effectiveness of the Salk polio vaccine. However, these conclusions would not have been possible if the statisticians and scientists had not planned for and conducted such a large clinical trial. The development of the Salk vaccine is not an isolated example of the use of statistics in the testing and development of drug products. In recent years, Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 1.3 Some Current Applications of Statistics 11 the U.S. Food and Drug Administration (FDA) has placed stringent requirements on pharmaceutical ﬁrms wanting to establish the effectiveness of proposed new drug products. Thus, statistics has played an important role in the development and testing of birth control pills, rubella vaccines, chemotherapeutic agents in the treatment of cancer, and many other preparations. Defining the Problem: lmproving the Reliability of Evidence in Criminal Investigations The National Academy of Sciences released a report (National Research Council, 2009) in which one of the more important findings was the need for applying statistical methods in the design of studies used to evaluate inferences from evidence gathered by forensic technicians. The following statement is central to the report: “Over the last two decades, advances in some forensic science disciplines, especially the use of DNA technology, have demonstrated that some areas of forensic science have great additional potential to help law enforcement identify criminals. . . . Those advances, however, also have revealed that, in some cases, substantive information and testimony based on faulty forensic science analyses may have contributed to wrongful convictions of innocent people. This fact has demonstrated the potential danger of giving undue weight to evidence and testimony derived from imperfect testing and analysis.” There are many sources that may impact the accuracy of conclusions inferred from the crime scene evidence and presented to a jury by a forensic investigator. Statistics can play a role in improving forensic analyses. Statistical principles can be used to identify sources of variation and quantify the size of the impact that these sources of variation can have on the conclusions reached by the forensic investigator. An illustration of the impact of an inappropriately designed study and statistical analysis on the conclusions reached from the evidence obtained at a crime scene can be found in Spiegelman et al. (2007). They demonstrate that the evidence used by the FBI crime lab to support the claim that there was not a second assassin of President John F. Kennedy was based on a faulty analysis of the data and an overstatement of the results of a method of forensic testing called Comparative Bullet Lead Analysis (CBLA). This method applies a chemical analysis to link a bullet found at a crime scene to the gun that had discharged the bullet. Based on evidence from chemical analyses of the recovered bullet fragments, the 1979 U.S. House Select Committee on Assassinations concluded that all the bullets striking President Kennedy were fired from Lee Oswald’s rifle. A new analysis of the bullets using more appropriate statistical analyses demonstrated that the evidence presented in 1979 was overstated. A case is presented for a new analysis of the assassination bullet fragments, which may shed light on whether the five bullet fragments found in the Kennedy assassination are derived from three or more bullets and not just two bullets, as was presented as the definitive evidence that Oswald was the sole shooter in the assassination of President Kennedy. Defining the Problem: Estimating Bowhead Whale Population Size Raftery and Zeh (1998) discuss the estimation of the population size and rate of increase in bowhead whales, Balaena mysticetus. The importance of such a study derives from the fact that bowheads were the ﬁrst species of great whale for Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 12 Chapter 1 Statistics and the Scientific Method which commercial whaling was stopped; thus, their status indicates the recovery prospects of other great whales. Also, the International Whaling Commission uses these estimates to determine the aboriginal subsistence whaling quota for Alaskan Eskimos. To obtain the necessary data, researchers conducted a visual and acoustic census off Point Barrow, Alaska. The researchers then applied statistical models and estimation techniques to the data obtained in the census to determine whether the bowhead population had increased or decreased since commercial whaling was stopped. The statistical estimates showed that the bowhead population was increasing at a healthy rate, indicating that stocks of great whales that have been decimated by commercial hunting can recover after hunting is discontinued. Defining the Problem: Ozone Exposure and Population Density Ambient ozone pollution in urban areas is one of the nation’s most pervasive environmental problems. Whereas the decreasing stratospheric ozone layer may lead to increased instances of skin cancer, high ambient ozone intensity has been shown to cause damage to the human respiratory system as well as to agricultural crops and trees. The Houston, Texas, area has ozone concentrations and are rated second only to those of Los Angeles. that exceed the National Ambient Air Quality Standard. Carroll et al. (1997) describe how to analyze the hourly ozone measurements collected in Houston from 1980 to 1993 by 9 to 12 monitoring stations. Besides the ozone level, each station recorded three meteorological variables: temperature, wind speed, and wind direction. The statistical aspect of the project had three major goals: 1. Provide information (and/or tools to obtain such information) about the amount and pattern of missing data as well as about the quality of the ozone and the meteorological measurements. 2. Build a model of ozone intensity to predict the ozone concentration at any given location within Houston at any given time between 1980 and 1993. 3. Apply this model to estimate exposure indices that account for either a long-term exposure or a short-term high-concentration exposure; also, relate census information to different exposure indices to achieve population exposure indices. The spatial–temporal model the researchers built provided estimates demonstrating that the highest ozone levels occurred at locations with relatively small populations of young children. Also, the model estimated that the exposure of young children to ozone decreased by approximately 20% from 1980 to 1993. An examination of the distribution of population exposure had several policy implications. In particular, it was concluded that the current placement of monitors is not ideal if one is concerned with assessing population exposure. This project involved all four components of Learning from Data: planning where the monitoring stations should be placed within the city, how often the data should be collected, and what variables should be recorded; conducting spatial–temporal graphing of the data; creating spatial–temporal models of the ozone data, meteorological data, and demographic data; and, ﬁnally, writing a report that could assist local and federal ofﬁcials in formulating policy with respect to decreasing ozone levels. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 1.5 Summary 13 Defining the Problem: Assessing Public Opinion Public opinion, consumer preference, and election polls are commonly used to assess the opinions or preferences of a segment of the public regarding issues, products, or candidates of interest. We, the American public, are exposed to the results of these polls daily in newspapers, in magazines, on the internet, on the radio, and on television. For example, the results of polls related to the following subjects were printed in local newspapers: ●● Public confidence in the potential for job growth in the coming year ●● Reactions of Texas residents to the state legislature’s failure to expand Medicaid coverage ●● Voters’ preferences for tea party candidates in the fall congressional elections ●● Attitudes toward increasing the gasoline tax in order to increase funding for road construction and maintenance ●● Product preference polls related to specific products (Toyota vs. Ford, DirecTV vs. Comcast, Dell vs. Apple, Subway vs. McDonald’s) ●● Public opinion on a national immigration policy A number of questions can be raised about polls. Suppose we consider a poll on the public’s opinion on a proposed income tax increase in the state of Michigan. What was the population of interest to the pollster? Was the pollster interested in all residents of Michigan or just those citizens who currently pay income taxes? Was the sample in fact selected from this population? If the population of interest was all persons currently paying income taxes, did the pollster make sure that all the individuals sampled were current taxpayers? What questions were asked and how were the questions phrased? Was each person asked the same question? Were the questions phrased in such a manner as to bias the responses? Can we believe the results of these polls? Do these results “represent’’ how the general public currently feels about the issues raised in the polls? Opinion and preference polls are an important, visible application of statistics for the consumer. We will discuss this topic in more detail in Chapters 2 and 10. We hope that after studying this material you will have a better understanding of how to interpret the results of these polls. 1.4 A Note to the Student We think with words and concepts. A study of the discipline of statistics requires us to memorize new terms and concepts (as does the study of a foreign language). Commit these deﬁnitions, theorems, and concepts to memory. Also, focus on the broader concept of making sense of data. Do not let details obscure these broader characteristics of the subject. The teaching objective of this text is to identify and amplify these broader concepts of statistics. 1.5 Summary The discipline of statistics and those who apply the tools of that discipline deal with Learning from Data. Medical researchers, social scientists, accountants, agronomists, consumers, government leaders, and professional statisticians are all involved with data collection, data summarization, data analysis, and the effective communication of the results of data analysis. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 14 Chapter 1 Statistics and the Scientific Method 1.6 Exercises 1.1 Introduction Bio. 1.1 H ansen (2006) describes a study to assess the migration and survival of salmon released from fish farms located in Norway. The mingling of escaped farmed salmon with wild salmon raises several concerns. First, the assessment of the abundance of wild salmon stocks will be biased if there is a presence of large numbers of farmed salmon. Second, potential interbreeding between farmed and wild salmon may result in a reduction in the health of the wild stocks. Third, diseases present in farmed salmon may be transferred to wild salmon. Two batches of farmed salmon were tagged and released in two locations, one batch of 1,996 fish in northern Norway and a second batch of 2,499 fish in southern Norway. The researchers recorded the time and location at which the fish were captured by either commercial fisherman or anglers in fresh water. Two of the most important pieces of information to be determined by the study were the distance from the point of the fish’s release to the point of its capture and the length of time it took for the fish to be captured. a. Identify the population that is of interest to the researchers. b. Describe the sample. c. What characteristics of the population are of interest to the researchers? d. If the sample measurements are used to make inferences about the population characteristics, why is a measure of reliability of the inferences important? Env. 1.2 Soc. 1.3 In 2014, Congress cut $8.7 billion from the Supplemental Nutrition Assistance Program (SNAP), more commonly referred to as food stamps. The rationale for the decrease is that providing assistance to people will result in the next generation of citizens being more dependent on the government for support. Hoynes (2012) describes a study to evaluate this claim. The study examines 60,782 families over the time period of 1968 to 2009 which is subsequent to the introduction of the Food Stamp Program in 1961. This study examines the impact of a positive and policy-driven change in economic resources available in utero and during childhood on the economic health of individuals in adulthood. The study assembled data linking family background in early childhood to adult health and economic outcomes. The study concluded that the Food Stamp Program has effects decades after initial exposure. Specifically, access to food stamps in childhood leads to a significant reduction in the incidence of metabolic syndrome (obesity, high blood pressure, and diabetes) and, for women, an increase in economic self-sufficiency. Overall, the results suggest substantial internal and external benefits of SNAP. a. Identify the population that is of interest to the researchers. b. Describe the sample. c. What characteristics of the population are of interest to the researchers? d. If the sample measurements are used to make inferences about the population characteristics, why is a measure of reliability of the inferences important? During 2012, Texas had listed on FracFocus, an industry fracking disclosure site, nearly 6,000 oil and gas wells in which the fracking methodology was used to extract natural gas. Fontenot et al. (2013 ) reports on a study of 100 private water wells in or near the Barnett Shale in Texas. There were 91 private wells located within 5 km of an active gas well using fracking, 4 private wells with no gas wells located within a 14 km radius, and 5 wells outside of the Barnett Shale with no gas well located with a 60 km radius. They found that there were elevated levels of potential contaminants such as arsenic and selenium in the 91 wells closest to natural gas extraction sites compared to the 9 wells that were at least 14 km away from an active gas well using the £racking technique to extract natural gas. a. Identify the population that is of interest to the researchers. b. Describe the sample. c. What characteristics of the population are of interest to the researchers? d. If the sample measurements are used to make inferences about the population characteristics, why is a measure of reliability of the inferences important? Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 1.6 Exercises 15 Med. 1.4 Of all sports, football accounts for the highest incidence of concussion in the United States due to the large number of athletes participating and the nature of the sport. While there is general agreement that concussion incidence can be reduced by making rule changes and teaching proper tackling technique, there remains debate as to whether helmet design may also reduce the incidence of concussion. Rowson et al. (2014) report on a retrospective analysis of head impact data collected between 2005 and 2010 from eight collegiate football teams. Concussion rates for players wearing two types of helmets, Riddell VSR4 and Riddell Revolution, were compared. A total of 1,281,444 head impacts were recorded, from which 64 concussions were diagnosed. The relative risk of sustaining a concussion in a Revolution helmet compared with a VSR4 helmet was 46.1%. This study illustrates that differences in the ability to reduce concussion risk exist between helmet models in football. Although helmet design may never prevent all concussions from occurring in football, evidence illustrates that it can reduce the incidence of this injury. a. Identify the population that is of interest to the researchers. b. Describe the sample. c. What characteristics of the population are of interest to the researchers? d. If the sample measurements are used to make inferences about the population characteristics, why is a measure of reliability of the inferences important? Pol. Sci. 1.5 During the 2004 senatorial campaign in a large southwestern state, illegal immigration was a major issue. One of the candidates argued that illegal immigrants made use of educational and social services without having to pay property taxes. The other candidate pointed out that the cost of new homes in their state was 20–30% less than the national average due to the low wages received by the large number of illegal immigrants working on new home construction. A random sample of 5,500 registered voters was asked the question, “Are illegal immigrants generally a benefit or a liability to the state’s economy?” The results were as follows: 3,500 people responded “liability,” 1,500 people responded “benefit,” and 500 people responded “uncertain.” a. What is the population of interest? b. What is the population from which the sample was selected? c. Does the sample adequately represent the population? d. If a second random sample of 5,000 registered voters was selected, would the results be nearly the same as the results obtained from the initial sample of 5,000 voters? Explain your answer. Edu. 1.6 An American history professor at a major university was interested in knowing the history literacy of college freshmen. In particular, he wanted to find what proportion of college freshmen at the university knew which country controlled the original 13 colonies prior to the American Revolution. The professor sent a questionnaire to all freshman students enrolled in HIST 101 and received responses from 318 students out of the 7,500 students who were sent the questionnaire. One of the questions was “What country controlled the original 13 colonies prior to the American Revolution?” a. What is the population of interest to the professor? b. What is the sampled population? c. Is there a major difference in the two populations. Explain your answer. d. Suppose that several lectures on the American Revolution had been given in HIST 101 prior to the students receiving the questionnaire. What possible source of bias has the professor introduced into the study relative to the population of interest? Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. PART 2 Collecting Data Chapter 2 U sing Surveys and Ex perim ental Studi es to G ather Data Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. CHAPTER 2 Using Surveys and Experimental Studies to Gather Data 2.1 2.1 Introduction and Abstract of Research Study 2.2 Observational Studies 2.3 Sampling Designs for Surveys 2.4 Experimental Studies 2.5 Designs for Experimental Studies 2.6 Research Study: Exit Polls Versus Election Results 2.7 Summary 2.8 Exercises Introduction and Abstract of Research Study As mentioned in Chapter 1, the ﬁrst step in Learning from Data is to define the problem. The design of the data collection process is the crucial step in intelligent data gathering. The process takes a conscious, concerted effort focused on the following steps: ●● Specifying the objective of the study, survey, or experiment ●● Identifying the variable(s) of interest ●● Choosing an appropriate design for the survey or experimental study ●● Collecting the data To specify the objective of the study, you must understand the problem being addressed. For example, the transportation department in a large city wants to assess the public’s perception of the city’s bus system in order to increase the use of buses within the city. Thus, the department needs to determine what aspects of the bus system determine whether or not a person will ride the bus. The objective of the study is to identify factors that the transportation department can alter to increase the number of people using the bus system. To identify the variables of interest, you must examine the objective of the study. For the bus system,...

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Help with Homework ”

Get high-quality paper

NEW! AI matching with writer