Statistics Question - Achiever Papers

For questions that ask you to calculate something, I need to see the full calculation, not just the answer. Note that the“calculation” will often (almost always) be done using code, so I would need to see the code that leads up to and produces the final answer. Similarly, for questions that require software to performance analysis or generate a plot, I need to see the R code that produced the results. You may attachelevant R code as an Appendix at the end of the assignment, or include the code as part of your answer to the question that the code supports.You should upload your assignment solution as a single pdf to the “Assignments” section of our course in Canvas. Click on the name of the assignment, then click the “Submit Assignment” button,then upload the file containing your solution, then click “Submit Assignment” a final time. The filename should be in the formatLastName_FirstName_ProjectNumber.pdf. For example, if I was submitting the assignment, I would name itpoythress_jc_proj2.pdf

Mini-Project 2 (due W 3/24 by 11:59pm)
Instructions:
You may work with other students to complete the assignment, but each student needs to turn in
his or her own work (e.g., it is ok to discuss how to write the code to solve a question, but it is not
ok to copy and paste each other’s code). Show all work to receive full credit. For questions that ask
you to calculate something, I need to see the full calculation, not just the answer. Note that the
“calculation” will often (almost always) be done using code, so I would need to see the code that
leads up to and produces the final answer. Similarly, for questions that require software to perform
an analysis or generate a plot, I need to see the R code that produced the results. You may attach
relevant R code as an Appendix at the end of the assignment, or include the code as part of your
answer to the question that the code supports.
You should upload your assignment solution as a single pdf to the “Assignments” section of our
course in Canvas. Click on the name of the assignment, then click the “Submit Assignment” button,
then upload the file containing your solution, then click “Submit Assignment” a final time. The
filename should be in the format LastName_FirstName_ProjectNumber.pdf. For example, if I was
submitting the assignment, I would name it poythress_jc_proj2.pdf.
Questions:
1. We will analyze a dataset containing information about the number of FEMA buyouts of
flood-prone properties from 1989 to 2016, which can be downloaded from Canvas in the file
fema_buyouts.csv. The dataset contains information at both the county and state level, but
we will focus on the county-level data. The response variable of interest is NumBuyouts. Of
interest is how certain socioeconomic and demographic factors are associated with the number
of buyouts. The covariates of interest are:
ALAND: land area in m2
AWATER: water area in m2
FDD: number of federal disaster declarations
FDD_IA: number of federal disaster declarations with individual assistance
CountyIncome: average household income
CountyEducation: proportion with high school education
CountyRace: proportion white
CountyPopulation: total population
CountyPopDens: population density
CountyLanguage: proportion proficient in English
(a) Make a scatterplot matrix of the covariates FDD–CountyLanguage. Do any of the covariates
appear correlated with one another? If so, which ones, and what effect might correlation
1
among the covariates have on the analysis? Do any observations have relatively large values
of a particular covariate? Should we consider transforming those covariates? If yes, why?
And which transformations should we consider?
Also make scatterplots of ln(NumBuyouts+1) vs. ln(ALAND) and ln(NumBuyouts+1) vs.
ln(ALAND). We might argue that either ALAND or AWATER should be treated as an exposure
variable, since either could serve as a proxy for the number of properties at risk of flooding.
Does the relationship between NumBuyouts and ALAND or AWATER suggest that either should
be treated as an exposure variable?
(b) For now, don’t treat ALAND or AWATER as an exposure variable (i.e., don’t use an offset).
Assume the response NumBuyouts is a Poisson random variable and fit a Poisson regression
model. Use all of the covariates ALAND, AWATER, FDD–CountyLanguage and any transformations of the covariates you wish to perform model selection to find the best model possible
for the response. You may also construct new variables from combinations of two or more
variables [e.g., AWATER/(AWATER+ALAND) would represent the proportion of the county that is
water, which may be a relevant covariate]. You can use whichever model selection algorithm
and criterion for the “best model,” but you should justify your choices.
Does the final model you selected appear to fit the data well? If your final model happened
to include either ALAND or AWATER or some function of one them, does it suggest one or the
other should be treated as an exposure variable?
(c) Make a histogram of NumBuyouts. Are there any unusual features apparent in the histogram
(where “unusual” is in the context of assuming the counts follow a Poisson distribution)? How
might the distribution of the counts be related to the lack-of-fit you may have encountered
for the models you fit in the previous part? [Hint: You may want to make a custom set
of breaks in the histogram, because the features may be difficult to see using the default
breakpoints.]
(d) Use the zeroinfl function from the pscl package to fit a zero-inflated Poisson (ZIP) model.
As in part (b) fit a model with many/all of the covariates and transformations of the covariates, then perform model selection to find a reduced model that includes some subset of
those covariates. The model you select for the count part of the model need not include the
same covariates as the model you select for the zero-inflated part of the model.
Some hints:
Refer to Faraway’s example of fitting a ZIP model in R.
If you fit a model, look at the summary, and see NAs for the SEs, Z values, and P -values,
try standardizing the covariates first.
The step function appears to work for the object returned by zeroinfl, but only for
the count part of the model. You might consider removing covariates “by hand” and
2
using LRTs to justify their removal for the zero-inflated part of the model (like Faraway
did).
(e) Even though the ZIP model accounts for lack-of-fit due to excess zeros, it’s still possible
that the counts are overdispersed (or that the mean structure of the model is misspecified,
even after model selection). We should somehow check for lack-of-fit before interpreting
the model or drawing any conclusions. The zeroinfl function can also fit a zero-inflated
negative binomial (ZINB) model by changing the dist argument.
Fit a ZINB model with the same set of covariates in the count and zero-inflated parts of
the model that were in the final model you selected in the previous part. Presumably most
or all of the covariates you selected in your final ZIP model were significant. Are they still
significant in the ZINB model?
We discussed comparing the Poisson vs. NB models through a LRT of the overdispersion
parameter. We could do that for the zero-inflated versions of the models as well if we could
be confident that the asymptotic distribution of the LRT statistic under the null is the
same in the regular and zero-inflated versions of the models. However, I am unsure whether
or not that is the case. Alternatively, we could use AIC to compare the ZIP and ZINB
models. Unfortunately, we would need to know 1) that no constants have been left off the
log-likelihood 2) the AIC function counts the number of parameters of zeroinfl objects
properly and 3) the zeroinfl function uses the MLE for the estimate of the dispersion
parameter. Again, I am unsure of any of those things. However, the summary of the ZINB
model includes a Wald test for log(theta) (where theta is the overdispersion parameter),
which may not be ideal, but is at least something we can use to determine whether or not
there is evidence of overdispersion. Based on the summary of the fitted ZINB model, is there
evidence for overdispersion in the counts? That is, should we prefer the ZIP model or the
ZINB model for the purpose of drawing conclusions from the data?
(f) Suppose we decide that we want to base our conclusions on the ZINB model. If there
are covariates included in the ZINB model fit in the previous part that are not significant,
we should perform model selection once again before interpreting the model. Use whichever
procedure and criterion you prefer to perform model selection on the ZINB fit in the previous
part. Just make sure to clearly state why the criterion justifies removing a covariate from
the model, should you choose to remove any.
(g) Interpret the effect of the covariates in the final model you selected in the previous part. In
particular, how are the covariates included in the zero-inflated part of the model associated
with the odds that a county had no FEMA property buyouts from 1989 to 2016? How are
the covariates included in the count part of the model associated with the mean number of
FEMA property buyouts, among counties that had at least one FEMA buyout? Does the
association between each covariate and the number of FEMA buyouts match your intuition?
[Hint: If your final model includes standardized versions of the covariates, it’s OK to interpret
the covariate effects in loose terms. For example, if Income was included in the model, is an
increase in a county’s average income associated with an increase or decrease in the number
of FEMA buyouts? You don’t need to phrase the interpretation as “for a 1-unit increase in
standardized income, the mean number of FEMA buyouts increases/decreases by a factor
3
of ….” A precise quantitative interpretation of the effect of a covariate in anything but its
original units makes things messy and complicated and difficult to understand.]
2. Do Exercise 2 on page 126 of Faraway.
3. We will analyze a dataset about chronic respiratory disease, which can be downloaded from Canvas in the file respire.dat. The dataset has column names, so use the header=TRUE argument
when you read it into R.
The dataset has information on three covariates:
air: air pollution (low or high)
exposure: job exposure (yes or no)
smoking: non-smoker, ex-smoker, or current smoker
The goal is to analyze the covariates’ relationships with the response – the counts of individuals
falling into four chronic respiratory disease categories:
Level 1: no symptoms
Level 2: cough or phlegm < 3 months/year Level 3: cough or phlegm ≥ 3 months/year Level 4: cough or phlegm ≥ 3 months/year + shortness of breath Thus, we have an ordinal multinomial response. Furthermore, we could argue that the categories arise from a sequential mechanism, so that a continuation ratio logit model would be reasonable. (a) After reading the data into R, take a look at the dataset to see how it is structured. Now use the vglm function from the VGAM package to fit parallel and non-parallel versions of the cumulative logit, adjacent category logit, and continuation ratio logit models. Use main effects for air, exposure, and smoking as the covariates in the models. For each type of model, use a LRT to determine whether the non-parallel version fits better than the parallel version. Is the preferred version of the model (parallel vs. non-parallel) consistent across the three different types of models? How many more parameters do the non-parallel versions of the models have vs. the parallel versions? [Hint: The model type can be changed via the family argument, where the names are cumulative, acat, and cratio. The parallel argument controls which version of the model is fitted. So for example, the non-parallel version of the continuation ratio logit model would specified with the argument family=cratio(parallel=F). Note that we could also change the link argument, but logit is the default for all three types, so we don’t need to adjust it.] 4 (b) For each type of model (cumulative, adjacent category, and continuation ratio), choose either the parallel version or non-parallel version to proceed with, based on the LRTs from the previous part. Use the drop1 function to perform LRTs to determine whether any covariates can be removed from the models. If yes, refit the models with the covariate(s) removed. In your final models (there should be one for each of the three types), is there evidence for lack-of-fit? Report the values of the statistics on which you base your conclusions regarding lack-of-fit. Why is the statistic you chose an appropriate measure of goodness-of-fit? (c) For each of the final models chosen in the previous part, interpret the effects of the covariates on chronic respiratory disease. Note that each type of model involves a subtly different function of the probabilities of the respiratory disease levels, and your interpretations should reflect that. [Hint: If you look at the summary of the fitted model, besides the reference levels for the covariates, it also tells you which linear predictors are being modelled. Not only should this help you interpret the effects, but you can also reverse the direction of the probabilities by refitting the model with the reverse=T argument if that leads to a more convenient way to interpret the effects.] FYI, if the models you chose were the non-parallel slopes versions, the interpretations can get quite messy and tedious. In other words, something like “current smokers have a higher odds of more severe respiratory disease” would be too simplistic; it neither takes into account the non-parallel slopes nor the choice of model for the probabilities. So I am expecting the detail and nuance of the interpretation to be commensurate with the complexity of the model. (d) Suppose we wanted to pick just one type of model among the cumulative logit, adjacent category logit, and continuation ratio logit. How would you compare the three final models from the previous two parts? LRTs? AIC or BIC? Log-likelihood? Some other method? Choose a model comparison method and determine which model is preferred among the three final models you selected in the previous two parts. Justify your choice of model comparison method by explaining why it is appropriate (and why some of the other choices are not appropriate). 5 Mini-Project 1 (due F 2/26 by 11:59pm) Instructions: You may work with other students to complete the assignment, but each student needs to turn in his or her own work (e.g., it is ok to discuss how to write the code to solve a question, but it is not ok to copy and paste each other’s code). Show all work to receive full credit. For questions that ask you to calculate something, I need to see the full calculation, not just the answer. Note that the “calculation” will often (almost always) be done using code, so I would need to see the code that leads up to and produces the final answer. Similarly, for questions that require software to perform an analysis or generate a plot, I need to see the R code that produced the results. You may attach relevant R code as an Appendix at the end of the assignment, or include the code as part of your answer to the question that the code supports. You should upload your assignment solution as a single pdf to the “Assignments” section of our course in Canvas. Click on the name of the assignment, then click the “Submit Assignment” button, then upload the file containing your solution, then click “Submit Assignment” a final time. The filename should be in the format LastName_FirstName_ProjectNumber.pdf. For example, if I was submitting the assignment, I would name it poythress_jc_proj1.pdf. Questions: 1. This question is adapted from Exercise 1 on page 46 of Faraway. We will analyze the wbca data from the faraway package. After loading the faraway library, type ?wbca to see descriptions of the study, the response variable and covariates, and the goals of the analysis. (a) First we will examine the associations between the individual covariates and the response (Class, where 0 is a malignant tumor and 1 is benign). From the perspective of interpretation, it makes more sense to model the probability of a malignant tumor rather than the probability of a benign tumor, so create a new response variable called y where y=1 if malignant, and y=0 if benign. For each covariate, plot the sample proportions p̂ vs. the covariate values (there should be 9 plots, since there are 9 covariates). Does the probability of a malignant tumor appear to be associated with any of the covariates? If so, for which covariates does the association appear strongest? [Hint: When looking at the univariate associations between the response and individual covariates, there are only 10 unique covariate classes, because each covariate can only take integer values between 1 and 10. Thus, we can summarize the binary response as counts of 1 vs. 0 for each unique value of the covariate, and use the counts to calculate the p̂’s. There are many ways to do this, but table(y,covariate) is perhaps the easiest object to work with. Note that logistic regression models with multiple covariates will have more and more unique covariate classes as the number of covariates increase, so it makes less sense to calculate and plot the p̂’s, since most will be 0 or 1.] (b) Fit 9 different logistic regression models – one model describing the relationship between the response and each individual covariate in the dataset. Use the test statistic or P value corresponding to each covariate (e.g., from the Wald tests shown in the summary of the 1 fitted model) to order the covariates from least to most strongly associated with the response (where a larger test statistic or smaller P -value indicates stronger association). Which three covariates are most strongly associated with the probability of a tumor being malignant? Is the ordering by the strength of the association consistent with what you determined by plotting the p̂’s in the previous part? (c) Make a scatterplot matrix of all of the covariates. Are any of the covariates correlated among themselves? If we were to fit a logistic regression model with multiple covariates and compare to the models with a single covariate, how might correlation among the covariates affect the results? For example, would the covariate most strongly associated with the response on its own be most strongly associated with the response in a model that includes all 9 covariates? Is it possible that a covariate strongly associated with the response on its own has little to no association with the response when other covariates are included in the model? If the relationship between a covariate and the response changes depending on whether and which other covariates are included in the model, how do you explain that phenomenon? (d) Fit a logistic regression model that includes all 9 covariates. Use the step function to select the best model according to AIC (using the default algorithm given by direction='both'). Use the summary function to display the fitted model. Which covariates are included in that model? Is the strength and nature of the associations with the response consistent with the individual associations determined in part (b)? (e) Repeat the previous part, but use BIC to select the model. Do AIC and BIC agree on what is the best model? If not, describe how the selected models differ. Mathematically, why might BIC select a different model than AIC? [Hint: To use BIC instead of AIC, type ?step and read what the k argument does. The documentation should also give you a hint as to why AIC and BIC sometimes select different models, but you may want to google and read some external sources of information, since the information in R isn’t very detailed and Faraway discusses AIC, but not BIC (or at least, not until much later in the book).] (f) Create ROC curves for the models selected by AIC and BIC. Would you say the models have low or high explanatory power? Does the explanatory power of the model selected by AIC differ substantially from that of the model selected by BIC? [Hint: You can follow Faraway’s example to create the curve. However, you may find it easier to install the LogisticDx package and take a look at what the gof function does.] 2. This question is adapted from Exercise 2 on page 80 of Faraway. We will analyze the aflatoxin data from the faraway package. Type ?aflatoxin to see descriptions of the study, the response variable and covariates, and the goals of the analysis. (a) For all parts of this question, we will model the probability of a tumor (so Y = 1 means that a tumor is present, Y = 0 means that no tumor is present). Plot the sample logits vs. dose, ln(dose+1), and sqrt(dose). Do any of the relationships appear linear? If so, which 2 is most closely linear? [Hint: Refer to the ED/LD activity for the definition of the sample logit.] (b) Fit three logistic regression models: one with dose as the covariate, one with ln(dose+1) as the covariate, and one with sqrt(dose) as the covariate. Note: In R, the default choice of base in the log function is e, so log actually takes the natural logarithm. Plot the observed sample proportions (i.e., the p̂’s) vs. dose. Overlay the estimated regression functions from each of the three models on the scatterplot. To what extent do the estimated regression functions differ? From which model does the estimated regression function seem to best fit the data? [Hint: The estimated regression functions are just the model-estimated p̂’s, calculated for a sequence of values of dose over a suitable range, which can be obtained using the predict function (with appropriate type argument) or the ilogit function.] (c) Choose an appropriate goodness-of-fit test for the three models fit in the previous part. Is there evidence of lack-of-fit for any of the models? If so, which ones? Make sure to report any test statistics and P -values associated with the GOF test you choose. (d) For the remaining parts, we will focus on the model using sqrt(dose) as the covariate. Recall that complementary log-log link ln(− ln(1 − p)) arises by assuming a Gumbel distribution for a latent variable. Fit the model using the complementary log-log link instead of the default logit link. Also recall that a different version of the Gumbel distribution leads to the link − ln(− ln(p)). If you type ?family, you will see that the binomial family has an option for the complementary log-log link, but not the − ln(− ln(p)) version. We can create a custom link for cases in which none of the links that are built into the glm function match the link we would like to use. Create a link for − ln(− ln(p)) and fit the model using that link. [Hint: This stackoverflow discussion has a nice example of how to create a custom link: https://stackoverflow.com/questions/15931403/modify-glm-function-to-adopt-user-specified-link-function-in-r.] Plot the estimated regression functions for the models using the ln(− ln(1−p)), − ln(− ln(p)), and logit links. Does the choice of link affect the estimated regression function much? Do the two links derived from different versions of the Gumbel distribution result in similar estimates of the regression function? Which link seems to result in an estimate of the regression function that best fits the data? (e) Both in the Faraway reading and in lecture we discussed using the delta method to derive the SE of the estimate of the EDp0 or LDp0 . The resulting SE could then be used to construct a Wald C.I. for the EDp0 or LDp0 . Fieller’s method provides an alternative approach to constructing the C.I. Like Wald C.I.s based on the delta method, C.I.s constructed using Fieller’s method are approximate. 3 Fieller’s method applies to ratios of normally distributed R.V.s. In our application, the R.V.s are approximately normally distributed; thus, Fieller’s method results in an approximate C.I. To apply Fieller’s method, recall that the EDp0 (or LDp0 ) is the value x0 satisfying: logit(p0 ) = β0 + β1 x0 logit(p0 ) − β0 =⇒ x0 = β1 b = (β̂0 , β̂1 )T ∼ N β, V ar[β] b (approximately), Now define ψ ∗ = −β̂0 − x0 β̂1 . Assuming β then E[ψ ∗ ] = − logit(p0 ) =⇒ E[ψ ∗ + logit(p0 )] = 0 Now let ψ = ψ∗ + logit(p0 ). Then E[ψ] = 0 and V ar[ψ] = V ar[ψ ∗ ] b = (1 x0 )V ar[β] 1 x0 . Thus, we have that: ψ p ∼ N (0, 1) (approximately) V ar[ψ] ! |ψ| =⇒ P p ≤ z1−α/2 = 1 − α V ar[ψ] 2 =⇒ P ψ 2 − z1−α/2 V ar[ψ] ≤ 0 = 1 − α 2 If you expand ψ 2 − z1−α/2 V ar[ψ] so that it is written in terms of β̂0 , β̂1 , logit(p0 ), x0 and b then you will find that it is a quadratic function of x0 . the elements of the matrix V ar[β], Thus, you can set the function equal to zero and solve for x0 to obtain the endpoints of the C.I. for x0 . Use Fieller’s method to find 95% C.I.s for the ED50 and ED90. Use the model fitted with the logit link and sqrt(dose) as the covariate. How does the width of the C.I. for the ED50 compare to the width of the interval for the ED90? Can you explain why the widths might differ substantially? Some hints: √ 2 b −4ac to solve the quadratic function of x0 for x0 after You can use the formula −b± 2a you’ve identified what a, b, and c are. To get you started on identifying a, b, and c, consider expanding V ar[ψ]. Let b = V ar[β] 4 v00 v01 v10 v11 to denote the elements of the covariance matrix. Then b V ar[ψ] = (1 x0 )V ar[β] 1 x0 = v11 x20 + 2v01 x0 + v00 (since v01 = v10 ). Thus, v11 should appear somewhere in a, 2v01 should appear somewhere in b, and v00 should appear somewhere in c. To identify the rest of the terms in a, b, and c, expand 2 ψ 2 and multiply V ar[ψ] by −z1−α/2 , where z1−α/2 is the 1−α/2 quantile of the standard normal distribution. b The vcov applied to a glm object extracts V ar[β]. Remember that the C.I. is going to be for whatever function of dose was used in the logistic regression model. So if log(dose), sqrt(dose), etc. was used in the model, you will need to perform the inverse transformation on the endpoints of the C.I. to get back to the original dose units. 3. Do Exercise 3 on page 80 of Faraway (an analysis of the infert data). For part (a), besides commenting on the differences between the tables, you should also comment on why we are crossclassifying the number of spontaneous and induced abortions separately for cases and controls in the first place. That is, what is the purpose of looking at such a table? What are we hoping to learn? [Hint: The tables for cases and controls will each be 3 × 3. Rather than looking at the raw counts, it may be easier to see the differences between the tables if you divide by the grand total of counts for each table (i.e., divide by 83 for the cases table, and divide by 248-83=165 for the controls table); then the cells represent proportions rather than counts.] 5

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Statistics Question ”

Get high-quality paper

NEW! AI matching with writer