<- read_csv("data/NOMO_Nest_Data.csv") %>%
nest_data filter(hood %in% c("ls", "up"))
<- read_csv("data/NestlingPb.csv") %>%
nestlingpb_data filter(hood %in% c("lakeshore", "uptown"))
To lead poison a mockingbird
Background
Lead is an important contaminant in urban areas with well-known impacts on human health, but do non-human animals also face risks from lead exposure? Hitt et al. investigated this question by measuring lead levels in soil and the blood of breeding mockingbirds, as well as egg hatching and offspring development (Hitt et al. 2023). We will replicate parts of their analysis here.
Get the data
The data for this study were deposited in the Dryad Data Repository and are available here. Download the full dataset and put them in the appropriate folder of your RStudio project.
Read the northern mockingbird nestling data (“NOMO_Nest_Data.csv”) and nestling lead data (“NestlingPb.csv”) into data frames called nest_data
and nestlingpb_data
, respectively. Filter both data frames to the Uptown and Lakeshore neighborhoods.
Hypothesis testing by randomization
Do mockingbirds in a neighborhood with higher lead concentration have less successful nests?
We’ll investigate this question using randomization hypothesis testing.
Nest success
nest_data
contains columns hood
and bin.status
, representing the neighborhood and binary status (at least one chick fledged or not) for each monitored nest. Visualize how nest success (i.e., binary status) varied by neighborhood.
ggplot(nest_data, aes(hood, fill = factor(bin.status))) +
geom_bar() +
scale_fill_brewer("Nest success", palette = "Dark2")
Step 1: state the null and alternative hypotheses
H0: Neighborhood has no effect on nest success
HA: Neighborhood has an effect on nest success
Step 2: calculate the point estimate
1) What is the relevant sample statistic for our hypothesis?
Difference in proportions
2) How would you calculate it?
<- nest_data %>%
nest_success group_by(hood) %>%
summarize(prop = sum(bin.status) / n())
<- nest_success$prop[2] - nest_success$prop[1]
point_estimate_success
point_estimate_success
Step 3: quantify the uncertainty
Use randomization to simulate the distribution of the sample statistic under the null hypothesis.
<- replicate(1000, {
null_dist <- nest_data %>%
nest_success mutate(hood = sample(hood, n())) %>%
group_by(hood) %>%
summarize(prop = sum(bin.status) / n())
<- nest_success$prop[2] - nest_success$prop[1]
point_estimate_success
point_estimate_success
})
ggplot(tibble(null_dist), aes(null_dist)) +
geom_histogram(bins = 20,
color = "cornflowerblue",
fill = NA) +
geom_vline(xintercept = point_estimate_success,
color = "firebrick")
Step 4: calculate probability of the point estimate under the null
What’s the p-value?
sum(abs(null_dist) > abs(point_estimate_success)) /
length(null_dist)
Step 5: reject or fail to reject the null
We’re using a threshold of alpha=0.05 (by convention). p>0.05, so we fail to reject the null.
Do we accept the null? NO. NO WE DO NOT.
Nestling blood lead levels
Now it’s your turn. Perform a similar analysis to investigate whether nestling blood Pb levels vary by neighborhood. Use nestlingpb_data
for this part, column blood_ug_dl_pbwt
.
First, visualize the blood Pb levels by neighborhood. What’s an appropriate type of visualization for this?
Step 1: state the null and alternative hypotheses
H0:
HA:
Step 2: calculate the point statistic
1) What is the relevant sample statistic for our hypothesis?
2) How would you calculate it?
Step 3: quantify the uncertainty
Use randomization to simulate the distribution of the sample statistic under the null hypothesis.
Step 4: calculate probability of the point estimate under the null
What’s the p-value?
Step 5: reject or fail to reject the null
Hypothesis testing by normal approximation
Rather than using randomization to simulate the null distribution, it’s often easier to approximate it as a normal distribution.
Does nestling blood Pb level have a relationship with feather Pb level?
First, visualize the relationship between the two variables.
Step 1: state the null and alternative hypotheses
H0:
HA:
Step 2: calculate the point statistic
1) What is the relevant sample statistic for our hypothesis?
2) How would you calculate it?
Step 3: quantify the uncertainty
Use the standard error of the regression coefficient to visualize the distribution of the sample statistic under the null hypothesis.
Step 4: calculate probability of the point estimate under the null
What’s the p-value? Use pnorm()
to get the probability of a point estimate at least as extreme as the observed.
Our calculate p-value is much lower than the p-value from the summary of our linear model. In class I told you lm()
uses a Student’s t-distribution instead of a normal distribution for calculating the p-value. When sample sizes are large, the normal distribution and t-distribution are virtually identical. With only 22 complete data points, our sample size is relatively small. Therefore the t-distribution has thicker tails and yields a larger p-value. Hence the discrepancy.
Step 5: reject or fail to reject the null
Sensitivity to outliers
Revisit the visualization of the blood and feather Pb levels. One point seems to be an extreme outlier, and it seems to be exerting a strong influence on our model. Repeat the previous analysis with that point removed, then answer the question below.
Question: Of the two analyses (with and without the outlier), which had a lower p-value? Does that make it a better analysis?
Confidence intervals
For the last part of today’s lab we will construct confidence intervals around the point estimate of the blood Pb level coefficient. Here’s the plan:
Simulate a population of nestlings, with the same relationship between blood and feather Pb levels as the observed sample.
Draw a new sample of nestlings from the simulated population. Create a confidence interval for the point estimate of the blood Pb level coefficient in this sample.
Repeat the process 100 times.
~95% of our 95% CIs should contain the population parameter.
1. Simulate the population
2. Create one confidence interval
3. Repeat the process
How many 95% CIs contain the population parameter?
Recap
Randomization allows us to simulate the null distribution, which we can use to quantify the probability of our result if the null hypothesis is true.
sample()
andreplicate()
are helpful here!Randomization doesn’t make assumptions about the normality of the sample statistic, but it does assume the sample is representative of the population.
By assuming the sample statistic is normally distributed, we can use standard errors to conduct hypothesis testing.
- R will calculate standard errors for us for most sample statistics, such as regression coefficients.
We can also use standard errors to construct confidence intervals.
- By simulating a population, we demonstrated ~95% of 95% CIs will contain the population parameter.