suppressMessages(library(tidyverse))
theme_set(theme_bw())
<- read_csv("drinking_water.csv", show_col_types = FALSE) drinking_water
The Color of Drinking Water
Is exposure to environmental hazards influenced by class or race and ethnicity? Switzer and Teodoro (Switzer and Teodoro 2017) argued the either-or framing of that question is misguided, and actually the interaction between class and race/ethnicity is the important question for environmental justice. Today’s lab will use logistic regression and interaction terms to investigate drinking water quality in the context of socio-economic status (SES), race, and ethnicity.
Overview
- Data exploration
- Try (and fail) to use OLS
- Fit a logistic regression model using maximum likelihood
- Fit the full logistic regression model using
glm()
Data exploration
We will use the data from Switzer and Teodoro (2017) for today’s lab. Begin by downloading the CSV file available here. These data come from multiple sources.
Water utility and violations data
Source: Safe Drinking Water Information System (SDWIS)
Time period: 2010-2013 (4 years)
Sample size: 12,972 utilities
Criteria: Local government-owned utilities serving populations of 10,000 or more
Demographic data
Source: US Census Bureau’s American Community Surveys (2010-2013)
Variables included:
Percent Hispanic population
Percent Black population
Percent of population with high school education
Percent of population with bachelor’s degree
Percent of population below poverty line
Median household income
Load the data and begin exploring.
Questions
- What do you think each row represents?
The water utility and demographic data for a single district in a single year. - What columns do you think represent:
- Drinking water health violations?
health
- Percent Black and Hispanic population in the utility district?
pctblack
andpcthisp
, respectively - Median household income in the utility district?
medianincomehouse
- Drinking water health violations?
- One column includes a count of drinking water health violations. How would you create a new column with a binary variable representing whether there were any violations?
$violations <- ifelse(drinking_water$health > 0, 1, 0) drinking_water
- Create a scatter plot of violations against race (percent Black population), ethnicity (percent Hispanic population), and SES (median household income). What visualization issues do we get with a scatter plot? How could you address that?
Points are overplotted, we can’t see the overall trend. One option: bin the data and calculate the mean.
<- drinking_water %>%
violations_by_pctblack mutate(pctblack = round(pctblack / 5) * 5) %>%
group_by(pctblack) %>%
summarize(violations = mean(violations))
ggplot(drinking_water, aes(pctblack, violations)) +
geom_point() +
geom_point(data = violations_by_pctblack, color = "red")
<- drinking_water %>%
violations_by_pcthisp mutate(pcthisp = round(pcthisp / 5) * 5) %>%
group_by(pcthisp) %>%
summarize(violations = mean(violations))
ggplot(drinking_water, aes(pcthisp, violations)) +
geom_point() +
geom_point(data = violations_by_pcthisp, color = "red")
<- drinking_water %>%
violations_by_medianincomehouse mutate(medianincomehouse = round(medianincomehouse / 1e4) * 1e4) %>%
group_by(medianincomehouse) %>%
summarize(violations = mean(violations))
ggplot(drinking_water, aes(medianincomehouse, violations)) +
geom_point() +
geom_point(data = violations_by_medianincomehouse, color = "red")
Try (and fail) to use OLS
Fit the following OLS model.
\[ \text{violations} = \beta_0 + \beta_1\text{percentHispanic} \]
<- lm(violations ~ pcthisp, drinking_water)
pcthisp_lm summary(pcthisp_lm)
Questions
- Plot the residuals. What pattern do you notice?
All the residuals fall along two parallel lines - not normal!
%>%
drinking_water select(violations, pcthisp) %>%
drop_na() %>%
mutate(resid = resid(pcthisp_lm)) %>%
ggplot(aes(pcthisp, resid)) +
geom_point()
- Plot the raw data and the predicted values for
violations
. Are there any obvious problems?
The predicted values fall between 0 and 1, so they could conceivably be interpreted as probabilities. But the raw data are far from the line.
ggplot(drinking_water, aes(pcthisp, violations)) +
geom_point() +
geom_point(data = violations_by_pcthisp, color = "red") +
geom_abline(intercept = coef(pcthisp_lm)[1],
slope = coef(pcthisp_lm)[2])
Fit a logistic regression model using maximum likelihood
Recall that logistic regression uses maximum likelihood, not OLS, to estimate coefficients. Usually we will use functions from R packages to fit these models, but I want you to do it yourself once.
Negative log likelihood
Estimating coefficients is an optimization problem: what combination yields the maximum likelihood? There are two things we can do to make our problem more tractable for optimization.
- Recall that the likelihood function involves a product. Multiplication is computationally costly (compared to addition) and multiplying small numbers is very error prone. We can avoid this problem by working with logarithms. The log of a product is the sum of the logs: \(log(a \times b) = log(a) + log(b)\). Logarithms are monotonically increasing, which means if \(a > b\) then \(log(a) > log(b)\). This useful property means we can maximize the sum of the log likelihoods (which is quick and robust to errors) instead of the product of likelihoods (which is slow and error prone).
- Optimization algorithms typically find the minimum value. They’re intended to look for valleys, not peaks. So instead of maximizing the log likelihood, we can minimize the negative log likelihood.
That seems confusing! Let’s write the code and so we can see what’s happening. Do the following:
- Write a likelihood function for the violations and percent Hispanic model. This should calculate the negative log likelihood of a set of coefficients, conditional on the data.
- Call an optimization function to find the maximum likelihood parameters.
It will help to keep the model formulation handy:
\[ \begin{align} \text{violations} &\sim Bernoulli(p) \\ logit(p) &= \beta_0 + \beta_1 \text{percentHispanic} \end{align} \]
# Inverse logit utility function
<- function(x) exp(x) / (1 + exp(x))
inv_logit
# Likelihood of the coefficients, given the data
<- function(coefficients, data) {
likelihood_fun # Calculate logit(p) based on coefficients and predictor
# Invert the logit to get p
# Use the PMF of the Bernoulli to get our log likelihoods
# Sum the negative log likelihood
}
# Use an optimization function to get the maximum likelihood coefficients
<- drop_na(drinking_water, pcthisp, violations)
drinking_water_complete <- optim(???,
coef_optim
???, data = ???)
Questions
- What were your maximum likelihood estimates for \(\hat\beta_0\) and \(\hat\beta_1\)?
# What are the maximum likelihood estimates for our coefficients?
# Hint: explore coef_optim
- What’s the predicted probability of drinking water violations for communities with 0%, 50%, and 100% Hispanic population?
Plot the predicted probability across the whole range 0-100% Hispanic.
# Create and plot predictions
- How much does the probability of a drinking water violation change when percent Hispanic population increases from 10 to 20%, 45 to 55%, and 80 to 90%?
- How would you interpret the coefficients? What do the slope and intercept mean in this context? Where is the relationship linear, and where is it non-linear?
- Create a “DEM” of the likelihood landscape for \(\beta_0\) and \(\beta_1\). Choose a range of \(\beta_0\) and \(\beta_1\) values around your best estimates, calculate the likelihood for each combination, and create a figure with \(\beta_0\) on the x-axis, \(\beta_1\) on the y-axis, and the likelihood as the fill. Add a point for \((\hat\beta_0, \hat\beta_1)\).
Bonus problem: add contours!
<- expand_grid(
likelihood_dem
???,
???%>%
) mutate(coefficients = mapply(function(b0, b1) c(beta0 = b0, beta1 = b1),
???, ???,SIMPLIFY = FALSE),
negloglik = sapply(coefficients,
???, data = ???),
likelihood = ???)
ggplot(likelihood_dem, aes(???, ???, fill = ???)) +
geom_raster() +
geom_point(x = ???, y = ???, color = "red")
Fit the full logistic regression model using glm()
Normally we won’t need to write our own likelihood functions and use optimization to find our maximum likelihood coefficients. The glm()
function in the built-in stats
package can fit all kinds of generalized linear models, which includes logistic regression.
Here’s how to fit the same model from the previous section. Notice the formula and data arguments look very similar to what we’d use for lm()
. But now we have to specify the response variable’s “family” (i.e., the type of random variable) and the link function we want to use (logit, in our case).
<- glm(violations ~ pcthisp,
pcthisp_glm family = binomial(link = "logit"),
data = drinking_water)
summary(pcthisp_glm)
Questions
- How would you fit a model that includes an interaction term between ethnicity and SES?
<- ???
interaction_glm summary(interaction_glm)
- Create a figure similar to Fig. 1 in Switzer and Teodoro (2017). Put percent Hispanic population on the x-axis, median household income on the y-axis, and make the fill the probability of a water quality violation.
<- expand_grid(
predictions pcthisp = ???,
medianincomehouse = ???
%>%
) mutate(violations = predict(???,
newdata = .,
type = ???))
# Create plot
- Interpret the predicted surface. How does SES influence the relationship between ethnicity and exposure to environmental hazards? What is the “slope” of the probability of a violation w.r.t. percent Hispanic population at low, medium, and high median household income levels?