EDS222 Week 7

This week, we will review key topics from earlier weeks and postpone new material to week 8. As you complete the following exercises, please try to identify your “muddiest” areas. Are there formulas or functions you’re using and can’t articulate why? Are there similar topics that you’re having a hard time differentiating? Make a note of these points as you go - that will help guide your studying and give me a better idea of how to clarify concepts and tools in our final weeks.

Probability

What potential values can a $Normal$ variable take? What about a $Bernoulli$ variable?
1. What’s something in the real world we could represent with a $Normal$ variable?
2. How about a $Bernoulli$ variable?
What’s the difference between a probability density function (PDF) and a probability mass function (PMF)?
1. Consider distributions from the $Normal$ and $Bernoulli$ families. Which has a PDF? Which a PMF?
2. Consider the PDF for a $Normal(\mu=0,\sigma=1)$ variable. What’s the value of the PDF when $x=0$? Does this value represent the probability of producing a 0 from this distribution? Why or why not?
3. Consider the PMF for a $Bernoulli(p=0.5)$ variable. What’s the value of the PMF when $x=0$? Does this value represent the probability of producing a 0 from this distribution? Why or why not?
Draw the PMF for a $Bernoulli(p=0.25)$ variable by hand. Then use ggplot to recreate your drawing.

What’s the probability of observing $\{ 0, 0, 1, 1 \}$ from a $Bernoulli(p = 0.5)$ variable?
1. Would that probability go up or down if the variable was a $Bernoulli(p = 0.99)$?
What’s the probability of observing 0 from a $Bernoulli(p=0.1)$ AND 1 from a $Bernoulli(p=0.9)$?
1. Extending the previous question, what’s the probability of observing $\{0, 0, 0, 1, 0\}$ from a $Bernoulli$ variable if $p$ is 0.1 for the first observation and increases by 0.1 for each subsequent observation?
Describe what each following R function returns.
1. dnorm()
2. pnorm()
3. rnorm()
4. dbinom()
5. rbinom()
Which R function would help you solve problem 4 above? Write the code below.

Link functions

Fill in the blanks below to specify a logistic regression in statistical notation.
\[ \begin{align} y &\sim ??? (???) \\ ???(???) &= ??? + ??? x \end{align} \]
1. What kind of random variable is the response?
2. What parameter does that variable take?
3. What transformation (link function) is applied to the parameter?
4. Which part of the model is linear?
What’s the formula for odds?
What’s the formula for the logit function?
The logistic function is the inverse of the logit function. Without using R or a calculator, what’s the value of $logistic(logit(0))$? How about $logistic(logit(1000))$?
1. What kinds of numbers can the logit function operate on? Conversely, what inputs would yield an undefined result? Why?
2. What about the logistic?
Imagine a “logistic” regression without a link function. I.e., $p$ is linearly related to $x$. Give an example of values for $\beta_0,\beta_1,x$ that would break your regression model. What constraint on $p$ did you break?
In your own words, describe why the logit link function is necessary for logistic regression to work.
Use ggplot to create a logistic curve. Your x-axis should cover the range -5 to 5. Use $\beta_0=0,\beta_1=1$ as your coefficients.
Answer the following questions first using your intuition, then check your answers in code.
1. How will your curve change if $\beta_0$ increases to 3?
2. How will your curve change if you flip the sign of $\beta_1$ to -1?

Likelihood

The PMF of a random discrete variable describes the probability of observing a value given the parameters. What is the likelihood function relative to that?
Is the likelihood function directly interpretable? Or rather, in what context are likelihoods meaningful?
Consider this logistic regression model from the week 6 lab.

\[ \begin{align} \text{healthViolation} &\sim Bernoulli(p) \\ logit(p) &= \beta_0 + \beta_1 \text{pctHisp} \end{align} \]

Let’s say you observe the following data.

pctHisp healthViolation

0.00 1

0.15 0

0.30 0

0.45 1

0.60 1

0.75 0

0.90 1
1. Visualize these data using ggplot.
2. Let $\beta_0=0, \beta_1=1$.
  1. Calculate $logit(p)$ across the range of $\text{pctHisp}$ values.
  2. Calculate $p$.
  3. Add $p$ to your ggplot as a curve. Make it red.
3. Let $\beta_0=-2, \beta_1=0.5$.
  1. Repeat the calculations above and add a curve for $p$, but make it blue.
4. Intuitively, which curve do you think is more likely?
Now you’re going to write a likelihood function for the above logistic model.
1. What inputs does this function need?
2. What value will it return?
Consider the following partially written code chunk.
```
inv_logit <- function(x) {
  ____
}
likelihood_fun <- function(coefs, data) {
  logit_p <- ____
  p <- inv_logit(____)
  loglik <- ____(____, size = 1, prob = ____, ____ = TRUE)
  -____(____)
}
```
1. What is the length of the vector coefs (i.e. the first parameter of the likelihood function)? What should the names of the vector be ?
2. If data is a data frame, what two columns does it need to have?
3. Why is size = 1 important? Refer to the statistical notation above.
4. In questions 3.2 and 3.3, you visualized $p$ for two different sets of coefficients. Use the likelihood function to calculate the negative log likelihood for both sets of coefficients. Interpret those values.
Now you’ll use optimization to estimate the most likely coefficients. Fill in the partially written code chunk below.
```
most_likely <- optim(
  par = c(____ = ____, ____ = ____),
  fn = ____,
  data = ____
)
```
1. What estimates did you get for $\beta_0, \beta_1$?
2. Add the predicted values of $p$ for these coefficients to your plot with the raw data. How does this curve compare to the other two you added earlier?

pctHisp	healthViolation
0.00	1
0.15	0
0.30	0
0.45	1
0.60	1
0.75	0
0.90	1

The last thing you’ll do is create a map of the negative log likelihood “landscape”. This should be raster plot with candidate $\beta_0$’s on the x-axis, $\beta_1$’s on the y-axis, and the fill should be the negative log likelihood. Add a point for the most likely coefficients. Choose your range of $\beta$’s so the most likely coefficients fall somewhere in the middle.

# This utility function will make it easier to call the likelihood function within mutate()
likelihood_fun2 <- function(beta0, beta1, data) {
  likelihood_fun(c(beta0 = beta0, beta1 = beta1), data = data)
}
# Create a grid of candidate coefficients to map out the likelihood landscape
expand_grid(
  ____ = seq(____, ____, length.out = 20),
  ____ = seq(____, ____, length.out = 20)
) %>%
  # Calculate the negative log likelihood for each pair of coefficients using mapply() and likelihood_fun2()
  mutate(____ = mapply(FUN = ____, 
                       ____, 
                       ____, 
                       MoreArgs = list(data = health_data))) %>% 
  # Create a "likelihood DEM"
  ggplot() +
  geom_raster(aes(x = ____, y = ____, fill = ____)) +
  # Most likely coefficients
  geom_point(x = ____, y = ____, color = "red") +
  # Make the fill of the likelihood raster cells look like terrain
  scale_fill_stepsn(colors = terrain.colors(6)) +
  theme_bw()

Are your most likely coefficients down in a valley or up on a peak? Why?
Pick three pairs of coefficients that fall in green, yellow/orange, and pink/white parts of the likelihood landscape. Recreate the plots from question 3 (raw data with predicted $p$ curve) using those three sets of coefficients. Qualitatively describe the shapes of the curves, where the coefficients fall in likelihood landscape, and how well the curves match the data.

--- title: "EDS222 Week 7" format: html: code-tools: true --- This week, we will review key topics from earlier weeks and postpone new material to week 8. As you complete the following exercises, please try to identify your "muddiest" areas. Are there formulas or functions you're using and can't articulate why? Are there similar topics that you're having a hard time differentiating? Make a note of these points as you go - that will help guide your studying and give me a better idea of how to clarify concepts and tools in our final weeks. ## Probability 1. What potential values can a $Normal$ variable take? What about a $Bernoulli$ variable? 1. What's something in the real world we could represent with a $Normal$ variable? 2. How about a $Bernoulli$ variable? 2. What's the difference between a probability *density* function (PDF) and a probability *mass* function (PMF)? 1. Consider distributions from the $Normal$ and $Bernoulli$ families. Which has a PDF? Which a PMF? 2. Consider the PDF for a $Normal(\mu=0,\sigma=1)$ variable. What's the value of the PDF when $x=0$? Does this value represent the probability of producing a 0 from this distribution? Why or why not? 3. Consider the PMF for a $Bernoulli(p=0.5)$ variable. What's the value of the PMF when $x=0$? Does this value represent the probability of producing a 0 from this distribution? Why or why not? 3. Draw the PMF for a $Bernoulli(p=0.25)$ variable by hand. Then use ggplot to recreate your drawing. ```{r} ``` 4. What's the probability of observing $\{ 0, 0, 1, 1 \}$ from a $Bernoulli(p = 0.5)$ variable? 1. Would that probability go up or down if the variable was a $Bernoulli(p = 0.99)$? 5. What's the probability of observing 0 from a $Bernoulli(p=0.1)$ AND 1 from a $Bernoulli(p=0.9)$? 1. Extending the previous question, what's the probability of observing $\{0, 0, 0, 1, 0\}$ from a $Bernoulli$ variable if $p$ is 0.1 for the first observation and increases by 0.1 for each subsequent observation? 6. Describe what each following R function returns. 1. `dnorm()` 2. `pnorm()` 3. `rnorm()` 4. `dbinom()` 5. `rbinom()` 7. Which R function would help you solve problem 4 above? Write the code below. ```{r} ``` ## Link functions 1. Fill in the blanks below to specify a logistic regression in statistical notation.\ $$ \begin{align} y &\sim ??? (???) \\ ???(???) &= ??? + ??? x \end{align} $$ 1. What kind of random variable is the response? 2. What parameter does that variable take? 3. What transformation (link function) is applied to the parameter? 4. Which part of the model is linear? 2. What's the formula for *odds*? 3. What's the formula for the *logit* function? 4. The *logistic* function is the inverse of the logit function. Without using R or a calculator, what's the value of $logistic(logit(0))$? How about $logistic(logit(1000))$? 1. What kinds of numbers can the logit function operate on? Conversely, what inputs would yield an undefined result? Why? 2. What about the logistic? 5. Imagine a "logistic" regression without a link function. I.e., $p$ is linearly related to $x$. Give an example of values for $\beta_0,\beta_1,x$ that would break your regression model. What constraint on $p$ did you break? 6. In your own words, describe why the logit link function is necessary for logistic regression to work. 7. Use ggplot to create a logistic curve. Your x-axis should cover the range -5 to 5. Use $\beta_0=0,\beta_1=1$ as your coefficients.\ Answer the following questions first using your intuition, then check your answers in code. 1. How will your curve change if $\beta_0$ increases to 3? 2. How will your curve change if you flip the sign of $\beta_1$ to -1? ## Likelihood 1. The PMF of a random discrete variable describes the probability of observing a value given the parameters. What is the likelihood function relative to that? 2. Is the likelihood function directly interpretable? Or rather, in what context are likelihoods meaningful? 3. Consider this logistic regression model from the week 6 lab. $$ \begin{align} \text{healthViolation} &\sim Bernoulli(p) \\ logit(p) &= \beta_0 + \beta_1 \text{pctHisp} \end{align} $$ Let's say you observe the following data. | pctHisp | healthViolation | |---------|-----------------| | 0.00 | 1 | | 0.15 | 0 | | 0.30 | 0 | | 0.45 | 1 | | 0.60 | 1 | | 0.75 | 0 | | 0.90 | 1 | 1. Visualize these data using ggplot. 2. Let $\beta_0=0, \beta_1=1$. 1. Calculate $logit(p)$ across the range of $\text{pctHisp}$ values. 2. Calculate $p$. 3. Add $p$ to your ggplot as a curve. Make it red. 3. Let $\beta_0=-2, \beta_1=0.5$. 1. Repeat the calculations above and add a curve for $p$, but make it blue. 4. Intuitively, which curve do you think is *more likely*? 4. Now you're going to write a likelihood function for the above logistic model. 1. What inputs does this function need? 2. What value will it return? 5. Consider the following partially written code chunk. ```{r} #| eval: false inv_logit <- function(x) { ____ } likelihood_fun <- function(coefs, data) { logit_p <- ____ p <- inv_logit(____) loglik <- ____(____, size = 1, prob = ____, ____ = TRUE) -____(____) } ``` 1. What is the length of the vector `coefs` (i.e. the first parameter of the likelihood function)? What should the names of the vector be ? 2. If `data` is a data frame, what two columns does it need to have? 3. Why is `size = 1` important? Refer to the statistical notation above. 4. In questions 3.2 and 3.3, you visualized $p$ for two different sets of coefficients. Use the likelihood function to calculate the negative log likelihood for both sets of coefficients. Interpret those values. 6. Now you'll use optimization to estimate the *most likely* coefficients. Fill in the partially written code chunk below. ```{r} #| eval: false most_likely <- optim( par = c(____ = ____, ____ = ____), fn = ____, data = ____ ) ``` 1. What estimates did you get for $\beta_0, \beta_1$? 2. Add the predicted values of $p$ for these coefficients to your plot with the raw data. How does this curve compare to the other two you added earlier? 7. The last thing you'll do is create a map of the negative log likelihood "landscape". This should be raster plot with _candidate_ $\beta_0$'s on the x-axis, $\beta_1$'s on the y-axis, and the fill should be the negative log likelihood. Add a point for the _most likely_ coefficients. Choose your range of $\beta$'s so the most likely coefficients fall somewhere in the middle. ```{r} #| eval: false # This utility function will make it easier to call the likelihood function within mutate() likelihood_fun2 <- function(beta0, beta1, data) { likelihood_fun(c(beta0 = beta0, beta1 = beta1), data = data) } # Create a grid of candidate coefficients to map out the likelihood landscape expand_grid( ____ = seq(____, ____, length.out = 20), ____ = seq(____, ____, length.out = 20) ) %>% # Calculate the negative log likelihood for each pair of coefficients using mapply() and likelihood_fun2() mutate(____ = mapply(FUN = ____, ____, ____, MoreArgs = list(data = health_data))) %>% # Create a "likelihood DEM" ggplot() + geom_raster(aes(x = ____, y = ____, fill = ____)) + # Most likely coefficients geom_point(x = ____, y = ____, color = "red") + # Make the fill of the likelihood raster cells look like terrain scale_fill_stepsn(colors = terrain.colors(6)) + theme_bw() ``` 1. Are your most likely coefficients down in a valley or up on a peak? Why? 2. Pick three pairs of coefficients that fall in green, yellow/orange, and pink/white parts of the likelihood landscape. Recreate the plots from question 3 (raw data with predicted $p$ curve) using those three sets of coefficients. Qualitatively describe the shapes of the curves, where the coefficients fall in likelihood landscape, and how well the curves match the data.