What is statistics? Why do we need it as environmental data scientists?
Into the syllabus weeds 🌱
The science of collecting, manipulating, and analyzing empirical data
Statistics will enable us to test hypotheses, analyze data, and draw conclusions about the world from the process
Otherwise, we'd be stuck at observing and forming hypotheses
... and we'd have a lot of unanswered empirical questions!
R
This course is in-person, following UCSB guidelines
If you are sick (with anything), please stay home and just let me know ahead of time. We will get you caught up with notes from classmates, extra office hours, etc.
If you test positive for COVID-19, stay home for at least 5 days, and follow UCSB protocol. In this case, I will provide a personal Zoom link for you to join, but cannot guarantee all lecture content (e.g., writing on the white board) will be perfectly visible via Zoom
Consider a potential research question:
What is the average mercury content in swordfish in the Atlantic Ocean?
Consider a potential research question:
What is the average mercury content in swordfish in the Atlantic Ocean?
Consider a potential research question:
What is the average mercury content in swordfish in the Atlantic Ocean?
Parameter: A numerical summary of the population
Statistic: A numerical summary of the sample
From IMS: Suppose we want to estimate time to graduation for Duke undergraduates in the last five years using a sample of recent students.
Suppose we take a random sample (i.e., every individual in the population has the same probability of being selected)
10 graduates are randomly selected from the population to be included in the sample.
Suppose we ask a nutrition major to pick a few of her friends for the sample.
Asked to pick a sample of graduates, a nutrition major might inadvertently pick a disproportionate number of graduates from health-related majors.
When a sample is not drawn randomly, it is likely your statistic will be a biased estimate of the population parameter
Some other examples of biased sampling:
Systematic non-response (e.g., only people from a certain group respond to the phone survey)
Convenience sampling (e.g., biologists only take forest transects near the edge of a large forested area)
Under-represented groups may be particularly misrepresented due to improper sampling
Rolf et al., "Representation Matters"
Buolamwini et al., "Gender shades"
Nearly all statistical methods are based on assumptions of randomness. If data are not collected randomly from the population, estimates are likely to be biased.
Source: IMS
Source: IMS
Source: IMS
Source: IMS
Experimental studies
Observational studies
Experimental studies
Observational studies
Q: Does sunscreen lower risk of skin cancer?
Q: Does sunscreen lower risk of skin cancer?
All the data for homework will be on Taylor
See MEDS summer coursework for a refresher on how to access Taylor, compute on Taylor, and pull data to/from Taylor (if you are not a MEDS student please reach out to Leo and/or Kat Le for help getting access to Taylor)
All the data we use will be small enough to load and work with locally (e.g., use Cyberduck to pull data down)
But you're welcome to work on Taylor with an RStudio GUI or Workbench, etc.
You won't have write access to our class directory, but you do have your own directories on Taylor
Leo will walk you through all of this in your first Discussion section
GitHub will be used in multiple ways in this class:
GitHub will be used in multiple ways in this class:
GitHub will be used in multiple ways in this class:
My course website is built in git, so you can access any source code you might want (slide materials, lab materials, etc.) from this repo, #EDS-222-stats. But you won't ever need to interact with this repo if you don't want to.
GitHub Classroom. All homework assignments will be accessed via GH Classroom. You will pull the assignment from GH, edit and push your code by the submission deadline, and then pull again once grades are posted to see your grade and to get feedback.
GitHub will be used in multiple ways in this class:
My course website is built in git, so you can access any source code you might want (slide materials, lab materials, etc.) from this repo, #EDS-222-stats. But you won't ever need to interact with this repo if you don't want to.
GitHub Classroom. All homework assignments will be accessed via GH Classroom. You will pull the assignment from GH, edit and push your code by the submission deadline, and then pull again once grades are posted to see your grade and to get feedback.
Final projects. You will submit a GitHub repo alongside your final project report. We will not be grading this repository, but expect you to keep your project code here.
Slides created via the R package xaringan.
What is statistics? Why do we need it as environmental data scientists?
Into the syllabus weeds 🌱
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |