Statistical Analysis Using R: Online Notes
Introduction
Hello! I’m Jonathan, and I want to teach you statistics. I have designed this website to accompany the University of Chicago course ADSP 31014 “Statistical Models for Data Science”, since I was unable to find a single textbook which covered all of the requisite material, and and did not want my students to buy three expensive textbooks for a ten-week class.
How to use this website
The left-side chapter headings and search bar will help you to select different pages. The right-side table of contents will help you to navigate within each page.
The appendices contain notes on common probability distributions, a few proofs (which might be of interest but are not necessary to progress through the material), and a table of commonly used symbols and their meanings. For more on how to read these chapters, please see the last section below, A note on notation.
Purpose and scope
Eleventy years ago, I studied probability and statistics “for statisticians”. Probability theory is elegant and contemplative. Graduate-level work can be completed with just pencil and paper, through proofs and symbolic notation.
The tools and methods discussed in these notes were built from such theoretical work. However, the students I teach are headed (generally) to industry, not to research positions or Ph.D programs. They may never need to author a proof, but they will need to draw conclusions about the world from noisy, biased, incomplete, or insufficient data, to perform their analyses within a modern day tech stack, and to communciate their findings to decisionmakers.
So I want to teach probability and statistics “for data scientists”. Data science is a tradecraft, not a body of theory, and it allows people and organizations to answer questions, solve problems, and achieve goals. Data science which does not help us to confront real problems or understand real datasets is not truly data science after all.
I will try to stay focused on that mission in these notes, which should not be confused for a true textbook: they more closely resemble a detailed set of lecture slides, paired with ready-to-use code and interactive workshops. The layout of these notes echoes the syllabus for my class: a fast-paced survey of inferential statistics and parametric modeling methods, using likelihood estimation theory as a throughline for three main sections:
- First, we’ll review some topics in univariate analysis which may already be familiar to many readers.
- Second, we’ll examine ordinary least squares (OLS) regression in more detail than readers may have seen at the college level.
- Third, we’ll abstract from OLS regression to the family of models called generalized linear models, which describe non-linear trends among non-normally distributed datasets.
Recommended textbooks
Despite the sentiments above, I retain both fondness and respect for true textbooks, and I highly suggest that readers of these notes pair them with one or more proper reference tomes. A few recommendations follow:
Introduction to Probability 2nd edition, by Joseph Blitzstein and Jessica Hwang
This book is freely available and would help you prepare if you need to spend more time with the foundational topics which precede this course. The authors cover probability and random variables in depth, but they do not cover inference/estimation or regression topics.
Practical Statistics for Data Scientists 2nd edition, by Peter Bruce, Andrew Bruce, and Peter Gedeck
This might be the most comprehensive of the books in this list. It covers many of our topics, and contains extra sections on experimental design, classification models, and machine learning models which might be useful to readers of these notes.
Foundations and Applications of Statistics: An Introduction using R 2nd edition, by Randall Pruim
This book covers most of our material, but skimps a little on generalized linear models. It focuses mostly on univariate analysis and linear regression, with some basic probability as well as a brief section on logistic regression. It also contains a lot of R code.
Foundations of Linear and Generalized Linear Models, by Alan Agresti
This book aligns well with the second and third sections of these notes (linear regression and GLMs), but does not cover basic probability or univariate inference.
How these notes were made
I assembled these notes using Quarto, a publishing system built around the Pandoc markdown language. I wrote all the code backing these notes in R, and alongside every figure or table you can find the corresponding R code.
Neither the text nor the R code in these notes were generated by AI tools: for better and worse the opinions expressed here are my own, and the I’ve described these concepts in my own voice.1 Complaints can be submitted here.
These notes began as a (clunky) Word document shared with my students across several academic quarters. Their questions, requests for clarification, and occasional corrections have all vastly improved my content, and I thank them for their help.
A note on notation
Unfortunately, no two statistical sources use exactly the same notation. Any choice I make will inevitably differ from other sources you consult. I will follow the common conventions when I can, and beg your understanding when I cannot. I will occasionally use two different options to represent the same concept, not because I wish to confuse you, but because sometimes complete consistency creates impossible formatting challenges or even greater ambiguity.
- Observations from a sample and realizations of a random variable will use lowercase Latin letters, with subscripts as needed:
\[x_1,x_2,...,x_n\]
- Random variables themselves will use uppercase Latin letters.2 Subscripts on random variables suggest relations between them, such as two predictors for the same response:
\[Y,Z;\quad X_1,X_2\]
- True parameters of a model will often use Greek lowercase letters:3
\[\mu,\sigma^2,\beta_0,\beta_1\]
- Estimates of random variables will use a ‘hat’ accent above the original symbol or use the matching lowercase Latin alphabet letter:4
\[\hat{\mu},\hat{\beta}_0;\quad s^2,b_1\]
- Vectors (including univariate samples) will use an arrow accent or boldface lowercase letters. Matrices will use bold uppercase letters (not italic):
\[\vec{u},\boldsymbol{y};\quad \mathbf{X}\]
- Elements of a matrix or data with multiple indices will use a double subscript:
\[\mathbf{X} = \begin{bmatrix} x_{1,1} & x_{1,2} \\ x_{2,1} & x_{2,2} \end{bmatrix};\qquad y_{i \bullet} = \frac{1}{n_i}\sum_j y_{i,j}\]
- By convention some terms are stylized using blackboard letters, including the set of reals and the set of integers, as well as expectation and variance.
\[\mathbb{R},\mathbb{Z}; \quad \mathbb{V}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2\]
- Script typefaces may be used for other sets, or to “re-use” a letter with more established meanings:
\[\mathcal{S},\mathcal{L},\ell \qquad (\textrm{compare with}\ S, L, l)\]
A further list of all the symbols I use and their meanings can be found in the appendix section List of Symbols.
AI assistance was used to brainstorm case studies and examples, and to help with the layout and coding of the website itself.↩︎
The most common exception will be my use of the Greek lowercase epsilon \((\varepsilon)\) for an error term, per tradition.↩︎
With many exceptions when parameters are traditionally named otherwise, such as using ‘a’ and ‘b’ for the parameters of a Uniform distribution or ‘df’ for degrees of freedom in a t-distribution.↩︎
The hat accent is properly called a circumflex, but statisticians say ‘hat’, e.g. the estimate for ‘mu’ is ‘mu-hat’.↩︎