Coding it up in R

How to specify a linear regression in R

Let’s run multiple (>1 predictor) OLS regression in R, and learn how to read its summaries and use its output objects.

The penguins dataset describes measuresments of three species of Antarctic penguins, taken from three islands from 2007–2009. Let’s see if we can predict body mass (in grams) from bill length and flipper length (in millimeters).

peng.lm <- lm(body_mass ~ bill_len + flipper_len, data=penguins)

Already I need to stop and tell you about several important details in this simple line of code.

The formula: The syntax I supplied was Y ~ X1 + X2, which R understands as “$Y$ is a linear function of the predictors $X_1$ and $X_2$ and also an intercept.
- I don’t need to specify an intercept (that is, using the column of ones as a predictor), since it’s the default. If I wish to make the intercept explicit, I would write Y ~ 1 + X1 + X2 (the results would be identical.)
- If I don’t want an intercept, I could use the syntax Y ~ 0 + X1 + X2, or alternatively Y ~ -1 + X1 + X2 (no mathematical difference). These formulae are not literal equations. They simply help R to understand what model to run. We will discuss the implications of a no-intercept model later.
- Technically, R doesn’t recognize the terms body_mass, bill_len, or flipper_len. These aren’t defined objects that R is “aware of”. By using the data= argument, we are telling R “it’s okay that you don’t recognize these names; you’ll find objects with these names in this dataset.”
- We can also write the formula without using the data= argument, and simply provide the objects directly to R. This method works well when your predictors and response variables are in different data frames, or not in any data frame at all (e.g. stored as separate vectors): lm(penguins$body_mass ~ penguins$bill_len + penguins$flipper_len)
The assignment: Notice that I didn’t ask to see the results of the lm() call, but instead I stored the results as an R object. This allows me to modify it later, and to retrieve objects like the fitted values or the residuals whenever I like. Running R modesl without storing them into memory is very rare.
The function call: I used lm() here since I am running a linear model. In the future we will use functions like glm(), glm.nb(), or zeroinfl() to run non-linear models, and their syntax will be very similar.

How to interpret the basic summary of a linear regression in R

The R function summary() will provide basic information about whatever you put inside of it: datasets, models, lists, etc. When we apply it to an lm() object, we see the following output:

summary(peng.lm)


Call:
lm(formula = body_mass ~ bill_len + flipper_len, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-1090.5  -285.7   -32.1   244.2  1287.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5736.897    307.959 -18.629   <2e-16 ***
bill_len        6.047      5.180   1.168    0.244    
flipper_len    48.145      2.011  23.939   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.1 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:   0.76, Adjusted R-squared:  0.7585 
F-statistic: 536.6 on 2 and 339 DF,  p-value: < 2.2e-16

Let’s learn how to read this from top to bottom:

The call: Repeats its instructions back to the user. Helps to remind you which model you’re looking at, especially if your models are being auto-generated by a function.
The residuals: Not very useful. Can help you to notice a large outlier or a skewed distribution (e.g. if $|\mathrm{Q1}| \ll \mathrm{Q3}$), but those should be examined separately.
The table of coefficients: Each row of this table refers to a different $\hat{\beta}$ estimated by the model. (Intercept) refers to $\hat{\beta}_0$ and each slope after the intercept is labeled using the name of its corresponding predictor.

The estimates themselves are in the first column, Estimate. These are our best guesses, not the truth (which is still unknown). The second column, Std. Error, provides standard errors for our estimates which can be used to conduct hypothesis tests and to construct confidence intervals. The third and fourth columns, t value and Pr(>|t|), conduct a two-sided hypothesis test that the true value of the parameter is actually 0.

If we can reject this null hypothesis, then we can say with some confidence that the predictor “belongs” in the model.¹ On the other hand, if we can’t reject the null that $\beta_j = 0$, then the predictor has no place in our model. After all, the following two equations are equivalent:

\[\begin{aligned} Y &= \beta_0 + \beta_1 X_1 + 0 X_2 + \beta_3 X_3 + \varepsilon \\ Y &= \beta_0 + \beta_1 X_1 + \beta_3 X_3 + \varepsilon \end{aligned}\]

Here we estimate that for every additional millimeter of bill length, penguins weigh an additional 6 grams, and that for every additional millimeter of flipper length, penguins weigh an additional 48 grams. Furthermore, a penguin with no bill or flippers would be expected to weigh -5736 grams.²
The model fit statistics: At the very bottom we learn more about how well the regression model describes the data. Importantly, we see that two rows of the original dataset were not used due to missing values in the response or the predictors — regressions cannot use a row in which either $Y$ or any of the $X$ predictors are missing.

We also see that R^2 is about 76%, meaning that 76% of the variation in penguin body mass can be explained by variation in penguin bill and flipper lengths. The remaining model fit metrics must wait until we learn a little more theory.

How to retrieve and use regression model attributes in R

Sometimes all we need to move forward can be found in the model summary. More usually, we will want to use objects from the regression. These can be found in two places: (i) in the lm() object, and (ii) in the summary() object.³

Recall that we named the penguin regression peng.lm. This object now has many attributes we can select by name:

names(peng.lm)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"        
[13] "model"

The fitted values (also called $\hat{\boldsymbol{y}}$ or the predictions) are stored as a vector in fitted.values:⁴

head(peng.lm$fitted.values)

       1        2        3        5        6        7 
3213.779 3456.922 3895.064 3777.003 3648.292 3212.570

The residuals (also called $\boldsymbol{e}$ or the errors) are stored as a vector in residuals.

head(peng.lm$residuals)

          1           2           3           5           6           7 
 536.220898  343.077607 -645.064115 -327.003441    1.707668  412.430396

The estimated coefficients (also called $\hat{\boldsymbol{\beta}}$ or the parameter estimates) are stored as a vector in coefficients.

peng.lm$coefficients

 (Intercept)     bill_len  flipper_len 
-5736.897161     6.047488    48.144859

If you need more information about the coefficients, you can retrieve the matrix of standard errors, t-statistics, and p-values from the summmary object as coefficients. Notice that the estimates are the same as the lm() coefficients, this method simply extracts more information.

summary(peng.lm)$coefficients

                Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -5736.897161 307.959128 -18.628762 7.796205e-54
bill_len        6.047488   5.179831   1.167507 2.438263e-01
flipper_len    48.144859   2.011115  23.939388 7.564660e-75

summary(peng.lm)$coefficients[1,4] #p-value for intercept

[1] 7.796205e-54

The estimated error variance, $\hat{\sigma}^2$ can also be found in the summary information. In R, the value which is stored is simply $\hat{\sigma}$, the standard deviation, stored as sigma.

summary(peng.lm)$sigma

[1] 394.0678

And the last item for now will the R² model metric, which can be found in the summary information under r.squared:

summary(peng.lm)$r.squared

[1] 0.7599577

How to make predictions for new values using a regression object

The fitted values provide our model’s guess as to where the training data “should” have been observed. However, we often want to learn where new $X$-values would be observed. As described in Multiple regression, predictions can take two forms:

Predictions for where new values will be observed, or
Predictions for the mean of the new observations.

Both of these predictions share the same point estimates, they simply differ in their standard errors, i.e. in the width of their confidence intervals.

The point estimates for new observations can be computed automatically using the predict() function or manually from the estimated coefficients as $\mathbf{X}_{\textrm{new}} \hat{\boldsymbol{\beta}}$. Let’s build five new penguins and find their predicted body mass both ways:

newpengs <- data.frame(ones=c(1,1,1,1,1),
                       bill_len=c(35,40,45,50,55),
                       flipper_len=c(175,185,195,205,215))
predict(peng.lm,newdata=newpengs)

       1        2        3        4        5 
2900.115 3411.801 3923.487 4435.173 4946.859

as.matrix(newpengs)%*%peng.lm$coefficients

         [,1]
[1,] 2900.115
[2,] 3411.801
[3,] 3923.487
[4,] 4435.173
[5,] 4946.859

Note a few details here: we had to change the data frame (which R regards as a type of list) into a matrix so that we could matrix multiply it by the vector of beta-hats using the %*% operator. We also built in a vector of ones into the new data — totally unnecessary for the automatic calculation using predict(), but helpful for the manual solution. Finally, note that we had to build a data frame with the same names as the original dataset, so that the regression object peng.lm could know which vectors to use in which “roles”.

The wider confidence intervals for the possible location of new values can be found as follows:

predict.lm(peng.lm,newdata=newpengs,level=0.9,interval='prediction')

       fit      lwr      upr
1 2900.115 2245.681 3554.549
2 3411.801 2759.673 4063.930
3 3923.487 3272.035 4574.940
4 4435.173 3782.762 5087.584
5 4946.859 4291.863 5601.856

And the narrower confidence intervals for the location of the mean for new values can be found with this change:

predict.lm(peng.lm,newdata=newpengs,level=0.9,interval='confidence')

       fit      lwr      upr
1 2900.115 2823.723 2976.507
2 3411.801 3358.665 3464.938
3 3923.487 3879.417 3967.558
4 4435.173 4378.676 4491.670
5 4946.859 4865.788 5027.931

Recovering ANOVA information from a regression using R

Once we fit a regression in R, it’s easy to decompose the total variance of $\boldsymbol{y}$ into (i) variance explained by the model and (ii) the residual variance. That information comes to us from the anova() command:

anova(peng.lm)

Analysis of Variance Table

Response: body_mass
             Df   Sum Sq  Mean Sq F value    Pr(>F)    
bill_len      1 77669072 77669072  500.16 < 2.2e-16 ***
flipper_len   1 88995501 88995501  573.09 < 2.2e-16 ***
Residuals   339 52643125   155289                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s review this table piece by piece:

The first two rows describe variance explained by each individual predictor, and the last row marked Residuals describes the residual variance.
The ANOVA tables in R are order-dependent – the variance explained by each individual predictor is only the additional variance explained by adding that predictor to a model with all previously-listed variables in the model.⁵ ANOVA tables from other software tools may work differently.
For each predictor, the first column describes the model degrees of freedom used up by that predictor.⁶ In the last Residuals row, the first column lists the total residual degrees of freedom $n - k - 1$.
The second column lists the sum of squares contribution from each predictor toward the SSM, as well as the residual sum of squares/sum of squared errors (SSE).
The remaining columns are used to conduct an $F$ test which we will study in later sections.

We can access all of this information from the anova() object, and we can even compare it to our manual calculations of SST, SSM, and SSE. Starting with the model sum of squares, SSM, notice that the two contributions from bill_len and flipper_len total about 166 million g². We can recreate total directly from the fitted values:

anova(peng.lm)[1,2] + anova(peng.lm)[2,2]

[1] 166664572

(SSM <- sum((peng.lm$fitted.values-mean(penguins$body_mass,na.rm=TRUE))^2))

[1] 166664572

The SSE or sum of squared errors can be found using the bottom row of the table:

anova(peng.lm)[3,2]

[1] 52643125

(SSE <- sum(peng.lm$residuals^2,na.rm=TRUE))

[1] 52643125

And the total sum of squares can be found either by totaling the Sum Sq column of the ANOVA table, or directly from $\boldsymbol{y}$, or by adding SSM and SSE:

sum(anova(peng.lm)[,2])

[1] 219307697

(SST <- sum((penguins$body_mass-mean(penguins$body_mass,na.rm=TRUE))^2,na.rm=TRUE))

[1] 219307697

SSM + SSE

[1] 219307697

In future sections we will add more nuance to this idea.↩︎
Later, we shall learn why these estimates are flawed and unrealistic.↩︎
As a general rule, any information that R prints in an output can be accessed through the code as well.↩︎
Notice that the fourth fitted value is labeled 5 — a clear sign that the fourth row of the original data could not be used in the regression.↩︎
Or, for the first row, a model with no other predictors.↩︎
For our simple linear predictors this will always be 1; later, we shall consider categorical predictors which use more degrees of freedom.↩︎