Coding it up in R

This page will walk you through some practical “how-to”s of creating and using new features in an R regression.

Categorical predictors

In the remainder of this document, I will use what statisticians call “dummy coding” — the practice of representing levels of a categorical variables with indicator values of 0 or 1.

There are alternative numeric representations, such as using -1 and +1, which create useful interpretations for the betas, but I will not discuss these alternatives here.1

For some examples, I will use the datasets::warpbreaks data, which counts the number of warp breaks (errors during the weaving process) on different looms where the wool comes in two types (A and B) and the loom tension is set to three different levels (L[ow], M[edium], and H[igh]).

head(warpbreaks)
  breaks wool tension
1     26    A       L
2     30    A       L
3     54    A       L
4     25    A       L
5     70    A       L
6     52    A       L

Indicator variables

A single binary predictor (e.g. TRUE or FALSE, survive or die, purchase a product or not) will generally be represented by 1/0 indicator variables. R will recognize several types:

  • String variables. In general, if you enter a character-type variable with two values as a regression predictor, R, will assign the lower alphabetical value to 0 (the reference level) and the higher alphabetical value to 1:

    wool.basic <- lm(breaks~wool,data=warpbreaks)
    summary(wool.basic)$coefficients
                 Estimate Std. Error   t value     Pr(>|t|)
    (Intercept) 31.037037   2.501018 12.409761 3.612159e-17
    woolB       -5.777778   3.536974 -1.633537 1.083971e-01

    You can see that we have a beta for when wool=='B' and that R calls this predictor <variablename><levelvalue>, here “woolB”. From the summary, we learn that wool type B produced (on average) 5.8 fewer warp breaks than type A.

  • Logical variables. One simple way to code for an effect is to create a TRUE/FALSE logical variable for some binary condition. R will automatically assign the FALSE condition to 0 (the reference level), and the TRUE condition to 1:

    warpbreaks$lowtension <- warpbreaks$tension=='L'
    wool.low <- lm(breaks~lowtension,data=warpbreaks)
    summary(wool.low)$coefficients
                   Estimate Std. Error   t value     Pr(>|t|)
    (Intercept)    24.02778   1.988112 12.085724 1.009865e-16
    lowtensionTRUE 12.36111   3.443512  3.589682 7.328922e-04

    From this summary, we learn that low tension produced (on average) 12.36 more warp breaks than other types of tension (medium or high).

  • Inline formula expresssions. We can also create indicator variables from expressions inside our regression calls, without having to make new variables in our dataset. For example, the line (tension=='L') will create a vector of TRUEs and FALSEs as it is evaluated:

    wool.low2 <- lm(breaks~(tension=='L'),data=warpbreaks)
    summary(wool.low2)$coefficients
                       Estimate Std. Error   t value     Pr(>|t|)
    (Intercept)        24.02778   1.988112 12.085724 1.009865e-16
    tension == "L"TRUE 12.36111   3.443512  3.589682 7.328922e-04
  • Manually coded 1/0 indicators. Often our predictor already comes to us coded as 1s and 0s, in which case R can use the variable without needing to do any type conversion. Note that R will treat 1/0 variables like a linear predictor, and so the user must remember that they are not reading a “slope” but an effect size.

    warpbreaks$lownumeric <- 0
    warpbreaks$lownumeric[warpbreaks$tension=='L'] <- 1
    wool.low3 <- lm(breaks~lownumeric,data=warpbreaks)
    summary(wool.low3)$coefficients
                Estimate Std. Error   t value     Pr(>|t|)
    (Intercept) 24.02778   1.988112 12.085724 1.009865e-16
    lownumeric  12.36111   3.443512  3.589682 7.328922e-04

Categorical variables with 3+ values

When describing a variable several different values, we usually avoid TRUE/FALSE representations, but all of the other techniques described above will still work. One familiar method and one new method are described below.

  • String variables. R will convert a string variable with \(I\) levels into \(I-1\) dummy variables, using the lowest alphabetical level as the reference level. This reference level becomes part of the interpretation of the intercept and all the betas on the dummy variables represent shifts away from the reference level:

    warpbreaks$combination <- 
      paste(warpbreaks$wool,warpbreaks$tension,sep='+')
    unique(warpbreaks$combination)
    [1] "A+L" "A+M" "A+H" "B+L" "B+M" "B+H"
    summary(lm(breaks ~ combination, data=warpbreaks))$coefficients
                     Estimate Std. Error    t value     Pr(>|t|)
    (Intercept)    24.5555556   3.646761  6.7335241 1.884531e-08
    combinationA+L 20.0000000   5.157299  3.8779987 3.199282e-04
    combinationA+M -0.5555556   5.157299 -0.1077222 9.146651e-01
    combinationB+H -5.7777778   5.157299 -1.1203107 2.681556e-01
    combinationB+L  3.6666667   5.157299  0.7109664 4.805457e-01
    combinationB+M  4.2222222   5.157299  0.8186886 4.170102e-01

    Notice that we only see betas for five of the six levels, and since A+H (wool A and high tension) is first alphabetically it disappears into the intercept.

  • Factor variables. R can (and often does) store character strings as factors, which has a different meaning than the factor notation we learned in Revisiting ANOVA. Factors allow R to store string data as numbers, which dramatically lowers the memory and compute requirements. Every unique string value is matched to an integer, and R stores and manipulates these integers instead of working with the strings. Whenever R needs to show the factor to a human, it re-matches the integers with their string values using a small “dictionary”.

    Factors offer better computational performance, but sometimes create unexpected results. I mentioned earlier that R treats the lowest alphabetical level of a categorical predictor as the reference level. However, a factor might associate a different character level with the number “1” (which then becomes the reference level). This is the case in our warpbreaks dataset:

    class(warpbreaks$tension)
    [1] "factor"
    levels(warpbreaks$tension)
    [1] "L" "M" "H"
    summary(lm(breaks~tension,data=warpbreaks))$coefficients
                 Estimate Std. Error   t value     Pr(>|t|)
    (Intercept)  36.38889   2.800279 12.994736 8.299362e-18
    tensionM    -10.00000   3.960193 -2.525130 1.471696e-02
    tensionH    -14.72222   3.960193 -3.717552 5.008605e-04

    You can see that although L(ow) is in the middle alphabetically, the variable has been stored as a factor in the sensible order of 1–L(ow) 2–M(edium) 3–H(igh). Therefore R treats L(ow) as the reference level, which is why only M and H are listed as predictors.

Interpreting ANOVA information about factors

Resuming our statistical use of the word ‘factor’ to mean any categorical predictor, we can now find another use for the anova function seen earlier in Coding it up in R.

anova(lm(breaks ~ wool + tension,data=warpbreaks))
Analysis of Variance Table

Response: breaks
          Df Sum Sq Mean Sq F value   Pr(>F)   
wool       1  450.7  450.67  3.3393 0.073614 . 
tension    2 2034.3 1017.13  7.5367 0.001378 **
Residuals 50 6747.9  134.96                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table above performs two tests: one for the explanatory power of wool (2 levels, requiring 1 indicator and thus 1 degree of freedom), and another for the additional explanatory power of tension (3 levels, requiring 2 indicators and thus 2 degrees of freedom).

From these tests, we learn that wool type (A or B) offers only minimal explanatory power, an amount which could be due to chance sampling error (\(p \approx 0.074\)). However, adding tension (Low, Medium, or High) delivers a significant increase in explanatory power, an amount very unlikely to arise by chance alone (\(p \approx 0.001\)). This ANOVA table has summarized the effect of adding tension in a single line, even though the variable includes three different levels and creates two different betas (one for M and one for H).

summary(lm(breaks ~ wool + tension,data=warpbreaks))$coefficients
              Estimate Std. Error   t value     Pr(>|t|)
(Intercept)  39.277778   3.161783 12.422667 6.681866e-17
woolB        -5.777778   3.161783 -1.827380 7.361367e-02
tensionM    -10.000000   3.872378 -2.582393 1.278683e-02
tensionH    -14.722222   3.872378 -3.801856 3.913842e-04

The t-tests of the coefficients on M and H above only test whether they could be different than zero (i.e. whether their group means are different than the group mean for L), not whether the entire three-level variable is useful in modeling warp breaks.

Interactions and other formula nuances

R uses its formula objects to create a model matrix defining the mathematical model space. However, the formula object itself is not an exact mathematical formula, just a syntax that helps the user to precisely define what terms are in and out of the model.

For this section, let us consider a version of the penguins dataset which uses only those rows with no missing values. We will attempt to model body_mass as a function of other variables.

fullpengs <- penguins[complete.cases(penguins),]
head(fullpengs)
  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
6  Adelie Torgersen     39.3     20.6         190      3650   male 2007
7  Adelie Torgersen     38.9     17.8         181      3625 female 2007

The general formula is formula = [response] ~ [predictors], with the ~ (tilde) operator separating the two sides.2 Consider the following formulae:

R formula Meaning
body_mass ~ 1 Include only an intercept, no predictors
body_mass ~ bill_len Include bill length as a predictor
body_mass ~ 0 + bill_len Include bill length, but do not use an intercept
body_mass ~ bill_len + 0 same
body_mass ~ bill_len - 1 same
body_mass ~ bill_len + flipper_len Include both bill length and flipper length
body_mass ~ bill_len : flipper_len Include only a bill length-flipper length interaction term, without using either predictor alone
body_mass ~ bill_len + bill_len : flipper_len Include bill length and a bill length-flipper length interaction term, without using flipper length
body_mass ~ bill_len * flipper_len Include bill length, flipper length, and their interaction
body_mass ~ bill_len + flipper_len + bill_len : flipper_len same
body_mass ~ bill_len + flipper_len + bill_len * flipper_len same
body_mass ~ log(flipper_len) Include log of flipper length
body_mass ~ I(flipper_len + bill_len) Include the actual sum of flipper length and bill length as a single predictor
body_mass ~ I((flipper_len + bill_len)^2) Include the squared sum of flipper length and bill length as a single predictor
body_mass ~ (flipper_len + bill_len)^2 Include all terms inside the parentheses and all their pairwise interactions
body_mass ~ . Regress body mass on all other variables in the dataset
body_mass ~ .^2 Regress body mass on all other variables in the dataset and all pairwise interactions

Keep in mind,

  • The order of the terms doesn’t typically matter3

  • R is smart enough to not add the same variable(s) twice, which is why A + B + A*B and A*B create the same regression formula (A + B + A:B)

  • Use the I() notation to pass R mathematical commands that would normally be mistaken for formula syntax, such as addition (+), multiplication (*), and powers (^2)

  • Create 2-way and 3-way interactions between groups of variables by placing them in parentheses and adding ^2 or ^3 respectively

  • Include all remaining non-response fields with .

Additional feature engineering

We can create many useful features from the data, either directly within the formula statement or beforehand as a part of our data cleaning and feature engineering process. Using the penguins dataset above, and choosing body_mass as the response variable, consider each of the following predictors and their meanings:

head(fullpengs)
  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
6  Adelie Torgersen     39.3     20.6         190      3650   male 2007
7  Adelie Torgersen     38.9     17.8         181      3625 female 2007
Variable(s) Meaning
sqrt(bill_len) Square root of bill length
1*(bill_len>40) Indicator for when bill length is over 40mm
bill_dep,bill_dep^2,bill_dep^3 Polynomial trends on bill depth
bill_len/flipper_len Ratio of bill length to flipper length
rank(bill_len) Change bill length from a linear measure to its empirical rank (ordinal)
pnorm((bill_len-mean(bill_len))/sd(bill_len)) Change bill length from a linear measure to a Normal quantile
paste(species,island) Combine two categorical variables into one (essentially, fits an interaction effect)
1*(bill_len < tapply(bill_len,island,mean)[island]) Indicator for whether each penguin has a longer-than-average bill for its specific island

These are just a few examples of the many features that could be built from the penguins dataset. Strictly linear transformations are usually avoided except when seeking to aid interpretation, since a linear transformation of a predictor will not change the fitted values, residuals, R2, nor the p-values on significance tests.

However, when linear transformations are applied only to some observations and not others, they can create a novel non-linear effect. For example, in this penguin data, if we de-meaned bill length (that is, changed “bill length in mm” to “length in mm above or below the average bill length”), we would not improve the model in any way. But if we de-meaned bill length by sex, and measured whether bill lengths were large relative to other male or female penguins, that might possibly improve model fit.


  1. Those who are interested should look up “effect coding”.↩︎

  2. Some advanced R use cases explore less-common variants.↩︎

  3. One exception is that the order used here will affect the order of the anova() output, if you compute one.↩︎