Linear regression with a factor, using R

Overview
Create and plot data
Specify & fit linear models
Extract model predictions & plot vs. raw data
R source code
Session information
About

Overview

Fitting models in R is simple and can be easily automated, to allow many different model types to be explored. This tutorial shows how to fit a variety of different linear regression models to continuous data from different categories. This shows the R formula interface and also demonstrates the power and flexibility of the plyr and ggplot2 packages for manipulating and visualising data, respectively.

Create and plot data

It is best to assemble a data frame of x and y data, to keep these two vectors associated in a single object; subsequent fitting & plotting of the data can then reference this data frame. The use of 'within' here performs this task without leaving copies of x and y as separate objects. This approach is recommended, to avoid cluttering up the R workspace with unnecessary objects.

set.seed(123)     # Allow reproducible random numbers
## Create some data with 5 different categories & Gaussian random noise:
mydata <- within(data.frame(x = 1:50),
                 {
                   class <- rep(1:5, each = 10)
                   y <- 0.5 * class * (1:10) + rnorm(50)
                   class <- factor(class)                 # convert to a factor
                 }
                 )

Now plot the data, using the lattice package, which makes it easy to display the separate categories within the data. Note that lattice is a 'recommended' package, which means that it comes bundled with the standard installation of R, but is not automatically loaded by default, so you need to do so using the library function.

library(lattice)
xyplot(y ~ x, data = mydata, group = class,
       auto.key = list(title = "Class", columns = 5)
       )

The legend is automatically created to match the point colours used to identify the different categories in the class variable.

Linear regression data

A Lattice plot of the data to be fitted, showing the 5 separate categories

You can clearly see that each category shows a roughly linear relationship between x and y, with a different slope and intercept.

Specify & fit linear models

Now let's specify a variety of different linear models to fit to the data, using the formula interface in R. We want to model y in terms of x and possibly also class, so the syntax starts with "y ~". Formulae can be treated as normal objects in R, so you can generate them by manipulating character strings, allowing us to avoid code duplication by pasting this common initial part onto the different other parts:

## create a vector of model formulae:
forms <- paste("y ~ ",
               c("x",                   # same intercept & slope
                 "x + class",           # different intercept, same slope
                 "x / class",           # same intercept, different slope
                 "x * class",           # different intercept & slope
                 "class / x",           # different intercept & slope
                 "class / (1 + x) - 1"  # different intercept & slope
                 ), sep = ""
               )

The manual on 'An Introduction to R' has more information about R statistical model formulae.

It is useful to assign a simple name to each model, to keep track of the results later on:

names(forms) <- paste("Model", LETTERS[1:length(forms)])
forms   # print out the object

                  Model A                   Model B                   Model C 
                  "y ~ x"           "y ~ x + class"           "y ~ x / class" 
                  Model D                   Model E                   Model F 
          "y ~ x * class"           "y ~ class / x" "y ~ class / (1 + x) - 1"

To fit the models to the data, we use the excellent 'plyr' package, which makes it very easy to apply a similar action repeatedly to subsets of data and combine the results together. In this case, we want to run lm to fit a linear model to mydata, using a different formula in each case. The 'llply' function reads in the formula list, and runs lm for each entry on the mydata data frame. It then outputs a list of lm models.

## fit each model to the data:
##install.packages("plyr")       # if package not already installed
library(plyr)
models <- llply(forms, lm, data = mydata)

names(models) <- names(forms)    # assign model name to model object in list

Extract model predictions & plot vs. raw data

Now we need to add in the model predictions for each x value and also include the AIC relative measure of fit quality. This calculation is performed using another plyr package function.

In this case, 'ldply' allows you to loop over each fitted model in the models list and calculate some numbers. The results for each model are added into the existing model data frame then collected together and returned in a single data frame:

## Use an anonymous function to add predicted data + AIC fit criterion:
predData <- ldply(models, function(m) within(m$model,  # combine with fitted x & y data
                                             {
                                               pred <- predict(m)
                                               AIC <- AIC(m)
                                             }
                                             )
                  )
head(predData)   # print out first few rows of data frame

      .id           y x     AIC      pred class
1 Model A -0.06047565 1 289.342 0.3911885  <NA>
2 Model A  0.76982251 2 289.342 0.7133605  <NA>
3 Model A  3.05870831 3 289.342 1.0355325  <NA>
4 Model A  2.07050839 4 289.342 1.3577046  <NA>
5 Model A  2.62928774 5 289.342 1.6798766  <NA>
6 Model A  4.71506499 6 289.342 2.0020487  <NA>

'predData' also contains a column called '.id', which gives the name of the model corresponding to each row of the data frame, which it gets from the names of the input list (i.e. 'models'). Model A is the simplest case where class is not included, hence is reported as a missing (factor) value ('<NA>') in the predicted data.

Next, add in the model formula, and create a customised label to be used as a strip title for each panel in the plot. These strip titles are ordered from best to worst fitting model, according to the AIC value (lower is better).

## add formula column to data frame & create custom panel strip title:
predData <- within(predData,
                   {
                     modelName <- .id   # preserve model name
                     formula <- rep(as.character(forms),
                                    each = nrow(predData) / length(forms))
                     strip <- factor(paste(formula, ", AIC =", round(AIC, 1)))
                     strip <- reorder(strip, AIC)  # sort by AIC value
                   }
                   )

Now we can plot the data & overlay the best fit model in each case. This time using the powerful and flexible ggplot2 package.

## plot points & predictions for each model:
##install.packages("ggplot2")   # if package not already installed
library(ggplot2)
ggplot(data = predData, aes(x, y, colour = class)) +
  geom_point(alpha = 0.5) +   # use 50% transparent points, to handle overplotting
  geom_line(aes(y = pred)) +  # show model prediction as a line
  facet_wrap( ~ strip)        # use separate panel for each different model

Specifying 'colour = class' as an aesthetic attribute treats each different category of 'class' as a separate group of data, with a different colour for the points & lines used to represent it.

Linear regression data + models

Fitted data + linear model prediction for 6 different models

Not surprisingly, the models with separate slopes and intercepts for each category provide the best fit. However, this might well not be the case if there was more noise on the data and/or smaller differences between the slopes & intercepts in each category.

R source code

Click for the complete R source code for this tutorial. Note that this is automatically generated ("tangled") from the org mode source file for this document, which adds some extra commands to specify filenames for plots (and to subsequently close the graphics device). Note that if you use 'source' to read in the R code, the ggplot2 plots will not be created as auto-printing is turned off when using 'source' (see R FAQ 7.22 for more information).

Session information

This page was created with (Emacs) org mode.

R version: R version 3.0.1 (2013-05-16), x86_64-unknown-linux-gnu
locale: LC_CTYPE=en_GB.UTF-8, LC_NUMERIC=C, LC_TIME=en_GB.UTF-8, LC_COLLATE=en_GB.UTF-8, LC_MONETARY=en_GB.UTF-8, LC_MESSAGES=en_GB.UTF-8, LC_PAPER=C, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_GB.UTF-8, LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets, methods, base
other attached packages: plyr_1.8, lattice_0.20-15, ascii_2.1, ggplot2_0.9.3.1, MASS_7.3-27, quantreg_4.98, SparseM_0.99
loaded via a namespace (and not attached): colorspace_1.2-2, dichromat_2.0-0, digest_0.6.3, grid_3.0.1, gtable_0.1.2, labeling_0.2, munsell_0.4.2, proto_0.3-10, RColorBrewer_1.0-5, reshape2_1.2.2, scales_0.2.3, stringr_0.6.2

About

This R tutorial is by Alastair Sanderson Ph.D. You can contact me here.