Linear regression with a factor, using R
Table of Contents
Overview
Fitting models in R is simple and can be easily automated, to allow
many different model types to be explored. This tutorial shows how to
fit a variety of different linear regression models to continuous data
from different categories. This shows the R formula interface and also
demonstrates the power and flexibility of the plyr
and ggplot2
packages for manipulating and visualising data, respectively.
Create and plot data
It is best to assemble a data frame of x
and y
data, to keep these
two vectors associated in a single object; subsequent fitting &
plotting of the data can then reference this data frame. The use of
'within
' here performs this task without leaving copies of x
and
y
as separate objects. This approach is recommended, to avoid
cluttering up the R workspace with unnecessary objects.
set.seed(123) # Allow reproducible random numbers ## Create some data with 5 different categories & Gaussian random noise: mydata <- within(data.frame(x = 1:50), { class <- rep(1:5, each = 10) y <- 0.5 * class * (1:10) + rnorm(50) class <- factor(class) # convert to a factor } )
Now plot the data, using the lattice
package, which makes it easy
to display the separate categories within the data. Note that
lattice
is a 'recommended' package, which means that it comes
bundled with the standard installation of R, but is not automatically
loaded by default, so you need to do so using the library
function.
library(lattice) xyplot(y ~ x, data = mydata, group = class, auto.key = list(title = "Class", columns = 5) )
The legend is automatically created to match the point colours used to
identify the different categories in the class
variable.
A Lattice plot of the data to be fitted, showing the 5 separate categories
You can clearly see that each category shows a roughly linear
relationship between x
and y
, with a different slope and
intercept.
Specify & fit linear models
Now let's specify a variety of different linear models to fit to the
data, using the formula interface in R. We want to model y
in terms
of x
and possibly also class
, so the syntax starts with "y ~
".
Formulae can be treated as normal objects in R, so you can generate
them by manipulating character strings, allowing us to avoid code
duplication by pasting this common initial part onto the different
other parts:
## create a vector of model formulae: forms <- paste("y ~ ", c("x", # same intercept & slope "x + class", # different intercept, same slope "x / class", # same intercept, different slope "x * class", # different intercept & slope "class / x", # different intercept & slope "class / (1 + x) - 1" # different intercept & slope ), sep = "" )
The manual on 'An Introduction to R' has more information about R statistical model formulae.
It is useful to assign a simple name to each model, to keep track of the results later on:
names(forms) <- paste("Model", LETTERS[1:length(forms)]) forms # print out the object
Model A Model B Model C "y ~ x" "y ~ x + class" "y ~ x / class" Model D Model E Model F "y ~ x * class" "y ~ class / x" "y ~ class / (1 + x) - 1"
To fit the models to the data, we use the excellent 'plyr' package,
which makes it very easy to apply a similar action repeatedly to
subsets of data and combine the results together. In this case, we
want to run lm
to fit a linear model to mydata
, using a different
formula in each case. The 'llply
' function reads in the formula
list, and runs lm
for each entry on the mydata
data frame. It then
outputs a list of lm
models.
## fit each model to the data: ##install.packages("plyr") # if package not already installed library(plyr) models <- llply(forms, lm, data = mydata) names(models) <- names(forms) # assign model name to model object in list
Extract model predictions & plot vs. raw data
Now we need to add in the model predictions for each x
value and
also include the AIC relative measure of fit quality. This calculation
is performed using another plyr
package function.
In this case, 'ldply
' allows you to loop over each fitted model in
the models
list and calculate some numbers. The results for each
model are added into the existing model data frame then collected
together and returned in a single data frame:
## Use an anonymous function to add predicted data + AIC fit criterion: predData <- ldply(models, function(m) within(m$model, # combine with fitted x & y data { pred <- predict(m) AIC <- AIC(m) } ) ) head(predData) # print out first few rows of data frame
.id y x AIC pred class 1 Model A -0.06047565 1 289.342 0.3911885 <NA> 2 Model A 0.76982251 2 289.342 0.7133605 <NA> 3 Model A 3.05870831 3 289.342 1.0355325 <NA> 4 Model A 2.07050839 4 289.342 1.3577046 <NA> 5 Model A 2.62928774 5 289.342 1.6798766 <NA> 6 Model A 4.71506499 6 289.342 2.0020487 <NA>
'predData
' also contains a column called '.id
', which gives the
name of the model corresponding to each row of the data frame, which
it gets from the names of the input list (i.e. 'models
'). Model A is
the simplest case where class
is not included, hence is reported as
a missing (factor) value ('<NA>
') in the predicted data.
Next, add in the model formula, and create a customised label to be used as a strip title for each panel in the plot. These strip titles are ordered from best to worst fitting model, according to the AIC value (lower is better).
## add formula column to data frame & create custom panel strip title: predData <- within(predData, { modelName <- .id # preserve model name formula <- rep(as.character(forms), each = nrow(predData) / length(forms)) strip <- factor(paste(formula, ", AIC =", round(AIC, 1))) strip <- reorder(strip, AIC) # sort by AIC value } )
Now we can plot the data & overlay the best fit model in each case. This time using the powerful and flexible ggplot2 package.
## plot points & predictions for each model: ##install.packages("ggplot2") # if package not already installed library(ggplot2) ggplot(data = predData, aes(x, y, colour = class)) + geom_point(alpha = 0.5) + # use 50% transparent points, to handle overplotting geom_line(aes(y = pred)) + # show model prediction as a line facet_wrap( ~ strip) # use separate panel for each different model
Specifying 'colour = class' as an aesthetic attribute treats each different category of 'class' as a separate group of data, with a different colour for the points & lines used to represent it.
Fitted data + linear model prediction for 6 different models
Not surprisingly, the models with separate slopes and intercepts for each category provide the best fit. However, this might well not be the case if there was more noise on the data and/or smaller differences between the slopes & intercepts in each category.
R source code
Click for the complete R source code for this tutorial. Note that this
is automatically generated ("tangled") from the org mode source file
for this document, which adds some extra commands to specify filenames
for plots (and to subsequently close the graphics device). Note that
if you use 'source
' to read in the R code, the ggplot2 plots will
not be created as auto-printing is turned off when using 'source
'
(see R FAQ 7.22 for more information).
Session information
This page was created with (Emacs) org mode.
- R version
- R version 3.0.1 (2013-05-16), x86_64-unknown-linux-gnu
- locale
- LC_CTYPE=en_GB.UTF-8, LC_NUMERIC=C, LC_TIME=en_GB.UTF-8, LC_COLLATE=en_GB.UTF-8, LC_MONETARY=en_GB.UTF-8, LC_MESSAGES=en_GB.UTF-8, LC_PAPER=C, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_GB.UTF-8, LC_IDENTIFICATION=C
- attached base packages
- stats, graphics, grDevices, utils, datasets, methods, base
- other attached packages
- plyr_1.8, lattice_0.20-15, ascii_2.1, ggplot2_0.9.3.1, MASS_7.3-27, quantreg_4.98, SparseM_0.99
- loaded via a namespace (and not attached)
- colorspace_1.2-2, dichromat_2.0-0, digest_0.6.3, grid_3.0.1, gtable_0.1.2, labeling_0.2, munsell_0.4.2, proto_0.3-10, RColorBrewer_1.0-5, reshape2_1.2.2, scales_0.2.3, stringr_0.6.2
About
This R tutorial is by Alastair Sanderson Ph.D. You can contact me here.