Robust regression using R
Table of Contents
Overview
R provides several methods for robust regression, to handle data with outliers. This tutorial shows how to fit a data set with a large outlier, comparing the results from both standard and robust regressions. This also serves as a comparison of plotting with base graphics vs. ggplot2, and demonstrates the power of using ggplot2 to integrate analysis with visualization.
Using base graphics
Create some data and fit a linear model
It is best to assemble a data frame of x
and y
data, to keep these
two vectors associated in a single object; subsequent fitting &
plotting of the data can then reference this data frame. The use of
'within
' here performs this task without leaving copies of x
and
y
as separate objects. This approach is recommended, to avoid
cluttering up the R workspace with unnecessary objects.
set.seed(123) # allow reproducible random numbers mydata <- within(data.frame(x=1:10), y <- rnorm(x, mean=x)) fm.orig <- lm(y ~ x, data=mydata) # fitted model for original data
Now create an outlier point and refit the model:
mydata$y[2] <- 20 fm.lm <- update(fm.orig)
compare the coefficients of the two models; first the original data:
coef(summary(fm.orig))
Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5254674 0.6672766 0.7874806 4.536973e-01 x 0.9180288 0.1075414 8.5365180 2.728816e-05
now the outlier case:
coef(summary(fm.lm))
Estimate Std. Error t value Pr(>|t|) (Intercept) 6.6021932 3.708910 1.7800898 0.1129350 x 0.1446273 0.597745 0.2419549 0.8149021
The outlier has clearly biased both the slope and the
intercept. Furthermore, the errors on both these parameters are
greatly inflated.
Here is the code to plot the data & best-fit models, using the
standard "base" graphics in R. Note that the 'abline
' function picks
up the 'coefficients
' component from within the fitted model object
and assumes that the first 2 values of this vector are, respectively,
the intercept & gradient of a straight line, which it then adds to
the current plot.
plot(y ~ x, data=mydata) abline(fm.orig, lty="dashed") # use a dashed line abline(fm.lm) legend("topright", inset=0.03, bty="n", legend = c("Fit without outlier", "Fit with outlier"), lty = c("dashed", "solid") )
An outlier point can seriously bias a normal linear regression
Fit robust regression model
Quantile regression
This models the median of y as a function of x, rather than modelling the mean of y as a function of x, in the case of least squares regression.
# install.packages("quantreg") # may be needed if not already installed library("quantreg") # load the required package fm.rq <- rq(y ~ x, data=mydata)
Iteratively re-weighted least squares fit
This down-weights outliers according to how far they are from the
best-fit line, and iteratively re-fits the model until convergence is
achieved (see '?MASS::rlm
' for more information).
library("MASS") # load required package (part of standard R installation) fm.rlm <- rlm(y ~ x, data=mydata)
Plot the models compared to the least squares fit
Now plot all the model lines together for comparison.
plot(y ~ x, data=mydata) abline(fm.orig, lty="dashed") # use a dashed line abline(fm.lm) abline(fm.rq, col="red") abline(fm.rlm, col="blue") legend("topright", inset=0.05, bty="n", legend = c("lm fit to original data", "lm fit", "rq fit", "rlm fit"), lty = c(2, 1, 1, 1), # 1 = "solid" ; 2 = "dashed" col = c("black", "black", "red", "blue") )
Comparison of regression methods using R base graphics
Both the robust regression models succeed in resisting the influence
of the outlier point and capturing the trend in the remaining data.
You can find out more on the CRAN taskview on Robust statistical methods for a comprehensive overview of this topic in R, as well as the 'robust' & 'robustbase' packages.
Using ggplot2
Here the above exercise is repeated with the same data, but using the ggplot2 R package to display the results and run the regressions.
Assemble data frame
An additional column ("type
") is added to the data frame, to enable
automatic labelling of the fit line in the plot legend.
set.seed(123) # allow reproducible random numbers orig <- within(data.frame(x=1:10), { type <- "orig" y <- rnorm(x, mean=x) } ) outlier <- orig outlier$y[2] <- 20 outlier$type <- "outlier"
In a single (wrapped!) line of code, you can plot the data as well as
fit and plot the regression line, with standard error confidence
bounds displayed as a semi-transparent, shaded envelope. ggplot runs
the 'lm
' regression for us, including requesting standard errors on
the predicted values, and plots the results as a line + envelope with
an appropriate legend.
By default, the fitting is applied separately to each specified
"group" of data, which is identified in this case by the different
colours & line types applied to each unique value of the "type
"
column in the supplied data frame (which is itself the concatenation
by row of 2 separate data frames using 'rbind
'):
library("ggplot2") theme_set(theme_grey(base_size = 16)) # increase default font etc. size p <- ggplot(data = rbind(outlier, orig), aes(x, y, colour=type, linetype=type)) + geom_point() # "base" plot, with points only p + geom_smooth(method = "lm") # fit & plot lm model + envelope
The outlier point produces a biased fit with a very large standard error envelope
Now we can also plot the standard error on the 'rlm
' robust
regression fit to the outlier data, which is quite similar to that
obtained when 'lm
' is fitted to the original data. Note that when
using 'rq
' from the quantreg package with 'geom_smooth
', the
plotting of the standard error on the fit does not work, so you need
to use 'se=FALSE
'.
library("MASS") # part of standard R installation p + geom_smooth(method="rlm") # fit & plot *robust* model + envelope
The robust regression closely resembles the fit to the original data without the outlier
Comparison of robust regressions
Now we can reproduce the equivalent plot as before, but using ggplot2,
which does the regressions on the fly. The plotting of standard errors
is not done here ('se=FALSE
'), to avoid cluttering the plot; this
would have to be done for 'rq
' anyway, as noted above.
library("quantreg") # may need 'install.packages("quantreg")' if package not installed ggplot(data=outlier, aes(x, y)) + # use "outlier" data frame geom_point() + # } these all geom_smooth(method="lm", aes(colour="lm"), se=FALSE) + # } automatically geom_smooth(method="rq", aes(colour="rq"), se=FALSE) + # } inherit geom_smooth(method="rlm", aes(colour="rlm"), se=FALSE) + # } "data=outlier" geom_smooth(method="lm", data=orig, aes(colour="lm fit to original data"), se=FALSE) + # use "orig" data frame labs(colour=NULL) # remove legend title
Comparison of regression methods using ggplot2
R source code
Click for the complete R source code for this tutorial. Note that this
is automatically generated ("tangled") from the org mode source file
for this document, which adds some extra commands to specify filenames
for plots (and to subsequently close the graphics device). Note that
if you use 'source
' to read in the R code, the ggplot2 plots will
not be created as auto-printing is turned off when using 'source
'
(see R FAQ 7.22 for more information).
Session information
This page was created with (Emacs) org mode.
- R version
- R version 3.0.1 (2013-05-16), x86_64-unknown-linux-gnu
- locale
- LC_CTYPE=en_GB.UTF-8, LC_NUMERIC=C, LC_TIME=en_GB.UTF-8, LC_COLLATE=en_GB.UTF-8, LC_MONETARY=en_GB.UTF-8, LC_MESSAGES=en_GB.UTF-8, LC_PAPER=C, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_GB.UTF-8, LC_IDENTIFICATION=C
- attached base packages
- stats, graphics, grDevices, utils, datasets, methods, base
- other attached packages
- plyr_1.8, lattice_0.20-15, ascii_2.1, ggplot2_0.9.3.1, MASS_7.3-27, quantreg_4.98, SparseM_0.99
- loaded via a namespace (and not attached)
- colorspace_1.2-2, dichromat_2.0-0, digest_0.6.3, grid_3.0.1, gtable_0.1.2, labeling_0.2, munsell_0.4.2, proto_0.3-10, RColorBrewer_1.0-5, reshape2_1.2.2, scales_0.2.3, stringr_0.6.2
About
This R tutorial is by Alastair Sanderson Ph.D. You can contact me here.