Robust regression using R

UP | HOME

Table of Contents

Overview

R provides several methods for robust regression, to handle data with outliers. This tutorial shows how to fit a data set with a large outlier, comparing the results from both standard and robust regressions. This also serves as a comparison of plotting with base graphics vs. ggplot2, and demonstrates the power of using ggplot2 to integrate analysis with visualization.

Using base graphics

Create some data and fit a linear model

It is best to assemble a data frame of x and y data, to keep these two vectors associated in a single object; subsequent fitting & plotting of the data can then reference this data frame. The use of 'within' here performs this task without leaving copies of x and y as separate objects. This approach is recommended, to avoid cluttering up the R workspace with unnecessary objects.

set.seed(123)                       # allow reproducible random numbers
mydata <- within(data.frame(x=1:10), y <- rnorm(x, mean=x))
fm.orig <- lm(y ~ x, data=mydata)   # fitted model for original data

Now create an outlier point and refit the model:

mydata$y[2] <- 20
fm.lm <- update(fm.orig)

compare the coefficients of the two models; first the original data:

coef(summary(fm.orig))
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 0.5254674  0.6672766 0.7874806 4.536973e-01
x           0.9180288  0.1075414 8.5365180 2.728816e-05

now the outlier case:

coef(summary(fm.lm))
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 6.6021932   3.708910 1.7800898 0.1129350
x           0.1446273   0.597745 0.2419549 0.8149021

The outlier has clearly biased both the slope and the intercept. Furthermore, the errors on both these parameters are greatly inflated.

Here is the code to plot the data & best-fit models, using the standard "base" graphics in R. Note that the 'abline' function picks up the 'coefficients' component from within the fitted model object and assumes that the first 2 values of this vector are, respectively, the intercept & gradient of a straight line, which it then adds to the current plot.

plot(y ~ x, data=mydata)
abline(fm.orig, lty="dashed")    # use a dashed line
abline(fm.lm)
legend("topright", inset=0.03, bty="n",
       legend = c("Fit without outlier", "Fit with outlier"),
       lty = c("dashed", "solid")
       )

R base graphics outlier lm regression

An outlier point can seriously bias a normal linear regression

Fit robust regression model

Quantile regression

This models the median of y as a function of x, rather than modelling the mean of y as a function of x, in the case of least squares regression.

# install.packages("quantreg")     # may be needed if not already installed
library("quantreg")                # load the required package
fm.rq <- rq(y ~ x, data=mydata)

Iteratively re-weighted least squares fit

This down-weights outliers according to how far they are from the best-fit line, and iteratively re-fits the model until convergence is achieved (see '?MASS::rlm' for more information).

library("MASS")                   # load required package (part of standard R installation)
fm.rlm <- rlm(y ~ x, data=mydata)

Plot the models compared to the least squares fit

Now plot all the model lines together for comparison.

plot(y ~ x, data=mydata)
abline(fm.orig, lty="dashed")    # use a dashed line
abline(fm.lm)
abline(fm.rq, col="red")
abline(fm.rlm, col="blue")
legend("topright", inset=0.05, bty="n",
       legend = c("lm fit to original data", "lm fit", "rq fit", "rlm fit"),
       lty = c(2, 1, 1, 1),      # 1 = "solid" ; 2 = "dashed"
       col = c("black", "black", "red", "blue")
       )

R base graphics outlier lm, rlm & rq regression

Comparison of regression methods using R base graphics

Both the robust regression models succeed in resisting the influence of the outlier point and capturing the trend in the remaining data.

You can find out more on the CRAN taskview on Robust statistical methods for a comprehensive overview of this topic in R, as well as the 'robust' & 'robustbase' packages.


Using ggplot2

Here the above exercise is repeated with the same data, but using the ggplot2 R package to display the results and run the regressions.

Assemble data frame

An additional column ("type") is added to the data frame, to enable automatic labelling of the fit line in the plot legend.

set.seed(123)          # allow reproducible random numbers
orig <- within(data.frame(x=1:10),
               {
                 type <- "orig"
                 y <- rnorm(x, mean=x)
               }
               )
outlier <- orig
outlier$y[2] <- 20
outlier$type <- "outlier"

In a single (wrapped!) line of code, you can plot the data as well as fit and plot the regression line, with standard error confidence bounds displayed as a semi-transparent, shaded envelope. ggplot runs the 'lm' regression for us, including requesting standard errors on the predicted values, and plots the results as a line + envelope with an appropriate legend.

By default, the fitting is applied separately to each specified "group" of data, which is identified in this case by the different colours & line types applied to each unique value of the "type" column in the supplied data frame (which is itself the concatenation by row of 2 separate data frames using 'rbind'):

library("ggplot2")
theme_set(theme_grey(base_size = 16))  # increase default font etc. size
p <- ggplot(data = rbind(outlier, orig), aes(x, y, colour=type, linetype=type)) +
  geom_point()                         # "base" plot, with points only
p + geom_smooth(method = "lm")         # fit & plot lm model + envelope

ggplot2 outlier lm regression

The outlier point produces a biased fit with a very large standard error envelope

Now we can also plot the standard error on the 'rlm' robust regression fit to the outlier data, which is quite similar to that obtained when 'lm' is fitted to the original data. Note that when using 'rq' from the quantreg package with 'geom_smooth', the plotting of the standard error on the fit does not work, so you need to use 'se=FALSE'.

library("MASS")                # part of standard R installation
p + geom_smooth(method="rlm")  # fit & plot *robust* model + envelope

ggplot2 outlier lm & rlm regression

The robust regression closely resembles the fit to the original data without the outlier

Comparison of robust regressions

Now we can reproduce the equivalent plot as before, but using ggplot2, which does the regressions on the fly. The plotting of standard errors is not done here ('se=FALSE'), to avoid cluttering the plot; this would have to be done for 'rq' anyway, as noted above.

library("quantreg")   # may need 'install.packages("quantreg")' if package not installed
ggplot(data=outlier, aes(x, y)) +                                  # use "outlier" data frame
  geom_point() +                                                   # } these all 
  geom_smooth(method="lm", aes(colour="lm"), se=FALSE) +           # }  automatically
  geom_smooth(method="rq", aes(colour="rq"), se=FALSE) +           # }  inherit 
  geom_smooth(method="rlm", aes(colour="rlm"), se=FALSE) +         # }  "data=outlier"
  geom_smooth(method="lm", data=orig, aes(colour="lm fit to original data"),
              se=FALSE) +     # use "orig" data frame
  labs(colour=NULL)           # remove legend title

ggplot2 outlier lm, rlm & rq regression

Comparison of regression methods using ggplot2


R source code

Click for the complete R source code for this tutorial. Note that this is automatically generated ("tangled") from the org mode source file for this document, which adds some extra commands to specify filenames for plots (and to subsequently close the graphics device). Note that if you use 'source' to read in the R code, the ggplot2 plots will not be created as auto-printing is turned off when using 'source' (see R FAQ 7.22 for more information).


Session information

This page was created with (Emacs) org mode.

R version
R version 3.0.1 (2013-05-16), x86_64-unknown-linux-gnu
locale
LC_CTYPE=en_GB.UTF-8, LC_NUMERIC=C, LC_TIME=en_GB.UTF-8, LC_COLLATE=en_GB.UTF-8, LC_MONETARY=en_GB.UTF-8, LC_MESSAGES=en_GB.UTF-8, LC_PAPER=C, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_GB.UTF-8, LC_IDENTIFICATION=C
attached base packages
stats, graphics, grDevices, utils, datasets, methods, base
other attached packages
plyr_1.8, lattice_0.20-15, ascii_2.1, ggplot2_0.9.3.1, MASS_7.3-27, quantreg_4.98, SparseM_0.99
loaded via a namespace (and not attached)
colorspace_1.2-2, dichromat_2.0-0, digest_0.6.3, grid_3.0.1, gtable_0.1.2, labeling_0.2, munsell_0.4.2, proto_0.3-10, RColorBrewer_1.0-5, reshape2_1.2.2, scales_0.2.3, stringr_0.6.2

About

This R tutorial is by Alastair Sanderson Ph.D. You can contact me here.