gam                   package:mgcv                   R Documentation

_G_e_n_e_r_a_l_i_z_e_d _a_d_d_i_t_i_v_e _m_o_d_e_l_s _w_i_t_h _i_n_t_e_g_r_a_t_e_d _s_m_o_o_t_h_n_e_s_s _e_s_t_i_m_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Fits a generalized additive model (GAM) to data, the term `GAM'
     being taken to include any quadratically penalized GLM.   The
     degree of smoothness of model terms is estimated as part of
     fitting. 'gam' can also fit any GLM subject to multiple quadratic
     penalties (including  estimation of degree of penalization).
     Isotropic or scale invariant smooths of any number of variables
     are available as model terms, as are linear functionals of such
     smooths; confidence/credible intervals are readily available for
     any quantity predicted using a fitted model; 'gam' is extendable:
     users can add smooths. 

     Smooth terms are represented using penalized regression splines
     (or similar smoothers) with smoothing parameters selected by
     GCV/UBRE/AIC or by regression splines with fixed degrees of
     freedom (mixtures of the two are permitted). Multi-dimensional
     smooths are  available using penalized thin plate regression
     splines (isotropic) or tensor product splines  (when an isotropic
     smooth is inappropriate). For an overview of the smooths available
     see 'smooth.terms'.  For more on specifying models see
     'gam.models' and 'linear.functional.terms'.  For more on model
     selection see 'gam.selection'. 

     'gam()' is not a clone of what S-PLUS provides: the major
     differences are (i) that by default estimation of the degree of
     smoothness of model terms is part of model fitting, (ii) a
     Bayesian approach to variance estimation is employed that makes
     for easier confidence interval calculation (with good coverage
     probabilites), (iii) that the model can depend on any (bounded)
     linear functional of smooth terms, and, (iv) the parametric part
     of the model can be penalized, and  (v) the facilities for
     incorporating smooths of more than one variable are different:
     specifically there are no 'lo' smooths, but instead (a) 's' terms
     can have more than one argument, implying an isotropic smooth and
     (b) 'te' smooths are provided as an effective means for modelling
     smooth interactions of any number of variables via scale invariant
     tensor product smooths. If you want a clone of what S-PLUS
     provides use gam from package 'gam'.

_U_s_a_g_e:

     gam(formula,family=gaussian(),data=list(),weights=NULL,subset=NULL,
         na.action,offset=NULL,control=gam.control(),method=gam.method(),
         scale=0,knots=NULL,sp=NULL,min.sp=NULL,H=NULL,gamma=1,
         fit=TRUE,paraPen=NULL,G=NULL,in.out,...)

_A_r_g_u_m_e_n_t_s:

 formula: A GAM formula (see 'formula.gam' and also 'gam.models'). 
          This is exactly like the formula for a GLM except that smooth
          terms, 's' and 'te' can be added  to the right hand side to
          specify that the linear predictor depends on smooth functions
          of predictors  (or linear functionals of these). 

  family: This is a family object specifying the distribution and link
          to use in fitting etc. See 'glm' and 'family' for more
          details. A negative binomial family is provided: see
          'negbin'. 

    data: A data frame or list containing the model response variable
          and  covariates required by the formula. By default the
          variables are taken  from 'environment(formula)': typically
          the environment from  which 'gam' is called.

 weights: prior weights on the data.

  subset: an optional vector specifying a subset of observations to be
          used in the fitting process.

na.action: a function which indicates what should happen when the data
          contain `NA's.  The default is set by the `na.action' setting
          of `options', and is `na.fail' if that is unset.  The
          ``factory-fresh'' default is `na.omit'.

  offset: Can be used to supply a model offset for use in fitting. Note
          that this offset will always be completely ignored when
          predicting, unlike an offset  included in 'formula': this
          conforms to the behaviour of 'lm' and 'glm'.

 control: A list of fit control parameters returned by  'gam.control'.

  method: A list controlling the fitting methods used. This can make a
          difference to computational speed, and, in some cases,
          reliability of convergence: see 'gam.method' for details.

   scale: If this is zero then GCV is used for all distributions except
          Poisson and binomial where UBRE is used with scale parameter
          assumed to be 1. If this is greater than 1 it is assumed to
          be the scale parameter/variance and UBRE is used. If 'scale'
          is negative  GCV  is always used, which means that the scale
          parameter will be estimated by GCV and the Pearson 
          estimator. For binomial models in  particular, it is probably
          worth  comparing UBRE and GCV results; for ``over-dispersed
          Poisson'' GCV is probably more appropriate than UBRE.

   knots: this is an optional list containing user specified knot
          values to be used for basis construction.  For most bases the
          user simply supplies the knots to be used, which must match
          up with the 'k' value supplied (note that the number of knots
          is not always just 'k').  See 'tprs' for what happens in the
          '"tp"/"ts"' case.  Different terms can use different numbers
          of knots, unless they share a covariate. 

      sp: A vector of smoothing parameters can be provided here.
          Smoothing parameters must be supplied in the order that the
          smooth terms appear in the model  formula. Negative elements
          indicate that the parameter should be estimated, and hence a
          mixture  of fixed and estimated parameters is possible. If
          smooths share smoothing parameters then 'length(sp)'  must
          correspond to the number of underlying smoothing parameters.

  min.sp: Lower bounds can be supplied for the smoothing parameters.
          Note that if this option is used then the smoothing
          parameters 'full.sp', in the  returned object, will need to
          be added to what is supplied here to get the  smoothing
          parameters actually multiplying the penalties.
          'length(min.sp)' should  always be the same as the total
          number of penalties (so it may be longer than 'sp', if
          smooths share smoothing parameters).

       H: A user supplied fixed quadratic penalty on the parameters of
          the  GAM can be supplied, with this as its coefficient
          matrix. A common use of this term is  to add a ridge penalty
          to the parameters of the GAM in circumstances in which the
          model is close to un-identifiable on the scale of the linear
          predictor, but perfectly well defined on the response scale.

   gamma: It is sometimes useful to inflate the model degrees of 
          freedom in the GCV or UBRE/AIC score by a constant
          multiplier. This allows  such a multiplier to be supplied. 

     fit: If this argument is 'TRUE' then 'gam' sets up the model and
          fits it, but if it is 'FALSE' then the model is set up and an
          object 'G' containing what would be required to fit is
          returned is returned. See argument 'G'.

 paraPen: optional list specifying any penalties to be applied to
          parametric model terms.  'gam.models' explains more.

       G: Usually 'NULL', but may contain the object returned by a
          previous call to 'gam' with  'fit=FALSE', in which case all
          other arguments are ignored except for 'gamma', 'in.out',
          'control', 'method' and 'fit'.

  in.out: optional list for initializing outer iteration. If supplied
          then this must contain two elements: 'sp' should be an array
          of initialization values for all smoothing parameters (there
          must be a value for all smoothing parameters, whether fixed
          or to be estimated, but those for fixed s.p.s are not used);
          'scale' is the typical scale of the GCV/UBRE function, for
          passing to the outer optimizer.

     ...: further arguments for  passing on e.g. to 'gam.fit' (such as
          'mustart'). 

_D_e_t_a_i_l_s:

     A generalized additive model (GAM) is a generalized linear model
     (GLM) in which the linear  predictor is given by a user specified
     sum of smooth functions of the covariates plus a  conventional
     parametric component of the linear predictor. A simple example is:

                   log(E(y_i))=f_1(x_1i)+f_2(x_2i)

     where the (independent) response variables y_i~Poi, and f_1 and
     f_2 are smooth functions of covariates x_1 and  x_2. The log is an
     example of a link function. 

     If absolutely any smooth functions were allowed in model fitting
     then maximum likelihood  estimation of such models would
     invariably result in complex overfitting estimates of  f_1  and
     f_2. For this reason the models are usually fit by  penalized
     likelihood  maximization, in which the model (negative log)
     likelihood is modified by the addition of  a penalty for each
     smooth function, penalizing its `wiggliness'. To control the
     tradeoff  between penalizing wiggliness and penalizing badness of
     fit each penalty is multiplied by  an associated smoothing
     parameter: how to estimate these parameters, and  how to
     practically represent the smooth functions are the main
     statistical questions  introduced by moving from GLMs to GAMs. 

     The 'mgcv' implementation of 'gam' represents the smooth functions
     using  penalized regression splines, and by default uses basis
     functions for these splines that  are designed to be optimal,
     given the number basis functions used. The smooth terms can be 
     functions of any number of covariates and the user has some
     control over how smoothness of  the functions is measured. 

     'gam' in 'mgcv' solves the smoothing parameter estimation problem
     by using the  Generalized Cross Validation (GCV) criterion

                           n D/(n - DoF)^2

     or an Un-Biased Risk Estimator (UBRE )criterion

                         D/n + 2 s DoF / n -s

     where D is the deviance, n the number of data, s the scale
     parameter and  DoF the effective degrees of freedom of the model.
     Notice that UBRE is effectively just AIC rescaled, but is only
     used when s is known. An alternative is GACV (again see
     'gam.method'). Smoothing parameters are chosen to  minimize the
     GCV or UBRE/AIC score for the model, and the main computational
     challenge solved  by the 'mgcv' package is to do this efficiently
     and reliably. Various alternative numerical methods are provided:
     see 'gam.method'.

     Broadly 'gam' works by first constructing basis functions and one
     or more quadratic penalty  coefficient matrices for each smooth
     term in the model formula, obtaining a model matrix for  the
     strictly parametric part of the model formula, and combining these
     to obtain a  complete model matrix (/design matrix) and a set of
     penalty matrices for the smooth terms.  Some linear
     identifiability constraints are also obtained at this point. The
     model is  fit using 'gam.fit', a modification of 'glm.fit'. The
     GAM  penalized likelihood maximization problem is solved by
     Penalized Iteratively  Reweighted  Least Squares (P-IRLS) (see
     e.g. Wood 2000).  Smoothing parameter selection is integrated in
     one of two ways. (i) `Performance iteration' uses the fact that at
     each P-IRLS iteration a penalized  weighted least squares problem
     is solved, and the smoothing parameters of that problem can 
     estimated by GCV or UBRE. Eventually, in most cases, both model
     parameter estimates and smoothing  parameter estimates converge.
     (ii) Alternatively the P-IRLS scheme is iterated to convergence
     for each trial set of smoothing parameters, and GCV or UBRE scores
     are only evaluated on convergence - optimization is then `outer'
     to the P-IRLS loop: in this case the P-IRLS iteration has to be
     differentiated, to facilitate optimization, and 'gam.fit3' is used
     in place of 'gam.fit'. The default is the second method, outer
     iteration.

     Several alternative basis-penalty types  are built in for
     representing model smooths, but alternatives can easily be added
     (see 'smooth.terms'  for an overview and 'smooth.construct' for
     how to add smooth classes). In practice the  default basis is
     usually the best choise, but the choise of the basis dimension
     ('k' in the  's' and 'te' terms) is something that should be
     considered carefully (the exact value is not critical, but it is
     important not to make it restrictively small, nor very large and
     computationally costly). The basis should  be chosen to be larger
     than is believed to be necessary to approximate the smooth
     function concerned.  The effective degrees of freedom for the
     smooth will then be controlled by the smoothing penalty on  the
     term, and (usually) selected automatically (withy an upper limit
     set by 'k-1' or occasionally 'k'). Of course  the 'k' should not
     be made too large, or computation will be slow (or in extreme
     cases there will be more  coefficients to estimate than there are
     data).

     Note that 'gam' assumes a very inclusive definition of what counts
     as a GAM:  basically any penalized GLM can be used: to this end
     'gam' allows the non smooth model  components to be penalized via
     argument 'paraPen' and allows the linear predictor to depend on 
     general linear functionals of smooths, via the summation
     convention mechanism described in  'linear.functional.terms'.

     Details of the default underlying fitting methods are given in
     Wood (2004 and 2008). Some alternative methods are discussed in
     Wood (2000 and 2006).

_V_a_l_u_e:

     If 'fit=FALSE' the function returns a list 'G' of items needed to
     fit a GAM, but doesn't actually fit it. 

     Otherwise the function returns an object of class '"gam"' as
     described in 'gamObject'.

_W_A_R_N_I_N_G_S:

     You must have more unique combinations of covariates than the
     model has total parameters. (Total parameters is sum of basis
     dimensions plus sum of non-spline  terms less the number of spline
     terms). 

     Automatic smoothing parameter selection is not likely to work well
     when  fitting models to very few response data.

     For data with many  zeroes clustered together in the covariate
     space it is quite easy to set up  GAMs which suffer from
     identifiability problems, particularly when using Poisson or
     binomial families. The problem is that with e.g. log or logit
     links, mean value zero corresponds to an infinite range on the
     linear predictor scale.

_A_u_t_h_o_r(_s):

     Simon N. Wood simon.wood@r-project.org

     Front end design inspired by the S function of the same name based
     on the work of Hastie and Tibshirani (1990). Underlying methods
     owe much to the work of Wahba (e.g. 1990) and Gu (e.g. 2002).

_R_e_f_e_r_e_n_c_e_s:

     Key References on this implementation:

     Wood, S.N. (2004) Stable and efficient multiple smoothing
     parameter estimation for generalized additive models. J. Amer.
     Statist. Ass. 99:673-686. [Default method for additive case (but
     no longer for generalized)]

     Wood, S.N. (2008) Fast stable direct fitting and smoothness
     selection for generalized additive models. J.R.Statist.Soc.B
     70(3):495-518 - [Default method for generalized additive model
     case]

     Wood, S.N. (2003) Thin plate regression splines. J.R.Statist.Soc.B
     65(1):95-114

     Wood, S.N. (2006a) Low rank scale invariant tensor product smooths
     for generalized additive mixed models. Biometrics 62(4):1025-1036

     Wood S.N. (2006b) Generalized Additive Models: An Introduction
     with R. Chapman and Hall/CRC Press.

     Wood, S.N. (2006c) On confidence intervals for generalized
     additive models based on penalized regression splines. Australian
     and New Zealand Journal of Statistics. 48(4): 445-464.

     Key Reference on GAMs and related models:

     Hastie (1993) in Chambers and Hastie (1993) Statistical Models in
     S. Chapman and Hall.

     Hastie and Tibshirani (1990) Generalized Additive Models. Chapman
     and Hall.

     Wahba (1990) Spline Models of Observational Data. SIAM 

     Wood, S.N. (2000)  Modelling and Smoothing Parameter Estimation
     with Multiple Quadratic Penalties. J.R.Statist.Soc.B 62(2):413-428
     [The original mgcv paper, but no longer the default methods.]

     Background References:

     Green and Silverman (1994) Nonparametric Regression and
     Generalized  Linear Models. Chapman and Hall.

     Gu and Wahba (1991) Minimizing GCV/GML scores with multiple
     smoothing parameters via the Newton method. SIAM J. Sci. Statist.
     Comput. 12:383-398

     Gu (2002) Smoothing Spline ANOVA Models, Springer.

     O'Sullivan, Yandall and Raynor (1986) Automatic smoothing of
     regression functions in generalized linear models. J. Am.
     Statist.Ass. 81:96-103 

     Wood (2001) mgcv:GAMs and Generalized Ridge Regression for R. R
     News 1(2):20-25

     Wood and Augustin (2002) GAMs with integrated model selection
     using penalized regression splines and applications  to
     environmental modelling. Ecological Modelling 157:157-177

     <URL: http://www.maths.bath.ac.uk/~sw283/>

_S_e_e _A_l_s_o:

     'mgcv-package', 'gamObject', 'gam.models', 'smooth.terms',
     'linear.functional.terms', 's', 'te' 'predict.gam', 'plot.gam',
     'summary.gam', 'gam.side', 'gam.selection','mgcv', 'gam.control'
     'gam.check', 'linear.functional.terms' 'negbin', 'magic','vis.gam'

_E_x_a_m_p_l_e_s:

     library(mgcv)
     set.seed(0) ## simulate some data... 
     dat <- gamSim(1,n=400,dist="normal",scale=2)
     b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
     summary(b)
     plot(b,pages=1,residuals=TRUE)
     plot(b,pages=1,seWithMean=TRUE)
     ## same fit in two parts .....
     G<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),fit=FALSE,data=dat)
     b<-gam(G=G)
     print(b)

     ## set the smoothing parameter for the first term, estimate rest ...
     bp<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),sp=c(0.01,-1,-1,-1),data=dat)
     plot(bp,pages=1)
     ## alternatively...
     bp <- gam(y~s(x0,sp=.01)+s(x1)+s(x2)+s(x3),data=dat)

     # set lower bounds on smoothing parameters ....
     bp<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),
             min.sp=c(0.001,0.01,0,10),data=dat) 
     print(b);print(bp)

     ## now a GAM with 3df regression spline term & 2 penalized terms
     b0<-gam(y~s(x0,k=4,fx=TRUE,bs="tp")+s(x1,k=12)+s(x2,k=15),data=dat)
     plot(b0,pages=1)

     ## now fit a 2-d term to x0,x1
     b1<-gam(y~s(x0,x1)+s(x2)+s(x3),data=dat)
     par(mfrow=c(2,2))
     plot(b1)


     par(mfrow=c(1,1))
     ## now simulate poisson data...
     dat <- gamSim(1,n=400,dist="poisson",scale=.25)

     b2<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=poisson,data=dat)
     plot(b2,pages=1)

     ## repeat fit using performance iteration
     gm <- gam.method(gam="perf")
     b3<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=poisson,
             data=dat,method=gm)
     plot(b3,pages=1)

     ## repeat using GACV as in Wood 2008...

     gm <- gam.method(gcv="GACV")
     b4<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=poisson,
             data=dat,method=gm,scale=-1)
     plot(b4,pages=1)

      

     ## a binary example 
     dat <- gamSim(1,n=400,dist="binary",scale=.33)

     lr.fit <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=binomial,data=dat)
     ## plot model components with truth overlaid in red
     op <- par(mfrow=c(2,2))
     fn <- c("f0","f1","f2","f3");xn <- c("x0","x1","x2","x3")
     for (k in 1:4) {
       plot(lr.fit,residuals=TRUE,select=k)
       ff <- dat[[fn[k]]];xx <- dat[[xn[k]]]
       ind <- sort.int(xx,index.return=TRUE)$ix
       lines(xx[ind],(ff-mean(ff))[ind]*.33,col=2)
     }
     par(op)
     anova(lr.fit)
     lr.fit1 <- gam(y~s(x0)+s(x1)+s(x2),family=binomial,data=dat)
     lr.fit2 <- gam(y~s(x1)+s(x2),family=binomial,data=dat)
     AIC(lr.fit,lr.fit1,lr.fit2)

     ## now a 2D smoothing example...

     eg <- gamSim(2,n=500,scale=.1)
     attach(eg)

     op <- par(mfrow=c(2,2),mar=c(4,4,1,1))

     contour(truth$x,truth$z,truth$f) ## contour truth
     b4 <- gam(y~s(x,z),data=data) ## fit model
     fit1 <- matrix(predict.gam(b4,pr,se=FALSE),40,40)
     contour(truth$x,truth$z,fit1)   ## contour fit
     persp(truth$x,truth$z,truth$f)    ## persp truth
     vis.gam(b4)                     ## persp fit
     detach(eg)
     par(op)

     ## very large dataset example with user defined knots
     par(mfrow=c(1,1))
     eg <- gamSim(2,n=10000,scale=.5)
     attach(eg)

     ind<-sample(1:10000,1000,replace=FALSE)
     b5<-gam(y~s(x,z,k=50),data=data,knots=list(x=data$x[ind],z=data$z[ind]))
     vis.gam(b5)

     ## and a pure "knot based" spline of the same data
     b6<-gam(y~s(x,z,k=100),data=data,knots=list(x= rep((1:10-0.5)/10,10),
             z=rep((1:10-0.5)/10,rep(10,10))))
     vis.gam(b6,color="heat")

     ## varying the default large dataset behaviour via `xt'
     b7 <- gam(y~s(x,z,k=50,xt=list(max.knots=1000,seed=2)),data=data)
     vis.gam(b7)
     detach(eg)

