Title: | Gamma Lasso Regression |
---|---|
Description: | The gamma lasso algorithm provides regularization paths corresponding to a range of non-convex cost functions between L0 and L1 norms. As much as possible, usage for this package is analogous to that for the glmnet package (which does the same thing for penalization between L1 and L2 norms). For details see: Taddy (2017 JCGS), 'One-Step Estimator Paths for Concave Regularization', <arXiv:1308.5623>. |
Authors: | Matt Taddy <[email protected]> |
Maintainer: | Matt Taddy <[email protected]> |
License: | GPL-3 |
Version: | 1.13-8 |
Built: | 2025-01-05 02:46:09 UTC |
Source: | https://github.com/taddylab/gamlr |
Corrected AIC calculation.
AICc(object, k=2)
AICc(object, k=2)
object |
Some model object that you can call |
k |
The usual |
This works just like usual AIC, but instead calculates the small sample (or high dimensional) corrected version from Hurvich and Tsai
A numeric value for every model evaluated.
Matt Taddy [email protected]
Hurvich, C. M. and C-L Tsai, 1989. "Regression and Time Series Model Selection in Small Samples", Biometrika 76.
gamlr, hockey
Cross validation for gamma lasso penalty selection.
cv.gamlr(x, y, nfold=5, foldid=NULL, verb=FALSE, cl=NULL, ...) ## S3 method for class 'cv.gamlr' plot(x, select=TRUE, df=TRUE, ...) ## S3 method for class 'cv.gamlr' coef(object, select=c("1se","min"), ...) ## S3 method for class 'cv.gamlr' predict(object, newdata, select=c("1se","min"), ...)
cv.gamlr(x, y, nfold=5, foldid=NULL, verb=FALSE, cl=NULL, ...) ## S3 method for class 'cv.gamlr' plot(x, select=TRUE, df=TRUE, ...) ## S3 method for class 'cv.gamlr' coef(object, select=c("1se","min"), ...) ## S3 method for class 'cv.gamlr' predict(object, newdata, select=c("1se","min"), ...)
x |
Covariates; see |
y |
Response; see |
nfold |
The number of cross validation folds. |
foldid |
An optional length-n vector of fold memberships for each observation. If specified, this dictates |
verb |
Whether to print progress through folds. |
cl |
possible |
... |
Arguments to |
object |
A gamlr object. |
newdata |
New |
select |
In prediction and coefficient extraction,
select which "best" model to return:
|
df |
Whether to add to the plot degrees of freedom along the top axis. |
Fits a gamlr
regression to the full dataset, and then performs nfold
cross validation to evaluate out-of-sample (OOS)
performance for different penalty weights.
plot.cv.gamlr
can be used to plot the results: it
shows mean OOS deviance with 1se error bars.
gamlr |
The full-data fitted |
nfold |
The number of CV folds. |
foldid |
The length-n vector of fold memberships. |
cvm |
Mean OOS deviance by |
cvs |
The standard errors on |
seg.min |
The index of minimum |
seg.1se |
The index of |
lambda.min |
Penalty at minimum |
lambda.1se |
Penalty at |
Matt Taddy [email protected]
Taddy (2017 JCGS), One-Step Estimator Paths for Concave Regularization, http://arxiv.org/abs/1308.5623
gamlr, hockey
n <- 100 p <- 100 xvar <- matrix(ncol=p,nrow=p) for(i in 1:p) for(j in i:p) xvar[i,j] <- 0.5^{abs(i-j)} x <- matrix(rnorm(p*n), nrow=n)%*%chol(xvar) beta <- matrix( (-1)^(1:p)*exp(-(1:p)/10) ) mu = x%*%beta y <- mu + rnorm(n,sd=sd(as.vector(mu))/2) ## fit with gamma=1 concavity cvfit <- cv.gamlr(x, y, gamma=1, verb=TRUE) coef(cvfit)[1:3,] # 1se default coef(cvfit, select="min")[1:3,] # min OOS deviance predict(cvfit, x[1:2,], select="min") predict(cvfit$gamlr, x[1:2,], select=cvfit$seg.min) par(mfrow=c(1,2)) plot(cvfit) plot(cvfit$gamlr)
n <- 100 p <- 100 xvar <- matrix(ncol=p,nrow=p) for(i in 1:p) for(j in i:p) xvar[i,j] <- 0.5^{abs(i-j)} x <- matrix(rnorm(p*n), nrow=n)%*%chol(xvar) beta <- matrix( (-1)^(1:p)*exp(-(1:p)/10) ) mu = x%*%beta y <- mu + rnorm(n,sd=sd(as.vector(mu))/2) ## fit with gamma=1 concavity cvfit <- cv.gamlr(x, y, gamma=1, verb=TRUE) coef(cvfit)[1:3,] # 1se default coef(cvfit, select="min")[1:3,] # min OOS deviance predict(cvfit, x[1:2,], select="min") predict(cvfit$gamlr, x[1:2,], select=cvfit$seg.min) par(mfrow=c(1,2)) plot(cvfit) plot(cvfit$gamlr)
double (i.e., double) Machine Learning for treatment effect estimation
doubleML(x, d, y, nfold=2, foldid=NULL, family="gaussian", cl=NULL, ...)
doubleML(x, d, y, nfold=2, foldid=NULL, family="gaussian", cl=NULL, ...)
x |
Covariates; see |
d |
The matrix of treatment variables. Each column is used as a response by |
y |
Response; see |
nfold |
The number of cross validation folds. |
foldid |
An optional length-n vector of fold memberships for each observation. If specified, this dictates |
family |
Response model type for the treatment prediction;
either "gaussian", "poisson", or "binomial". This can be either be a single family shared by all columns of |
cl |
possible |
... |
Arguments to all the |
Performs the double ML procedure of Chernozhukov et al. (2017) to produce an unbiased estimate of the average linear treatment effects of d
on y
. This procedure uses gamlr
to regress y
and each column of d
onto x
. In the cross-fitting routine described in Taddy (2019), these regressions are trained on a portion of the data and the out-of-sample residuals are calculated on the left-out fold. Model selection for these residualization steps is based on the AICc selection rule. The response residuals are then regressed onto the treatment residuals using lm
and the resulting estimates and standard errors are unbiased for the treatment effects under the assumptions of Chernozhukov et al.
A fitted lm
object estimating the treatment effect of d
on y
. The lm
function has been called with x=TRUE, y=TRUE
such that this object contains the residualized d
as x
and residualized y
as y
.
Matt Taddy [email protected]
Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther and Hansen, Christian and Newey, Whitney and Robins, James (The Econometrics Journal, 2017), Double/debiased machine learning for treatment and structural parameters
Matt Taddy, 2019. Business Data Science, McGraw-Hill
gamlr, hockey, AICc
data(hockey) who <- which(colnames(player)=="SIDNEY_CROSBY") s <- sample.int(nrow(player),10000) # subsample for a fast example doubleML(x=player[s,-who], d=player[s,who], y=goal$homegoal[s], standardize=FALSE)
data(hockey) who <- which(colnames(player)=="SIDNEY_CROSBY") s <- sample.int(nrow(player),10000) # subsample for a fast example doubleML(x=player[s,-who], d=player[s,who], y=goal$homegoal[s], standardize=FALSE)
Adaptive L1 penalized regression estimation.
gamlr(x, y, family=c("gaussian","binomial","poisson"), gamma=0,nlambda=100, lambda.start=Inf, lambda.min.ratio=0.01, free=NULL, standardize=TRUE, obsweight=NULL,varweight=NULL, prexx=(p<500), tol=1e-7,maxit=1e5,verb=FALSE, ...) ## S3 method for class 'gamlr' plot(x, against=c("pen","dev"), col=NULL, select=TRUE, df=TRUE, ...) ## S3 method for class 'gamlr' coef(object, select=NULL, k=2, corrected=TRUE, ...) ## S3 method for class 'gamlr' predict(object, newdata, type = c("link", "response"), ...) ## S3 method for class 'gamlr' logLik(object, ...)
gamlr(x, y, family=c("gaussian","binomial","poisson"), gamma=0,nlambda=100, lambda.start=Inf, lambda.min.ratio=0.01, free=NULL, standardize=TRUE, obsweight=NULL,varweight=NULL, prexx=(p<500), tol=1e-7,maxit=1e5,verb=FALSE, ...) ## S3 method for class 'gamlr' plot(x, against=c("pen","dev"), col=NULL, select=TRUE, df=TRUE, ...) ## S3 method for class 'gamlr' coef(object, select=NULL, k=2, corrected=TRUE, ...) ## S3 method for class 'gamlr' predict(object, newdata, type = c("link", "response"), ...) ## S3 method for class 'gamlr' logLik(object, ...)
x |
A dense |
y |
A vector of response values.
There is almost no argument checking,
so be careful to match |
family |
Response model type;
either "gaussian", "poisson", or "binomial".
Note that for "binomial", |
gamma |
Penalty concavity tuning parameter; see details. Zero (default) yields the lasso, and higher values correspond to a more concave penalty. |
nlambda |
Number of regularization path segments. |
lambda.start |
Initial penalty value. Default of |
lambda.min.ratio |
The smallest penalty weight
(expected L1 cost) as a ratio of the path start value.
Our default is always 0.01; note that this differs from |
free |
Free variables: indices of the columns of |
standardize |
Whether to standardize the coefficients to have standard deviation of one. This is equivalent to multiplying the L1 penalty by each coefficient standard deviation. |
obsweight |
For |
varweight |
Multipliers on the penalty associated with each covariate coefficient. Must be non-negative. These are further multiplied by |
prexx |
Only possible for |
tol |
Optimization convergence tolerance relative to the null model deviance for each inner coordinate-descent loop. This is measured against the maximum coordinate change times deviance curvature after full parameter-set update. |
maxit |
Max iterations for a single segment coordinate descent routine. |
verb |
Whether to print some output for each path segment. |
object |
A gamlr object. |
against |
Whether to plot paths against log penalty or deviance. |
select |
In In |
k |
If |
corrected |
A flag that swaps corrected (for high dimensional bias) |
newdata |
New |
type |
Either "link" for the linear equation,
or "response" for predictions transformed
to the same domain as |
col |
A single plot color,
or vector of length |
df |
Whether to add to the plot degrees of freedom along the top axis. |
... |
Extra arguments to each method. Most importantly, from
|
Finds posterior modes along a regularization path of adapted L1 penalties via coordinate descent.
Each path segment minimizes the objective -
logLHD
, where
is the
exponential family dispersion parameter (
for
family="gaussian"
, one otherwise). Weights are
set as
where
is our estimate of
for the previous path segment (or zero if
). This adaptation is what makes the penalization ‘concave’; see Taddy (2013) for details.
plot.gamlr
can be used to graph the results: it
shows the regularization paths for penalized , with degrees of freedom along the top axis and minimum AICc selection marked.
logLik.gamlr
returns log likelihood along the regularization path. It is based on the deviance
, and is correct only up to static constants;
e.g., for a Poisson it is off by (the saturated log likelihood) and for a Gaussian it is off by likelihood constants
.
lambda |
The path of fitted prior expected L1 penalties. |
nobs |
The number of observations. |
alpha |
Intercepts. |
beta |
Regression coefficients. |
df |
Approximate degrees of freedom. |
deviance |
Fitted deviance:
|
iter |
Number of optimization iterations by segment, broken into coordinate descent cycles and IRLS re-weightings for |
family |
The exponential family model. |
Under prexx=TRUE
(requires family="gaussian"
), weighted covariances and
, weighted column sums of
, and column means
will be pre-calculated. Here
is the diagonal matrix of least squares weights (
obsweights
, so defaults to
). It is not necessary (they will be built by
gamlr
otherwise), but you have the option to pre-calculate these sufficient statistics yourself as arguments vxx
(matrix
or dspMatrix
), vxy
, vxsum
, and xbar
(all vectors
) respectively. Search PREXX
in gamlr.R
to see the steps involved, and notice that there is very little argument checking – do at your own risk. Note that xbar
is an unweighted calculation, even if . For really Big Data you can then run with
x=NULL
(e.g., if these statistics were calculated on distributed machines and full design is unavailable). Beware: in this x=NULL
case our deviance (and df, if gamma>0
) calculations are incorrect and selection rules will always return the smallest-lambda model.
Matt Taddy [email protected]
Taddy (2017 JCGS), One-Step Estimator Paths for Concave Regularization, http://arxiv.org/abs/1308.5623
cv.gamlr, AICc, hockey
### a low-D test (highly multi-collinear) n <- 1000 p <- 3 xvar <- matrix(0.9, nrow=p,ncol=p) diag(xvar) <- 1 x <- matrix(rnorm(p*n), nrow=n)%*%chol(xvar) y <- 4 + 3*x[,1] + -1*x[,2] + rnorm(n) ## run models to extra small lambda 1e-3xlambda.start fitlasso <- gamlr(x, y, gamma=0, lambda.min.ratio=1e-3) # lasso fitgl <- gamlr(x, y, gamma=2, lambda.min.ratio=1e-3) # small gamma fitglbv <- gamlr(x, y, gamma=10, lambda.min.ratio=1e-3) # big gamma par(mfrow=c(1,3)) ylim = range(c(fitglbv$beta@x)) plot(fitlasso, ylim=ylim, col="navy") plot(fitgl, ylim=ylim, col="maroon") plot(fitglbv, ylim=ylim, col="darkorange")
### a low-D test (highly multi-collinear) n <- 1000 p <- 3 xvar <- matrix(0.9, nrow=p,ncol=p) diag(xvar) <- 1 x <- matrix(rnorm(p*n), nrow=n)%*%chol(xvar) y <- 4 + 3*x[,1] + -1*x[,2] + rnorm(n) ## run models to extra small lambda 1e-3xlambda.start fitlasso <- gamlr(x, y, gamma=0, lambda.min.ratio=1e-3) # lasso fitgl <- gamlr(x, y, gamma=2, lambda.min.ratio=1e-3) # small gamma fitglbv <- gamlr(x, y, gamma=10, lambda.min.ratio=1e-3) # big gamma par(mfrow=c(1,3)) ylim = range(c(fitglbv$beta@x)) plot(fitlasso, ylim=ylim, col="navy") plot(fitgl, ylim=ylim, col="maroon") plot(fitglbv, ylim=ylim, col="darkorange")
Every NHL goal from fall 2002 through the 2014 cup finals.
The data comprise of information about
play configuration and the
players on ice (including goalies) for every
goal from 2002-03 to 2012-14 NHL seasons.
Collected using A. C. Thomas's nlhscrapr
package.
See the Chicago hockey analytics project at github.com/mataddy/hockey
.
goal |
Info about each goal scored, including |
player |
Sparse Matrix with entries for who was on the ice for each goal: +1 for a home team player, -1 for an away team player, zero otherwise. |
team |
Sparse Matrix with indicators for each team*season interaction: +1 for home team, -1 for away team. |
config |
Special teams info. For example,
|
Matt Taddy, [email protected]
Gramacy, Jensen, and Taddy (2013): "Estimating Player Contribution in Hockey with Regularized Logistic Regression", the Journal of Quantitative Analysis in Sport.
Gramacy, Taddy, and Tian (2015): "Hockey Player Performance via Regularized Logistic Regression", the Handbook of statistical methods for design and analysis in sports.
gamlr
## design data(hockey) x <- cbind(config,team,player) y <- goal$homegoal ## fit the plus-minus regression model ## (non-player effects are unpenalized) fit <- gamlr(x, y, lambda.min.ratio=0.05, nlambda=40, ## just so it runs in under 5 sec free=1:(ncol(config)+ncol(team)), standardize=FALSE, family="binomial") plot(fit) ## look at estimated player [career] effects B <- coef(fit)[colnames(player),] sum(B!=0) # number of measurable effects (AICc selection) B[order(-B)[1:10]] # 10 biggest ## convert to 2013-2014 season partial plus-minus now <- goal$season=="20132014" pm <- colSums(player[now,names(B)]*c(-1,1)[y[now]+1]) # traditional plus minus ng <- colSums(abs(player[now,names(B)])) # total number of goals # The individual effect on probability that a # given goal is for vs against that player's team p <- 1/(1+exp(-B)) # multiply ng*p - ng*(1-p) to get expected plus-minus ppm <- ng*(2*p-1) # organize the data together and print top 20 effect <- data.frame(b=round(B,3),ppm=round(ppm,3),pm=pm) effect <- effect[order(-effect$ppm),] print(effect[1:20,])
## design data(hockey) x <- cbind(config,team,player) y <- goal$homegoal ## fit the plus-minus regression model ## (non-player effects are unpenalized) fit <- gamlr(x, y, lambda.min.ratio=0.05, nlambda=40, ## just so it runs in under 5 sec free=1:(ncol(config)+ncol(team)), standardize=FALSE, family="binomial") plot(fit) ## look at estimated player [career] effects B <- coef(fit)[colnames(player),] sum(B!=0) # number of measurable effects (AICc selection) B[order(-B)[1:10]] # 10 biggest ## convert to 2013-2014 season partial plus-minus now <- goal$season=="20132014" pm <- colSums(player[now,names(B)]*c(-1,1)[y[now]+1]) # traditional plus minus ng <- colSums(abs(player[now,names(B)])) # total number of goals # The individual effect on probability that a # given goal is for vs against that player's team p <- 1/(1+exp(-B)) # multiply ng*p - ng*(1-p) to get expected plus-minus ppm <- ng*(2*p-1) # organize the data together and print top 20 effect <- data.frame(b=round(B,3),ppm=round(ppm,3),pm=pm) effect <- effect[order(-effect$ppm),] print(effect[1:20,])
Set NA as the reference level for factor variables and do imputation on missing values for numeric variables. This is useful to build model matrices for regularized regression, and for dealing with missing values, as in Taddy 2019.
naref(x, impute=FALSE, pzero=0.5)
naref(x, impute=FALSE, pzero=0.5)
x |
A data frame. |
impute |
Logical, whether to impute missing values in numeric columns. |
pzero |
If |
For every factor
or character
column in x
, naref
sets NA
as the reference level for a factor
variable. Columns coded as character
class are first converted to factors via Rfactor(x). If impute=TRUE
then the numeric columns are converted to two columns, one appended .x
that contains imputed values and another appended .miss
which is a binary variable indicating whether the original value was missing. Numeric columns are returned without change if impute=FALSE
or if they do not contain any missing values.
A data frame where the factor and character columns have been converted to factors with reference level NA
, and if impute=TRUE
the missing values in numeric columns have been imputed and a flag for missingness has been added. See details.
Matt Taddy [email protected]
Matt Taddy, 2019. "Business Data Science", McGraw-Hill
( x <- data.frame(a=factor(c(1,2,3)),b=c(1,NA,3)) ) naref(x, impute=TRUE)
( x <- data.frame(a=factor(c(1,2,3)),b=c(1,NA,3)) ) naref(x, impute=TRUE)