Title: | Inverse Regression for Text Analysis |
---|---|
Description: | Multinomial (inverse) regression inference for text documents and associated attributes. For details see: Taddy (2013 JASA) Multinomial Inverse Regression for Text Analysis <arXiv:1012.2098> and Taddy (2015, AoAS), Distributed Multinomial Regression, <arXiv:1311.6139>. A minimalist partial least squares routine is also included. Note that the topic modeling capability of earlier 'textir' is now a separate package, 'maptpx'. |
Authors: | Matt Taddy <[email protected]> |
Maintainer: | Matt Taddy <[email protected]> |
License: | GPL-3 |
Version: | 2.0-5 |
Built: | 2024-10-27 05:49:40 UTC |
Source: | https://github.com/taddylab/textir |
Phrase counts and ideology scores by speaker for members of the 109th US congress.
This data originally appear in Gentzkow and Shapiro (GS; 2010) and considers text of the 2005 Congressional Record, containing all speeches in that year for members of the United States House and Senate. In particular, GS record the number times each of 529 legislators used terms in a list of 1000 phrases (i.e., each document is a year of transcripts for a single speaker). Associated sentiments are repshare – the two-party vote-share from each speaker's constituency (congressional district for representatives; state for senators) obtained by George W. Bush in the 2004 presidential election – and the speaker's first and second common-score values (from http://voteview.com). Full parsing and sentiment details are in Taddy (2013; Section 2.1).
congress109Counts |
A |
congress109Ideology |
A |
Matt Taddy, [email protected]
Gentzkow, M. and J. Shapiro (2010), What drives media slant? Evidence from U.S. daily newspapers. Econometrica 78, 35-7. The full dataset is at http://dx.doi.org/10.3886/ICPSR26242.
Taddy (2013), Multinomial Inverse Regression for Text Analysis. http://arxiv.org/abs/1012.2098
srproj, pls, dmr, we8there
data(congress109) ## Bivariate sentiment factors (roll-call vote common scores) covars <- data.frame(gop=congress109Ideology$party=="R", cscore=congress109Ideology$cs1) covars$cscore <- covars$cscore - tapply(covars$cscore,covars$gop,mean)[covars$gop+1] rownames(covars) <- rownames(congress109Ideology) ## cl=NULL implies a serial run. ## To use a parallel library fork cluster, ## uncomment the relevant lines below. ## Forking is unix only; use PSOCK for windows cl <- NULL # cl <- makeCluster(detectCores(), type="FORK") ## small nlambda for a fast example fitCS <- dmr(cl, covars, congress109Counts, gamma=1, nlambda=10) # stopCluster(cl) ## plot the fit par(mfrow=c(1,2)) for(j in c("estate.tax","death.tax")){ plot(fitCS[[j]], col=c("red","green")) mtext(j,line=2) } legend("topright",bty="n",fill=c("red","green"),legend=names(covars)) ## plot the IR sufficient reduction space Z <- srproj(fitCS, congress109Counts) par(mfrow=c(1,1)) plot(Z, pch=21, bg=c(4,3,2)[congress109Ideology$party], main="SR projections") ## two pols Z[c(68,388),]
data(congress109) ## Bivariate sentiment factors (roll-call vote common scores) covars <- data.frame(gop=congress109Ideology$party=="R", cscore=congress109Ideology$cs1) covars$cscore <- covars$cscore - tapply(covars$cscore,covars$gop,mean)[covars$gop+1] rownames(covars) <- rownames(congress109Ideology) ## cl=NULL implies a serial run. ## To use a parallel library fork cluster, ## uncomment the relevant lines below. ## Forking is unix only; use PSOCK for windows cl <- NULL # cl <- makeCluster(detectCores(), type="FORK") ## small nlambda for a fast example fitCS <- dmr(cl, covars, congress109Counts, gamma=1, nlambda=10) # stopCluster(cl) ## plot the fit par(mfrow=c(1,2)) for(j in c("estate.tax","death.tax")){ plot(fitCS[[j]], col=c("red","green")) mtext(j,line=2) } legend("topright",bty="n",fill=c("red","green"),legend=names(covars)) ## plot the IR sufficient reduction space Z <- srproj(fitCS, congress109Counts) par(mfrow=c(1,1)) plot(Z, pch=21, bg=c(4,3,2)[congress109Ideology$party], main="SR projections") ## two pols Z[c(68,388),]
Correlation and deviation in sparse matrices.
corr(x, y) sdev(x)
corr(x, y) sdev(x)
x |
A |
y |
A |
corr
returns the ncol(x)
by ncol(y)
matrix of correlation between x and y, and sdev
returns the column standard deviations.
Matt Taddy [email protected]
pls, congress109
# some congress examples data(congress109) r <- corr(congress109Counts, congress109Ideology$repshare) ## 20 terms for Democrats sort(r[,1])[1:20] ## 20 terms for Republicans sort(r[,1], decreasing=TRUE)[1:20] ## 20 high variance terms colnames(congress109Counts)[ order(-sdev(congress109Counts))[1:20]]
# some congress examples data(congress109) r <- corr(congress109Counts, congress109Ideology$repshare) ## 20 terms for Democrats sort(r[,1])[1:20] ## 20 terms for Republicans sort(r[,1], decreasing=TRUE)[1:20] ## 20 high variance terms colnames(congress109Counts)[ order(-sdev(congress109Counts))[1:20]]
A simple partial least squares procedure.
pls(x, y, K=1, scale=TRUE, verb=TRUE) ## S3 method for class 'pls' predict( object, newdata, type="response", ... ) ## S3 method for class 'pls' summary( object, ... ) ## S3 method for class 'pls' print(x, ... ) ## S3 method for class 'pls' plot(x, K=NULL, xlab="response", ylab=NULL, ...)
pls(x, y, K=1, scale=TRUE, verb=TRUE) ## S3 method for class 'pls' predict( object, newdata, type="response", ... ) ## S3 method for class 'pls' summary( object, ... ) ## S3 method for class 'pls' print(x, ... ) ## S3 method for class 'pls' plot(x, K=NULL, xlab="response", ylab=NULL, ...)
x |
The covariate matrix, in either |
y |
The response vector. |
K |
The number of desired PLS directions. In plotting, this can be a vector of directions to draw, otherwise directions |
scale |
An indicator for whether to scale |
verb |
Whether or not to print a small progress script. |
object |
For |
newdata |
For |
type |
For |
xlab |
For |
ylab |
For |
... |
Additional arguments. |
pls
fits the Partial Least Squares algorithm described in Taddy (2012; Appendix A.1).
In particular, we obtain loadings loadings[,k]
as the correlation between
X
and factors factors[,k]
, where factors[,1]
is initialized
at scale(as.numeric(y))
and subsequent factors are orthogonal to
to the k'th pls direction, an ortho-normal transformation of x%*%loadings[,k]
.
predict.pls
returns predictions from the object$fwdmod
forward regression for projections
z = x*loadings -
shift
derived from new covariates, or if type="reduction"
it just returns these projections.
summary.pls
prints dimension details and a quick summary of the
corresponding forward regression. plot.pls
draws response
versus fitted values for least-squares fit onto the K pls directions.
Output from pls
is a list with the following entries
y |
The response vector. |
x |
The unchanged covariate matrix. |
directions |
The pls directions: |
loadings |
The pls loadings. |
shift |
Shift applied after projection to center the PLS directions. |
fitted |
|
fwdmod |
The |
predict.pls
outputs either a vector of predicted resonse or an nrow(newcounts)
by ncol(object$loadings)
matrix of pls directions for each new observation. Summary and plot produce return nothing.
Matt Taddy [email protected]
Taddy (2013), Multinomial Inverse Regression for Text Analysis. Journal of the American Statistical Association 108.
Wold, H. (1975), Soft modeling by latent variables: The nonlinear iterative partial least squares approach. In Perspectives in Probability and Statistics, Papers in Honour of M.S. Bartlett.
normalize, sdev, corr, congress109
data(congress109) x <- t( t(congress109Counts)/rowSums(congress109Counts) ) summary( fit <- pls(x, congress109Ideology$repshare, K=3) ) plot(fit, pch=21, bg=c(4,3,2)[congress109Ideology$party]) predict(fit, newdata=x[c(68,388),])
data(congress109) x <- t( t(congress109Counts)/rowSums(congress109Counts) ) summary( fit <- pls(x, congress109Ideology$repshare, K=3) ) plot(fit, pch=21, bg=c(4,3,2)[congress109Ideology$party]) predict(fit, newdata=x[c(68,388),])
Estimation of MNIR sufficient reduction projections. Note that mnlm
is just a call to dmr
from the distrom
package.
srproj(obj, counts, dir=1:K, ...) mnlm(cl, covars, counts, mu=NULL, bins=NULL, verb=0, ...)
srproj(obj, counts, dir=1:K, ...) mnlm(cl, covars, counts, mu=NULL, bins=NULL, verb=0, ...)
cl |
A |
covars |
A dense |
counts |
A dense |
mu |
Pre-specified fixed effects for each observation in the Poisson regression linear equation. See the same argument in |
bins |
Number of bins into which we will attempt to collapse each column of |
verb |
Whether to print some info. See the same argument in |
obj |
Either a |
dir |
The attribute ( |
... |
Additional arguments to |
These functions provide the first two steps of multinomial inverse regression (see MNIR paper).
mnlm
fits multinomial logistic regression
parameters under gamma lasso penalization on a factorized Poisson likelihood. The mnlm
function, which remains in this package for backwards compatability only, is just call to the dmr
function of the distrom
library (see DMR paper). For simplicity, we recommend using dmr
instead of mnlm
. For model selection, coefficients, prediction, and plotting see the relevant functions in help(dmr)
.
srproj
calculates the MNIR Sufficient Reduction projection from text counts
on to the attribute dimensions of interest (covars
in mnlm
or dmr
). In particular, for counts , with row sums
, and
mnlm
/dmr
coefficients corresponding to attribute
,
is the SR projection in the direction of
. The MNIR paper explains how
, your original covariates/attributes, are independent of text counts
given SR projections
.
The final step of MNIR is ‘forward regression’ for any element of onto
and the remaining elements of
. We do not provide a function for this because you are free to use whatever you want; see the MNIR and DMR papers for linear, logistic, and random forest forward regression examples.
Note that if you were previously using textir
not for inverse regression, but rather just as fast code for multinomial logistic regression, you probably want to work directly with the gamlr
(binary response) or dmr
(multinomial response) packages.
srproj
returns a matrix with columns corresponding to directions dir
, plus an additional column m
holding the row totals of counts
.
mnlm
returns a dmr
s3 object. See help(dmr)
for details.
Matt Taddy [email protected]
Taddy (2013, JASA), Multinomial Inverse Regression for Text Analysis (MNIR).
Taddy (2015, AoAS), Distributed Multinomial Regression (DMR).
Taddy (2016, JCGS), The Gamma Lasso (GL).
congress109, we8there, dmr
### Ripley's Cushing Data; see help(Cushings) ### library(MASS) data(Cushings) Cushings[,1:2] <- log(Cushings[,1:2]) train <- Cushings[Cushings$Type!="u",] newdata <- as.matrix(Cushings[Cushings$Type == "u", 1:2]) ## fit, coefficients, predict, and plot # you could replace 'mnlm' with 'dmr' here. fit <- mnlm(NULL, covars=train[,1:2], counts=factor(train$Type)) ## dmr applies corrected AICc selection by default round(coef(fit),1) round(predict(fit, newdata, type="response"),1) par(mfrow=c(1,3)) for(j in c("a","b","c")){ plot(fit[[j]]); mtext(j,line=2) } ## see we8there and congress109 for MNIR and srproj examples
### Ripley's Cushing Data; see help(Cushings) ### library(MASS) data(Cushings) Cushings[,1:2] <- log(Cushings[,1:2]) train <- Cushings[Cushings$Type!="u",] newdata <- as.matrix(Cushings[Cushings$Type == "u", 1:2]) ## fit, coefficients, predict, and plot # you could replace 'mnlm' with 'dmr' here. fit <- mnlm(NULL, covars=train[,1:2], counts=factor(train$Type)) ## dmr applies corrected AICc selection by default round(coef(fit),1) round(predict(fit, newdata, type="response"),1) par(mfrow=c(1,3)) for(j in c("a","b","c")){ plot(fit[[j]]); mtext(j,line=2) } ## see we8there and congress109 for MNIR and srproj examples
term frequency, inverse document frequency
tfidf(x,normalize=TRUE)
tfidf(x,normalize=TRUE)
x |
A |
normalize |
Whether to normalize term frequency by document totals. |
A matrix of the same type as x
, with values replaced by the tf-idf
where is
or
, depending on
normalize
,
and is the number of documents containing token
.
Matt Taddy [email protected]
pls, we8there
data(we8there) ## 20 high-variance tf-idf terms colnames(we8thereCounts)[ order(-sdev(tfidf(we8thereCounts)))[1:20]]
data(we8there) ## 20 high-variance tf-idf terms colnames(we8thereCounts)[ order(-sdev(tfidf(we8thereCounts)))[1:20]]
Counts for 2804 bigrams in 6175 restaurant reviews from the site www.we8there.com.
The short user-submitted reviews are accompanied by a five-star rating on four specific aspects of restaurant quality - food, service, value, and atmosphere - as well as the overall experience. The reviews originally appear in Maua and Cozman (2009), and the parsing details behind these specific counts are in Taddy (MNIR; 2013).
we8thereCounts |
A |
we8thereRatings |
A |
Matt Taddy, [email protected]
Maua, D.D. and Cozman, F.G. (2009), Representing and classifying user reviews. In ENIA '09: VIII Enconro Nacional de Inteligencia Artificial, Brazil.
Taddy (2013, JASA), Multinomial Inverse Regression for Text Analysis.
Taddy (2013, AoAS), Distributed Multinomial Regression.
dmr, srproj
## some multinomial inverse regression ## we'll regress counts onto 5-star overall rating data(we8there) ## cl=NULL implies a serial run. ## To use a parallel library fork cluster, ## uncomment the relevant lines below. ## Forking is unix only; use PSOCK for windows cl <- NULL # cl <- makeCluster(detectCores(), type="FORK") ## small nlambda for a fast example fits <- dmr(cl, we8thereRatings[,'Overall',drop=FALSE], we8thereCounts, bins=5, gamma=1, nlambda=10) # stopCluster(cl) ## plot fits for a few individual terms terms <- c("first date","chicken wing", "ate here", "good food", "food fabul","terribl servic") par(mfrow=c(3,2)) for(j in terms) { plot(fits[[j]]); mtext(j,font=2,line=2) } ## extract coefficients B <- coef(fits) mean(B[2,]==0) # sparsity in loadings ## some big loadings in IR B[2,order(B[2,])[1:10]] B[2,order(-B[2,])[1:10]] ## do MNIR projection onto factors z <- srproj(B,we8thereCounts) ## fit a fwd model to the factors summary(fwd <- lm(we8thereRatings$Overall ~ z)) ## truncate the fwd predictions to our known range fwd$fitted[fwd$fitted<1] <- 1 fwd$fitted[fwd$fitted>5] <- 5 ## plot the fitted rating by true rating par(mfrow=c(1,1)) plot(fwd$fitted ~ factor(we8thereRatings$Overall), varwidth=TRUE, col="lightslategrey")
## some multinomial inverse regression ## we'll regress counts onto 5-star overall rating data(we8there) ## cl=NULL implies a serial run. ## To use a parallel library fork cluster, ## uncomment the relevant lines below. ## Forking is unix only; use PSOCK for windows cl <- NULL # cl <- makeCluster(detectCores(), type="FORK") ## small nlambda for a fast example fits <- dmr(cl, we8thereRatings[,'Overall',drop=FALSE], we8thereCounts, bins=5, gamma=1, nlambda=10) # stopCluster(cl) ## plot fits for a few individual terms terms <- c("first date","chicken wing", "ate here", "good food", "food fabul","terribl servic") par(mfrow=c(3,2)) for(j in terms) { plot(fits[[j]]); mtext(j,font=2,line=2) } ## extract coefficients B <- coef(fits) mean(B[2,]==0) # sparsity in loadings ## some big loadings in IR B[2,order(B[2,])[1:10]] B[2,order(-B[2,])[1:10]] ## do MNIR projection onto factors z <- srproj(B,we8thereCounts) ## fit a fwd model to the factors summary(fwd <- lm(we8thereRatings$Overall ~ z)) ## truncate the fwd predictions to our known range fwd$fitted[fwd$fitted<1] <- 1 fwd$fitted[fwd$fitted>5] <- 5 ## plot the fitted rating by true rating par(mfrow=c(1,1)) plot(fwd$fitted ~ factor(we8thereRatings$Overall), varwidth=TRUE, col="lightslategrey")