Package 'textir' reference manual

Title:	Inverse Regression for Text Analysis
Description:	Multinomial (inverse) regression inference for text documents and associated attributes. For details see: Taddy (2013 JASA) Multinomial Inverse Regression for Text Analysis <arXiv:1012.2098> and Taddy (2015, AoAS), Distributed Multinomial Regression, <arXiv:1311.6139>. A minimalist partial least squares routine is also included. Note that the topic modeling capability of earlier 'textir' is now a separate package, 'maptpx'.
Authors:	Matt Taddy <[email protected]>
Maintainer:	Matt Taddy <[email protected]>
License:	GPL-3
Version:	2.0-5
Built:	2025-03-26 05:02:53 UTC
Source:	https://github.com/taddylab/textir

Ideology in Political Speeches

Description

Phrase counts and ideology scores by speaker for members of the 109th US congress.

Details

This data originally appear in Gentzkow and Shapiro (GS; 2010) and considers text of the 2005 Congressional Record, containing all speeches in that year for members of the United States House and Senate. In particular, GS record the number times each of 529 legislators used terms in a list of 1000 phrases (i.e., each document is a year of transcripts for a single speaker). Associated sentiments are repshare – the two-party vote-share from each speaker's constituency (congressional district for representatives; state for senators) obtained by George W. Bush in the 2004 presidential election – and the speaker's first and second common-score values (from http://voteview.com). Full parsing and sentiment details are in Taddy (2013; Section 2.1).

Value

`congress109Counts`	A `dgCMatrix` of phrase counts indexed by speaker-rows and phrase-columns.
`congress109Ideology`	A `data.frame` containing the associated `repshare` and common scores `[cs1,cs2]`, as well as speaker characteristics: `party` (‘R’epublican, ‘D’emocrat, or ‘I’ndependent), `state`, and `chamber` (‘H’ouse or ‘S’enate).

Author(s)

Matt Taddy, [email protected]

References

Gentzkow, M. and J. Shapiro (2010), What drives media slant? Evidence from U.S. daily newspapers. Econometrica 78, 35-7. The full dataset is at http://dx.doi.org/10.3886/ICPSR26242.

Taddy (2013), Multinomial Inverse Regression for Text Analysis. http://arxiv.org/abs/1012.2098

Examples

data(congress109)

## Bivariate sentiment factors (roll-call vote common scores)
covars <- data.frame(gop=congress109Ideology$party=="R",
					cscore=congress109Ideology$cs1)
covars$cscore <- covars$cscore - 
	tapply(covars$cscore,covars$gop,mean)[covars$gop+1]
rownames(covars) <- rownames(congress109Ideology)

## cl=NULL implies a serial run. 
## To use a parallel library fork cluster, 
## uncomment the relevant lines below. 
## Forking is unix only; use PSOCK for windows
cl <- NULL
# cl <- makeCluster(detectCores(), type="FORK")
## small nlambda for a fast example
fitCS <- dmr(cl, covars, congress109Counts, gamma=1, nlambda=10)
# stopCluster(cl)

## plot the fit
par(mfrow=c(1,2))
for(j in c("estate.tax","death.tax")){
	plot(fitCS[[j]], col=c("red","green"))
	mtext(j,line=2) }
legend("topright",bty="n",fill=c("red","green"),legend=names(covars))


## plot the IR sufficient reduction space
Z <- srproj(fitCS, congress109Counts)
par(mfrow=c(1,1))
plot(Z, pch=21, bg=c(4,3,2)[congress109Ideology$party], main="SR projections")
## two pols
Z[c(68,388),]
data(congress109)

## Bivariate sentiment factors (roll-call vote common scores)
covars <- data.frame(gop=congress109Ideology$party=="R",
					cscore=congress109Ideology$cs1)
covars$cscore <- covars$cscore - 
	tapply(covars$cscore,covars$gop,mean)[covars$gop+1]
rownames(covars) <- rownames(congress109Ideology)

## cl=NULL implies a serial run. 
## To use a parallel library fork cluster, 
## uncomment the relevant lines below. 
## Forking is unix only; use PSOCK for windows
cl <- NULL
# cl <- makeCluster(detectCores(), type="FORK")
## small nlambda for a fast example
fitCS <- dmr(cl, covars, congress109Counts, gamma=1, nlambda=10)
# stopCluster(cl)

## plot the fit
par(mfrow=c(1,2))
for(j in c("estate.tax","death.tax")){
	plot(fitCS[[j]], col=c("red","green"))
	mtext(j,line=2) }
legend("topright",bty="n",fill=c("red","green"),legend=names(covars))


## plot the IR sufficient reduction space
Z <- srproj(fitCS, congress109Counts)
par(mfrow=c(1,1))
plot(Z, pch=21, bg=c(4,3,2)[congress109Ideology$party], main="SR projections")
## two pols
Z[c(68,388),]

2nd momments of sparse matrices

Description

Correlation and deviation in sparse matrices.

Usage

corr(x, y)
sdev(x)
corr(x, y)
sdev(x)

Arguments

`x`	A `dgCMatrix` or `matrix` of counts.
`y`	A `matrix` with `nrow(y)=nrow(x)`.

Value

corr returns the ncol(x) by ncol(y) matrix of correlation between x and y, and sdev returns the column standard deviations.

Author(s)

Matt Taddy [email protected]

Examples

# some congress examples
data(congress109)
r <- corr(congress109Counts, congress109Ideology$repshare)
## 20 terms for Democrats
sort(r[,1])[1:20]
## 20 terms for Republicans
sort(r[,1], decreasing=TRUE)[1:20]
## 20 high variance terms
colnames(congress109Counts)[
	order(-sdev(congress109Counts))[1:20]]
 
 # some congress examples
data(congress109)
r <- corr(congress109Counts, congress109Ideology$repshare)
## 20 terms for Democrats
sort(r[,1])[1:20]
## 20 terms for Republicans
sort(r[,1], decreasing=TRUE)[1:20]
## 20 high variance terms
colnames(congress109Counts)[
	order(-sdev(congress109Counts))[1:20]]

Partial Least Squares

Description

A simple partial least squares procedure.

Usage

pls(x, y, K=1, scale=TRUE, verb=TRUE) 

## S3 method for class 'pls'
predict( object, newdata, type="response", ... )

## S3 method for class 'pls'
summary( object, ... )

## S3 method for class 'pls'
print(x, ... )

## S3 method for class 'pls'
plot(x, K=NULL, xlab="response", ylab=NULL, ...)
pls(x, y, K=1, scale=TRUE, verb=TRUE) 

## S3 method for class 'pls'
predict( object, newdata, type="response", ... )

## S3 method for class 'pls'
summary( object, ... )

## S3 method for class 'pls'
print(x, ... )

## S3 method for class 'pls'
plot(x, K=NULL, xlab="response", ylab=NULL, ...)

Arguments

`x`	The covariate matrix, in either `dgCMatrix` or `matrix` format. For `plot` and `print`: a `pls` output object.
`y`	The response vector.
`K`	The number of desired PLS directions. In plotting, this can be a vector of directions to draw, otherwise directions `1:fit$K` are plotted.
`scale`	An indicator for whether to scale `x`; usually a good idea. If `scale=TRUE`, model is fit with x scaled to have variance-one columns.
`verb`	Whether or not to print a small progress script.
`object`	For `predict` and `summary`: a `pls` output object.
`newdata`	For `predict`, an `ncol(x)`-column matrix of new observations. Can be either a simple `matrix` or a `simple_triplet_matrix`.
`type`	For `predict`, a choice between output types: predictions scaled to the original response for `"response"`, fitted partial least squares directions for `"reduction"`.
`xlab`	For `plot`, the x-axis label.
`ylab`	For `plot`, the y-axis label. If null, will be set to ‘pls(k) fitted values’ for each k.
`...`	Additional arguments.

Details

pls fits the Partial Least Squares algorithm described in Taddy (2012; Appendix A.1). In particular, we obtain loadings loadings[,k] as the correlation between X and factors factors[,k], where factors[,1] is initialized at scale(as.numeric(y)) and subsequent factors are orthogonal to to the k'th pls direction, an ortho-normal transformation of x%*%loadings[,k].

predict.pls returns predictions from the object$fwdmod forward regression $\alpha + \beta*z$ for projections z = x*loadings - shift derived from new covariates, or if type="reduction" it just returns these projections. summary.pls prints dimension details and a quick summary of the corresponding forward regression. plot.pls draws response versus fitted values for least-squares fit onto the K pls directions.

Value

Output from pls is a list with the following entries

`y`	The response vector.
`x`	The unchanged covariate matrix.
`directions`	The pls directions: `x%*%loadings - shift`.
`loadings`	The pls loadings.
`shift`	Shift applied after projection to center the PLS directions.
`fitted`	`K` columns of fitted `y` values for each number of directions.
`fwdmod`	The `lm` object from forward regression `lm(as.numeric(y)~directions)`.

predict.pls outputs either a vector of predicted resonse or an nrow(newcounts) by ncol(object$loadings) matrix of pls directions for each new observation. Summary and plot produce return nothing.

Author(s)

Matt Taddy [email protected]

References

Taddy (2013), Multinomial Inverse Regression for Text Analysis. Journal of the American Statistical Association 108.

Wold, H. (1975), Soft modeling by latent variables: The nonlinear iterative partial least squares approach. In Perspectives in Probability and Statistics, Papers in Honour of M.S. Bartlett.

Examples

data(congress109)
x <- t( t(congress109Counts)/rowSums(congress109Counts) )
summary( fit <- pls(x, congress109Ideology$repshare, K=3) )
plot(fit, pch=21, bg=c(4,3,2)[congress109Ideology$party])
predict(fit, newdata=x[c(68,388),])
data(congress109)
x <- t( t(congress109Counts)/rowSums(congress109Counts) )
summary( fit <- pls(x, congress109Ideology$repshare, K=3) )
plot(fit, pch=21, bg=c(4,3,2)[congress109Ideology$party])
predict(fit, newdata=x[c(68,388),])

Multinomial Inverse Regression (MNIR)

Description

Estimation of MNIR sufficient reduction projections. Note that mnlm is just a call to dmr from the distrom package.

Usage

srproj(obj, counts, dir=1:K, ...)
mnlm(cl, covars, counts, mu=NULL, bins=NULL, verb=0, ...)
srproj(obj, counts, dir=1:K, ...)
mnlm(cl, covars, counts, mu=NULL, bins=NULL, verb=0, ...)

Arguments

`cl`	A `parallel` library socket cluster. See the same argument in `help(dmr)` for details.
`covars`	A dense `matrix` or sparse `Matrix` of covariates. This should not include the intercept. See the same argument in `help(dmr)` for details.
`counts`	A dense `matrix` or sparse `Matrix` of response counts (e.g., token counts in text mining). See the same argument in `help(dmr)` for details. For `srproj`, this must have the same number of columns as the response dimensions (vocabulary size) in `obj`.
`mu`	Pre-specified fixed effects for each observation in the Poisson regression linear equation. See the same argument in `help(dmr)` for details.
`bins`	Number of bins into which we will attempt to collapse each column of `covars`. `bins=NULL` does no collapsing. See the same argument in `help(dmr)` for details.
`verb`	Whether to print some info. See the same argument in `help(dmr)` for details.
`obj`	Either a `dmr` object, as returned from `mnlm`, or the `dmrcoef` object obtained by calling `coef` on the output of `mnlm` or `dmr`. The latter will be faster, since `coef.dmr` is called inside `srproj` otherwise.
`dir`	The attribute (`covar`) dimensions onto which you want to project. The default is all dimensions: `1:K`, where `K` is the number of columns in the `covars` argument to mnlm.
`...`	Additional arguments to `gamlr` from `dmr` (or `mnlm`), and to `coef.dmr` from `srproj`. See `help(gamlr)` and `help(dmr)` for details.

Details

These functions provide the first two steps of multinomial inverse regression (see MNIR paper).

mnlm fits multinomial logistic regression parameters under gamma lasso penalization on a factorized Poisson likelihood. The mnlm function, which remains in this package for backwards compatability only, is just call to the dmr function of the distrom library (see DMR paper). For simplicity, we recommend using dmr instead of mnlm. For model selection, coefficients, prediction, and plotting see the relevant functions in help(dmr).

srproj calculates the MNIR Sufficient Reduction projection from text counts on to the attribute dimensions of interest (covars in mnlm or dmr). In particular, for counts $C$ , with row sums $m$ , and mnlm/dmr coefficients $\phi_j$ corresponding to attribute $j$ , $z_j = C'\phi_j/m$ is the SR projection in the direction of $j$ . The MNIR paper explains how $V=[v_1 ... v_K]$ , your original covariates/attributes, are independent of text counts $C$ given SR projections $Z=[z_1 ... z_K]$ .

The final step of MNIR is ‘forward regression’ for any element of $V$ onto $Z$ and the remaining elements of $V$ . We do not provide a function for this because you are free to use whatever you want; see the MNIR and DMR papers for linear, logistic, and random forest forward regression examples.

Note that if you were previously using textir not for inverse regression, but rather just as fast code for multinomial logistic regression, you probably want to work directly with the gamlr (binary response) or dmr (multinomial response) packages.

Value

srproj returns a matrix with columns corresponding to directions dir, plus an additional column m holding the row totals of counts. mnlm returns a dmr s3 object. See help(dmr) for details.

Author(s)

Matt Taddy [email protected]

References

Taddy (2013, JASA), Multinomial Inverse Regression for Text Analysis (MNIR).

Taddy (2015, AoAS), Distributed Multinomial Regression (DMR).

Taddy (2016, JCGS), The Gamma Lasso (GL).

Examples


### Ripley's Cushing Data; see help(Cushings) ###
library(MASS)
data(Cushings)
Cushings[,1:2] <- log(Cushings[,1:2])
train <- Cushings[Cushings$Type!="u",]
newdata <- as.matrix(Cushings[Cushings$Type == "u", 1:2])

## fit, coefficients, predict, and plot

# you could replace 'mnlm' with 'dmr' here.
fit <- mnlm(NULL, 
  covars=train[,1:2], 
  counts=factor(train$Type))

## dmr applies corrected AICc selection by default
round(coef(fit),1) 
round(predict(fit, newdata, type="response"),1)
par(mfrow=c(1,3))
for(j in c("a","b","c")){ 
  plot(fit[[j]]); mtext(j,line=2) }

## see we8there and congress109 for MNIR and srproj examples
 
### Ripley's Cushing Data; see help(Cushings) ###
library(MASS)
data(Cushings)
Cushings[,1:2] <- log(Cushings[,1:2])
train <- Cushings[Cushings$Type!="u",]
newdata <- as.matrix(Cushings[Cushings$Type == "u", 1:2])

## fit, coefficients, predict, and plot

# you could replace 'mnlm' with 'dmr' here.
fit <- mnlm(NULL, 
  covars=train[,1:2], 
  counts=factor(train$Type))

## dmr applies corrected AICc selection by default
round(coef(fit),1) 
round(predict(fit, newdata, type="response"),1)
par(mfrow=c(1,3))
for(j in c("a","b","c")){ 
  plot(fit[[j]]); mtext(j,line=2) }

## see we8there and congress109 for MNIR and srproj examples

tf-idf

Description

term frequency, inverse document frequency

Usage

tfidf(x,normalize=TRUE)
tfidf(x,normalize=TRUE)

Arguments

`x`	A `dgCMatrix` or `matrix` of counts.
`normalize`	Whether to normalize term frequency by document totals.

Value

A matrix of the same type as x, with values replaced by the tf-idf

$f_{ij} * \log[n/(d_j+1)],$

where $f_{ij}$ is $x_{ij}/m_i$ or $x_{ij}$ , depending on normalize, and $d_j$ is the number of documents containing token $j$ .

Author(s)

Matt Taddy [email protected]

Examples

data(we8there)
## 20 high-variance tf-idf terms
colnames(we8thereCounts)[
	order(-sdev(tfidf(we8thereCounts)))[1:20]]
 
 data(we8there)
## 20 high-variance tf-idf terms
colnames(we8thereCounts)[
	order(-sdev(tfidf(we8thereCounts)))[1:20]]

On-Line Restaurant Reviews

Description

Counts for 2804 bigrams in 6175 restaurant reviews from the site www.we8there.com.

Details

The short user-submitted reviews are accompanied by a five-star rating on four specific aspects of restaurant quality - food, service, value, and atmosphere - as well as the overall experience. The reviews originally appear in Maua and Cozman (2009), and the parsing details behind these specific counts are in Taddy (MNIR; 2013).

Value

`we8thereCounts`	A `dgCMatrix` of phrase counts indexed by review-rows and bigram-columns.
`we8thereRatings`	A `matrix` containing the associated review ratings.

Author(s)

Matt Taddy, [email protected]

References

Maua, D.D. and Cozman, F.G. (2009), Representing and classifying user reviews. In ENIA '09: VIII Enconro Nacional de Inteligencia Artificial, Brazil.

Taddy (2013, JASA), Multinomial Inverse Regression for Text Analysis.

Taddy (2013, AoAS), Distributed Multinomial Regression.

Examples

## some multinomial inverse regression
## we'll regress counts onto 5-star overall rating
data(we8there)

## cl=NULL implies a serial run. 
## To use a parallel library fork cluster, 
## uncomment the relevant lines below. 
## Forking is unix only; use PSOCK for windows
cl <- NULL
# cl <- makeCluster(detectCores(), type="FORK")
## small nlambda for a fast example
fits <- dmr(cl, we8thereRatings[,'Overall',drop=FALSE], 
			we8thereCounts, bins=5, gamma=1, nlambda=10)
# stopCluster(cl)

## plot fits for a few individual terms
terms <- c("first date","chicken wing",
			"ate here", "good food",
			"food fabul","terribl servic")
par(mfrow=c(3,2))
for(j in terms)
{ 	plot(fits[[j]]); mtext(j,font=2,line=2) }
 
## extract coefficients
B <- coef(fits)
mean(B[2,]==0) # sparsity in loadings
## some big loadings in IR
B[2,order(B[2,])[1:10]]
B[2,order(-B[2,])[1:10]]

## do MNIR projection onto factors
z <- srproj(B,we8thereCounts) 

## fit a fwd model to the factors
summary(fwd <- lm(we8thereRatings$Overall ~ z)) 

## truncate the fwd predictions to our known range
fwd$fitted[fwd$fitted<1] <- 1
fwd$fitted[fwd$fitted>5] <- 5
## plot the fitted rating by true rating
par(mfrow=c(1,1))
plot(fwd$fitted ~ factor(we8thereRatings$Overall), 
	varwidth=TRUE, col="lightslategrey")

## some multinomial inverse regression
## we'll regress counts onto 5-star overall rating
data(we8there)

## cl=NULL implies a serial run. 
## To use a parallel library fork cluster, 
## uncomment the relevant lines below. 
## Forking is unix only; use PSOCK for windows
cl <- NULL
# cl <- makeCluster(detectCores(), type="FORK")
## small nlambda for a fast example
fits <- dmr(cl, we8thereRatings[,'Overall',drop=FALSE], 
			we8thereCounts, bins=5, gamma=1, nlambda=10)
# stopCluster(cl)

## plot fits for a few individual terms
terms <- c("first date","chicken wing",
			"ate here", "good food",
			"food fabul","terribl servic")
par(mfrow=c(3,2))
for(j in terms)
{ 	plot(fits[[j]]); mtext(j,font=2,line=2) }
 
## extract coefficients
B <- coef(fits)
mean(B[2,]==0) # sparsity in loadings
## some big loadings in IR
B[2,order(B[2,])[1:10]]
B[2,order(-B[2,])[1:10]]

## do MNIR projection onto factors
z <- srproj(B,we8thereCounts) 

## fit a fwd model to the factors
summary(fwd <- lm(we8thereRatings$Overall ~ z)) 

## truncate the fwd predictions to our known range
fwd$fitted[fwd$fitted<1] <- 1
fwd$fitted[fwd$fitted>5] <- 5
## plot the fitted rating by true rating
par(mfrow=c(1,1))
plot(fwd$fitted ~ factor(we8thereRatings$Overall), 
	varwidth=TRUE, col="lightslategrey")

Package 'textir'

Help Index

Ideology in Political Speeches

Description

Details

Value

Author(s)

References

See Also

Examples

2nd momments of sparse matrices

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Partial Least Squares

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Multinomial Inverse Regression (MNIR)

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

tf-idf

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

On-Line Restaurant Reviews

Description

Details

Value

Author(s)

References

See Also

Examples