Title: | MAP Estimation of Topic Models |
---|---|
Description: | Maximum a posteriori (MAP) estimation for topic models (i.e., Latent Dirichlet Allocation) in text analysis, as described in Taddy (2012) 'On estimation and selection for topic models'. Previous versions of this code were included as part of the 'textir' package. If you want to take advantage of openmp parallelization, uncomment the relevant flags in src/MAKEVARS before compiling. |
Authors: | Matt Taddy <[email protected]> |
Maintainer: | Matt Taddy <[email protected]> |
License: | GPL-3 |
Version: | 1.9-7 |
Built: | 2024-11-04 03:17:33 UTC |
Source: | https://github.com/taddylab/maptpx |
Tools for manipulating (sparse) count matrices.
normalize(x,byrow=TRUE) stm_tfidf(x)
normalize(x,byrow=TRUE) stm_tfidf(x)
x |
A |
byrow |
Whether to normalize by row or column totals. |
normalize
divides the counts by row or column totals, and stm_tfidf
returns a matrix with entries , where
is term-j frequency in document-i,
and
is the number of documents containing term-j.
Matt Taddy [email protected]
normalize( matrix(1:9, ncol=3) ) normalize( matrix(1:9, ncol=3), byrow=FALSE ) (x <- matrix(rbinom(15,size=2,prob=.25),ncol=3)) stm_tfidf(x)
normalize( matrix(1:9, ncol=3) ) normalize( matrix(1:9, ncol=3), byrow=FALSE ) (x <- matrix(rbinom(15,size=2,prob=.25),ncol=3)) stm_tfidf(x)
Predict function for Topic Models
## S3 method for class 'topics' predict( object, newcounts, loglhd=FALSE, ... )
## S3 method for class 'topics' predict( object, newcounts, loglhd=FALSE, ... )
object |
An output object from the |
newcounts |
An |
loglhd |
Whether or not to calculate and return |
... |
Additional arguments to the undocumented internal |
Under the default mixed-membership topic model, this function uses sequential quadratic programming to fit topic weights for new documents.
Estimates for each new
are, conditional on
object$theta
,
MAP in the (K-1)-dimensional logit transformed parameter space.
The output is an nrow(newcounts)
by object$K
matrix of document topic weights, or a list with including these weights as W
and the log likelihood as L
.
Matt Taddy [email protected]
Taddy (2012), On Estimation and Selection for Topic Models. http://arxiv.org/abs/1109.4518
topics, plot.topics, summary.topics, congress109
## Simulate some data omega <- t(rdir(500, rep(1/10,10))) theta <- rdir(10, rep(1/1000,1000)) Q <- omega%*%t(theta) counts <- matrix(ncol=1000, nrow=500) totals <- rpois(500, 200) for(i in 1:500){ counts[i,] <- rmultinom(1, size=totals[i], prob=Q[i,]) } ## predict omega given theta W <- predict.topics( theta, counts ) plot(W, omega, pch=21, bg=8)
## Simulate some data omega <- t(rdir(500, rep(1/10,10))) theta <- rdir(10, rep(1/1000,1000)) Q <- omega%*%t(theta) counts <- matrix(ncol=1000, nrow=500) totals <- rpois(500, 200) for(i in 1:500){ counts[i,] <- rmultinom(1, size=totals[i], prob=Q[i,]) } ## predict omega given theta W <- predict.topics( theta, counts ) plot(W, omega, pch=21, bg=8)
Generate random draws from a Dirichlet distribution
rdir(n, alpha)
rdir(n, alpha)
n |
The number of observations. |
alpha |
A |
An n
column matrix containing the observations.
Matt Taddy [email protected]
rdir(3,rep(1,6))
rdir(3,rep(1,6))
MAP estimation of Topic models
topics(counts, K, shape=NULL, initopics=NULL, tol=0.1, bf=FALSE, kill=2, ord=TRUE, verb=1, ...)
topics(counts, K, shape=NULL, initopics=NULL, tol=0.1, bf=FALSE, kill=2, ord=TRUE, verb=1, ...)
counts |
A matrix of multinomial response counts in |
K |
The number of latent topics. If |
shape |
Optional argument to specify the Dirichlet prior concentration parameter as |
initopics |
Optional start-location for |
tol |
Convergence tolerance: optimization stops, conditional on some extra checks, when the absolute posterior increase over a full paramater set update is less than |
bf |
An indicator for whether or not to calculate the Bayes factor for univariate |
kill |
For choosing from multiple |
ord |
If |
verb |
A switch for controlling printed output. |
... |
Additional arguments to the undocumented internal |
A latent topic model represents each i'th document's term-count vector
(with
total phrase count)
as having been drawn from a mixture of
K
multinomials, each parameterized by topic-phrase
probabilities , such that
We assign a K-dimensional Dirichlet(1/K) prior to each document's topic weights
, and the prior on each
is Dirichlet with concentration
.
The
topics
function uses quasi-newton accelerated EM, augmented with sequential quadratic programming
for conditional updates, to obtain MAP estimates for the topic model parameters.
We also provide Bayes factor estimation, from marginal likelihood
calculations based on a Laplace approximation around the converged MAP parameter estimates. If input
length(K)>1
, these
Bayes factors are used for model selection. Full details are in Taddy (2011).
An topics
object list with entries
K |
The number of latent topics estimated. If input |
theta |
The |
omega |
The |
BF |
The log Bayes factor for each number of topics in the input |
D |
Residual dispersion: for each element of |
X |
The input count matrix, in |
Estimates are actually functions of the MAP (K-1 or p-1)-dimensional logit transformed natural exponential family parameters.
Matt Taddy [email protected]
Taddy (2012), On Estimation and Selection for Topic Models. http://arxiv.org/abs/1109.4518
plot.topics, summary.topics, predict.topics, wsjibm, congress109, we8there
## Simulation Parameters K <- 10 n <- 100 p <- 100 omega <- t(rdir(n, rep(1/K,K))) theta <- rdir(K, rep(1/p,p)) ## Simulated counts Q <- omega%*%t(theta) counts <- matrix(ncol=p, nrow=n) totals <- rpois(n, 100) for(i in 1:n){ counts[i,] <- rmultinom(1, size=totals[i], prob=Q[i,]) } ## Bayes Factor model selection (should choose K or nearby) summary(simselect <- topics(counts, K=K+c(-5:5)), nwrd=0) ## MAP fit for given K summary( simfit <- topics(counts, K=K, verb=2), n=0 ) ## Adjust for label switching and plot the fit (color by topic) toplab <- rep(0,K) for(k in 1:K){ toplab[k] <- which.min(colSums(abs(simfit$theta-theta[,k]))) } par(mfrow=c(1,2)) tpxcols <- matrix(rainbow(K), ncol=ncol(theta), byrow=TRUE) plot(theta,simfit$theta[,toplab], ylab="fitted values", pch=21, bg=tpxcols) plot(omega,simfit$omega[,toplab], ylab="fitted values", pch=21, bg=tpxcols) title("True vs Fitted Values (color by topic)", outer=TRUE, line=-2) ## The S3 method plot functions par(mfrow=c(1,2)) plot(simfit, lgd.K=2) plot(simfit, type="resid")
## Simulation Parameters K <- 10 n <- 100 p <- 100 omega <- t(rdir(n, rep(1/K,K))) theta <- rdir(K, rep(1/p,p)) ## Simulated counts Q <- omega%*%t(theta) counts <- matrix(ncol=p, nrow=n) totals <- rpois(n, 100) for(i in 1:n){ counts[i,] <- rmultinom(1, size=totals[i], prob=Q[i,]) } ## Bayes Factor model selection (should choose K or nearby) summary(simselect <- topics(counts, K=K+c(-5:5)), nwrd=0) ## MAP fit for given K summary( simfit <- topics(counts, K=K, verb=2), n=0 ) ## Adjust for label switching and plot the fit (color by topic) toplab <- rep(0,K) for(k in 1:K){ toplab[k] <- which.min(colSums(abs(simfit$theta-theta[,k]))) } par(mfrow=c(1,2)) tpxcols <- matrix(rainbow(K), ncol=ncol(theta), byrow=TRUE) plot(theta,simfit$theta[,toplab], ylab="fitted values", pch=21, bg=tpxcols) plot(omega,simfit$omega[,toplab], ylab="fitted values", pch=21, bg=tpxcols) title("True vs Fitted Values (color by topic)", outer=TRUE, line=-2) ## The S3 method plot functions par(mfrow=c(1,2)) plot(simfit, lgd.K=2) plot(simfit, type="resid")
Tools for looking at the variance of document-topic weights.
topicVar(counts, theta, omega) logit(prob) expit(eta)
topicVar(counts, theta, omega) logit(prob) expit(eta)
counts |
A matrix of multinomial response counts, as inputed to the |
theta |
A fitted topic matrix, as ouput from the |
omega |
A fitted document topic-weight matrix, as ouput from the |
prob |
A probability vector (positive and sums to one) or a matrix with probability vector rows. |
eta |
A vector of the natural exponential family parameterization for a probability vector (with first category taken as null) or a matrix with each row the NEF parameters for a single observation. |
These function use the natural exponential family (NEF) parametrization of a probability vector with the first element corresponding to a 'null' category; that is, with
and setting
, the probabilities are
Refer to Taddy (2012) for details.
topicVar
returns an array with dimensions , where
K=ncol(omega)=ncol(theta)
and n = nrow(counts) = nrow(omega)
, filled with the posterior covariance matrix for the NEF parametrization of each row of omega
. Utility logit
performs the NEF transformation and expit
reverses it.
Matt Taddy [email protected]
Taddy (2012), On Estimation and Selection for Topic Models. http://arxiv.org/abs/1109.4518
topics, predict.topics