pLSA implementation for sparse matrix - sparse-matrix

I'm trying to implement the pLSA algorithm proposed by Thomas Hoffman (1999). However, all the implementations I have found consider the input term-doc matrix as complete instead of sparse. Since my input matrix is quite large and sparse, I would like to find out an algorithm which supports the sparsity. Could you help me find one? Matlab or Java is preferred.
UPDATE
I have found out that the PennAspect
http://www.cis.upenn.edu/~ungar/Datamining/software_dist/PennAspect/index.html
in fact implement PLSA with sparse matrix input.
The solution is simple. A 2D ragged array(an array which does not have the same length for each row) can be used to represent the sparse matrix.

I know its too late. But I was also searching for an answer, and finally implemented on my own. I am new to R but loved this algorithm and was advised to implement this in R. It is working perfectly with my large sparse dtm ie document term matrix with 10 iterations :
##PLSA algo
k <- 100;
P1<-t(apply(matrix(sample.int(46, k*dim(mat)[2], TRUE), k, dim(mat[2]),1,funnorm <- function(matrow){
matcol <- matrow/sum(matrow)
return(matcol)
}))
P2<-t(apply(matrix(sample.int(46, dim(mat)[1]*k, TRUE), dim(mat)[1], k),1,funnorm <- function(matrow){
matcol <- matrow/sum(matrow)
return(matcol)
}))
for(n in 1:10){
P3<-P2 %*% P1
P4 <- mat / P3
P5 <- P4 %*% t(P1)
P6 <- P2 * P5
P2new <- P6/(rowSums(P6))
P5 <- t(P2) %*% P4
P6 <- P1 * P5
P1new <- P6/(rowSums(P6))
P1 <- P1new
P2 <- P2new
}
Hope it helps anybody still looking for this.

Related

Generating mxn matrix from user input

I'm new to python and I am trying to work through some exercises introducing numpy. I have got stuck on this question:
Create a function that takes ๐‘š,๐‘› โˆˆ โ„• as input and generates a ๐‘šร—๐‘› matrix (numpy.array) ๐ด with entries a[i,j] = j*m + i where 0 โ‰ค ๐‘– โ‰ค ๐‘šโˆ’1 and 0 โ‰ค ๐‘— โ‰ค๐‘›โˆ’1
I have found a way of doing this more or less without numpy but any help on this would be appreciated.
I think the best way would be to create an array using generator first and then convert it into a numpy array
// m, n = rows, cols
np.array([j*m + i for j in range(n) for i in range(m)]).reshape((m, n))
However, it is a sequence of numbers along rows so it can be more easily done as
np.array(range(m*n)).reshape((m, n))
However, for this numpy library also has an inbuilt function which is arange
np.arange(m*n).reshape((m, n))
Hope it helps.

Multiplying matrix along one specific dimension

QUESTION
I'm looking for an elegant way to multiply two arrays along one particular dimension.
SIMILAR QUESTION
There is already a similar question on the official matlab forum, but the thread is outdated (2004).
EXAMPLE
M1 a [6x4x4] matrix and M2 a [6x1] matrix, I would like to multiply (element by element) M1 with M2 along the 3rd dimension of M1 to obtain a matrix M [6x4x4]
An equivalent to:
M1 = rand(6,4,4);
M2 = rand(6,1);
for ii = 1:size(M1,2)
for jj = 1:size(M1,3)
M(:,ii,jj) = M1(:,ii,jj).*M2;
end
end
VISUAL EXAMPLE
Do you know a cool way to do that ? (no loop, 1 or 2 lines solution,...)
If I'm interpreting your question correctly, you want to take each temporal slice (i.e. 1 x 1 x n) at each spatial location in M1 and element-wise multiply it with a vector M2 of size n x 1. bsxfun and permute are perfect for that situation:
M = bsxfun(#times, M1, permute(M2, [2 3 1]));

Make a matrix of s columns of a dataframe with same column names

How to make a matrix P containing the proportions p1 to ps for s variables from the initial dataframe (which contains columns p1 to ps)
This is an R problem. I have a dataframe that includes variables p1
to ps as well as other variables. I want to transfer the values for
variables p1 to ps from the dataframe to a matrix P for use in other
routines. I can readily do this when I know the number of columns
s (s = 5 in the example supplied below) using the code below (test
data is in dataframe ALL_test for a five column example).
The following code reads in the example dataframe ALL_test.
ALL_test <- data.frame(
x = c(50,75,45), p1 = c(1, 0, 0), p2 = c(0, .4, .1), p3 = c(0, .2, .3),
p4 = c(0, .4, .1), p5 = c(0, 0, .5)
)
P <- with(ALL_test, cbind(p1, p2, p3, p4, p5))
colnames(P)<- c("p1","p2","p3","p4","p5")
Outputting P shows that this solution works when based
on the known value 5 of s, the number of columns I wish
to transfer to a matrix P.
I want to develop code where I supply โ€˜sโ€™ that will return
s columns in the matrix P. The code that was kindly supplied in the
first response to this post gives me a list that contains the names
p1 to ps but I do not see how to use this to extract the columns
p1 to p5 from the dataframe.
I know that this is probably trivial but I cannot sort it.
I have tried (all of which just gives strings of p1 to ps)
s <- 5
nam1 <- paste("p", 1:s, sep = "", collapse = ", ")
nam1 # this returns "p1, p2, p3, p4, p5"
cat(nam1, "\n") # returns p1, p2, p3, p4, p5 but this does not work in
P <- with(ALL, cbind(cat(nam1, "\n")))
I think I see what you're trying to do... but everything you've tried just creates one string of s labels, rather than a list of length s.
How about: with(ALL, paste("p",as.character(seq(1,s)),sep=""))
Edit
With the updated question you've essentially got a data frame that you want to take a subset of columns for and create a matrix out of, so that's how I'd go about building an expression to do it (someone feel free to tell me a better way of doing this!)
Subset the data frame (using the code I posted before as a vector for an %in test):
temp<-ALL_test[,colnames(ALL_test)%in%paste("p",as.character(seq(1,s)),sep="")]
Then create a matrix from that:
P <- data.matrix(temp)
Naturally there's nothing stopping you combining all of that into:
P <-data.matrix(ALL_test[,colnames(ALL_test)%in%paste("p",as.character(seq(1,s)),sep="")])

Apply an R function over multiple arrays, returning an array of the same size

I have two arrays of 2x2 matrices, and I'd like to apply a function over each pair of 2x2 matrices. Here's a minimal example, multiplying each matrix in A by its corresponding matrix in B:
A <- array(1:20, c(5,2,2))
B <- array(1:20, c(5,2,2))
n <- nrow(A)
# Desired output: array with dimension 5x2x2 that contains
# the product of each pair of 2x2 matrices in A and B.
C <- aperm(sapply(1:n, function(i) A[i,,]%*%B[i,,], simplify="array"), c(3,1,2))
This takes two arrays, each with 5 2x2 matrices, and multiplies each pair of 2x2 matrices together, with the desired result in C.
My current code is this ugly last line, using sapply to loop through the first array dimension and pull out each 2x2 matrix separately from A and B. And then I need to permute the array dimensions with aperm() in order to have the same ordering as the original arrays (sapply(...,simplify="array") indexes each 2x2 matrix using the third dimension rather than the first one).
Is there a nicer way to do this? I hate that ugly function(i) in there, which is really just a way of faking a for loop. And the aperm() call makes this much less readable. What I have now works fine; I'm just searching for something that feels more like idiomatic R.
mapply() will take multiple lists or vectors, but it doesn't seem to work with arrays. aaply() from plyr is also close, but it doesn't take multiple inputs. The closest I've come is to use abind() with aaply() to pack A and B into one array work with 2 matrices at once, but this doesn't quite work (it only gets the first two entries; somewhere my indexing is off):
aaply(.data=abind(A,B,along=0), 1, function(ab) ab[1,,]%*%ab[2,,])
And this isn't exactly cleaner or clearer anyway!
I've tried to make this a minimal example, but my real use case requires a more complicated function of the matrix pairs (and I'd also love to scale this up to more than two arrays), so I'm looking for something that will generalize and scale.
D <- aaply(abind(A, B, along = 4), 1, function(x) x[,,1] %*% x[,,2])
This is a working solution using abind and aaply.
Sometimes a for loop is the easiest to follow. It also generalizes and scales:
n <- nrow(A)
C <- A
for(i in 1:n) C[i,,] <- A[i,,] %*% B[i,,]
R's infrastructure for lists is much better (it seems) than for arrays, so I could also approach it by converting the arrays into lists of matrices like this:
A <- alply(A, 1, function(a) matrix(a, ncol=2, nrow=2))
B <- alply(A, 1, function(a) matrix(a, ncol=2, nrow=2))
mapply(function(a,b) a%*%b, A, B, SIMPLIFY=FALSE)
I think this is more straightforward than what I have above, but I'd still love to hear better ideas.

Fastest way to multiply arrays of matrices in Python (numpy)

I have two arrays of 2-by-2 complex matrices, and I was wondering what would be the fastest method of multiplying them. (I want to do matrix multiplication on the elements of the matrix arrays.) At present, I have
numpy.array(map(lambda i: numpy.dot(m1[i], m2[i]), range(l)))
But can one do better than this?
Thanks,
v923z
numpy.einsum is the optimal solution for this problem, and it is mentioned way down toward the bottom of DaveP's reference. The code is clean, very easy to understand, and an order of magnitude faster than looping through the array and doing the multiplication one by one. Here is some sample code:
import numpy
l = 100
m1 = rand(l,2,2)
m2 = rand(l,2,2)
m3 = numpy.array(map(lambda i: numpy.dot(m1[i], m2[i]), range(l)))
m3e = numpy.einsum('lij,ljk->lik', m1, m2)
%timeit numpy.array(map(lambda i: numpy.dot(m1[i], m2[i]), range(l)))
%timeit numpy.einsum('lij,ljk->lik', m1, m2)
print np.all(m3==m3e)
Here are the return values when run in an ipython notebook:
1000 loops, best of 3: 479 ยตs per loop
10000 loops, best of 3: 48.9 ยตs per loop
True
I think the answer you are looking for is here. Unfortunately it is a rather messy solution involving reshaping.
If m1 and m2 are 1-dimensional arrays of 2x2 complex matrices, then they essentially have shape (l,2,2). So matrix multiplication on the last two axes is equivalent to summing the product of the last axis of m1 with the second-to-last axis of m2. That's exactly what np.dot does:
np.dot(m1,m2)
Or, since you have complex matrices, perhaps you want to take the complex conjugate of m1 first. In that case, use np.vdot.
PS. If m1 is a list of 2x2 complex matrices, then perhaps see if you can rearrange your code to make m1 an array of shape (l,2,2) from the outset.
If that is not possible, a list comprehension
[np.dot(m1[i],m2[i]) for i in range(l)]
will be faster than using map with lambda, but performing l np.dots is going to be slower than doing one np.dot on two arrays of shape (l,2,2) as suggested above.
If m1 and m2 are 1-dimensional arrays of 2x2 complex matrices, then they essentially have shape (l,2,2). So matrix multiplication on the last two axes is equivalent to summing the product of the last axis of m1 with the second-to-last axis of m2. That's exactly what np.dot does:
But that is not what np.dot does.
a = numpy.array([numpy.diag([1, 2]), numpy.diag([2, 3]), numpy.diag([3, 4])])
produces a (3,2,2) array of 2-by-2 matrices. However, numpy.dot(a,a) creates 6 matrices, and the result's shape is (3, 2, 3, 2). That is not what I need. What I need is an array holding numpy.dot(a[0],a[0]), numpy.dot(a[1],a[1]), numpy.dot(a[2],a[2]) ...
[np.dot(m1[i],m2[i]) for i in range(l)]
should work, but I haven't yet checked, whether it is faster that the mapping of the lambda expression.
Cheers,
v923z
EDIT: the for loop and the map runs at about the same speed. It is the casting to numpy.array that consumes a lot of time, but that would have to be done for both methods, so there is no gain here.
May be it is too old question but i was still searching for an answer.
I tried this code
a=np.asarray(range(1048576),dtype='complex');b=np.reshape(a//1024,(1024,1024));b=b+1J*b
%timeit c=np.dot(b,b)
%timeit d=np.einsum('ij, ki -> jk', b,b).T
The results are : for 'dot'
10 loops, best of 3: 174 ms per loop
for 'einsum'
1 loops, best of 3: 4.51 s per loop
I have checked that c and d are same
(c==d).all()
True
still 'dot' is the winner, I am still searching for a better method but no success

Resources