Should I trim dfm before or after applying tfidf? - tf-idf

I used Quanteda package to create dfm and dfm-tfidf objects. I followed two ways to remove sparse features and create trimmed dfms. One was by directly applying sparsity argument on the dfm() function. The second was by reducing sparsity using dfm_trim().
Approach 1: I first created the dfm and dfm_tfidf objects from the train and test tokens. Then I applied dfm_tfidf as follows.
dfmat_train <- dfm(traintokens)
dfmat_train_tfidf<- dfm_tfidf(dfmat_train)
dfmat_test <- dfm(traintokens)
dfmat_test_tfidf <- dfm(dfmat_test)
Then, I simply used dfm_trim to remove sparse features.
dfmat_train_trimmed <- dfm_trim(dfmat_train, sparsity=0.98)
dfmat_train_trimmed _tfidf <- dfm_trim(dfmat_train_tfidf, sparsity=0.98)
dfmat_test_trimmed <- dfm_trim(dfmat_test, sparsity=0.98)
dfmat_test_trimmed_tfidf <- dfm_trim(dfmat_test_tfidf, sparsity=0.98)
Approach 2 was shorter. The tfdif weighting is done after trimming.
dfmat_train <- dfm(traintokens, sparsity = 0.98)
dfmat_train_tfidf <- dfm_tfidf(dfmat_train)
dfmat_test <- dfm_tfidf(testtokens, sparsity = 0.98)
dfmat_test_tfidf <- dfm_tfidf(dfmat_test )
After training models using both of the above approaches and predicting the test data sets, Approach 1 resulted in identical prediction performance metrics for both tfidf and non-tfidf test data. Cohen's Kappa is 1. Approach 2 resulted in different (tfidf and non-tfidf) but less accurate predictions. I am puzzled. Which one is the right approach?

Related

Error converting to stm after tf-idf weighting

For several dfms, I have no problem converting them to stm/lda/topicmodels format. However, if I weight the dfms with dfm_tfidf() before converting, I get the following error:
Error in convert.dfm(users_dfm, to = "stm") : cannot convert a
non-count dfm to a topic model format
Any idea why this might be? I've tried different weighting schemes for both term and document frequency (to try and make the weighted dfm a 'count' dfm), but I keep getting the error.
So, this works:
users_dfm <- dfm(users_tokens)
users_stm <- convert(users_dfm, to = "stm")
But this doesn't:
users_dfm <- dfm(users_tokens)
weighted_dfm <- dfm_tfidf(users_dfm)
users_stm <- convert(weighted_dfm, to = "stm")
Thanks!
This is because topic models require counts as inputs, because that is the nature of the assumed statistical distribution for the latent Dirichlet allocation model. tf-idf weighting of the dfm turns the matrix into non-integer values, which are not valid inputs for stm (or any other topic model).
So in short, don't weight your dfm before using it with a topic model.
You should also note that conversion of a dfm to the stm format is not strictly required, since stm::stm() can take a dfm object directly as an input.

Should I use Halfcomplex2Real or Complex2Complex

Good morning, I'm trying to perform a 2D FFT as 2 1-Dimensional FFT.
The problem setup is the following:
There's a matrix of complex numbers generated by an inverse FFT on an array of real numbers, lets call it arr[-nx..+nx][-nz..+nz].
Now, since the original array was made up of real numbers, I exploit the symmetry and reduce my array to be arr[0..nx][-nz..+nz].
My problem starts here, with arr[0..nx][-nz..nz] provided.
Now I should come back in the domain of real numbers.
The question is what kind of transformation I should use in the 2 directions?
In x I use the fftw_plan_r2r_1d( .., .., .., FFTW_HC2R, ..), called Half complex to Real transformation because in that direction I've exploited the symmetry, and that's ok I think.
But in z direction I can't figure out if I should use the same transformation or, the Complex to complex (C2C) transformation?
What is the correct once and why?
In case of needing here, at page 11, the HC2R transformation is briefly described
Thank you
"To easily retrieve a result comparable to that of fftw_plan_dft_r2c_2d(), you can chain a call to fftw_plan_dft_r2c_1d() and a call to the complex-to-complex dft fftw_plan_many_dft(). The arguments howmany and istride can easily be tuned to match the pattern of the output of fftw_plan_dft_r2c_1d(). Contrary to fftw_plan_dft_r2c_1d(), the r2r_1d(...FFTW_HR2C...) separates the real and complex component of each frequency. A second FFTW_HR2C can be applied and would be comparable to fftw_plan_dft_r2c_2d() but not exactly similar.
As quoted on the page 11 of the documentation that you judiciously linked,
'Half of these column transforms, however, are of imaginary parts, and should therefore be multiplied by I and combined with the r2hc transforms of the real columns to produce the 2d DFT amplitudes; ... Thus, ... we recommend using the ordinary r2c/c2r interface.'
Since you have an array of complex numbers, you can either use c2r transforms or unfold real/imaginary parts and try to use HC2R transforms. The former option seems the most practical.Which one might solve your issue?"
-#Francis

How to iterate through two numpy arrays of different dimensions

I am working with the MNIST dataset, x_test has dimension of (10000,784) and y_test has a dimension of (10000,10). I need to iterate through each sample of these two numpy arrays at the same time as I need to pass them individually to score.evaluate()
I tried nditer, but it throws an error saying operands could not be broadcasted together since hey have different shape.
score=[]
for x_sample, y_sample in np.nditer ([x_test,y_test]):
a=x_sample.reshape(784,1)
a=np.transpose(a)
b=y_sample.reshape(10,1)
b=np.transpose(b)
s=model.evaluate(a,b,verbose=0)
score.append(s)
Assuming that what you are actually trying to do here is to get the individual loss per sample in your test set, here is a way to do it (in your approach, even if you get past the iteration part, you will have issues with model.evaluate, which was not designed for single sample pairs)...
To make the example reproducible, here I also assume we have first run the Keras MNIST CNN example for only 2 epochs; so, the shape of our data is:
x_test.shape
# (10000, 28, 28, 1)
y_test.shape
# (10000, 10)
Given that, here is a way to get the individual loss per sample:
from keras import backend as K
y_pred = model.predict(x_test)
y_test = y_test.astype('float32') # necessary, as y_pred.dtype is 'float32'
y_test_tensor = K.constant(y_test)
y_pred_tensor = K.constant(y_pred)
g = K.categorical_crossentropy(target=y_test_tensor, output=y_pred_tensor)
ce = K.eval(g) # 'ce' for cross-entropy
ce
# array([1.1563368e-05, 2.0206178e-05, 5.4946734e-04, ..., 1.7662416e-04,
# 2.4232995e-03, 1.8954457e-05], dtype=float32)
ce.shape
# (10000,)
i.e. ce now contains what the score list in your question was supposed to contain.
For confirmation, let's calculate the loss for all test samples using model.evaluate:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
# Test loss: 0.050856544668227435
and again manually, averaging the values of the ce we have just calculated:
import numpy as np
log_loss = np.sum(ce)/ce.shape[0]
log_loss
# 0.05085654296875
which, although not exactly equal (due to different numeric precision involved in the two ways of calculation), they are practically equal indeed:
log_loss == score[0]
# False
np.isclose(log_loss, score[0])
# True
Now, the adaptation of this to your own case, where the shape of x_test is (10000, 784), is arguably straighforward...
You are mixing between training features and testing labels. The training set has 60,000 samples and the test set has 10,000 samples (that is, your x_test should be of dimension (10000,784)). Make sure you have download all the correct data, and don't mix up training data with testing data.
http://yann.lecun.com/exdb/mnist/

Matching and factoring across data frames in R without For loop

I have two data frames, one of survey response options (levels) and one of the coded responses. Across data frames, the columns have the same names but not necessarily the same order. Also, within the levels data frame, questions may have different numbers of response options.
levels <- data.frame(restaurant=c("TACO BELL","CHIPOTLE",""),
would_recommend=c("YES","NO",""),
satisfaction=c("VERY SATISFIED","SATISFIED","UNSATISFIED"))
responses <- data.frame(satisfaction=c(2,2,1,1,3,3,2,2),
would_recommend=c(1,2,1,1,2,2,2,1),
restaurant=c(1,2,1,2,1,2,1,2))
The responses are essentially factors whose levels are the same-named column in the levels table, so I would like to convert them to factors.
I know that I can do this by:
for (i in 1:length(responses)){
resp_levels <- levels[,match(names(responses)[i],names(levels))]
responses[,i]<-factor(x=resp_levels[responses[,i]],levels=resp_levels)
}
Is there a clever way to do this without a For loop?
I generally agree with #gogolews that there is nothing wrong with a for loop if it works for you, and especially a simple one like yours. However, if you really want a non-loop solution, here is one with the packages tidyr and dplyr. This may be faster on a really gigantic data-set, but its hard to say for sure:
library(dplyr)
library(tidyr)
First, gather responses into a long format data.frame, and add an id variable so that we know which go together later. We convert the factor to a character so we can index it by name later
new_responses <- responses %>% mutate(id = row_number(restaurant)) %>%
gather(question, response, -id) %>% mutate(question = as.character(question))
Now use dplyr to grab the appropriate level from levels data.frame then spread this back out to short form using tidyr and delete no longer needed id.
responses2 <- new_responses %>% rowwise %>%
mutate(response = as.character(levels[response, question])) %>%
spread(question, response) %>% select(-id)
responses2
Source: local data frame [8 x 3]
restaurant satisfaction would_recommend
1 TACO BELL SATISFIED YES
2 TACO BELL VERY SATISFIED YES
3 TACO BELL UNSATISFIED NO
4 TACO BELL SATISFIED NO
5 CHIPOTLE SATISFIED NO
6 CHIPOTLE VERY SATISFIED YES
7 CHIPOTLE UNSATISFIED NO
8 CHIPOTLE SATISFIED YES
Note that the rows won't necessarily be in the same order as in the original, but it would be possible to do this by using the id variable to resort the new data.frame.

What is the most efficient way to read a CSV file into an Accelerate (or Repa) Array?

I am interested in playing around with the Accelerate library, and I would like to perform some operations on data stored inside of a CSV file. I've read this excellent introduction to Accelerate, but I'm not sure how I can go about reading CSVs into Accelerate efficiently. I've thought about this, and the only thing I can think of is to parse the entire CSV file into one long list, and then feed the entire list into Accelerate.
My data sets will be quite large, and it doesn't seem efficient to read a 1 gb+ file into memory only to copy somewhere else. I noticed there was a CSV Enumerator package on Hackage, but I'm not sure how to use it with Accelerate's generate function. Another constraint is that it seems the dimensions of the Array, or at least number of elements, must be known before generating an array using Accelerate.
Has anyone dealt with this kind of problem before?
Thanks!
I am not sure if this is 100% applicable to accelerate or repa, but here is one way I've handled this for Vector in the past:
-- | A hopefully-efficient sink that incrementally grows a vector from the input stream
sinkVector :: (PrimMonad m, GV.Vector v a) => Int -> ConduitM a o m (Int, v a)
sinkVector by = do
v <- lift $ GMV.new by
go 0 v
where
-- i is the index of the next element to be written by go
-- also exactly the number of elements in v so far
go i v = do
res <- await
case res of
Nothing -> do
v' <- lift $ GV.freeze $ GMV.slice 0 i v
return $! (i, v')
Just x -> do
v' <- case GMV.length v == i of
True -> lift $ GMV.grow v by
False -> return v
lift $ GMV.write v' i x
go (i+1) v'
It basically allocates by empty slots and proceeds to fill them. Once it hits the ceiling, it grows the underlying vector once again. I haven't benchmarked anything, but it appears to perform OK in practice. I am curious to see if there will be other more efficient answers here.
Hope this helps in some way. I do see there's a fromVector function in repa and perhaps that's your golden ticket in combination with this method.
I haven't tried reading CSV files into repa but I recommend using cassava (http://hackage.haskell.org/package/cassava). Iirc I had a 1.5G file which I used to create my stats. With cassava, my program ran in a surprisingly small amount of memory. Here's an extended example of usage:
http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-haskell/
In the case of repa, if you add rows incrementally to an array (which it sounds like you want to do) then one would hope the space usage would also grow incrementally. It certainly is worth an experiment. And possibly also contacting the repa folks. Please report back on your results :-)

Resources