Error converting to stm after tf-idf weighting - tf-idf

For several dfms, I have no problem converting them to stm/lda/topicmodels format. However, if I weight the dfms with dfm_tfidf() before converting, I get the following error:
Error in convert.dfm(users_dfm, to = "stm") : cannot convert a
non-count dfm to a topic model format
Any idea why this might be? I've tried different weighting schemes for both term and document frequency (to try and make the weighted dfm a 'count' dfm), but I keep getting the error.
So, this works:
users_dfm <- dfm(users_tokens)
users_stm <- convert(users_dfm, to = "stm")
But this doesn't:
users_dfm <- dfm(users_tokens)
weighted_dfm <- dfm_tfidf(users_dfm)
users_stm <- convert(weighted_dfm, to = "stm")
Thanks!

This is because topic models require counts as inputs, because that is the nature of the assumed statistical distribution for the latent Dirichlet allocation model. tf-idf weighting of the dfm turns the matrix into non-integer values, which are not valid inputs for stm (or any other topic model).
So in short, don't weight your dfm before using it with a topic model.
You should also note that conversion of a dfm to the stm format is not strictly required, since stm::stm() can take a dfm object directly as an input.

Related

Elimination of need to retrain models in "Shaply Sampling Values"

While reading the paper "A Unified Approach to Interpreting Model
Predictions" by Lundberg and Lee (https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf), on page 3 I see:
Shapley sampling values are meant to explain any model by: (1) applying sampling approximations
to Equation 4, and (2) approximating the effect of removing a variable from the model by integrating
over samples from the training dataset. This eliminates the need to retrain the model and allows fewer
than pow(2,|F|) differences to be computed. Since the explanation model form of Shapley sampling values
is the same as that for Shapley regression values, it is also an additive feature attribution method.
My question is: how does sampling from the training dataset eliminate the need to retrain models? It is not obvious to me and I cannot think of a mathematical proof. Any reference or explanation would be greatly appreciated. My internet searches have been unsuccessful. Thank you.

VAE input data scaling

Variational Autoencoders (VAE) are quite a heaving concept themselves. Non-surprisingly most post, comments and tutorials focus on the theory and architecture, but most also fail to address the topic of data scaling. While experimenting with VAEs I have come across a (to me) surprising read flag that the way the data is scaled into an VAE is very important and I could not put my head around it what is the explanation.
To visualize the following issue descripting access the Notebook here: https://github.com/mobias17/VAE-Input-Scaling/blob/master/VAE%20Input%20Scaling.ipynb
Let’s assume the goal is to reconstruct a sine wave (e.g. a sound wave) by a VAE. When I feed the standardized data through the model, it is only able to approximate values between -1 and 1. Obviously, the quick answer would be to normalize the data. Still, this leads to the following questions:
1) What is the rational that the VAE can only approximate values between -1 and 1? (is it the gaussian reparameterization, vanishing gradients?)
2) Is there a way to overcome this boundary (model changes)?
3) What is the best practice to scale data for a VAE? Should the data be normalized over the std dev?
Results showing Sutputs are between -1 & 1
Variational autoencoders can approximate values in any range. The problem here is with this particular model's architecture.
The decoder of this VAE uses as the last layer keras.layers.LSTM.
This layer's default activation function is tanh, and the tanh function outputs values in the range (-1,1). This is why the model cannot generate values outside that range.
However, if we change the activation function to, say, linear, replacing
decoder_mean = LSTM(input_dim, return_sequences=True)
with
decoder_mean = LSTM(input_dim, return_sequences=True, activation=None)
the VAE now can approximate the data. This is the result I got after training for 100 epochs.
The general recommendation is to ensure that the data you are trying to approximate lies in the range of the function you are using to approximate it, either by scaling the data or choosing a more expressive function.

Document classification with incomplete training set

Advice please. I have a collection of documents that all share a common attribute (e.g. The word French appears) some of these documents have been marked as not pertinent to this collection (e.g. French kiss appears) but not all documents are guaranteed to have been identified. What is the best method to use to figure out which other documents don't belong.
Assumptions
Given your example "French", I will work under the assumption that the feature is a word that appears in the document. Also, since you mention that "French kiss" is not relevant, I will further assume that in your case, a feature is a word used in a particular sense. For example, if "pool" is a feature, you may say that documents mentioning swimming pools are relevant, but those talking about pool (the sport, like snooker or billiards) are not relevant.
Note: Although word sense disambiguation (WSD) methods would work, they require too much effort, and is an overkill for this purpose.
Suggestion: localized language model + bootstrapping
Think of it this way: You don't have an incomplete training set, but a smaller training set. The idea is to use this small training data to build bigger training data. This is bootstrapping.
For each occurrence of your feature in the training data, build a language model based only on the words surrounding it. You don't need to build a model for the entire document. Ideally, just the sentences containing the feature should suffice. This is what I am calling a localized language model (LLM).
Build two such LLMs from your training data (let's call it T_0): one for pertinent documents, say M1, and another for irrelevant documents, say M0. Now, to build a bigger training data, classify documents based on M1 and M0. For every new document d, if d does not contain the feature-word, it will automatically be added as a "bad" document. If d contains the feature-word, then consider a local window around this word in d (the same window size that you used to build the LLMs), and compute the perplexity of this sequence of words with M0 and M1. Classify the document as belonging to the class which gives lower perplexity.
To formalize, the pseudo-code is:
T_0 := initial training set (consisting of relevant/irrelevant documents)
D0 := additional data to be bootstrapped
N := iterations for bootstrapping
for i = 0 to N-1
T_i+1 := empty training set
Build M0 and M1 as discussed above using a window-size w
for d in D0
if feature-word not in d
then add d to irrelevant documents of T_i+1
else
compute perplexity scores P0 and P1 corresponding to M0 and M1 using
window size w around the feature-word in d.
if P0 < P1 - delta
add d to irrelevant documents of T_i+1
else if P1 < P0 - delta
add d to relevant documents of T_i+1
else
do not use d in T_i+1
end
end
end
Select a small random sample from relevant and irrelevant documents in
T_i+1, and (re)classify them manually if required.
end
T_N is your final training set. In this above bootstrapping, the parameter delta needs to be determined with experiments on some held-out data (also called development data).
The manual reclassification on a small sample is done so that the noise during this bootstrapping is not accumulated through all the N iterations.
Firstly you should take care of how to extract features of the sample docs. Counting every word is not a good way. You might need some technique like TFIDF to teach the classifier that which words are important to classify and which are not.
Build a right dictionary. In your case, the word French kiss should be a unique word, instead of a sequence of French + kiss. Use the right technique to build a right dictionary is important.
The remain errors in samples are normal, we call it "not linear separable". There're a huge amount of advanced researches on how to solve this problem. For example, SVM (support vector machine) would be what you like to use. Please note that single-layer Rosenblatt perceptron usually shows very bad performance for the dataset which are not linear separable.
Some kinds of neural networks (like Rosenblatt perceptron) can be educated on erroneus data set and can show a better performance than tranier has. Moreover in many cases you should make errors for avoid over-training.
You can mark all unlabeled documents randomly, train several nets and estimate theirs performance on the test set (of course, you should not include unlabeled documents in the test set). After that you can in cycle recalculate weights of unlabeled documents as w_i = sum of quality(j) * w_ij, and then repeate training and the recalculate weight and so on. Because procedure is equivalent to introducing new hidden layer and recalculating it weights by Hebb procedure the overall procedure should converge if your positive and negative sets are lineary separable in some network feature space.

nominal-value inputs for Neural Network

I have a set of training data, each item in this set consists of 4 numerical values and 1 nominal-value which is the name of the method that these values have been calculated with. (There are 8 methods)
I'm training a Neural Network with these. To get rid of the nominal-value I simply assigned a value from 1 to 8 to each method and used one input to pass it to Neural Network and 4 other inputs for numerical-values. It is sort of working, but the result is not as amazing as I want.
So my question is could it be because of this simple assignment of numbers to nominal-values? or maybe it is because of mixing two different categories of inputs which are not really at the same level (numbers and method types)
As a general note, a better way for coding nominal values would be a binary vector. In your case, in addition to the 4 continuous-valued inputs, you'd have 8 binary input neurons, where only one is activated (1) and the other 7 are inactive.
The way you did it implies an artificial relationship between the computation methods, which is almost certainly an artifact. For example, 1 and 2 are numerically (and from your network's point of view!) nearer than 1 and 8. But are the methods nr. 1 and 2 really more similar, or related, than the methods 1 and 8?
Since you don't provide much detail, my answer can't be very specific.
Generally speaking neural networks tend to perform worse when coding nominal values as numeric values since the transformation will impose a (probably) false ordering on the variables. Mixing inputs with very varied levels also tend to worsen the performance.
However, given the little information provided here there is no way of telling if this is the reason that the networks performance is "not as amazing" as you want. It could just as well be the case that you don't have enough training data, or that your training data contains a lot of noise. Perhaps you need to pre-scale your data, perhaps there is an error in your network code, perhaps you have chosen ill-suited values of constants for your learning algorithm...
The reasons a neural network doesn't perform as expected are many and diverse (on of them beeing unreasonably high expectations). Without much more information there is no way of knowing what the problem is in your case.
Mapping categories to numerical values is not a good practice in statistics. Especially in the case of neural networks. Bear in mind that neural networks tend to map similar inputs to similar outputs. If you map category A to 1 and category B to 2 (both as inputs), the NN will try to output similar values for both categories, even if they have nothing to do with each other.
A sparser representation is preferred. If you have 4 categories, map them like this:
A -> 0001
B -> 0010
etc
Take a look at the "Subject: How should categories be encoded?" in this link:
ftp://ftp.sas.com/pub/neural/FAQ2.html#A_cat
The previous answers are right - do not map nominal values into arbitrary numeric ones. However, if the attribute has an ordinal nature ("Low", "Medium", High" for example), you can replace the nominal values by ascending numeric values. Note that this may not be the optimal solution - since there is no guarantee for example that "High"=3 by the nature of your data. Instead, use one-hot bit encoding as suggested.
The reason for this is that a neural network is very similar to regression in the sense that multiple numeric values go through some kind of an aggregating function - but this happens multiple times. Each input is also multiplied by a weight.
So when you enter a numeric value, it undergoes a series of mathematical manipulations that adjusts its weights in the network. So if you use numeric values for non-nomial data - nominal values that were mapped to closer numeric values will be treated about the same in the best case, in the worst case - it can harm your model.

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources