The TF-IDF generated by the TfidfVectorizer of sklearn is incorrect? - tf-idf

This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
textRaw = [
"good boy girl",
"good good good",
"good boy",
"good girl",
"good bad girl",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(textRaw)
allWords = vectorizer.get_feature_names_out()
dense = X.todense()
XList = dense.tolist()
df = pd.DataFrame(XList, columns=allWords)
dictionary = df.T.sum(axis=1)
print(dictionary)
Output:
bad 0.772536
boy 1.561542
girl 1.913661
good 2.870128
However, good appears in every document in the corpus. Its idf should be 0, which means its Tf-idf should also be 0. Why is the Tf-idf value of good calculated by TfidfVectorizer the highest?

From the sklearn documentation (emphasis mine):
The formula that is used to compute the tf-idf for a term t of a
document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and
the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if
smooth_idf=False), where n is the total number of documents in the
document set and df(t) is the document frequency of t; the document
frequency is the number of documents in the document set that contain
the term t. The effect of adding “1” to the idf in the equation above
is that terms with zero idf, i.e., terms that occur in all documents
in a training set, will not be entirely ignored. (Note that the idf
formula above differs from the standard textbook notation that defines
the idf as idf(t) = log [ n / (df(t) + 1) ]).

Related

mT5 transformer, how to access encoder to compute cosine similarity

this is my method, my question is how to access the encoder be sending 2 sentences each time? because I have a dataset that contain pairs of sentences, and I need to compute the similarity between each pair.
//anyone could help?
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']
sentence1 = ['This is an embedding for framework generation']
#Sentences are encoded by calling
embedding = model.encode(sentence)
embedding1 = model.encode(sentence1)
e = np.squeeze(np.asarray(embedding))
e1 = np.squeeze(np.asarray(embedding1))
#calculate Cosine Similarity
cos_sim = dot(e, e1)/(norm(e)*norm(e1))
print(cos_sim)

Why is Pymc3 ADVI worse than MCMC in this logistic regression example?

I am aware of the mathematical differences between ADVI/MCMC, but I am trying to understand the practical implications of using one or the other. I am running a very simple logistic regressione example on data I created in this way:
import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
def logistic(x, b, noise=None):
L = x.T.dot(b)
if noise is not None:
L = L+noise
return 1/(1+np.exp(-L))
x1 = np.linspace(-10., 10, 10000)
x2 = np.linspace(0., 20, 10000)
bias = np.ones(len(x1))
X = np.vstack([x1,x2,bias]) # Add intercept
B = [-10., 2., 1.] # Sigmoid params for X + intercept
# Noisy mean
pnoisy = logistic(X, B, noise=np.random.normal(loc=0., scale=0., size=len(x1)))
# dichotomize pnoisy -- sample 0/1 with probability pnoisy
y = np.random.binomial(1., pnoisy)
And the I run ADVI like this:
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 0, sd=10)
x1_coef = pm.Normal('x1', 0, sd=10)
x2_coef = pm.Normal('x2', 0, sd=10)
# Define likelihood
likelihood = pm.Bernoulli('y',
pm.math.sigmoid(intercept+x1_coef*X[0]+x2_coef*X[1]),
observed=y)
approx = pm.fit(90000, method='advi')
Unfortunately, no matter how much I increase the sampling, ADVI does not seem to be able to recover the original betas I defined [-10., 2., 1.], while MCMC works fine (as shown below)
Thanks' for the help!
This is an interesting question! The default 'advi' in PyMC3 is mean field variational inference, which does not do a great job capturing correlations. It turns out that the model you set up has an interesting correlation structure, which can be seen with this:
import arviz as az
az.plot_pair(trace, figsize=(5, 5))
PyMC3 has a built-in convergence checker - running optimization for to long or too short can lead to funny results:
from pymc3.variational.callbacks import CheckParametersConvergence
with model:
fit = pm.fit(100_000, method='advi', callbacks=[CheckParametersConvergence()])
draws = fit.sample(2_000)
This stops after about 60,000 iterations for me. Now we can inspect the correlations and see that, as expected, ADVI fit axis-aligned gaussians:
az.plot_pair(draws, figsize=(5, 5))
Finally, we can compare the fit from NUTS and (mean field) ADVI:
az.plot_forest([draws, trace])
Note that ADVI is underestimating variance, but fairly close for the mean of each parameter. Also, you can set method='fullrank_advi' to capture the correlations you are seeing a little better.
(note: arviz is soon to be the plotting library for PyMC3)

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The resulting tcm would then be used to measure a bunch of lexical statistics and serve as input into vector semantics methods (Glove, LSA, LDA).
For reference, the Google Books (v2) dataset is formatted as follows (tab-separated)
ngram year match_count volume_count
some word 1999 32 12 # example bigram
However, problem is of course, these data be superhuge. Although, I will only need a subset of the data from certain decades (about 20 years worth of ngrams), and I am happy with a context window of up to 2 (i.e., use the trigram corpus). I have a few ideas but none seem particularly, well, good.
-Idea 1- initially was more or less this:
# preprocessing (pseudo)
for file in trigram-files:
download $file
filter $lines where 'year' tag matches one of years of interest
find the frequency of each of those ngrams (match_count)
cat those $lines * $match_count >> file2
# (write the same line x times according to the match_count tag)
remove $file
# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it <- itoken(grams)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm <- create_tcm(it, vectorizer) # nice and sparse
However, I have a hunch this might not be the best solution. The ngram data files already contain the co-occurrence data in the form of n-grams, and there is a tag that gives the frequency. I have a feeling there should be a more direct way.
-Idea 2- I was also thinking of cat'ing each filtered ngram only once into the new file (instead of replicating it match_count times), then creating an empty tcm and then looping over the whole (year-filtered) ngram dataset and record instances (using the match_count tag) where any two words co-occur to populate the tcm. But, again, the data is big, and this kind of looping would probably take ages.
-Idea 3- I found a Python library called google-ngram-downloader that apparently has a co-occurrence matrix creation function, but looking at the code, it would create a regular (not sparse) matrix (which would be massive, given most entries are 0s), and (if I got it right) it simply loops through everything (and I assume a Python loop over this much data would be superslow), so it seems to be more aimed at rather smaller subsets of data.
edit -Idea 4- Came across this old SO question asking about using Hadoop and Hive for a similar task, with a a short answer with a broken link and a comment about MapReduce (none of which I am familiar with, so I would not know where to start).
But I'm thinking I can't be the first one with the need to tackle such a task, given the popularity of the Ngram dataset, and the popularity of (non-word2vec) distributed semantics methods that operate on a tcm or dtm input; hence ->
...the question: what would be a more reasonable/effective way of constructing a term-term co-occurrence matrix from Google Books Ngram data? (be it a variation of the proposed ideas of something completely different; R preferred but not necessary)
I will give an idea of how you can do this. But it can be improved in several places. I specially wrote in a "spagetti-style" for better interpretability, but it can be generalized to more than tri-grams
ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()
# vocab here is vocabulary from chunk, but you can be interested first
# to create vocabulary from whole corpus of ngrams and filter non
# interesting/rare words
vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)
ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]
dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]
dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]
tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T,
giveCsparse = F, check = F, dimnames = list(vocab, vocab))

How to get the tfidf vector for a given document

I have the following file:
id review
1 "Human machine interface for lab abc computer applications."
2 "A survey of user opinion of computer system response time."
3 "The EPS user interface management system."
4 "System and human system engineering testing of EPS."
5 "Relation of user perceived response time to error measurement."
6 "The generation of random binary unordered trees."
7 "The intersection graph of paths in trees."
8 "Graph minors IV Widths of trees and well quasi ordering."
9 "Graph minors A survey."
10 "survey is a state of art."
Each row concern a document.
I convert these documents to a corpus and i find for each word its TFIDF:
from collections import defaultdict
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
reviews = defaultdict(list)
with open("C:/Users/user/workspacePython/Tutorial/data/unlabeledTrainData.tsv", "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter='\t')
reader.next()
for row in reader:
reviews[row[1]].append(row[1])
for id, review in reviews.iteritems():
reviews[id] = " ".join(review)
corpus = []
for id, review in sorted(reviews.iteritems(), key=lambda t: id):
corpus.append(review)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
tfidf_matrix = tf.fit_transform(corpus)
My question is: How can i get for a given document(from the above file) its correponding vector(row) in the tfidf_matrix.
Thank you
You have a list of documents, 1 to 10. That's 0 to 9 in array-index terms.
The variable tfidx_matrix will contain a sparse-row form matrix consisting of rows (representing documents) and their normalised association with the vocabulary across the corpus (minus English stop-words).
So to convert the sparse array into a more traditional matrix, you could try
npm_tfidf = tfidf_matrix.todense()
document_1_vector = npm_tfidf[0]
document_2_vector = npm_tfidf[1]
document_3_vector = npm_tfidf[2]
...
document_10_vector = npm_tfidf[9]
There's easier and better ways to extract the contents, but I suppose the part that's getting in your way is this conversion from the sparse matrix representation, that can be tricky to unpick, and the more traditional dense matrix representation.
Note also, that interpreting the vectors will require you being able to extract the vocabulary extracted during the process - this should be in the form of an ordered (alphabetically list of tokens) and can be extracted using:
vocabulary = tfidf_matrix.get_feature_names()

Theano - logistic regression example weight vector becomes NaN?

I am doing a tutorial (code here) and video here (13:00 minutes in).
My only change is using the mnist training set from a different location (creating a one-hot encoding) but it is not working. I literally copy-pasted all the code (except for the mnist loading) in this example. Here is the code:
import theano
from theano import tensor as T
import numpy as np
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata("MNIST Original")
trX, teX, trY_digit, teY_digit = train_test_split(mnist.data, mnist.target, test_size=.4)
#Get one-hot encoding
enc = OneHotEncoder()
enc.fit([[n] for n in range(10)])
trY, teY = sparse_to_floatX(enc.transform(trY_digit[:,newaxis])), sparse_to_floatX(enc.transform(teY_digit[:,newaxis]))
def floatX(X):
return np.asarray(X, dtype=theano.config.floatX)
def init_weights(shape):
return theano.shared(floatX(np.random.randn(*shape) * 0.1))
def model(X, w):
return T.nnet.softmax(T.dot(X, w))
X = T.fmatrix()
Y = T.fmatrix()
w = init_weights((784, 10))
py_x = model(X, w)
y_pred = T.argmax(py_x, axis=1)
cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * 0.05]]
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
for i in range(10):
print w.get_value()
cost = train(trX, trY)
print i, predict(teX)
The weight vector updates once, and becomes all NaN on the second update. I am very new to theano, but I am looking for tips to figure this out, especially if someone has already done this tutorial.
UPDATE.
It looks like the gradient is the issue.
When I add this
the_grad = T.sum(gradient)
f_grad = theano.function(inputs=[X, Y], outputs=the_grad, allow_input_downcast=True)
print f_grad(trX, trY)
It prints NaN. This appears to be the correct usage of T.grad though.
UPDATE 2.
When I change the cost function to this:
cost = T.mean(T.sum(T.sqr(py_x - Y), axis=1), axis=0)
It is working now but I only have 70% accuracy which is really bad.
UPDATE 3.
I downloaded the MNIST data used in the tutorial and it worked with 92% accuary.
I am not sure why my first mnist datasource was dying with the crossentropy cost, and then performing really poor with mean squared error cost function.

Resources