How to get the tfidf vector for a given document - tf-idf

I have the following file:
id review
1 "Human machine interface for lab abc computer applications."
2 "A survey of user opinion of computer system response time."
3 "The EPS user interface management system."
4 "System and human system engineering testing of EPS."
5 "Relation of user perceived response time to error measurement."
6 "The generation of random binary unordered trees."
7 "The intersection graph of paths in trees."
8 "Graph minors IV Widths of trees and well quasi ordering."
9 "Graph minors A survey."
10 "survey is a state of art."
Each row concern a document.
I convert these documents to a corpus and i find for each word its TFIDF:
from collections import defaultdict
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
reviews = defaultdict(list)
with open("C:/Users/user/workspacePython/Tutorial/data/unlabeledTrainData.tsv", "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter='\t')
reader.next()
for row in reader:
reviews[row[1]].append(row[1])
for id, review in reviews.iteritems():
reviews[id] = " ".join(review)
corpus = []
for id, review in sorted(reviews.iteritems(), key=lambda t: id):
corpus.append(review)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
tfidf_matrix = tf.fit_transform(corpus)
My question is: How can i get for a given document(from the above file) its correponding vector(row) in the tfidf_matrix.
Thank you

You have a list of documents, 1 to 10. That's 0 to 9 in array-index terms.
The variable tfidx_matrix will contain a sparse-row form matrix consisting of rows (representing documents) and their normalised association with the vocabulary across the corpus (minus English stop-words).
So to convert the sparse array into a more traditional matrix, you could try
npm_tfidf = tfidf_matrix.todense()
document_1_vector = npm_tfidf[0]
document_2_vector = npm_tfidf[1]
document_3_vector = npm_tfidf[2]
...
document_10_vector = npm_tfidf[9]
There's easier and better ways to extract the contents, but I suppose the part that's getting in your way is this conversion from the sparse matrix representation, that can be tricky to unpick, and the more traditional dense matrix representation.
Note also, that interpreting the vectors will require you being able to extract the vocabulary extracted during the process - this should be in the form of an ordered (alphabetically list of tokens) and can be extracted using:
vocabulary = tfidf_matrix.get_feature_names()

Related

The TF-IDF generated by the TfidfVectorizer of sklearn is incorrect?

This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
textRaw = [
"good boy girl",
"good good good",
"good boy",
"good girl",
"good bad girl",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(textRaw)
allWords = vectorizer.get_feature_names_out()
dense = X.todense()
XList = dense.tolist()
df = pd.DataFrame(XList, columns=allWords)
dictionary = df.T.sum(axis=1)
print(dictionary)
Output:
bad 0.772536
boy 1.561542
girl 1.913661
good 2.870128
However, good appears in every document in the corpus. Its idf should be 0, which means its Tf-idf should also be 0. Why is the Tf-idf value of good calculated by TfidfVectorizer the highest?
From the sklearn documentation (emphasis mine):
The formula that is used to compute the tf-idf for a term t of a
document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and
the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if
smooth_idf=False), where n is the total number of documents in the
document set and df(t) is the document frequency of t; the document
frequency is the number of documents in the document set that contain
the term t. The effect of adding “1” to the idf in the equation above
is that terms with zero idf, i.e., terms that occur in all documents
in a training set, will not be entirely ignored. (Note that the idf
formula above differs from the standard textbook notation that defines
the idf as idf(t) = log [ n / (df(t) + 1) ]).

How to efficiently store 1 million words and query them by starts_with, contains, or ends_with?

How do sites like this store tens of thousands of words "containing c", or like this, "words with d and c", or even further, "unscrambling" the word like CAUDK and finding that the database has duck. Curious from an algorithms/efficiency perspective how they would accomplish this:
Would a database be used, or would the words simply be stored in memory and quickly traversed? If a database was used (and each word was a record), how would you make these sorts of queries (with PostgreSQL for example, contains, starts_with, ends_with, and unscrambles)?
I guess the easiest thing to do would be to store all words in memory (sorted?), and just traverse the whole million or less word list to find the matches? But how about the unscramble one?
Basically wondering the efficient way this would be done.
"Containing C" amounts to count(C) > 0. Unscrambling CAUDC amounts to count(C) <= 2 && count(A) <= 1 && count(U) <= 1 && count(D) <= 1. So both queries could be efficiently answered by a database with 26 indices, one for the count of each letter in the alphabet.
Here is a quick and dirty python sqlite3 demo:
from collections import defaultdict, Counter
import sqlite3
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
alphabet = [chr(ord('A')+i) for i in range(26)]
alphabet_set = set(alphabet)
columns = ['word TEXT'] + [f'{c}_count TINYINT DEFAULT 0' for c in alphabet]
create_cmd = f'CREATE TABLE abc ({", ".join(columns)})'
cur.execute(create_cmd)
for c in alphabet:
cur.execute(f'CREATE INDEX {c}_index ON abc ({c}_count)')
def insert(word):
counts = Counter(word)
columns = ['word'] + [f'{c}_count' for c in counts.keys()]
counts = [f'"{word}"'] + [f'{n}' for n in counts.values()]
var_str = f'({", ".join(columns)})'
val_str = f'({", ".join(counts)})'
insert_cmd = f'INSERT INTO abc {var_str} VALUES {val_str}'
cur.execute(insert_cmd)
def unscramble(text):
counts = {a:0 for a in alphabet}
for c in text:
counts[c] += 1
where_clauses = [f'{c}_count <= {n}' for (c, n) in counts.items()]
select_cmd = f'SELECT word FROM abc WHERE {" AND ".join(where_clauses)}'
cur.execute(select_cmd)
return list(sorted([tup[0] for tup in cur.fetchall()]))
print('Building sqlite table...')
with open('/usr/share/dict/words') as f:
word_set = set(line.strip().upper() for line in f)
for word in word_set:
if all(c in alphabet_set for c in word):
insert(word)
print('Table built!')
d = defaultdict(list)
for word in unscramble('CAUDK'):
d[len(word)].append(word)
print("unscramble('CAUDK'):")
for n in sorted(d):
print(' '.join(d[n]))
Output:
Building sqlite table...
Table built!
unscramble('CAUDK'):
A C D K U
AC AD AK AU CA CD CU DA DC KC UK
AUK CAD CUD
DUCK
I don't know for sure what they're doing, but I suggest this algorithm for contains and unscramble (and, I think, can be trivially extended to starts with or end with):
User submits a set of letters in the form of a string. Say, user submits bdsfa.
The algorithm sorts that string in (1). So, query becomes abdfs
Then, to find all words with those letters in them, the algorithm simply accesses the directory database/a/b/d/f/s/ and finds all words with those letters in. In case it finds the directory to be empty, it goes one level up: database/a/b/d/f/ and shows result there.
So, now, the question is, how to index the database of millions of words as done in step (3)? database/ directory will have 26 directories inside it for a to z, each of which will have 26-1 directories for all letters, except their parent's. E.g.:
database/a/{b,c,...,z}`
database/b/{a,c,...,z}`
...
database/z/{a,c,...,y}`
This tree structure will be only 26 level deep. Each branch will have no more than 26 elements. So browsing this directory structure is scalable.
Words will be stored in the leaves of this tree. So, the word apple will be stored in database/a/e/l/p/leaf_apple. In that place, you will also find other words such as leap. More specifically:
database/
a/
e/
l/
p/
leaf_apple
leaf_leap
leaf_peal
...
This way, you can efficiently reach the subset of target words as O(log n), where n is total number of words in your database.
You can further optimise this by adding additional indices. For example, there are too many words containing a, and the website won't display them all (at least not in the 1st page). Instead, the website may say there are total 500,000 many words containing 'a', here is 100 examples. In order to obtain 500,000 efficiently, the number of children at every level can be added during the indexing. E.g. `database/{a,b,...,z}/{num_children,
`database/
{...}/
{num_children,...}/
{num_childred,...}/
...
Here, num_children is just a leaf node, just like leaf_WORD. All leafs are files.
Depending on the load that this website has, it may not require to load this database in memory. It may simply leave it to the operating system to decide which portion of its file system to cache in memory as a read-time optimisation.
Personally, I think, as a criticism to applications, I think developers tend to jump into requiring RAM too fast even when a simple file system trick can do the job without any noticeable difference to the end user.

MatchIt - how to make matching date specific?

I'm trying to use MatchIt to create two sets of matched investment companies (treatment vs control).
I need to match the treatment companies to the control companies using only data from the 1-3 years proceeding the treatment.
For example if a company received treatment in 2009, then I would want to match it using data from 2009, 2008, 2007 (My after treatment effects dummy would hold a value from 2010 onwards in this case)
I am unsure how to add this selection into my matching code, which currently looks like this:
matchit(signatory ~ totalUSD + brownUSD + country + strategy, data = panel6, method = "full")
Should I consider using the 'after' treatments effects dummy in some way?
Any tips for how I add this in would be greatly appreciated!
There is no straightforward way to do this in MatchIt. You can set a caliper, which requires the control companies to be within a certain number of years from a treated company, but there isn't a way to require that control companies have a year strictly before the treated company. You can perform exact matching on year so that the treated and control companies have exactly the same year using the exact argument.
Another, slightly more involved way is to construct a distance matrix yourself and set to Inf any distances between units that are forbidden to match with each other. The first step would be estimating a propensity score, which you can do manually or using matchit(). Then you construct a distance matrix, and for each entry in the distance matrix, decide whether to set the distance to Inf. FInaly, you can supply the distance matrix to the distance argument of matchit(). Here's how you would do that:
#Estimate the propensity score
ps <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = NULL)$distance
#Create the distance matrix
dist <- optmatch::match_on(signatory ~ ps, data = panel6)
#Loop through the matrix and set set disallowed matches to Inf
t <- which(panel6$signatory == 1)
u <- which(panel6$signatory != 1)
for (i in seq_along(t)) {
for (j in seq_along(u)) {
if (panel6$year[u[j]] > panel6$year[t[i]] || panel6$year[u[j]] < panel6$year[t[i]] - 2)
dist[i,j] <- Inf
}
}
#Note: can be vectorized for speed but shouldn't take long regardless
#Supply the distance matrix to matchit() and match
m <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = "full", distance = dist)
That should work. You can verify by looking at individual groups of matched companies using match.data():
md <- match.data(m, data = panel6)
md <- md[with(md, order(subclass, signatory)),]
View(md) #assuming you're using RStudio
You should see that within subclasses, the control units are 0-2 years below the treated units.

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The resulting tcm would then be used to measure a bunch of lexical statistics and serve as input into vector semantics methods (Glove, LSA, LDA).
For reference, the Google Books (v2) dataset is formatted as follows (tab-separated)
ngram year match_count volume_count
some word 1999 32 12 # example bigram
However, problem is of course, these data be superhuge. Although, I will only need a subset of the data from certain decades (about 20 years worth of ngrams), and I am happy with a context window of up to 2 (i.e., use the trigram corpus). I have a few ideas but none seem particularly, well, good.
-Idea 1- initially was more or less this:
# preprocessing (pseudo)
for file in trigram-files:
download $file
filter $lines where 'year' tag matches one of years of interest
find the frequency of each of those ngrams (match_count)
cat those $lines * $match_count >> file2
# (write the same line x times according to the match_count tag)
remove $file
# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it <- itoken(grams)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm <- create_tcm(it, vectorizer) # nice and sparse
However, I have a hunch this might not be the best solution. The ngram data files already contain the co-occurrence data in the form of n-grams, and there is a tag that gives the frequency. I have a feeling there should be a more direct way.
-Idea 2- I was also thinking of cat'ing each filtered ngram only once into the new file (instead of replicating it match_count times), then creating an empty tcm and then looping over the whole (year-filtered) ngram dataset and record instances (using the match_count tag) where any two words co-occur to populate the tcm. But, again, the data is big, and this kind of looping would probably take ages.
-Idea 3- I found a Python library called google-ngram-downloader that apparently has a co-occurrence matrix creation function, but looking at the code, it would create a regular (not sparse) matrix (which would be massive, given most entries are 0s), and (if I got it right) it simply loops through everything (and I assume a Python loop over this much data would be superslow), so it seems to be more aimed at rather smaller subsets of data.
edit -Idea 4- Came across this old SO question asking about using Hadoop and Hive for a similar task, with a a short answer with a broken link and a comment about MapReduce (none of which I am familiar with, so I would not know where to start).
But I'm thinking I can't be the first one with the need to tackle such a task, given the popularity of the Ngram dataset, and the popularity of (non-word2vec) distributed semantics methods that operate on a tcm or dtm input; hence ->
...the question: what would be a more reasonable/effective way of constructing a term-term co-occurrence matrix from Google Books Ngram data? (be it a variation of the proposed ideas of something completely different; R preferred but not necessary)
I will give an idea of how you can do this. But it can be improved in several places. I specially wrote in a "spagetti-style" for better interpretability, but it can be generalized to more than tri-grams
ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()
# vocab here is vocabulary from chunk, but you can be interested first
# to create vocabulary from whole corpus of ngrams and filter non
# interesting/rare words
vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)
ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]
dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]
dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]
tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T,
giveCsparse = F, check = F, dimnames = list(vocab, vocab))

MATLAB cell array indexing and looping

I'm trying to create a script that reads data from a text file, and plots the data onto a scatter plot.
For example, say the file name is prices.txt and contains:
Pens 2 4
Pencils 1.5 3
Rulers 3 3.5
Sharpeners 1 3
Highlighters 3 4
Where columns 2 and 3 are prices of the items for two different stores.
What my script should do is read the prices, calculates (using another function) future prices of the stores and plots these prices onto a scatter plot where x is one store and y is another. This is a silly example I know but it fits the description.
Don't worry to much about the other function that does the calculation, just assume it does what its supposed to.
Basically, I've come up with the following:
pricesfile = fopen('Prices.txt');
prices = textscan(pricesfile, '%s %d d');
fclose(pricesfile);
count = 1;
while count <= length(prices{1})
for item = constants{1}
name = constants{1}{count};
store_A = prices{2}{count};
store_B = prices{3}{count};
(...other function goes here...)
end
end
After doing this I'm completely stuck. My thought process behind this was to go through each item name, and create a vector that's assigned to this name with its two corresponding prices as items in the vector eg:
pens = [2 4]
pencils = [1.5 3]
etc. Then, I would somehow plot those items in the vector on a scatter plot and use the name of the vector as a label.
I'm not too sure how to carry out the rest of my code or even if what I've written will get me to the solution.
Please help and thanks in advance.
pricesfile = fopen('Prices.txt');
data = textscan(pricesfile, '%s %d d');
fclose(pricesfile);
You were on the right track but after this (through a bit of hackery) you don't actually need a loop:
plot(repmat(data{2},1,2)', repmat(data{3},1,2)', '.')
legend(data{1})
What you DO NOT want to do is create variables named after strings. Rather store them in an array with an array of the names (which is basically what your textscan code gives you). Matlab is very good at handling matrices/arrays.
You could also split your price array up for example:
names = prices{1};
prices = [data{2:3}];
now you can perform calculations on prices quite easily like
prices_cents = prices*100;
plot(prices_cents(:,[1,1]), prices_cents(:,[2,2]))
legend(names)
Note that the [1,1] etc above is just using indexing as a short hand to achieve what repmat does...

Resources