mT5 transformer, how to access encoder to compute cosine similarity - dataset

this is my method, my question is how to access the encoder be sending 2 sentences each time? because I have a dataset that contain pairs of sentences, and I need to compute the similarity between each pair.
//anyone could help?
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']
sentence1 = ['This is an embedding for framework generation']
#Sentences are encoded by calling
embedding = model.encode(sentence)
embedding1 = model.encode(sentence1)
e = np.squeeze(np.asarray(embedding))
e1 = np.squeeze(np.asarray(embedding1))
#calculate Cosine Similarity
cos_sim = dot(e, e1)/(norm(e)*norm(e1))
print(cos_sim)

Related

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The resulting tcm would then be used to measure a bunch of lexical statistics and serve as input into vector semantics methods (Glove, LSA, LDA).
For reference, the Google Books (v2) dataset is formatted as follows (tab-separated)
ngram year match_count volume_count
some word 1999 32 12 # example bigram
However, problem is of course, these data be superhuge. Although, I will only need a subset of the data from certain decades (about 20 years worth of ngrams), and I am happy with a context window of up to 2 (i.e., use the trigram corpus). I have a few ideas but none seem particularly, well, good.
-Idea 1- initially was more or less this:
# preprocessing (pseudo)
for file in trigram-files:
download $file
filter $lines where 'year' tag matches one of years of interest
find the frequency of each of those ngrams (match_count)
cat those $lines * $match_count >> file2
# (write the same line x times according to the match_count tag)
remove $file
# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it <- itoken(grams)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm <- create_tcm(it, vectorizer) # nice and sparse
However, I have a hunch this might not be the best solution. The ngram data files already contain the co-occurrence data in the form of n-grams, and there is a tag that gives the frequency. I have a feeling there should be a more direct way.
-Idea 2- I was also thinking of cat'ing each filtered ngram only once into the new file (instead of replicating it match_count times), then creating an empty tcm and then looping over the whole (year-filtered) ngram dataset and record instances (using the match_count tag) where any two words co-occur to populate the tcm. But, again, the data is big, and this kind of looping would probably take ages.
-Idea 3- I found a Python library called google-ngram-downloader that apparently has a co-occurrence matrix creation function, but looking at the code, it would create a regular (not sparse) matrix (which would be massive, given most entries are 0s), and (if I got it right) it simply loops through everything (and I assume a Python loop over this much data would be superslow), so it seems to be more aimed at rather smaller subsets of data.
edit -Idea 4- Came across this old SO question asking about using Hadoop and Hive for a similar task, with a a short answer with a broken link and a comment about MapReduce (none of which I am familiar with, so I would not know where to start).
But I'm thinking I can't be the first one with the need to tackle such a task, given the popularity of the Ngram dataset, and the popularity of (non-word2vec) distributed semantics methods that operate on a tcm or dtm input; hence ->
...the question: what would be a more reasonable/effective way of constructing a term-term co-occurrence matrix from Google Books Ngram data? (be it a variation of the proposed ideas of something completely different; R preferred but not necessary)
I will give an idea of how you can do this. But it can be improved in several places. I specially wrote in a "spagetti-style" for better interpretability, but it can be generalized to more than tri-grams
ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()
# vocab here is vocabulary from chunk, but you can be interested first
# to create vocabulary from whole corpus of ngrams and filter non
# interesting/rare words
vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)
ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]
dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]
dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]
tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T,
giveCsparse = F, check = F, dimnames = list(vocab, vocab))

How to get the tfidf vector for a given document

I have the following file:
id review
1 "Human machine interface for lab abc computer applications."
2 "A survey of user opinion of computer system response time."
3 "The EPS user interface management system."
4 "System and human system engineering testing of EPS."
5 "Relation of user perceived response time to error measurement."
6 "The generation of random binary unordered trees."
7 "The intersection graph of paths in trees."
8 "Graph minors IV Widths of trees and well quasi ordering."
9 "Graph minors A survey."
10 "survey is a state of art."
Each row concern a document.
I convert these documents to a corpus and i find for each word its TFIDF:
from collections import defaultdict
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
reviews = defaultdict(list)
with open("C:/Users/user/workspacePython/Tutorial/data/unlabeledTrainData.tsv", "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter='\t')
reader.next()
for row in reader:
reviews[row[1]].append(row[1])
for id, review in reviews.iteritems():
reviews[id] = " ".join(review)
corpus = []
for id, review in sorted(reviews.iteritems(), key=lambda t: id):
corpus.append(review)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
tfidf_matrix = tf.fit_transform(corpus)
My question is: How can i get for a given document(from the above file) its correponding vector(row) in the tfidf_matrix.
Thank you
You have a list of documents, 1 to 10. That's 0 to 9 in array-index terms.
The variable tfidx_matrix will contain a sparse-row form matrix consisting of rows (representing documents) and their normalised association with the vocabulary across the corpus (minus English stop-words).
So to convert the sparse array into a more traditional matrix, you could try
npm_tfidf = tfidf_matrix.todense()
document_1_vector = npm_tfidf[0]
document_2_vector = npm_tfidf[1]
document_3_vector = npm_tfidf[2]
...
document_10_vector = npm_tfidf[9]
There's easier and better ways to extract the contents, but I suppose the part that's getting in your way is this conversion from the sparse matrix representation, that can be tricky to unpick, and the more traditional dense matrix representation.
Note also, that interpreting the vectors will require you being able to extract the vocabulary extracted during the process - this should be in the form of an ordered (alphabetically list of tokens) and can be extracted using:
vocabulary = tfidf_matrix.get_feature_names()

Matlab: Sum corresponding values if index is within a range

I have been going crazy trying to figure a way to speed this up. Right now my current code talks ~200 sec looping over 77000 events. I was hoping someone might be able to help me speed this up because I have to do about 500 of these.
Problem:
I have arrays (both 200000x1) that correspond to Energy and Position of a hit over 77000 events. I have the range of each event separated into two arrays, event_start and event_end. First thing I do is look for the position in a specific range, then I put the correspond energy in its own array. To get what I need out of this information, I loop through each event and its corresponding start/end to sum up all the energies from each it hit. My code is below:
indx_pos = find(pos>0.7 & pos<2.0);
energy = HitEnergy(indx_pos);
for i=1:n_events
Etotal(i) = sum(energy(find(indx_pos>=event_start(i) …
& indx_pos<=event_end(i))));
end
Sample input & output:
% Sample input
% pos and energy same length
n_events = 3;
event_start = [1 3 7]';
event_end = [2 6 8]';
pos = [0.75 0.8 2.1 3.6 1.9 0.5 21.0 3.1]';
HitEnergy = [0.002 0.004 0.01 0.0005 0.08 0.1 1.7 0.007]';
% Sample Output
Etotal = 0.0060
0.0800
0
Approach #1: Generic case
One approach with bsxfun and matrix-multiplication -
mask = bsxfun(#ge,indx_pos,event_start.') & bsxfun(#le,indx_pos,event_end.')
Etotal = energy.'*mask
This could be a bit memory-hungry if indx_pos has lots of elements in it.
Approach #2: Non-overlapping start/end ranges case
One can use accumarray for this special case like so -
%// Setup ID array for use in accumarray later on
loc(numel(pos))=0; %// Fast pre-allocation scheme
valids = event_end+1<=numel(pos);
loc(event_end(valids)+1) = -1*(1:sum(valids));
loc(event_start) = loc(event_start)+(1:numel(event_end));
id = cumsum(loc);
%// Set elements as zeros in HitEnergy that do not satisfy the criteria:
%// pos>0.7 & pos<2.0
HitEnergy_select = (pos>0.7 & pos<2.0).*HitEnergy(:);
%// Discard elments in HitEnergy_select & id that have IDs as zeros
HitEnergy_select = HitEnergy_select(id~=0);
id = id(id~=0);
%// Accumulate summations as done inside the loop in the original code
Etotal = accumarray(id(:),HitEnergy_select);
The problem is that for every event you are searching the entire vector indx_pos.
Constrain your search inside the loop to only the range from event_start(i) to event_end(i):
for i = 1:n_events
I = event_start(i):event_end(i);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal(i) = sum(HitEnergy(I(posIIsWithinRange)));
end
You could also use a vectorized version based on run length decoding and vectorizing the notion of colon. (Download the functions coloncatrld and runLengthDecode.)
I = coloncatrld(event_start, event_end);
energy = HitEnergy(I);
eventNum = runLengthDecode(event_end - event_start+1);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal = accumarray(eventNum(posIIsWithinRange), energy(posIIsWithinRange), [n_events,1]);
This is similar to Divakar's Approach #2 with the addition that it should work for overlapping ranges too.

How to train a Support Vector Machine(svm) classifier with openCV with facial features?

I want to use the svm classifier for facial expression detection. I know opencv has a svm api, but I have no clue what should be the input to train the classifier. I have read many papers till now, all of them says after facial feature detection train the classifier.
so far what I did,
Face detection,
16 facial points calculation in every frame. below is an output of facial feature detection![enter image description
A vector which holds the features points pixel address
Note: I know how I can train the SVM only with positive and negative images, I saw this codehere, But I don't know how I combine the facial feature information with it.
Can anybody please help me to start the classification with svm.
a. what should be the sample input to train the classifier?
b. How do I train the classifier with this facial feature points?
Regards,
the machine learning algos in opencv all come with a similar interface. to train it, you pass a NxM Mat offeatures (N rows, each feature one row with length M) and a Nx1 Mat with the class-labels. like this:
//traindata //trainlabels
f e a t u r e 1
f e a t u r e -1
f e a t u r e 1
f e a t u r e 1
f e a t u r e -1
for the prediction, you fill a Mat with 1 row in the same way, and it will return the predicted label
so, let's say, your 16 facial points are stored in a vector, you would do like:
Mat trainData; // start empty
Mat labels;
for all facial_point_vecs:
{
for( size_t i=0; i<16; i++ )
{
trainData.push_back(point[i]);
}
labels.push_back(label); // 1 or -1
}
// now here comes the magic:
// reshape it, so it has N rows, each being a flat float, x,y,x,y,x,y,x,y... 32 element array
trainData = trainData.reshape(1, 16*2); // numpoints*2 for x,y
// we have to convert to float:
trainData.convertTo(trainData,CV_32F);
SVM svm; // params omitted for simplicity (but that's where the *real* work starts..)
svm.train( trainData, labels );
//later predict:
vector<Point> points;
Mat testData = Mat(points).reshape(1,32); // flattened to 1 row
testData.convertTo(testData ,CV_32F);
float p = svm.predict( testData );
Face gesture recognition is a widely researched problem, and the appropriate features you need to use can be found by a very thorough study of the existing literature. Once you have the feature descriptor you believe to be good, you go on to train the SVM with those. Once you have trained the SVM with optimal parameters (found through cross-validation), you start testing the SVM model on unseen data, and you report the accuracy. That, in general, is the pipeline.
Now the part about SVMs:
SVM is a binary classifier- it can differentiate between two classes (though it can be extended to multiple classes as well). OpenCV has an inbuilt module for SVM in the ML library. The SVM class has two functions to begin with: train(..) and predict(..). To train the classifier, you give as in input a very large amount of sample feature descriptors, along with their class labels (usually -1 and +1). Remember the format OpenCV supports: every training sample has to be a row-vector. And each row will have one corresponding class label in the labels vector. So if you have a descriptor of length n, and you have m such sample descriptors, your training matrix would be m x n (m rows, each of length n), and the labels vector would be of length m. There is also a SVMParams object that contains properties like SVM-type and values for parameters like C that you'll have to specify.
Once trained, you extract features from an image, convert it into a single row format, and give to predict() and it'll tell you which class it belongs to (+1 or -1).
There's also a train_auto() with similar arguments with a similar format that gives you the optimum values of the SVM parameters.
Also check this detailed SO answer to see an example.
EDIT:
Assuming you have a Feature Descriptor that returns a vector of features, the algorithm would be something like:
Mat trainingMat, labelsMat;
for each image in training database:
feature = extractFeatures( image[i] );
Mat feature_row = alignAsRow( feature );
trainingMat.push_back( feature_row );
labelsMat.push_back( -1 or 1 ); //depending upon class.
mySvmObject.train( trainingMat, labelsMat, Mat(), Mat(), mySvmParams );
I don't presume that extractFeatures() and alignAsRow() are existing functions, you might need to write them yourself.

Uniformly sampling on hyperplanes

Given the vector size N, I want to generate a vector <s1,s2, ..., sn> that s1+s2+...+sn = S.
Known 0<S<1 and si < S. Also such vectors generated should be uniformly distributed.
Any code in C that helps explain would be great!
The code here seems to do the trick, though it's rather complex.
I would probably settle for a simpler rejection-based algorithm, namely: pick an orthonormal basis in n-dimensional space starting with the hyperplane's normal vector. Transform each of the points (S,0,0,0..0), (0,S,0,0..0) into that basis and store the minimum and maximum along each of the basis vectors. Sample uniformly each component in the new basis, except for the first one (the normal vector), which is always S, then transform back to the original space and check if the constraints are satisfied. If they are not, sample again.
P.S. I think this is more of a maths question, actually, could be a good idea to ask at http://maths.stackexchange.com or http://stats.stackexchange.com
[I'll skip "hyper-" prefix for simplicity]
One of possible ideas: generate many uniformly distributed points in some enclosing volume and project them on the target part of plane.
To get uniform distribution the volume must be shaped like the part of plane but with added margins along plane normal.
To uniformly generate points in such volumewe can enclose it in a cube and reject everything outside of the volume.
select margin, let's take margin=S for simplicity (once margin is positive it affects only performance)
generate a point in cube [-M,S+M]x[-M,S+M]x[-M,S+M]
if distance to the plane is more than M, reject the point and go to #2
project the point on the plane
check that projection falls into [0,S]x[0,S]x[0,S], if not - reject and go to #2
add this point to the resulting set and go to #2 is you need more points
The problem can be mapped to that of sampling on linear polytopes for which the common approaches are Monte Carlo methods, Random Walks, and hit-and-run methods (see https://www.jmlr.org/papers/volume19/18-158/18-158.pdf for examples a short comparison). It is related to linear programming, and can be extended to manifolds.
There is also the analysis of polytopes in compositional data analysis, e.g. https://link.springer.com/content/pdf/10.1023/A:1023818214614.pdf, which provide an invertible transformation between the plane and the polytope that can be used for sampling.
If you are working on low dimensions, you can use also rejection sampling. This means you first sample on the plane containing the polytope (defined by your inequalities). This later method is easy to implement (and wasteful, of course), the GNU Octave (I let the author of the question re-implement in C) code below is an example.
The first requirement is to get vector orthogonal to the hyperplane. For a sum of N variables this is n = (1,...,1). The second requirement is a point on the plane. For your example that could be p = (S,...,S)/N.
Now any point on the plane satisfies n^T * (x - p) = 0
we assume also that x_i >= 0
With these given you compute an orthonormal basis on the plane (the nullity of the vector n) and then create random combination on that bases. Finally you map back to the original space and apply your constraints on the generated samples.
# Example in 3D
dim = 3;
S = 1;
n = ones(dim, 1); # perpendicular vector
p = S * ones(dim, 1) / dim;
# null-space of the perpendicular vector (transposed, i.e. row vector)
# this generates a basis in the plane
V = null (n.');
# These steps are just to reduce the amount of samples that are rejected
# we build a tight bounding box
bb = S * eye(dim); # each column is a corner of the constrained region
# project on the null-space
w_bb = V \ (bb - repmat(p, 1, dim));
wmin = min (w_bb(:));
wmax = max (w_bb(:));
# random combinations and map back
nsamples = 1e3;
w = wmin + (wmax - wmin) * rand(dim - 1, nsamples);
x = V * w + p;
# mask the points inside the polytope
msk = true(1, nsamples);
for i = 1:dim
msk &= (x(i,:) >= 0);
endfor
x_in = x(:, msk); # inside the polytope (your samples)
x_out = x(:, !msk); # outside the polytope
# plot the results
scatter3 (x(1,:), x(2,:), x(3,:), 8, double(msk), 'filled');
hold on
plot3(bb(1,:), bb(2,:), bb(3,:), 'xr')
axis image

Resources