why does tfidf object takes so much space? - tf-idf

I have roughly 100,000 long articles totally about 5GB of texts, when I perform
TfidfVectorizer
from sklearn it constructs a model with 6GB. How is that possible? Isn't that we only need to store the document frequency of that 4000 words and what that 4000 words are? I am guessing TfidfVectorizer of stores such 4000 dimension vector for every document. Is it possible somehow I have some settings wrongly set?

A TF-IDF matrix shape is (number_of_documents, number_of_unique_words). So for each document you get a feature for each word from the dataset. It can get bloated for large datasets.
In your case
(100000 (docs) * 4000 (words) * 4 (np.float64 bytes))/1024**3 ~ 1.5 Gb
Moreover, the Scipy TfidfVectorizer by default tries to compensate it using a sparse matrix (scipy.sparse.csr.csr_matrix). Even for long documents the matrix tends to contain lots of zeros. So it is usually an order less than the original size. If I am correct, it should be lower than 1.5 GB.
Thus is the question. Do you really have only 4000 words in your model (controlled by TfidfVectorizer(max_features=4000)?
If you don't care about individual word frequencies you can decrease the vector size using PCA or other techniques.
dense_matrix = tf_idf_matrix.todense()
components_number = 300
reduced_data = PCA(n_components=300).fit_transform(dense_matrix)
Or you can use something like doc2vec. https://radimrehurek.com/gensim/models/doc2vec.html
Using it you'll get the matrix of the shape (number_of_documents, embedding_size). The embedding size is usually in the range between (100 and 600). You can train a doc2vec model without storing individual word vectors using the dbow_words parameter.
If you care about individual word features, the only reasonable solution that I see is to decrease the amount of words.
Relevant stackoverflow posts:
----On dimensinality reduction
How do i visualize data points of tf-idf vectors for kmeans clustering?
----On using generators to train TFIDF
Sklearn TFIDF on large corpus of documents
How to get tf-idf matrix of a large size corpus, where features are pre-specified?
tf-idf on a somewhat large (65k) amount of text files
Models itself should not occupy so much space. I suppose it is possible if only you have some heavy objects in TfidfVectorizer tokenizer or preprocessor attributes.
class Tokenizer:
def __init__(self):
self.s = np.random.uniform(0,1, size=(10000,10000))
def tokenizer(self, text):
text = text.lower().split()
return text
tokenizer = Tokenizer()
vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenizer)
pickle.dump(vectorizer, open("vectorizer.pcl", "wb"))
This will occupy more than 700mb after pickling.

I know there is an answer but some additional information to consider for others. When you directly pickle the TFIDFVectorizer you also saving stop words attribute of the vectorizer but that is not necessary after vocabulary is established. In one of our models, there were 3000 words in vocabulary but saved model occupied 250MB space so inspecting the model we saw 10 Million stop words also is stored with the model. Then we saw the following warning at TfidfVectorizer
"The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling."
Applying that reduced our model size significantly.

Related

How to store FaceNet data efficiently?

I am using the Facenet algorithm for face recognition. I want to create application based on this, but the problem is the Facenet algorithm returns an array of length 128, which is the face embedding per person.
For person identification, I have to find the Euclidian difference between two persons face embedding, then check that if it is greater than a threshold or not. If it is then the persons are same; if it is less then persons are different.
Let's say If I have to find person x in the database of 10k persons. I have to calculate the difference with each and every person's embeddings, which is not efficient.
Is there any way to store this face embedding efficiently and search for the person with better efficiency?
I guess reading this blog will help the others.
It's in detail and also covers most aspects of implementation.
Face recognition on 330 million faces at 400 images per second
Recommend you to store them in redis or cassandra. They will overperform than relational databases.
Those key-value stores can store multidimensional vector as a value.
You can find embedding vectors with deepface. I shared a sample code snippet below.
#!pip install deepface
from deepface import DeepFace
img_list = ["img1.jpg", "img2.jpg", ...]
model = DeepFace.build_model("Facenet")
for img_path in img_list:
img_embedding = DeepFace.represent(img_path, model = model)
#store img_embedding into the redis here
Sounds like you want a nearest neighbour search. You could have a look at the various space partitioning data structures like kd-trees
First make a dictionary with 10000 face encodings as it is shown at Face_recognition sample, then store it as pickle-file. While loaded in memory it will take a sacond to find distance between X face encoding and that 10000 pre-encoded ones. take a look how it works I'm operating with millions of faces in such way.

How to split the data into training and test sets?

One approach to split the data into two disjoint sets, one for training and one for tests is taking the first 80% as the training set and the rest as the test set. Is there another approach to split the data into training and test sets?
** For example, I have a data contains 20 attributes and 5000 objects. Therefore, I will take 12 attributes and 1000 objects as my training data and 3 attributes from the 12 attributes as test set. Is this method correct?
No, that's invalid. You would always use all features in all data sets. You split by "objects" (examples).
It's not clear why you are taking just 1000 objects and trying to extract a training set from that. What happened to the other 4000 you threw away?
Train on 4000 objects / 20 features. Cross-validate on 500 objects / 20 features. Evaluate performance on the remaining 500 objects/ 20 features.
If your training produces a classifier based on 12 features, it could be (very) hard to evaluate its performances on a test set based only on a subset of these features (your classifier is expecting 12 inputs and you'll give only 3).
Feature/attribute selection/extraction is important if your data contains many redundant or irrelevant features. So you could identify and use only the most informative features (maybe 12 features) but your training/validation/test sets should be based on the same number of features (e.g. since you're mentioning weka Why do I get the error message 'training and test set are not compatible'?).
Remaining on a training/validation/test split (holdout method), a problem you can face is that the samples might not be representative.
For example, some classes might be represented with very few instance or even with no instances at all.
A possible improvement is stratification: sampling for training and testing within classes. This ensures that each class is represented with approximately equal proportions in both subsets.
However, by partitioning the available data into fixed training/test set, you drastically reduce the number of samples which can be used for learning the model. An alternative is cross validation.

Motivation for k-medoids

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Any strategies for assessing the trade-off between CPU loss and memory gain from compression of data held in a datastore model's TextProperty?

Are very large TextProperties a burden? Should they be compressed?
Say I have a information stored in 2 attributes of type TextProperty in my datastore entities.
The strings are always the same length of 65,000 characters and have lots of repeating integers, a sample appearing as follows:
entity.pixel_idx = 0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,5,5,5,5,5,5,5,5,5,5,5,5....etc.
entity.pixel_color = 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,...etc.
So these above could also be represented using much less storage memory by compressing say using only each integer and the length of its series ( '0,8' for '0,0,0,0,0,0,0,0') but then its takes time and CPU to compress and decompress?
Any general ideas?
Are there some tricks for testing different attempts to the problem?
If all of your integers are single-digit numbers (as in your example), then you can reduce your storage space in half by simply omitting the commas.
The Short Answer
If you expect to have a lot of repetition, then compressing your data makes sense - your data is not so small (65K) and is highly repetitive => it will compress well. This will save you storage space and will reduce how long it takes to transfer the data back from the datastore when you query for it.
The Long Answer
I did a little testing starting with the short example string you provided and that same string repeated to 65000 characters (perhaps more repetitive than your actual data). This string compressed from 65K to a few hundred bytes; you may want to do some additional testing based on how well your data actually compresses.
Anyway, the test shows a significant savings when using compressed data versus uncompressed data (for just the above test where compression works really well!). In particular, for compressed data:
API time takes 10x less for a single entity (41ms versus 387ms on average)
Storage used is significantly less (so it doesn't look like GAE is doing any compression on your data).
Unexpectedly, CPU time is about 50% less (130ms versus 180ms when fetching 100 entities). I expected CPU time to be a little worse since the compressed data has to be uncompressed. There must be some other CPU work (like decoding the protocol buffer) which is even more CPU work for the much larger uncompressed data.
These differences mean wall clock time is also significantly faster for the compressed version (<100ms versus 426ms when fetching 100 entities).
To make it easier to take advantage of compression, I wrote a custom CompressedDataProperty which handles all of the compressing/decompressing business so you don't have to worry about it (I used it in the above tests too). You can get the source from the above link, but I've also included it here since I wrote it for this answer:
from google.appengine.ext import db
import zlib
class CompressedDataProperty(db.Property):
"""A property for storing compressed data or text.
Example usage:
>>> class CompressedDataModel(db.Model):
... ct = CompressedDataProperty()
You create a compressed data property, simply specifying the data or text:
>>> model = CompressedDataModel(ct='example uses text too short to compress well')
>>> model.ct
'example uses text too short to compress well'
>>> model.ct = 'green'
>>> model.ct
'green'
>>> model.put() # doctest: +ELLIPSIS
datastore_types.Key.from_path(u'CompressedDataModel', ...)
>>> model2 = CompressedDataModel.all().get()
>>> model2.ct
'green'
Compressed data is not indexed and therefore cannot be filtered on:
>>> CompressedDataModel.gql("WHERE v = :1", 'green').count()
0
"""
data_type = db.Blob
def __init__(self, level=6, *args, **kwargs):
"""Constructor.
Args:
level: Controls the level of zlib's compression (between 1 and 9).
"""
super(CompressedDataProperty, self).__init__(*args, **kwargs)
self.level = level
def get_value_for_datastore(self, model_instance):
value = self.__get__(model_instance, model_instance.__class__)
if value is not None:
return db.Blob(zlib.compress(value, self.level))
def make_value_from_datastore(self, value):
if value is not None:
return zlib.decompress(value)
I think this should be pretty easy to test. Just create 2 handlers, one that compresses the data, and one that doesn't, and record how much cpu each one uses (using the appstats package for whichever language you are developing with.) You should also create 2 entity types, one for the compressed data, one for the uncompressed.
Load in a few hundred thousand or a million entities (using the task queue perhaps). Then you can check the disk space usage in the administrator's console, and see how much each entity type uses. If the data is compressed internally by app engine, you shouldn't see much difference in the space used (unless their compression is significantly better than yours) If it is not compressed, there should be a stark difference.
Of course, you may want to hold off on this type of testing until you know for sure that these entities will account for a significant portion of your quota usage and/or your page load time.
Alternatively, you could wait for Nick or Alex to pop in and they could probably tell you whether the data is compressed in the datastore or not.

Resources