I downloaded Mnist dataset from Libsvm's dataset page.
All samples are like the following:
5 153:3 154:18 155:18 156:18 157:126 ...
Does anyone knows what that means? 5 is the class label, but what is 153:3 pair for example? Also I couldn't find the meaning from mnist's own web page.
This is the way libsvm encodes (sparse) vectors. As you said 5 is the label, and the following pairs i:v say that the i-th entry of the vector is v. So you would encode a 3-dim vector (a,b,c) as
1:a 2:b 3:c
Which is inefficient for dense vectors but a good and established format for sparse data. As it is plain text, the storage space is not optimal, but good enough for most applications. Whereas the files are easy to write and to read.
Related
I am using the Facenet algorithm for face recognition. I want to create application based on this, but the problem is the Facenet algorithm returns an array of length 128, which is the face embedding per person.
For person identification, I have to find the Euclidian difference between two persons face embedding, then check that if it is greater than a threshold or not. If it is then the persons are same; if it is less then persons are different.
Let's say If I have to find person x in the database of 10k persons. I have to calculate the difference with each and every person's embeddings, which is not efficient.
Is there any way to store this face embedding efficiently and search for the person with better efficiency?
I guess reading this blog will help the others.
It's in detail and also covers most aspects of implementation.
Face recognition on 330 million faces at 400 images per second
Recommend you to store them in redis or cassandra. They will overperform than relational databases.
Those key-value stores can store multidimensional vector as a value.
You can find embedding vectors with deepface. I shared a sample code snippet below.
#!pip install deepface
from deepface import DeepFace
img_list = ["img1.jpg", "img2.jpg", ...]
model = DeepFace.build_model("Facenet")
for img_path in img_list:
img_embedding = DeepFace.represent(img_path, model = model)
#store img_embedding into the redis here
Sounds like you want a nearest neighbour search. You could have a look at the various space partitioning data structures like kd-trees
First make a dictionary with 10000 face encodings as it is shown at Face_recognition sample, then store it as pickle-file. While loaded in memory it will take a sacond to find distance between X face encoding and that 10000 pre-encoded ones. take a look how it works I'm operating with millions of faces in such way.
Today I was wondering why are rgb values stored in files at all? Wouldn't it save space if a program could take rgb values and develop a function for each color: r(x) g(x), b(x)? I would assume it would be costly at run time to iterate through the function a few million times but the trade off would be no file size. Clearly i'm missing something and would really appreciate an explanation for why this isn't feasible thanks.
In a way you are right, it is inefficient to store the RGB values for every pixel (aka a bitmap), thats why many formats such as png use compression (jpeg uses lossy compression which saves even more memory). However you are mistaken that you can simply create a simple function that will tell you the value of every pixel at a certain point. You could of course construct such a function,say a polynomial, for an arbitrary image, but this function would have so many terms that you wouldn't really save any memory, because you still need to store the function. There is a fundamental limit to how much you can compress an image or any data, which depends on the entropy.
If you are interested in that sort of thing, look up "Fractal Compression" or "Lossless compression (section: Limitations)" on Wikipedia.
I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A, C, T or G. Storing each base pair in one bytes would give a file of 3 GB, which is way too much. :)
Now I though of storing each base pair in two bits (four base pairs per octet), which gives a file of 750 MB. 750 MB is still way too much, even when compressed.
Are there any better file formats for efficiently storing giant base pairs on disk? In memory is not a problem as I read in chunks.
I think you'll have to use two bits per base pair, plus implement compression as described in this paper.
"DNA sequences... are not random; they contain
repeating sections, palindromes, and other features that
could be represented by fewer bits than is required to spell
out the complete sequence in binary...
With the proposed algorithm, sequence will be compressed by 75%
irrespective of the number of repeated or non-repeated
patterns within the sequence."
DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management
July-December 2010, Volume 2, No. 2, pp. 383-386.
Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
Edit: See also this question on BioStar.
If you don't mind having a complex solution, take a look at this paper or this paper or even this one which is more detailed.
But I think you need to specify better what you're dealing with. Some specifics applications can lead do diferent storage. For example, the last paper I cited deals with lossy compression of DNA...
Base pairs always pair up, so you should only have to store one side of the strand. Now, I doubt that this works if there are certain mutations in the DNA (like a di-Thiamine bond) that cause the opposite strand to not be the exact opposite of the stored strand. Beyond that, I don't think you have many options other than to compress it somehow. But, then again, I'm not a bioinformatics guy, so there might be some pretty sophisticated ways to store a bunch of DNA in a small space. Another idea if it's an iOS app is just putting a reader on the device and reading the sequence from a web service.
Use a diff from a reference genome. From the size (3Gbp) that you post, it looks like you want to include a full human sequences. Since sequences don't differ too much from person to person, you should be able to compress massively by storing only a diff.
Could help a lot. Unless your goal is to store the reference sequence itself. Then you're stuck.
consider this, how many different combinations can you get? out of 4 (i think its about 16 )
actg = 1
atcg = 2
atgc = 3 and so on, so that
you can create an array like [1,2,3] then you can go one step further,
check if 1 is follow by 2, convert 12 to a, 13 = b and so on...
if I understand DNA a bit it means that you cannot get a certain value
as a must be match with c, and t with g or something like that which reduces your options,
so basically you can look for a sequence and give it a something you can also convert back...
You want to look into a 3d space-filling curve. A 3d sfc reduces the 3d complexity to a 1d complexity. It's a little bit like n octree or a r-tree. If you can store your full dna in a sfc you can look for similar tiles in the tree although a sfc is most likely to use with lossy compression. Maybe you can use a block-sorting algorithm like the bwt if you know the size of the tiles and then try an entropy compression like a huffman compression or a golomb code?
You can use the tools like MFCompress, Deliminate,Comrad.These tools provides entropy less than 2.That is for storing each symbol it will take less than 2 bits
Does anybody have any pointers to Naive Bayes Classifier Implementation preferably in C. I have 5 dimensional binary dataset. The class labels are also binary. I used Naive Bayes Classifier in Matlab with good results. However, is there any machine learning algorithm and its implementation which allows me to infer data from the class labels? Here in this case I want five dimensional binary data inferred from a binary class label. A sample of data is [1 1 0 1 0] and class is 0.
As you have a binary dataset, here is a nice implementation using C:
http://users.ics.tkk.fi/jhollmen/BernoulliMix/
It is a open source software that we are using currently in our course, you can actually check how he implemented the algorithm.
And about the question you made, here is my understanding.
What naive bayes classifier(NBC) does is to predict P(C|X) given some data and label. According to Bayes' theorem,
P(C|X) = \frac{P(X|C)P(C)}{P(X)}
which means that all you can do with predict the class of unknown data. Conversely, what you want to do there is P(X|C). Therefore, you can train your model like this,
P(X|C) = \frac{P(C|X)P(X)}{P(C)}
Accordingly, you have to assume distribution for your data...and stuff like that, therefore,it might be so accurate if you have a wrong assumption with your data. In you case, you have binary attributes X that is wanted be estimated from the label class, if you assume the attributes are independent, what you need to is like this,
P(C|X_1,X_2,X_3,X_4,X_5) \proportional P(X_1|C)P(X_2|C)P(X_3|C)P(X_4|C)P(X_5|C)P(C)
which is not so easy to solve.....
Hava a look at this package of the R-project:
http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/e1071/html/naiveBayes.html
http://cran.r-project.org/web/packages/e1071/index.html
You have tagged [C]: it is possible to link R with your own C-programs.
I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here