Most efficient way to store a big DNA sequence? - file-format

I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A, C, T or G. Storing each base pair in one bytes would give a file of 3 GB, which is way too much. :)
Now I though of storing each base pair in two bits (four base pairs per octet), which gives a file of 750 MB. 750 MB is still way too much, even when compressed.
Are there any better file formats for efficiently storing giant base pairs on disk? In memory is not a problem as I read in chunks.

I think you'll have to use two bits per base pair, plus implement compression as described in this paper.
"DNA sequences... are not random; they contain
repeating sections, palindromes, and other features that
could be represented by fewer bits than is required to spell
out the complete sequence in binary...
With the proposed algorithm, sequence will be compressed by 75%
irrespective of the number of repeated or non-repeated
patterns within the sequence."
DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management
July-December 2010, Volume 2, No. 2, pp. 383-386.
Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
Edit: See also this question on BioStar.

If you don't mind having a complex solution, take a look at this paper or this paper or even this one which is more detailed.
But I think you need to specify better what you're dealing with. Some specifics applications can lead do diferent storage. For example, the last paper I cited deals with lossy compression of DNA...

Base pairs always pair up, so you should only have to store one side of the strand. Now, I doubt that this works if there are certain mutations in the DNA (like a di-Thiamine bond) that cause the opposite strand to not be the exact opposite of the stored strand. Beyond that, I don't think you have many options other than to compress it somehow. But, then again, I'm not a bioinformatics guy, so there might be some pretty sophisticated ways to store a bunch of DNA in a small space. Another idea if it's an iOS app is just putting a reader on the device and reading the sequence from a web service.

Use a diff from a reference genome. From the size (3Gbp) that you post, it looks like you want to include a full human sequences. Since sequences don't differ too much from person to person, you should be able to compress massively by storing only a diff.
Could help a lot. Unless your goal is to store the reference sequence itself. Then you're stuck.

consider this, how many different combinations can you get? out of 4 (i think its about 16 )
actg = 1
atcg = 2
atgc = 3 and so on, so that
you can create an array like [1,2,3] then you can go one step further,
check if 1 is follow by 2, convert 12 to a, 13 = b and so on...
if I understand DNA a bit it means that you cannot get a certain value
as a must be match with c, and t with g or something like that which reduces your options,
so basically you can look for a sequence and give it a something you can also convert back...

You want to look into a 3d space-filling curve. A 3d sfc reduces the 3d complexity to a 1d complexity. It's a little bit like n octree or a r-tree. If you can store your full dna in a sfc you can look for similar tiles in the tree although a sfc is most likely to use with lossy compression. Maybe you can use a block-sorting algorithm like the bwt if you know the size of the tiles and then try an entropy compression like a huffman compression or a golomb code?

You can use the tools like MFCompress, Deliminate,Comrad.These tools provides entropy less than 2.That is for storing each symbol it will take less than 2 bits

Related

Mathematical Idea with the purpose to store data more efficiently

dear reader
I have been thinking about how to store data efficiently since the beginning of my studies and while taking a shower, I came up with the following idea:
For example, you take a picture and convert that picture into 0(zeros) and 1(ones). Then you take this eternally long number and divide it by e.g. 10 then again by 10 and then again by 10 etc. and at the end you have a small number. Now the small number and the calculation path are stored and if someone wants to read the data, they only have to perform the inverse operation to get the result.
The idea is too good to be true --> my gut feeling tells me. But I would still like to know why this should not work?
Kind regards
Hello, dear reader
I have been thinking about how to store data efficiently since the beginning of my studies and while taking a shower, I came up with the following idea:
For example, you take a picture and convert that picture into 0(zeros) and 1(ones). Then you take this eternally long number and divide it by e.g. 10 then again by 10 and then again by 10 etc. and at the end you have a small number. Now the small number and the calculation path are stored and if someone wants to read the data, they only have to perform the inverse operation to get the result.
The idea is too good to be true --> my gut feeling tells me. But I would still like to know why this should not work?
Kind regards
Fun theorem. No bijection on the natural numbers can map every number to a smaller one. Proof by contradiction, consider F(F(1)).
Lots of ways to map numbers 1-1 to smaller numbers such that many map to smaller numbers. These are lossless compression algorithms. Most have the property that repeated application of the algorithm make the data larger, or leave it unchanged.
In your proposal, to the extent I understand it, you would have to store all the remainders of the division, which would be as large as the original data.

Fast spatial data structure for nearest neighbor search amongst non-uniformly sized hyperspheres

Given a k-dimensional continuous (euclidean) space filled with rather unpredictably moving/growing/shrinking  hyperspheres I need to repeatedly find the hypersphere whose surface is nearest to a given coordinate. If some hyperspheres are of the same distance to my coordinate, then the biggest hypersphere wins. (The total count of hyperspheres is guaranteed to stay the same over time.)
My first thought was to use a KDTree but it won't take the hyperspheres' non-uniform volumes into account.
So I looked further and found BVH (Bounding Volume Hierarchies) and BIH (Bounding Interval Hierarchies), which seem to do the trick. At least in 2-/3-dimensional space. However while finding quite a bit of info and visualizations on BVHs I could barely find anything on BIHs.
My basic requirement is a k-dimensional spatial data structure that takes volume into account and is either super fast to build (off-line) or dynamic with barely any unbalancing.
Given my requirements above, which data structure would you go with? Any other ones I didn't even mention?
Edit 1: Forgot to mention: hypershperes are allowed (actually highly expected) to overlap!
Edit 2: Looks like instead of "distance" (and "negative distance" in particular) my described metric matches the power of a point much better.
I'd expect a QuadTree/Octree/generalized to 2^K-tree for your dimensionality of K would do the trick; these recursively partition space, and presumably you can stop when a K-subcube (or K-rectangular brick if the splits aren't even) does not contain a hypersphere, or contains one or more hyperspheres such that partitioning doesn't separate any, or alternatively contains the center of just a single hypersphere (probably easier).
Inserting and deleting entities in such trees is fast, so a hypersphere changing size just causes a delete/insert pair of operations. (I suspect you can optimize this if your sphere size changes by local additional recursive partition if the sphere gets smaller, or local K-block merging if it grows).
I haven't worked with them, but you might also consider binary space partitions. These let you use binary trees instead of k-trees to partition your space. I understand that KDTrees are a special case of this.
But in any case I thought the insertion/deletion algorithms for 2^K trees and/or BSP/KDTrees was well understood and fast. So hypersphere size changes cause deletion/insertion operations but those are fast. So I don't understand your objection to KD-trees.
I think the performance of all these are asymptotically the same.
I would use the R*Tree extension for SQLite. A table would normally have 1 or 2 dimensional data. SQL queries can combine multiple tables to search in higher dimensions.
The formulation with negative distance is a little weird. Distance is positive in geometry, so there may not be much helpful theory to use.
A different formulation that uses only positive distances may be helpful. Read about hyperbolic spaces. This might help to provide ideas for other ways to describe distance.

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

How are neural networks used when the number of inputs could be variable?

All the examples I have seen of neural networks are for a fixed set of inputs which works well for images and fixed length data. How do you deal with variable length data such sentences, queries or source code? Is there a way to encode variable length data into fixed length inputs and still get the generalization properties of neural networks?
I have been there, and I faced this problem.
The ANN was made for fixed feature vector length, and so are many other classifiers such as KNN, SVM, Bayesian, etc.
i.e. the input layer should be well defined and not varied, this is a design problem.
However, some researchers opt for adding zeros to fill the missing gap, I personally think that this is not a good solution because those zeros (unreal values) will affect the weights that the net will converge to. in addition there might be a real signal ending with zeros.
ANN is not the only classifier, there are more and even better such as the random forest. this classifier is considered the best among researchers, it uses a small number of random features, creating hundreds of decision trees using bootstrapping an bagging, this might work well, the number of the chosen features normally the sqrt of the feature vector size. those features are random. each decision tree converges to a solution, using majority rules the most likely class will chosen then.
Another solution is to use the dynamic time warping DTW, or even better to use Hidden Markov models HMM.
Another solution is the interpolation, interpolate (compensate for missing values along the small signal) all the small signals to be with the same size as the max signal, interpolation methods include and not limited to averaging, B-spline, cubic.....
Another solution is to use feature extraction method to use the best features (the most distinctive), this time make them fixed size, those method include PCA, LDA, etc.
another solution is to use feature selection (normally after feature extraction) an easy way to select the best features that give the best accuracy.
that's all for now, if non of those worked for you, please contact me.
You would usually extract features from the data and feed those to the network. It is not advisable to take just some data and feed it to net. In practice, pre-processing and choosing the right features will decide over your success and the performance of the neural net. Unfortunately, IMHO it takes experience to develop a sense for that and it's nothing one can learn from a book.
Summing up: "Garbage in, garbage out"
Some problems could be solved by a recurrent neural network.
For example, it is good for calculating parity over a sequence of inputs.
The recurrent neural network for calculating parity would have just one input feature.
The bits could be fed into it over time. Its output is also fed back to the hidden layer.
That allows to learn the parity with just two hidden units.
A normal feed-forward two-layer neural network would require 2**sequence_length hidden units to represent the parity. This limitation holds for any architecture with just 2 layers (e.g., SVM).
I guess one way to do it is to add a temporal component to the input (recurrent neural net) and stream the input to the net a chunk at a time (basically creating the neural network equivalent of a lexer and parser) this would allow the input to be quite large but would have the disadvantage that there would not necessarily be a stop symbol to seperate different sequences of input from each other (the equivalent of a period in sentances)
To use a neural net on images of different sizes, the images themselves are often cropped and up or down scaled to better fit the input of the network. I know that doesn't really answer your question but perhaps something similar would be possible with other types of input, using some sort of transformation function on the input?
i'm not entirely sure, but I'd say, use the maximum number of inputs (e.g. for words, lets say no word will be longer than 45 characters (longest word found in a dictionary according to wikipedia), and if a shorter word is encountered, set the other inputs to a whitespace character.
Or with binary data, set it to 0. the only problem with this approach is if an input filled with whitespace characters/zeros/whatever collides with a valid full length input (not so much a problem with words as it is with numbers).

Data structure for finding nearby keys with similar bitvalues

I have some data, up to a between a million and a billion records, each which is represented by a bitfield, about 64 bits per key. The bits are independent, you can imagine them basically as random bits.
If I have a test key and I want to find all values in my data with the same key, a hash table will spit those out very easily, in O(1).
What algorithm/data structure would efficiently find all records most similar to the query key? Here similar means that most bits are identical, but a minimal number are allowed to be wrong. This is traditionally measured by Hamming distance., which just counts the number of mismatched bits.
There's two ways this query might be made, one might be by specifying a mismatch rate like "give me a list of all existing keys which have less than 6 bits that differ from my query" or by simply best matches, like "give me a list of the 10,000 keys which have the lowest number of differing bits from my query."
You might be temped to run to k-nearest-neighbor algorithms, but here we're talking about independent bits, so it doesn't seem likely that structures like quadtrees are useful.
The problem can be solved by simple brute force testing a hash table for low numbers of differing bits. If we want to find all keys that differ by one bit from our query, for example, we can enumerate all 64 possible keys and test them all. But this explodes quickly, if we wanted to allow two bits of difference, then we'd have to probe 64*63=4032 times. It gets exponentially worse for higher numbers of bits.
So is there another data structure or strategy that makes this kind of query more efficient?
The database/structure can be preprocessed as much as you like, it's the query speed that should be optimized.
What you want is a BK-Tree. It's a tree that's ideally suited to indexing metric spaces (your problem is one), and supports both nearest-neighbour and distance queries. I wrote an article about it a while ago.
BK-Trees are generally described with reference to text and using levenshtein distance to build the tree, but it's straightforward to write one in terms of binary strings and hamming distance.
This sounds like a good fit for an S-Tree, which is like a hierarchical inverted file. Good resources on this topic include the following papers:
Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes.
Improved Methods for Signature-Tree Construction (2000)
Quote from the first one:
The hierarchical bitmap index efficiently supports dif-
ferent classes of queries, including subset, superset and similarity queries.
Our experiments show that the hierarchical bitmap index outperforms
other set indexing techniques significantly.
These papers include references to other research that you might find useful, such as M-Trees.
Create a binary tree (specifically a trie) representing each key in your start set in the following way: The root node is the empty word, moving down the tree to the left appends a 0 and moving down the right appends a 1. The tree will only have as many leaves as your start set has elements, so the size should stay manageable.
Now you can do a recursive traversal of this tree, allowing at most n "deviations" from the query key in each recursive line of execution, until you have found all of the nodes in the start set which are within that number of deviations.
I'd go with an inverted index, like a search engine. You've basically got a fixed vocabulary of 64 words. Then similarity is measured by hamming distance, instead of cosine similarity like a search engine would want to use. Constructing the index will be slow, but you ought to be able to query it with normal search enginey speeds.
The book Introduction to Information Retrieval covers the efficient construction, storage, compression and querying of inverted indexes.
"Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions", from 2008, seems to be the best result as of then. I won't try to summarize since I read it over a year ago and it's hairy. That's from a page on locality-sensitive hashing, along with an implementation of an earlier version of the scheme. For more general pointers, read up on nearest neighbor search.
This kind of question has been asked before: Fastest way to find most similar string to an input?
The database/structure can be
preprocessed as much as you like
Well...IF that is true. Then all you need is a similarity matrix of your hamming distances. Make the matrix sparse by pruning out large distances. It doesn't get any faster and not that much of a memory hog.
Well, you could insert all of the neighbor keys along with the original key. That would mean that you store (64 choose k) times as much data, for k differing bits, and it will require that you decide k beforehand. Though you could always extend k by brute force querying neighbors, and this will automatically query the neighbors of your neighbors that you inserted. This also gives you a time-space tradeoff: for example, if you accept a 64 x data blowup and 64 times slower you can get two bits of distance.
I haven't completely thought this through, but I have an idea of where I'd start.
You could divide the search space up into a number of buckets where each bucket has a bucket key and the keys in the bucket are the keys that are more similar to this bucket key than any other bucket key. To create the bucket keys, you could randomly generate 64 bit keys and discard any that are too close to any previously created bucket key, or you could work out some algorithm that generates keys that are all dissimilar enough. To find the closest key to a test key, first find the bucket key that is closest, and then test each key in the bucket. (Actually, it's possible, but not likely, for the closest key to be in another bucket - do you need to find the closest key, or would a very close key be good enough?)
If you're ok with doing it probabilistically, I think there's a good way to solve question 2. I assume you have 2^30 data and cutoff and you want to find all points within cutoff distance from test.
One_Try()
1. Generate randomly a 20-bit subset S of 64 bits
2. Ask for a list of elements that agree with test on S (about 2^10 elements)
3. Sort that list by Hamming distance from test
4. Discard the part of list after cutoff
You repeat One_Try as much as you need while merging the lists. The more tries you have, the more points you find. For example, if x is within 5 bits, you'll find it in one try with about (2/3)^5 = 13% probability. Therefore if you repeat 100 tries you find all but roughly 10^{-6} of such x. Total time: 100*(1000*log 1000).
The main advantage of this is that you're able to output answers to question 2 as you proceed, since after the first few tries you'll certainly find everything within distance not more than 3 bits, etc.
If you have many computers, you give each of them several tries, since they are perfectly parallelizable: each computer saves some hash tables in advance.
Data structures for large sets described here: Detecting Near-Duplicates for Web Crawling
or
in memory trie: Judy-arrays at sourceforge.net
Assuming you have to visit each row to test its value (or if you index on the bitfield then each index entry), then you can write the actual test quite efficiently using
A xor B
To find the difference bits, then bit-count the result, using a technique like this.
This effectively gives you the hamming distance.
Since this can compile down to tens of instructions per test, this can run pretty fast.
If you are okay with a randomized algorithm (monte carlo in this case), you can use the minhash.
If the data weren't so sparse, a graph with keys as the vertices and edges linking 'adjacent' (Hamming distance = 1) nodes would probably be very efficient time-wise. The space would be very large though, so in your case, I don't think it would be a worthwhile tradeoff.

Resources