I was trying to collect statistics of a 6D vector and plot a 1D histogram for each coordinate. I get 729000000 different copies of this vector (each 6 dimensional). For this I create an array of zeros of size 729000000x6 before I get any of the actual W's and this seems to be a problem in matlab since it says:
Error using zeros
Requested 729000000x6 (32.6GB) array exceeds maximum array size preference. Creation of arrays
greater than this limit may take a long time and cause MATLAB to become unresponsive. See array
size limit or preference panel for more information.
The reason I did this at first was because it was easy to fill W_history and then just feed it to the histogram plotter:
histogram(W_history(:,d),nbins,'Normalization','probability')
however filling W_history seemed impossible for high number of copies of W. Is there a way to do this in matlab automatically? It feels that there should be and didn't want to re-invent the wheel.
I am sure I could potentially create for each coordinate some array of counters where I count how many times a specific value of the coordinate W falls. However, implementing that and having the checks for in which bin each one should fall seemed inefficient or even unnecessary. Is this really the only solution or what do matlab experts people recommend? Is this re-inventing the wheel? Seems also inefficient if I implement it myself?
Also, I thought I could manually have matlab put thing in memory then bring them back etc (as in store W_history in disk as it fills and then put more back in disk as it fills and eventually somehow plug it in to the histogram plotter), that seemed overwork. I hope I can avoid a solution like this one. It feels a wrong solution since it should be "easy" and high level to use matlab and going down to disk and memory doesn't seem to me what matlab is intended.
Currently through the comment that was given the best solution that I have so far is using histcounts as follow:
for i=2:iter+1
%
W = get_new_W(W)
%
[W_hist_counts_current, edges2] = histcounts(W,edges);
W_hist_counts = W_hist_counts + W_hist_counts_current;
end
however, after this it seems difficult to convert W_hist_counts to pdf/probability or other values since it seems they have to be processed manually. Is there no official way to do this processing without the user having to implement the normalizations again?
did any one know how to approximate lines from grayscale image resulted from line segment detector: using opencv or C language! in the image attached you see that each finger composed of many lines, what i need to do is to make each finger consists of exactly two parallel lines (i.e. approximate small lines to fit into only one line), if any one helps me, i will appreciate that.
N.B. i'm new to stackocerflow therefore i'm not allowed to post images, so for more clarification, that's the link of the image.
http://www.2shared.com/photo/Ff7mFtV3/Optimal.html
grayscale image resulted from line segment detector (LSD)
What have you done so far? You might need some heuristics. First add all segments on a table, try calculating the inclination of each of the segments and then sorting them by this as index. Afterwards, consider all segments that have an inclination say close by 5% or something to have the exact same inclination. This will induce a partitioning in the table. You might want to draw them using different colors so that you find the perfect parameter value.
Now you need to 'merge' all segments that have the same inclination and are close together. I'd try to measure the distance between the segments (google an algorithm for that) and sort the segments of each partition according to this. Consider merging segments that are close by less than, for instance, 3% of the total image height in pixels or something (find that empirically).
Last step, merging the segments should be very easy compared to the rest.
If you really want to find the fingers, you can stop earlier and compare the groups of same inclination to check if there are two almost (by 7% or so) parallel. The 5 closest pairs of inclinations should be fingers :-)
I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A, C, T or G. Storing each base pair in one bytes would give a file of 3 GB, which is way too much. :)
Now I though of storing each base pair in two bits (four base pairs per octet), which gives a file of 750 MB. 750 MB is still way too much, even when compressed.
Are there any better file formats for efficiently storing giant base pairs on disk? In memory is not a problem as I read in chunks.
I think you'll have to use two bits per base pair, plus implement compression as described in this paper.
"DNA sequences... are not random; they contain
repeating sections, palindromes, and other features that
could be represented by fewer bits than is required to spell
out the complete sequence in binary...
With the proposed algorithm, sequence will be compressed by 75%
irrespective of the number of repeated or non-repeated
patterns within the sequence."
DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management
July-December 2010, Volume 2, No. 2, pp. 383-386.
Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
Edit: See also this question on BioStar.
If you don't mind having a complex solution, take a look at this paper or this paper or even this one which is more detailed.
But I think you need to specify better what you're dealing with. Some specifics applications can lead do diferent storage. For example, the last paper I cited deals with lossy compression of DNA...
Base pairs always pair up, so you should only have to store one side of the strand. Now, I doubt that this works if there are certain mutations in the DNA (like a di-Thiamine bond) that cause the opposite strand to not be the exact opposite of the stored strand. Beyond that, I don't think you have many options other than to compress it somehow. But, then again, I'm not a bioinformatics guy, so there might be some pretty sophisticated ways to store a bunch of DNA in a small space. Another idea if it's an iOS app is just putting a reader on the device and reading the sequence from a web service.
Use a diff from a reference genome. From the size (3Gbp) that you post, it looks like you want to include a full human sequences. Since sequences don't differ too much from person to person, you should be able to compress massively by storing only a diff.
Could help a lot. Unless your goal is to store the reference sequence itself. Then you're stuck.
consider this, how many different combinations can you get? out of 4 (i think its about 16 )
actg = 1
atcg = 2
atgc = 3 and so on, so that
you can create an array like [1,2,3] then you can go one step further,
check if 1 is follow by 2, convert 12 to a, 13 = b and so on...
if I understand DNA a bit it means that you cannot get a certain value
as a must be match with c, and t with g or something like that which reduces your options,
so basically you can look for a sequence and give it a something you can also convert back...
You want to look into a 3d space-filling curve. A 3d sfc reduces the 3d complexity to a 1d complexity. It's a little bit like n octree or a r-tree. If you can store your full dna in a sfc you can look for similar tiles in the tree although a sfc is most likely to use with lossy compression. Maybe you can use a block-sorting algorithm like the bwt if you know the size of the tiles and then try an entropy compression like a huffman compression or a golomb code?
You can use the tools like MFCompress, Deliminate,Comrad.These tools provides entropy less than 2.That is for storing each symbol it will take less than 2 bits
Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.
I need to do a program that does this: given an image (5*5 pixels), I have to search how many images like that exist in another image, composed by many other images. That is, i need to search a given pattern in an image.
The language to use is C. I have to use parallel computing to search in the 4 angles (0º, 90º, 180º and 270º).
What is the best way to do that?
Seems straight forward.
Create 4 versions of the image rotated by 0°, 90°, 180°, and 270°.
Start four threads each with one version of the image.
For all positions from (0,0) to (width - 5, height - 5)
Comapare the 25 pixels of the reference image with the 25 pixels at the current position
If they are equal enough using some metric, report the finding.
Use normalized correlation to determine a match of templates.
#Daniel, Daniel's solution is good for leveraging your multiple CPUs. He doesn't mention a quality metric that would be useful and I would like to suggest one quality metric that is very common in image processing.
I suggest using normalized correlation[1] as a comparison metric because it outputs a number from -1 to +1. Where 0 is no correlation 1 would be output if the two templates were identical and -1 would be if the two templates were exactly opposite.
Once you compute the normalized correlation you can test to see if you have found the template by doing either a threshold test or a peak-to-average test[2].
[1 - footnote] How do you implement normalized correlation? It is pretty simple and only has two for loops. Once you have an implementation that is good enough you can verify your implementation by checking to see if the identical image gets you a 1.
[2 - footnote] You do the ratio of the max(array) / average(array_without_peak). Then threshold to make sure you have a good peak to average ratio.
There's no need to create the additional three versions of the image, just address them differently or use something like the class I created here. Better still, just duplicate the 5x5 matrix and rotate those instead. You can then linearly scan the image for all rotations (which is a good thing).
This problem will not scale well for parallel processing since the bottleneck is certainly accessing the image data. Having multiple threads accessing the same data will slow it down, especially if the threads get 'out of sync', i.e. one thread gets further through the image than the other threads so that the other threads end up reloading the data the first thread has discarded.
So, the solution I think will be most efficient is to create four threads that scan 5 lines of the image, one thread per rotation. A fifth thread loads the image data one line at a time and passes the line to each of the four scanning threads, waiting for all four threads to complete, i.e. load one line of image, append to five line buffer, start the four scanning threads, wait for threads to end and repeat until all image lines are read.
5 * 5 = 25
25 bits fits in an integer.
each image can be encoded as an array of 4 integers.
Iterate your larger image, (hopefully it is not too big),
pulling out all 5 * 5 sub images, convert to an array of 4 integers and compare.