Mathematical Idea with the purpose to store data more efficiently - database

dear reader
I have been thinking about how to store data efficiently since the beginning of my studies and while taking a shower, I came up with the following idea:
For example, you take a picture and convert that picture into 0(zeros) and 1(ones). Then you take this eternally long number and divide it by e.g. 10 then again by 10 and then again by 10 etc. and at the end you have a small number. Now the small number and the calculation path are stored and if someone wants to read the data, they only have to perform the inverse operation to get the result.
The idea is too good to be true --> my gut feeling tells me. But I would still like to know why this should not work?
Kind regards
Hello, dear reader
I have been thinking about how to store data efficiently since the beginning of my studies and while taking a shower, I came up with the following idea:
For example, you take a picture and convert that picture into 0(zeros) and 1(ones). Then you take this eternally long number and divide it by e.g. 10 then again by 10 and then again by 10 etc. and at the end you have a small number. Now the small number and the calculation path are stored and if someone wants to read the data, they only have to perform the inverse operation to get the result.
The idea is too good to be true --> my gut feeling tells me. But I would still like to know why this should not work?
Kind regards

Fun theorem. No bijection on the natural numbers can map every number to a smaller one. Proof by contradiction, consider F(F(1)).
Lots of ways to map numbers 1-1 to smaller numbers such that many map to smaller numbers. These are lossless compression algorithms. Most have the property that repeated application of the algorithm make the data larger, or leave it unchanged.
In your proposal, to the extent I understand it, you would have to store all the remainders of the division, which would be as large as the original data.

Related

Data Structure for a recursive function with two parameters one of which is Large the other small

Mathematician here looking for a bit of help. (If you ever need math help I'll try to reciprocate on math.stackexchange!) Sorry if this is a dup. Couldn't find it myself.
Here's the thing. I write a lot of code (mostly in C) that is extremely slow and I know it could be sped up considerably but I'm not sure what data structure to use. I went to school 20 years ago and unfortunately never got to take a computer science course. I have watched a lot of open-course videos on data structures but I'm still a bit fuddled never taking an actual class.
Mostly my functions just take integers to integers. I almost always use 64-bit numbers and I have three use cases that I'm interested in. I use the word small to mean no more than a million or two in quantity.
Case 1: Small numbers as input. Outputs are arbitrary.
Case 2: Any 64-bit values as input, but only a small number of them. Outputs are arbitrary.
Case 3: Two parameter functions with one parameter that's small in value (say less than two million), and the other parameter is Large but with only a small number of possible inputs. Outputs are arbitrary.
For Case 1, I just make an array to cache the values. Easy and fast.
For Case 2, I think I should be using a hash. I haven't yet done this but I think I could figure it out if I took the time.
Case 3 is the one I'd like help with and I'm not even sure what I need.
For a specific example take a function F(n,p) that takes large inputs n for the first parameter and a prime p for the second. The prime is at most the square root of n. so even if n is about 10^12, the primes are only up to about a million. Suppose this function is recursive or otherwise difficult to calculate (expensive) and will be called over and over with the same inputs. What might be a good data structure to use to easily create and retrieve the possible values of F(n,p) so that I don't have to recalculate it every time? Total number of possible inputs should be 10 or 20 million at most.
Help please! and Thank you in advance!
You are talking about memoizing I presume. Trying to answer without a concrete exemple...
If you have to retrieve values from a small range (the 2nd parameter), say from 0 to 10^6, and that needs to be upper fast, and... you have enough memory, you could simply declare an array of int (long...), which basically stores the output values from all input.
To make things simple, let say the value 0 means there is no-value set
long *small = calloc(MAX, sizeof(*small)); // Calloc intializes to 0
then in a function that gives the value for a small range
if (small[ input ]) return small[ input ];
....calculate
small[ input ] = value;
+/-
+ Very fast
- Memory consumption takes the whole range, [ 0, MAX-1 ].
If you need to store arbitrary input, use the many libraries available (there are so many). Use a Set structure, that tells if the items exists or no.
if (set.exists( input )) return set.get( input );
....calculate
set.set( input, value );
+/-
+ less memory usage
+ still fast (said to be O(1))
- but, not as fast as a mere array
Add to this the hashed set (...), which are faster, as in terms of probabilities, values (hashes) are better distributed.
+/-
+ less memory usage than array
+ faster than a simple Set
- but, not as fast as a mere array
- use more memory than a simple Set

Most efficient way to store a big DNA sequence?

I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A, C, T or G. Storing each base pair in one bytes would give a file of 3 GB, which is way too much. :)
Now I though of storing each base pair in two bits (four base pairs per octet), which gives a file of 750 MB. 750 MB is still way too much, even when compressed.
Are there any better file formats for efficiently storing giant base pairs on disk? In memory is not a problem as I read in chunks.
I think you'll have to use two bits per base pair, plus implement compression as described in this paper.
"DNA sequences... are not random; they contain
repeating sections, palindromes, and other features that
could be represented by fewer bits than is required to spell
out the complete sequence in binary...
With the proposed algorithm, sequence will be compressed by 75%
irrespective of the number of repeated or non-repeated
patterns within the sequence."
DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management
July-December 2010, Volume 2, No. 2, pp. 383-386.
Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
Edit: See also this question on BioStar.
If you don't mind having a complex solution, take a look at this paper or this paper or even this one which is more detailed.
But I think you need to specify better what you're dealing with. Some specifics applications can lead do diferent storage. For example, the last paper I cited deals with lossy compression of DNA...
Base pairs always pair up, so you should only have to store one side of the strand. Now, I doubt that this works if there are certain mutations in the DNA (like a di-Thiamine bond) that cause the opposite strand to not be the exact opposite of the stored strand. Beyond that, I don't think you have many options other than to compress it somehow. But, then again, I'm not a bioinformatics guy, so there might be some pretty sophisticated ways to store a bunch of DNA in a small space. Another idea if it's an iOS app is just putting a reader on the device and reading the sequence from a web service.
Use a diff from a reference genome. From the size (3Gbp) that you post, it looks like you want to include a full human sequences. Since sequences don't differ too much from person to person, you should be able to compress massively by storing only a diff.
Could help a lot. Unless your goal is to store the reference sequence itself. Then you're stuck.
consider this, how many different combinations can you get? out of 4 (i think its about 16 )
actg = 1
atcg = 2
atgc = 3 and so on, so that
you can create an array like [1,2,3] then you can go one step further,
check if 1 is follow by 2, convert 12 to a, 13 = b and so on...
if I understand DNA a bit it means that you cannot get a certain value
as a must be match with c, and t with g or something like that which reduces your options,
so basically you can look for a sequence and give it a something you can also convert back...
You want to look into a 3d space-filling curve. A 3d sfc reduces the 3d complexity to a 1d complexity. It's a little bit like n octree or a r-tree. If you can store your full dna in a sfc you can look for similar tiles in the tree although a sfc is most likely to use with lossy compression. Maybe you can use a block-sorting algorithm like the bwt if you know the size of the tiles and then try an entropy compression like a huffman compression or a golomb code?
You can use the tools like MFCompress, Deliminate,Comrad.These tools provides entropy less than 2.That is for storing each symbol it will take less than 2 bits

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

large test data for knapsack problem

i am researcher student. I am searching large data for knapsack problem. I wanted test my algorithm for knapsack problem. But i couldn't find large data. I need data has 1000 item and capacity is no matter. The point is item as much as huge it's good for my algorithm. Is there any huge data available in internet. Does anybody know please guys i need urgent.
You can quite easily generate your own data. Just use a random number generator and generate lots and lots of values. To test that your algorithm gives the correct results, compare it to the results from another known working algorithm.
I have the same requirement.
Obviously only Brute force will give the optimal answer and that won't work for large problems.
However we could pitch our algorithms against each other...
To be clear, my algorithm works for 0-1 problems (i.e. 0 or 1 of each item), Integer or decimal data.
I also have a version that works for 2 dimensions (e.g. Volume and Weight vs. Value).
My file reader uses a simple CSV format (Item-name, weight, value):
X229257,9,286
X509192,11,272
X847469,5,184
X457095,4,88
etc....
If I recall correctly, I've tested mine on 1000 items too.
Regards.
PS:
I ran my algorithm again the problem on Rosette Code that Mark highlighted (thank you). I got the same result but my solution is much more scalable than the dynamic programming / LP solutions and will work on much bigger problems

Shuffling biased random numbers

While thinking about this question and conversing with the participants, the idea came up that shuffling a finite set of clearly biased random numbers makes them random because you don't know the order in which they were chosen. Is this true and if so can someone point to some resources?
EDIT: I think I might have been a little unclear. Suppose a bad random numbers generator. Take n values. These are biased(the rng is bad). Is there a way through shuffling to make the output of the rng over multiple trials statistically match the output of a known good rng?
False.
There is an easy test: Assume the bias in the original set creation algorithm is "creates sets whose arithmetic average is significantly lower than expected average". Obviously, shuffling the result of the algorithm will not change the averages and thus not remove the bias.
Also, regarding your clarification: How would you shuffle the set? Using the same bad output from the bad RNG that created the set in the first place? Or using a better RNG? Which raises the question why you don't use that directly.
It's not true. In the other question the problem is to select 30 random numbers in [1..9] with a sum of 200. After choosing about on average 20 of them randomly, you reach a point where you can't select nines anymore because this would make the total sum go over 200. Of the remaining 10 numbers, most will be ones and twos. So in the end, ones and twos are very overrepresented in the selected numbers. Shuffling doesn't change that. But it's not clear how the random distribution really should look like, so one could say this is as good a solution as any.
In general, if your "random" numbers will be biased to, say, low numbers, they will be biased that way no matter the ordering.
Just shuffling a set of numbers of already random numbers won't do anything to the probability distribution of course. That would mean false. Perhaps I misunderstand your question though?
I would say false, with a caveat:
I think there is random, and then there is 'random-enough'. For most applications that I have needed to work on, 'random-enough' was more than enough, i.e. picking a 'random' ad to display on a page from a list of 300 or so that have paid to be placed on that site.
I am sure a mathematician could prove my very basic 'random' selection criteria is not truly random at all, but in fact is predictable - for my clients, and for the users, nobody cares.
On the other hand if I was writing a video game to be used in Las Vegas where large amounts of money was at hand I'd define random differently (and may have a hard time coming up with truly random).
False
The set is finite, suppose consists of n numbers. What happens if you choose n+1 numbers? Let's also consider a basic random function as implemented in many languages which gives you a random number in [0,1). However, this number is limited to three digits after the decimal giving you a set of 1000 possible numbers (0.000 - 0.999). However in most cases you will not need to use all these 1000 numbers so this amount of randomness is more than enough.
However for some uses, you will need a better random generator than this. So it all comes down to exactly how many random numbers you are going to need, and how random you need them to be.
Addition after reading original question: in the case that you have some sort of limitation (such as in the original question in which each set of selected numbers must sum up to a certain N) you are not really selected random numbers per se, but rather choosing numbers in a random order from a given set (specifically, a permutation of numbers summing up to N).
Addition to edit: Suppose your bad number generator generated the sequence (1,1,1,2,2,2). Does the permutation (1,2,2,1,1,2) satisfy your definition of random?
Completely and utterly untrue: Shuffling doesn't remove a bias, it just conceals it from the casual observer. It's like removing your dog's fondly-laid present from your carpet by just pushing under the sofa - you really haven't solved the problem, you've just made it less conspicuous. Anyone with a nose knows that there is still a problem that needs removing.
The randomness must be applied evenly over the whole range, so here's one way (off the top of my head, lots of assumptions, yadda yadda. The point is the approach, not the code - start with everything even, then introduce your randomness in a consistent fashion until you're done. The only bias now is dependent on the values chosen for 'target' and 'numberofnumbers', which is part of the question.)
target = 200
numberofnumbers = 30
numbers = array();
for (i=0; i<numberofnumbers; i++)
numbers[i] = 9
while (sum(numbers)>target)
numbers[random(numberofnumbers)]--
False. Consider a bad random number generator producing only zeros (I said it was BAD :-) No amount of shuffling the zeros would change any property of that sequence.

Resources