Compressing a sparse bit array

Compressing a sparse bit array - c

I have arrays of 1024 bytes (8192 bits) which are mostly zero.
Between 0.01% and 10% of bits will be set (random, no pattern).
How could these be compressed, given the lack of structure and the relatively small size?
(My first thought was to store the distances between set bits. I need 13 bits for each distance, but at worst case 10% occupancy this needs 13 * 816 / 8 = 1326 bytes, which is not an improvement.)
This is for ultra-low bandwidth comms, so every byte matters.

I've dealt deeply with a similar problem, but my sets are much bigger (30 million possible values with between 1 and 30 million elements in each set), so they both gain much more from compression and the compression metadata is insignificant compared to the size of the data. I have never gone down to squeezing things into units smaller than uint16_t, so the things I write below might not apply if you start chopping up 13 bit values into pieces. It feels like it should work, but caveat emptor.
What I've found works is to employ several strategies that depend on the particular data we have. The good news is that the count of elements in each set is a very good indicator of which compression strategy will work best for a particular set. So all the metadata you need is a count of elements in the set. In my data format the first and only metadata value (I'll be unspecific and just call it "value", you can squeeze things in bytes, 16 bit values or 13 bit values however you feel) is the count of elements in the set, the rest is just the encoding of the set elements.
The strategies are:
If very few elements are in the set, you can't do better than an array that says "1, 4711, 8140", so in this case the data is encoded as: [3, 1, 4711, 8140]
If almost all elements are in the set, you can just keep track of elements that aren't. For example [8190, 17, 42].
If around half of the elements are in the set you pretty much can't do much better than a bitmap, so you get [4000, {bitmap}], this is the only case where your data ends up being longer than strictly uncompressed.
If more than "a few" but many fewer than "around half" elements are set, I found another strategy. Divide the bits of your possible values in the set in half. Let's say we have 2^16 (it's easier to describe, it should probably work for 2^13) possible values. The values are divided into 256 ranges with each range with 256 possible values. We then have an array with 256 bytes, each of these bytes describes how many values are in each range (so byte 0 tells us how many elements are [0,255], byte 1 gives us [256,511], etc.) immediately after follow arrays with the values in each range mod 256. The trick here is that while every element in the set encoded as an array (strategy 1) would be 2 bytes, in this scheme each element is only 1 bytes + 256 static bytes for the counts of elements. This means that as soon as we have more than 256 elements in the set this saves us space by switching from strategy 1 to 4.
Strategy 4 can be refined (probably meaningless if your data is random as you mention, but my data had more patterns sometimes, so it worked for me). Since we still need 8 bits for each element in the previous encoding, as soon as a sub-array of elements goes over 32 elements (256 bytes), we can store it as a bitmap instead. This is also a good breakpoint for switching strategies between 4/5 to 3. If all the arrays in this strategy are just bitmaps, then we should just use strategy 3 (it's more complicated than that, but the breakpoint between strategies can be precomputed quite accurately that you'll end up picking the most likely efficient strategy each time).
I have only vaguely tried saving deltas between numbers in the set. Quick experiments showed that they weren't really much more efficient than the strategies I mentioned above, had unpredictable degenerate cases, but most importantly, the application I work with really likes to not have to deserialise its data, just use it raw straight from disk (mmap).

Related

Is there a space efficent way to store and retrieve the order of a dataset?

Here's my problem. I have a set of 20 objects stored in memory as an array. I want to store a second piece of data that defines an order for the objects to be displayed.
The simplest way to store the order is as an array of 20 unsigned integers, each of which is 5 bits (aka 0-31). The position of the object in the output list would be defined by the number stored in this array at the same index as the object in it's array.
But.. I know from statistics that there are only 20! (that's 20 factorial), ways to arrange these objects.
This could be stored in 62 bits, since 2^62 > 20!
I'm currently using 100 bits to store the same information.
So my question is this: Is there a space efficient way to store ORDER as a sequence of bits?
I have some addition constraints as well. This will run on an embedded device, so I can't use any huge arrays or high level math functions. I would need a simple iterative method.
Edit: Some clarification on why this is necessary. Say for example the objects are pictures, and they're stored in ROM (aka they can't be moved around). Now lets say I want to keep track of what order to display the images in, and i'm going to update that order every second. My device has 1k of storage with wear leveling, but each bit in the storage can only be written 1000 times before it becomes unreliable. If I need 1kb to store the order, than my device will only work for 1000 seconds. If I need 0.1kb, it will work for 10k seconds, and so on. Thus the devices longevity will be inversely proportional to the number of bits I need to update every cycle.

You can store the order in a single 64-bit value x:
For the first choice, 20 possibilities, compute the index as x % 20 and update x as x /= 20,
For the next choice, only 19 possibilities, compute x % 19 and update x as x /= 19.
Continue this process 17 more times and you are done.

I think I've found a partial solution to my own question. Assuming I start at the left side of the order array, for every move right there are fewer remaining possibilities for the position value. The number of possibilities is 20,19,18,etc. I can take advantage of this by populating the order array in a relative fashion. The first index will place a value in the order array. There are 20 possibilities so this takes 5 bits. Placing the next value, there are only 19 position available (still 5 bits). Proceeding though the whole array. The bits-required is now 5,5,5,5,4,4,4,4,4,4,4,4,3,3,3,3,2,2,1,0. So that gets me down to 69 bits, much better.
There's still some "wasted" precision in each of the values, since for example the first position can store 32 possible values, even though there are only 20. I'm not sure how to deal with this, but I think will have something to do with carrying a remainder from one calculation to the next..

How LogLog algorithm with single hash function works

I have found tens of explanation of the basic idea of LogLog algorithms, but they all lack details about how does hash function result splitting works? I mean using single hash function is not precise while using many function is too expensive. How do they overcome the problem with single hash function?
This answer is the best explanation I've found, but still have no sense for me:
They used one hash but divided it into two parts. One is called a
bucket (total number of buckets is 2^x) and another - is basically the
same as our hash. It was hard for me to get what was going on, so I
will give an example. Assume you have two elements and your hash
function which gives values form 0 to 2^10 produced 2 values: 344 and
387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
Could you explain example above pls? You should have 16 buckets because you have header of length 4, right? So how you can have 16 buckets with only two hashes? Do we estimate only buckets, right? So the first bucket is of size 1, and the second of size 4, right? How to merge the results?

Hash function splitting: our goal is to use many hyperloglog structures (as an example, let's say 16 hyperloglog structures, each of them using a 64-bit hash function) instead of one, in order to reduce the estimation error. An intuitive approach might be to process each of the inputs in each of these hyperloglog structures. However, in that case we would need to make sure that the hyperloglogs are independent of each other, meaning we would need a set of 16 hash functions which are independent of each other - that's hard to find!.
So we use an alternative approach. Instead of using a family of 64-bit hash functions, we will use 16 separate hyperloglog structures, each using just a 60-bit hash function. How do we do that? Easy, we take our 64-bit hash function and just ignore the first 4 bits, producing a 60-bit hash function. What do we do with the first 4 bits? We use them to choose one of 16 "buckets" (Each "bucket" is just a hyperloglog structure. Note that 2^4 bits=16 buckets). Now each of the inputs is assigned to exactly one of the 16 buckets, where a 60-bit hash function is used to calculate the hyperloglog value. So we have 16 hyperloglog structures, each using a 60-bit hash function. Assuming that we chose a decent hash function (meaning that the first 4 bits are uniformly distributed, and that they aren't correlated with the remaining 60 bits), we now have 16 independent hyperloglog structures. We take an harmonic average of their 16 estimates to get a much less error-prone estimate of the cardinality.
Hope that clears it up!

The original HyperLogLog paper mentioned by OronNavon is quite theoretical. If you are looking for an explanation of the cardinality estimator without the need of complex analysis, you could have a look on the paper I am currently working on: http://oertl.github.io/hyperloglog-sketch-estimation-paper. It also presents a generalization of the original estimator that does not require any special handling for small or large cardinalities.

What is a good representation for a searchable bit matrix with fixed number of columns?

The raw data can be described as a fixed number of columns (on the order of a few thousand) and a large (on the order of billions) and variable number of rows. Each cell is a bit. The desired query would be something like find all rows where bits 12,329,2912,3020 are set. Something like
for (i=0;i< max_ents;i++)
if (entry[i].data & mask == mask)
add_result(i);
In a typical case not many (e.g. 5%) bits are set in any particular row, but that's not guaranteed, there's a degree of variability.
On a higher level the data describes a bitwise fingerprint of entries and the data itself is a kind of search index so maximal speed is desired. What algorithm would be good for this kind of search? At the moment I'm thinking of having separate sparse (packed/compressed) bit vectors for each column separately. I doubt it's optimal though.

This looks similar to "text search", in particular to that of intersecting reverse indexes. Let me go through the simplest algorithm for doing that.
First, you should create sorted lists of numbers where each bit is set. E.g., for the table of numbers:
Row 1 -> 10110
Row 2 -> 00111
Row 3 -> 11110
Row 4 -> 00011
Row 5 -> 01010
Row 6 -> 10101
you can create an reverse index:
Bit 0 is set in -> 2, 4, 6
Bit 1 is set in -> 1, 2, 3, 4, 5
Bit 2 is set in -> 1, 2, 3, 6
etc.
Now, for a query (let's say bits 0 & 1 & 2), you just have to merge these sorted lists using a merge sort like algorithm,. To do this, you can do it by first merging lists 0, 1, giving you {2, 4}, and then merge this with list 2 giving you {2}.
Several optimizations are possible, including, but not limited to, compressing these lists, since the difference between consecutive items is typically small, doing more efficient merging etc.
But, to save more hassle, why not reuse work that others have already done? ;)... You can readily use (should be possible in less than 1 day of coding) any open source text search engine (I suggest Lucene) to perform this task, and it should contain several optimizations which people have built over a long time ;). (Hint: You should treat each row as a "doc" in text search parlance, and each bit as a "token").
Edit (adding some of the algorithms by request of the question author):
a) Compression: One of the most effective things you can do is compression of postings lists (the sorted list corresponding to each position). Most algorithms generally take differences of consecutive terms, and then compress them according to some encoding (Gamma Coding, Varint Encoding) to name a few. This compresses the inverted list so that it either consumes less file space (thus less file I/O), or uses less memory for encoding the same set of numbers. In your case, I can estimate that each posting list will contain ~ 5% * 1e9 = 5e7 elements. If they are uniformly distributed across 0 - 1e9, the gaps should be around 20, and so let us say encoding each gap takes ~ 8b on an average (this is a large overestimation), adding up to 500MB. So for 1000 lists you will need 500GB of space, which definitely needs a disk space. This in turn means that you should go for as good a compression algorithm as possible, since a better compression means less file I/O and you are going to be I/O bound.
b) Intersection Order: You should always intersect lists starting from the smallest, since that is guaranteed to create the smallest sized intermediate lists, which means less comparisons later, by techniques shown below.
c) Merge algorithm: Since your index almost certainly spills to disk, there is probably not much you can do at an algorithmic level. But some of the ideas that are used is to use a binary search based procedure for merging two lists instead of the straightforward linear merge procedure in case one of the lists is much smaller than the other (this will lead to O(N*log(M)) complexity instead of O(N+M) where M >> N). But for file based indices this is almost never a good idea since binary search makes many random accesses, which can completely screw up your disk latency, whereas the linear merge procedure is strictly sequential.
d) Skip Lists: This is another great data structure used to store sorted postings lists, which can also then support efficient "binary search" mentioned before. The key idea here is that the upper levels of the skip list can be kept in memory, and this can greatly speed up the last stages of your intersection algorithm, when you can simply search through the in-memory upper levels to get to a disk offset, and then do disk access from there. There is a point when binary search + skiplist based merge becomes more efficient than linear merge and can be found by experimentation.
e) Caching: No-brainer. If some of your terms occur frequently, cache them in-memory so that you can get them more efficiently in the future. Note that the cache can also be, e.g. a faster flash based disk, which can give you better throughput as well as probably cache a significant number of the more frequent terms (a 32GB memory can only hold ~ 64 of these lists, whereas a 256GB flash disk can hold ~ 512).

fast poker hand ranking

I am working on a simulation of poker and now I have to rank hands effectively:
Every hand is a combination of 5 cards and is represented as an uint64_t.
Every bit from 0 (Ace of Spades), 1 (Ace of Hearts) to 51 (Two of Clubs) indicates if the corresponding card is part (bit == 1) or isn't part (bit == 0) of the hand. The bits from 52 to 63 are always set to zero and don't hold any information.
I already know how I theoretically could generate a table, so that every valid hand can be mapped to rang (represented as uint16_t) between 1 (2,3,4,5,7 - not in the same color) and 7462 (Royal Flush) and all the others to the rang zero.
So a naive lookup table (with the integer value of the card as index) would have an enormous size of
2 bytes * 2^52 >= 9.007 PB.
Most of this memory would be filled with zeros, because almost all uint64_t's from 0 to 2^52-1 are invalid hands and therefor have a rang equal to zero.
The valuable data occupies only
2 bytes * 52!/(47!*5!) = 5.198 MB.
What method can I use for the mapping so that I only have to save the ranks from the valid cards and some overhead (max. 100 MB's) and still don't have to do some expensive search...
It should be as fast as possible!
If you have any other ideas, you're welcome! ;)

You need only a table of 13^5*2, with the extra level of information dictating if all the cards are of the same suit. If for some reason 'heart' outranks 'diamond', you need still at most a table with size of 13^6, as the last piece of information encodes as '0 = no pattern, 1 = all spades, 2 = all hearts, etc.'.
A hash table is probably also a good and fast approach -- Creating a table from nCk(52,5) combinations doesn't take much time (compared to all possible hands). One would, however, need to store 65 bits of information for each entry to store both the key (52 bits) and the rank (13 bits).
Speeding out evaluation of the hand, one first rules out illegal combinations from the mask:
if (popcount(mask) != 5); afterwards once can use enough bits from e.g. crc32(mask), which has instruction level support in i7-architecture at least.

If I understand your scheme correctly, you only need to know that the hamming weight of a particular hand is exactly 5 for it to be a valid hand. See Calculating Hamming Weight in O(1) for information on how to calculate the hamming weight.
From there, it seems you could probably work out the rest on your own. Personally, I'd want to store the result in some persistent memory (if it's available on your platform of choice) so that subsequent runs are quicker since they don't need to generate the index table.

This is a good source
Cactus Kev's
For a hand you can take advantage of at most 4 of any suit
4 bits for the rank (0-12) and 2 bits for the suit
6 bits * 5 cards is just 30 bit
Call it 4 bytes
There are only 2598960 hands
Total size a little under 10 mb

A simple implementation that comes to mind would be to change your scheme to a 5-digit number in base 52. The resulting table to hold all of these values would still be larger than necessary, but very simple to implement and it would easily fit into RAM on modern computers.
edit: You could also cut down even more by only storing the rank of each card and an additional flag (e.g., lowest bit) to specify if all cards are of the same suit (i.e., flush is possible). This would then be in base 13 + one bit for the ranking representation. You would presumably then need to store the specific suits of the cards separately to reconstruct the exact hand for display and such.

I would represent your hand in a different way:
There are only 4 suits = 2bits and only 13 cards = 4 bits for a total of 6 bits * 5 = 30 - so we fit into a 32bit int - we can also force this to always be sorted as per your ordering
[suit 0][suit 1][suit 2][suit 3][suit 4][value 0][value 1][value 2][value 3][value 4]
Then I would use a separate hash for:
consectutive values (very small) [mask off the suits]
1 or more multiples (pair, 2 pair, full house) [mask off the suits]
suits that are all the same (very small) [mask off the values]
Then use the 3 hashes to calculate your rankings
At 5MB you will likely have enough caching issues that will make a bit of math and three small lookups faster

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)

I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.

The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).

Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.

How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.

Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.

You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?

You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.

If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.

If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).

If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.

I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.

Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set

#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}

This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT

Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight