Optimal way for storing compressed sets

Optimal way for storing compressed sets - c

As title says, I am searching for the optimal way of storing sets in memory. I am only interested in sets of bytes (array of integers from 0 to 255 where order is not important). It is not required that encoding/decoding be fast. The only necessary thing is that sets should require as little memory as possible.
The first method I came up with is to allocate array of 256 bits (32 bytes) for each set and the bit at position n tells if there is n in the set or not. The problem with this approach is because it requires the same amount of memory even if the set are mostly empty (has only few elements).
The second approach I tried is to store sets as regular arrays. So, if a set contains n elements, then it will require n + 1 bytes to be stored. The first byte represents the number of elements and other bytes represents elements. But, as we know, order in sets are not important, so something strongly tells me that there must be a way to impove this.
My third attempt is to enumerate all possible sets and the just store the index of set (integer representing its index in list of all possible sets of bytes). But, it turned out that it is absolutelly equivalent as the first approach. Basically, I will still need 32 bytes to store any set, so it is not very useful.
Fourth attempt I made is based on my second approach. I noticed that is the set contains n elements it will, of course, require n + 1 bytes (if I use my second method). But, if, for example, element k appeared in set (actually in array, because in my second attempt I store sets as arrays), then it cannot appear again. Basically, if k appears again, then it must mean something different (maybe k - 1). So, I did some optimizations and I noticed that I can save some bytes if I differently encode each next element (for examle [3, 3, 5, 7] is interpreted as set of 3 elements whose elements are {3, 4, 5} (every next element is decreased by its index) and [3, 3, 5, 6] is interpreted as {3, 4, 2} (notice that 3 and 4 already exists, so 6 is decreased by 2 and it becomes 4, but 4 exists and 3 exists, so it must be 2)). But how can this approach can actually save bytes? I experimented and realized that I can order elements in the array to make it possible, for some cases, to avoid using high bit to encode element, so I saved 1 bit per element, which is about n / 16 bytes saved (which is n / 2 * 1 / 8).
Fifth approach I made is similar to my second approach, but it differently interpret number fo elements. If number of elements are less than 128 then it normally read all the lements from the following array in the memory. But, if the number fo ellements is greater than 128 then it creates a full set and then just remove elements from the following array in memory. On average, is saves a lot of bytes, but it is still far away from optimal.
My last attempt (sixth attempt) is to enumerate just some sets (for example create a list of sets which will contain: full set, set with only even numbers, set with only odd numbers, set with elements less than 128, set with elements greater than 128, etc) and then to use elements from that list and basic set operations (union, intersection, etc) to reconstruct original set. It will require few bytes for each base set we use from the list and it will require a few bits for union or intersection operations, and of course one byte for length of our sequence. It very depends on number of elements in the base set list which should be hardcoded, but it seems hard to preoperly create and properly chose elements which are in that list. Anyway, something tells me that this is not very clever approach.
But hat is actually the most optimal way? Something tells me that my fourth attempt is not so bad, but can we do better? The sets I opereate with have random number of elements, so on average 128 elements per set, so I am looking for a way to allocate 128 bits (16 bytes) per set. The best I did so far is using my fourth approach which is far away from my goal.
Just to mention again, speed is not important. Encoding/decoding may be extremelly slow, the only important thing is that sets require as little amount of memory as possible. When I said "in memory" I meant encoded in memory (compressed). Also, I am interested in as little bits as possible (not only bytes) because I want to store billions of sets compressed on my HDD, so it is important to calculate average amount of bits I need for each set so I know how many resources are available to what I want to achieve.
P.S. If you want some code (but I really don't see why would you) I can post here my solutions I made in C for all of these approaches. Anyway, I am not asking for code or technical details how to implement this in specific programming language, I am just asking for method/algorithm for compressing sets.
Thank you in advance.

Your first method (and the third method, which is equivalent) is already optimal. It cannot be improved.
There are 2256 possible sets of numbers you're working with. By the pigeonhole principle, you need 2256 numbers to identify them all, and you'll need 256 bits to represent those numbers. Any method of identifying the sets which used fewer than 256 bits would leave at least one pair (and probably many pairs) of sets sharing the same identifier.

There are 2^256 possible sets of bytes.
If all sets are equally likely, then the best you can do is to use a constant 256 bits (32 bytes) to indicate which of the 2^256 possibilities you have.
You seem not to like this idea, because you think that sets with only a few elements should take fewer bits. But if they are no more likely to occur than any other sets, then that would not be optimal.
If sets with fewer elements are more likely, then using a constant 32-bytes is not optimal, but the optimal encoding depends on the precise probability distribution of possible sets, which you haven't given. The relevant concept from information theory is "entropy": https://en.wikipedia.org/wiki/Entropy_(information_theory)
Succinctly, in an optimal encoding, the average number of bits required will be the Sum_of_all Pᵢ * -log₂(Pᵢ) over all 2^256 possible sets, where each Pᵢ is the probability of having to encode a particular set (all the Pᵢ must sum to 1)
If the number of elements is the only thing that you think should affect the size of the encoding, then you can't go too far wrong with something like this:
1) Use 1 byte to write out the number of elements in the set. There are 257 possible set sizes, but you can use 0 for both 0 and 256 elements.
2) Write out the index of the set in an enumeration of all sets with that length. (If you wrote a 0 then you need 1 bit to indicate the empty or full set). If the set is known to have N elements, then the number of bits required for this number will be log₂(256!/(N!*(256-N)!)

Related

Finding the pair of strings with most number of identical letters in an array

Suppose I have an array of strings of different lengths.
It can be assumed that the strings have no repeating characters.
Using a brute-force algorithm, I can find the pair of strings that have the most number of identical letters (order does not matter - for example, "ABCDZFW" and "FBZ" have 3 identical letters) in n-squared time.
Is there a more efficient way to do this?
Attempt: I've tried to think of a solution using the trie data structure, but this won't work since a trie would only group together strings with similar prefixes.

I can find the pair of strings that have the most number of identical
letters (order does not matter - for example, "ABCDZFW" and "FBZ" have
3 identical letters) in n-squared time.
I think you can't as string comparison itself is O(max(length(s1), length(s2))) along with the O(n^2) loop for checking all pairs. However you can optimize the comparison of strings in some extent.
As you mentioned the strings don't have duplicates and I am assuming the strings consist of only uppercase letters according to your input. So, it turns into each string can be only 26 characters long.
For each string, we can use a bitmask. And for each character of a string, we can set the corresponding bit 1. For example:
ABCGH
11000111 (from LSB to MSB)
Thus, we have n bit-masks for n strings.
Way #1
Now you can check all possible pairs of strings using O(n^2) loop and compare the string by ANDing two corresponding mask and check the number of set bits (hamming weight). Obviously this is an improvement of your version because the string comparison is optimized now - Only an AND operation between two 32 bit integer which is a O(1) operation.
For example for any two strings comparison will be:
ABCDG
ABCEF
X1 = mask(ABCDG) => 1001111
X2 = mask(ABCEF) => 0110111
X1 AND X2 => 0000111
hamming weight(0000111) => 3 // number of set bits
Way #2
Now, one observation is the AND of same type bit is 1. So for every masks, we will try to maximize the Hamming weight (total number of set bits) of AND value of two string's masks as the string with most matched characters have same bit 1 and ANDing these two masks will make those bits 1.
Now build a Trie with all masks - every node of the trie will hold 0 or 1 based on the corresponding bit is set or not. Insert each mask from MSB ot LSB. Before inserting ith mask into Trie(already holding i - 1 masks), we will query to try maximizing the Hamming weight of AND recusively by going to same bit's branch (to make the bit 1 in final AND variable) and also to opposite bit's branch because in later levels you might get more set bits in this branch.
Regarding this Trie part, for nice pictorial explanation, you can find a similar thread here (this works with XOR).
Here in worst case, we will need to traverse many branches of trie for maximizing the hamming weight. And in worst case it will take around 6 * 10^6 operations (which will take ~1 sec in typical machine) and also we need additional space for building trie. But say the total number of strings is 10^5, then for O(n^2) algorithms, it will take 10^10 operations which is too much - so the trie approach is still far better.
Let me know if you're having problem with implementation. Unfortunately I can able to help you with code only if you're a C/C++ or Java guy.
Thanks #JimMischel for pointing out a major flaw. I slightly misunderstood the statement first.

ARM NEON: Sort an array of 16 bytes

tl;dr: What is the fastest way to sort an uint8x16_t?
I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization.
And i find it to be quite a fancy puzzle, as it seems that there "must" exist a short combination of NEON instructions (such as vmax/vpmax/vmin/vpmin, vzip/vuzp) that reliably results in a sorted array.
For example, if we transform a pair (A, B) of two 8-byte arrays into (vpmax(A,B), vpmin(A,B)), we obtain same 16 values, just in different order. If we repeat this operation four times, we reliably have the array maximum in the first cell and the array minimum in the last cell; we cannot be sure about the middle elements though.
Another example: if we first do (C,D)=(vmax(A,B),vmin(A,B)), then we do (E,F)=(vpmax(C,D),vpmin(C,D)), then we do (G,H)=vzip(E,F), then we get our array split into four parts of four bytes, in each part we already know the largest element and the smallest element. Probably the next naive step would be to deinterleave this array to have top four bytes at start of the array (which won't necessary be the top 4 elements of the array, just top bytes of their respective groups) and repeat, not yet sure where it leads at the end.
Is there any known method for this particular problem or for other similar problems (for different array sizes or whatever)? Any ideas are appreciated :)

How to find 0 in an integer array of size 100 ,having 99 elements as 1 and only one element 0 in most efficient way

I need to find the position( or index ) say i of an integer array A of size 100, such that A[i]=0. 99 elements of array A are 1 and only one element is 0. I want the most efficient way solving this problem.(So no one by one element comparison).

Others have already answered the fundamental question - you will have to check all entries, or at least, up until the point where you find the zero. This would be a worst case of 99 comparisons. (Because if the first 99 are ones then you already know that the last entry must be the zero, so you don't need to check it)
The possible flaw in these answers is the assumption that you can only check one entry at a time.
In reality we would probably use direct memory access to compare several integers at once. (e.g. if your "integer" is 32 bits, then processors with SIMD instructions could compare 128 bits at once to see if any entry in a group of 4 values contains the zero - this would make your brute force scan just under 4 times faster. Obviously the smaller the integer, the more entries you could compare at once).
But that isn't the optimal solution. If you can dictate the storage of these values, then you could store the entire "array" as binary bits (0/1 values) in just 100 bits (the easiest would be to use two 64-bit integers (128 bits) and fill the spare 28 bits with 1's) and then you could do a "binary chop" to find the data.
Essentially a "binary chop" works by chopping the data in half. One half will be all 1's, and the other half will have the zero in it. So a single comparison allows you to reject half of the values at once. (You can do a single comparison because half of your array will fit into a 64-bit long, so you can just compare it to 0xffffffffffffffff to see if it is all 1's). You then repeat on the half that contains the zero, chopping it in two again and determining which half holds the zero... and so on. This will always find the zero value in 7 comparisons - much better than comparing all 100 elements individually.
This could be further optimised because once you get down to the level of one or two bytes you could simply look up the byte/word value in a precalculated look-up table to tell you which bit is the zero. This would bring the algorithm down to 4 comparisons and one look-up (in a 64kB table), or 5 comparisons and one look-up (in a 256-byte table).
So we're down to about 5 operations in the worst case.
But if you could dictate the storage of the array, you could just "store" the array by noting down the index of the zero entry. There is no need at all to store all the individual values. This would only take 1 byte of memory to store the state, and this byte would already contain the answer, giving you a cost of just 1 operation (reading the stored value).

You cannot do it better then linear scan - unless the data is sorted or you have some extra data on it. At the very least you need to read all data, since you have no clue where this 0 is hiding.
If it is [sorted] - just access the relevant [minimum] location.

Something tells me that the expected answer is "compare pairs":
while (a[i] == a[i+1]) i += 2;
Although it looks better that the obvious approach, it's still O(n),

Keep track of it as you insert to build the array. Then just access the stored value directly. O(1) with a very small set of constants.

Imagine 100 sea shells, under one is a pearl. There is no more information.
There is really no way to find it faster than trying to turn them all over. The computer can't do any better with the same knowledge. In other words, a linear scan is the best you can do unless you save the position of the zero earlier in the process and just use that.

More trivia than anything else, but if you happen to have a quantum computer this can be done faster than linear.
Grover's algortithm

Storing Large Integers/Values in an Embedded System

I'm developing a embedded system that can test a large numbers of wires (upto 360) - essentially a continuity checking system. The system works by clocking in a test vector and reading the output from the other end. The output is then compared with a stored result (which would be on an SD Card) that tells what the output should have been. The test-vectors are just a walking ones so there's no need to store them anywhere. The process would be a bit like follows:
Clock out test-vector (walking ones)
Read in output test-vector.
Read corresponding output test-vector from SD Card which tells what the output vector should be.
Compare the test-vectors from step 2 and 3.
Note down the errors/faults in a separate array.
Continue back to step 1 unless all wires are checked.
Output the errors/faults to the LCD.
My hardware consists of a large shift register thats clocked into the AVR microcontroller. For every test vector (which would also be 360 bits), I will need to read in 360 bits. So, for 360 wires the total amount of data would be 360*360 = 16kB or so. I already know I cannot do this in one pass (i.e. read the entire data and then compare), so it will have to be test-vector by test-vector.
As there are no inherent types that can hold such large numbers, I intend to use a bit-array of length 360 bit. Now, my question is, how should I store this bit array in a txt file?
One way is to store raw values i.e. on each line store the raw binary data that I read in from the shift register. So, for 8 wires, it would be 0b10011010. But this can get ugly for upto 360 wires - each line would contain 360 bytes.
Another way is to store hex values - this would just be two characters for 8 bits (9A for the above) and about 90 characters for 360 bits. This would, however, require me to read in the text - line by line - and convert the hex value to be represented in the bit-array, somehow.
So whats the best solution for this sort of problem? I need the solution to be completely "deterministic" - I can't have calls to malloc or such. They are a bit of a no-no in embedded systems from what I've read.
SUMMARY
I need to store large values that can't be represented by any traditional variable types. Currently I intend to store these values in a bitarray. What's the best way to store these values in a text file on an SD Card?

These are not integer values but rather bit maps; they have no arithmetic meaning. What you are suggesting is simply a byte array of length 360/8, and not related to "large integers" at all. However some more appropriate data structure or representation may be possible.
If the test vector is a single bit in 360, then it is both inefficient and unnecessary to store 360 bits for each vector, a value 0 to 359 is sufficient to unambiguously define each vector. If the correct output is also a single bit, then that could also be stored as a bit index, if not then you could store it as a list of indices for each bit that should be set, with some sentinel value >=360 or <0 to indicate the end of the list. Where most vectors contain less than fewer than 22 set bits, this structure will be more efficient that storing a 45 byte array.
From any bit index value, you can determine the address and mask of the individual wire by:
byte_address = base_address + bit_index / 8 ;
bit_mask = 0x01 << (bit_index % 8) ;
You could either test each of the 360 bits iteratively or generate a 360 bit vector on the fly from the list of bits.
I can see no need for dynamic memory allocation in this, but whether or not it is advisable in an embedded system is largely dependent on the application and target resources. A typical AVR system has very little memory, and dynamic memory allocation carries an overhead for heap management and block alignment that you may not be able to afford. Dynamic memory allocation is not suited in situations where hard real-time deterministic timing is required. And in all cases you should have a well defined strategy or architecture for avoiding memory leak issues (repeatedly allocating memory that never gets released).

Why is that data structures usually have a size of 2^n?

Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.

There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?

Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.

Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.

Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.

I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.

Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.

I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.

It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.

It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.

In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight