Fast 3D Lut lookups - c

I'm trying to write a fast 3D lut lookup function and noticed that most luts are either 33x33x33 or 17x17x17.
Why 33 or 17? Wouldn't the math be quicker with 32 or 16 instead? So you could do some shifts instead of divides? Or maybe I'm not understanding why.
Anyone?

This paper will provide a synopsis: https://www.hpl.hp.com/techreports/98/HPL-98-95.pdf
Basically what you need is to divide the color space into a certain number of pieces and do linear interpolation between those pieces. It's a method of doing the lookup table such that you can find the color positions without much error but with a more sparced lookup than you would otherwise have.
And here's the reason: if you cut a line 2 times, you have 3 pieces.
The reason you have 17 or 33 rather than 16 or 32 is that you need the piece you are in, not the nearest position. If you divide you're going to bitshift a 2^8 value, you'll have 16 values that you could have. But, since you need to linear interpolation the position within that piece, you need 17 values.
In short, the reason you have 17 and not 16 is that with 17 you can evenly divide the value by 16 which is faster, and then check the value that occurs after your floored integer division, and then make an educated guess where you should be between those values. And that takes N+1 values in the lookup table.

Related

Is there a space efficent way to store and retrieve the order of a dataset?

Here's my problem. I have a set of 20 objects stored in memory as an array. I want to store a second piece of data that defines an order for the objects to be displayed.
The simplest way to store the order is as an array of 20 unsigned integers, each of which is 5 bits (aka 0-31). The position of the object in the output list would be defined by the number stored in this array at the same index as the object in it's array.
But.. I know from statistics that there are only 20! (that's 20 factorial), ways to arrange these objects.
This could be stored in 62 bits, since 2^62 > 20!
I'm currently using 100 bits to store the same information.
So my question is this: Is there a space efficient way to store ORDER as a sequence of bits?
I have some addition constraints as well. This will run on an embedded device, so I can't use any huge arrays or high level math functions. I would need a simple iterative method.
Edit: Some clarification on why this is necessary. Say for example the objects are pictures, and they're stored in ROM (aka they can't be moved around). Now lets say I want to keep track of what order to display the images in, and i'm going to update that order every second. My device has 1k of storage with wear leveling, but each bit in the storage can only be written 1000 times before it becomes unreliable. If I need 1kb to store the order, than my device will only work for 1000 seconds. If I need 0.1kb, it will work for 10k seconds, and so on. Thus the devices longevity will be inversely proportional to the number of bits I need to update every cycle.
You can store the order in a single 64-bit value x:
For the first choice, 20 possibilities, compute the index as x % 20 and update x as x /= 20,
For the next choice, only 19 possibilities, compute x % 19 and update x as x /= 19.
Continue this process 17 more times and you are done.
I think I've found a partial solution to my own question. Assuming I start at the left side of the order array, for every move right there are fewer remaining possibilities for the position value. The number of possibilities is 20,19,18,etc. I can take advantage of this by populating the order array in a relative fashion. The first index will place a value in the order array. There are 20 possibilities so this takes 5 bits. Placing the next value, there are only 19 position available (still 5 bits). Proceeding though the whole array. The bits-required is now 5,5,5,5,4,4,4,4,4,4,4,4,3,3,3,3,2,2,1,0. So that gets me down to 69 bits, much better.
There's still some "wasted" precision in each of the values, since for example the first position can store 32 possible values, even though there are only 20. I'm not sure how to deal with this, but I think will have something to do with carrying a remainder from one calculation to the next..

Compressing a sparse bit array

I have arrays of 1024 bytes (8192 bits) which are mostly zero.
Between 0.01% and 10% of bits will be set (random, no pattern).
How could these be compressed, given the lack of structure and the relatively small size?
(My first thought was to store the distances between set bits. I need 13 bits for each distance, but at worst case 10% occupancy this needs 13 * 816 / 8 = 1326 bytes, which is not an improvement.)
This is for ultra-low bandwidth comms, so every byte matters.
I've dealt deeply with a similar problem, but my sets are much bigger (30 million possible values with between 1 and 30 million elements in each set), so they both gain much more from compression and the compression metadata is insignificant compared to the size of the data. I have never gone down to squeezing things into units smaller than uint16_t, so the things I write below might not apply if you start chopping up 13 bit values into pieces. It feels like it should work, but caveat emptor.
What I've found works is to employ several strategies that depend on the particular data we have. The good news is that the count of elements in each set is a very good indicator of which compression strategy will work best for a particular set. So all the metadata you need is a count of elements in the set. In my data format the first and only metadata value (I'll be unspecific and just call it "value", you can squeeze things in bytes, 16 bit values or 13 bit values however you feel) is the count of elements in the set, the rest is just the encoding of the set elements.
The strategies are:
If very few elements are in the set, you can't do better than an array that says "1, 4711, 8140", so in this case the data is encoded as: [3, 1, 4711, 8140]
If almost all elements are in the set, you can just keep track of elements that aren't. For example [8190, 17, 42].
If around half of the elements are in the set you pretty much can't do much better than a bitmap, so you get [4000, {bitmap}], this is the only case where your data ends up being longer than strictly uncompressed.
If more than "a few" but many fewer than "around half" elements are set, I found another strategy. Divide the bits of your possible values in the set in half. Let's say we have 2^16 (it's easier to describe, it should probably work for 2^13) possible values. The values are divided into 256 ranges with each range with 256 possible values. We then have an array with 256 bytes, each of these bytes describes how many values are in each range (so byte 0 tells us how many elements are [0,255], byte 1 gives us [256,511], etc.) immediately after follow arrays with the values in each range mod 256. The trick here is that while every element in the set encoded as an array (strategy 1) would be 2 bytes, in this scheme each element is only 1 bytes + 256 static bytes for the counts of elements. This means that as soon as we have more than 256 elements in the set this saves us space by switching from strategy 1 to 4.
Strategy 4 can be refined (probably meaningless if your data is random as you mention, but my data had more patterns sometimes, so it worked for me). Since we still need 8 bits for each element in the previous encoding, as soon as a sub-array of elements goes over 32 elements (256 bytes), we can store it as a bitmap instead. This is also a good breakpoint for switching strategies between 4/5 to 3. If all the arrays in this strategy are just bitmaps, then we should just use strategy 3 (it's more complicated than that, but the breakpoint between strategies can be precomputed quite accurately that you'll end up picking the most likely efficient strategy each time).
I have only vaguely tried saving deltas between numbers in the set. Quick experiments showed that they weren't really much more efficient than the strategies I mentioned above, had unpredictable degenerate cases, but most importantly, the application I work with really likes to not have to deserialise its data, just use it raw straight from disk (mmap).

How LogLog algorithm with single hash function works

I have found tens of explanation of the basic idea of LogLog algorithms, but they all lack details about how does hash function result splitting works? I mean using single hash function is not precise while using many function is too expensive. How do they overcome the problem with single hash function?
This answer is the best explanation I've found, but still have no sense for me:
They used one hash but divided it into two parts. One is called a
bucket (total number of buckets is 2^x) and another - is basically the
same as our hash. It was hard for me to get what was going on, so I
will give an example. Assume you have two elements and your hash
function which gives values form 0 to 2^10 produced 2 values: 344 and
387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
Could you explain example above pls? You should have 16 buckets because you have header of length 4, right? So how you can have 16 buckets with only two hashes? Do we estimate only buckets, right? So the first bucket is of size 1, and the second of size 4, right? How to merge the results?
Hash function splitting: our goal is to use many hyperloglog structures (as an example, let's say 16 hyperloglog structures, each of them using a 64-bit hash function) instead of one, in order to reduce the estimation error. An intuitive approach might be to process each of the inputs in each of these hyperloglog structures. However, in that case we would need to make sure that the hyperloglogs are independent of each other, meaning we would need a set of 16 hash functions which are independent of each other - that's hard to find!.
So we use an alternative approach. Instead of using a family of 64-bit hash functions, we will use 16 separate hyperloglog structures, each using just a 60-bit hash function. How do we do that? Easy, we take our 64-bit hash function and just ignore the first 4 bits, producing a 60-bit hash function. What do we do with the first 4 bits? We use them to choose one of 16 "buckets" (Each "bucket" is just a hyperloglog structure. Note that 2^4 bits=16 buckets). Now each of the inputs is assigned to exactly one of the 16 buckets, where a 60-bit hash function is used to calculate the hyperloglog value. So we have 16 hyperloglog structures, each using a 60-bit hash function. Assuming that we chose a decent hash function (meaning that the first 4 bits are uniformly distributed, and that they aren't correlated with the remaining 60 bits), we now have 16 independent hyperloglog structures. We take an harmonic average of their 16 estimates to get a much less error-prone estimate of the cardinality.
Hope that clears it up!
The original HyperLogLog paper mentioned by OronNavon is quite theoretical. If you are looking for an explanation of the cardinality estimator without the need of complex analysis, you could have a look on the paper I am currently working on: http://oertl.github.io/hyperloglog-sketch-estimation-paper. It also presents a generalization of the original estimator that does not require any special handling for small or large cardinalities.

fast poker hand ranking

I am working on a simulation of poker and now I have to rank hands effectively:
Every hand is a combination of 5 cards and is represented as an uint64_t.
Every bit from 0 (Ace of Spades), 1 (Ace of Hearts) to 51 (Two of Clubs) indicates if the corresponding card is part (bit == 1) or isn't part (bit == 0) of the hand. The bits from 52 to 63 are always set to zero and don't hold any information.
I already know how I theoretically could generate a table, so that every valid hand can be mapped to rang (represented as uint16_t) between 1 (2,3,4,5,7 - not in the same color) and 7462 (Royal Flush) and all the others to the rang zero.
So a naive lookup table (with the integer value of the card as index) would have an enormous size of
2 bytes * 2^52 >= 9.007 PB.
Most of this memory would be filled with zeros, because almost all uint64_t's from 0 to 2^52-1 are invalid hands and therefor have a rang equal to zero.
The valuable data occupies only
2 bytes * 52!/(47!*5!) = 5.198 MB.
What method can I use for the mapping so that I only have to save the ranks from the valid cards and some overhead (max. 100 MB's) and still don't have to do some expensive search...
It should be as fast as possible!
If you have any other ideas, you're welcome! ;)
You need only a table of 13^5*2, with the extra level of information dictating if all the cards are of the same suit. If for some reason 'heart' outranks 'diamond', you need still at most a table with size of 13^6, as the last piece of information encodes as '0 = no pattern, 1 = all spades, 2 = all hearts, etc.'.
A hash table is probably also a good and fast approach -- Creating a table from nCk(52,5) combinations doesn't take much time (compared to all possible hands). One would, however, need to store 65 bits of information for each entry to store both the key (52 bits) and the rank (13 bits).
Speeding out evaluation of the hand, one first rules out illegal combinations from the mask:
if (popcount(mask) != 5); afterwards once can use enough bits from e.g. crc32(mask), which has instruction level support in i7-architecture at least.
If I understand your scheme correctly, you only need to know that the hamming weight of a particular hand is exactly 5 for it to be a valid hand. See Calculating Hamming Weight in O(1) for information on how to calculate the hamming weight.
From there, it seems you could probably work out the rest on your own. Personally, I'd want to store the result in some persistent memory (if it's available on your platform of choice) so that subsequent runs are quicker since they don't need to generate the index table.
This is a good source
Cactus Kev's
For a hand you can take advantage of at most 4 of any suit
4 bits for the rank (0-12) and 2 bits for the suit
6 bits * 5 cards is just 30 bit
Call it 4 bytes
There are only 2598960 hands
Total size a little under 10 mb
A simple implementation that comes to mind would be to change your scheme to a 5-digit number in base 52. The resulting table to hold all of these values would still be larger than necessary, but very simple to implement and it would easily fit into RAM on modern computers.
edit: You could also cut down even more by only storing the rank of each card and an additional flag (e.g., lowest bit) to specify if all cards are of the same suit (i.e., flush is possible). This would then be in base 13 + one bit for the ranking representation. You would presumably then need to store the specific suits of the cards separately to reconstruct the exact hand for display and such.
I would represent your hand in a different way:
There are only 4 suits = 2bits and only 13 cards = 4 bits for a total of 6 bits * 5 = 30 - so we fit into a 32bit int - we can also force this to always be sorted as per your ordering
[suit 0][suit 1][suit 2][suit 3][suit 4][value 0][value 1][value 2][value 3][value 4]
Then I would use a separate hash for:
consectutive values (very small) [mask off the suits]
1 or more multiples (pair, 2 pair, full house) [mask off the suits]
suits that are all the same (very small) [mask off the values]
Then use the 3 hashes to calculate your rankings
At 5MB you will likely have enough caching issues that will make a bit of math and three small lookups faster

generate random RGB color using rand() function

I need a function which will generate three numbers so I can use them as RGB pattern for my SVG.
While this is simple, I also need to make sure I'm not using the same color twice.
How exactly do I do that? Generate one number at a time with simple rand (seed time active) and then what? I don't want to exclude a number, but maybe the whole pattern?
I'm kind of lost here.
To be precise, by first calling of this function I will get for example 218 199 154 and by second I'll get 47 212 236 which definitely are two different colors. Any suggestions?
Also I think a struct with int r, int g, int b would be suitable for this?
Edit: Colors should be different to the human eye. Sorry for not mentioning this earlier.
You could use a set to store the generated colors.
First instanciate a new set.
Then, every time you generate a color, look if the value is present in your set.
If the record exists, skip it and retry for a new colour. If not, you can use it but dont forget to cache it in the Set after.
This may become not performant if you need to generate a big quantity of colour.
The cheapest way to do this would be to use a Bloom filter which is very small memory wise, but leads to occasional false positives (i.e., you will think you have used a colour, but you haven't). Basically, create three random numbers between 0-255, save them however you like, hash them as a triplet and place the hash in the filter.
Also, you might want to throw away the low bits of each channel since it's probably not easy to tell #FFFFF0 versus #FFFFF2.
Here is a simple way:
1.Generate a random integer.
2.Shift it 8 times to have 24 meaningful bits, store this integer value.
3.Use first 8 bits for R, second group of 8 bits for G,
and the remaining 8 bits for B value.
For every new random number, shift it 8 times, compare all the other integer values that you stored before, if none of them matches with the new one use it for the new color(step3).
The differentiation by human eye is an interesting topic, because perceptional thresholds vary from one to another person. To achieve it shift the integer 14 times, get the first 6 bits for R(pad two 0s to get 8 bits again), get the second 6 bits for G, and last 6 bits for B. If you think that 6 bits are not good for it, decrease it 5,4...
Simple Run with 4 significant bits for each channel:
My random integer is:
0101-1111-0000-1111-0000-1100-1101-0000
I shift(you can also use multiply or modulo) it to left 20 times:
0000-0000-0000-0000-0000-0101-1111-0000
store this value.
Then get first 4 bits for R second 4 bits for G and last 4 bits for B:
R: 0101
G: 1111
B: 0000
Pad them to make each of them 8 bits.
R: 0101-0000
G: 1111-0000
B: 0000-0000
Use those for your color components.
For each new random number after shifting it compare it with your stored integer values so far. If it is different, then store and use it for color.
One idea would be to use a bit vector to represent the set of colors generated. For 24-bit precision, the bit vector would be 224 bits long, which is 16,777,216 bits, or 2 MB. Certainly not a lot, these days, and it would be very fast to look up and insert colors.

Resources