In RFC4122 there is a phrase "A UUID is 128 bits long, and can guarantee uniqueness across space and time" - uuid

RFC4122 what does phrase uniqueness across space and time mean and please explain.

It basically means that for all realistic purposes, that ID is statistically guaranteed to be unique
While it is not technically impossible for the UUID to be duplicated again, the formula for calculating the probability of that looks like this. n is equal to the number of digits in the UUID and r is the number of UUIDs you want. Try plugging this into a calculator.
Well, the calculator broke if you tried that as the computation is massive. And the value of that for just 32 character UUIDS would be very close to 1 for any number less than 10^16 number of numbers. That is 10 quadrillion.
You are not going to run out of UUIDs if you are Facebook. You are not going to run out of IDs if you are the US government. Both store a massive amount of data (space) and have been generating data for a long time (time).

"across space and time" describes how unlikely it is for two UUIDs to be the same.
128 bits of entropy is quite large and a collision would be like flipping a coin 128 heads in a row twice or rolling a 6 on a 6 sided die 50 times twice. To determine the bits of entropy here, you take the log_2 of the number of options:
128 UUID bits / log_2(2) => 128 / 1 => 128 coin flips
128 UUID bits / log_2(6) => 128 / 2.6 ~=> 50 dice rolls
This duplication is called a UUID collision, and it is possible; however, the chance is extremely small and not worth worrying about.

Related

Compressing a sparse bit array

I have arrays of 1024 bytes (8192 bits) which are mostly zero.
Between 0.01% and 10% of bits will be set (random, no pattern).
How could these be compressed, given the lack of structure and the relatively small size?
(My first thought was to store the distances between set bits. I need 13 bits for each distance, but at worst case 10% occupancy this needs 13 * 816 / 8 = 1326 bytes, which is not an improvement.)
This is for ultra-low bandwidth comms, so every byte matters.
I've dealt deeply with a similar problem, but my sets are much bigger (30 million possible values with between 1 and 30 million elements in each set), so they both gain much more from compression and the compression metadata is insignificant compared to the size of the data. I have never gone down to squeezing things into units smaller than uint16_t, so the things I write below might not apply if you start chopping up 13 bit values into pieces. It feels like it should work, but caveat emptor.
What I've found works is to employ several strategies that depend on the particular data we have. The good news is that the count of elements in each set is a very good indicator of which compression strategy will work best for a particular set. So all the metadata you need is a count of elements in the set. In my data format the first and only metadata value (I'll be unspecific and just call it "value", you can squeeze things in bytes, 16 bit values or 13 bit values however you feel) is the count of elements in the set, the rest is just the encoding of the set elements.
The strategies are:
If very few elements are in the set, you can't do better than an array that says "1, 4711, 8140", so in this case the data is encoded as: [3, 1, 4711, 8140]
If almost all elements are in the set, you can just keep track of elements that aren't. For example [8190, 17, 42].
If around half of the elements are in the set you pretty much can't do much better than a bitmap, so you get [4000, {bitmap}], this is the only case where your data ends up being longer than strictly uncompressed.
If more than "a few" but many fewer than "around half" elements are set, I found another strategy. Divide the bits of your possible values in the set in half. Let's say we have 2^16 (it's easier to describe, it should probably work for 2^13) possible values. The values are divided into 256 ranges with each range with 256 possible values. We then have an array with 256 bytes, each of these bytes describes how many values are in each range (so byte 0 tells us how many elements are [0,255], byte 1 gives us [256,511], etc.) immediately after follow arrays with the values in each range mod 256. The trick here is that while every element in the set encoded as an array (strategy 1) would be 2 bytes, in this scheme each element is only 1 bytes + 256 static bytes for the counts of elements. This means that as soon as we have more than 256 elements in the set this saves us space by switching from strategy 1 to 4.
Strategy 4 can be refined (probably meaningless if your data is random as you mention, but my data had more patterns sometimes, so it worked for me). Since we still need 8 bits for each element in the previous encoding, as soon as a sub-array of elements goes over 32 elements (256 bytes), we can store it as a bitmap instead. This is also a good breakpoint for switching strategies between 4/5 to 3. If all the arrays in this strategy are just bitmaps, then we should just use strategy 3 (it's more complicated than that, but the breakpoint between strategies can be precomputed quite accurately that you'll end up picking the most likely efficient strategy each time).
I have only vaguely tried saving deltas between numbers in the set. Quick experiments showed that they weren't really much more efficient than the strategies I mentioned above, had unpredictable degenerate cases, but most importantly, the application I work with really likes to not have to deserialise its data, just use it raw straight from disk (mmap).

fast poker hand ranking

I am working on a simulation of poker and now I have to rank hands effectively:
Every hand is a combination of 5 cards and is represented as an uint64_t.
Every bit from 0 (Ace of Spades), 1 (Ace of Hearts) to 51 (Two of Clubs) indicates if the corresponding card is part (bit == 1) or isn't part (bit == 0) of the hand. The bits from 52 to 63 are always set to zero and don't hold any information.
I already know how I theoretically could generate a table, so that every valid hand can be mapped to rang (represented as uint16_t) between 1 (2,3,4,5,7 - not in the same color) and 7462 (Royal Flush) and all the others to the rang zero.
So a naive lookup table (with the integer value of the card as index) would have an enormous size of
2 bytes * 2^52 >= 9.007 PB.
Most of this memory would be filled with zeros, because almost all uint64_t's from 0 to 2^52-1 are invalid hands and therefor have a rang equal to zero.
The valuable data occupies only
2 bytes * 52!/(47!*5!) = 5.198 MB.
What method can I use for the mapping so that I only have to save the ranks from the valid cards and some overhead (max. 100 MB's) and still don't have to do some expensive search...
It should be as fast as possible!
If you have any other ideas, you're welcome! ;)
You need only a table of 13^5*2, with the extra level of information dictating if all the cards are of the same suit. If for some reason 'heart' outranks 'diamond', you need still at most a table with size of 13^6, as the last piece of information encodes as '0 = no pattern, 1 = all spades, 2 = all hearts, etc.'.
A hash table is probably also a good and fast approach -- Creating a table from nCk(52,5) combinations doesn't take much time (compared to all possible hands). One would, however, need to store 65 bits of information for each entry to store both the key (52 bits) and the rank (13 bits).
Speeding out evaluation of the hand, one first rules out illegal combinations from the mask:
if (popcount(mask) != 5); afterwards once can use enough bits from e.g. crc32(mask), which has instruction level support in i7-architecture at least.
If I understand your scheme correctly, you only need to know that the hamming weight of a particular hand is exactly 5 for it to be a valid hand. See Calculating Hamming Weight in O(1) for information on how to calculate the hamming weight.
From there, it seems you could probably work out the rest on your own. Personally, I'd want to store the result in some persistent memory (if it's available on your platform of choice) so that subsequent runs are quicker since they don't need to generate the index table.
This is a good source
Cactus Kev's
For a hand you can take advantage of at most 4 of any suit
4 bits for the rank (0-12) and 2 bits for the suit
6 bits * 5 cards is just 30 bit
Call it 4 bytes
There are only 2598960 hands
Total size a little under 10 mb
A simple implementation that comes to mind would be to change your scheme to a 5-digit number in base 52. The resulting table to hold all of these values would still be larger than necessary, but very simple to implement and it would easily fit into RAM on modern computers.
edit: You could also cut down even more by only storing the rank of each card and an additional flag (e.g., lowest bit) to specify if all cards are of the same suit (i.e., flush is possible). This would then be in base 13 + one bit for the ranking representation. You would presumably then need to store the specific suits of the cards separately to reconstruct the exact hand for display and such.
I would represent your hand in a different way:
There are only 4 suits = 2bits and only 13 cards = 4 bits for a total of 6 bits * 5 = 30 - so we fit into a 32bit int - we can also force this to always be sorted as per your ordering
[suit 0][suit 1][suit 2][suit 3][suit 4][value 0][value 1][value 2][value 3][value 4]
Then I would use a separate hash for:
consectutive values (very small) [mask off the suits]
1 or more multiples (pair, 2 pair, full house) [mask off the suits]
suits that are all the same (very small) [mask off the values]
Then use the 3 hashes to calculate your rankings
At 5MB you will likely have enough caching issues that will make a bit of math and three small lookups faster

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.
A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).
The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Use of datatype for 10-20 digit value - PostgreSQL

Im currently developing an application that needs to store a 10 to 20 digit value into the database.
My question is, what datatype should i need to be using? This digit is used as an primary key, and therefore the performance of the DB is important for my accplication. In Java i use this digit as and BigDecimal.
Quote from the manual:
numeric: up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
http://www.postgresql.org/docs/current/static/datatype-numeric.html
131072 digits should cover your needs as far as I can tell.
Edit:
To answer the question about efficiency:
The first and most important question is: what kind of data is stored in that column and how do you use it?
If it's a number then use numeric.
If it's not a number use a varchar.
Never, ever store (real) numbers in character columns!
If you need to sort by that column you won't be satifisfied with what you get if you use a character datatype (e.g. 2 will be sorted after 10)
Coming back to the efficiency question. I assume this is mostly space efficiency you are concerned. You can calculate the space requirements for your values yourself.
The storage requirement for the numeric data type is documented as well:
The actual storage requirement is two bytes for each group of four decimal digits, plus five to eight bytes overhead
So for 20 digits this would be a maximum of 10 bytes plus the five to eight bytes overhead. So max. 18 bytes.
To store 20 digits in a varchar column you need 21 bytes.
So from a space "efficiency" point of view numeric is slightly better. But that should never influence your decision, because the choice of datatypes should be driven by the requirements of the column's content.
From a performance point of view I don't think there will be a big difference either.
Try BIGINT instead of NUMERIC.It should work.
http://www.postgresql.org/docs/current/static/datatype-numeric.html

hash function for src dest ip + port

So, I am looking at different hash functions to use for hashing a 4 tuple ip and port to identify flows.
One I came across was
((size_t)(key.src.s_addr) * 59) ^
((size_t)(key.dst.s_addr)) ^
((size_t)(key.sport) << 16) ^
((size_t)(key.dport)) ^
((size_t)(key.proto));
Now for the life of me, I cannot explain the prime used (59). why not 31, and then why go mess it up by multiplying the sport by a power of 2.
Is there a better hash function to be used for ip addresses ?
The prime number is used because when one value is multiplied by a prime number it tends to have a higher probability of remaining unique when other similar operations are accumulated on top of it. The specific value 59 may have been choosen arbitrarily or it may be intentional. It is hard to tell. It is possible that 59 tended to generate a better distribution of values based on the most likely inputs.
The shift by 16 may be because ports are limited to the range 2^16. The function appears to be moving the source port into the higher part of the bitfield while leaving the destination port in the lower part. I think this can be explained further in my next paragraph.
Another reason why the multiplication takes place (and this is true of the shift operation as well) is because it breaks down the associative nature of the hash function. Remember, XOR is associative so the IPs src=192.168.1.1 and dst=192.168.1.254 would hash to the same value as src=192.168.1.254 and dst=192.168.1.1 (swapped) if the multiplication were not there.
Personally I think you'd be better off reading the four IP bytes as an unsigned long which would give you a number roughly in the range 0 - 2^32-1. Then you figure out how many flows you want to have active at any one time and that would be your index table size.
Take 2000 for example. That means you want to map 2^32 numbers onto roughly 2^11 indeces (to flow information). That won't work because hashing almost never works if filled to 100% and even 90% can be difficult. Using a index table that you only fill to 50% (4000 indeces) or even 25% (8000) is no big deal with todays memories.
The exact size of the index table should be an uneven number of locations and preferably a prime number. This is because you'll most likely need to have some overflow handling to handle collisions (two or more ip numbers which after the hashing point to the same location in the index table) - which you'll get. The overflow handling should be another prime number less than the index table size. All these prime numbers! What's with them anyway?
I'll illustrate with an example (in C):
idxsiz = prime(2000 * 2); // 50% loading
ovfjmp = prime(idxsiz/3);
...
initially fill the table of idxjmp positions with an UNUSED marking (-1). Have a DELETED marking ready (-2).
Your ip number enters the system and you look for its flow record (may or may not exist):
stoppos = ip % idxsiz; /* modulo (division) just this once */
i = stoppos;
do
{
if (index[i] == UNUSED) return NULL;
if (index[i] != DELETED)
{
flowrecptr = &flow_record[index[i]];
if (!flowrecptr->in_use) {/* hash table is broken */}
if (flowrecptr->ip == ip) return flowrecptr;
}
i += ovfjmp;
if (i >= idxsiz) i -= idxsiz;
}
while (i != stoppos);
return NULL;
The UNUSED serves as a marker that this index has never been used and that searching should stop. The DELETED serves as a marker that this index has been used but no longer. That means that searching must continue.
This was when you were attempting to do a get. You got a NULL back from get so you need to do a put which you begin by finding the first index position containing UNUSED or DELETED. Replace this value with an index to the first/next free row on the flow_record table. Mark the row as in_use. Put the original ip number into the ip member of the flow_record row.
This is a very simple - but very effective - way to construct a hashing mechanism. Practically every optimization in the form of special functions to be used after this or that function has failed will enhance the effectiveness of the hashing.
The use of prime numbers will ensure that - in the worst case where all index locations are occupied - the mechanism will test every single location. To illustrate this: suppose idxsiz is evenly divisible by ovfjmp: you won't have much overflow handling to speak of. 35 and 7 will result in locations 0,7,14,21 and 28 being tested before the index jumps to 0 where the while test will cause the search to stop.
----------------------OOPS!
I missed that you wanted the port number as well. Assuming ip V4 that means 6 bytes of address. Read this as an unsigned 64-bit integer and clear the top 16 bits/2 bytes. Then you do the modulo calculation.
Examine the output of the function for uniform distribution. If you don't like it, plug in some different primes until you get a distribution you like. Hashing can be a very dark art with no 'right' answer.
Brian Gideon pretty much sums it up; the multiplication and the shift are intended as a symmetry breaker. So this catches the hypothetical case of machine A telnetting to machine B and vice versa and they happen to chose the same ephemeral portnum. Not very common, but not impossible. Much of the 5-tuple is pretty constant: protocol comes from a very small domain, and so does one half of {address,portnum}.
Assuming a prime-sized hashtable, the magic constant 59 could be replaced by any prime, IMO. The (port << 16) could also be replaced by another shift, as long as no bits fall off or even by a (port * some_other_prime) term.
For a power-of-two sized hashtable, all(minus one) members of the 5-tuple should be multiplied by a (different) prime. (in the old days, when division was expensive that would have been an option)

Resources