Are UUID4s distributed evenly across the md5 address space? - md5

(I will preface this question by saying I think there is virtually no way that UUID4 would be designed not to be uniformly distributed, but I lack the math skills and crypto knowledge to prove it.)
In streaming a bunch of data to kinesis, we are experiencing a problem where one shard, shard #4, is very hot and the other seven shards are underloaded. Kinesis distributes data across its shards by a partition key, which is a unicode string that it converts to an md5 hash.
Shards are sequential by default, so if you have one shard it will have all partition keys from 0 - 2^128 in it. We have eight shards, so the buckets are bounded at increments of 2^125. The end of each shard range is, in hex,
0x20000000000000000000000000000000
0x40000000000000000000000000000000
0x60000000000000000000000000000000
0x80000000000000000000000000000000
0xa0000000000000000000000000000000
0xc0000000000000000000000000000000
0xe0000000000000000000000000000000
0x100000000000000000000000000000000
We partition based on a UUID 4. We had assumed that that would be evenly distributed across the above address space, but with this "hot shard" problem I'm starting to wonder. UUID4s are 2^128 bits, but they reserve six bits for deterministic information, leaving 2^122 values that can be random. It's those six bits that give me pause.
Trivially, if I take the six most significant bits away, my largest possible value is 2^122, which will certainly fall in the first or second bucket, all the time. But back in reality those six digits aren't the most significant of the UUID4 space, so what effect do they have on the distribution? If I use a UUID4 for a sharding key, will my data be evenly distributed across the shards?

Related

Storing floats as small integers

I am working on a project in which we collect a lot of data from many sensors. These sensors, in many cases, return low precision floats represented with 1 and 2 byte integers. These integers are mapped back to floats via some simple relation; for instance,
x_{float} = (x_{int} + 5) / 3
Each sensor will return around 70+ variables of this kind.
Currently, we expect to store a minimum of 10+ millions of entries per day, possibly even 100+ million entries per day. However, we only require like 2 or 3 of these variables on a daily basis and the others one will rarely be used (we require them for modeling purposes).
So, in order to save some space I was considering storing these small precision integers directly into the DB instead of the float value (with the exception of the 2-3 variables we read regularly, which will be stored as floats to avoid the constant overhead of mapping them back from ints). In theory, this should reduce the size of the database by almost a half.
My questions is if this is a good idea?. Will this backfire when we have to map back all the data to train models?.
Thanks in advance.
P.S. We are using Cassandra, I don't know it this may be relevant for the question.

Long Binary Array Compression

There is an Array of Binary Numbers and the number of elements in this array is around 10^20.
The number of "ones" in the array is around 10^10, and these numbers are randomly distributed.
Once the data is generated and saved, it won't be edited: it will remain in read-only mode in its whole life cycle.
Having this data saved, requests will be received. Each request contains the index of the array, and the response should be the value in that particular index. The indexes of these requests are not in order (they may be random).
Question is: how to encode this info saving space and at the same time have a good performance when serving requests?
My thoughts so far are:
To have an array of indexes for each of the "ones". So, I would have an array of 10^10 elements, containing indexes in the range: 0 - 10^20. Maybe not the best compression method, but, it is easy to decode.
The optimal solution in compression: to enumerate each of the combinations (select 10^10 numbers from a set of 10^20 available numbers), then, the data is just the "id" of this enumeration... but, this could be a problem to decode, I think.
Look up "sparse array". If access speed is important, a good solution is a hash table of indices. You should allocate about 2x the space, requiring a 180 GB table. The access time would be O(1).
You could have just a 90 GB table and do a binary search for an index. The access time would be O(log n), if you're happy with that speed.
You can pack the indices more tightly, to less than 84 GB to minimize the size of the single-table approach.
You can break it up into multiple tables. E.g. if you had eight tables, each representing the possible high three bits of the index, then the tables would take 80 GB.
You can break it up further. E.g. if you have 2048 tables, each representing the high 11 bits of the index, the total would be 70 GB, plus some very small amount for the table of pointers to the sub-tables.
Even further, with 524288 tables, you can do six bytes per entry for 60 GB, plus the table of tables overhead. That would still be small in comparison, just megabytes.
The next multiple of 256 should still be a win. With 134 million subtables, you could get it down to 50 GB, plus less than a GB for the table of tables. So less than 51 GB. Then you could, for example, keep the table of tables in memory, and load a sub-table into memory for each binary search. You could have a cache of sub-tables in memory, throwing out old ones when you run out of space. Each sub-table would have, on average, only 75 entries. Then the binary search is around seven steps, after one step to find the sub-table. Most of the time will be spent getting the sub-tables into memory, assuming that you don't have 64 GB of RAM. Then again, maybe you do.

Space efficient map/dictionary/database with URI/URL keys

I'm looking for a space-efficient key-value mapping/dictionary/database which satisfies certain properties:
Format: The keys will be represented by http(s) URIs. The values will be variable length binary data.
Size: There will be 1-100 billion unique keys (average length 60-70 bytes). Values will initially only be a few tens of bytes but might eventually grow to tens of kilobytes in size (perhaps even more if I decide to store multiple versions). The total size of the data will be measured in terabytes or petabytes.
Hardware: The data will have to be distributed across multiple machines. This distribution should ensure that all URIs from a particular domain end up on the same machine. Furthermore, data on a machine will have to be distributed between the RAM, SSD, and HDD according to how frequently it is accessed. Data will have to be shifted around as machines are added or removed from the cluster. Replication is not needed initially, but might be useful later.
Access patterns: I need both sequential and (somewhat) random access to the data. The sequential access will be from a low-priority batch process that continually scans through the data. Throughput is much more important than latency in this case. Ideally, the iteration will proceed lexicographicaly (i.e. dictionary order). The random accesses arise from accessing the URIs in an HTML page, I expect that most of these will point to URIs from the same domain as the page and hence will be located on the same machine, while others will be located on different machines. I anticipate needing at most 100,000 to 1,000,000 in-memory random accesses per second. The data is not static. Reads will occur one or two orders of magnitude more often than writes.
Initially, the data will be composed of 100 million to 1 billion urls with several tens of bytes of data per url. It will be hosted on a small number of cheap commodity servers with 10-20GBs of RAM and several TBs of hard drives. In this case, most of the space will be taken up storing the keys and indexing information. For this reason, and because I have a tight budget, I'm looking for something which will allow me to store this information in as little space as possible. In particular, I'm hoping to exploit the common prefixes shared by many of the URIs. In this way, I believe it might be possible to store the keys and index in less space than the total length of the URIs.
I've looked at several traditional data structures (e.g. hash-maps, self-balancing trees (e.g. red-black, AVL, B), tries). Only the tries (with some tricks) seem to have the potential for reducing the size of the index and keys (all the others store the keys in addition to the index). The most promising option I've thought of is to split URIs into several components (e.g. example.org/a/b/c?d=e&f=g becomes something like [example, org, a, b, c, d=e, f=g]). The various components would each index a child in subsequent levels of a tree-like structure, kind of like a filesystem. This seems profitable as a lot of URIs share the same domain and directory prefix.
Unfortunately, I don't know much about the various database offerings. I understand that a lot of them use B-trees to index the data. As I understand it, the space required by the index and keys exceeds the total length of the URLs.
So, I would like to know if anyone can offer some guidance as to any data structures or databases that can exploit the redundancy in the URIs to save space. The other stuff is less important, but any help there would be appreciated too.
Thanks, and sorry for the verbosity ;)

How do I know how many partitions a DynamoDB table is spread over?

Amazon's DynamoDB in designed for guaranteed performances. A customer must provision throughput for each of it's tables.
To achieve this performances, tables are transparently spread over multiple "servers" AKA "partitions".
Amazon provides us with a "best practice" guide for dimensioning and optimizing the throughput. In this guide, we are told that the provisioned throughput is evenly divided over the partitions. In other words, If the requests are not evenly distributed over the partitions, only a fraction of the reserved (and paid) throughput will be available to the application.
In the worst case scenario, it will be:
worst_throughput = provisioned_and_paid_throughput / partitions
To estimate this "worst_throughput", I need to know the total number of partitions. Where can I find it or how do I estimate it ?
It says, "When storing data, Amazon DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based on the hash key element."
What you really want to know is the throughput of a single partition. It seems like you can test that by hammering a single key.
See this page: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
Which has some simple calculations you can carry out based on the amount of read and write capacity you provision. Note that this is only for initial capacity. As your usage of dynamodb continues, these calculations will have less and less relevance.
A single partition can hold approximately 10 GB of data, and can support a maximum of 3,000 read capacity units or 1,000 write capacity units.

Data Structure to store billions of integers

What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM).
Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?
The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).
In C, platform Linux..
Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.
Depends.
Do you want to search on name or on integer?
Are the names all about the same size?
Are all the integers 32 bits, or some big number thingy?
Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.
Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.
Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).
If the index is the integer where gang lookup is used, you might look into radix trees.
my educated guess is a B-Tree (but I could be wrong ...):
B-trees have substantial advantages
over alternative implementations when
node access times far exceed access
times within nodes. This usually
occurs when most nodes are in
secondary storage such as hard drives.
By maximizing the number of child
nodes within each internal node, the
height of the tree decreases,
balancing occurs less often, and
efficiency increases. Usually this
value is set such that each node takes
up a full disk block or an analogous
size in secondary storage. While 2-3
B-trees might be useful in main
memory, and are certainly easier to
explain, if the node sizes are tuned
to the size of a disk block, the
result might be a 257-513 B-tree
(where the sizes are related to larger
powers of 2).
Instead of a hash you can at least use a radix to get started.
For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work
If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.
If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.
We can use a trie where each node is 1/0 to store the integer values . with this we can ensure that the depth of the tree is 32/64,so fetch time is constant and with sub-linear space complexity.

Resources