Space efficient map/dictionary/database with URI/URL keys - database

I'm looking for a space-efficient key-value mapping/dictionary/database which satisfies certain properties:
Format: The keys will be represented by http(s) URIs. The values will be variable length binary data.
Size: There will be 1-100 billion unique keys (average length 60-70 bytes). Values will initially only be a few tens of bytes but might eventually grow to tens of kilobytes in size (perhaps even more if I decide to store multiple versions). The total size of the data will be measured in terabytes or petabytes.
Hardware: The data will have to be distributed across multiple machines. This distribution should ensure that all URIs from a particular domain end up on the same machine. Furthermore, data on a machine will have to be distributed between the RAM, SSD, and HDD according to how frequently it is accessed. Data will have to be shifted around as machines are added or removed from the cluster. Replication is not needed initially, but might be useful later.
Access patterns: I need both sequential and (somewhat) random access to the data. The sequential access will be from a low-priority batch process that continually scans through the data. Throughput is much more important than latency in this case. Ideally, the iteration will proceed lexicographicaly (i.e. dictionary order). The random accesses arise from accessing the URIs in an HTML page, I expect that most of these will point to URIs from the same domain as the page and hence will be located on the same machine, while others will be located on different machines. I anticipate needing at most 100,000 to 1,000,000 in-memory random accesses per second. The data is not static. Reads will occur one or two orders of magnitude more often than writes.
Initially, the data will be composed of 100 million to 1 billion urls with several tens of bytes of data per url. It will be hosted on a small number of cheap commodity servers with 10-20GBs of RAM and several TBs of hard drives. In this case, most of the space will be taken up storing the keys and indexing information. For this reason, and because I have a tight budget, I'm looking for something which will allow me to store this information in as little space as possible. In particular, I'm hoping to exploit the common prefixes shared by many of the URIs. In this way, I believe it might be possible to store the keys and index in less space than the total length of the URIs.
I've looked at several traditional data structures (e.g. hash-maps, self-balancing trees (e.g. red-black, AVL, B), tries). Only the tries (with some tricks) seem to have the potential for reducing the size of the index and keys (all the others store the keys in addition to the index). The most promising option I've thought of is to split URIs into several components (e.g. example.org/a/b/c?d=e&f=g becomes something like [example, org, a, b, c, d=e, f=g]). The various components would each index a child in subsequent levels of a tree-like structure, kind of like a filesystem. This seems profitable as a lot of URIs share the same domain and directory prefix.
Unfortunately, I don't know much about the various database offerings. I understand that a lot of them use B-trees to index the data. As I understand it, the space required by the index and keys exceeds the total length of the URLs.
So, I would like to know if anyone can offer some guidance as to any data structures or databases that can exploit the redundancy in the URIs to save space. The other stuff is less important, but any help there would be appreciated too.
Thanks, and sorry for the verbosity ;)

Related

Cassandra - What is the reasonable maximum number of tables?

I am new to Cassandra. As I understand the maximum number of tables that can be stored per keyspace is Integer.Max_Value. However, what are the implications from the performance perspective (speed, storage, etc) of such a big number of tables? Is there any recommendation regarding that?
While there are legitimate use cases for having lots of tables in Cassandra, they are rare. Your use case might be one of them, but make sure that it is. Without knowning more about the problem you're trying to solve, it's obviously hard to give guidance. Many tables will require more resources, obviously. How much? That depends on the settings, and the usage.
For example, if you have a thousand tables and write to all of them at the same time there will be contention for RAM since there will be memtables for each of them, and there is a certain overhead for each memtable (how much depends on which version of Cassandra, your settings, etc.).
However, if you have a thousand tables but don't write to all of them at the same time, there will be less contention. There's still a per table overhead, but there will be more RAM to keep the active table's memtables around.
The same goes for disk IO. If you read and write to a lot of different tables at the same time the disk is going to do much more random IO.
Just having lots of tables isn't a big problem, even though there is a limit to how many you can have – you can have as many as you want provided you have enough RAM to keep the structures that keep track of them. Having lots of tables and reading and writing to them all at the same time will be a problem, though. It will require more resources than doing the same number of reads and writes to fewer tables.
In my opinion if you can split the data into multiple tables, even thousands, is beneficial.
Pros:
Suppose you want to scale in future to 10+ nodes and with a RF of 2 will result in having the data evenly distributed across nodes, thus not salable.
Another point is random IO which will be big if you will read from many tables at the same time but I don't see why there is a difference when having just one table. Also you will seek for another partition key, so no difference in IO.
When the compactation takes place it will have to do less work if there is only one table. The values from SSTables must be loaded into memory, merged and saved back.
Cons:
Having multiple tables will result in having multiple memtables. I think the difference added by this to the RAM is insignificant.
Also, check out the links, they helped me A LOT http://manuel.kiessling.net/2016/07/11/how-cassandras-inner-workings-relate-to-performance/
https://www.infoq.com/presentations/Apache-Cassandra-Anti-Patterns
Please fell free to edit my post, I am kinda new to Big Data

For millions of objects, is it better to store in an array or a database like redis if the objects are needed in realtime?

I am developing a simulation in which there can be millions of entities that can interact with each other. At the moment, all the entities are stored in a list. Would it be better to store the objects in a database like redis instead of a list?
Note: I assumed this was being implemented in Java (force of habit). My answer is not terribly useful if it is not Java.
Making lots of assumptions about your requirements, I'd consider Redis if:
You are running into unacceptable GC pauses as a result of your millions of objects OR
The entities you create can be reused across multiple simulation runs
Java apps with giant heaps and lots of long-lived objects can run into very long GC pauses, depending on work-load. i.e. the old gen fills up with all these millions of objects and they're never eligible for collection. Regardless, periodically a full collect will happen (unless you're a GC tuning master) and have to scan these millions of objects in the old gen. This can take many seconds each time it happens, and you're frozen during that time. If this is happening and you don't like it, you could off-load all these long-lived objects to Redis, and pay the serialize/deserialize cost of accessing them rather than the GC pauses.
On the other point about reusing entities: if you're loading up a big Redis db and then dropping all its data when the simulation ends, it feels a bit wasteful. If you can re-use entities across simulation runs you might save yourself a bunch of time by persisting them in Redis.
The best choice depends on a number of factors, including how you access data, whether it will fit in memory, and what the distribution of accesses looks like. As a broad generalization, keeping data in memory is always faster than on disk, and keeping it in-process is faster than keeping it elsewhere.
If your data fits in memory, is accessed in a manner that means you can use basic data structures like lists/arrays and hashtables efficiently, and all items are accessed roughly equally often, keeping your data in memory is probably the best option.
If your data fits in memory, but you need to access it in complex ways, you may be best choosing a datastore like redis that supports in-memory databases.
If your data doesn't fit in memory, or you have a very uneven access pattern such that evicting the least used data to disk might allow other things to be loaded, speeding up your task in general, a regular disk-based datastore may be a better choice.
A list is not necessarily the best data structure unless "interaction" is limited to the respective next or previous element. Random access (by index) is very slow on a list.
Lists rocket at inserting at front and end, and at finding the next (or previous) element, or inserting one in between. They totally blow for accessing element 164553 and then element 10657, being O(N) on random access. Thus "interact with each other" suggests that list is a bad choice.
It very much depends on the access and allocation patterns, but a vector or deque will likely be much better suited than a list for your simulation.
Redis is based on a hash table, which has a (much!) better characteristic for random access, but it will most likely still be slower, because it has considerable overhead for you serializing the data, it going through a socket, redis unserializing and analyzing it, sending a reply, and you parsing that.

B+ tree node sizing

I'm planning on writing a simple key/value store with a file architecture similar to CouchDB, i.e. an append-only b+tree.
I've read everything I can find on B+trees and also everything I can find on CouchDB's internals, but I haven't had time to work my way through the source code (being in a very different language makes it a special project in its own right).
So I have a question about the sizing the of B+tree nodes which is: given key-length is variable, is it better to keep the nodes the same length (in bytes) or is it better to give them the same number of keys/child-pointers regardless of how big they become?
I realise that in conventional databases the B+tree nodes are kept at a fixed length in bytes (e.g. 8K) because space in the data files is managed in fixed size pages. But in an append-only file scheme where the documents can be any length and the updated tree nodes are written after, there seems to be no advantage to having a fixed-size node.
The goal of a b-tree is to minimize the number of disk accesses. If the file system cluster size is 4k, then the ideal size for the nodes is 4k. Also, the nodes should be properly aligned. A misaligned node will cause two clusters to be read, reducing performance.
With a log based storage scheme, choosing a 4k node size is probably the worst choice unless gaps are created in the log to improve alignment. Otherwise, 99.98% of the time one node is stored on two clusters. With a 2k node size, the odds of this happening are just under 50%. However, there's a problem with a small node size: average height of the b-tree is increased and the time spent reading a disk cluster is not fully utilized.
Larger node sizes reduce the height of the tree, but they too increase the number of disk accesses. Larger nodes also increase the overhead of maintaining the entries within the node. Imagine a b-tree where the node size is large enough to encapsulate the entire database. You have to embed a better data structure within the node itself, perhaps another b-tree?
I spent some time prototyping a b-tree implementation over an append-only log format and eventually rejected the concept altogether. To compensate for performance losses due to node/cluster misalignment, you need to have a very large cache. A more traditional storage approach can make better use of the RAM.
The final blow was when I evaluated the performance of randomly-ordered inserts. This kills performance of any disk-backed storage system, but log based formats suffer much more. A write of even the smallest entry forces several nodes to be written to the log, and the internal nodes are invalidated shortly after being written. As a result, the log rapidly fills up with garbage.
BerkeleyDB-JE (BDB-JE) is also log based, and I studied its performance characteristics too. It suffers the same problem that my prototype did -- rapid accumulation of garbage. BDB-JE has several "cleaner" threads which re-append surviving records to the log, but the random order is preserved. As a result, the new "clean" records have already created log files full of garbage. The overall performance of the system degrades to the point that the only thing running is the cleaner, and it's hogging all system resources.
Log based formats are very attractive because one can quickly implement a robust database. The Achilles heel is the cleaner, which is non-trivial. Caching strategies are also tricky to get right.

Comment post scalability: Top n per user, 1 update, heavy read

Here's the situation. Multi-million user website. Each user's page has a message section. Anyone can visit a user's page, where they can leave a message or view the last 100 messages.
Messages are short pieces of txt with some extra meta-data. Every message has to be stored permanently, the only thing that must be real-time quick is the message updates and reading (people use it as chat). A count of messages will be read very often to check for changes. Periodically, it's ok to archive off the old messages (those > 100), but they must be accessible.
Currently all in one big DB table, and contention between people reading the messages lists and sending more updates is becoming an issue.
If you had to re-architect the system, what storage mechanism / caching would you use? what kind of computer science learning can be used here? (eg collections, list access etc)
Some general thoughts, not particular to any specific technology:
Partition the data by user ID. The idea is that you can uniformly divide the user space to distinct partitions of roughly the same size. You can use an appropriate hashing function to divide users across partitions. Ultimately, each partition belongs on a separate machine. However, even on different tables/databases on the same machine this will eliminate some of the contention. Partitioning limits contention, and opens the door to scaling "linearly" in the future. This helps with load distribution and scale-out too.
When picking a hashing function to partition the records, look for one that minimizes the number of records that will have to be moved should partitions be added/removed.
Like many other applications, we could assume the use of the service follows a power law curve: few of the user pages cause much of the traffic, followed by a long tail. A caching scheme can take advantage of that. The steeper the curve, the more effective caching will be. Given the short messages, if each page shows 100 messages, and each message is 100 bytes on average, you could fit about 100,000 top-pages in 1GB of RAM cache. Those cached pages could be written lazily to the database. Out of 10 Mil users, 100,000 is in the ballpark for making a difference.
Partition the web servers, possibly using the same hashing scheme. This lets you hold separate RAM caches without contention. The potential benefit is increasing the cache size as the number of users grows.
If appropriate for your environment, one approach for ensuring new messages are eventually written to the database is to place them in a persistent message queue, right after placing them in the RAM cache. The queue suffers no contention, and helps ensure messages are not lost upon machine failure.
One simple solution could be to denormalize your data, and store pre-calculated aggregates in a separate table, e.g. a MESSAGE_COUNTS table which has a column for the user ID and a column for their message count. When the main messages table is updated, then re-calculate the aggregate.
It's just shifting the bottleneck from one place to another, but it might move it somewhere that's less of a burden.

How do databases deal with data tables that cannot fit in memory?

Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.

Resources