B+ tree node sizing - database

I'm planning on writing a simple key/value store with a file architecture similar to CouchDB, i.e. an append-only b+tree.
I've read everything I can find on B+trees and also everything I can find on CouchDB's internals, but I haven't had time to work my way through the source code (being in a very different language makes it a special project in its own right).
So I have a question about the sizing the of B+tree nodes which is: given key-length is variable, is it better to keep the nodes the same length (in bytes) or is it better to give them the same number of keys/child-pointers regardless of how big they become?
I realise that in conventional databases the B+tree nodes are kept at a fixed length in bytes (e.g. 8K) because space in the data files is managed in fixed size pages. But in an append-only file scheme where the documents can be any length and the updated tree nodes are written after, there seems to be no advantage to having a fixed-size node.

The goal of a b-tree is to minimize the number of disk accesses. If the file system cluster size is 4k, then the ideal size for the nodes is 4k. Also, the nodes should be properly aligned. A misaligned node will cause two clusters to be read, reducing performance.
With a log based storage scheme, choosing a 4k node size is probably the worst choice unless gaps are created in the log to improve alignment. Otherwise, 99.98% of the time one node is stored on two clusters. With a 2k node size, the odds of this happening are just under 50%. However, there's a problem with a small node size: average height of the b-tree is increased and the time spent reading a disk cluster is not fully utilized.
Larger node sizes reduce the height of the tree, but they too increase the number of disk accesses. Larger nodes also increase the overhead of maintaining the entries within the node. Imagine a b-tree where the node size is large enough to encapsulate the entire database. You have to embed a better data structure within the node itself, perhaps another b-tree?
I spent some time prototyping a b-tree implementation over an append-only log format and eventually rejected the concept altogether. To compensate for performance losses due to node/cluster misalignment, you need to have a very large cache. A more traditional storage approach can make better use of the RAM.
The final blow was when I evaluated the performance of randomly-ordered inserts. This kills performance of any disk-backed storage system, but log based formats suffer much more. A write of even the smallest entry forces several nodes to be written to the log, and the internal nodes are invalidated shortly after being written. As a result, the log rapidly fills up with garbage.
BerkeleyDB-JE (BDB-JE) is also log based, and I studied its performance characteristics too. It suffers the same problem that my prototype did -- rapid accumulation of garbage. BDB-JE has several "cleaner" threads which re-append surviving records to the log, but the random order is preserved. As a result, the new "clean" records have already created log files full of garbage. The overall performance of the system degrades to the point that the only thing running is the cleaner, and it's hogging all system resources.
Log based formats are very attractive because one can quickly implement a robust database. The Achilles heel is the cleaner, which is non-trivial. Caching strategies are also tricky to get right.

Related

Space efficient map/dictionary/database with URI/URL keys

I'm looking for a space-efficient key-value mapping/dictionary/database which satisfies certain properties:
Format: The keys will be represented by http(s) URIs. The values will be variable length binary data.
Size: There will be 1-100 billion unique keys (average length 60-70 bytes). Values will initially only be a few tens of bytes but might eventually grow to tens of kilobytes in size (perhaps even more if I decide to store multiple versions). The total size of the data will be measured in terabytes or petabytes.
Hardware: The data will have to be distributed across multiple machines. This distribution should ensure that all URIs from a particular domain end up on the same machine. Furthermore, data on a machine will have to be distributed between the RAM, SSD, and HDD according to how frequently it is accessed. Data will have to be shifted around as machines are added or removed from the cluster. Replication is not needed initially, but might be useful later.
Access patterns: I need both sequential and (somewhat) random access to the data. The sequential access will be from a low-priority batch process that continually scans through the data. Throughput is much more important than latency in this case. Ideally, the iteration will proceed lexicographicaly (i.e. dictionary order). The random accesses arise from accessing the URIs in an HTML page, I expect that most of these will point to URIs from the same domain as the page and hence will be located on the same machine, while others will be located on different machines. I anticipate needing at most 100,000 to 1,000,000 in-memory random accesses per second. The data is not static. Reads will occur one or two orders of magnitude more often than writes.
Initially, the data will be composed of 100 million to 1 billion urls with several tens of bytes of data per url. It will be hosted on a small number of cheap commodity servers with 10-20GBs of RAM and several TBs of hard drives. In this case, most of the space will be taken up storing the keys and indexing information. For this reason, and because I have a tight budget, I'm looking for something which will allow me to store this information in as little space as possible. In particular, I'm hoping to exploit the common prefixes shared by many of the URIs. In this way, I believe it might be possible to store the keys and index in less space than the total length of the URIs.
I've looked at several traditional data structures (e.g. hash-maps, self-balancing trees (e.g. red-black, AVL, B), tries). Only the tries (with some tricks) seem to have the potential for reducing the size of the index and keys (all the others store the keys in addition to the index). The most promising option I've thought of is to split URIs into several components (e.g. example.org/a/b/c?d=e&f=g becomes something like [example, org, a, b, c, d=e, f=g]). The various components would each index a child in subsequent levels of a tree-like structure, kind of like a filesystem. This seems profitable as a lot of URIs share the same domain and directory prefix.
Unfortunately, I don't know much about the various database offerings. I understand that a lot of them use B-trees to index the data. As I understand it, the space required by the index and keys exceeds the total length of the URLs.
So, I would like to know if anyone can offer some guidance as to any data structures or databases that can exploit the redundancy in the URIs to save space. The other stuff is less important, but any help there would be appreciated too.
Thanks, and sorry for the verbosity ;)

Why do freenet keys have a maximum file (data block) size?

In Freenet, if a file is large, it is split into datablocks and what is called a splitfile, contains keys to all these blocks. Why is this necessary?
The only possible explanation I could draw from it is that they want the possibility of a hash collision to be minimal.
NOTE: I've posted this in StackOverflow because I believe it is a programming problem of sorts
I'm not a freenet expert, but... splitting files into small chunks has several benefits:
The "burden" of large files are split across nodes. This means that even nodes who have only set a small amount of storage or bandwidth aside for freenet can help store and deliver parts of larger files.
Nodes in freenet are not guaranteed to be stable. If a node goes offline during the storing or fetching small chunks little effort is lost and another node can be used instead without added protocol complexity.
Chunks can be stored and fetched in parallel allowing for very fast transfers despite a network of slow nodes.

Sorted String Table (SSTable) or B+ Tree for a Database Index?

Using two databases to illustrate this example: CouchDB and Cassandra.
CouchDB
CouchDB uses a B+ Tree for document indexes (using a clever modification to work in their append-only environment) - more specifically as documents are modified (insert/update/delete) they are appended to the running database file as well as a full Leaf -> Node path from the B+ tree of all the nodes effected by the updated revision right after the document.
These piece-mealed index revisions are inlined right alongside the modifications such that the full index is a union of the most recent index modifications appended at the end of the file along with additional pieces further back in the data file that are still relevant and haven't been modified yet.
Searching the B+ tree is O(logn).
Cassandra
Cassandra keeps record keys sorted, in-memory, in tables (let's think of them as arrays for this question) and writes them out as separate (sorted) sorted-string tables from time to time.
We can think of the collection of all of these tables as the "index" (from what I understand).
Cassandra is required to compact/combine these sorted-string tables from time to time, creating a more complete file representation of the index.
Searching a sorted array is O(logn).
Question
Assuming a similar level of complexity between either maintaining partial B+ tree chunks in CouchDB versus partial sorted-string indices in Cassandra and given that both provide O(logn) search time which one do you think would make a better representation of a database index and why?
I am specifically curious if there is an implementation detail about one over the other that makes it particularly attractive or if they are both a wash and you just pick whichever data structure you prefer to work with/makes more sense to the developer.
Thank you for the thoughts.
When comparing a BTree index to an SSTable index, you should consider the write complexity:
When writing randomly to a copy-on-write BTree, you will incur random reads (to do the copy of the leaf node and path). So while the writes my be sequential on disk, for datasets larger than RAM, these random reads will quickly become the bottle neck. For a SSTable-like index, no such read occurs on write - there will only be the sequential writes.
You should also consider that in the worse case, every update to a BTree could incur log_b N IOs - that is, you could end up writing 3 or 4 blocks for every key. If key size is much less than block size, this is extremely expensive. For an SSTable-like index, each write IO will contain as many fresh keys as it can, so the IO cost for each key is more like 1/B.
In practice, this make SSTable-like thousands of times faster (for random writes) than BTrees.
When considering implementation details, we have found it a lot easier to implement SSTable-like indexes (almost) lock-free, where as locking strategies for BTrees has become quite complicated.
You should also re-consider your read costs. You are correct than a BTree is O(log_b N) random IOs for random point reads, but a SSTable-like index is actually O(#sstables . log_b N). Without an decent merge scheme, #sstables is proportional to N. There are various tricks to get round this (using Bloom Filters, for instance), but these don't help with small, random range queries. This is what we found with Cassandra:
Cassandra under heavy write load
This is why Castle, our (GPL) storage engine, does merges slightly differently, and can achieve a lot better (O(log^2 N)) range queries performance with a slight trade off in write performance (O(log^2 N / B)). In practice we find it to be quicker than Cassandra's SSTable index for writes as well.
If you want to know more about this, I've given a talk about how it works:
podcast
slides
Some things that should also be mentioned about each approach:
B-trees
The read/write operations are supposed to be logarithmic O(logn). However, a single database write can lead to multiple writes in the storage system. For example, when a node is full, it has to be split and that means that there will be 2 writes for the 2 new nodes and 1 additional write for updating the parent node. You can see how that could increase if the parent node was also full.
Usually, B-trees are stores in such a way that each node has the size of a page. This creates a phenomenon called write amplification, where even if a single byte needs to be updated, a whole page is written.
Writes are usually random (not sequential), thus slower especially for magnetic disks.
SSTables
SSTables are usually used in the following approach. There is an in-memory structure, called memtable, as you described. Every once in a while, this structure is flushed to disk to an SSTable. As a result, all the writes go to the memtable, but the reads might not be in the current memtable, in which case they are searched in the persisted SSTables.
As a result, writes are O(logn). However, always bear in mind that they are done in memory, so they should be orders of magnitude faster than the logarithmic operations in disk of B-trees. For the sake of completeness, we should mention that writes are also written to a write-ahead log for crash recovery. But, given that these are all sequential writes, they are expected to be much more efficient than the random writes of B-trees.
When served from memory (from the memtable), reads are expected to be much faster as well. But, when there's need to look in the older, disk-based SSTables, reads can potentially become quite slower than B-trees. There are several optimisations around that, such as use of bloom filters, to check whether an SSTable contains a value without performing disk reads.
As you mentioned, there's also a background process, called compaction, used to merge SSTables. This helps remove deleted values and prevent fragmentation, but it can cause significant write load, affecting the write throughput of the incoming operations.
As it becomes evident, a comparison between these 2 approaches is much more complicated. In an extremely simplified attempt to provide a concrete comparison, I think we could say that:
SSTables provide a much better write throughput than B-trees. However, they are expected to have less stable behaviour, because of ongoing compactions. An example of this can be seen in this benchmark comparison.
B-trees are usually preferred for use-cases, where transaction semantics are needed. This is because, each key can be found only in a single place (in contrast to the SSTable, where it could exist in multiple SSTables with obsolete values in some of them) and also because one could represent a range of values as part of the tree. This means that it's easier to perform key-level and range-level locking mechanisms.
References
[1] A Performance Comparison of LevelDB and MySQL
[2] Designing Data-intensive Applications
I think fractal trees, as used by Tokutek, are a better index for a database. They offer real-world 20x to 80x improvements over b-trees.
There are excellent explanations of how fractal tree indices work here.
LSM-Trees is better than B-Trees on storage engine structured.
It converts random-write to aof in a way.
Here is a LSM-Tree src:
https://github.com/shuttler/lsmtree

For millions of objects, is it better to store in an array or a database like redis if the objects are needed in realtime?

I am developing a simulation in which there can be millions of entities that can interact with each other. At the moment, all the entities are stored in a list. Would it be better to store the objects in a database like redis instead of a list?
Note: I assumed this was being implemented in Java (force of habit). My answer is not terribly useful if it is not Java.
Making lots of assumptions about your requirements, I'd consider Redis if:
You are running into unacceptable GC pauses as a result of your millions of objects OR
The entities you create can be reused across multiple simulation runs
Java apps with giant heaps and lots of long-lived objects can run into very long GC pauses, depending on work-load. i.e. the old gen fills up with all these millions of objects and they're never eligible for collection. Regardless, periodically a full collect will happen (unless you're a GC tuning master) and have to scan these millions of objects in the old gen. This can take many seconds each time it happens, and you're frozen during that time. If this is happening and you don't like it, you could off-load all these long-lived objects to Redis, and pay the serialize/deserialize cost of accessing them rather than the GC pauses.
On the other point about reusing entities: if you're loading up a big Redis db and then dropping all its data when the simulation ends, it feels a bit wasteful. If you can re-use entities across simulation runs you might save yourself a bunch of time by persisting them in Redis.
The best choice depends on a number of factors, including how you access data, whether it will fit in memory, and what the distribution of accesses looks like. As a broad generalization, keeping data in memory is always faster than on disk, and keeping it in-process is faster than keeping it elsewhere.
If your data fits in memory, is accessed in a manner that means you can use basic data structures like lists/arrays and hashtables efficiently, and all items are accessed roughly equally often, keeping your data in memory is probably the best option.
If your data fits in memory, but you need to access it in complex ways, you may be best choosing a datastore like redis that supports in-memory databases.
If your data doesn't fit in memory, or you have a very uneven access pattern such that evicting the least used data to disk might allow other things to be loaded, speeding up your task in general, a regular disk-based datastore may be a better choice.
A list is not necessarily the best data structure unless "interaction" is limited to the respective next or previous element. Random access (by index) is very slow on a list.
Lists rocket at inserting at front and end, and at finding the next (or previous) element, or inserting one in between. They totally blow for accessing element 164553 and then element 10657, being O(N) on random access. Thus "interact with each other" suggests that list is a bad choice.
It very much depends on the access and allocation patterns, but a vector or deque will likely be much better suited than a list for your simulation.
Redis is based on a hash table, which has a (much!) better characteristic for random access, but it will most likely still be slower, because it has considerable overhead for you serializing the data, it going through a socket, redis unserializing and analyzing it, sending a reply, and you parsing that.

How do databases deal with data tables that cannot fit in memory?

Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.

Resources