Space complexity of a B-tree index in a database - database

A traditional B-tree implementation has O(n) space complexity [1].
So assume in a database (regardless of implementation, just consider general case), I have a table of 10GB data, currently the index size is 1GB, so can I assume that if the database grown to 100GB, my index size will be 10GB?

You cannot say anything "regardless of implementation."
If the index is a pure B-tree, then it theoretically should be linear in the number and size of keys being indexed with some fudge factor for fill rates. However, it is unlikely to be a pure B-tree. First, it might be a B+tree or some other variant. B+tree would add a very small logarithmic term to the size computation. That increase is unlikely to be material. More importantly, most implementations do not implement theoretical B-tree operations to maintain the fill rate. For example, deletion might be implemented by merely leaving an open slot to be used by a later insert. Over a large number of operations and with a bit of bad luck, the efficiency of the index representation can degrade so the index might get larger. If your index on 10GB is tightly packed and your 100GB is after a year of operations, it might be larger than you expect.
In direct answer to your question - no I do not think your assumption is safe. More because the index might change size over time than due to non-linearity in the underlying data structure.

Related

How to maintain B-tree invariants when elements vary in size?

For a B-tree of order m, every node except the root must contain m-1 to 2m-1 elements, where every element is at least a key and maybe also some additional data (e.g., a value). Yet each node must have some constant total size picked to give good performance on the underlying block device. So what happens if your elements are of variable size?
SQLite3 seems to have a scheme for tacking additional block-sized pieces onto its nodes, and MySQL lets you declare the size of your records (e.g., you can type your fields to be not just strings but strings under some size). What other solutions are there? And what do people think about when picking one over the other?
edit: And by the previous sentence, I mean, what do database developers think about when deciding to implement their B-trees one way over the other?
(I'm in a databases course right now, so I'm more interested in the theory and design angle than in details of particular systems.)
I know that SQL Server can have key length up to 900 bytes at a page size of 8192 bytes. If you actually have 900 bytes keys only 9 (or 8) rows will fit on an index'es intermediate-level pages. This just means that the branching factor is lower than usual. This might violate the theoretical B-tree invariant but this is just an academic concern which does not impede performance in a significant way. It does not change asymptotic complexity of the algorithms involved.
In short: This is a purely academic concern.
I think this is quite a good question. Although RDBMS vendors all have slightly different implementations, the underlying theory is the same and I doubt anyone uses b-tree implementations as the determining factor in choosing a vendor.
As I understand it, the basic structure of each b-tree page contains keys and pointers. The pointers continually reference other pages containing more keys and pointers with the final pointer referencing the associated data record.
How to Handle variable length keys is interesting. Perhaps others can shed some light on vendor specific solutions.

Sorted String Table (SSTable) or B+ Tree for a Database Index?

Using two databases to illustrate this example: CouchDB and Cassandra.
CouchDB
CouchDB uses a B+ Tree for document indexes (using a clever modification to work in their append-only environment) - more specifically as documents are modified (insert/update/delete) they are appended to the running database file as well as a full Leaf -> Node path from the B+ tree of all the nodes effected by the updated revision right after the document.
These piece-mealed index revisions are inlined right alongside the modifications such that the full index is a union of the most recent index modifications appended at the end of the file along with additional pieces further back in the data file that are still relevant and haven't been modified yet.
Searching the B+ tree is O(logn).
Cassandra
Cassandra keeps record keys sorted, in-memory, in tables (let's think of them as arrays for this question) and writes them out as separate (sorted) sorted-string tables from time to time.
We can think of the collection of all of these tables as the "index" (from what I understand).
Cassandra is required to compact/combine these sorted-string tables from time to time, creating a more complete file representation of the index.
Searching a sorted array is O(logn).
Question
Assuming a similar level of complexity between either maintaining partial B+ tree chunks in CouchDB versus partial sorted-string indices in Cassandra and given that both provide O(logn) search time which one do you think would make a better representation of a database index and why?
I am specifically curious if there is an implementation detail about one over the other that makes it particularly attractive or if they are both a wash and you just pick whichever data structure you prefer to work with/makes more sense to the developer.
Thank you for the thoughts.
When comparing a BTree index to an SSTable index, you should consider the write complexity:
When writing randomly to a copy-on-write BTree, you will incur random reads (to do the copy of the leaf node and path). So while the writes my be sequential on disk, for datasets larger than RAM, these random reads will quickly become the bottle neck. For a SSTable-like index, no such read occurs on write - there will only be the sequential writes.
You should also consider that in the worse case, every update to a BTree could incur log_b N IOs - that is, you could end up writing 3 or 4 blocks for every key. If key size is much less than block size, this is extremely expensive. For an SSTable-like index, each write IO will contain as many fresh keys as it can, so the IO cost for each key is more like 1/B.
In practice, this make SSTable-like thousands of times faster (for random writes) than BTrees.
When considering implementation details, we have found it a lot easier to implement SSTable-like indexes (almost) lock-free, where as locking strategies for BTrees has become quite complicated.
You should also re-consider your read costs. You are correct than a BTree is O(log_b N) random IOs for random point reads, but a SSTable-like index is actually O(#sstables . log_b N). Without an decent merge scheme, #sstables is proportional to N. There are various tricks to get round this (using Bloom Filters, for instance), but these don't help with small, random range queries. This is what we found with Cassandra:
Cassandra under heavy write load
This is why Castle, our (GPL) storage engine, does merges slightly differently, and can achieve a lot better (O(log^2 N)) range queries performance with a slight trade off in write performance (O(log^2 N / B)). In practice we find it to be quicker than Cassandra's SSTable index for writes as well.
If you want to know more about this, I've given a talk about how it works:
podcast
slides
Some things that should also be mentioned about each approach:
B-trees
The read/write operations are supposed to be logarithmic O(logn). However, a single database write can lead to multiple writes in the storage system. For example, when a node is full, it has to be split and that means that there will be 2 writes for the 2 new nodes and 1 additional write for updating the parent node. You can see how that could increase if the parent node was also full.
Usually, B-trees are stores in such a way that each node has the size of a page. This creates a phenomenon called write amplification, where even if a single byte needs to be updated, a whole page is written.
Writes are usually random (not sequential), thus slower especially for magnetic disks.
SSTables
SSTables are usually used in the following approach. There is an in-memory structure, called memtable, as you described. Every once in a while, this structure is flushed to disk to an SSTable. As a result, all the writes go to the memtable, but the reads might not be in the current memtable, in which case they are searched in the persisted SSTables.
As a result, writes are O(logn). However, always bear in mind that they are done in memory, so they should be orders of magnitude faster than the logarithmic operations in disk of B-trees. For the sake of completeness, we should mention that writes are also written to a write-ahead log for crash recovery. But, given that these are all sequential writes, they are expected to be much more efficient than the random writes of B-trees.
When served from memory (from the memtable), reads are expected to be much faster as well. But, when there's need to look in the older, disk-based SSTables, reads can potentially become quite slower than B-trees. There are several optimisations around that, such as use of bloom filters, to check whether an SSTable contains a value without performing disk reads.
As you mentioned, there's also a background process, called compaction, used to merge SSTables. This helps remove deleted values and prevent fragmentation, but it can cause significant write load, affecting the write throughput of the incoming operations.
As it becomes evident, a comparison between these 2 approaches is much more complicated. In an extremely simplified attempt to provide a concrete comparison, I think we could say that:
SSTables provide a much better write throughput than B-trees. However, they are expected to have less stable behaviour, because of ongoing compactions. An example of this can be seen in this benchmark comparison.
B-trees are usually preferred for use-cases, where transaction semantics are needed. This is because, each key can be found only in a single place (in contrast to the SSTable, where it could exist in multiple SSTables with obsolete values in some of them) and also because one could represent a range of values as part of the tree. This means that it's easier to perform key-level and range-level locking mechanisms.
References
[1] A Performance Comparison of LevelDB and MySQL
[2] Designing Data-intensive Applications
I think fractal trees, as used by Tokutek, are a better index for a database. They offer real-world 20x to 80x improvements over b-trees.
There are excellent explanations of how fractal tree indices work here.
LSM-Trees is better than B-Trees on storage engine structured.
It converts random-write to aof in a way.
Here is a LSM-Tree src:
https://github.com/shuttler/lsmtree

Asymptotically Fast Associative Array with Low Memory Requirements

Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.
Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.
The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?
It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.
Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.
There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.
Dan Gusfield's book is the Bible of string algorithms.
I don't think there a reason to be worried about the worst case for two reasons:
You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.
If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.
That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.
In my experience there are three implementation that I think could met your requirement:
HAT-Trie (combination between trie and hashtable)
JudyArray (compressed n-ary tree)
Double Array Tree
You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.

Data Structure to store billions of integers

What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM).
Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?
The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).
In C, platform Linux..
Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.
Depends.
Do you want to search on name or on integer?
Are the names all about the same size?
Are all the integers 32 bits, or some big number thingy?
Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.
Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.
Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).
If the index is the integer where gang lookup is used, you might look into radix trees.
my educated guess is a B-Tree (but I could be wrong ...):
B-trees have substantial advantages
over alternative implementations when
node access times far exceed access
times within nodes. This usually
occurs when most nodes are in
secondary storage such as hard drives.
By maximizing the number of child
nodes within each internal node, the
height of the tree decreases,
balancing occurs less often, and
efficiency increases. Usually this
value is set such that each node takes
up a full disk block or an analogous
size in secondary storage. While 2-3
B-trees might be useful in main
memory, and are certainly easier to
explain, if the node sizes are tuned
to the size of a disk block, the
result might be a 257-513 B-tree
(where the sizes are related to larger
powers of 2).
Instead of a hash you can at least use a radix to get started.
For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work
If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.
If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.
We can use a trie where each node is 1/0 to store the integer values . with this we can ensure that the depth of the tree is 32/64,so fetch time is constant and with sub-linear space complexity.

Why DB indexes use balanced trees, not hashtables?

Hashtables seem to be preferable in terms of disk access. What is the real reason that indexes usually implemented with a tree?
Sorry if it's infantile, but i did not find the straight answer on SO.
One of the common actions with data is to sort it or to search for data in a range - a tree will contain data in order while a hash table is only useful for looking up a row and has no idea of what the next row is.
So hash tables are no good for this common case, thanks to this answer
SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000
or
SELECT * FROM MyTable ORDER BY x
Obviously there are cases where hash tables are better but best to deal with the main cases first.
Size, btrees start small and perfectly formed and grow nicely to enormous sizes. Hashes have a fixed size which can be too big (10,000 buckets for 1000 entries) or too small (10,000 buckets for 1,000,000,000 entries) for the amount of data you have.
Hash tables provide no benefit for this case:
SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000
One has to only look at MySQL's hash index implementation associated with MEMORY storage engine to see its disadvantages:
They can be used with equality operators such as = but not with comparison operators such as <
The optimizer cannot use a hash index to speed up ORDER BY operations.
Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
Optimizer cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use).
And note that the above applies to hash indexes implemented in memory, without the added consideration of disk access matters associated with indexes implemented on disk.
Disk access factors as noted by #silentbicycle would skew it in favour of the balanced-tree index even more.
Databases typically use B+ trees (a specific kind of tree), since they have better disk access properties - each node can be made the size of a filesystem block. Doing as few disk reads as possible has a greater impact on speed, since comparatively little time is spent on either chasing pointers in a tree or hashing.
Hasing is good when the data is not increasing, more techically when N/n is constant ..
where N = No of elements and n = hash slots ..
If this is not the case hashing doesnt give a good performance gain.
In database most probably the data would be increasing a significant pace so using hash there is not a good idea.
and yes sorting is there too ...
"In database most probably the data would be increasing a significant pace so using hash there is not a good idea."
That is an over-exaggeration of the problem. Yes hash spaces must be fixed in size (modulo solutions ala extensible hashing) and yes, their size must be managed, and yes, someone must do that job.
That said, the performance gains if you exploit hash-based physical location to its fullest potential, are enormous.

Resources