In database system tutorials, like textbook Database System Concepts, there is a module called Buffer Pool / Buffer Manager / Pager / whatever. I didn't see much detail about it, so I'm curious how do you increase the concurrent performance of it?
For example, let's say we have a Trie Index. If we do the paging inside the trie, without the buffer pool, we can easily have multiple threads concurrently load or evict leaf nodes: all you need to do is to acquire shared locks of nodes from the top to the bottom and the exclusive lock of the parent of the leaf node.
However, if you instead let the buffer pool to handle the paging things, then I suppose you might need to acquire the exclusive lock of the buffer pool. Then, there is only a single thread can load or evict pages at the same time.
Actually, I have tried this in a database implementation. The old version doesn't have a buffer pool and manages the paging things in the trie index. And the new version has a buffer pool does the job instead of the trie index itself. There is a big lock protecting the hashmap that maps Page ID to the corresponding page in the buffer pool. The single thread test is 40% faster, however, with 10 concurrent threads, 5x slower!
I guess lock-free data structures may help? But I also guess that's going to be hard to think it straight. So how do you guys design and implement the buffer pool? Thanks!
I solved this problem thanks to the discussion here (in Chinese, sorry). The solution is quite simple, just shard the buffer manager. Each page is delegated to a shard by hashing the page number. As long as this hash function results in a uniform distribution, the probability of multiple threads waiting on the same lock will be low.
In my case, I divided the buffer manager into 128 shards and the hash function is just page_no % 128, with 10 threads, the result of a simple benchmark looks quite amazing:
with Sharded Buffer Manager: 7.73s
with Buffer Manager: 123s
without Buffer Manager, i.e. the trie does the paging itself: 19.7s
BTW, MySQL seems to also take this approach (correct me if I misunderstood it): https://dev.mysql.com/doc/refman/5.7/en/innodb-multiple-buffer-pools.html
Related
My current problem is to track the last 250 actions of each user using an app - these need to be stored in some kind of database, with reasonably fast access and fairly fast writes.
Up till now, I have just saved all actions of all users to a database, but the size of the database is completely blowing up. All messages older than 250 messages (per user) can be discarded, but SQL based libraries do not provide a reasonable way to do this.
Optimally, the database system would use circular buffers of non-fixed length. The problem with pre-allocating all the space required will be impossible (over 90% of users only ever perform < 100 actions, meaning pre-allocated space would be unfeasible for memory reasons). Additionally, for memory reasons, the size of entries needs to be dynamic, since allocating the maximum message length for each message will cause alot of empty space.
I was considering writing such a database system myself, using small (256byte) linked equally-sized chunks of data and keeping track of empty spaces. Obviously I would prefer a pre-made solution, since writing an own database system is time-consuming and error-prone.
Is there any system that - more or less - does what I intend to do? If not, what approach is the best towards writing such a database system myself?
I have a large index of 50 Million docs. all running on the same machine (no sharding).
I don't have an ID that will allow me to update the wanted docs, so for each update I must delete the whole index and to index everything from scratch and commit only at the end when I'm done indexing.
My problem is that every few index runs, My Solr crashes with out of memory exception, I am running with 12.5 GB memory.
From what I understand, until the commit everything is being saved in the memory, so I'm storing in the memory 100M docs instead of 50M. am I right?
But I cannot make commits while I'm indexing, because I deleted all docs at the beginning and than I'll run with partial index which is bad.
Is there any known solutions for that? can sharding solve it or I still going to have the same problem?
Is there a flag that allow me to make soft-commits but it won't change the index until the hard-commit?
You can use the master slave replication. Just dedicate one machine to do your indexing (master solr), and then, if it's finished, you can tell the slave to replicate the index from the master machine. The slave will download the new index, and it will only delete the old index if the download is successful. So it's quite safe.
http://wiki.apache.org/solr/SolrReplication
One other solution to avoid all this replication set-up is to use a reverse proxy, put nginx or something of the like in front of your solr. Use one machine for indexing the new data, and the other for searching. And you can just make the reverse proxy to always point at the one not currently doing any indexing.
If you do one of them, then you can just commit as often as you want.
And because it's generally a bad idea to do indexing and search in one same machine, I will prefer to use the master-slave solution (not to mention you have 50M docs).
out of memory error can be solved by providing more memory to jvm of your container it has nothing to do with your cache .
Use better options for Garbage collection because source of error is your jvm memory being full.
Increase the number of threads because if number of threads for a process is reached a new process is spawn (which have same number of threads as prior one and same memory allocation ).
PLease also write about cpu spike , and any other type of caching mechanism you are using
you can try one thing thats to put all auto warmup to 0 it would speed up commit time
regards
Rajat
I am working on a project which runs queries on a database and the results are greater than the memory size. I have heard of memory pool libraries but I'm not sure that it's the best way solution to this problem.
Do memory pool libraries support writing and reading back from disk (as the result of a query that needs to be parsed many times). Are there also some other ways to achieve this?
P.S
I am using MySQL Database and its C API to access database.
EDIT: here's an example:
Suppose I have five tables, each having a million rows. I want to find how much one table is similar to another, so I am creating a bloom filter for each table and then check each filter against the data in the rest of the four tables.
Extending your logical memory beyond the physical memory by using secondary storage (e.g. disks) is usually called swapping, not memory pooling. Your operating system already does it for you, and you should try letting it do its job first.
Memory pool libraries provide more speed and real-time predictability to memory allocation by using fixed-size allocation, but don't increase your actual memory.
You should restructure your program to not use so much memory. Instead of pulling the "whole" (or large part) of the DB into memory you should use a cursor and incrementally update the datastructure your program is maintaining or incrementally change the metric you are querying.
EDIT: you added that you might want to run a bloom filter on the tables?
Have a look at incremental bloom filters: here
how about the Physical Address Extension(PAE)
I've put together a simple key value store which speaks a subset of the Redis protocol. It uses pthreads on Linux to share the hash table; I use pthreads rwlocks to manage access to this table. I've been testing the K-V store using the Redis benchmark tool.
With a single client, I can do about 2500 SET operations a second. However, it can only do about 25 GETs per second; I'd expect the other way around, so this surprises me. It scales to some extent, so if I throw 10 clients at it I'll get nearly 9000 SETs per second and around 250 GETs per second.
My GET code is pretty simple; I lock the table, find the appropriate hash table location, and check for a matching key in the linked-list there. For a GET, I use pthread_rwlock_rdlock and pthread_rwlock_unlock when I'm done. For SET, I use pthread_rwlock_wrlock and pthread_rwlock_unlock. SET is quite a bit more complex than GET.
I also made the code work on Plan 9, using shared-memory processes and their own implementation of read/write locks. There, GETs are almost as fast as SETs, instead of 100x slower. This makes me think my hash table code is probably ok; I use exactly the same hash table code for both OSes, I simply use #defines to select the appropriate lock for each OS (the interface is the same in both cases, lucky!).
I'm not very experienced with pthreads. Can anyone help me figure out why my performance is sucking so badly?
(Note: this isn't meant to be a high-performing K-V store, it's meant to be a naively written test application/benchmark. It handles requests in about the simplest method possible, by spinning off a new thread for every client)
I don't know rwlocks but the experience I made with conditional locks with pthreads was, that with a realtime kernel the conditions were woken up faster. You can even tune the programs priority with chrtcommand.
I'm planning on writing a simple key/value store with a file architecture similar to CouchDB, i.e. an append-only b+tree.
I've read everything I can find on B+trees and also everything I can find on CouchDB's internals, but I haven't had time to work my way through the source code (being in a very different language makes it a special project in its own right).
So I have a question about the sizing the of B+tree nodes which is: given key-length is variable, is it better to keep the nodes the same length (in bytes) or is it better to give them the same number of keys/child-pointers regardless of how big they become?
I realise that in conventional databases the B+tree nodes are kept at a fixed length in bytes (e.g. 8K) because space in the data files is managed in fixed size pages. But in an append-only file scheme where the documents can be any length and the updated tree nodes are written after, there seems to be no advantage to having a fixed-size node.
The goal of a b-tree is to minimize the number of disk accesses. If the file system cluster size is 4k, then the ideal size for the nodes is 4k. Also, the nodes should be properly aligned. A misaligned node will cause two clusters to be read, reducing performance.
With a log based storage scheme, choosing a 4k node size is probably the worst choice unless gaps are created in the log to improve alignment. Otherwise, 99.98% of the time one node is stored on two clusters. With a 2k node size, the odds of this happening are just under 50%. However, there's a problem with a small node size: average height of the b-tree is increased and the time spent reading a disk cluster is not fully utilized.
Larger node sizes reduce the height of the tree, but they too increase the number of disk accesses. Larger nodes also increase the overhead of maintaining the entries within the node. Imagine a b-tree where the node size is large enough to encapsulate the entire database. You have to embed a better data structure within the node itself, perhaps another b-tree?
I spent some time prototyping a b-tree implementation over an append-only log format and eventually rejected the concept altogether. To compensate for performance losses due to node/cluster misalignment, you need to have a very large cache. A more traditional storage approach can make better use of the RAM.
The final blow was when I evaluated the performance of randomly-ordered inserts. This kills performance of any disk-backed storage system, but log based formats suffer much more. A write of even the smallest entry forces several nodes to be written to the log, and the internal nodes are invalidated shortly after being written. As a result, the log rapidly fills up with garbage.
BerkeleyDB-JE (BDB-JE) is also log based, and I studied its performance characteristics too. It suffers the same problem that my prototype did -- rapid accumulation of garbage. BDB-JE has several "cleaner" threads which re-append surviving records to the log, but the random order is preserved. As a result, the new "clean" records have already created log files full of garbage. The overall performance of the system degrades to the point that the only thing running is the cleaner, and it's hogging all system resources.
Log based formats are very attractive because one can quickly implement a robust database. The Achilles heel is the cleaner, which is non-trivial. Caching strategies are also tricky to get right.