Can this size of the graph be stored in memory? - graph-databases

I am considering using Memgraph for low latency graph queries. My graph is pretty huge, with more than 100M nodes and edges. Can this size of the graph be stored in memory? How can I estimate the amount of memory needed? Is there a way to spill over to disk?

Here is a simple guide for how to calculate memory usage: https://memgraph.com/docs/memgraph/under-the-hood/storage
If you have around 100M nodes and edges, the approximate amount of memory needed would be at least 22GB. Right now, there is no way to spill over to disk, but Memgraph will be adding this feature at some point in the future. Also, Memgraph's GQLAlchemy library provides an on-disk storage solution for large properties not used in graph algorithms. This is useful when nodes or relationships have metadata that doesn’t need to be used in any of the graph algorithms that need to be carried out in Memgraph, but can be fetched after.

Related

Why isn't a trie index used in databases for string indexing?

My question is in the title. It seems the Trie tree is quite fit for string indexing, and why is it that no mainstream databases use it as an indexing strategy.
Disks or SSDs are read in blocks, and the B+Tree indexes that databases use are optimized according to that structure. The B+Tree minimizes the average number of blocks you have to read to perform a lookup. They also allow you to update the index without changing too many blocks, and maximize the utility of cache.
Tries don't have these advantages. The one advantage they do provide is compressed storage of common prefixes, but for the short strings that are usually used as DB keys, that isn't much of an advantage. Sometimes specialized index structures are built to compress common prefixes, but again they're designed around the block structure of the storage.

Graph databases utilizing locality

DAG = directed acyclic graph;
roots = vertices without incoming edges.
I have a DAG larger than available RAM, so I need a disk-based graph database to work with it.
My DAG is shallow: I have billions of roots nodes, but from each node only dozens of nodes are reachable.
It is also not well connected: majority of the nodes have only one incoming edge. So for any couple of root nodes reachable subgraphs usually have very few nodes in common.
So my DAG can be thought of as a large number of small trees, only few of which intersect.
I need to perform the following queries on my DAG in bulk numbers: given a root node, get all nodes reachable from it.
It can be thought as a batch query: given few thousands of root nodes, return all nodes reachable from there.
As far as I know there are algorithms to improve disk storage locality for graphs. Three examples are:
http://ceur-ws.org/Vol-733/paper_pacher.pdf
http://www.cs.ox.ac.uk/dan.olteanu/papers/g-store.pdf
http://graphlab.org/files/osdi2012-kyrola-blelloch-guestrin.pdf
It also seems there are older generation graph databases that don't utilize graph locality. for example a popular Neo4j graph database:
http://www.ibm.com/developerworks/library/os-giraph/
Neo4j relies on data access methods for graphs without considering
data locality, and the processing of graphs entails mostly random data
access. For large graphs that cannot be stored in memory, random disk
access becomes a performance bottleneck.
My question is: are there any graph databases suited well for my workload?
Support for Win64 and a possibility to work with database from something else than Java is a plus.
From the task itself it doesn't seem that you need a graph database.
You can simply use some external-memory programming library, such as stxxl.
First perform topological sort on the graph (in edge format). Then you only sequentially scan until you finish all the "root nodes". The I/O complexity is bounded by the topological sort. Actually you don't need a topo sort, just need to identify the root nodes. This can be done by a join with edge table and node table, which is linear time.

Handling large data in a C program

I am working on a project which runs queries on a database and the results are greater than the memory size. I have heard of memory pool libraries but I'm not sure that it's the best way solution to this problem.
Do memory pool libraries support writing and reading back from disk (as the result of a query that needs to be parsed many times). Are there also some other ways to achieve this?
P.S
I am using MySQL Database and its C API to access database.
EDIT: here's an example:
Suppose I have five tables, each having a million rows. I want to find how much one table is similar to another, so I am creating a bloom filter for each table and then check each filter against the data in the rest of the four tables.
Extending your logical memory beyond the physical memory by using secondary storage (e.g. disks) is usually called swapping, not memory pooling. Your operating system already does it for you, and you should try letting it do its job first.
Memory pool libraries provide more speed and real-time predictability to memory allocation by using fixed-size allocation, but don't increase your actual memory.
You should restructure your program to not use so much memory. Instead of pulling the "whole" (or large part) of the DB into memory you should use a cursor and incrementally update the datastructure your program is maintaining or incrementally change the metric you are querying.
EDIT: you added that you might want to run a bloom filter on the tables?
Have a look at incremental bloom filters: here
how about the Physical Address Extension(PAE)

For millions of objects, is it better to store in an array or a database like redis if the objects are needed in realtime?

I am developing a simulation in which there can be millions of entities that can interact with each other. At the moment, all the entities are stored in a list. Would it be better to store the objects in a database like redis instead of a list?
Note: I assumed this was being implemented in Java (force of habit). My answer is not terribly useful if it is not Java.
Making lots of assumptions about your requirements, I'd consider Redis if:
You are running into unacceptable GC pauses as a result of your millions of objects OR
The entities you create can be reused across multiple simulation runs
Java apps with giant heaps and lots of long-lived objects can run into very long GC pauses, depending on work-load. i.e. the old gen fills up with all these millions of objects and they're never eligible for collection. Regardless, periodically a full collect will happen (unless you're a GC tuning master) and have to scan these millions of objects in the old gen. This can take many seconds each time it happens, and you're frozen during that time. If this is happening and you don't like it, you could off-load all these long-lived objects to Redis, and pay the serialize/deserialize cost of accessing them rather than the GC pauses.
On the other point about reusing entities: if you're loading up a big Redis db and then dropping all its data when the simulation ends, it feels a bit wasteful. If you can re-use entities across simulation runs you might save yourself a bunch of time by persisting them in Redis.
The best choice depends on a number of factors, including how you access data, whether it will fit in memory, and what the distribution of accesses looks like. As a broad generalization, keeping data in memory is always faster than on disk, and keeping it in-process is faster than keeping it elsewhere.
If your data fits in memory, is accessed in a manner that means you can use basic data structures like lists/arrays and hashtables efficiently, and all items are accessed roughly equally often, keeping your data in memory is probably the best option.
If your data fits in memory, but you need to access it in complex ways, you may be best choosing a datastore like redis that supports in-memory databases.
If your data doesn't fit in memory, or you have a very uneven access pattern such that evicting the least used data to disk might allow other things to be loaded, speeding up your task in general, a regular disk-based datastore may be a better choice.
A list is not necessarily the best data structure unless "interaction" is limited to the respective next or previous element. Random access (by index) is very slow on a list.
Lists rocket at inserting at front and end, and at finding the next (or previous) element, or inserting one in between. They totally blow for accessing element 164553 and then element 10657, being O(N) on random access. Thus "interact with each other" suggests that list is a bad choice.
It very much depends on the access and allocation patterns, but a vector or deque will likely be much better suited than a list for your simulation.
Redis is based on a hash table, which has a (much!) better characteristic for random access, but it will most likely still be slower, because it has considerable overhead for you serializing the data, it going through a socket, redis unserializing and analyzing it, sending a reply, and you parsing that.

B+ tree node sizing

I'm planning on writing a simple key/value store with a file architecture similar to CouchDB, i.e. an append-only b+tree.
I've read everything I can find on B+trees and also everything I can find on CouchDB's internals, but I haven't had time to work my way through the source code (being in a very different language makes it a special project in its own right).
So I have a question about the sizing the of B+tree nodes which is: given key-length is variable, is it better to keep the nodes the same length (in bytes) or is it better to give them the same number of keys/child-pointers regardless of how big they become?
I realise that in conventional databases the B+tree nodes are kept at a fixed length in bytes (e.g. 8K) because space in the data files is managed in fixed size pages. But in an append-only file scheme where the documents can be any length and the updated tree nodes are written after, there seems to be no advantage to having a fixed-size node.
The goal of a b-tree is to minimize the number of disk accesses. If the file system cluster size is 4k, then the ideal size for the nodes is 4k. Also, the nodes should be properly aligned. A misaligned node will cause two clusters to be read, reducing performance.
With a log based storage scheme, choosing a 4k node size is probably the worst choice unless gaps are created in the log to improve alignment. Otherwise, 99.98% of the time one node is stored on two clusters. With a 2k node size, the odds of this happening are just under 50%. However, there's a problem with a small node size: average height of the b-tree is increased and the time spent reading a disk cluster is not fully utilized.
Larger node sizes reduce the height of the tree, but they too increase the number of disk accesses. Larger nodes also increase the overhead of maintaining the entries within the node. Imagine a b-tree where the node size is large enough to encapsulate the entire database. You have to embed a better data structure within the node itself, perhaps another b-tree?
I spent some time prototyping a b-tree implementation over an append-only log format and eventually rejected the concept altogether. To compensate for performance losses due to node/cluster misalignment, you need to have a very large cache. A more traditional storage approach can make better use of the RAM.
The final blow was when I evaluated the performance of randomly-ordered inserts. This kills performance of any disk-backed storage system, but log based formats suffer much more. A write of even the smallest entry forces several nodes to be written to the log, and the internal nodes are invalidated shortly after being written. As a result, the log rapidly fills up with garbage.
BerkeleyDB-JE (BDB-JE) is also log based, and I studied its performance characteristics too. It suffers the same problem that my prototype did -- rapid accumulation of garbage. BDB-JE has several "cleaner" threads which re-append surviving records to the log, but the random order is preserved. As a result, the new "clean" records have already created log files full of garbage. The overall performance of the system degrades to the point that the only thing running is the cleaner, and it's hogging all system resources.
Log based formats are very attractive because one can quickly implement a robust database. The Achilles heel is the cleaner, which is non-trivial. Caching strategies are also tricky to get right.

Resources