What happens once a graph exceeds the available RAM. Persistence is guaranteed through snapshots and WAL - but at some point, we will likely hit a limit on how much of the graph can be held in memory.
Is Memgraph aware of the completeness of the graph in memory? If so, does it have strategies to offload less-used paths from memory? And if that’s the case, how would it guarantee that a query returns a complete result (and might not have missed offloaded segments of the graph)?
If file storage-based queries are part of the strategy, how much of a performance hit is to be expected?
Memgraph stores the entire graph inside RAM. In other words, there has to be enough memory to store the whole dataset. That was an early strategic decision because we didn’t want to sacrifice performance. At this point, there is no other option. Memgraph is durable because of snapshots and WALs + there are memory limits in place (at some point when the limit is reached, Memgraph will stop accepting writes). A side note, but essential to mention, is that graph algorithms usually use the entire dataset multiple times, which means you, either way, have to bring data to memory (Memgraph is really optimized for that case).
There are drawbacks to the above approach. In the great majority of cases, you can find enough RAM, but it’s not the cheapest.
Related
I have a memory-heavy application which is supposed to run with low latency and with constant speed, but in practice it has poor performance during the first few seconds of startup. This appears to be because the initial memory accesses triggers page faults which have significant performance implications.
I would like to try preallocating a single large block of memory, paging it all in (via mlock() or just by touching each byte), and then using a custom malloc()/free() implementation to ensure that all further allocations are done from within this block.
I am aware of numerous custom memory allocators (TCMalloc, Hoard, jemalloc, etc) but it is not clear to me whether they can be backed by user-provided memory, or whether they always perform their internal allocations from the OS. Does anyone have any insight or recommendations here?
To be clear, I am not looking for a memory pooling system (which would be for reusing small objects). The custom implementation of malloc()/free() should be able to perform any size allocation while limiting fragmentation of its backing store and following other best practices.
Edit based on comments: I do not expect to make the system faster - I just want to move the slow part (allocation, initial page faults) to the start of the process, and then do the real computation work once the system is 'primed'.
Thanks!
A bit late to the party.
dlmalloc is one choice that can be backed by pre-allocated memory. You can find it here. You may just need to add some extra definitions in the beginning to force it to use your pre-allocated memory rather than call the system mmap, you can refer to the nice documentation at the beginning of the file.
We had an issue with one of our real time application. The idea was to run one of the threads every 2ms (500 Hz). After the application ran for half or an hour so.. we noticed that the thread is falling behind.
After a few discussions, people complain about the malloc allocations in the real time thread or root cause is malloc allocations.
I am wondering that, is it always a good idea to avoid all dynamic memory allocations in the real time threads?
Internet has very few resource on this ? If you can point to some discussion that would we great too..
Thanks
First step is to profile the code and make sure you understand exactly where the bottleneck is. People are often bad at guessing bottlenecks in code, and you might be surprised with the findings. You can simply instrument several parts of this routine yourself and dump min/avg/max durations in regular intervals. You want to see the worst case (max), and if the average duration increases as the time goes by.
I doubt that malloc will take any significant portion of these 2ms on a reasonable microcontroller capable of running Linux; I'd say it's more likely you would run out of memory due to fragmentation, than having performance issues. If you have any other syscalls in your function, they will easily take an order of magnitude more than malloc.
But if malloc is really the problem, depending on how short-lived your objects are, how much memory you can afford to waste, and how much your requirements are known in advance, there are several approaches you can take:
General purpose allocation (malloc from your standard library, or any third party implementation): best approach if you have "more than enough" RAM, many short-lived objects, and no strict latency requirements
PROS: works for any object size out of the box, familiar interface, memory is shared dynamically, no need to "plan ahead" if memory is not an issue
CONS: slight performance penalty during allocation and/or deallocation, memory fragmentation issues when doing lots of allocations/deallocations of objects of different sizes, whether a run-time allocation will fail is less deterministic and cannot be easily mitigated at runtime
Memory pool: best approach in most cases where memory is limited, low latency is required, and the object needs to live longer than a single block scope
PROS: allocation/deallocation time is guaranteed to be O(1) in any reasonable implementation, does not suffer from fragmentation, easier to plan its size in advance, failure to allocate at run-time is (likely) easier to mitigate
CONS: works for a single specific object size - memory is not shared between other parts of the program, requires a planning for the right size of the pool or risking potential waste of memory
Stack based (automatic-duration) objects: best for smaller, short-lived objects (single block scope)
PROS: allocation and deallocation is done automatically, allows having optimum usage of RAM for the object's lifetime, there are tools which can sometimes do a static analysis of your code and estimate the stack size
CONS: objects limited to a single block scope - cannot share objects between interrupt invocations
Individual statically allocated objects: best approach for long lived objects
PROS: no allocation whatsoever - all needed objects exist throughout the application life-cycle, no problems with allocation/deallocation
CONS: wastes memory if the objects should be short-lived
Even if you decide to go for memory pools all over the program, make sure you add profiling/instrumentation to your code. And then leave it there forever to see how the performance changes over time.
Being a realtime software engineer in the aerospace industry we see this question a lot. Even within our own engineers, we see that software engineers attempt to use non-realtime programming techniques they learned elsewhere or to use open-source code in their programs. Never allocate from the heap during realtime. One of our engineers created a tool that intercepts the malloc and records the overhead. You can see in the numbers that you cannot predict when the allocation attempt will take a long time. Even on very high end computers (72 cores, 256 GB RAM servers) running a realtime hybrid of Linux we record mallocs taking 100's of milliseconds. It is a system call which is cross-ring, so high overhead, and you don't know when you will get hit by garbage collection, or when it decides it must request another large chunk or memory for the task from the OS.
I want to know what exactly is sequential write and what is random write in definition. I will be even more helpful with example. I tried to google the result. But not much google explanation.
Thanks
When you write two blocks that are next to each-other on disk, you have a sequential write.
When you write two blocks that are located far away from eachother on disk, you have random writes.
With a spinning hard disk, the second pattern is much slower (can be magnitudes), because the head has to be moved around to the new position.
Database technology is (or has been, maybe not that important with SSD anymore) to a large part about optimizing disk access patterns. So what you often see, for example, is trading direct updates of data in their on-disk location (random access) versus writing to a transaction log (sequential access). Makes it more complicated and time-consuming to reconstruct the actual value, but makes for much faster commits (and you have checkpoints to eventually consolidate the logs that build up).
If not using mmap(), it seems like there should be a way to give certain files "priority", so that the only time they're swapped out is for page faults trying to bring in, e.g., executing code, or memory that was malloc()'d by some process, but never other files. One can think of situations where this could be useful. Consider search engines, which should keep their index files in cache, but which may be simultaneously writing new files (not being used for search).
There are a few ways.
The best way is with madvise(), which allows you to inform the kernel that you will need a particular range of memory soon, which gives it priority over other memory. You can also use it to say that a particular range will not be needed soon, so it should be swapped out sooner.
The hack way is with mlock(), which forces a range of memory to stay in RAM. This is generally not a good idea, and should only be used in special cases. The most common case is to store passwords in RAM so that the password cannot be recovered from the swap file after the computer is powered off. I would not use mlock() for performance tuning unless I had exhausted other options.
The worst way is to constantly poke memory, forcing it to stay fresh.
Is the order of page flushes with msync(MS_ASYNC) on linux guaranteed to be the same as the order the pages where written to?
If it depends on circumstances, is there a way for me (full server access) to make sure they are in the same order?
Background
I'm currently using OpenLDAP Symas MDB as a persistent key/value storage and without MDB_MAPASYNC - which results in using msync(MS_ASYNC) (I looked through the source code) - the writes are so slow, that even while processing data a single core is permanently waiting on IO at sometimes < 1MB/s. After analyzing, the problem seems to be many small IO Ops. Using MDB_MAPASYNC I can hit the max rate of my disk easily, but the documentation of MDB states that in that case the database can become corrupted. Unfortunately the code is too complex to me/I currently don't have the time to work through the whole codebase step by step to find out why this would be, and also, I don't need many of the features MDB provides (transactions, cursors, ACID compliance), so I was thinking of writing my own KV Store backed by mmap, using msync(MS_ASYNC) and making sure to write in a way that an un-flushed page would only lose the last touched data, and not corrupt the database or lose any other data.
But for that I'd need an answer to my question, which I totally can't find by googling or going through linux mailing lists unfortunately (I've found a few mails regarding msync patches, but nothing else)
On a note, I've looked through dozens of other available persistent KV stores, and wasn't able to find a better fit for me (fast writes, easy to use, embedded(so no http services or the like), deterministic speed(so no garbage collection or randomly run compression like leveldb), sane space requirements(so no append-only databases), variable key lengths, binary keys and data), but if you know of one which could help me out here, I'd also be very thankful.
msync(MS_ASYNC) doesn't guarantee the ordering of the stores, because the IO elevator algos operating in the background try to maximize efficiency by merging and ordering the writes to maximize the throughput to the device.
From man 2 msync:
Since Linux 2.6.19, MS_ASYNC is in fact a no-op, since the kernel properly tracks dirty pages and flushes them to storage as necessary.
Unfortunately, the only mechanism to sync a mapping with its backing storage is the blocking MS_SYNC, which also does not have any ordering guarantees (if you sync a 1 MiB region, the 256 4 KiB pages can propagate to the drive in any order -- all you know is that if msync returns, all of the 1 MiB has been synced).