How is memory is managed in YugaByte DB? I understand that there are two sets of processes yb-tserver & yb-master, but couldn't find too many other details.
Specific questions:
How much RAM do each of these processes use by default?
Is there a way to explicitly control this?
Presumably, the memory is used for caching, memtables etc. How are these components sized?
Can specific tables be pinned in memory (or say given higher priority in caches)?
Thanks in advance.
How much RAM do each of these processes use by default?
By default:
yb-tserver process assumes 85% of node's RAM is available for its use.
and
yb-master process assumes 10% of node's RAM is available for its use.
These are determined by default settings of the gflag
--default_memory_limit_to_ram_ratio (0.85 and 0.1 respectively for
yb-tserver/yb-master).
Is there a way to explicitly control this?
Yes, there are 2 different options for controlling how much memory is allocated to the processes yb-master and yb-tserver:
Option A) You can set --default_memory_limit_to_ram_ratio to control
what percentage of node's RAM the process should use.
Option B) You can specify an absolute value too using
--memory_limit_hard_bytes. For example, to give yb-tserver 32GB of
RAM, use:
--memory_limit_hard_bytes 34359738368
Since you start these two processes independently, you can use either option for yb-master or yb-tserver. Just make sure that you don't oversubscribe total machine memory since a yb-master and a yb-tserver process can be present on a single VM.
Presumably, the memory is used for caching, memtables etc. How are
these components sized?
Yes, the primary consumers of memory are the block cache, memstores &
memory needed for requests/RPCs in flight.
Block Cache:
--db_block_cache_size_percentage=50 (default)
Total memstore is the minimum of these two knobs:
--global_memstore_size_mb_max=2048
--global_memstore_size_percentage=10
Can specific tables be pinned in memory (or say given higher
priority in caches)?
We currently (as of 1.1) do not have per-table pinning hints yet.
However, the block cache does do a great job already by default of
keeping hot blocks in cache. We have enhanced RocksDB’s block cache to
be scan resistant. The motivation was to prevent operations such as
long-running scans (e.g., due to an occasional large query or
background Spark jobs) from polluting the entire cache with poor
quality data and wiping out useful/hot data.
Related
How can we control the window in RSS when mapping a large file? Now let me explain what i mean.
For example, we have a large file that exceeds RAM by several times, we do shared memory mmaping for several processes, if we access some object whose virtual address is located in this mapped memory and catch a page fault, then reading from disk, the sub-question is, will the opposite happen if we no longer use the given object? If this happens like an LRU, then what is the size of the LRU and how to control it? How is page cache involved in this case?
RSS graph
This is the RSS graph on testing instance(2 thread, 8 GB RAM) for 80 GB tar file. Where does this value of 3800 MB come from and stay stable when I run through the file after it has been mapped? How can I control it (or advise the kernel to control it)?
As long as you're not taking explicit action to lock the pages in memory, they should eventually be swapped back out automatically. The kernel basically uses a memory pressure heuristic to decide how much of physical memory to devote to swapped-in pages, and frequently rebalances as needed.
If you want to take a more active role in controlling this process, have a look at the madvise() system call.
This allows you to tweak the paging algorithm for your mmap, with actions like:
MADV_FREE (since Linux 4.5)
The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. ...
MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure.
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
MADV_WILLNEED
Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) ...
Issuing an madvise(MADV_SEQUENTIAL) after creating the mmap might be sufficient to get acceptable behavior. If not, you could also intersperse some MADV_WILLNEED/MADV_DONTNEED access hints (and/or MADV_FREE/MADV_COLD) during the traversal as you pass groups of pages.
What happens once a graph exceeds the available RAM. Persistence is guaranteed through snapshots and WAL - but at some point, we will likely hit a limit on how much of the graph can be held in memory.
Is Memgraph aware of the completeness of the graph in memory? If so, does it have strategies to offload less-used paths from memory? And if that’s the case, how would it guarantee that a query returns a complete result (and might not have missed offloaded segments of the graph)?
If file storage-based queries are part of the strategy, how much of a performance hit is to be expected?
Memgraph stores the entire graph inside RAM. In other words, there has to be enough memory to store the whole dataset. That was an early strategic decision because we didn’t want to sacrifice performance. At this point, there is no other option. Memgraph is durable because of snapshots and WALs + there are memory limits in place (at some point when the limit is reached, Memgraph will stop accepting writes). A side note, but essential to mention, is that graph algorithms usually use the entire dataset multiple times, which means you, either way, have to bring data to memory (Memgraph is really optimized for that case).
There are drawbacks to the above approach. In the great majority of cases, you can find enough RAM, but it’s not the cheapest.
If not using mmap(), it seems like there should be a way to give certain files "priority", so that the only time they're swapped out is for page faults trying to bring in, e.g., executing code, or memory that was malloc()'d by some process, but never other files. One can think of situations where this could be useful. Consider search engines, which should keep their index files in cache, but which may be simultaneously writing new files (not being used for search).
There are a few ways.
The best way is with madvise(), which allows you to inform the kernel that you will need a particular range of memory soon, which gives it priority over other memory. You can also use it to say that a particular range will not be needed soon, so it should be swapped out sooner.
The hack way is with mlock(), which forces a range of memory to stay in RAM. This is generally not a good idea, and should only be used in special cases. The most common case is to store passwords in RAM so that the password cannot be recovered from the swap file after the computer is powered off. I would not use mlock() for performance tuning unless I had exhausted other options.
The worst way is to constantly poke memory, forcing it to stay fresh.
Lets say, 4 threads are running on 4 separate cores of a Multicore x86 processor, and they do not share any data, is it possible to progammatically make the 4 cores use separate and predefined portions of the shared L2 cache.
Let's use two terms, exclusive and shared caches instead of L1, L2, L3, L4 caches. Different CPU families start to share cache on different levels. In the presented terms the original question is - is it possible split shared cache into the parts, each of which will be used exclusively by one of the CPU/cores? There is no clear answer. Furthermore there are two answers opposite to each other.
1) First and general answer: NO.
Cache is by design managed in hardware. There are only few control levers of cache accessible in software such as enable/disable cache for whole memory or defined memory region, apply specified policy for cache flushing (write through/ write back). NO basically due to the fact, that it was designed to be managed in hardware. So there are no useful interface that will allow manage it gracefully in software.
2) Second answer: Yes.
In fact, cache designed in such a way, that each line of the cache can save data from specified set of memory lines. Due to this if memory manager provides guaranty, that the same CPU one CPU/core own and use all memory lines assigned to the same cache line exclusively, then memory manager provides guaranty that that cache line will be used by that CPU exclusively. It is a very tricky workaround. And it have very limited benefits, and have serious drawbacks: memory layout is very fragmented, cache usage is unbalanced, complicated memory management, very hadrware-dependent (Details can be found in the paper provided by "MetallicPriest").
Resume: it is possible in theory and almost impossible on practice.
For a process, I have set a soft limit value of 335544320 and hard limit value of 1610612736 for the resource RLIMIT_AS. Even after setting this value, the address space of the process goes up to maximum 178MB. But I am able to see the value of the soft and Hard limits in /proc/process_number/limits correctly set to the above said values.
I wanted to know whether RLIMIT_AS is working in my OS and would also like to know how I can test for the RLIMIT_AS feature.
CentOS 5.5(64 bit) is the operating system that I am using.
Some please help me with regard to this. Thank you!
All setrlimit() limits are upper limits. A process is allowed to use as much resources as it needs to, as long as it stays under the soft limits. From the setrlimit() manual page:
The soft limit is the value that the
kernel enforces for the corresponding
resource. The hard limit acts as a
ceiling for the soft limit: an
unprivileged process may only set its
soft limit to a value in the range
from 0 up to the hard limit, and
(irreversibly) lower its hard limit. A
privileged process (under Linux: one
with the CAP_SYS_RESOURCE capability)
may make arbitrary changes to either
limit value.
Practically this means that the hard limit is an upper limit for both the soft limit and itself. The kernel only enforces the soft limits during the operation of a process - the hard limits are checked only when a process tries to change the resource limits.
In your case, you specifiy an upper limit of 320MB for the address space and your process uses about 180MB of those - well within its resource limits. If you want your process to grow, you need to do it in its code.
BTW, resource limits are intended to protect the system - not to tune the behaviour of individual processes. If a process runs into one of those limits, it's often doubtful that it will be able to recover, no matter how good your fault handling is.
If you want to tune the memory usage of your process by e.g. allocating more buffers for increased performance you should do one or both of the following:
ask the user for an appropriate value. This is in my opinion the one thing that should always be possible. The user (or a system administrator) should always be able to control such things, overriding any and all guesswork from your application.
check how much memory is available and try to guess a good amount to allocate.
As a sidenote, you can (and should) deal with 32-bit vs 64-bit at compile-time. Runtime checks for something like this are prone to failure and waste CPU cycles. Keep in mind, however, that the CPU "bitness" does not have any direct relation with the available memory:
32-bit systems do indeed impose a limit (usually in the 1-3 GB range) on the memory that a process can use. That does not mean that this much memory is actually available.
64-bit systems, being relatively newer, usually have more physical memory. That does not mean that a specific system actually has it or that your process should use it. For example, many people have built 64-bit home file servers with 1GB of RAM to keep the cost down. And I know quite a few people that would be annoyed if a random process forced their DBMS to swap just because it only thinks of itself.