Avoid paging when Allocating big blocks of memory in C? - c

I am writing an N-body simulation in C using the Barnes-Hut algorithm which requires using big blocks of memory. I am going for speed and efficiency. Is there any way to guarantee that these blocks of memory will stay in RAM and not get paged to the hard drive?
Edit: I would like to allocate as many as 2GB, however it is conceivable that I may end up running some simulations with much more memory.
Edit: Solution should support Windows7 (maybe Windows8 when it comes out?) and Ubuntu

There are operating system primitives that do what you want: mlock on Unix (of which Ubuntu is but one example¹), and VirtualLock on Windows. (Ignore the quibbling in the comments over the exact semantics of VirtualLock; they're irrelevant for your use case.)
The Unix primitive requires root privilege in the calling process (some systems permit locking down a small amount of memory without privilege, but you want far more than that). The Windows primitive appears not to require special privileges.
¹ "Linux is not UNIX" objection noted and ignored with prejudice.

For Linux: mlock(2) will do the job.
https://www.kernel.org/doc/man-pages/online/pages/man2/mlock.2.html
But beware that the amount of user mlockable memory is normally limited on standard systems ulimit -l.
The Windows version is VirtualLock. I do not know if there is a limit and how it can be queried.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366895%28v=vs.85%29.aspx

Related

Profiling resident memory usage and many page faults in C++ program on linux

I am trying to figure out why my resident memory for one version of a program ("new") is much higher (5x) than another version of the same program ("baseline"). The program is running on a Linux cluster with E5-2698 v3 CPUs and written in C++. The baseline is a multiprocess program, and the new one is a multithreaded program; they are both fundamentally doing the same algorithm, computation, and operating on the same input data, etc. In both, there are as many processes or threads as cores (64), with threads pinned to CPUs. I've done a fair amount of heap profiling using both Valgrind Massif and Heaptrack, and they show that the memory allocation is the same (as it should be). The RSS for both the baseline and new version of the program are larger than the LLC.
The machine has 64 cores (hyperthreads). For both versions, I straced relevant processes and found some interesting results. Here's the strace command I used:
strace -k -p <pid> -e trace=mmap,munmap,brk
Here are some details about the two versions:
Baseline Version:
64 processes
RES is around 13 MiB per process
using hugepages (2MB)
no malloc/free-related syscalls were made from the strace call listed above (more on this below)
top output
New Version
2 processes
32 threads per process
RES is around 2 GiB per process
using hugepages (2MB)
this version does a fair amount of memcpy calls of large buffers (25MB) with default settings of memcpy (which, I think, is supposed to use non-temporal stores but I haven't verified this)
in release and profile builds, many mmap and munmap calls were generated. Curiously, none were generated in debug mode. (more on that below).
top output (same columns as baseline)
Assuming I'm reading this right, the new version has 5x higher RSS in aggregate (entire node) and significantly more page faults as measured using perf stat when compared to the baseline version. When I run perf record/report on the page-faults event, it's showing that all of the page faults are coming from a memset in the program. However, the baseline version has that memset as well and there are no pagefaults due to it (as verified using perf record -e page-faults). One idea is that there's some other memory pressure for some reason that's causing the memset to page-fault.
So, my question is how can I understand where this large increase in resident memory is coming from? Are there performance monitor counters (i.e., perf events) that can help shed light on this? Or, is there a heaptrack- or massif-like tool that will allow me to see what is the actual data making up the RES footprint?
One of the most interesting things I noticed while poking around is the inconsistency of the mmap and munmap calls as mentioned above. The baseline version didn't generate any of those; the profile and release builds (basically, -march=native and -O3) of the new version DID issue those syscalls but the debug build of the new version DID NOT make calls to mmap and munmap (over tens of seconds of stracing). Note that the application is basically mallocing an array, doing compute, and then freeing that array -- all in an outer loop that runs many times.
It might seem that the allocator is able to easily reuse the allocated buffer from the previous outer loop iteration in some cases but not others -- although I don't understand how these things work nor how to influence them. I believe allocators have a notion of a time window after which application memory is returned to the OS. One guess is that in the optimized code (release builds), vectorized instructions are used for the computation and it makes it much faster. That may change the timing of the program such that the memory is returned to the OS; although I don't see why this isn't happening in the baseline. Maybe the threading is influencing this?
(As a shot-in-the-dark comment, I'll also say that I tried the jemalloc allocator, both with default settings as well as changing them, and I got a 30% slowdown with the new version but no change on the baseline when using jemalloc. I was a bit surprised here as my previous experience with jemalloc was that it tends to produce some some speedup with multithreaded programs. I'm adding this comment in case it triggers some other thoughts.)
In general: GCC can optimize malloc+memset into calloc which leaves pages untouched. If you only actually touch a few pages of a large allocation, that not happening could account for a big diff in page faults.
Or does the change between versions maybe let the system use transparent hugepages differently, in a way that happens to not be good for your workload?
Or maybe just different allocation / free is making your allocator hand pages back to the OS instead of keeping them in a free list. Lazy allocation means you get a soft page fault on the first access to a page after getting it from the kernel. strace to look for mmap / munmap or brk system calls.
In your specific case, your strace testing confirms that your change led to malloc / free handing pages back to the OS instead of keeping them on a free list.
This fully explains the extra page faults. A backtrace of munmap calls could identify the guilty free calls. To fix it, see https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html / http://man7.org/linux/man-pages/man3/mallopt.3.html, especially M_MMAP_THRESHOLD (perhaps raise it to get glibc malloc not to use mmap for your arrays?). I haven't played with the parameters before. The man page mentions something about a dynamic mmap threshold.
It doesn't explain the extra RSS; are you sure you aren't accidentally allocating 5x the space? If you aren't, perhaps better alignment of the allocation lets the kernel use transparent hugepages where it didn't before, maybe leading to wasting up to 1.99 MiB at the end of an array instead of just under 4k? Or maybe Linux wouldn't use a hugepage if you only allocated the first couple of 4k pages past a 2M boundary.
If you're getting the page faults in memset, I assume these arrays aren't sparse and that you are touching every element.
I believe allocators have a notion of a time window after which application memory is returned to the OS
It would be possible for an allocator to check the current time every time you call free, but that's expensive so it's unlikely. It's also very unlikely that they use a signal handler or separate thread to do a periodic check of free-list size.
I think glibc just uses a size-based heuristic that it evaluates on every free. As I said, the man page mentions something about heuristics.
IMO actually tuning malloc (or finding a different malloc implementation) that's better for your situation should probably be a different question.

How long does memory written with memset() stay in memory without calling free()?

On Linux.
Hi. I'm sure there are many factors involved where the OS simply garbage-dumps memory allocated with memset() without calling free(), but I was wondering if anyone has a good estimation on this? That's really all I want to know.
There is a functionality in linux called KSM that saves memory space by combining matching data. My question revolves around detecting if KSM is working or not by checking the write time of the data. I have already successfully tested this on a machine while running everything in one program. Now I want to upload to memory, close the program, then open another program and test for memory duplication.
Thanks!
-Taylor
memset does not allocate memory, malloc does
the memory is not freed until a free call or the process terminates
there is no abstract machine in C, that the design principle of the language
Let's talk about abstractions:
A C programmer writes software for a "C abstract machine". This has nothing to do with any real hardware.
The "C abstract machine" is converted into something (e.g. an executable file) that runs in some kind of "process" abstraction. This "process" abstraction has nothing to do with any real hardware (it uses "threads" and not real CPUs, "virtual memory" and not real RAM, "files" and not real disk space, ...).
The OS creates the "process" abstraction on top of a machine. For KSM (where the OS is running inside a virtual machine) this "virtual machine" abstraction has nothing to do with any real hardware.
Now; let's define "abstraction" as a deliberate lie intended to shield people from reality (and let's also define "security vulnerability" as a flaw in the lie).
To determine the relationship between "memory in the C abstraction machine" and actual physical resources (RAM chips, disk space, etc) at any point in time; you need to break through a minimum of 3 barriers deliberately designed to prevent you from knowing the relationship.

Getting as much uninitialized memory as possible

I'm trying to create a C/C++ program that dumps as much uninitialized memory as possible.
The program has to be run by a local user, i.e in user mode.
It does not work to use malloc:
Why does malloc initialize the values to 0 in gcc?
The goal is not to use this data as a seed for randomness.
Does the OS always make sure that you can't see "leftovers" from other processes?
If possible, I would like references to implementations or further explanation.
The most common multi-user operating systems (modern Windows, Linux, other Unix variants, VMS--probably all OSes with a concept of virtual memory) try to isolate processes from one another for security. If process A could read process B's leftover memory, it might get access to user data it shouldn't have, so these operating systems will clear pages of memory before they become available to a new process. You would probably have to have elevated privileges to get at uninitialized RAM, and the solution would likely depend on which operating system it was.
Embedded OSes, DOS, and ancient versions of Windows generally don't have the facilities for protecting memory. But they also don't have a concept of virtual memory or of strong process isolation. On these, just allocating memory through the usual methods (e.g., malloc) would give you uninitialized memory without you having to do anything special.
For more information on Windows, you can search for Windows zero page thread to learn about the OS thread whose only job is to write zeros in unused pages so that they can be doled out again. Also, Windows has a feature called superfetch which fills up unused RAM with files that Windows predicts you'll want to open soon. If you allocated memory and Windows decided to give you a superfetch page, there would be a risk that you'd see the contents of a file you don't have access to read. This is another reason why pages must be cleared before they can be allocated to a process.
You got uninitialized memory. It contains indeterminate values. In your case those values are all 0. Nothing unexpected. If you want pseudo-random numbers use a PRNG. If you want real random numbers/entropy, use a legitimate random source like your operating system's random number device (e.g. /dev/urandom) or API.
No operating system in its right mind is going to provide uninitialized memory to a process.
The closest thing you are going to find is the stack. That memory will have been initialized when mapped to the process but much of it will have been overwritten.
It's common sense. We don't need to document that 1+1=2 either.
An operating system that leaks secrets between processes would be useless for many applications. So if a general purpose operating system that wants to be general purpose it will isolate processes. Keeping track of which pages might contain secrets and which are safe would be too much work and too error-prone, so we assume that every page that has ever been used is dirty and contains secrets. Initializing new pages with garbage is slower than initializing them with just one value, so random garbage isn't used. The most useful value is zero (for calloc or bss for example), so new pages are zeroed to clear them.
There's really no other way to do it.
There might be special purpose operating systems that don't do it and do leak secrets between processes (it might be necessary for real-time requirements for example). Some older operating systems didn't have decent memory management and privilege isolation. Also, malloc will reuse previously freed memory within the same process. Therefore malloc will be documented to contain uninitialized garbage. But that doesn't mean you'll ever be able to obtain uninitialized memory from another process on a general purpose operating system.
I guess a simple rule of thumb is: if your operating system ever asks you for a password it will not give uninitialized pages to a process and since zeroing is the only reasonable way to initialize pages, they will be zeroed.

how to fix a memory size for my application in C?

I would like to allocate a fixed memory for my application (developed using C). Say my application should not cross 64MB of memory occupation. And also i should avoid to use more CPU usage. How it is possible?
Regards
Marcel.
Under unix: "ulimit -d 64M"
One fairly low-tech way I could ensure of not crossing a maximum threshold of memory in your application would be to define your own special malloc() function which keeps count of how much memory has been allocated, and returns a NULL pointer if the threshold has been exceeded. This would of course rely on you checking the return value of malloc() every time you call it, which is generally considered good practice anyway because there is no guarantee that malloc() will find a contiguous block of memory of the size that you requested.
This wouldn't be foolproof though, because it probably won't take into account memory padding for word alignment, so you'd probably end up reaching the 64MB memory limit long before your function reports that you have reached it.
Also, assuming you are using Win32, there are probably APIs that you could use to get the current process size and check this within your custom malloc() function. Keep in mind that adding this checking overhead to your code will most likely cause it to use more CPU and run a lot slower than normal, which leads nicely into your next question :)
And also i should avoid to use more
CPU usage.
This is a very general question and there is no easy answer. You could write two different programs which essentially do the same thing, and one could be 100 times more CPU intensive than another one due to the algorithms that have been used. The best technique is to:
Set some performance benchmarks.
Write your program.
Measure to see whether it reaches your benchmarks.
If it doesn't reach your benchmarks, optimise and go to step (3).
You can use profiling programs to help you work out where your algorithms need to be optimised. Rational Quantify is an example of a commercial one, but there are many free profilers out there too.
If you are on POSIX, System V- or BSD-derived system, you can use setrlimit() with resource RLIMIT_DATA - similar to ulimit -d.
Also take a look at RLIMIT_CPU resource - it's probably what you need (similar to ulimit -t)
Check man setrlimit for details.
For CPU, we've had a very low-priority task ( lower than everything else ) that does nothing but count. Then you can see how often that task gets to run, and you know if the rest of your processes are consuming too much CPU. This approach doesn't work if you want to limit your process to 10% while other processes are running, but if you want to ensure that you have 50% CPU free then it works fine.
For memory limitations you are either stuck implementing your own layer on top of malloc, or taking advantage of your OS in some way. On Unix systems ulimit is your friend. On VxWorks I bet you could probably figure out a way to take advantage of the task control block to see how much memory the application is using... if there isn't already a function for that. On Windows you could probably at least set up a monitor to report if your application does go over 64 MB.
The other question is: what do you do in response? Should your application crash if it exceeds 64Mb? Do you want this just as a guide to help you limit yourself? That might make the difference between choosing an "enforcing" approach versus a "monitor and report" approach.
Hmm; good question. I can see how you could do this for memory allocated off the heap, using a custom version of malloc and free, but I don't know about enforcing it on the stack too.
Managing the CPU is harder still...
Interesting.

C/C++ memory usage API in Linux/Windows

I'd like to obtain memory usage information for both per process and system wide. In Windows, it's pretty easy. GetProcessMemoryInfo and GlobalMemoryStatusEx do these jobs greatly and very easily. For example, GetProcessMemoryInfo gives "PeakWorkingSetSize" of the given process. GlobalMemoryStatusEx returns system wide available memory.
However, I need to do it on Linux. I'm trying to find Linux system APIs that are equivalent GetProcessMemoryInfo and GlobalMemoryStatusEx.
I found 'getrusage'. However, max 'ru_maxrss' (resident set size) in struct rusage is just zero, which is not implemented. Also, I have no idea to get system-wide free memory.
Current workaround for it, I'm using "system("ps -p %my_pid -o vsz,rsz");". Manually logging to the file. But, it's dirty and not convenient to process the data.
I'd like to know some fancy Linux APIs for this purpose.
You can see how it is done in libstatgrab.
And you can also use it (GPL)
Linux has a (modular) filesystem-interface for fetching such data from the kernel, thus being usable by nearly any language or scripting tool.
Memory can be complex. There's the program executable itself, presumably mmap()'ed in. Shared libraries. Stack utilization. Heap utilization. Portions of the software resident in RAM. Portions swapped out. Etc.
What exactly is "PeakWorkingSetSize"? It sounds like the maximum resident set size (the maximum non-swapped physical-memory RAM used by the process).
Though it could also be the total virtual memory footprint of the entire process (sum of the in-RAM and SWAPPED-out parts).
Irregardless, under Linux, you can strace a process to see its kernel-level interactions. "ps" gets its data from /proc/${PID}/* files.
I suggest you cat /proc/${PID}/status. The Vm* lines are quite useful.
Specifically: VmData refers to process heap utilization. VmStk refers to process stack utilization.
If you continue using "ps", you might consider popen().
I have no idea to get system-wide free memory.
There's always /usr/bin/free
Note that Linux will make use of unused memory for buffering files and caching... Thus the +/-buffers/cache line.

Resources