How to determine the maximum possible Threads and Blocks for a memory-heavy CUDA application?

How to determine the maximum possible Threads and Blocks for a memory-heavy CUDA application? - c

I'm trying to find the optimal value of threads and blocks for my application. Therefore I wrote a small suit to run possible combinations of threadcount, blocksize and gridsize. The task I'm working with, is not parallelizable, so every thread is computing its unique problem and needs read and write access to a unique chunk of global memory for it. I also had to increase cudaLimitStackSize for my kernel to run.
I'm running into problems when I try to calculate the maximum number of threads I can run at once. My refined approach(thanks to Robert Crovella) is
threads = (freememory*0.9)/memoryperthread
where freememory is aquired from cudaMemGetInfo and memoryperthread is the global memory requirement for one thread. Even if I decrease the constant factor, I still encounter "unspecified launch failure", which I can't debug because the debugger fails with Error: Internal error reported by CUDA debugger API (error=1). The application cannot be further debugged.. Depending on the settings this error
I'm also encountering a problem when I try different blocksizes. Any blocksize larger than 512 threads yields "too many resources requested for launch". As Robert Crovella pointed out, this may be a problem of my kernel occupying to many registers(63 as reported by -Xptxas="-v"). Since blocks can be spread across several multiProcessorCount, I safly can't find any limitation that would suddenly hit with a blocksize of 1024.
My code runs fine for small values of threads and blocks, but I seem to be unable to compute the maximum numbers I could run at the same time. Is there any way to properly compute those or do I need to do it empirical?
I know that memory heavy tasks aren't optimal for CUDA.
My device is a GTX480 with Compute Capability 2.0. For now I'm stuck with CUDA Driver Version = 6.5, CUDA Runtime Version = 5.0.
I do compile with -gencode arch=compute_20,code=sm_20 to enfore the Compute Capability.
Update: Most of the aforementioned problems went away after updating the runtime to 6.5. I will leave this post the way it is, since I mention the errors I encountered and people may stumble up on it when searching for their error. To solve the problem with large blocksizes I had to reduce the registers per thread(-maxrregcount).

threads = totalmemory/memoryperthread
If your calculation for memoryperthread is accurate, this won't work because totalmemory is generally not all available. The amount you can actually allocate is less than this, due to CUDA runtime overhead, allocation granularity, and other factors. So that is going to fail somehow, but since you've provided no code, it's impossible to say exactly how. If you were doing all of this allocation from the host e.g. via cudaMalloc, then I would expect an error there, not a kernel unspecified launch failure. But if you are doing in-kernel malloc or new, then it's possible that you are trying to use a returned null pointer (indicating an allocation failure - ie. out of memory) and that would probably lead to an unspecified launch failure.
having a blocksize larger than 512 threads yields "too many resources requested for launch".
This is probably either the fact that you are not compiling for a cc2.0 device or else your kernel uses more registers per thread than what can be supported. Anyway this is certainly a solvable problem.
So how would one properly calculate the maximum possible threads and blocks for a kernel?
Often, global memory requirements are a function of the problem, not of the kernel size. If your global memory requirements scale up with kernel size, then there is probably some ratio that can be determined based on the "available memory" reported by cudaMemGetInfo (e.g. 90%) that should give reasonably safe operation. But in general, a program is well designed if it is tolerant of allocation failures, and you should at least be checking for these explicitly on host code and device code, rather than depending on "unspecified launch failure" to tell you that something has gone wrong. That could be any sort of side-effect bug triggered by memory usage, and may not be directly due to an allocation failure.
I would suggest tracking down these issues. Debug the problem, find the source of the issue. I think the correct solution will then present itself.

Related

Profiling resident memory usage and many page faults in C++ program on linux

I am trying to figure out why my resident memory for one version of a program ("new") is much higher (5x) than another version of the same program ("baseline"). The program is running on a Linux cluster with E5-2698 v3 CPUs and written in C++. The baseline is a multiprocess program, and the new one is a multithreaded program; they are both fundamentally doing the same algorithm, computation, and operating on the same input data, etc. In both, there are as many processes or threads as cores (64), with threads pinned to CPUs. I've done a fair amount of heap profiling using both Valgrind Massif and Heaptrack, and they show that the memory allocation is the same (as it should be). The RSS for both the baseline and new version of the program are larger than the LLC.
The machine has 64 cores (hyperthreads). For both versions, I straced relevant processes and found some interesting results. Here's the strace command I used:
strace -k -p <pid> -e trace=mmap,munmap,brk
Here are some details about the two versions:
Baseline Version:
64 processes
RES is around 13 MiB per process
using hugepages (2MB)
no malloc/free-related syscalls were made from the strace call listed above (more on this below)
top output
New Version
2 processes
32 threads per process
RES is around 2 GiB per process
using hugepages (2MB)
this version does a fair amount of memcpy calls of large buffers (25MB) with default settings of memcpy (which, I think, is supposed to use non-temporal stores but I haven't verified this)
in release and profile builds, many mmap and munmap calls were generated. Curiously, none were generated in debug mode. (more on that below).
top output (same columns as baseline)
Assuming I'm reading this right, the new version has 5x higher RSS in aggregate (entire node) and significantly more page faults as measured using perf stat when compared to the baseline version. When I run perf record/report on the page-faults event, it's showing that all of the page faults are coming from a memset in the program. However, the baseline version has that memset as well and there are no pagefaults due to it (as verified using perf record -e page-faults). One idea is that there's some other memory pressure for some reason that's causing the memset to page-fault.
So, my question is how can I understand where this large increase in resident memory is coming from? Are there performance monitor counters (i.e., perf events) that can help shed light on this? Or, is there a heaptrack- or massif-like tool that will allow me to see what is the actual data making up the RES footprint?
One of the most interesting things I noticed while poking around is the inconsistency of the mmap and munmap calls as mentioned above. The baseline version didn't generate any of those; the profile and release builds (basically, -march=native and -O3) of the new version DID issue those syscalls but the debug build of the new version DID NOT make calls to mmap and munmap (over tens of seconds of stracing). Note that the application is basically mallocing an array, doing compute, and then freeing that array -- all in an outer loop that runs many times.
It might seem that the allocator is able to easily reuse the allocated buffer from the previous outer loop iteration in some cases but not others -- although I don't understand how these things work nor how to influence them. I believe allocators have a notion of a time window after which application memory is returned to the OS. One guess is that in the optimized code (release builds), vectorized instructions are used for the computation and it makes it much faster. That may change the timing of the program such that the memory is returned to the OS; although I don't see why this isn't happening in the baseline. Maybe the threading is influencing this?
(As a shot-in-the-dark comment, I'll also say that I tried the jemalloc allocator, both with default settings as well as changing them, and I got a 30% slowdown with the new version but no change on the baseline when using jemalloc. I was a bit surprised here as my previous experience with jemalloc was that it tends to produce some some speedup with multithreaded programs. I'm adding this comment in case it triggers some other thoughts.)

In general: GCC can optimize malloc+memset into calloc which leaves pages untouched. If you only actually touch a few pages of a large allocation, that not happening could account for a big diff in page faults.
Or does the change between versions maybe let the system use transparent hugepages differently, in a way that happens to not be good for your workload?
Or maybe just different allocation / free is making your allocator hand pages back to the OS instead of keeping them in a free list. Lazy allocation means you get a soft page fault on the first access to a page after getting it from the kernel. strace to look for mmap / munmap or brk system calls.
In your specific case, your strace testing confirms that your change led to malloc / free handing pages back to the OS instead of keeping them on a free list.
This fully explains the extra page faults. A backtrace of munmap calls could identify the guilty free calls. To fix it, see https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html / http://man7.org/linux/man-pages/man3/mallopt.3.html, especially M_MMAP_THRESHOLD (perhaps raise it to get glibc malloc not to use mmap for your arrays?). I haven't played with the parameters before. The man page mentions something about a dynamic mmap threshold.
It doesn't explain the extra RSS; are you sure you aren't accidentally allocating 5x the space? If you aren't, perhaps better alignment of the allocation lets the kernel use transparent hugepages where it didn't before, maybe leading to wasting up to 1.99 MiB at the end of an array instead of just under 4k? Or maybe Linux wouldn't use a hugepage if you only allocated the first couple of 4k pages past a 2M boundary.
If you're getting the page faults in memset, I assume these arrays aren't sparse and that you are touching every element.
I believe allocators have a notion of a time window after which application memory is returned to the OS
It would be possible for an allocator to check the current time every time you call free, but that's expensive so it's unlikely. It's also very unlikely that they use a signal handler or separate thread to do a periodic check of free-list size.
I think glibc just uses a size-based heuristic that it evaluates on every free. As I said, the man page mentions something about heuristics.
IMO actually tuning malloc (or finding a different malloc implementation) that's better for your situation should probably be a different question.

OSDev: Why my memory allocation function suddenly stops working in the AHCI initialization function?

After my kernel calls the AHCIInit() function inside of the ArchInit() function, I get a page fault in one of the MemAllocate() calls, and this only happens in real machines, as I tried replicating it on VirtualBox, VMWare and QEMU.
I tried debugging the code, unit testing the memory allocator and removing everything from the kernel, with the exception from the memory manager and the AHCI driver itself, the only thing that I discovered is that something is corrupting the allocation blocks, making the MemAllocate() page fault.
The whole kernel source is at https://github.com/CHOSTeam/CHicago-Kernel, but the main files where the problem probably occours are:
https://github.com/CHOSTeam/CHicago-Kernel/blob/master/mm/alloc.c
https://github.com/CHOSTeam/CHicago-Kernel/blob/master/arch/x86/io/ahci.c
I expected the AHCIInit() to detect and initialize all the AHCI devices and the boot to continues until it reaches the session manager or the kernel shell, but in real computers it page faults before even initializing the scheduler (so no, the problem isn't my scheduler).

If it works in emulators but doesn't work on real hardware; then the first things I'd suspect are:
bugs in physical memory management. For example, physical memory manager initialization not rounding "starting address of usable RAM area" up to a page boundary or not rounding "ending address of usable RAM area" down to a page boundary, causing a "half usable RAM and half not usable RAM" page to be allocated by the heap later (where it works on emulators because the memory map provided by firmware happens to describe areas that are nicely aligned anyway).
a bug where RAM is assumed to contain zeros but may not be (where it works on emulators because they tend to leave almost all RAM full of zeros).
a race condition (where different timing causes different behavior).
However; this is a monolithic kernel, which means that you'll continually be facing "any piece of code in kernel-space did something that caused problems for any other piece of code anywhere else"; and there's a bunch of common bugs with memory usage (e.g. accidentally writing past the end of what you allocated). For this reason I'd want better tools to help diagnose problems, especially for heap.
Specifically, for the heap I'd start with canaries (e.g. put a magic number like 0xFEEDFACE before each block of memory in the heap, and another different number after each block of memory in the heap; and then check that the magic numbers are still present and correct where convenient - e.g. when blocks are freed or resized). Then I'd write a "check_heap()" function that scans through everything checking as much as possible (the canaries, if statistics like "number of free blocks" are actually correct, etc). The idea being that (whenever you suspect something might have corrupted the heap) you can insert a call to the "check_heap()" function, and move that call around until you find out which piece of code caused the heap corruption. I'd also suggest having a "what" parameter in your "kmalloc() or equivalent" (e.g. so you can do things like myFooStructure = kmalloc("Foo Structure", sizeof(struct foo));), where the provided "what string" is stored in the allocated block's meta-data, so that later on (when you find out the heap was corrupted) you can display the "what string" associated with the block before the corruption, and so that you can (e.g.) list how many of each type of thing there currently is to help determine what is leaking memory (e.g. if the number of "Foo Structure" blocks is continually increasing). Of course these things can be (should be?) enabled/disabled by compile time options (e.g. #ifdef DEBUG_HEAP).
The other thing I'd recommend is self tests. These are like unit tests, but built directly into the kernel itself and always present. For example, you could write code to pound the daylights out of the heap (e.g. allocate random sized pieces of memory and fill them with something until you run out of memory, then free half of them, then allocate more until you run out of memory again, etc; while calling the "check_heap()" function between each step); where this code could/should take a "how much pounding" parameter (so you could spend a small amount of time doing the self test, or a huge amount of time doing the self test). You could also write code to pound the daylights out of the virtual memory manager, and the physical memory manager (and the scheduler, and ...). Then you could decide to always do a small amount of self testing each time the kernel boots and/or provide a special kernel parameter/option to enable "extremely thorough self test mode".
Don't forget that eventually (if/when the OS is released) you'll probably have to resort to "remote debugging via. email" (e.g. where someone without any programming experience, who may not know very much English, sends you an email saying "OS not work"; and you have to try to figure out what is going wrong before the end user's "amount of hassle before giving up and not caring anymore" counter is depleted).

Questions about parallelism on GPU (CUDA)

I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.
I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.
Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).
The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).
Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).
Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.
Here is my question :
If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?
I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.
Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...
Thanks!

If I understood correctly, all of my blocks are doing the job on the arrays at the same time.
No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1.
A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.
Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator.
The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.
If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.

GLib handle out of memory

I've a question concerning the GLib.
I would like to use the GLib in a server context but I'm not aware on how the memory is managed:
https://developer.gnome.org/glib/stable/glib-Memory-Allocation.html
If any call to allocate memory fails, the application is terminated. This also means that there is no need to check if the call succeeded.
If I look at the source code, if g_malloc failed, it will call g_error:
g_error()
define g_error(...)
A convenience function/macro to log an error message.
Error messages are always fatal, resulting in a call to abort() to terminate the application.[...]
But in my case, as I'm developing a server application, I don't want the application exit, I would prefer, as the traditional malloc function, the GLib functions returns NULL or something to indicate an error happened.
So, my question is, there is a manner to handle out of memory?
Is the GLib not recommended for server purpose applications?
If I look at the man of abort I can see that I can handle the signal but I'll make the management of out-of-memory errors a little bit painful...
The abort() function causes abnormal program termination to occur, unless
the signal SIGABRT is being caught and the signal handler does not
return.
Thanks for you help!

It's very difficult to recover from lack of memory. The reason for that is that it can be considered a terminal state, in the sense that lack of memory will persist for some time before it goes away. Even reacting to the lack of memory (like informing the user) might require more memory, for example, to build and send a message. A related problem is that there are operating systems (linux at least) that may be over optimistic about allocating memory. When the kernel realizes that memory is missing, it may kill the application, even if your code is handling the failures.
So either you have a much stricter grasp of your whole system than average, or you won't be able to successfully handle out of memory errors, and, in this case, it doesn't matter what the helper library is doing.
If you really want to control memory allocation while still using glib, you have partial ways to do that. Don't use any glib allocation function and use some from other library. Glib provides functions that receive a "free function" when necessary. For example:
https://developer.gnome.org/glib/2.31/glib-Hash-Tables.html#g-hash-table-new-full
The hash table constructor accepts functions for destroying both keys and values. In your case, the data will be allocated using custom allocation functions, while the hash data structures will be allocated with glib functions.
Alternatively you could use g_try_* macros to allocate memory, so you still use glib allocator, but it won't abort on error. Again, this only partially solves the problem. Internally, glib will implicitly call functions that may abort and it assumes it will never return on error.
About the general question: does it make sense for a server to crash when it's out of memory ? The obvious answer is no, but I can't estimate how theoretical this answer is. I can only expect that the server system be properly sized for its operation and reject as invalid any input that could potentially exceed its capacities, and for that, it doesn't matter which libraries it might use.

I'm probably editorializing a bit here, but the modern tendency to use virtual/logical memory (both names have been used, although "logical" is more distinct) does dramatically complicate knowing when memory is exhausted, although I think one can restore the old, real-(RAM + swap) model (I'll call this the physical model) in Linux with the following in /etc/sysctl.d/10-no-overcommit.conf:
vm.overcommit_memory = 2
vm.overcommit_ratio = 100
This restores the ability to have the philosophy that if a program's malloc just failed, that program has a good chance of having been the actual cause of memory exhaustion, and can then back away from the construction of the current object, freeing memory along the way, possibly grumbling at the user for having asked for something crazy that needed too much RAM, and awaiting the next request. In this model, most OOM conditions resolve almost instantly - the program either copes and presumably returns RAM, or gets killed immediately on the following SEGV when it tries to use the 0 returned by malloc.
With the virtual/logical memory models that linux tends to default to in 2013, this doesn't work, since a program won't find memory isn't available at malloc, but instead upon attempting to access memory later at which point the kernel finally realizes there's nowhere in RAM for it. This amounts to disaster, since any program on the system can die, rather than the one the ran the host out of RAM. One can understand why some GLib folks don't even care about trying to fix this problem, because with the logical memory model, it can't be fixed.
The original point of logical memory was to allow huge programs using more than half the memory of the host to still be able to fork and exec supporting programs. It was typically enabled only on hosts with that particular usage pattern. Now in 2013 when a home workstation can have 24+ GiB of RAM, there's really no excuse to have logical memory enabled at all 99% of the time. It should probably be disabled by default on hosts with >4 GiB of RAM at boot.
Anyway. So if you want to take the old-school physical model approach, make sure your computer has it enabled, or there's no point to testing your malloc and realloc calls.
If you are in that model, remember that GLib wasn't really guided by the same philosophy (see http://code.google.com/p/chromium/issues/detail?id=51286#c27 for just how madly astray some of them are). Any library based on GLib might be infected with the same attitude as well. However, there may be some interesting things one can do with GLib in the physical memory model by emplacing your own memory handlers with g_mem_set_vtable(), since you might be able to poke around in program globals and reduce usage in a cache or the like to free up space, then retry the underlying malloc. However, that's limited in its own way by not knowing which object was under construction at the point your special handler is invoked.

how to fix a memory size for my application in C?

I would like to allocate a fixed memory for my application (developed using C). Say my application should not cross 64MB of memory occupation. And also i should avoid to use more CPU usage. How it is possible?
Regards
Marcel.

Under unix: "ulimit -d 64M"

One fairly low-tech way I could ensure of not crossing a maximum threshold of memory in your application would be to define your own special malloc() function which keeps count of how much memory has been allocated, and returns a NULL pointer if the threshold has been exceeded. This would of course rely on you checking the return value of malloc() every time you call it, which is generally considered good practice anyway because there is no guarantee that malloc() will find a contiguous block of memory of the size that you requested.
This wouldn't be foolproof though, because it probably won't take into account memory padding for word alignment, so you'd probably end up reaching the 64MB memory limit long before your function reports that you have reached it.
Also, assuming you are using Win32, there are probably APIs that you could use to get the current process size and check this within your custom malloc() function. Keep in mind that adding this checking overhead to your code will most likely cause it to use more CPU and run a lot slower than normal, which leads nicely into your next question :)
And also i should avoid to use more
CPU usage.
This is a very general question and there is no easy answer. You could write two different programs which essentially do the same thing, and one could be 100 times more CPU intensive than another one due to the algorithms that have been used. The best technique is to:
Set some performance benchmarks.
Write your program.
Measure to see whether it reaches your benchmarks.
If it doesn't reach your benchmarks, optimise and go to step (3).
You can use profiling programs to help you work out where your algorithms need to be optimised. Rational Quantify is an example of a commercial one, but there are many free profilers out there too.

If you are on POSIX, System V- or BSD-derived system, you can use setrlimit() with resource RLIMIT_DATA - similar to ulimit -d.
Also take a look at RLIMIT_CPU resource - it's probably what you need (similar to ulimit -t)
Check man setrlimit for details.

For CPU, we've had a very low-priority task ( lower than everything else ) that does nothing but count. Then you can see how often that task gets to run, and you know if the rest of your processes are consuming too much CPU. This approach doesn't work if you want to limit your process to 10% while other processes are running, but if you want to ensure that you have 50% CPU free then it works fine.
For memory limitations you are either stuck implementing your own layer on top of malloc, or taking advantage of your OS in some way. On Unix systems ulimit is your friend. On VxWorks I bet you could probably figure out a way to take advantage of the task control block to see how much memory the application is using... if there isn't already a function for that. On Windows you could probably at least set up a monitor to report if your application does go over 64 MB.
The other question is: what do you do in response? Should your application crash if it exceeds 64Mb? Do you want this just as a guide to help you limit yourself? That might make the difference between choosing an "enforcing" approach versus a "monitor and report" approach.

Hmm; good question. I can see how you could do this for memory allocated off the heap, using a custom version of malloc and free, but I don't know about enforcing it on the stack too.
Managing the CPU is harder still...
Interesting.