Is there any way I could EXPLICITLY limit the number of GPU multiprocessors being used during runtime of my program? I would like to calculate how my algorithm scales up with growing number of multiprocessors.
If it helps: I am using CUDA 4.0 and device with compute capability 2.0.
Aaahhh... I know the problem. I played with it myself when writing a paper.
There is no explicit way to do it, however you can try "hacking" it, by having some of the blocks doing nothing.
if you never launch more blocks as there are multiprocessors, then your work is easy - just launch even less blocks. Some of SM are guaranteed to have no work, because a block cannot split onto multiple SMs.
if you launch much more blocks and you just rely on the driver to schedule them, use a different approach: just launch as many blocks as your GPU can handle and if one of the blocks finishes its work, instead of terminating it, loop back to the beginning and fetch another piece of data to work on. Most likely, the performance of your program will not fall; it might even get better if you schedule your work carefully :)
The biggest problem is when all your blocks are all running on the GPU at once, but you have more than one block per SM. Then you need to launch normally, but manually "disable" some of the blocks and order other blocks to do the work for them. The problem is - which blocks to disable to guarantee that one SM is working and other is not.
From my own experiments, the 1.3 devices (I had GTX 285) schedules the blocks in sequence. So, if I launch 60 blocks onto 30 SMs, blocks 1-30 are scheduled onto SM 1-30 and then 31-60 again onto SM from 1 to 30. So, by disabling block 5 and 35, SM number 5 is practically not doing anything.
Note however, this is my private, experimental observation I made 2 years ago. It is in no way confirmed, supported, maintained, what-not by NVIDIA and may change (or already has changed) with the new GPUs and/or drivers.
I would suggest - try playing with some simple kernels which do a lot of stupid work and see how long does it take to compute on various "enabled"/"disabled" configurations. If you are lucky, you will catch a performance drop, indicating that 2 blocks are in fact executed by a single SM.
Related
It's an odd title, but bear with me. Some time ago, I finished writing a program in C with Visual Studio Community 2017 that made significant use of OpenSSL's secp256k1 implementation. It was 10x faster than an equivalent program in Java, so I was happy. However, today I decided to upgrade it to use the bitcoin project's libsecp256k1 optimized library. It worked out great and I got a further 7x performance boost! Changing what library was used to do the EC multiplications is the ONLY thing I changed about the software- it still reads in the same input files, computes the same things, and outputs that same results.
Input files consist of 5 million initial values, and I break that up into chunks of 50k. I then set pthreads to use 6 threads to compute those 50k, save the results, and move on to the next 50k until all 5 million are done (I've also used OpenMP with 6 threads). For some reason when running this program on my Windows 10 4-core laptop, after exactly 16 chunks, the CPU utilization drops from 75% down to 65%, and after another 10 chunks down to 55%, and so on until it's only using about 25% of my CPU by the time all 5 million inputs are calculated.
The thread count (7- 1 main thread, 6 worker threads) remains the same, and the memory usage never goes over 1.5GB (laptop has 16GB), yet the CPU utilization drops as if I'm dropping threads. My temps never go over 83C, and all-core turbo stays at the max 3.4Ghz (base 2.8), so there is no temp throttling happening here. The laptop is always plugged in and power settings are set to max performance. There are no other CPU- or memory-intensive programs running besides this one.
Even stranger is that this problem doesn't happen on either of my two Windows 7 desktops- they both hold correct CPU utilization throughout all 5 million calculations. The old OpenSSL implementation always stayed at correct CPU utilization on all computers, so something's different yet it only affects the Windows 10 laptop.
I'm sorry I don't have code to demonstrate this, and maybe another forum would be more appropriate, but since it's my code I thought I'd ask here. Anyone have any ideas what might be causing this or how to fix it?
I have to do a program client-server in c where server can use n-threads that can work simultaneously for manage the request of clients.
For do it I use a socket that use a listener that put the new FD (of new connection request) in a list and then the threads can take it when they are able to do.
I know that I can use pipe too for communication between thread.
Is the socket the best way ? And why or why not?
Sorry for my bad English
To communicate between threads you can use socket as well as shared memory.
To do multithreading there are many libraries available on github, one of them I used is the below one.
https://github.com/snikulov/prog_posix_threads/blob/master/workq.c
I tried and tested the same way what you want. it works perfect!
There's one very nice resource related to socket multiplexing which I think you should stop and read after reading this answer. That resource is entitled The C10K problem, and it details numerous solutions to the problem people faced in the year 2000, of handling 10000 clients.
Of those solutions, multithreading is not the primary one. Indeed, multithreading as an optimisation should be one of your last resorts, as that optimisation will interfere with the instruments you use to diagnose other optimisations.
In general, here is how you should perform optimisations, in order to provide guaranteed justifications:
Use a profiler to determine the most significant bottlenecks (in your single-threaded program).
Perform your optimisation upon one of the more significant bottlenecks.
Use the profiler again, with the same set of data, to verify that your optimisation worked correctly.
You can repeat these steps ad infinitum until you decide the improvements are no longer tangible (meaning, good luck observing the differences between before and after). Following these steps will provide you with data you can show your employer, if he/she asks you what you've been doing for the last hour, so make sure you save the output of your profiler at each iteration.
Optimisations are per-machine; what this means is that an optimisation for your machine might actually be slower on another machine. For example, you may use a buffer of 4096 bytes for your machine, while the cache lines for another machine might indicate that 512 bytes is a better idea.
Hence, ideally, we should design programs and modules in such a way that their resources are minimal and can be easily be scaled up, substituted and/or otherwise adjusted for other machines. This can be difficult, as it means in the buffer example above you might start off with a buffer of one byte; you'd most likely need to study finite state machines to achieve that, and using buffers of one byte might not always be technically feasable (i.e. when dealing with fields that are guaranteed to be a certain width; you should use that width as your minimum limit, and scale up from there). The reward is ultra-portable and ultra-optimisable in all situations.
Keep in mind that extra threads use extra resources; we tend to assume that the stack space reserved for a thread can grow to 1MB, so 10000 sockets occupying 10000 threads (in a thread-per-socket model) would occupy about 10GB of memory! Yikes! The minimal resources method suggests that we should start off with one thread, and scale up from there, using a multithreading profiler to measure performance like in the three steps above.
I think you'll find, though, that for anything purely socket-driven, you likely won't need more than one thread, even for 10000 clients, if you study the C10K problem or use some library which has been engineered based on those findings (see your comments for one such suggestion). We're not talking about masses of number crunching, here; we're talking about socket operations, which the kernel likely processes using a single core, and so you can likely match that single core with a single thread, and avoid any context switching or thread synchronisation troubles/overheads incurred by multithreading.
I'm trying to find the optimal value of threads and blocks for my application. Therefore I wrote a small suit to run possible combinations of threadcount, blocksize and gridsize. The task I'm working with, is not parallelizable, so every thread is computing its unique problem and needs read and write access to a unique chunk of global memory for it. I also had to increase cudaLimitStackSize for my kernel to run.
I'm running into problems when I try to calculate the maximum number of threads I can run at once. My refined approach(thanks to Robert Crovella) is
threads = (freememory*0.9)/memoryperthread
where freememory is aquired from cudaMemGetInfo and memoryperthread is the global memory requirement for one thread. Even if I decrease the constant factor, I still encounter "unspecified launch failure", which I can't debug because the debugger fails with Error: Internal error reported by CUDA debugger API (error=1). The application cannot be further debugged.. Depending on the settings this error
I'm also encountering a problem when I try different blocksizes. Any blocksize larger than 512 threads yields "too many resources requested for launch". As Robert Crovella pointed out, this may be a problem of my kernel occupying to many registers(63 as reported by -Xptxas="-v"). Since blocks can be spread across several multiProcessorCount, I safly can't find any limitation that would suddenly hit with a blocksize of 1024.
My code runs fine for small values of threads and blocks, but I seem to be unable to compute the maximum numbers I could run at the same time. Is there any way to properly compute those or do I need to do it empirical?
I know that memory heavy tasks aren't optimal for CUDA.
My device is a GTX480 with Compute Capability 2.0. For now I'm stuck with CUDA Driver Version = 6.5, CUDA Runtime Version = 5.0.
I do compile with -gencode arch=compute_20,code=sm_20 to enfore the Compute Capability.
Update: Most of the aforementioned problems went away after updating the runtime to 6.5. I will leave this post the way it is, since I mention the errors I encountered and people may stumble up on it when searching for their error. To solve the problem with large blocksizes I had to reduce the registers per thread(-maxrregcount).
threads = totalmemory/memoryperthread
If your calculation for memoryperthread is accurate, this won't work because totalmemory is generally not all available. The amount you can actually allocate is less than this, due to CUDA runtime overhead, allocation granularity, and other factors. So that is going to fail somehow, but since you've provided no code, it's impossible to say exactly how. If you were doing all of this allocation from the host e.g. via cudaMalloc, then I would expect an error there, not a kernel unspecified launch failure. But if you are doing in-kernel malloc or new, then it's possible that you are trying to use a returned null pointer (indicating an allocation failure - ie. out of memory) and that would probably lead to an unspecified launch failure.
having a blocksize larger than 512 threads yields "too many resources requested for launch".
This is probably either the fact that you are not compiling for a cc2.0 device or else your kernel uses more registers per thread than what can be supported. Anyway this is certainly a solvable problem.
So how would one properly calculate the maximum possible threads and blocks for a kernel?
Often, global memory requirements are a function of the problem, not of the kernel size. If your global memory requirements scale up with kernel size, then there is probably some ratio that can be determined based on the "available memory" reported by cudaMemGetInfo (e.g. 90%) that should give reasonably safe operation. But in general, a program is well designed if it is tolerant of allocation failures, and you should at least be checking for these explicitly on host code and device code, rather than depending on "unspecified launch failure" to tell you that something has gone wrong. That could be any sort of side-effect bug triggered by memory usage, and may not be directly due to an allocation failure.
I would suggest tracking down these issues. Debug the problem, find the source of the issue. I think the correct solution will then present itself.
I hope answering my question would not require a lot of time, because it is about my understanding of this topic.
So, the question is about block and grid sizes for concurrent kernels execution.
First, let me tell about my card: it is GeForce GTX TITAN, and here is some of it's characteristics, which I think are important in this question.
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6144 MBytes (6442123264 bytes)
(14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Now, the main problem: I have a kernel(it performs sparse matrix multiplication, but it is not so important) and I want to launch it simultaneously(!) in several streams on one GPU, computing different matrixes multiplication.
Please, notice again the simultaneous requirement - I want all the kernels start at one moment, and finish at the another(all of them!), so the solution when these kernels only partly overlap doesn't satisfy me.
It is also very important that I want to maximize the number of parallel kernels, even if we lose some performance because of it.
Ok, let`s consider we already have the kernel and we want to specify it's grid and block sizes in in the best way.
Looking to the card characteristics we see it has 14 sm and capability 3.5, which allows to run 32 concurrent kernels.
So, the conclusion I make here is that launching 28 concurrent kernels(two per each of 14 SM) would be the best decision. The first question - am I right here?
Now, again, we want to optimize each kernel's block and grid sizes. Ok, let's look to this characteristic:
Maximum number of threads per multiprocessor: 2048
I understand it this way: if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another.
So, the next conclusion I make is that launching 28 kernels each one with 1024 threads would be also the best solution - because this is the only way when they can be executed simultaneously on each SM. The second question - am I right here? Or there is better solution how to get the simultaneous execution?
It would be very nice if you only say am I right or not, and I would be very grateful if you explain where I mistake or propose a better solution.
Thank you for reading this!
There are a number of questions on concurrent kernels already. You might search and review some of them. You must consider register usage, blocks, threads, and shared memory usage, amongst other things. Your question is not precisely answerable when you don't provide information about register usage or shared memory usage. Maximizing concurrent kernels is partly an occupancy question, so you should study that as well.
Nevertheless, you want to observe maximum concurrent kernels. As you've already pointed out, that is 32.
You have 14 SMs, each of which can have a maximum of 2048 threads. 14x2048/32 = 896 threads per kernel (ie. blocks * threads per block)
With a threadblock size of 128, that would be 7 blocks per kernel. 7 blocks * 32 kernels = 224 blocks total. When we divide this by 14 SMs we get 16 blocks per SM, which just happens to exactly match the spec limit.
So the above analysis, 32 kernels, 7 blocks per kernel, 128 threads per block, would be the extent of the analysis that could be done taking into account only the data you have provided.
If that does not work for you, I'd be sure to make sure I have addressed the requirements for concurrent execution and then focus on registers per thread or shared memory to see if those are limiters for "occupancy" in this case.
Honestly I don't hold out much hope for you witnessing the perfect scenario you describe, but have at it. I'd enjoy being surprised. FYI, if I were trying to do something like this, I would certainly try it on linux rather than windows, especially considering your card is a GeForce card subject to WDDM limitations under windows.
Your understanding seems flawed. Statements like this:
if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another
don't make sense to me. Blocks will be computed in whatever order the scheduler deems appropriate, but there is no rule that says two blocks will be computed simultaneously, but four blocks will be computed two by two.
I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.
I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.
Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).
The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).
Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).
Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.
Here is my question :
If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?
I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.
Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...
Thanks!
If I understood correctly, all of my blocks are doing the job on the arrays at the same time.
No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1.
A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.
Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator.
The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.
If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.