I am doing particles simulations with Self-propelled particles. My CUDA kernel updates each particle's location at every time step. So I run CUDA kernel from the for loop. Schematically it looks like this:
for(int i=0;i<NumberOfTimeSteps;i++)
Calculate<<<1,N,sharedsize>>>(float *data, other parameters)
Cudamemcpy(data_cpu,data,...);
WriteToFile(data_cpu);
end
So, each time step new data is calculated based on previously calculated data. It works ok, when NumberOfTimeSteps is small. But after I set NumberOfTimeSteps > 500 (approximate critical value), program stops working.
I know, that there is a limitation on kernel execution: driver can stop GPU calculations if kernel execution time is too long. However, in my code, time of the single kernel execution doesn't change with NumberOfTimeSteps.
Is there any limitations on the number of kernel calls?
Thanks
EDIT: There was another issue: I didn't close mat files (where I put results), and kept opening new files each step. That eventually caused error. I voted to close question, since it has nothing to do with CUDA. Robert answered alredy about CUDA kernels.
Is there any limitations on the number of kernel calls?
There is no real limit to the number of kernel calls. There is a limit to how many can be accepted asynchronously, but after this limit, additional kernel calls will simply block the CPU thread from proceeding until some queue slots open up (i.e. until some previously issued kernels complete).
If your program is failing after ~500 kernel calls, it is due to some other issue, which is impossible to diagnose based on what you have shown in your question.
If by "program stops working" you mean that you hit a WDDM timeout, then it is possible based on batched kernel calls within WDDM, that even though a single kernel call is not longer than the timeout period, back-to-back kernel calls may exceed the watchdog timeout. This really should not be happening in your case, because cudaMemcpy as you have shown it is not an asynchronous operation; it blocks the CPU thread. Therefore, you should at most have one kernel call outstanding at a time.
Related
Imagine process/thread is running from point A to point B.
I can get how much time the code execution took by taking two gettimeofday() and calculating the difference (wall clock time). However, it may happen, that during the 'route' from A to B CPU was switching to another processes, to drivers, kernel, and other stuff it must perform to keep system running.
Is it possible somehow identify how much time A to B took in terms of actual process/thread execution, and kernel time related to their execution?
The goal for this exercise is to actually identify how much time CPU was NOT executing process/thread or its system calls by executing something else that them.
I am using C.
Searching the man-pages from time (1) backwards I found this:
You can use getrusage:
getrusage() returns resource usage measures
can be used for the own process (self), child processes or the calling thread.
Amongst other values it will give you the user CPU time used and the system CPU time used.
Please see man 2 getrusage for the details. (Or use an online replacement like https://linux.die.net/man/2/getrusage)
I try to measure the time taken by a CUDA kernel function. I measure both CPU and GPU timings of it. But I'm getting a huge difference between both.
When I profile it using the NVIDIA profiler the kernel takes around 6ms which is what I want. But when I used gettimeofday() around the kernel call to get the CPU timings, the measure was 15ms. I do not have any memcpy calls there too. The kernel runs in a separate stream. And similar kernels are running in concurrent streams.
Sample code :
gettimeofday(start);
cudaEventRecord(startGPU);
Kernel <<<abc, xyz,stream>>>();
cudaDeviceSynchronize();
cudaEventRecord(stopGPU);
printf("Elapsed GPU time = ");
gettimeofday(stop);
printf("Elapsed CPU time = ");
Results I'm getting for the above code :
Elapsed GPU time = 6ms
Elapsed CPU time = 15ms
It is weird because there is only kernel execution line present. The kernel params are however pointers. Is the extra time being taken by mem copies? But I do not find mem copies anywhere in the profile too. Any leads would be appreciated.
Basically, what you're measuring as your CPU time is the time it takes to
record the first event,
set up the kernel launch with the respective parameters,
send the necessary commands to the GPU,
launch the kernel on the GPU,
execute the kernel on the GPU,
wait for the notification that GPU execution finished to get back to the CPU, and
record the second event.
Also, note that your method of measuring CPU time does not measure just the processing time spent by your process/thread, but, rather, the total system time elapsed (which potentially includes processing time spent by other processes/threads while your process/thread was not necessarily even running). I have to admit that, even in light of all that, the CPU time you report is still much larger compared to the GPU time than I would normally expect. But I'm not sure that that up there really is your entire code. In fact, I rather doubt it, given that, e.g., the printf()s don't really print anything. So there may be some additional factors we're not aware of that would still have to be considered to fully explain your timings.
Anyways, most likely neither of the two measurements you take are actually measuring what you really wanted to measure. If you're interested in the time it takes for the kernel to run, then use CUDA events. However, if you synchronize first and only then record the end event, the time between the start and end events will be the time between the beginning of kernel execution, the CPU waiting for kernel execution to finish, and whatever time it may take to then record the second event and have that one get to the GPU just so you can then ask the GPU at what time it got it. Think of events like markers that mark a specific point in the command stream that is sent to the GPU. Most likely, you actually wanted to write this:
cudaEventRecord(startGPU, stream); // mark start of kernel execution
Kernel<<<abc, xyz, stream>>>();
cudaEventRecord(stopGPU, stream); // mark end of kernel execution
cudaEventSynchronize(stopGPU); // wait for results to be available
and then use cudaEventElapsedTime() to get the time between the two events.
Also, note that gettimeofday() is not necessarily a reliable way of obtaining high-resolution timings. In C++, you could use, e.g., std::steady_clock, or std::high_resolution_clock (I would resort to the latter only if it cannot be avoided, since it is not guaranteed to be steady; and make sure that the clock period is actually sufficient for what you're trying to measure).
After debugging into the same issue, I found that cuda usually takes time before the first kernel launch, as referred in the forum here: https://devtalk.nvidia.com/default/topic/1042733/extremely-slow-cuda-api-calls-/?offset=3.
The cuda runtime APIs before the kernel had 6ms of cudaMalloc and 14ms of cudaLaunch which was the reason for the extra delay. The subsequent kernels, however, was good to work normally. cudaLaunch take usually takes time in microseconds, so if anything goes beyond that, it definitely needs some repair.
NOTE: If you are running any cuda kernels in a while(1) loop (only once), the allocation must be done outside the loop. Else you will end up with delays just like this.
I am developing a system with a DSP and an ARM. On the ARM there is a linux OS. I have a DSP sending data to the ARM (Linux) - In the Linux there is a kernel module which read the data received from the DSP. The kernel module is waking up to read the data, using an hardware interrupt between the DSP to the ARM.
I want to write a user space app, that will read the data from the kernel space (The kernel module) each time there's a new data arrived from the DSP.
The question is:
What is better approach to do that, a software interrupt from the kernel to the user-space or polling from the user-space (reading a known memory address with the kernel) every 10ms..?
Knowing that:
The data from the DSP to the kernel must arrive in very short time - 100us.
The data from the kernel to the user-space can take 10ms to 30ms.
The data that is being read is regarded small - around 100 bytes.
I would create a device and have the userland program block on read. No need to wait 10ms in between, this is handled efficiently by blocking.
Polling in a sense of using poll (yes, I understood that's not what you meant) would work fine, but there is no reason to call two functions (first poll and then read) when one function can do it anyway. No need to do it every 10ms, you can immediately call poll again after having processed what you got from your last read.
Polling in a sense of checking a known memory location every 10ms is not advisable. Not only is this an ugly hack and more complicated than you think (you will have to map the page containing that memory location to user space), and a form of busy waiting which needlessly consumes CPU, it also has an average latency of 5ms and a worst case latency of 10ms, which is entirely unnecessary. Average and worst case latency of read is approximately zero (well, not quite, but nearly so... it's as fast as waking a blocked task goes).
Interrupts (i.e. signals) are very efficient but make the program a lot more complicated/contorted compared to simply reading and blocking (have to write a signal handler, may not use certain functions in handlers, must communicate to main app, etc.). While technically a good solution, I would advise against them because a program needs not be more complicated than necessary.
Polling has no advantage over waiting. The process still has to be scheduled and switched to and all that and then it does useless poll part of the time.
Linux runs scheduler when returning from interrupts, so when you wake up the waiting task in the in-kernel interrupt handler and it has high priority set (you should give it real-time priority, obviously) the task will be scheduled immediately. You won't beat that with polling.
The standard interface of (character) device files is reasonably fast, so just implement blocking read, poll (which is a blocking system call, not polling anything really) and possibly asynchronous read (uses real-time signal), but I suspect performance of dedicated thread waiting in read system call will be better than AIO. And it's easier to write too. You should find enough examples in kernel sources.
You don't seem to mention any hard time constraints, so you could really go with either approach. However, as Martin James said, polling introduces some overhead to the application, which you probably don't want.
Personally, I'd go with an interrupt or event flag triggered by the kernel. While you may not have have hard timing constraints, I assume you're wanting something that's more deterministic, rather than not. A kernel interrupt will get you closer to that.
I am wondering if it is possible to force a cache flush within c using linux x86. I have read several answers answering how to do this within the shell or using asm/cache.h (requiring me to write a linux module...)
I am using the PAPI library which allows me to get very close to the exact number of clock cycles that a given block of code takes to execute. However, since I want to time some extremely short functions I need to run the functions many times for accurate statistics (the timing function call takes longer than the code within the blocks takes to execute). By running the code multiple times the cache is speeding up the execution of successive calls of the same block of code and I would like to prevent this!
I don't Know any standard way to do this other than loading other thing to the cache. My usual workaround is simply process something large enough to "cool down" the cache, say a matrix multiplication.
I have an assignment where I am analyzing the runtime of various sorting algorithms. I have written the code but I think it's an unfair comparison.
My code basically grabs the the clock time before and after the sorting is finished to compute the elapsed time. However, what if the OS decides to interrupt more frequently during the runtime of a specific sorting algorithm, or if it rather decides that some other background application should be given more of the time domain when it's thread comes back up?
I am not a CS major so I may not be entirely correct here, but from what I've read previously I was concerned this might have an impact on the results.
I also realize that if OS scheduling is suspended and the program hangs then there might be a serious problem; I am just wondering if it possible.
Normally, there's no real reason for it. The scheduler will slightly increase the execution time, but if the code runs for a few seconds, the change will be tiny.
So unless you're running heavy applications on the same computer, the amount of noise this will add to your tests is negligible.
In Linux, you can use isolcpus parameter to mark CPUs that won't be used by the scheduler. You can find information here. I'm not sure what's the minimal kernel version.
If you use it, you'll need to use sched_setaffinity, to put your theread on an isolated CPU, because the scheduler won't put it there.
It is not possible, not in user space code. Otherwise, any malicious process could steal the CPU from others.
If you want precise time counting for your process only, I suggest using time command. You can read about it here: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Quick answer: you are most likely interested in user time, assuming your code doesn't make a heavy use of syscalls (which would be rather strange for a sorting algorithm)
On an up-to-date POSIX system (basically Linux) you can use clock_gettime with CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID if you make sure the process doesn't migrate between CPUs (you can set its affinity for example).
The difference in times returned by clock_gettime with those arguments results in exact time the process/thread spent executing. Only pitfall as I mentioned is process migration as the man page says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
This means that you don't really need to suspend all other processes just to measure the execution time of your program.