I am currently developing a Windows Kernel Driver that implements its own networking stack. While testing some base functionality of the implemented stack, I noticed that replies to pings would sometimes take noticeably longer than usual. Investigating this issue further, I found out that KeAcquireSpinLock sporadically has an execution time of up to 20 ms (instead of few µs), even when the lock is not held by other cores (I confirmed this by printing the lock value before calling KeAcquireSpinLock).
Since I had no clue why KeAcquireSpinLock takes so long, I implemented a different approach with KeAcquireSpinLockAtDpcLevel, manually rising the IRQL if required:
oldIrql = KeGetCurrentIrql();
if (oldIrql < DISPATCH_LEVEL)
{
KeRaiseIrql(DISPATCH_LEVEL, &oldIrql);
}
KeAcquireSpinLockAtDpcLevel(m_lock);
// DO STH WITH SHARED RECOURCES
KeReleaseSpinLockFromDpcLevel(m_lock);
if (oldIrql< DISPATCH_LEVEL) KeLowerIrql(oldIrql);
I expected the above code to be functionally equivalent to KeAcquireSpinLock. However, it turned out that the runtime issue I had with KeAcquireSpinLock is gone and performance is fine with this approach.
I have searched the internet for similar problems with KeAcquireSpinLock, but it seems like I am alone with this issue. Maybe I have a bug in other sections of the driver? Can someone explain this behavior?
Note that I am not talking about Deadlocks, since KeAcquireSpinLock would always return at some point and the implementation with KeAcquireSpinLockAtDpcLevel uses the same architecture / locking object.
Related
I'm very new to the whole OpenCL world so I'm following some beginners tutorials. I'm trying to combine this and this to compare the the time required to add two arrays together on different devices. However I'm getting confusing results. Considering that the code is too long I made this GitHub Gist.
On my mac I have 1 platform with 3 devices. When I assign the j in
cl_command_queue command_queue = clCreateCommandQueue(context, device_id[j], 0, &ret);
manually to 0 it seems to run the calculation on CPU (about 5.75 seconds). when putting 1 and 2 then calculation time drops drastically (0.01076 seconds). Which I assume is because the calculation is being ran on my Intel or AMD GPU. But Then there are some issues:
I can adjust the j to any higher numbers and it still seems to run on GPUs.
When I put all the calculation in a loop, the time measured for all the devices are the same as calculating on CPU (as I persume).
The time required to do the calculation for all 0<j are suspiciously close. I wonder if they are really being ran on different devices.
I have clearly no clue about OpenCL so I would appreciate if you could take a look at my code and let me know what are my mistake(s) and how I can solve it/them. Or maybe point me towards a good example which runs a calculation on different devices and compares the time.
P.S. I have also posted this question here in Reddit
Before submitting a question for an issue you are having, always remember to check for errors (specifically, in this case, that every API call returns CL_SUCCESS). The results are meaningless otherwise.
In the specific case, the problem in your code is that when getting the device IDs, you're only getting one device ID (line 60, third argument), meaning that everything else in the buffer is bogus, and results for j > 0 are meaningless.
The only surprising thing is that it doesn't crash.
Also, when checking runtimes, use OpenCL events, not host-side clock times. In your case you're at least doing after the clFinish, so you are ensuring that the kernel execution terminates, but you're essentially counting the time necessary for all the setup, rather than just the copy time.
I am getting a lot of profiling overhead when trying to profile my code using nvvp (or with nvprof):
Overall time is 98 ms and I'm getting 85 ms of "Instrumentation" in the first kernel launch.
How can I reduce this profiling overhead or otherwise zoom-in on just the part that I'm interested in?
Background
I am running this with "Start execution with profiling enabled" unchecked and I've limited the profiling using cudaProfilerStart/cudaProfilerStop like so:
/* --- generate data etc --- */
// Call the function once to warm up the FFT plan cache
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
// Call it once for profiling
cudaProfilerStart();
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
cudaProfilerStop();
where applyConvolution() is the function that I'm profiling.
I am using CUDA Toolkit 8.0 on Ubuntu 16.04 with a GTX 1080.
As I was writing up this question, I thought I'd try messing around with the profiler settings to try and preempt some potential answer-in-comment material.
To my surprise, disabling "Enable concurrent kernel profiling" got rid of the profiler overhead completely:
But perhaps this shouldn't have been that much of a surprise:
Enable concurrent kernel profiling - This option should be selected
for an application that uses CUDA streams to launch kernels that can
execute concurrently. If the application uses only a single stream
(and therefore cannot have concurrent kernel execution), deselecting
this option may decrease profiling overhead.
(taken from http://docs.nvidia.com/cuda/profiler-users-guide/)
An earlier version of the CUDA Profiler User's Guide also noted in a "Profiling Limitations" section that:
Concurrent kernel mode can add significant overhead if used on kernels that
execute a large number of blocks and that have short execution durations.
Oh well. Posting this question/answer anyways in case it helps someone else avoid this annoyance.
I am seeing something similar, but which is perhaps only vaguely related. But since the above answer helped, I'll add my observations.
On profiling a Quadro GV100 there is a massive change in apparent performance for fairly simple kernels compared to pascal-gen cards (e.g. a 1080). I too am running nvvp with profiling disabled and activating it in a part of the code I'm interested in. Then I accidentally omitted to turn it on, and all I got was our manual event markers (using nvtxRangePush & nvtxRangePop). What do you know, a tenfold-speedup. That is to say; on the Quadro GV100 there is a massive profiling overhead that is not there on earlier-gen GPUs.
Disabling concurrent profiling as you did does NOT help, but disabling the API tracing DOES.
There's still a significant overhead compared to manual nvtx though, but at least it allows some idea of kernel performance on the GV100. Larger kernels seem less affected, which is natural if it's related to fixed-cost overhead or API-tracing. The unknown that's left is why the API-tracing costs so much on the GV100 specifically, but I'm in no position to speculate, at least not yet.
I compiled sm-specific binaries using gcc/5.4 and cuda/9.0 for the above tests, and ran RELION single-threaded for a simple test-case.
How would be the correct way to prevent a soft lockup/unresponsiveness in a long running while loop in a C program?
(dmesg is reporting a soft lockup)
Pseudo code is like this:
while( worktodo ) {
worktodo = doWork();
}
My code is of course way more complex, and also includes a printf statement which gets executed once a second to report progress, but the problem is, the program ceases to respond to ctrl+c at this point.
Things I've tried which do work (but I want an alternative):
doing printf every loop iteration (don't know why, but the program becomes responsive again that way (???)) - wastes a lot of performance due to unneeded printf calls (each doWork() call does not take very long)
using sleep/usleep/... - also seems like a waste of (processing-)time to me, as the whole program will already be running several hours at full speed
What I'm thinking about is some kind of process_waiting_events() function or the like, and normal signals seem to be working fine as I can use kill on a different shell to stop the program.
Additional background info: I'm using GWAN and my code is running inside the main.c "maintenance script", which seems to be running in the main thread as far as I can tell.
Thank you very much.
P.S.: Yes I did check all other threads I found regarding soft lockups, but they all seem to ask about why soft lockups occur, while I know the why and want to have a way of preventing them.
P.P.S.: Optimizing the program (making it run shorter) is not really a solution, as I'm processing a 29GB bz2 file which extracts to about 400GB xml, at the speed of about 10-40MB per second on a single thread, so even at max speed I would be bound by I/O and still have it running for several hours.
While the posed answer using threads might possibly be an option, it would in reality just shift the problem to a different thread. My solution after all was using
sleep(0)
Also tested sched_yield / pthread_yield, both of which didn't really help. Unfortunately I've been unable to find a good resource which documents sleep(0) in linux, but for windows the documentation states that using a value of 0 lets the thread yield it's remaining part of the current cpu slice.
It turns out that sleep(0) is most probably relying on what is called timer slack in linux - an article about this can be found here: http://lwn.net/Articles/463357/
Another possibility is using nanosleep(&(struct timespec){0}, NULL) which seems to not necessarily rely on timer slack - linux man pages for nanosleep state that if the requested interval is below clock granularity, it will be rounded up to clock granularity, which on linux depends on CLOCK_MONOTONIC according to the man pages. Thus, a value of 0 nanoseconds is perfectly valid and should always work, as clock granularity can never be 0.
Hope this helps someone else as well ;)
Your scenario is not really a soft lock up, it is a process is busy doing something.
How about this pseudo code:
void workerThread()
{
while(workToDo)
{
if(threadSignalled)
break;
workToDo = DoWork()
}
}
void sighandler()
{
signal worker thread to finish
waitForWorkerThreadFinished;
}
void main()
{
InstallSignalHandler;
CreateSemaphore
StartThread;
waitForWorkerThreadFinished;
}
Clearly a timing issue. Using a signalling mechanism should remove the problem.
The use of printf solves the problem because printf accesses the console which is an expensive and time consuming process which in your case gives enough time for the worker to complete its work.
I have a C code with some functions.I need to find out the execution time of each function, I have tried using gettimeofday and rdtsc,but i guess that because its multi core system the output time provided involves the switching time between the processors. I wanted it to be serialized. So if somebody can give me an idea that how should i calulate the time or at least let me know about the syntax of rdstcp.
P.S. please reply as soon as possible
Thanks :)
It's a little impractical to expect nanosecond resolution.
You can't add code just to output the execution times of functions without increasing execution time. When you take the code out, the timing changes.
In practice, this kind of measurement is made by observing the CPU timing signals on an oscilloscope (or logic analyser).
If you have multiple cores, then the CPU timer won't be stable between them. So set the thread affinity to keep it on the one core. You also might want to use a real time timer to measure the time for the process or thread using clock_gettime(CLOCK_PROCESS_CPUTIMER_ID). Read the note for SMP systems in the usage for that function.
Both of these will effect the timing of the program, so perform multiple iterations of whatever you are benchmarking, and don't call the timing functions too often to try and mitigate this.
There should be some way to set processor affinity to tell the operating system to only run that thread on a particuar core.
In windows there is a SetThreadAffinity system call, I imagine there is a similar function in linux, although I don't know what it is called.
You could boot your dual core system to use one core only using the following kernel parameter:
maxcpus=1
But the measured time will still comprise process contest switching and thus depend on the activity on the other processes on the system. Are you interested in the execution time, or the CPU time needed to execute your task ?
Mate, I'm not sure about this, but even if you're dual core,unless the program is threaded, it will only run in 1 thread (meaning 1 core), so it should not involve the time of switching between processors, I believe there is no such thing...
Pavium is correct, the only way to get decent timing at this resolution in with an oscilloscope and toggling GPIO pins.
Bear in mind that this is all a bit academic anyway: I suppose you are running with an operating system, etc, so there is no way to get a straight run at the hardware.
You really need to look at the reason you want this measurement. Is it a performance benchmark for some code? You could try running the code many thousands of times and get some statistics. For this kind of approach I would recommend you read Zed Shaws diatribe to make sure the numbers aren't fooling you.
Precise Performance Measuring was impossible until Linux kernel 2.6.31. In this kernel a new library for accessing the performacne counters of the CPU and IMHO correcting times in the scheduler was added.
Unfortunately i don't have more details but maybe it is a starting point for more information search. I'm just adding this because nobody mentioned it before
Use the struct timespec structure & clock_gettime function as follows to obtain the time of execution of the code in nanoseconds precision
struct timespec start, end;
clock_gettime(CLOCK_REALTIME,&start);
/* Do something */
clock_gettime(CLOCK_REALTIME,&end);
It returns a value as ((((unsigned64)start.tv_sec) * ((unsigned64)(1000000000L))) + ((unsigned64)(start.tv_nsec))))
Moreover this I've used for multithreaded concepts too..
Hope this answer will be more helpful for you to get your desired execution time in nanoseconds.
I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.