Regarding CPU utilization - c

Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.

You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?

Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.

Related

Scale Down the CPU frequency

Is there any way to tell the kernel that I don't need the full CPU power?
Basically, I want to do some calculation while waiting for another process. But I don't need the full CPU power for that. As the CPU load during the computation is still 100%, the frequency is high. I want to tell the kernel that I am satisfied with a lower CPU frequency in order to save energy.
Instead of calculating using the full frequency and then suspend to wait for the other process, I want to try calculating with lower frequency so that the CPU is not in a lower C state when the other process has finished and the frequency can scale back again.
This doesn't make any sense on a multi-process system, particularly not in Linux. The CPU frequency is a very basic parameter, which affects everything running on the computer - including other processes and the OS itself.
If your program would fiddle with the CPU frequency, it would not only dictate the priority of itself, but also the priority of everything in the computer, including the OS. This isn't possible to do on a desktop system, simply because it doesn't make any sense to have a single application process dictate things that not even the OS dares to meddle with.
If saving power is a priority, you should probably look for completely different alternatives than some desktop Linux solution. PC computers only care about 1) speed, 2) speed and also 3) speed.
The kind of things you ask for are common in real-time embedded systems, where CPUs have a "sleep mode", from which it can wake up to execute something, then go back to sleep. It would usually also be possible for such systems to fiddle with the internal PLL to adjust their own frequency, but such solutions are rare. The industry standard way of doing things is to perform all calculations at max speed, then revert to power-saving sleep mode.
In case of multiple cores - there is a way to specify a certain cpu core to an interrupt. In this way you can save a certain CPU to a certain process:
to find the irq number of the task use:
cat /proc/interrupts
look for your irq number.
lets say that the irp number is 99, so in order to set core #2 to handle this irq, do:
echo 2 > /proc/irq/99/smp_affinity
in this way you can save a certain core to handle your special process.
You can actually use nice() to tell the kernel that your process can live with a lower-than-normal scheduling priority. This effectively reduces the amount of time slices your process will get to use the CPU (typically in favor of other processes running at the same time).
On some more modern systems, if this will reduce the overall CPU load significantly, the CPU might eventually even decide to run on lower frequency. But you typically don't have direct influence on that decision.
Note: Depending on the system, you might be having problems to restore the original nice value (i.e. to scale up on priority again) without running with appropriate permissions.
In case your application is I/O bound and is not doing stupid things wasting CPU cycles like busy-waiting, it shouldn't be necessary to revert to reducing your nice value - Modern CPUs and operating systems should be able to detect themselves when the system is mainly idling around and step down autonomously.
modify scalling_goverance accordingly.
The "scaling_governor" feature enables setting a static frequency to the CPU.
Frequency value must be between scaling_min_freq and scaling_max_freq.
When CPU frequency governor is set to "powersave" mode, CPU is set to the lowest static frequency (within the borders of scaling_min_freq and scaling_max_freq).
Check in below path on the target
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors
and select the required scalling governanceby writing to
echo "powersaving"/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
To tune the performance, few of the files can be updated which will make changes in CPU and the frequency and also the scheduler policies.
Based on the performance analysis, and the load balancing.
Modifications can be adopted.
Check /sys/devices/system/cpu/
Example:-
root:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_governo
rs
interactive performance
root#:/sys/devices/system/cpu/cpu0/cpufreq#
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_governor
interactive
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600
##:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600

average time between kernel launch and execution?

If I understand correctly, when you launch a CUDA kernel asynchronously, it may begin execution immediately or it may wait for previous asynchronous calls (transfers, kernels, etc) to complete first. (I also understand that kernels can run concurrently in some cases, but I want to ignore that for now).
How can I find out the time between launching a kernel ("queuing") and when it actually begins execution. In fact, I really just want to know the average "queued time" for all launches in a single run of my program (generally in the tens or hundreds of thousands of kernel launches.)
I can easily calculate the average execution time per kernel with events (~500us). I tried to simulate - I dropped the results of CLOCK() every time a kernel is launched, with the idea that I could then determine how long the launch queue was when each kernel was launched. But CLOCK() does not have high enough precision (0.01s) - sometimes as many as 60 kernels appear to be launched at a single time, when of course in reality many are not.
Rather than clock use the QueryPerformanceTimer which counts based on machine clock cycles.
Code for QueryPerformanceTimer
Secondly, the profiling tool (Visual Profiler) only measures serial launches [see page 24] and [see post number 3].
Thus the best option is (1) use QueryPerformanceTimer (or the Visual Profiler) such that you get an accurate measurement of a single launch and (2) use QueryPerformanceTimer to get the timing of multiple launches and observe whether the timing results suggest that asynchronous launching took place.

How to find execution time (if possible in nanoseconds) of the sections in a C code on linux with intel dual core?

I have a C code with some functions.I need to find out the execution time of each function, I have tried using gettimeofday and rdtsc,but i guess that because its multi core system the output time provided involves the switching time between the processors. I wanted it to be serialized. So if somebody can give me an idea that how should i calulate the time or at least let me know about the syntax of rdstcp.
P.S. please reply as soon as possible
Thanks :)
It's a little impractical to expect nanosecond resolution.
You can't add code just to output the execution times of functions without increasing execution time. When you take the code out, the timing changes.
In practice, this kind of measurement is made by observing the CPU timing signals on an oscilloscope (or logic analyser).
If you have multiple cores, then the CPU timer won't be stable between them. So set the thread affinity to keep it on the one core. You also might want to use a real time timer to measure the time for the process or thread using clock_gettime(CLOCK_PROCESS_CPUTIMER_ID). Read the note for SMP systems in the usage for that function.
Both of these will effect the timing of the program, so perform multiple iterations of whatever you are benchmarking, and don't call the timing functions too often to try and mitigate this.
There should be some way to set processor affinity to tell the operating system to only run that thread on a particuar core.
In windows there is a SetThreadAffinity system call, I imagine there is a similar function in linux, although I don't know what it is called.
You could boot your dual core system to use one core only using the following kernel parameter:
maxcpus=1
But the measured time will still comprise process contest switching and thus depend on the activity on the other processes on the system. Are you interested in the execution time, or the CPU time needed to execute your task ?
Mate, I'm not sure about this, but even if you're dual core,unless the program is threaded, it will only run in 1 thread (meaning 1 core), so it should not involve the time of switching between processors, I believe there is no such thing...
Pavium is correct, the only way to get decent timing at this resolution in with an oscilloscope and toggling GPIO pins.
Bear in mind that this is all a bit academic anyway: I suppose you are running with an operating system, etc, so there is no way to get a straight run at the hardware.
You really need to look at the reason you want this measurement. Is it a performance benchmark for some code? You could try running the code many thousands of times and get some statistics. For this kind of approach I would recommend you read Zed Shaws diatribe to make sure the numbers aren't fooling you.
Precise Performance Measuring was impossible until Linux kernel 2.6.31. In this kernel a new library for accessing the performacne counters of the CPU and IMHO correcting times in the scheduler was added.
Unfortunately i don't have more details but maybe it is a starting point for more information search. I'm just adding this because nobody mentioned it before
Use the struct timespec structure & clock_gettime function as follows to obtain the time of execution of the code in nanoseconds precision
struct timespec start, end;
clock_gettime(CLOCK_REALTIME,&start);
/* Do something */
clock_gettime(CLOCK_REALTIME,&end);
It returns a value as ((((unsigned64)start.tv_sec) * ((unsigned64)(1000000000L))) + ((unsigned64)(start.tv_nsec))))
Moreover this I've used for multithreaded concepts too..
Hope this answer will be more helpful for you to get your desired execution time in nanoseconds.

Why would one CPU core run slower than the others?

I was benchmarking a large scientific application, and found it would sometimes run 10% slower given the same inputs. After much searching, I found the the slowdown only occurred when it was running on core #2 of my quad core CPU (specifically, an Intel Q6600 running at 2.4 GHz). The application is a single-threaded and spends most of its time in CPU-intensive matrix math routines.
Now that I know one core is slower than the others, I can get accurate benchmark results by setting the processor affinity to the same core for all runs. However, I still want to know why one core is slower.
I tried several simple test cases to determine the slow part of the CPU, but the test cases ran with identical times, even on slow core #2. Only the complex application showed the slowdown. Here are the test cases that I tried:
Floating point multiplication and addition:
accumulator = accumulator*1.000001 + 0.0001;
Trigonometric functions:
accumulator = sin(accumulator);
accumulator = cos(accumulator);
Integer addition:
accumulator = accumulator + 1;
Memory copy while trying to make the L2 cache miss:
int stride = 4*1024*1024 + 37; // L2 cache size + small prime number
for(long iter=0; iter<iterations; ++iter) {
for(int offset=0; offset<stride; ++offset) {
for(i=offset; i<array_size; i += stride) {
array1[i] = array2[i];
}
}
}
The Question: Why would one CPU core be slower than the others, and what part of the CPU is causing that slowdown?
EDIT: More testing showed some Heisenbug behavior. When I explicitly set the processor affinity, then my application does not slow down on core #2. However, if it chooses to run on core #2 without an explicitly set processor affinity, then the application runs about 10% slower. That explains why my simple test cases did not show the same slowdown, as they all explicitly set the processor affinity. So, it looks like there is some process that likes to live on core #2, but it gets out of the way if the processor affinity is set.
Bottom Line: If you need to have an accurate benchmark of a single-threaded program on a multicore machine, then make sure to set the processor affinity.
You may have applications that have opted to be attached to the same processor(CPU Affinity).
Operating systems would often like to run on the same processor as they can have all their data cached on the same L1 cache. If you happen to run your process on the same core that your OS is doing a lot of its work on, you could experience the effect of a slowdown in your cpu performance.
It sounds like some process is wanting to stick to the same cpu. I doubt it's a hardware issue.
It doesn't necessarily have to be your OS doing the work, some other background daemon could be doing it.
Most modern cpu's have separate throttling of each cpu core due to overheating or power saving features. You may try to turn off power-saving or improve cooling. Or maybe your cpu is bad. On my i7 I get about 2-3 degrees different core temperatures of the 8 reported cores in "sensors". At full load there is still variation.
Another possibility is that the process is being migrated from one core to another while running. I'd suggest setting the CPU affinity to the 'slow' core and see if it's just as fast that way.
Years ago, before the days of multicore, I bought myself a dual-socket Athlon MP for 'web development'. Suddenly my Plone/Zope/Python web servers slowed to a crawl. A google search turned up that the CPython interpreter has a global interpreter lock, but Python threads are backed by OS threads. OS Threads were evenly distributed among the CPUs, but only one CPU can acquire the lock at a time, thus all the other processes had to wait.
Setting Zope's CPU affinity to any CPU fixed the problem.
I've observed something similar on my Haswel laptop. The system was quiet, no X running, just the terminal. Executing the same code with different numactl --physcpubin option gave exactly the same results on all cores, except one. I changed the frequency of the cores to Turbo, to other values, nothing helped. All cores were running with expected speed, except one which was always running slower than the others. That effect survived the reboot.
I rebooted the computer and turned off HyperThreading in the BIOS. When it came back online it was fine again. I then turned on HyperThreading and it is fine till now.
Bizzare. No idea what that could be.

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Resources