I do understand the term CPU load. It is an indication of how well is the CPU occupied doing useful tasks. I would periodically measure the amount of time CPU executed the lowest priritized task (such as an Idle task) and that would tell me how loaded the CPU was.
I have often in the past seen system software architects being able to determine the CPU bandwidth requirements of an application.
For example, it would be stated that an audio codec needed 130Mhz for its operation ( while the CPU itself would run at e.g 260Mhz).
Sometimes the requirement would be stated as a program needing only a certain percentage (e.g. 10%) of CPU. In other words, if the CPU ticked at 260Mhz, the program in question would only need 26Mhz.
What is the philosophy/technique to determine computing requirements such as above?
Related
I am working on Zedboard which contains dual-core ARM A9 processors and runs Linux. The board communicates with an external I/O device.
I have two functions written in 'C' language, which I have to run in parallel.
One function calls a while loop and continuously dumps data to the external device and receives the processed data back into a memory pointer.
The other function reads the data from the pointer location creates a copy of it and does computationally intensive processes (such as FFT, signal alignment etc which is slow).
The external device needs data at 15 million samples per second. which I am able to achieve if I only run the first function and it takes about 70% of one ARM core. When I run both the functions both of the ARM cores reaches its limit and I find that I am not able to provide the data to the external device in the required sample speed.
Is there a way in which I can restrict both the functions in independent cores (it doesn't matter of the second function is slow but the performance of the first function can't be compromised) and still be able to share data between them?
I tried using OpenMP but it didn't work in achieving the required performance. I read about SCHED_SETAFFINITY but had a problem in understanding its implementation.
I have optimized each of my functions as much as I could using NEON constructs/libraries and the auto-vectorization feature of ARM processors.
You can set each separate thread to a different core with:
int sched_setaffinity(pid_t pid,size_t cpusetsize,cpu_set_t *mask);
From the man page:
Description
A process's CPU affinity mask determines the set of CPUs on which it is eligible to run. On a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular process (i.e., setting the affinity mask of that process to specify a single CPU, and setting the affinity mask of all other processes to exclude that CPU), it is possible to ensure maximum execution speed for that process. Restricting a process to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a process ceases to execute on one CPU and then recommences execution on a different CPU.
But if your code has hard data relations between input and output thread, multithreading can be slower as single core usage! This is hardly related to the memory/cache and especially on arm on all the bridges between core/memory/cache and external bus systems. You should play around with the priority, affinity and maybe other parameters as well.
BTW: "15 million samples per second" and FFT with IO on 1 GHZ Arm with Linux in parallel. Wow! Hot stuff ;)
In a program that calls getrusage() twice in order to obtain the time of a task by subtraction, I have once seen an assertion, saying that the time of the task should be nonnegative, fail. This, of course, cannot easily be reproduced, although I could write a specialized program that might reproduce it more easily.
I have tried to find a guarantee that getrusage() increased along execution, but neither the man page on my system(Linux on x86-64) nor this system-independant description say so explicitly.
The behavior was observed on a physical computer, with several cores, and NTP running.
Should I report a bug against the OS I am using? Am I asking too much when I expect getrusage() to increase with time?
On many systems rusage (I presume you mean ru_utime and ru_stime) is not calculated accurately, it's just sampled once per clock tick which is usually as slow as 100Hz and sometimes even slower.
Primary reason for that is that many machines have clocks that are incredibly expensive to read and you don't want to do this accounting (you'd have to read the clock twice for every system call). You could easily end up spending more time reading clocks than doing anything else in programs that do many system calls.
The counters should never go backwards though. I've seen that many years ago where the total running time of the process was tracked on context switches (which was relatively cheap and getrusge could calculate utime by using samples for stime, and subtracting that from the total running time). The clock used in that case was the wall clock instead of a monotonic clock and when you changed the time on the machine, the running time of processes could go back. But that was of course a bug.
I am playing around with pthreads for the first time and have noticed something strange when running on my machine.
I have an Intel i5 with 2 physical cores and 4 virtual cores.
When running my program with 2 threads, I get roughly double the performance, yet when running with 4 threads, I get the same performance as two threads. Why is this the case?
Results with 2 threads:
real 0m9.335s
user 0m18.233s
sys 0m0.132s
Results with 4 threads:
real 0m9.427s
user 0m34.130s
sys 0m0.180s
Edit: The code is fully parallelizable and the threads are running independently without any shared resources.
Because you only really have 2 cores. Hyper-threading will not magically create 2 more cores for you. Hyper-threading makes it possible to run 4 threads on the CPU but not simultaneously. It will still allocate the threads on the two physical cores and switch the threads back and forth in the execution pipeline.
The performance increase you may expect is at BEST 30%.
Keep in mind that hyperthreading is basically a way of reusing spare execution units on the CPU for a separate thread of execution. You're still working with the horsepower of two cores, it's just split four ways.
If your code is optimized such that it fully utilizes most of the available EUs, there's no spare resources left once it's running on both physical cores, so the hyperthreaded cores can't do any better.
This old article from when HyperThreading (HT) was first introduced provides a lot of details on how it works (though I'm sure many improvements have been made over the last 10 years). http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf:
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total.
However, the following sentence shows where HT can bottleneck:
Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses.
If the threads execution are each keeping one or more of those shared resources (such as the execution unit or buses) 100% busy, then the hyperthreading will not improve throughput. Since benchmarks often exercise one aspect of a system (intentionally or not), it's not surprising that one of these shared processor resources would end up being a bottleneck and prevent HT from showing a benefit.
The performance gain when using multiple threads is very difficult to determine. Hyperthreading is also "less than one extra core" in performance for sure.
Besides from that, you may run into memory throughput issues, or your code is contending over locks or some such now that you have more of them - even if your own code is lock-less doesn't mean that for example I/O or some functions you call are completely able to run in parallel - there are sometimes "hidden" shared resources.
But most likely, your processor just can't go any faster.
I have written code which performs a specific task; now when I will run on different machine(having different frequency) will it take different time?
Ouestion
If my code has one printf function, then will its required number of machine cycles be fixed for all machines, or will it depend on the system?
My system frequency is 2.0GHz, what does it mean?
The performance time of the code will depend on the frequency of the CPU, amongst many other things. All other things being equal, a faster CPU will take less time to execute the same instructions. But the number of other things that can affect the timing is vast, including O/S, compiler, memory chips, disk and so on.
If the machines have the same basic architecture, then the number of machine cycles is fixed. However, modern CPU architectures are very complex, and there could easily be variations depending on what else is running on the machine at the same time. If the machines have different chip types (even within a family such as Intel Core 2 Duo), then the results could be different. If the machines are of different architectures (Intel vs SPARC or PowerPC, say), then all bets are off.
If the 'frequency is 2.0 GHz', then it means that the main CPU clock cycles at 2.0 GHz. How many instructions are executed in that time depends on the instructions, and the parallelism (how many cores), and the CPU type, etc. The CPU frequency is separate from the bus frequency which controls how fast memory can be read (so, I'm using a 2.0 GHz CPU but the memory bus runs at 1067 MHz).
Clock speed of a computer of course has its influence on the execution time of a program, but just stating that the processor runs at 2 GHz is absolutely not enough to determine how long exactly the program will run because there are huge differences in "efficiency" between the processor families - an Intel Core family processor will just do a lot more work per time unit than its predecessor, the Pentium 4, when both run at the same speed.
So yes, CPU speed has a serious influence on the execution time of a program but just the GHz value is absolutely not enough. That's why various benchmarks were set up, to be able to compare the work a processor can do in a time unit. These benchmarks will run a mix of instructions that can be considered a typical workload in a chosen scenario, and time how long their execution will take. Check out Whetstone and Dhrystone for some older but relatively easy to understand benchmarks.
The fact that there are tons of benchmarks only proves that it's not easy at all to obtain a comparable value on whose relevance everybody can agree, it remains a topic for debate...
The frequency of the CPU defines how much work it can do within a certain time. The code is the same on all machines (i.e. it's compiled code) so yes the frequency will affect the time it takes to run your program.
Are there any techniques to optimize code in order to ensure lesser power consumption.Architecture is ARM.language is C
From the ARM technical reference site:
The features of the ARM11 MPCore
processor that improve energy
efficiency include:
accurate branch and sub-routine return prediction, reducing the number
of incorrect instruction fetch and
decode operations
use of physically addressed caches, which reduces the number of cache
flushes and refills, saving energy in
the system
the use of MicroTLBs reduces the power consumed in translation and
protection lookups each cycle
the caches use sequential access information to reduce the number of
accesses to the tag RAMs and to
unwanted data RAMs.
In the ARM11 MPCore processor
extensive use is also made of gated
clocks and gates to disable inputs to
unused functional blocks. Only the
logic actively in use to perform a
calculation consumes any dynamic
power.
Based on this information, I'd say that the processor does a lot of work for you to save power. Any power wastage would come from poorly written code that does more processing than necessary, which you wouldn't want anyway. If you're looking to save power, the overall design of your application will have more effect. Network access, screen rendering, and other power-hungry operations will be of more concern for power consumption.
Optimizing code to use less power is, effectively, just optimizing code. Regardless of whether your motives are monetary, social, politital or the like, fewer CPU cycles = less energy used. What I'm trying to say is I think you can probably replace "power consumption" with "execution time", as they would, essentially, be directly proportional - and you therefore may have more success when not "scaring" people off with a power-related question. I may, however, stand corrected :)
Yes. Use a profiler and see what routines are using most of the CPU. On ARM you can use some JTAG connectors, if available (I used Lauterbach both for debugging and for profiling). The main problem is generally to put your processor, when in idle, in a low-consumption state (deep sleep). If you cannot reduce the CPU percentage used by much (for example from 80% to 50%) it won't make a big difference. Depending on what operating systems you are running the options may vary.
The July 2010 edition of the Communications of the ACM has an article on energy-efficient algorithms which might interest you. I haven't read it yet so cannot impart any of its wisdom.
Try to stay in on chip memory (cache) for idle loops, keep I/O to a minimum, keep bit flipping to a minimum on busses. NV memory like proms and flash consume more power to store zeros than ones (which is why they erase to ones, it is actually a zero but the transitor(s) invert the bit before you see it, zeros stored as ones, ones stored as zeros, this is also why they degrade to ones when they fail), I dont know about volatile memories, dram uses half as many transistors as sram, but has to be refreshed.
For all of this to matter though you need to start with a lower power system as the above may not be noticeable. dont use anything from intel for example.
If you are not running Windows XP+ or a newer version of Linux, you could run a background thread which does nothing but HLT.
This is how programs like CPUIdle reduce power consumption/heat.
If the processor is tuned to use less power when it needs less cycles, then simply making your code run more efficiently is the solution. Else, there's not much you can do unless the operating system exposes some sort of power management functionality.
Keep IO to a minimum.
On some ARM processors it's possible to reduce power consumption by putting the voltage regulator in standby mode.