while I was benchmarking a CPU with hyperthreading with BLAS matrix operations in C, I observed a nearly exact doubling of the runtime of the functions when using hyperthreading. What I expected was some kind of speed improvement because of out of order executions or other optimizations.
I use gettimeofday to estimate the runtime. In order to evaluate the observation I want to know if you have thoughts on the stability of gettimeofday in hyperthreading environment (Debian Linux 32 Bit) or maybe on my expectations (they might be wrong)?
Update: I forgot to mention that I am running the benchmark application twice, setting the affinity to one hyperthreading core each. For example gemm is run twice in parallel.
I doubt whether your use of gettimeofday() explains the discrepancy, unless, possibly, you are measuring very small time intervals.
More to the point, I would not expect enabling hyperthreading to improve the performance of single-threaded BLAS computations. A single thread uses only one processor (at a time), so the additional logical processors presented by hyperthreading do not help.
A well-tuned BLAS makes good use of the CPU's data cache to reduce memory access time. That doesn't help much if the needed data are evicted from the cache, however, as is likely to happen when a different process is executed by the other logical processor of the same physical CPU. Even on a lightly-loaded system, there is probably enough work to do that the OS will have a process scheduled at all times on every available (logical) processor.
Related
I am playing around with pthreads for the first time and have noticed something strange when running on my machine.
I have an Intel i5 with 2 physical cores and 4 virtual cores.
When running my program with 2 threads, I get roughly double the performance, yet when running with 4 threads, I get the same performance as two threads. Why is this the case?
Results with 2 threads:
real 0m9.335s
user 0m18.233s
sys 0m0.132s
Results with 4 threads:
real 0m9.427s
user 0m34.130s
sys 0m0.180s
Edit: The code is fully parallelizable and the threads are running independently without any shared resources.
Because you only really have 2 cores. Hyper-threading will not magically create 2 more cores for you. Hyper-threading makes it possible to run 4 threads on the CPU but not simultaneously. It will still allocate the threads on the two physical cores and switch the threads back and forth in the execution pipeline.
The performance increase you may expect is at BEST 30%.
Keep in mind that hyperthreading is basically a way of reusing spare execution units on the CPU for a separate thread of execution. You're still working with the horsepower of two cores, it's just split four ways.
If your code is optimized such that it fully utilizes most of the available EUs, there's no spare resources left once it's running on both physical cores, so the hyperthreaded cores can't do any better.
This old article from when HyperThreading (HT) was first introduced provides a lot of details on how it works (though I'm sure many improvements have been made over the last 10 years). http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf:
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total.
However, the following sentence shows where HT can bottleneck:
Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses.
If the threads execution are each keeping one or more of those shared resources (such as the execution unit or buses) 100% busy, then the hyperthreading will not improve throughput. Since benchmarks often exercise one aspect of a system (intentionally or not), it's not surprising that one of these shared processor resources would end up being a bottleneck and prevent HT from showing a benefit.
The performance gain when using multiple threads is very difficult to determine. Hyperthreading is also "less than one extra core" in performance for sure.
Besides from that, you may run into memory throughput issues, or your code is contending over locks or some such now that you have more of them - even if your own code is lock-less doesn't mean that for example I/O or some functions you call are completely able to run in parallel - there are sometimes "hidden" shared resources.
But most likely, your processor just can't go any faster.
I want to test a program with various memory bus usage levels. For example, I would like to find out if my program works as expected when other processes use 50% of the memory bus.
How would I simulate this kind of disturbance?
My attempt was to run a process with multiple threads, each thread doing random reads from a big block of memory. This didn't appear to have a big impact on my program. My program has a lot of memory operations, so I would expect that a significant disturbance will be noticeable.
I want to saturate the bus but without using too many CPU cycles, so that any performance degradation will be caused only by bus contention.
Notes:
I'm using a Xeon E5645 processor, DDR3 memory
The mental model of "processes use 50% of the memory bus" is not a great one. A thread that has acquired a core and accesses memory that's not in the caches uses the memory bus.
Getting a thread to saturate the bus is simple, just use memcpy(). Copy several times the amount that fits in the last cache and warm it up by running it multiple times so there are no page faults to slow the code down.
My first instinct would be to set up a bunch of DMA operations to bounce data around without using the CPU too much. This all depends on what operating system you're running and what hardware. Is this an embedded system? I'd be glad to give more detail in the comments.
I'd use SSE2 movntps instructions to stream data, to avoid cache conflicts for the other thread in the same core. Maybe unroll that loop 16 times to minimize number of instructions per memory transfer. While DMA idea sounds good, the linked manual is old and for 32bit linux and your processor model makes me think you probably have 64bit os, which makes me wonder how much of it is correct still. And bug in your test code may screw your hard drive in worst case.
I have written code which performs a specific task; now when I will run on different machine(having different frequency) will it take different time?
Ouestion
If my code has one printf function, then will its required number of machine cycles be fixed for all machines, or will it depend on the system?
My system frequency is 2.0GHz, what does it mean?
The performance time of the code will depend on the frequency of the CPU, amongst many other things. All other things being equal, a faster CPU will take less time to execute the same instructions. But the number of other things that can affect the timing is vast, including O/S, compiler, memory chips, disk and so on.
If the machines have the same basic architecture, then the number of machine cycles is fixed. However, modern CPU architectures are very complex, and there could easily be variations depending on what else is running on the machine at the same time. If the machines have different chip types (even within a family such as Intel Core 2 Duo), then the results could be different. If the machines are of different architectures (Intel vs SPARC or PowerPC, say), then all bets are off.
If the 'frequency is 2.0 GHz', then it means that the main CPU clock cycles at 2.0 GHz. How many instructions are executed in that time depends on the instructions, and the parallelism (how many cores), and the CPU type, etc. The CPU frequency is separate from the bus frequency which controls how fast memory can be read (so, I'm using a 2.0 GHz CPU but the memory bus runs at 1067 MHz).
Clock speed of a computer of course has its influence on the execution time of a program, but just stating that the processor runs at 2 GHz is absolutely not enough to determine how long exactly the program will run because there are huge differences in "efficiency" between the processor families - an Intel Core family processor will just do a lot more work per time unit than its predecessor, the Pentium 4, when both run at the same speed.
So yes, CPU speed has a serious influence on the execution time of a program but just the GHz value is absolutely not enough. That's why various benchmarks were set up, to be able to compare the work a processor can do in a time unit. These benchmarks will run a mix of instructions that can be considered a typical workload in a chosen scenario, and time how long their execution will take. Check out Whetstone and Dhrystone for some older but relatively easy to understand benchmarks.
The fact that there are tons of benchmarks only proves that it's not easy at all to obtain a comparable value on whose relevance everybody can agree, it remains a topic for debate...
The frequency of the CPU defines how much work it can do within a certain time. The code is the same on all machines (i.e. it's compiled code) so yes the frequency will affect the time it takes to run your program.
Are there any techniques to optimize code in order to ensure lesser power consumption.Architecture is ARM.language is C
From the ARM technical reference site:
The features of the ARM11 MPCore
processor that improve energy
efficiency include:
accurate branch and sub-routine return prediction, reducing the number
of incorrect instruction fetch and
decode operations
use of physically addressed caches, which reduces the number of cache
flushes and refills, saving energy in
the system
the use of MicroTLBs reduces the power consumed in translation and
protection lookups each cycle
the caches use sequential access information to reduce the number of
accesses to the tag RAMs and to
unwanted data RAMs.
In the ARM11 MPCore processor
extensive use is also made of gated
clocks and gates to disable inputs to
unused functional blocks. Only the
logic actively in use to perform a
calculation consumes any dynamic
power.
Based on this information, I'd say that the processor does a lot of work for you to save power. Any power wastage would come from poorly written code that does more processing than necessary, which you wouldn't want anyway. If you're looking to save power, the overall design of your application will have more effect. Network access, screen rendering, and other power-hungry operations will be of more concern for power consumption.
Optimizing code to use less power is, effectively, just optimizing code. Regardless of whether your motives are monetary, social, politital or the like, fewer CPU cycles = less energy used. What I'm trying to say is I think you can probably replace "power consumption" with "execution time", as they would, essentially, be directly proportional - and you therefore may have more success when not "scaring" people off with a power-related question. I may, however, stand corrected :)
Yes. Use a profiler and see what routines are using most of the CPU. On ARM you can use some JTAG connectors, if available (I used Lauterbach both for debugging and for profiling). The main problem is generally to put your processor, when in idle, in a low-consumption state (deep sleep). If you cannot reduce the CPU percentage used by much (for example from 80% to 50%) it won't make a big difference. Depending on what operating systems you are running the options may vary.
The July 2010 edition of the Communications of the ACM has an article on energy-efficient algorithms which might interest you. I haven't read it yet so cannot impart any of its wisdom.
Try to stay in on chip memory (cache) for idle loops, keep I/O to a minimum, keep bit flipping to a minimum on busses. NV memory like proms and flash consume more power to store zeros than ones (which is why they erase to ones, it is actually a zero but the transitor(s) invert the bit before you see it, zeros stored as ones, ones stored as zeros, this is also why they degrade to ones when they fail), I dont know about volatile memories, dram uses half as many transistors as sram, but has to be refreshed.
For all of this to matter though you need to start with a lower power system as the above may not be noticeable. dont use anything from intel for example.
If you are not running Windows XP+ or a newer version of Linux, you could run a background thread which does nothing but HLT.
This is how programs like CPUIdle reduce power consumption/heat.
If the processor is tuned to use less power when it needs less cycles, then simply making your code run more efficiently is the solution. Else, there's not much you can do unless the operating system exposes some sort of power management functionality.
Keep IO to a minimum.
On some ARM processors it's possible to reduce power consumption by putting the voltage regulator in standby mode.
I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?
The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.
No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.
Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.
I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)
Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).