Does the frequency of machine effect execution time of my code? - c

I have written code which performs a specific task; now when I will run on different machine(having different frequency) will it take different time?
Ouestion
If my code has one printf function, then will its required number of machine cycles be fixed for all machines, or will it depend on the system?
My system frequency is 2.0GHz, what does it mean?

The performance time of the code will depend on the frequency of the CPU, amongst many other things. All other things being equal, a faster CPU will take less time to execute the same instructions. But the number of other things that can affect the timing is vast, including O/S, compiler, memory chips, disk and so on.
If the machines have the same basic architecture, then the number of machine cycles is fixed. However, modern CPU architectures are very complex, and there could easily be variations depending on what else is running on the machine at the same time. If the machines have different chip types (even within a family such as Intel Core 2 Duo), then the results could be different. If the machines are of different architectures (Intel vs SPARC or PowerPC, say), then all bets are off.
If the 'frequency is 2.0 GHz', then it means that the main CPU clock cycles at 2.0 GHz. How many instructions are executed in that time depends on the instructions, and the parallelism (how many cores), and the CPU type, etc. The CPU frequency is separate from the bus frequency which controls how fast memory can be read (so, I'm using a 2.0 GHz CPU but the memory bus runs at 1067 MHz).

Clock speed of a computer of course has its influence on the execution time of a program, but just stating that the processor runs at 2 GHz is absolutely not enough to determine how long exactly the program will run because there are huge differences in "efficiency" between the processor families - an Intel Core family processor will just do a lot more work per time unit than its predecessor, the Pentium 4, when both run at the same speed.
So yes, CPU speed has a serious influence on the execution time of a program but just the GHz value is absolutely not enough. That's why various benchmarks were set up, to be able to compare the work a processor can do in a time unit. These benchmarks will run a mix of instructions that can be considered a typical workload in a chosen scenario, and time how long their execution will take. Check out Whetstone and Dhrystone for some older but relatively easy to understand benchmarks.
The fact that there are tons of benchmarks only proves that it's not easy at all to obtain a comparable value on whose relevance everybody can agree, it remains a topic for debate...

The frequency of the CPU defines how much work it can do within a certain time. The code is the same on all machines (i.e. it's compiled code) so yes the frequency will affect the time it takes to run your program.

Related

Fork() vs number of cores

I am looking to do some data processing of some 6700 files and am using fork() to handle different sets of calculations on the data. Will getting a CPU with a higher core count allow me to run more forks()? Currently I am using a quad core with 8 threads, forking() 8 times, takes me about an hour per file. If I had a 64 core processor and forked() 64 times (splitting up calculations), would that decrease the time by about 8???
Theoretically no, according to Amdahl's law. Probably also practically, because many resources are shared (the caches, the operating system calls, the disk, etc.), but this really depends on your algorithm. For example, if your algorithm is embarrassingly parallel and is cpu-bound, than you may notice a great improvement increasing the cores to 64.
A note after reading the comments of the question: if you have a complexity of O(n!), it is possible that your algorithm is simply impossible to be executed in a realistic time. For example, if your input is n=42, and let's say that you machine is able to do 1 billion of operation per seconds, then the time required to execute your algorithm is greater than the age of the universe.

Time to run instructions of a for loop

I am interested to calculate a duration of 125 μs for implementing a TDM (Time Division Multiplexing scheme) based scheme. However, I am not able to get this duration with an accuracy of +-5us using the Linux operating system. I am using DPDK which runs on ubuntu and intel hardware. If I take time from the computer using function clock_gettime(CLOCK_REALTIME), it adds the time to make a call to the kernel to get the time. This gives an inaccurate duration to me.
Therefore, I dedicated a cpu core for calculating time without asking the time from the kernel. For this, I run a for loop for a maximum instructions (8000000) and find the number instructions that need to be executed for the 125 μs duration (i.e. (125*8000000)/timespent).
However, the problem is that it is also giving inaccurate results (there is always different results i.e., a difference 1000 instructions).
Does anybody know why I am getting inaccurate results even if I am dedicating a CPU for this?
Do you know a method to calculate a duration (very short, may be equal to 125 us) without making a call to the kernel? thanks!
You are getting inaccurate result because you are on a multitasking operating system. You cannot do this on modern computers. You can only do this on embedded microcontroller where you control 100% of the cpu time. The operating system need to manage your process, even if you have a dedicated cpu. The mouse and keyboard takes time also. Your have to run the process on 'Bare Metal'.

Hyperthreading effects on gettimeofday and other time measurements

while I was benchmarking a CPU with hyperthreading with BLAS matrix operations in C, I observed a nearly exact doubling of the runtime of the functions when using hyperthreading. What I expected was some kind of speed improvement because of out of order executions or other optimizations.
I use gettimeofday to estimate the runtime. In order to evaluate the observation I want to know if you have thoughts on the stability of gettimeofday in hyperthreading environment (Debian Linux 32 Bit) or maybe on my expectations (they might be wrong)?
Update: I forgot to mention that I am running the benchmark application twice, setting the affinity to one hyperthreading core each. For example gemm is run twice in parallel.
I doubt whether your use of gettimeofday() explains the discrepancy, unless, possibly, you are measuring very small time intervals.
More to the point, I would not expect enabling hyperthreading to improve the performance of single-threaded BLAS computations. A single thread uses only one processor (at a time), so the additional logical processors presented by hyperthreading do not help.
A well-tuned BLAS makes good use of the CPU's data cache to reduce memory access time. That doesn't help much if the needed data are evicted from the cache, however, as is likely to happen when a different process is executed by the other logical processor of the same physical CPU. Even on a lightly-loaded system, there is probably enough work to do that the OS will have a process scheduled at all times on every available (logical) processor.

pthread offer no performance increase when using virtual cores

I am playing around with pthreads for the first time and have noticed something strange when running on my machine.
I have an Intel i5 with 2 physical cores and 4 virtual cores.
When running my program with 2 threads, I get roughly double the performance, yet when running with 4 threads, I get the same performance as two threads. Why is this the case?
Results with 2 threads:
real 0m9.335s
user 0m18.233s
sys 0m0.132s
Results with 4 threads:
real 0m9.427s
user 0m34.130s
sys 0m0.180s
Edit: The code is fully parallelizable and the threads are running independently without any shared resources.
Because you only really have 2 cores. Hyper-threading will not magically create 2 more cores for you. Hyper-threading makes it possible to run 4 threads on the CPU but not simultaneously. It will still allocate the threads on the two physical cores and switch the threads back and forth in the execution pipeline.
The performance increase you may expect is at BEST 30%.
Keep in mind that hyperthreading is basically a way of reusing spare execution units on the CPU for a separate thread of execution. You're still working with the horsepower of two cores, it's just split four ways.
If your code is optimized such that it fully utilizes most of the available EUs, there's no spare resources left once it's running on both physical cores, so the hyperthreaded cores can't do any better.
This old article from when HyperThreading (HT) was first introduced provides a lot of details on how it works (though I'm sure many improvements have been made over the last 10 years). http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf:
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total.
However, the following sentence shows where HT can bottleneck:
Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses.
If the threads execution are each keeping one or more of those shared resources (such as the execution unit or buses) 100% busy, then the hyperthreading will not improve throughput. Since benchmarks often exercise one aspect of a system (intentionally or not), it's not surprising that one of these shared processor resources would end up being a bottleneck and prevent HT from showing a benefit.
The performance gain when using multiple threads is very difficult to determine. Hyperthreading is also "less than one extra core" in performance for sure.
Besides from that, you may run into memory throughput issues, or your code is contending over locks or some such now that you have more of them - even if your own code is lock-less doesn't mean that for example I/O or some functions you call are completely able to run in parallel - there are sometimes "hidden" shared resources.
But most likely, your processor just can't go any faster.

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?
The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.
No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.
Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.
I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)
Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

Resources