I was reading this book "ARM System Developers Guide" by Elsevier and I came across this:
The ARM instruction set differs from the pure RISC definition in several ways that make
the ARM instruction set suitable for embedded applications:
Variable cycle execution for certain instructions — Not every ARM instruction executes in a single cycle. For example, load-store-multiple instructions vary in the number of execution cycles depending upon the number of registers being transferred. The
transfer can occur on sequential memory addresses, which increases performance since
sequential memory accesses are often faster than random accesses. Code density is also
improved since multiple register transfers are common operations at the start and end
of functions.
Any other ARM instructions you guys can point out which take variable cycles to execute?
Cycle timings are micro architecture dependent, so you need to check particular implementation's technical reference manual (TRM). For example for Cortex-A9, it is described as being quite complicated.
The complexity of the Cortex-A9 processor makes it impossible to calculate precise timing information manually. The timing of an instruction is often affected by other concurrent instructions, memory system activity, and additional events outside the instruction flow.
However on the same document there are precise timings for data-processing, load and store, multiplication and some information about branch and serialization instructions.
For example from the same document you can see if shifting is involved AND instruction may take 1-2 cycles more depending on the shift source, which might be a constant embedded in instruction or read from a register.
Also next to book's note about load-store-multiple may vary on number of registers involved, they also vary if address is aligned or not.
Related
Going over Eigen documentation, its not clear whether it was updated since the release of A76 CPU core to take advantage of the wider SIMD it contains (2x128b vs. previous 128b)
I am hoping someone from the development team (or an expert user) can help clarifying that.
I'm not familiar with Eigen in particular, but in general, one doesn't need to do much to SIMD code to take advantage of different amounts of hardware execution units - especially when the CPUs support out of order execution, they will pick up more instructions that can be executed in parallel when there's more execution units.
If compiling e.g. SIMD intrinsics with a compiler, the compiler may be able to tune the exact scheduling of code if told to optimize specifically for that core (and if the compiler knows the scheduling characteristics for the core). Same thing for handwritten assembly code - it can be tuned and tweaked a bit for different cores' characteristics, but in most cases, it doesn't change very dramatically; more capable cores will execute it faster.
(The factor that primarily affects the bigger picture of how the code is written, which would require a proper rewrite to take advantage of, is usually the number of registers available in the instruction set - but that doesn't change with a hardware implementation with more execution units.)
Due to the huge impact on performance, I never wonder if my current day desktop CPU has branch prediction. Of course it does. But how about the various ARM offerings? Does iPhone or android phones have branch prediction? The older Nintendo DS? How about PowerPC based Wii? PS 3?
Whether they have a complex prediction unit is not so important, but if they have at least some dynamic prediction, and whether they do some execution of instructions following an expected branch.
What is the cutoff for CPUs with branch prediction? A hand held calculator from decades ago obviously doesn't have one, while my desktop does. But can anyone more clearly outline where one can expect dynamic branch prediction?
If it is unclear, I am talking about the kind of prediction where the condition is changing, varying the expected path during runtime.
Any CPU with a pipeline beyond a few stages requires at least some primitive branch prediction, otherwise it can stall waiting on computation results in order to decide which way to go. The Intel Atom is an in-order core, but with a fairly deep pipeline, and it therefore requires a pretty decent branch predictor.
Old ARM 7 designs were only three stages. Combine that with things like branch delay slots (required on MIPS, optional on SPARC), and branch prediction isn't so useful.
Incidentally, when MIPS decided to get more performance by going beyond 4 pipeline stages, the branch delay slot became an annoyance. In the original design, it was necessary, because there was no branch predictor. Therefore, you had to sequence your branch instruction prior to the last instruction to be executed before the branch. With the longer pipeline, they needed a branch predictor, obviating the need for a branch delay slot, but they had to emulate it anyway in order to run older code.
The problem with a branch delay slot is that it can only be filled with a useful instruction about 50% of the time. The rest of the time, you either fill it with an instruction whose result is likely to be thrown away, or you use a NO-OP.
Modern high end superscalar CPUs with long pipelines (which means almost all CPUs commonly found in desktops and servers) have quite sophisticated branch prediction these days.
Most ARM CPUs do not have branch prediction, which saves silicon and power consumption, but ARM CPUs generally have relatively short pipelines. Also the support for conditional execution of most instructions in the ARM ISA helps to reduce the number of branches required (and hence mitigates the cost of branch misprediction stalls).
Branch prediction is getting more important and emphasized while ARM is getting more complicated.
For example new 64-bit ARM architecture called ARMv8 drops most use of conditional execution (mainly due to instruction encoding space restrictions with increased number of registers) and relies on branch prediction to keep performance at acceptable levels.
Even for newer ARMv7-a devices you can check terrible cases like unsorted data question on SO, which branch prediction improvement is around 3x.
Not so much for the ARM Cortex-A8 (though it does have some branch prediction), but I believe the Cortex-A9 is out-of-order super-scalar, with complex branch prediction.
You can expect Dynamic Branch predictor in any out of order processor, those processors not only rely on pipelining but also fetch multiple instructions at the time, and they have multiple execution units(Floating point units, ALU), more registers; to increase the instruction execution, you have multiple instructions on the fly on any given moment, of course branches are a problem if you want to keep all that machinery utilization high so this kind of processors, rely on dynamic branch prediction in order to keep throughput and utilization very high.
You can expect any server to have dynamic branch prediction, also desktops, in the past embedded systems like the ARM chips in current smartphones did not have branch predictions since they had smaller pipelines, and they did not have out of order execution, but as Moore's law give us more transistor per area, you will start seeing more and more processors increasing their architecture. So to answer your question, besides the obvious looking for the CPU specs, you can expect to have branch prediction on chips of 32 Bits, bigger pipelines, out of order exection. The most recent chips from ARM are moving in some level to this directions.
What does it mean to say that a function (e.g. modular multiplication,sine) is implemented in hardware as opposed to software?
Implemented in hardware means the electrical circuit (through logical gates and so) can perform the operation.
For example, in the ALU the processor is physically able to add one byte to another.
Implemented in software are operations that usually are very complex combinations of basic implemented in hardware functions.
Typically, "software" is a list of instructions from a small set of precise formal instructions supported by the hardware in question. The hardware (the cpu) runs in an infinite loop executing your instruction stream stored in "memory".
When we talk about a software implementation of an algorithm, we mean that we achieve the final answer by having the CPU carry out some set of these instructions in the order put together by an outside programmer.
When we talk about a hardware implementation, we mean that the final answer is carried out with intermediate steps that don't come from a formal (inefficient) software stream coded up by a programmer but instead is carried out with intermediate steps that are not exposed to the outside world. Hardware implementations thus are likely to be faster because (a) they can be very particular to the algorithm being implemented, with no need to reach well-defined states that the outside would will see, and (b) don't have to sync up with the outside world.
Note, that I am calling things like sine(x), "algorithms".
To be more specific about efficiency, the software instructions, being a part of a formal interface, have predefined start/stop points as they await for the next clock cycle. These sync points are needed to some extent to allow other software instructions and other hardware to cleanly and unambiguously access these well defined calculations. In contrast, a hardware implementation is more likely to have a larger amount of its internal implementations be asynchronous, meaning that they run to completion instead of stopping at many intermediate points to await a clock tick.
For example, most processors have an instruction that carries out an integer addition. The entire process of calculating the final bit positions is likely done asynchronously. The stop/sync point occurs only after the added result is achieved. In turn, a more complex algorithm than "add", and which is done in software that contains many such additions, necessarily is partly carried out asynchronously (eg, in between each addition) but with many sync points (after each addition, jump, test, etc, result is known). If that more complex algorithm were done entirely in hardware, it's possible it would run to completion from beginning to end entirely independent of the timing clock. No outside program instructions would be consulted during the hardware calculation of that algorithm.
It means that the logic behind it is in the hardware (ie, using gates AND/OR/XOR, etc) rather than a software recreation of said hardware logic.
Hardware implementation means typically that a circuit was created to perform the refered operation. There is no need for a CPU nor virtual calculations. You can literally see the algorithm being performed through the lines and architecture of the circuit itself.
Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?
Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.
Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.
Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.
No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?
I'm writing some micro-benchmarking code for some very short operations in C. For example, one thing I'm measuring is how many cycles are needed to call an empty function depending on the number of arguments passed.
Currently, I'm timing using an RDTSC instruction before and after each operation to get the CPU's cycle count. However, I'm concerned that instructions issued before the first RDTSC may slow down the actual instructions I'm measuring. I'm also worried that the full operation may not be complete before the second RDTSC gets issued.
Does anyone know of an x86 instruction that forces all in-flight instructions to commit before any new instructions are issued? I've been told CPUID might do this, but I've been unable to find any documentation that says so.
To my knowledge, there is no instruction which specifically "drains" the pipeline. This can be easily accomplished though using a serialising instruction.
CPUID is a serializing instruction, which means exactly what you're looking for. Every instruction issues before it is guaranteed to execute before the CPUID instruction.
So doing the following should get the desired effect:
cpuid
rdtsc
# stuff
cpuid
rdtsc
But, as an aside, I don't recommend that you do this. Your "stuff" can still be effected by a lot of other things outside of your control (such as CPU caches, other processes running on the system, etc), and you'll never be able to eliminate them all. The best way to get accurate performance statistics is to perform the operation(s) you want to measure at least several million times and average out the execution time of the batch.
Edit: Most instruction references for CPUID will mention its serializing properties, such as the NASM manual appendix B .
Edit 2: Also might want to take a look at this related question.