Suppose that a cpu reads a word that truncates an integer.
I've read that if structure padding is not enabled the CPU would have to do two reads: it has to read in the first half, then read in the second half separately, then reassemble them together to do the computation.
How does a cpu notices that an integer (for example) has been truncated ?
This depends on the CPU, and on what instructions your compiler will generate. Some CPUs will happily perform unaligned loads (basically, they read the two halves and recombine them for you). Some will silently return corrupted data, and some will generate an exception and cause your program to crash immediately. Sometimes a CPU will have multiple instructions that can load and store data, some allow unaligned access, some don't.
The best way to find out what is happening on your CPU is to test it out. Or, look at the assembly generated by your compiler, and look up those assembly instructions in your CPU's manual to find out what it is going to do.
See this question for more information if you have an Intel or AMD CPU: What's the actual effect of successful unaligned accesses on x86?
Related
In ARM SVE there are masked load instructions svld1and there are also non-failing loads
svldff1(svptrue<>).
Questions:
Does it make sense to do svld1 with a mask as opppose to svldff1?
The behaviour of mask in svldff1 seems confusing. Is there a practical reason to provide a not just svptrue mask for svldff1
Is there any performance difference between svld1 and svldff1
Both ldff1 and ld1 can be used to load a vector register. According my informal tests, on an AWS graviton processor, I find no performance difference, in the sense that both instructions (ldff1 and ld1) seem to have roughly the same performance characteristics. However, ldff1 will read and write to the first-fault register (FFR). It implies that you cannot do more than one ldff1 at any one time within an 'FFR group', since they are order sensitive and depend crucially on the FFR.
Furthermore, the ldff1 instruction is meant to be used along with the rdffr instruction, the instruction that generates a mask indicating which loads were successful. Using the rdffr instruction will obviously add some cost. I am assuming that the instruction in question might need to run after ldff1w, thus increasing the latency by at least a cycle. Of course, then you have to do something with the mask that rdffr produces...
Obviously, there is bound to be some small overhead tied to the FFR (clearing, setting, accessing).
"Is there a practical reason to provide a not just svptrue mask for svldff1": The documentation states that the leading inactive elements (up to the fault) are predicated to zero.
I'm building a small bytecode VM that will run on a variety of platforms including exotic embedded and microcontroller environments.
Each opcode in my VM can be variable length(no more than 4 bytes, no less than 1 byte). In interpreting the opcodes, I want to create a tiny "cache" for the current opcode. However, due to it being used on many different platforms, it's hard to do.
So, here is a few examples of expected behavior:
On an 8-bit microcontroller with an 8-bit memory bus, I'd want it to only load 1 byte because it'd take multiple (slow) memory operations to load anymore, and in theory, it might only require 1 byte to execute the current opcode
On an 8086(16-bit), I'd want to load 2 bytes because to only load 1 byte we would basically be throwing some useful data away to be read later, but I don't want to load more than 2 bytes because it'd take multiple operations
On a 32-bit ARM processor, I'd want to load 4 bytes because otherwise we're either throwing data that'd might have to be read again away, or we're doing multiple operations
I would say this could be handled easily by just assuming that unsigned int is good enough, but on 8-bit AVR microcontrollers, int is defined as 16-bit, but the memory data bus width is only 8 bit, so 2 memory load operations would be required.
Anyway, current ideas:
using uint_fast16_t seems to work as expected on most platforms (32 bits on ARM, 16 bits on 8086, 64 bits on x86-64). However, it clearly still leaves out AVR and other 8-bit microcontrollers.
I thought using uint_fast8_t might work, but it would appear on most platforms that it's defined as being unsigned char, which definitely isn't optimal
Also, there is another problem that must be solved as well: unaligned memory access. On x86, this probably isn't going to be a problem(in theory it does 2 memory operations, but it's probably cached away in hardware), however on ARM I know that doing an unaligned 32-bit access could possibly cost 3 times as much as a single aligned 32-bit load. If the address is unaligned, I want to load the aligned option and get as much data as possible, but at all costs avoid another memory operation
Is there a way to somehow do this using magical preprocessor includes or some such, or does it just require manually defining the optimum cache size before compiling for the platform?
There is no automatic way to do this using the types or information provided by standard C (in headers such as and so on).
Problems such as this are sometimes handled by executing and measuring sample code on the target platform and using the results to determine what code to use in practice. The samples might be executed during a build and then built into the final code or might be executed at the start of each program execution and then used for the duration of execution.
I was reading this book "ARM System Developers Guide" by Elsevier and I came across this:
The ARM instruction set differs from the pure RISC definition in several ways that make
the ARM instruction set suitable for embedded applications:
Variable cycle execution for certain instructions — Not every ARM instruction executes in a single cycle. For example, load-store-multiple instructions vary in the number of execution cycles depending upon the number of registers being transferred. The
transfer can occur on sequential memory addresses, which increases performance since
sequential memory accesses are often faster than random accesses. Code density is also
improved since multiple register transfers are common operations at the start and end
of functions.
Any other ARM instructions you guys can point out which take variable cycles to execute?
Cycle timings are micro architecture dependent, so you need to check particular implementation's technical reference manual (TRM). For example for Cortex-A9, it is described as being quite complicated.
The complexity of the Cortex-A9 processor makes it impossible to calculate precise timing information manually. The timing of an instruction is often affected by other concurrent instructions, memory system activity, and additional events outside the instruction flow.
However on the same document there are precise timings for data-processing, load and store, multiplication and some information about branch and serialization instructions.
For example from the same document you can see if shifting is involved AND instruction may take 1-2 cycles more depending on the shift source, which might be a constant embedded in instruction or read from a register.
Also next to book's note about load-store-multiple may vary on number of registers involved, they also vary if address is aligned or not.
Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?
Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.
Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.
Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.
No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?
Is there any advantage in doing bitwise operations on word boundaries? Any CPU or memory optimization in doing so?
Actual problem:
I am trying to create XOR of two structure. Lets say structure-1 and structure-2 both of same size 10000 bytes. I leave first few hundreds bytes as it is and then start XOR of 1 and 2.
Lets say I start with 302 to begin with. This will take 4 byte at a time and do XOR. 302, 303, 304 and 305 of both structure will be XORed. This cycle will be repeated till 10000.
Now, If I start from 304, Is there any performance improvement expected?
Yes, there are at least two advantages for using proper alignment:
Portability. Not all processor support non-aligned numbers. For maximum portability, you should only use fully aligned (i.e. an N-byte integer starts at an address that is a multiple of N) numbers
Speed. AFAIK, even a processor that supports non-aligned numbers is still faster with aligned numbers.
Premature optimization is the root of all evil
Just do it the straightforward way, then optimize it if your profiler tells you it's important.
Yes, you will go faster if you're properly aligned. You'll go even faster if you use the SSE2 vector XOR instructions, where properly aligned you'll do it 16 bytes at a time and not pollute the cache. And it's highly unlikely that optimizing this is where you should be spending your time.
Some processors only allow 4-byte operations on 32-bit word boundaries (some allow them only on halfword boundaries).
On these processors non-aligned access causes a processor exception which - depending on CPU, OS and settings - will cause a process crash or just a lot of work for the OS.
On other processors (e.g. x86) you will just get the performance hit of having to do two reads and writes (plus a bit of shifting) per operation.
See link text to see problems with ARM CPUs