word alignment of 4 byte for XOR operations

word alignment of 4 byte for XOR operations - c

Is there any advantage in doing bitwise operations on word boundaries? Any CPU or memory optimization in doing so?
Actual problem:
I am trying to create XOR of two structure. Lets say structure-1 and structure-2 both of same size 10000 bytes. I leave first few hundreds bytes as it is and then start XOR of 1 and 2.
Lets say I start with 302 to begin with. This will take 4 byte at a time and do XOR. 302, 303, 304 and 305 of both structure will be XORed. This cycle will be repeated till 10000.
Now, If I start from 304, Is there any performance improvement expected?

Yes, there are at least two advantages for using proper alignment:
Portability. Not all processor support non-aligned numbers. For maximum portability, you should only use fully aligned (i.e. an N-byte integer starts at an address that is a multiple of N) numbers
Speed. AFAIK, even a processor that supports non-aligned numbers is still faster with aligned numbers.

Premature optimization is the root of all evil
Just do it the straightforward way, then optimize it if your profiler tells you it's important.
Yes, you will go faster if you're properly aligned. You'll go even faster if you use the SSE2 vector XOR instructions, where properly aligned you'll do it 16 bytes at a time and not pollute the cache. And it's highly unlikely that optimizing this is where you should be spending your time.

Some processors only allow 4-byte operations on 32-bit word boundaries (some allow them only on halfword boundaries).
On these processors non-aligned access causes a processor exception which - depending on CPU, OS and settings - will cause a process crash or just a lot of work for the OS.
On other processors (e.g. x86) you will just get the performance hit of having to do two reads and writes (plus a bit of shifting) per operation.
See link text to see problems with ARM CPUs

Related

What happens if a word truncates an int?

Suppose that a cpu reads a word that truncates an integer.
I've read that if structure padding is not enabled the CPU would have to do two reads: it has to read in the first half, then read in the second half separately, then reassemble them together to do the computation.
How does a cpu notices that an integer (for example) has been truncated ?

This depends on the CPU, and on what instructions your compiler will generate. Some CPUs will happily perform unaligned loads (basically, they read the two halves and recombine them for you). Some will silently return corrupted data, and some will generate an exception and cause your program to crash immediately. Sometimes a CPU will have multiple instructions that can load and store data, some allow unaligned access, some don't.
The best way to find out what is happening on your CPU is to test it out. Or, look at the assembly generated by your compiler, and look up those assembly instructions in your CPU's manual to find out what it is going to do.
See this question for more information if you have an Intel or AMD CPU: What's the actual effect of successful unaligned accesses on x86?

Embedded C compile size is bigger with int16's than with int32's. Why?

I am working on embedded C firmware for Freescale Coldfire processors. After writing some code, I began to look at ways to reduce the size of the build. We are limited on space, so this is important for me to consider.
I realized I had several int32's in my code, but I only need int16's for them. To save space I tried replacing the relevant variables with int16's. When I built it, the size of the build went up by about 60 bytes.
I thought it might be how my structs were packed, so I defined how I wanted it packed, but it made no change.
#pragma pack(push, 1)
// Struct here
#pragma pack(pop)
I could kind of see it staying the same, but I can't figure what would cause it to go up. Any thoughts here? What might be causing this?
Edit:
Yes, looks like it was simply generating extra instructions to account for 32-bit being the optimized size for the processor. I should have checked the datasheet first.
This was the extra assembly being generated:
0x00000028 0x3210 move.w (a0),d1
0x0000002A 0x48C1 ext.l d1
; Other instructions between
0x0000002E 0x3028000E move.w 14(a0),d0
0x00000032 0x48C0 ext.l d0**

Your compiler probably emits code for int32's just fine; it's probably the natural int size for your architectiure (is this true? Is sizeof(int)==4?).
I am guessing that the size increase is from three places:
Making alignment work
32-bit ints probably align naturally on the stack and other places,
so typically code would not have to be emitted to make sure "the
stack is 4 byte aligned". If you sprinkle a bunch of 16 bit ints in
your code, it may have to add padding (extra adds to frame pointer
as a fix-up?) Usually one add instruction covers the frame/stack
maintenance, but maybe extra instructions are emitted to guarantee
alignment.
Translating between 16-bit and 32-bit ints
With 32-bit ints, most instructions naturally work. With smaller ints, sometimes the compiler has to emit code that chops/slices up bits so that
it preserves the semantics of the smaller. (Maybe doing an extra AND instruction to mask
off some high-order bits or an OR instruction to set some bits).
Going back and forth to memory
Standard LOADS and STORES are for 32-bit ints (which is probably natural size of your machine). It's possible, that when it has to store only 2 bytes instead of 4, that the architecture has to emit extra instructions to store a non-standard int (either by chopping up the int, using a strange instruction that has a longer encoding, or using bit instructions to chop up the instruction).
These are all guesses. The best way to see is to look at the assembly code and see what's going on!

To save space I tried replacing the relevant variables with int16's.
When I built it, the size of the build went up by about 60 bytes.
This doesn't make much sense to me. Using a smaller data type doesn't necessary translate to fewer instructions. It could reduce memory use when running the software but not necessarily build size.
So, for example, the reason using smaller data types here could be increasing the size of your binaries could be due to the fact that using smaller types requires more instructions or longer instructions. For example, for non-word/dword aligned memory, the compiler may have to use more instructions for unaligned moves. It may have to use special instructions to extract lower/upper words if all the general-purpose registers are larger. In that case, you might also get a slight performance hit in addition to the increased binary size using those smaller types (but less memory use when the code is running).
There may be a number of scenarios and it's specific to both the exact compiler you are using and architecture (the assembly code will reveal the exact cause), but in short, using smaller types for variables does not necessarily mean smaller-sized builds/fewer instructions, and could easily mean the opposite.

How to load the largest integer possible in one memory operation?

I'm building a small bytecode VM that will run on a variety of platforms including exotic embedded and microcontroller environments.
Each opcode in my VM can be variable length(no more than 4 bytes, no less than 1 byte). In interpreting the opcodes, I want to create a tiny "cache" for the current opcode. However, due to it being used on many different platforms, it's hard to do.
So, here is a few examples of expected behavior:
On an 8-bit microcontroller with an 8-bit memory bus, I'd want it to only load 1 byte because it'd take multiple (slow) memory operations to load anymore, and in theory, it might only require 1 byte to execute the current opcode
On an 8086(16-bit), I'd want to load 2 bytes because to only load 1 byte we would basically be throwing some useful data away to be read later, but I don't want to load more than 2 bytes because it'd take multiple operations
On a 32-bit ARM processor, I'd want to load 4 bytes because otherwise we're either throwing data that'd might have to be read again away, or we're doing multiple operations
I would say this could be handled easily by just assuming that unsigned int is good enough, but on 8-bit AVR microcontrollers, int is defined as 16-bit, but the memory data bus width is only 8 bit, so 2 memory load operations would be required.
Anyway, current ideas:
using uint_fast16_t seems to work as expected on most platforms (32 bits on ARM, 16 bits on 8086, 64 bits on x86-64). However, it clearly still leaves out AVR and other 8-bit microcontrollers.
I thought using uint_fast8_t might work, but it would appear on most platforms that it's defined as being unsigned char, which definitely isn't optimal
Also, there is another problem that must be solved as well: unaligned memory access. On x86, this probably isn't going to be a problem(in theory it does 2 memory operations, but it's probably cached away in hardware), however on ARM I know that doing an unaligned 32-bit access could possibly cost 3 times as much as a single aligned 32-bit load. If the address is unaligned, I want to load the aligned option and get as much data as possible, but at all costs avoid another memory operation
Is there a way to somehow do this using magical preprocessor includes or some such, or does it just require manually defining the optimum cache size before compiling for the platform?

There is no automatic way to do this using the types or information provided by standard C (in headers such as and so on).
Problems such as this are sometimes handled by executing and measuring sample code on the target platform and using the results to determine what code to use in practice. The samples might be executed during a build and then built into the final code or might be executed at the start of each program execution and then used for the duration of execution.

Why does ARM have 16 registers?

Why does ARM have only 16 registers? Is that the ideal number?
Does distance of registers with more registers also increase the processing time/power ?

As the number of the general-purpose registers becomes smaller, you need to start using the stack for variables. Using the stack requires more instructions, so code size increases. Using the stack also increases the number of memory accesses, which hurts both performance and power usage. The trade off is that to represent more registers you need more bits in your instruction, and you need more room on the chip for the register file, which increases power requirements. You can see how differing register counts affects code size and the frequency of load/store instructions by compiling the same set of code with different numbers of registers. The result of that type of exercise can be seen in table 1 of this paper:
Extendable Instruction Set Computing
Register Program Load/Store
Count Size Frequency
27 100.00 27.90%
16 101.62 30.22%
8 114.76 44.45%
(They used 27 as a base because that is the number of GPRs available on a MIPS processor)
As you can see, there are only marginal improvements in both programs size and the number of load/stores required as you drop the register count down to 16. The real penalties don't kick in until you drop down to 8 registers. I suspect ARM designers felt that 16 registers was a kind of sweet spot when you were looking for the best performance per watt.

To choose one of 16 registers you would need 4bit therefore it could be that this is the best match for opcodes (machine commands) otherwise you would have to introduce a more complex instructions set, which would lead to bigger coder which implies additional costs (execution time).
Wikipedia says It has "Fixed instruction width of 32 bits to ease decoding and pipelining"
so it is a reasonable tradeoff.

32-bit ARM has 16 registers because it only use 4 bits for encoding the register, not because 16 is the ideal number. Likewise x86 has only 8 registers because in history they used 3 bits to encode the register so that some instructions fit in a byte.
That's such a limited number so both x86 and ARM when going to 64-bit doubled the number to 16 and 32 registers respectively. The old ARM instruction encoding has no remaining bit left enough for the larger register number so they must do a trade-off by dropping the ability to execute almost every instruction conditionally and use the 4-bit condition for the new features (that's an oversimplification, in reality it's not exactly like that because the encoding is new, but you do need 3 more bits for the new registers).

Back in the 80's (IIRC) an academic paper was published that examined a number of different workloads, comparing expected performance benefits of different numbers of registers. This was at a time when RISC processors were transitioning from academic ideas to mainstream hardware, and it was important to decide what was optimal. CPUs were already pulling ahead of memory in speed, and RISC was making this worse by limiting addressing modes and having separate load and store instructions. Having more registers meant you could "cache" more data for immediate access and therefore access main memory less.
Considering only powers of two, it was found that 32 registers was optimal, although 16 wasn't terribly far behind.

ARM is unique in that each of the registers can have a conditional execution code avoiding tests & branches. Don't forget, many 32 register machines fix R0 to 0 so conditional tests are done by comparing to R0. I know from experience. 20 years ago I had to program a 'Mode 7' (from SNES terminology) floor. The CPUs were SH2 for the 32x (or rather 2 of them), MIPS3000 (Playstation) and 3DO (ARM), the inner loop of the code were 19,15 & 11. If the 3DO had been running at the same speed as the other 2, it would have been twice as fast. As it was, it was just a bit slower.

Determine word size of my processor

How do I determine the word size of my CPU? If I understand correct an int should be one word right? I'm not sure if I am correct.
So should just printing sizeof(int) would be enough to determine the word size of my processor?

Your assumption about sizeof(int) is untrue; see this.
Since you must know the processor, OS and compiler at compilation time, the word size can be inferred using predefined architecture/OS/compiler macros provided by the compiler.
However while on simpler and most RISC processors, word size, bus width, register size and memory organisation are often consistently one value, this may not be true to more complex CISC and DSP architectures with various sizes for floating point registers, accumulators, bus width, cache width, general purpose registers etc.
Of course it begs the question why you might need to know this? Generally you would use the type appropriate to the application, and trust the compiler to provide any optimisation. If optimisation is what you think you need this information for, then you would probably be better off using the C99 'fast' types. If you need to optimise a specific algorithm, implement it for a number of types and profile it.

an int should be one word right?
As I understand it, that depends on the data size model. For an explanation for UNIX Systems, 64-bit and Data Size Neutrality. For example Linux 32-bit is ILP32, and Linux 64-bit is LP64. I am not sure about the difference across Window systems and versions, other than I believe all 32-bit Window systems are ILP32.
How do I determine the word size of my CPU?
That depends. Which version of C standard are you assuming. What platforms are we talking. Is this a compile or run time determination you're trying to make.
The C header file <limits.h> may defines WORD_BIT and/or __WORDSIZE.

sizeof(int) is not always the "word" size of your CPU. The most important question here is why you want to know the word size.... are you trying to do some kind of run-time and CPU specific optimization?
That being said, on Windows with Intel processors, the nominal word size will be either 32 or 64 bits and you can easily figure this out:
if your program is compiled for 32-bits, then the nominal word size is 32-bits
if you have compiled a 64-bit program then then the nominal word size is 64-bits.
This answer sounds trite, but its true to the first order. But there are some important subtleties. Even though the x86 registers on a modern Intel or AMD processor are 64-bits wide; you can only (easily) use their 32-bit widths in 32-bit programs - even though you may be running a 64-bit operating system. This will be true on Linux and OSX as well.
Moreover, on most modern CPU's the data bus width is wider than the standard ALU registers (EAX, EBX, ECX, etc). This bus width can vary, some systems have 128 bit, or even 192 bit wide busses.
If you are concerned about performance, then you also need to understand how the L1 and L2 data caches work. Note that some modern CPU's have an L3 cache. Caches including a unit called the Write Buffer

Make a program that does some kind of integer operation many times, like an integer version of the SAXPY algorithm. Run it for different word sizes, from 8 to 64 bits (i.e. from char to long long).
Measure the time each version spends while running the algorithm. If there is one specific version that lasts noticeably less than the others, the word size used for that version is probably the native word size of your computer. On the other way, if there are several versions that last more or less the same time, pick up the one which has the greater word size.
Note that even with this technique you can get false data: your benchmark, compiled using Turbo C and running on a 80386 processor through DOS will report that the word size is 16 bits, just because the compiler doesn't use the 32-bit registers to perform integer aritmetic, but calls to internal functions that do the 32-bit version of each aritmetic operation.

"Additionally, the size of the C type long is equal to the word size, whereas the size of the int type is sometimes less than that of the word size. For example, the Alpha has a 64-bit word size. Consequently, registers, pointers, and the long type are 64 bits in length."
source: http://books.msspace.net/mirrorbooks/kerneldevelopment/0672327201/ch19lev1sec2.html
Keeping this in mind, the following program can be executed to find out the word size of the machine you're working on-
#include <stdio.h>
int main ()
{
long l;
short s = (8 * sizeof(l));
printf("Word size of this machine is %hi bits\n", s);
return 0;
}

In short: There's no good way. The original idea behind the C data types was that int would be the fastest (native) integer type, long the biggest etc.
Then came operating systems that originated on one CPU and were then ported to different CPUs whose native word size was different. To maintain source code compatibility, some of the OSes broke with that definition and kept the data types at their old sizes, and added new, non-standard ones.
That said, depending on what you actually need, you might find some useful data types in stdint.h, or compiler-specific or platform-specific macros for various purposes.

To use at compile time: sizeof(void*)

What every may be the reason for knowing the size of the processor it don't matter.
The size of the processor is the amount of date that Arthematic Logic Unit(ALU) of One CPU Core can work on at a single point of time. A CPU Cores's ALU will on Accumulator Register at any time. So, The size of a CPU in bits is the the size of Accumulator Register in bits.
You can find the size of the accumulator from the data sheet of the processor or by writing a small assembly language program.
Note that the effective usable size of Accumulator Register can change in some processors (like ARM) based on mode of operations (Thumb and ARM modes). That means the size of the processor will also change based on the mode for that processors.
It common in many architectures to have virtual address pointer size and integer size same as accumulator size. It is only to take advantage of Accumulator Register in different processor operations but it is not a hard rule.

Many thinks of memory as an array of bytes. But CPU has another view of it. Which is about memory granularity. Depending on architecture, there would be 2, 4, 8, 16 or even 32 bytes memory granularity. Memory granularity and address alignment have great impact on performance, stability and correctness of software. Consider a granularity of 4 bytes and an unaligned memory access to read in 4 bytes. In this case every read, 75% if address is increasing by one byte, takes two more read instructions plus two shift operations and finally a bitwise instruction for final result which is performance killer. Further atomic operations could be affected as they must be indivisible. Other side effects would be caches, synchronization protocols, cpu internal bus traffic, cpu write buffer and you guess what else. A practical test could be run on a circular buffer to see how the results could be different. CPUs from different manufacturers, based on model, have different registers which will be used in general and specific operations. For example modern CPUs have extensions with 128 bits registers. So, the word size is not only about type of operation but memory granularity. Word size and address alignment are beasts which must be taken care about. There are some CPUs in market which does not take care of address alignment and simply ignore it if provided. And guess what happens?

As others have pointed out, how are you interested in calculating this value? There are a lot of variables.
sizeof(int) != sizeof(word). the size of byte, word, double word, etc have never changed since their creation for the sake of API compatibility in the windows api world at least. Even though a processor word size is the natural size an instruction can operate on. For example, in msvc/cpp/c#, sizeof(int) is four bytes. Even in 64bit compilation mode. Msvc/cpp has __int64 and c# has Int64/UInt64(non CLS compliant) ValueType's. There are also type definitions for WORD DWORD and QWORD in the win32 API that have never changed from two bytes, four bytes, and eight bytes respectively. As well as UINT/INT_PTR on Win32 and UIntPtr/IntPtr on c# that are guranteed to be big enough to represent a memory address and a reference type respectively. AFAIK, and I could be wrong if arch's still exist, I don't think anyone has to deal with, nor do, near/far pointers exist anymore, so if you're on c/cpp/c#, sizeof(void*) and Unsafe.SizeOf{IntPtr}() would be enough to determine your maximum "word" size I would think in a compliant cross-platform way, and if anyone can correct that, please do so! Also, sizes of intrinsic types in c/cpp are vague in size definition.
C data type sizes - Wikipedia

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight