Performance comparison: 64 bit and 32 bit multiplication [closed] - c

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm using an Intel(R) Core(TM) i5-4200U CPU # 1.60GHz and wondering why the multiplication of 64 bit numbers is slower than that of 32 bit numbers. I've done a test run in C and it turns out it needs twice as much time.
I expected it to need the same amount of time since the CPU works with native 64 bit registers and it shouldn't matter how wide the numbers are (as long as they fit into a 64 bit register).
Can someone explain this?

There are specialized instructions in the x86-64 instruction set to express that you only want to multiply two 32-bit quantities. One instruction may look like IMUL %EBX, %ECX in a particular dialect for the x86-64 assembly, as opposed to the 64-bit multiplication IMUL %RBX, %RCX.
So the processor knows that you only want to multiply 32-bit quantities. This happens often enough that the designers of the processor made sure that the internal circuitry would be optimized to provide a faster answer in this easier case, just as it is easier for you to multiply 3-digit numbers than 6-digit numbers. The difference can be seen in the timings measured by Agner Fog and described in his comprehensive assembly optimization resources.
If your compiler is targeting the older 32-bit IA-32 instruction set, then the difference between 32-bit and 64-bit multiplication is even wider. The compiler has to implement 64-bit multiplication with only instructions for 32-bit multiplication, using four of them (three if computing only the 64 least significant bits of the result).
64-bit multiplication can be about three-four times slower than 32-bit multiplication in this case.

I can think of a problem occuring here because of 64-bit multiplication.
Actually, for multiplying two 32-bit numbers,the result will be maximum of 64 bits. But, in case of multiplying two 64-bit numbers, the product may be of 128 bits and in all cases it'll be greater than 64 bits!
As a similar example in 8086 microprocessor,if you'll perform the same with 8-bit numbers and 16-bit numbers,you'll encounter the situation that CPU registers will have to store it from AX register and DX register as well(if you know the assembly language abbreviations).
So,I believe that is possibly increasing the calculation time!!! I feel this is what making your 64-bits multiplication slow!

Related

32 bit operations on 8 bit architecture [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I just want to ask you is it possible to get 32-bit operations on 8-bit architecture and if yes - how?
I thought about this for some time and the best idea I have is to typedef char[N] to get types from N byte size and then implement functions such as add(char *, char *).
Thanks in advance!
(I'm using about the 6502 processor)
You have tagged your question as "C" so this answer takes this into consideration.
Most C compilers for 8-bit systems I know have long types. You can simply use these.
Having said this, how does it work?
All common 8-bit processors have a special 1-bit flag that receives the carry/borrow from 8-bit operations. And they have addition and subtraction instructions that take this flag into account. So a 32-bit add will be translated into this sequence:
; 1st operand in R0 to R3
; 2nd operand in R4 to R7
; addition works only with A(ccumulator)
; result goes into R0 to R3
MOV A,R0
ADD A,R4
MOV R0,A
MOV A,R1
ADDC A,R5
MOV R1,A
MOV A,R2
ADDC A,R6
MOV R2,A
MOV A,R3
ADDC A,R7
MOV R3,A
Think about how you do sums on paper. There is no need to add a carry on the rightmost digit, the least-significant one. Since there is "nothing" on the right, there is no carry. We can interpret each 8-bit step as one-digit operation on a digit of a number system of base 256.
For bit operations there is no need for a carry or borrow.
Another thought: What do you call an 8-bit system? When the instruction can just handle 8 bits in parallel, or when the data bus is just 8 bits wide?
For the latter case we can look at for example the 68008 processor. Internally a 32-bit processor its data bus has only 8 bits. Here you will use the 32-bit instructions. If the processor reads or writes a 32-bit value from/to memory it will generate 4 consecutive access cycles automatically.
Many (all that I know of...) CPUs have so called "carry flag" (1 bit), which is set when addition or substraction causes wrap-around. It is basically an extra bit for calculations. Then they have versions of addition and substraction, which include this carry flag. So you can do (for a example) 32-bit addition by doing 4 8-bit additions with carry.
Pseudocode example, little endian machine (so byte 0 of 4 byte result is the least significant byte):
carry,result[0] = opA[0] + opB[0]
carry,result[1] = opA[1] + opB[1] + carry
carry,result[2] = opA[2] + opB[2] + carry
carry,result[3] = opA[3] + opB[3] + carry
if carry == 1, overflow the 32 bit result
The first addition instruction might be called ADD (does not include carry, just sets it), while the following additions might be called ADC (includes carry and sets it). Some CPUs might have just ADC instruction, and reguire clearing the carry flag first.
If you use the standard int / long types, the compiler will automatically do the right thing. long has (at least) 32 bit, so no need for working with carry bits manually; the compiler is already capable of that. If possible, use the standard uint32_t/int32_t types for readability and portability. Examine the disassembled code to see how the compiler deals with 32 bit arithmetics.
In general, the answer to "Can I do M-bit arithmetic on a processor which has only N bits?" is "Certainly yes!"
To see why: back in school, you probably learned your addition and multiplication tables for only up to 10+10 and 10×10. Yet you have no trouble adding, subtracting, or multiplying numbers which are any number of digits long.
And, simply stated, that's how a computer can operate on numbers bigger than its bit width. If you have two 32-bit numbers, and you can only add them 8 bits at a time, the situation is almost exactly like having two 4-digit numbers which you can only add one digit at a time. In school, you learned how to add individual pairs of digits, and process the carry -- and similarly, the computer simply adds pairs of 8-bit numbers, and processes the carry. Subtraction and multiplication follow the same sorts of rules you learned in school, too. (Division, as always, can be trickier, although the long division algorithm you learned in school is often a good start for doing long computer division, too.)
It helps to have a very clear understanding of number systems with bases other than 10. I said, "If you have two 32-bit numbers, and you can only add them 8 bits at a time, the situation is almost exactly like having two 4-digit numbers which you can only add one digit at a time." Now, when you take two 32-bit numbers and add them 8 bits at a time, it turns out that you're doing arithmetic in base 256. That sounds crazy, at first: most people have never heard of base 256, and it seems like working in a base that big might be impossibly difficult. But it's actually perfectly straightforward, when you think about it.
(Just for fun, I once wrote some code to do arithmetic on arbitrarily big numbers, and it works in base 2147483648. That sounds really crazy at first -- but it's just as reasonable, and in fact it's how most arbitrary-precision libraries work. Although actually the "real" libraries probably use base 4294967296, because they're cleverer than me about processing carries, and they don't want to waste even a single bit.)

Can an 8 bit microcontrollers use a 32 bit Integers? (and the other way around) [duplicate]

This question already has answers here:
If an embedded system coded in C is 8 or 16-bit, how will it manipulate 32-bit data types like int?
(2 answers)
Closed 4 years ago.
I'm really wondering about the relation of the datasheets and how does this change how thing are done on the programming level,
As far as i know 8 bit in a uC is the resolution of the ADC up to 256 values that signals can be sampled, the higher you go the higher the precision is on the sampled signal however...
Does this affect the code? (Is everything 32bit on the code?)
Whenever I declare an int in a 32bit uC am i actually using an int32? or an int8?
Can an 8 bit microcontroller use a 32 bit integer?
The short answer is yes
When a microcontroller is said to be 8 bit, it means that the internal registers are 8 bit and that the arithmetic unit operates with 8 bit numbers. So in a single instruction, you can only do 8 bit math.
However, you can still do 32 bit math but that will require a number of instructions. For instance, you need 4 8 bit registers to hold a single 32 bit value. Further, you'll have to do the math operation using 8 bit operations (i.e. multiple instructions).
For an ADD of two 32 bit int, you'll need four 8 bit add instructions and besides that you'll need instructions to handle carries from the individual add instruction.
So you can do it but it will be slow as a single 32 bit add may require 10-20 instructions (or more - see comment from #YannVernier).
... (and the other way around)
AFAIK most 32 bit CPUs have instructions that allows for 8 bit math, i.e. as a single instruction. So doing 8 bit or 32 bit math will will be equally fast (in terms of instructions required).
Whenever I declare an int in a 32bit uC am i actually using an int32? or an int8?
With a 32 bit CPU, an int will normally be 32 bit so the answer is: int32
But from a C standard point of view, it would be okay to have 16 bit int on a 32 bit machine. So even if 32 bit would be common, you'll still have to check what the size is on your specific system to be real sure.
This seems to be quite a bit more than one question. First, for the title; yes, 8-bit and 32-bit microcontrollers can typically use integers of either width. Narrower processors will require more steps to handle larger widths, and therefore be slower. Wider processors may lack support for narrower types, causing them to require extra steps as well. Either way, a typical compiler will handle the difference between 8 and 32 bits.
Peripherals such as ADCs can have their own widths; it's not uncommon for them to be a width that doesn't fit precisely in bytes, such as 10 or 12 bits. Successive approximation ADCs also frequently offer a faster mode where less bits hold valid data. In such cases, requesting the fast/narrow mode would require different code from running in slow/full width mode.
If you declare an int in a C compliant compiler, you'll never get an 8-bit variable, because C requires it to be at least 16 bits. Many compilers have options to diverge from the standard. On 32 bit computers it frequently is 32 bits, but on a microcontroller it may well be smaller to conserve memory even if the processor is 32 bit. There are width specific types in inttypes.h if you want to be specific.

C: Using always widest type available would result in faster execution? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Assuming that
memory occupation is not relevant
there is no need to allocate very large arrays (the biggest array count 1.000.000 elements).
the platform is 64 bit
Using always the widest type available for each variable:
int64_t for signed integers
uint64_t for unsigned integers
double for floating point numbers
would result in faster execution than using smaller types when 64 bits are not needed ?
Or would execute at the same speed (or slower) and occupy more memory?
I am a game programmer, and frequently see people advocating double or floats.
Basically what I learned is: using the expected size by the hardware, is "usually" the faster, for example if the hardware expects a 64bit integer to do integer math, 64bit is faster.
The "double" camp defenders tend to use that argument only.
The problem is that for some software, you might hit other hurdles, like running out of cache, running out of pipelines (maybe the hardware can make out of order or parallel executions if the data is smaller than what it fits in its registers for example), explicit parallel operations (MMX, SSE, etc...), the amount of data that must be moved around in context switches, core switches and whatnot.
So the best "solution" I found, is to write some test code, run it in the designed hardware, and analyze the results.
It depends. Memory occupation affects performance so it could be the case that the 64 bit values compute faster than the 32 bit values, but because you can fit more 32 bit data on a cache line, the 32 bit data types may go faster anyway. The only way to know is test the particular algorithm on the particular hardware you are running it on. Complicating things further are SIMD data types, which are even wider, and generally even faster as you can do 4, 8, or 16 operations at a time, assuming it is possible to efficiently implement your algorithm using them.
One example not yet mentioned: on x86-64, there is a full range of instructions for 64-bit operands: load, store, arithmetic, etc. However, in many cases, these instructions are longer.
For instance, add ebx, eax, which does a 32-bit add, is 2 bytes. To do a 64-bit add, add rbx, rax is 3 bytes.
So even if they execute at the same speed, the 64-bit instructions will occupy more memory. In particular, less of your code will fit in cache at any one time, potentially leading to more cache misses and slower speed overall. So this is another reason not to explicitly demand 64-bit integers if you don't need them.

Is __int128_t arithmetic emulated by GCC, even with SSE?

I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC.
The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions of GCC, depending on what SSE instructions are taken advantage of.
So, what parts of __int128_t arithmetic are emulated by GCC, and what parts are implemented with SSE instructions (if any)?
I was confusing two different things in my question.
Firstly, as PaulR explained in the comments: "There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations)". Considering this, 128-bit arithmetic has to be emulated on modern x86-64 based processors (e.g. AMD Family 10 or Intel Core architecture). This has nothing to do with GCC.
The second part of the question is whether or not 128-bit arithmetic emulation in GCC benefits from SSE/AVX instructions or registers. As implied in PaulR's comments, there isn't much in SSE/AVX that's going to allow you to do 128-bit arithmetic more easily; most likely x86-64 instructions will be used for this. The code I'm interested in can't compile with -mno-sse, but it compiles fine with -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 and performance isn't affected. So my code doesn't benefit from modern SSE instructions.
SSE2-AVX instructions are available for 8,16,32,64-bit integer data types. They are mostly intended to treat packed data together, for example, 128-bit register may contain four 32-bit integers and so on.
Although SSE/AVX/AVX-512/etc. have no 128-bit mode (their vector elements are strictly 64-bit max, and operations will simply overflow), as Paul R has implied, the main CPU does support limited 128-bit operations, by using a pair of registers.
When multiplying two regular 64-bit number, MUL/IMUL can outputs its 128-bit result in the RAX/RDX register pair.
Inversely, when dividing DIV/IDIV can take its input from then RAX/RDX pair to divide a 128-bit number by a 64-bit divisor (and outputs 64-bit quotient + 64-bit modulo)
Of course the CPU's ALU is 64-bit, thus - as implied Intel docs - these higher extra 64-bit come at the cost of extra micro-ops in the microcode. This is more dramatic for divisions (> 3x more) which already require lots of micro-ops to be processed.
Still that means that under some circumstances (like using a rule of three to scale a value), it's possible for a compiler to emit regular CPU instruction and not care to do any 128-bit emulation by itself.
This has been available for a long time:
since 80386, 32-bit CPU could do 64-bit multiplication/division using EAX:EDX pair
since 8086/88, 16-bit CPU could do 32-bit multiplication/division using AX:DX pair
(As for additions and subtraction: thank to the support for carry, it's completely trivial to do additions/subtractions of numbers of any arbitrary length that can fill your storage).

Why does ARM have 16 registers?

Why does ARM have only 16 registers? Is that the ideal number?
Does distance of registers with more registers also increase the processing time/power ?
As the number of the general-purpose registers becomes smaller, you need to start using the stack for variables. Using the stack requires more instructions, so code size increases. Using the stack also increases the number of memory accesses, which hurts both performance and power usage. The trade off is that to represent more registers you need more bits in your instruction, and you need more room on the chip for the register file, which increases power requirements. You can see how differing register counts affects code size and the frequency of load/store instructions by compiling the same set of code with different numbers of registers. The result of that type of exercise can be seen in table 1 of this paper:
Extendable Instruction Set Computing
Register Program Load/Store
Count Size Frequency
27 100.00 27.90%
16 101.62 30.22%
8 114.76 44.45%
(They used 27 as a base because that is the number of GPRs available on a MIPS processor)
As you can see, there are only marginal improvements in both programs size and the number of load/stores required as you drop the register count down to 16. The real penalties don't kick in until you drop down to 8 registers. I suspect ARM designers felt that 16 registers was a kind of sweet spot when you were looking for the best performance per watt.
To choose one of 16 registers you would need 4bit therefore it could be that this is the best match for opcodes (machine commands) otherwise you would have to introduce a more complex instructions set, which would lead to bigger coder which implies additional costs (execution time).
Wikipedia says It has "Fixed instruction width of 32 bits to ease decoding and pipelining"
so it is a reasonable tradeoff.
32-bit ARM has 16 registers because it only use 4 bits for encoding the register, not because 16 is the ideal number. Likewise x86 has only 8 registers because in history they used 3 bits to encode the register so that some instructions fit in a byte.
That's such a limited number so both x86 and ARM when going to 64-bit doubled the number to 16 and 32 registers respectively. The old ARM instruction encoding has no remaining bit left enough for the larger register number so they must do a trade-off by dropping the ability to execute almost every instruction conditionally and use the 4-bit condition for the new features (that's an oversimplification, in reality it's not exactly like that because the encoding is new, but you do need 3 more bits for the new registers).
Back in the 80's (IIRC) an academic paper was published that examined a number of different workloads, comparing expected performance benefits of different numbers of registers. This was at a time when RISC processors were transitioning from academic ideas to mainstream hardware, and it was important to decide what was optimal. CPUs were already pulling ahead of memory in speed, and RISC was making this worse by limiting addressing modes and having separate load and store instructions. Having more registers meant you could "cache" more data for immediate access and therefore access main memory less.
Considering only powers of two, it was found that 32 registers was optimal, although 16 wasn't terribly far behind.
ARM is unique in that each of the registers can have a conditional execution code avoiding tests & branches. Don't forget, many 32 register machines fix R0 to 0 so conditional tests are done by comparing to R0. I know from experience. 20 years ago I had to program a 'Mode 7' (from SNES terminology) floor. The CPUs were SH2 for the 32x (or rather 2 of them), MIPS3000 (Playstation) and 3DO (ARM), the inner loop of the code were 19,15 & 11. If the 3DO had been running at the same speed as the other 2, it would have been twice as fast. As it was, it was just a bit slower.

Resources