32 bit operations on 8 bit architecture [closed] - c

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I just want to ask you is it possible to get 32-bit operations on 8-bit architecture and if yes - how?
I thought about this for some time and the best idea I have is to typedef char[N] to get types from N byte size and then implement functions such as add(char *, char *).
Thanks in advance!
(I'm using about the 6502 processor)

You have tagged your question as "C" so this answer takes this into consideration.
Most C compilers for 8-bit systems I know have long types. You can simply use these.
Having said this, how does it work?
All common 8-bit processors have a special 1-bit flag that receives the carry/borrow from 8-bit operations. And they have addition and subtraction instructions that take this flag into account. So a 32-bit add will be translated into this sequence:
; 1st operand in R0 to R3
; 2nd operand in R4 to R7
; addition works only with A(ccumulator)
; result goes into R0 to R3
MOV A,R0
ADD A,R4
MOV R0,A
MOV A,R1
ADDC A,R5
MOV R1,A
MOV A,R2
ADDC A,R6
MOV R2,A
MOV A,R3
ADDC A,R7
MOV R3,A
Think about how you do sums on paper. There is no need to add a carry on the rightmost digit, the least-significant one. Since there is "nothing" on the right, there is no carry. We can interpret each 8-bit step as one-digit operation on a digit of a number system of base 256.
For bit operations there is no need for a carry or borrow.
Another thought: What do you call an 8-bit system? When the instruction can just handle 8 bits in parallel, or when the data bus is just 8 bits wide?
For the latter case we can look at for example the 68008 processor. Internally a 32-bit processor its data bus has only 8 bits. Here you will use the 32-bit instructions. If the processor reads or writes a 32-bit value from/to memory it will generate 4 consecutive access cycles automatically.

Many (all that I know of...) CPUs have so called "carry flag" (1 bit), which is set when addition or substraction causes wrap-around. It is basically an extra bit for calculations. Then they have versions of addition and substraction, which include this carry flag. So you can do (for a example) 32-bit addition by doing 4 8-bit additions with carry.
Pseudocode example, little endian machine (so byte 0 of 4 byte result is the least significant byte):
carry,result[0] = opA[0] + opB[0]
carry,result[1] = opA[1] + opB[1] + carry
carry,result[2] = opA[2] + opB[2] + carry
carry,result[3] = opA[3] + opB[3] + carry
if carry == 1, overflow the 32 bit result
The first addition instruction might be called ADD (does not include carry, just sets it), while the following additions might be called ADC (includes carry and sets it). Some CPUs might have just ADC instruction, and reguire clearing the carry flag first.

If you use the standard int / long types, the compiler will automatically do the right thing. long has (at least) 32 bit, so no need for working with carry bits manually; the compiler is already capable of that. If possible, use the standard uint32_t/int32_t types for readability and portability. Examine the disassembled code to see how the compiler deals with 32 bit arithmetics.

In general, the answer to "Can I do M-bit arithmetic on a processor which has only N bits?" is "Certainly yes!"
To see why: back in school, you probably learned your addition and multiplication tables for only up to 10+10 and 10×10. Yet you have no trouble adding, subtracting, or multiplying numbers which are any number of digits long.
And, simply stated, that's how a computer can operate on numbers bigger than its bit width. If you have two 32-bit numbers, and you can only add them 8 bits at a time, the situation is almost exactly like having two 4-digit numbers which you can only add one digit at a time. In school, you learned how to add individual pairs of digits, and process the carry -- and similarly, the computer simply adds pairs of 8-bit numbers, and processes the carry. Subtraction and multiplication follow the same sorts of rules you learned in school, too. (Division, as always, can be trickier, although the long division algorithm you learned in school is often a good start for doing long computer division, too.)
It helps to have a very clear understanding of number systems with bases other than 10. I said, "If you have two 32-bit numbers, and you can only add them 8 bits at a time, the situation is almost exactly like having two 4-digit numbers which you can only add one digit at a time." Now, when you take two 32-bit numbers and add them 8 bits at a time, it turns out that you're doing arithmetic in base 256. That sounds crazy, at first: most people have never heard of base 256, and it seems like working in a base that big might be impossibly difficult. But it's actually perfectly straightforward, when you think about it.
(Just for fun, I once wrote some code to do arithmetic on arbitrarily big numbers, and it works in base 2147483648. That sounds really crazy at first -- but it's just as reasonable, and in fact it's how most arbitrary-precision libraries work. Although actually the "real" libraries probably use base 4294967296, because they're cleverer than me about processing carries, and they don't want to waste even a single bit.)

Related

Can an 8 bit microcontrollers use a 32 bit Integers? (and the other way around) [duplicate]

This question already has answers here:
If an embedded system coded in C is 8 or 16-bit, how will it manipulate 32-bit data types like int?
(2 answers)
Closed 4 years ago.
I'm really wondering about the relation of the datasheets and how does this change how thing are done on the programming level,
As far as i know 8 bit in a uC is the resolution of the ADC up to 256 values that signals can be sampled, the higher you go the higher the precision is on the sampled signal however...
Does this affect the code? (Is everything 32bit on the code?)
Whenever I declare an int in a 32bit uC am i actually using an int32? or an int8?
Can an 8 bit microcontroller use a 32 bit integer?
The short answer is yes
When a microcontroller is said to be 8 bit, it means that the internal registers are 8 bit and that the arithmetic unit operates with 8 bit numbers. So in a single instruction, you can only do 8 bit math.
However, you can still do 32 bit math but that will require a number of instructions. For instance, you need 4 8 bit registers to hold a single 32 bit value. Further, you'll have to do the math operation using 8 bit operations (i.e. multiple instructions).
For an ADD of two 32 bit int, you'll need four 8 bit add instructions and besides that you'll need instructions to handle carries from the individual add instruction.
So you can do it but it will be slow as a single 32 bit add may require 10-20 instructions (or more - see comment from #YannVernier).
... (and the other way around)
AFAIK most 32 bit CPUs have instructions that allows for 8 bit math, i.e. as a single instruction. So doing 8 bit or 32 bit math will will be equally fast (in terms of instructions required).
Whenever I declare an int in a 32bit uC am i actually using an int32? or an int8?
With a 32 bit CPU, an int will normally be 32 bit so the answer is: int32
But from a C standard point of view, it would be okay to have 16 bit int on a 32 bit machine. So even if 32 bit would be common, you'll still have to check what the size is on your specific system to be real sure.
This seems to be quite a bit more than one question. First, for the title; yes, 8-bit and 32-bit microcontrollers can typically use integers of either width. Narrower processors will require more steps to handle larger widths, and therefore be slower. Wider processors may lack support for narrower types, causing them to require extra steps as well. Either way, a typical compiler will handle the difference between 8 and 32 bits.
Peripherals such as ADCs can have their own widths; it's not uncommon for them to be a width that doesn't fit precisely in bytes, such as 10 or 12 bits. Successive approximation ADCs also frequently offer a faster mode where less bits hold valid data. In such cases, requesting the fast/narrow mode would require different code from running in slow/full width mode.
If you declare an int in a C compliant compiler, you'll never get an 8-bit variable, because C requires it to be at least 16 bits. Many compilers have options to diverge from the standard. On 32 bit computers it frequently is 32 bits, but on a microcontroller it may well be smaller to conserve memory even if the processor is 32 bit. There are width specific types in inttypes.h if you want to be specific.

How does a processor(esp. ARM) interpret an overflow result at later stage in execution when the result is wirtten back to memory

Since processors follow the convention of representing numbers as 2's complement how do they know whether the number resulted from an addition of two positive numbers is still positive and not negative.
For example if I add two 32bit numbers:
Let r2 contains the value- 0x50192E32
Sample Code:
add r1, r2, #0x6F06410C
str r1, [r3]
Here an overflow flag is set.
Now if I want to use the stored result from memory in later instructions(somewhere in the code...and by now due to different instructions let the processors cpsr has been changed) as shown below:
ldr r5, [r3]
add r7, r5
As the result of the first add instruction has 1 in it's MSB i.e.now r5 has 1 in it's MSB how do the processor interpret the value. Since the correct result on adding two positive numbers is positive. Is it just because the MSB has 1, it interprets as negative number? In that case we get different results from expected one.
Let for example in a 4 bit machine:
2's complement: 4=0100 and 5=0101;
-4=1100 and -5=1011
now 4+5=9 and if it is stored in a register/memory as 1001, and later if it is being accessed by another instruction and given the processor stores numbers in 2's complement format and checks the MSB and thinks that it is a negative 7.
If it all depends upon a programmer then how do one store the correct results in reg/mem. Is there anyway that we can do to our code to store the correct results?
If you care about overflow conditions, then you'd need to check the overflow flag before the status register is overwritten by some other operation - depending on the language involved, this may result in an exception being generated, or the operation being retried using a longer integer type. However, many languages (C, for example) DON'T care about overflow conditions - if the result is out of range of the type, you simply get an incorrect result. If a program written in such a language needs to detect overflow, it would have to implement the check itself - for example, in the case of addition, if the operands have the same sign, but the result is different, there was an overflow.
I know I have covered this many times as have others.
The carry flag can be considered the unsigned overflow flag for addition it is also the borrow flag or not borrow flag for subtraction depending on your architecture. The v flag is the signed overflow flag for addition (subtraction). YOU are the only one who knows or cares whether or not the addition is signed or unsigned as for addition/subtraction it doesnt matter.
It doesnt matter what flag it is, or what architecture, YOU have to make sure that if you care about the result (be it the result or a flag) that you preserve that information for as long as you have to until you need to use it, it is not the processors job to do that nor the instruction set nor the architecture in general. It goes for the answers in the registers as it does for the flags, it is all on you the programmer. Just preserve the state if you care. This question is like saying how do you solve this:
if(a==b)
{
}
stuff;
stuff;
I want to do the if a == b thing now.
It is all on you the programmer to make that work do the compare at the time you need to use it instead of at some other time, save the result of the compare at the time of the compare and then check the condition at the time you need to use it.

C: Using always widest type available would result in faster execution? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Assuming that
memory occupation is not relevant
there is no need to allocate very large arrays (the biggest array count 1.000.000 elements).
the platform is 64 bit
Using always the widest type available for each variable:
int64_t for signed integers
uint64_t for unsigned integers
double for floating point numbers
would result in faster execution than using smaller types when 64 bits are not needed ?
Or would execute at the same speed (or slower) and occupy more memory?
I am a game programmer, and frequently see people advocating double or floats.
Basically what I learned is: using the expected size by the hardware, is "usually" the faster, for example if the hardware expects a 64bit integer to do integer math, 64bit is faster.
The "double" camp defenders tend to use that argument only.
The problem is that for some software, you might hit other hurdles, like running out of cache, running out of pipelines (maybe the hardware can make out of order or parallel executions if the data is smaller than what it fits in its registers for example), explicit parallel operations (MMX, SSE, etc...), the amount of data that must be moved around in context switches, core switches and whatnot.
So the best "solution" I found, is to write some test code, run it in the designed hardware, and analyze the results.
It depends. Memory occupation affects performance so it could be the case that the 64 bit values compute faster than the 32 bit values, but because you can fit more 32 bit data on a cache line, the 32 bit data types may go faster anyway. The only way to know is test the particular algorithm on the particular hardware you are running it on. Complicating things further are SIMD data types, which are even wider, and generally even faster as you can do 4, 8, or 16 operations at a time, assuming it is possible to efficiently implement your algorithm using them.
One example not yet mentioned: on x86-64, there is a full range of instructions for 64-bit operands: load, store, arithmetic, etc. However, in many cases, these instructions are longer.
For instance, add ebx, eax, which does a 32-bit add, is 2 bytes. To do a 64-bit add, add rbx, rax is 3 bytes.
So even if they execute at the same speed, the 64-bit instructions will occupy more memory. In particular, less of your code will fit in cache at any one time, potentially leading to more cache misses and slower speed overall. So this is another reason not to explicitly demand 64-bit integers if you don't need them.

Performance comparison: 64 bit and 32 bit multiplication [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm using an Intel(R) Core(TM) i5-4200U CPU # 1.60GHz and wondering why the multiplication of 64 bit numbers is slower than that of 32 bit numbers. I've done a test run in C and it turns out it needs twice as much time.
I expected it to need the same amount of time since the CPU works with native 64 bit registers and it shouldn't matter how wide the numbers are (as long as they fit into a 64 bit register).
Can someone explain this?
There are specialized instructions in the x86-64 instruction set to express that you only want to multiply two 32-bit quantities. One instruction may look like IMUL %EBX, %ECX in a particular dialect for the x86-64 assembly, as opposed to the 64-bit multiplication IMUL %RBX, %RCX.
So the processor knows that you only want to multiply 32-bit quantities. This happens often enough that the designers of the processor made sure that the internal circuitry would be optimized to provide a faster answer in this easier case, just as it is easier for you to multiply 3-digit numbers than 6-digit numbers. The difference can be seen in the timings measured by Agner Fog and described in his comprehensive assembly optimization resources.
If your compiler is targeting the older 32-bit IA-32 instruction set, then the difference between 32-bit and 64-bit multiplication is even wider. The compiler has to implement 64-bit multiplication with only instructions for 32-bit multiplication, using four of them (three if computing only the 64 least significant bits of the result).
64-bit multiplication can be about three-four times slower than 32-bit multiplication in this case.
I can think of a problem occuring here because of 64-bit multiplication.
Actually, for multiplying two 32-bit numbers,the result will be maximum of 64 bits. But, in case of multiplying two 64-bit numbers, the product may be of 128 bits and in all cases it'll be greater than 64 bits!
As a similar example in 8086 microprocessor,if you'll perform the same with 8-bit numbers and 16-bit numbers,you'll encounter the situation that CPU registers will have to store it from AX register and DX register as well(if you know the assembly language abbreviations).
So,I believe that is possibly increasing the calculation time!!! I feel this is what making your 64-bits multiplication slow!

Is NEON of ARM faster for integers than floating points?

Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?
You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).
See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:
You may need to read explanation of how to read those tables.
To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.
In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.
It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.
Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.
As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.
I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.
In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.
For fixed point you will need 32 bit ints to represent at least 24 bits of precision:
Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.
For floating point:
Multiply two-by-two 32 bit floats together. This takes one cycle.
So for this scenario, there is no way in heck that you would ever choose integer over floating point.
If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.
Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.

Resources