How to determine if carry out occuurs in C - arm

I'm writing an ARM11 emulator and now I'm trying to set CPRS flags which are N(negative result), Z(zero), C(carry out) and V(overflow)
this is what the spec says:
The C bit in logical operations (and, eor, orr, teq, tst and mov) will be set to the carry out from
any shift operation (i.e.the result from the barrel shifter). In arithmetic operations (add, sub, rsb
and cmp) the C bit will be set to the carry out of the bit 31 of the ALU.
My question is, how do I determine the carry out from logical and arithmetic operations?
Operations work on two uint32_t, e.g. my eor operation simply returns x ^ y, and after that I need to set CPRS flags.
EDIT:
For addition, C is set
to 1 if the addition produced a carry (unsigned over
ow), it is set to 0 otherwise. For subtraction
(including comparison), the bit C is set to 0 if the subtraction produced a borrow, otherwise is set
to 1.

Logical operations are a bit of a red herring here - obviously eor r0, r1, r2 isn't going to produce overflow or carry. However, it's not the logical operations themselves that we care about:
The C bit in logical operations (and, eor, orr, teq, tst and mov) will be set to the carry out from any shift operation (i.e.the result from the barrel shifter).
Remember that optional shift on any data processing instruction? Given eor r0, r1, r2 lsl #3, the carry you care about is whatever r2 lsl #3 generates. However you're implementing flag-setting for shifts*, do that.
* if you're stuck on that too, I saw plenty of good ideas in a quick flick through the related questions over there -->

I posit that the code snippet below is probably as close as you're going to get with standard C and using logic operations to determine carry out and signed arithmetic overflow. This approach is an adaptation of how look ahead carry circuits are generated for arbitrary word lengths in FPGA's.
The basic sequence is to first determine which bit pairs will generate a carry and which will propagate a carry. In this presentation, the initial carry-in is presumed to be zero. A mask is marched along the "generate" and "propagate" words and with some logic and previous carry, determine carry to the next iteration. At the end of iteration, the carry flag will be set (or not) depending on the word pair bits to be added. The downside, is that this programming loop would be repeated every time you wanted to determine carry out and overflow for a given word pair - no such penalty in physical circuits or FPGA.
As a bonus, it's super easy to determine an overflow flag, which indicates whether the 2's compliment addition will be representable from the summation.
See the reference links below.
The code is for 32-bit integers, however could be adapted for longer or shorter types.
http://www.righto.com/2012/12/the-6502-overflow-flag-explained.html
https://en.wikipedia.org/wiki/Carry-lookahead_adder
// Global carry and overflow flags
// Set by carryLookahead()
bool carry, ov;
// Determines presence of carry out and overflow from 2's compliment addition
//
bool carryLookahead(int32 f1, int32 f2){
unsigned long mask;
unsigned long g,p;
unsigned char i;
// uses & sets global carry and ov flag variables
mask=1;
carry=ov=false; // initial carry and overflow flag assumed to be zero
g = f1 & f2; // bit pairs that will generate carry
p = f1 | f2; // bit pairs that will propagate a carry
for(i=0; i < 32; ++i, mask <<= 1){
ov=carry; // set ov to last carry
carry = (g&mask) || (p&mask) && carry; // use logical rather than bitwise logic to set the current carry;
ov=ov^carry; // ov is xor of last and current carries
}
return(carry);
}

Related

System Tick rollover on STM32 32bit ARM architecture

Having trouble understanding what happens when the 32bit system tick on a STM32 MCU rolls over using the ST supplied HAL platform.
If the MCU has been running until HAL_GetTick() returns its maximum of 2^32 -1 =0xFFFFFFFF which is 4,294,967,295 / 1000 / 60 / 60 / 24 = approx 49 days (when calculating the 1ms tick to the maximum duration that can be measured).
What happens if you have a timer that running across the rollover point?
Example code creating 100ms delay on a rollover event:
uint32_t start = HAL_GetTick() // start = 0xFFFF FFFF (in this example)
--> Interrupt increments systick which rolls it over to 0 at this point
while ((HAL_GetTick() - start) < 100);
So when the expression in the loop is first evaluated HAL_GetTick() = 0x0000 0000 and start = 0xFFFF FFFF. Hence 0x0000 00000 - 0xFFFF FFFF = ? (This number doesn't exist as it's negative and we are doing unsigned arithmetic)
However when I run the following code on my STM32 that is compiled with the GCC ARM :
uint32_t a = 0xFFFFFFFFUL;
uint32_t b = 0x00000000UL;
uint32_t c = b - a;
printf("a =%lu b=%lu c=%lu\r\n", a, b, c);
The output is:
a =4294967295 b=0 c=1
The fact that c=1 is good from the point of view of the code functioning properly across the overflow but I don't understand what is actually happening here at the low level. How does 0 - 4294967295 = 1 ?? How would I calculate this on paper to show what the arithmetic logic unit inside the MCU is doing when this situation is encountered?
This is a characteristic of modular arithmetic. Or modulo wrapping is what happens when an unsigned integer overflows.
When working with a fixed number of digits/bits, arithmetic operations can overflow the fixed number of digits. But the overflow portion cannot be represented in the fixed number of digits/bits and is basically masked away. The overflow portion can be considered a modulus and the portion within the fixed number of digits/bits is the remainder or modulo. Given the modulus, the modulo value remains correct/congruent after the operation that caused the overflow.
The best way to understand is to do a few operations with a pen on paper. Choose a base. Hexadecimal is great but it works for decimal, binary, and every base. Choose a fixed number of digits/bits. For uint32_t you have 8 hex digits or 32 bits. Choose two values that will overflow the fixed number of digits when you add them. Do the math on paper and include any overflow into an extra digit. Now perform the modulo operation by covering the overflow with your hand. Your CPU does this modulo operation automatically by virtue of having a fixed number of digits (i.e., uint32_t). Repeat this with different numbers and repeat with a subtraction/underflow. Eventually you'll start to trust that it works.
You do have to be careful when setting up this operation. Use unsigned types and subtract the start ticks value from the current ticks value, like is done in your example code. (Do not, for example, add the delay to start ticks and compare with the current ticks.) Raymond Chen's article, Using modular arithmetic to avoid timing overflow problems has more information.
How does 0 - 4294967295 = 1 ?? How would I calculate this on paper to
show what the arithmetic logic unit inside the MCU is doing when this
situation is encountered?
First write it in hex like this:
0000_0000
- FFFF_FFFF
_____________
Then realize that there can be a modulus value of 0x1_0000_0000 on the first value (minuend). (Because according to modular arithmetic, "0x0_0000_0000 and 0x1_0000_0000 are congruent modulo 0x1_0000_0000"). Then it should become obvious that the difference is 1.
1_0000_0000
- 0_FFFF_FFFF
_____________
0_0000_0001
Nothing bad will happen. It will work the same as before the wraparound.
int main(void)
{
uint32_t start = UINT32_MAX - 20;
uint32_t current = start;
for(uint32_t x = 0; x < 100; x++)
{
printf("start = 0x%08"PRIx32" current = 0x%08"PRIx32 " current - start = %"PRIu32"\n", start, current, current-start);
current++;
}
}
You can see it here:
https://godbolt.org/z/jx4T4fhsW
0x00000000 - 0xffffffff will be 1 as 1 needs to be added to 0xffffffff to get 0x00000000. Same with other numbers.
BTW it is much easier to understand if you use hex numbers instead of decimals which have very limited use in programming.

Assembly Loop Through Each Bit of Register Value

I have a register $t0 that has some integer stored to it. For example, say I store the integer 1100 to it. The binary representation of this value is 0000010001001100. Of course it may extend to 32 bits for 32 bit register but that is easily done.
I am trying to implement a loop in assembly that iterates through each bit of the register value and checks if it is a 1 or 0. How would one do this?
Perhaps I misunderstand the nature of a register. It is my understanding that the register stores a 32 bit number, yes? Does that mean each bit is stored at a specific address?
I have tried using shift for the register and checking the bit but that has failed. I also looked into the lb command but this loads the byte, not the bit. So what would one do?
some basics:
most(all?) shift instructions shift out the bit into the carry flag
most(all?) CPUs have a branch command, to jump to the location on carry flag set
combining this, you could do the following:
load register1,< the value to be tested >
load register2, 0
Lrepeat:
compare register1 with 0
jump if zero Lexit
shift right register1
jump no carry Lskip
increase register2
Lskip:
jump Lrepeat
Lexit: ; when you end up here, your register2
; holds the count of bit register1 had set
some optimization still can be done:
Lrepeat:
compare register1 with 0
jump if zero Lexit
shift right register1
jump no carry Lskip <-- note: here you jump ...
increase register2
Lskip:
jump Lrepeat <-- ... to jump again!
Lexit:
=====>
Lrepeat:
compare register1 with 0
jump if zero Lexit
shift right register1
jump no carry Lrepeat <-- so you can optimize it this way
increase register2
Lskip:
jump Lrepeat
Lexit:
some CPUs have an "add carry" instuction,
e.g. 6502:
ADC register,value ; register = register + value + Carry flag
this could be used to avoid the branch (conditional jump), by adding 0 ( plus Carry of course) to register2 each loop
shift right register1
add with carry register2,0 ; may look weird, but this adds a
; 1 if the carry is set, which is
; exactly what we want here
jump Lrepeat
note that you don't need to know the register size! you just loop until the register is 1, which can save you a lot of time e.g. if your value is something like 0000 0000 0000 0000 0000 0000 0001 1001
On any processor, set up a loop to count the number of bits in a register, in this case 32. On each pass through the loop, AND the register of interest with 1. Then add the result to an accumulator, and finally shift the register. That gives you the number of set bits.
The precise instructions vary from processor to processor. To loop you normally set a label, decrement a counter, then execute an instruction with a name like branch_not_equal_to zero (BNZ, BNEQ0 something like that). The and will have a name like AND, ANDI (and immediate). The ADD might be ADD, ADC (add with carry). The shift will be something like ASR (arithmetic shift right) LSR (logical shift right), you might have to pass it 1 to say shift only one place. But all processors will allow you to read out register bits in essentially this way.
Have the register that you want to iterate through be denoted as R0. So, for example, the least significant bits are R0= 1011 0101.
2) use a second cleared register R1=0000 0001.
3) AND R1 with R0 and then right shift R0(so the next iteration checks the next bit of R0).
4) Let R3 be a third register that increments by 1 IF the AND operation results in a 1 (that is, you run into a 1 in R0). ELSE, loop again to check the next bit in R0.
You could iterate through the entire 32 bits or a length of your choice by having a decrementing loop counter .

Understanding PowerPC rlwinm instruction

So I finally convinced myself to try and learn/use PowerPC (PPC).
Everything is going well and most information was found online.
However, when looking at some examples I came across this:
rlwinm r3, r3, 0,1,1
How would I do this in C?
I tried doing some research, but couldn't find anything that helped me out.
Thanks in advance!
rlwinm stands for "Rotate Left Word Immediate then aNd with Mask, and it's correct usage is
rlwinm RA, RS, SH, MB, ME
As per the description page:
RA Specifies target general-purpose register where result of operation is stored.
RS Specifies source general-purpose register for operation.
SH Specifies shift value for operation.
MB Specifies begin value of mask for operation.
ME Specifies end value of mask for operation.
BM Specifies value of 32-bit mask.
And
If the MB value is less than the ME value + 1, then the mask bits
between and including the starting point and the end point are set to
ones. All other bits are set to zeros.
If the MB value is the same as
the ME value + 1, then all 32 mask bits are set to ones.
If the MB value is greater than the ME value + 1, then all of the mask bits
between and including the ME value +1 and the MB value -1 are set to
zeros. All other bits are set to ones.
So in your example the source and target are the same. Shift amount is 0, so no shift. And MB=ME=1, so the first case applies, such that the mask becomes all zeros with bit number 1 as 1, while numbering from MSB=0: 0x40000000.
In C we can write it as simple as
a &= 0x40000000;
assuming a is 32-bit variable.
rlwinm rotates the value of a register left by the specified number, performs an AND and stores the result in a register.
Example: rlwinm r3, r4, 5, 0, 31
r4 is the source register which is rotated by 5 and before the rotated result is placed in r3, it is also ANDed with a bit mask of only 1s since the interval between 0 and 31 is the entire 32-bit value.
Example taken from here.
For a C implementation you may want to take a look at how to rotate left and how to AND which should be trivial to build together now. Something like the following should work:
int rotateLeft(int input, int shift) {
return (input << shift) | ((input >> (32 - shift)) & ~(-1 << shift));
}
int rlwinm(int input, int shift, int mask) {
return rotateLeft(input, shift) & mask;
}

optimized byte array shifter

I'm sure this has been asked before, but I need to implement a shift operator on a byte array of variable length size. I've looked around a bit but I have not found any standard way of doing it. I came up with an implementation which works, but I'm not sure how efficient it is. Does anyone know of a standard way to shift an array, or at least have any recommendation on how to boost the performance of my implementation;
char* baLeftShift(const char* array, size_t size, signed int displacement,char* result)
{
memcpy(result,array,size);
short shiftBuffer = 0;
char carryFlag = 0;
char* byte;
if(displacement > 0)
{
for(;displacement--;)
{
for(byte=&(result[size - 1]);((unsigned int)(byte))>=((unsigned int)(result));byte--)
{
shiftBuffer = *byte;
shiftBuffer <<= 1;
*byte = ((carryFlag) | ((char)(shiftBuffer)));
carryFlag = ((char*)(&shiftBuffer))[1];
}
}
}
else
{
unsigned int offset = ((unsigned int)(result)) + size;
displacement = -displacement;
for(;displacement--;)
{
for(byte=(char*)result;((unsigned int)(byte)) < offset;byte++)
{
shiftBuffer = *byte;
shiftBuffer <<= 7;
*byte = ((carryFlag) | ((char*)(&shiftBuffer))[1]);
carryFlag = ((char)(shiftBuffer));
}
}
}
return result;
}
If I can just add to what #dwelch is saying, you could try this.
Just move the bytes to their final locations. Then you are left with a shift count such as 3, for example, if each byte still needs to be left-shifted 3 bits into the next higher byte. (This assumes in your mind's eye the bytes are laid out in ascending order from right to left.)
Then rotate each byte to the left by 3. A lookup table might be faster than individually doing an actual rotate. Then, in each byte, the 3 bits to be shifted are now in the right-hand end of the byte.
Now make a mask M, which is (1<<3)-1, which is simply the low order 3 bits turned on.
Now, in order, from high order byte to low order byte, do this:
c[i] ^= M & (c[i] ^ c[i-1])
That will copy bits to c[i] from c[i-1] under the mask M.
For the last byte, just use a 0 in place of c[i-1].
For right shifts, same idea.
My first suggestion would be to eliminate the for loops around the displacement. You should be able to do the necessary shifts without the for(;displacement--;) loops. For displacements of magnitude greater than 7, things get a little trickier because your inner loop bounds will change and your source offset is no longer 1. i.e. your input buffer offset becomes magnitude / 8 and your shift becomes magnitude % 8.
It does look inefficient and perhaps this is what Nathan was referring to.
assuming a char is 8 bits where this code is running there are two things to do first move the whole bytes, for example if your input array is 0x00,0x00,0x12,0x34 and you shift left 8 bits then you get 0x00 0x12 0x34 0x00, there is no reason to do that in a loop 8 times one bit at a time. so start by shifting the whole chars in the array by (displacement>>3) locations and pad the holes created with zeros some sort of for(ra=(displacement>>3);ra>3)] = array[ra]; for(ra-=(displacement>>3);ra>(7-(displacement&7))). a good compiler will precompute (displacement>>3), displacement&7, 7-(displacement&7) and a good processor will have enough registers to keep all of those values. you might help the compiler by making separate variables for each of those items, but depending on the compiler and how you are using it it could make it worse too.
The bottom line though is time the code. perform a thousand 1 bit shifts then a thousand 2 bit shifts, etc time the whole thing, then try a different algorithm and time it the same way and see if the optimizations make a difference, make it better or worse. If you know ahead of time this code will only ever be used for single or less than 8 bit shifts adjust the timing test accordingly.
your use of the carry flag implies that you are aware that many processors have instructions specifically for chaining infinitely long shifts using the standard register length (for single bit at a time) rotate through carry basically. Which the C language does not support directly. for chaining single bit shifts you could consider assembler and likely outperform the C code. at least the single bit shifts are faster than C code can do. A hybrid of moving the bytes then if the number of bits to shift (displacement&7) is maybe less than 4 use the assembler else use a C loop. again the timing tests will tell you where the optimizations are.

Large bit arrays in C

Our OS professor mentioned that for assigning a process id to a new process, the kernel incrementally searches for the first zero bit in a array of size equivalent to the maximum number of processes(~32,768 by default), where an allocated process id has 1 stored in it.
As far as I know, there is no bit data type in C. Obviously, there's something I'm missing here.
Is there any such special construct from which we can build up a bit array? How is this done exactly?
More importantly, what are the operations that can be performed on such an array?
Bit arrays are simply byte arrays where you use bitwise operators to read the individual bits.
Suppose you have a 1-byte char variable. This contains 8 bits. You can test if the lowest bit is true by performing a bitwise AND operation with the value 1, e.g.
char a = /*something*/;
if (a & 1) {
/* lowest bit is true */
}
Notice that this is a single ampersand. It is completely different from the logical AND operator &&. This works because a & 1 will "mask out" all bits except the first, and so a & 1 will be nonzero if and only if the lowest bit of a is 1. Similarly, you can check if the second lowest bit is true by ANDing it with 2, and the third by ANDing with 4, etc, for continuing powers of two.
So a 32,768-element bit array would be represented as a 4096-element byte array, where the first byte holds bits 0-7, the second byte holds bits 8-15, etc. To perform the check, the code would select the byte from the array containing the bit that it wanted to check, and then use a bitwise operation to read the bit value from the byte.
As far as what the operations are, like any other data type, you can read values and write values. I explained how to read values above, and I'll explain how to write values below, but if you're really interested in understanding bitwise operations, read the link I provided in the first sentence.
How you write a bit depends on if you want to write a 0 or a 1. To write a 1-bit into a byte a, you perform the opposite of an AND operation: an OR operation, e.g.
char a = /*something*/;
a = a | 1; /* or a |= 1 */
After this, the lowest bit of a will be set to 1 whether it was set before or not. Again, you could write this into the second position by replacing 1 with 2, or into the third with 4, and so on for powers of two.
Finally, to write a zero bit, you AND with the inverse of the position you want to write to, e.g.
char a = /*something*/;
a = a & ~1; /* or a &= ~1 */
Now, the lowest bit of a is set to 0, regardless of its previous value. This works because ~1 will have all bits other than the lowest set to 1, and the lowest set to zero. This "masks out" the lowest bit to zero, and leaves the remaining bits of a alone.
A struct can assign members bit-sizes, but that's the extent of a "bit-type" in 'C'.
struct int_sized_struct {
int foo:4;
int bar:4;
int baz:24;
};
The rest of it is done with bitwise operations. For example. searching that PID bitmap can be done with:
extern uint32_t *process_bitmap;
uint32_t *p = process_bitmap;
uint32_t bit_offset = 0;
uint32_t bit_test;
/* Scan pid bitmap 32 entries per cycle. */
while ((*p & 0xffffffff) == 0xffffffff) {
p++;
}
/* Scan the 32-bit int block that has an open slot for the open PID */
bit_test = 0x80000000;
while ((*p & bit_test) == bit_test) {
bit_test >>= 1;
bit_offset++;
}
pid = (p - process_bitmap)*8 + bit_offset;
This is roughly 32x faster than doing a simple for loop scanning an array with one byte per PID. (Actually, greater than 32x since more of the bitmap is will stay in CPU cache.)
see http://graphics.stanford.edu/~seander/bithacks.html
No bit type in C, but bit manipulation is fairly straight forward. Some processors have bit specific instructions which the code below would nicely optimize for, even without that should be pretty fast. May or may not be faster using an array of 32 bit words instead of bytes. Inlining instead of functions would also help performance.
If you have the memory to burn just use a whole byte to store one bit (or whole 32 bit number, etc) greatly improve performance at the cost of memory used.
unsigned char data[SIZE];
unsigned char get_bit ( unsigned int offset )
{
//TODO: limit check offset
if(data[offset>>3]&(1<<(offset&7))) return(1);
else return(0);
}
void set_bit ( unsigned int offset, unsigned char bit )
{
//TODO: limit check offset
if(bit) data[offset>>3]|=1<<(offset&7);
else data[offset>>3]&=~(1<<(offset&7));
}

Resources