Consider the below setup:
typedef struct
{
float d;
} InnerStruct;
typedef struct
{
InnerStruct **c;
} OuterStruct;
float TestFunc(OuterStruct *b)
{
float a = 0.0f;
for (int i = 0; i < 8; i++)
a += b->c[i]->d;
return a;
}
The for loop in TestFunc exactly replicates one in another function that I'm testing.
Both loops are unrolled by gcc (4.9.2) but yield slightly different assembly after doing so.
Assembly for my test loop:ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤAssembly for the original loop:
lwz r9,-0x725C(r13) lwz r9,0x4(r3)
lwz r8,0x4(r9) lwz r8,0x8(r9)
lwz r10,0x0(r9) lwz r10,0x4(r9)
lwz r11,0x8(r9) lwz r11,0x0C(r9)
lwz r4,0x4(r8) lwz r3,0x4(r8)
lwz r10,0x4(r10) lwz r10,0x4(r10)
lwz r8,0x4(r11) lwz r0,0x4(r11)
lwz r11,0x0C(r9) lwz r11,0x10(r9)
efsadd r4,r4,r10 efsadd r3,r3,r10
lwz r10,0x10(r9) lwz r8,0x14(r9)
lwz r7,0x4(r11) lwz r10,0x4(r11)
lwz r11,0x14(r9) lwz r11,0x18(r9)
efsadd r4,r4,r8 efsadd r3,r3,r0
lwz r8,0x4(r10) lwz r0,0x4(r8)
lwz r10,0x4(r11) lwz r8,0x0(r9)
lwz r11,0x18(r9) lwz r11,0x4(r11)
efsadd r4,r4,r7 efsadd r3,r3,r10
lwz r9,0x1C(r9) lwz r10,0x1C(r9)
lwz r11,0x4(r11) lwz r9,0x4(r8)
lwz r9,0x4(r9) efsadd r3,r3,r0
efsadd r4,r4,r8 lwz r0,0x4(r10)
efsadd r4,r4,r10 efsadd r3,r3,r11
efsadd r4,r4,r11 efsadd r3,r3,r9
efsadd r4,r4,r9 efsadd r3,r3,r0
The issue is the float values these instructions return are not exactly the same. And I can't change the original loop. I need to modify the test loop somehow to return the same values. I believe the test's assembly is equivalent to just adding each element one after another. I'm not very familiar with assembly so I wasn't sure how the above differences translated into c. I know this is the issue because if I add a print to the loops, they don't unroll and the results match exactly as expected.
I presume this is for unit-testing the one function with another.
In general floating point calculations are never exact in C or C++ and it is not usually considered legitimate to expect them to be.
The Java language standard requires exact floating point results. Doing this is a constant source of hatred against Java, with various accusations that making the results reproducible usually makes them less accurate and sometimes makes the code much slower too.
If you are doing your testing in C or C++ then I would suggest this approach:
Calculate the result as best you can, with both high precision and high accuracy. In this case the input data are in 32-bit float, so convert them all to 64-bit float before calculating the expected result.
If the inputs were in double (and you don't have a bigger long double type) then sort the values into order and add them up smallest to largest. This will result in the least loss of accuracy.
Once you have your expected result then test that the function output matches it within some bounds.
There are two approaches to setting what accuracy you require to consider the test as a pass:
One approach is to check what the real physical meaning of the number is and what accuracy you actually require.
The other approach is to just require that the result is accurate to within a few least-significant-bits of the ideal result, ie: that the error is less than a few times the ideal result times FLT_EPSILON.
Disabling fast-math seems to fix this issue. Thanks to #njuffa for the suggestion. I was hoping to be able to design the test function around this optimization, but it doesn't seem to be possible. At least I know what the issue is now. Appreciate everyone's help on the problem!
Related
Feel like I've been asking a lot of these questions lately lol, but assembly is still pretty foreign to me.
Using an Arduino, I have to write a function in Atmel AVR Assembly for my computer science class that calculates the sum of the 8-bit values in an array and returns it as a 16-bit integer. The function is supposed to take in an array of bytes and a byte representing the length of the array as arguments, with those arguments stored in r24 and r22, respectively, when the function is called. I am allowed to use branching instructions and such.
The code is in this format:
.global sumArray
sumArray:
//magic happens
ret
I know how to make loops and increment the counter and things like that, but I am really lost as to how I would do this.
I am unsure as to how I would do this. Does anyone know how to write this function in Atmel AVR Assembly? Any help would be much appreciated!
Why don't you ask the question to your compiler?
#include <stdint.h>
uint16_t sumArray(uint8_t *val, uint8_t count)
{
uint16_t sum = 0;
for (uint8_t i = 0; i < count; i++)
sum += val[i];
return sum;
}
Compiling with avr-gcc -std=c99 -mmcu=avr5 -Os -S sum8-16.c generates
the following assembly:
.global sumArray
sumArray:
mov r19, r24
movw r30, r24
ldi r24, 0
ldi r25, 0
.L2:
mov r18, r30
sub r18, r19
cp r18, r22
brsh .L5
ld r18, Z+
add r24, r18
adc r25,__zero_reg__
rjmp .L2
.L5:
ret
This may not be the most straight-forward solution, but if you study
this code, you can understand how it works and, hopefully, come with
your own version.
Iif you want something quick and dirty, add the two 8-bit values into an 8-bit register. If the sum is less than the inputs, then make a second 8-bit register equal to 1, otherwise 0. That's how you can do the carry.
The processor should already have something called a carry flag that you can use to this end.
with pencil and paper how do I add two two digit decimal numbers when I was only taught to add two single digit numbers at a time? 12 + 49? I can add the 2+9 = 11 then what do I do? (search for the word carry)
I'm familiar with data alignment and performance but I'm rather new to aligning code. I started programming in x86-64 assembly recently with NASM and have been comparing performance using code alignment. As far as I can tell NASM inserts nop instructions to achieve code alignment.
Here is a function I have been trying this on a Ivy Bridge system
void triad(float *x, float *y, float *z, int n, int repeat) {
float k = 3.14159f;
int(int r=0; r<repeat; r++) {
for(int i=0; i<n; i++) {
z[i] = x[i] + k*y[i];
}
}
}
The assembly I'm using for this is below. If I don't specify the alignment my performance compared to the peak is only about 90%. However, when I align the code before the loop as well as both inner loops to 16 bytes the performance jumps to 96%. So clearly the code alignment in this case makes a difference.
But here is the strangest part. If I align the innermost loop to 32 bytes it makes no difference in the performance of this function, however, in another version of this function using intrinsics in a separate object file I link in its performance jumps from 90% to 95%!
I did an object dump (using objdump -d -M intel) of the version aligned to 16 bytes (I posted the result to the end of this question) and 32 bytes and they are identical! It turns out that the inner most loop is aligned to 32 bytes anyway in both object files. But there must be some difference.
I did a hex dump of each object file and there is one byte in the object files that differ. The object file aligned to 16 bytes has a byte with 0x10 and the object file aligned to 32 bytes has a byte with 0x20. What exactly is going on! Why does code alignment in one object file affect the performance of a function in another object file? How do I know what is the optimal value to align my code to?
My only guess is that when the code is relocated by the loader that the 32 byte aligned object file affects the other object file using intrinsics. You can find the code to test all this at Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%
The NASM code I am using:
global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
triad_avx_asm_repeat:
shl rcx, 2
add rdi, rcx
add rsi, rcx
add rdx, rcx
vbroadcastss ymm2, [rel pi]
;neg rcx
align 16
.L1:
mov rax, rcx
neg rax
align 16
.L2:
vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm1, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm1
add rax, 32
jne .L2
sub r8d, 1
jnz .L1
vzeroupper
ret
Result from objdump -d -M intel test16.o. The disassembly is identical if I change align 16 to align 32 in the assembly above just before .L2. However, the object files still differ by one byte.
test16.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <pi>:
0: d0 0f ror BYTE PTR [rdi],1
2: 49 rex.WB
3: 40 90 rex xchg eax,eax
5: 90 nop
6: 90 nop
7: 90 nop
8: 90 nop
9: 90 nop
a: 90 nop
b: 90 nop
c: 90 nop
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <triad_avx_asm_repeat>:
10: 48 c1 e1 02 shl rcx,0x2
14: 48 01 cf add rdi,rcx
17: 48 01 ce add rsi,rcx
1a: 48 01 ca add rdx,rcx
1d: c4 e2 7d 18 15 da ff vbroadcastss ymm2,DWORD PTR [rip+0xffffffffffffffda] # 0 <pi>
24: ff ff
26: 90 nop
27: 90 nop
28: 90 nop
29: 90 nop
2a: 90 nop
2b: 90 nop
2c: 90 nop
2d: 90 nop
2e: 90 nop
2f: 90 nop
0000000000000030 <triad_avx_asm_repeat.L1>:
30: 48 89 c8 mov rax,rcx
33: 48 f7 d8 neg rax
36: 90 nop
37: 90 nop
38: 90 nop
39: 90 nop
3a: 90 nop
3b: 90 nop
3c: 90 nop
3d: 90 nop
3e: 90 nop
3f: 90 nop
0000000000000040 <triad_avx_asm_repeat.L2>:
40: c5 ec 59 0c 07 vmulps ymm1,ymm2,YMMWORD PTR [rdi+rax*1]
45: c5 f4 58 0c 06 vaddps ymm1,ymm1,YMMWORD PTR [rsi+rax*1]
4a: c5 fc 29 0c 02 vmovaps YMMWORD PTR [rdx+rax*1],ymm1
4f: 48 83 c0 20 add rax,0x20
53: 75 eb jne 40 <triad_avx_asm_repeat.L2>
55: 41 83 e8 01 sub r8d,0x1
59: 75 d5 jne 30 <triad_avx_asm_repeat.L1>
5b: c5 f8 77 vzeroupper
5e: c3 ret
5f: 90 nop
Ahhh, code alignment...
Some basics of code alignment..
Most intel architectures fetch 16B worth of instructions per clock.
The branch predictor has a larger window and looks at typically double that, per clock. The idea is to get ahead of the instructions fetched.
How your code is aligned will dictate which instructions you have available to decode and predict at any given clock (simple code locality argument).
Most modern intel architectures cache instructions at various levels (either at the macro instructions level before decoding, or at the micro instruction level after decoding). This eliminates the effects of code alignment, as long as you executing out of the micro/macro cache.
Also, most modern intel architectures have some form of loop stream detector that detects loops, again, executing them out of some cache that bypasses the front end fetch mechanism.
Some intel architectures are finicky about what they can cache, and what they can't. There are often dependencies on number of instructions/uops/alignment/branches/etc. Alignment may, in some cases, affect what's cached and what's not, and you can create cases where padding can prevent or cause a loop to get cached.
To make things even more complicated, the addresses of instructions are also use by the branch predictor. They are used in several ways, including (1) as a lookup into a branch prediction buffer to predict branches, (2) as a key/value to maintain some form of global state of branch behavior for prediction purposes, (3) as a key into determining indirect branch targets, etc.. Therefore, alignment can actually have a pretty huge impact on branch prediction, in some case, due to aliasing or other poor prediction.
Some architectures use instruction addresses to determine when to prefetch data, and code alignment can interfere with that, if just the right conditions exist.
Aligning loops is not always a good thing to do, depending on how the code is laid out (especially if there's control flow in the loop).
Having said all that blah blah, your issue could be one of any of these. It's important to look at the disassembly of not just the object, but the executable. You want to see what the final addresses are after everything is linked. Making changes in one object, could affect the alignment/addresses of instructions in another object after linking.
In some cases, it's near impossible to align your code in such a way as to maximize performance, simply due to so many low level architectural behaviors being hard to control and predict (that doesn't necessarily mean this is always the case). In some cases, your best bet is to have some default alignment strategy (say align all entries on 16B boundaries, and outer loops the same) so as you minimize the amount your performance varies from change-to-change. As a general strategy, aligning function entries is good. Aligning loops that are relatively small is good, as long as you're not adding nops in your execution path.
Beyond that, I'd need more info/data to pinpoint your exact problem, but thought some of this may help.. Good luck :)
The confusing nature of the effect (the assembled code doesn't change!) you are seeing is due to section alignment. When using the ALIGN macro in NASM, it actually has two separate effects:
Add 0 or more nop instructions so that the next instruction is aligned to the specified power-of-two boundary.
Issue an implicit SECTALIGN macro call which will set the section alignment directive to alignment amount1.
The first point is the commonly understood behavior for align. It aligns the loop relatively within the section in the output file.
The second part is also needed however: imagine your loop was aligned to a 32 byte boundary in the assembled section, but then the runtime loader put your section, in memory, at an address aligned only to 8 bytes: this would make the in-file alignment quite pointless. To fix this, most executable formats allow each section to specify an alignment requirement, and the runtime loader/linker will be sure to load the section at a memory address which respects the requirement.
That's what the hidden SECTALIGN macro does - it ensures that your ALIGN macro works.
For your file, there is no difference in the assembled code between ALIGN 16 and ALIGN 32 because the next 16-byte boundary happens to also be the next 32-byte boundary (of course, every other 16-byte boundary is a 32-byte one, so that happens about half the time). The implicit SECTALIGN call is still different though, and that's the one byte difference you see in your hexdump. The 0x20 is decimal 32, and the 0x10 is decimal 16.
You can verify this with objdump -h <binary>. Here's an example on a binary I aligned to 32 bytes:
objdump -h loop-test.o
loop-test.o: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000d18a 0000000000000000 0000000000000000 00000180 2**5
CONTENTS, ALLOC, LOAD, READONLY, CODE
The 2**5 in the Algn column is the 32-byte alignment. With 16-byte alignment this changes to 2**4.
Now it should be clear what happens - aligning the first function in your example changes the section alignment, but not the assembly. When you linked your program together, the linker will merge the various .text sections and pick the highest alignment.
At runtime, then this causes the code to be aligned to a 32-byte boundary - but this doesn't affect the first function, because it isn't alignment sensitive. Since the linker has merged your object files into one section, the larger alignment of 32 changes the alignment of every function (and instruction) in the section, including your other method, and so it changes the performance of your other function, which is alignment-sensitive.
1To be precise, SECTALIGN only changes the section alignment if the current section alignment is less than the specified amount - so the final section alignment will be the same as the largest SECTALIGN directive in the section.
This question already exists:
Why are registers needed (why not only use memory)? [duplicate]
Closed 1 year ago.
Hi I have been reading this kind of stuff in various docs
register
Tells the compiler to store the variable being declared in a CPU register.
In standard C dialects, keyword register uses the following syntax:
register data-definition;
The register type modifier tells the compiler to store the variable being declared in a CPU register (if possible), to optimize access. For example,
register int i;
Note that TIGCC will automatically store often used variables in CPU registers when the optimization is turned on, but the keyword register will force storing in registers even if the optimization is turned off. However, the request for storing data in registers may be denied, if the compiler concludes that there is not enough free registers for use at this place.
http://tigcc.ticalc.org/doc/keywords.html#register
My point is not only about register. My point is why would a compiler stores the variables in memory. The compiler business is to just compile and to generate an object file. At run time the actual memory allocation happens. why would compiler does this business. I mean without running the object file just by compiling the file itself does the memory allocation happens in case of C?
The compiler is generating machine code, and the machine code is used to run your program. The compiler decides what machine code it generates, therefore making decisions about what sort of allocation will happen at runtime. It's not executing them when you type gcc foo.c but later, when you run the executable, it's the code GCC generated that's running.
This means that the compiler wants to generate the fastest code possible and makes as many decisions as it can at compile time, this includes how to allocate things.
The compiler doesn't run the code (unless it does a few rounds for profiling and better code execution), but it has to prepare it - this includes how to keep the variables your program defines, whether to use fast and efficient storage as registers, or using the slower (and more prone to side effects) memory.
Initially, your local variables would simply be assigned location on the stack frame (except of course for memory you explicitly use dynamic allocation for). If your function assigned an int, your compiler would likely tell the stack to grow by a few additional bytes and use that memory address for storing that variable and passing it as operand to any operation your code is doing on that variable.
However, since memory is slower (even when cached), and manipulating it causes more restrictions on the CPU, at a later stage the compiler may decide to try moving some variables into registers. This allocation is done through a complicated algorithm that tries to select the most reused and latency critical variables that can fit within the existing number of logical registers your architecture has (While confirming with various restrictions such as some instructions requiring the operand to be in this or that register).
There's another complication - some memory addresses may alias with external pointers in manners unknown at compilation time, in which case you can not move them into registers. Compilers are usually a very cautious bunch and most of them would avoid dangerous optimizations (otherwise they're need to put up some special checks to avoid nasty things).
After all that, the compiler is still polite enough to let you advise which variable it important and critical to you, in case he missed it, and by marking these with the register keyword you're basically asking him to make an attempt to optimize for this variable by using a register for it, given enough registers are available and that no aliasing is possible.
Here's a little example: Take the following code, doing the same thing twice but with slightly different circumstances:
#include "stdio.h"
int j;
int main() {
int i;
for (i = 0; i < 100; ++i) {
printf ("i'm here to prevent the loop from being optimized\n");
}
for (j = 0; j < 100; ++j) {
printf ("me too\n");
}
}
Note that i is local, j is global (and therefore the compiler doesn't know if anyone else might access him during the run).
Compiling in gcc with -O3 produces the following code for main:
0000000000400540 <main>:
400540: 53 push %rbx
400541: bf 88 06 40 00 mov $0x400688,%edi
400546: bb 01 00 00 00 mov $0x1,%ebx
40054b: e8 18 ff ff ff callq 400468 <puts#plt>
400550: bf 88 06 40 00 mov $0x400688,%edi
400555: 83 c3 01 add $0x1,%ebx # <-- i++
400558: e8 0b ff ff ff callq 400468 <puts#plt>
40055d: 83 fb 64 cmp $0x64,%ebx
400560: 75 ee jne 400550 <main+0x10>
400562: c7 05 80 04 10 00 00 movl $0x0,1049728(%rip) # 5009ec <j>
400569: 00 00 00
40056c: bf c0 06 40 00 mov $0x4006c0,%edi
400571: e8 f2 fe ff ff callq 400468 <puts#plt>
400576: 8b 05 70 04 10 00 mov 1049712(%rip),%eax # 5009ec <j> (loads j)
40057c: 83 c0 01 add $0x1,%eax # <-- j++
40057f: 83 f8 63 cmp $0x63,%eax
400582: 89 05 64 04 10 00 mov %eax,1049700(%rip) # 5009ec <j> (stores j back)
400588: 7e e2 jle 40056c <main+0x2c>
40058a: 5b pop %rbx
40058b: c3 retq
As you can see, the first loop counter sits in ebx, and is incremented on each iteration and compared against the limit.
The second loop however was the dangerous one, and gcc decided to pass the index counter through memory (loading it into rax every iteration). This example serves to show how better off you'd be when using registers, as well as how sometimes you can't.
The compiler needs to translate the code into machine instruction, and tell the computer how to run the code. That include how to make operations (like multiply two numbers) and how to store the data (stack, heap or register).
100x100 array A of integers, one byte each, is located at A. Write a program segment to compute the sum of the minor diagonal, i.e.
SUM = ΣA[i,99-i], where i=0...99
This is what I have so far:
LEA A, A0
CLR.B D0
CLR.B D1
ADDA.L #99, D0
ADD.B (A0), D1
ADD.B #1, D0
BEQ Done
ADDA.L #99,A0
BRA loop
There's quite many issues in this code, including (but not limited to):
You use 'Loop' and 'Done', but the labels are not shown in the code
You are adding 100 bytes in D1, also as a byte, so you are definitely going to overflow on the results (the target of the sum should at least be 16 bit, so .w or .l addressing)
I'm perhaps wrong but I think the 'minor diagonal' goes from the bottom left to the upper right, while your code goes from the top left to the bottom right of the array
On the performance side:
You should use the 'quick' variant of the 68000 instruction set
Decrement and branch as mentioned by JasonD is more efficient than add/beq
Considering the code was close enough from the solution, here is a variant (I did not test, hope it works)
lea A+99*100,a0 ; Points to the first column of the last row
moveq #0,d0 ; Start with Sum=0
moveq #100-1,d1 ; 100 iterations
Loop
moveq #0,d2 ; Clear register long
move.b (a0),d2 ; Read the byte
add.l d2,d0 ; Long add
lea -99(a0),a0 ; Move one row up and one column right
dbra d1,Loop ; Decrement d1 and branch to Loop until d1 gets negative
Done
; d0 now contains the sum
Most code I have ever read uses a int for standard error handling (return values from functions and such). But I am wondering if there is any benefit to be had from using a uint_8 will a compiler -- read: most C compilers on most architectures -- produce instructions using the immediate address mode -- i.e., embed the 1-byte integer into the instruction ? The key instruction I'm thinking about is the compare after a function, using uint_8 as its return type, returns.
I could be thinking about things incorrectly, as introducing a 1 byte type just causes alignment issues -- there is probably a perfectly sane reason why compiles like to pack things in 4-bytes and this is possibly the reason everyone just uses ints -- and since this is stack related issue rather than the heap there is no real overhead.
Doing the right thing is what I'm thinking about. But lets say say for the sake of argument this is a popular cheap microprocessor for a intelligent watch and that it is configured with 1k of memory but does have different addressing modes in its instruction set :D
Another question to slightly specialize the discussion (x86) would be: is the literal in:
uint_32 x=func(); x==1;
and
uint_8 x=func(); x==1;
the same type ? or will the compiler generate a 8-byte literal in the second case. If so it may use it to generate a compare instruction which has the literal as an immediate value and the returned int as a register reference. See CMP instruction types..
Another Refference for the x86 Instruction Set.
Here's what one particular compiler will do for the following code:
extern int foo(void) ;
void bar(void)
{
if(foo() == 31) { //error code 31
do_something();
} else {
do_somehing_else();
}
}
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 08 sub $0x8,%esp
6: e8 fc ff ff ff call 7 <bar+0x7>
b: 83 f8 1f cmp $0x1f,%eax
e: 74 08 je 18 <bar+0x18>
10: c9 leave
11: e9 fc ff ff ff jmp 12 <bar+0x12>
16: 89 f6 mov %esi,%esi
18: c9 leave
19: e9 fc ff ff ff jmp 1a <bar+0x1a>
a 3 byte instruction for the cmp. if foo() returns a char , we get
b: 3c 1f cmp $0x1f,%al
If you're looking for efficiency though. Don't assume comparing stuff in %a1 is faster than comparing with %eax
There may be very small speed differences between the different integral types on a particular architecture. But you can't rely on it, it may change if you move to different hardware, and it may even run slower if you upgrade to newer hardware.
And if you talk about x86 in the example you are giving, you make a false assumption: An immediate needs to be of type uint8_t.
Actually 8-bit immediates embedded into the instruction are of type int8_t and can be used with bytes, words, dwords and qwords, in C notation: char, short, int and long long.
So on this architecture there would be no benefit at all, neither code size nor execution speed.
You should use int or unsigned int types for your calculations. Using smaller types only for compounds (structs/arrays). The reason for that is that int is normally defined to be the "most natural" integral type for the processor, all other derived type may necessitate processing to work correctly. We had in our project compiled with gcc on Solaris for SPARC the case that accesses to 8 and 16 bit variable added an instruction to the code. When loading a smaller type from memory it had to make sure the upper part of the register was properly set (sign extension for signed type or 0 for unsigned). This made the code longer and increased pressure on the registers, which deteriorated the other optimisations.
I've got a concrete example:
I declared two variable of a struct as uint8_t and got that code in Sparc Asm:
if(p->BQ > p->AQ)
was translated in
ldub [%l1+165], %o5 ! <variable>.BQ,
ldub [%l1+166], %g5 ! <variable>.AQ,
and %o5, 0xff, %g4 ! <variable>.BQ, <variable>.BQ
and %g5, 0xff, %l0 ! <variable>.AQ, <variable>.AQ
cmp %g4, %l0 ! <variable>.BQ, <variable>.AQ
bleu,a,pt %icc, .LL586 !
And here what I got when I declared the two variables as uint_t
lduw [%l1+168], %g1 ! <variable>.BQ,
lduw [%l1+172], %g4 ! <variable>.AQ,
cmp %g1, %g4 ! <variable>.BQ, <variable>.AQ
bleu,a,pt %icc, .LL587 !
Two arithmetic operations less and 2 registers more for other stuff
Processors typically likes to work with their natural register sizes, which in C is 'int'.
Although there are exceptions, you're thinking too much on a problem that does not exist.