I'm trying out the new arm64 instructions on iOS and I'm having a peculiar issue. I hope someone can help me with this.
In particular this fails with 'Invalid operand for instruction'
void test()
{
register long long signed int r=0,c=0,d=0;
register signed int a=0,b=0,e=0,f=0;
// this fails
asm volatile("smaddl %0, %1, %2, %3" : "=r"(r) : "r"(a), "r"(b), "r"(c));
};
I'm not sure what I'm doing wrong, to the best that I can tell, I'm following the instruction and syntax correctly. Here's how it is defined in the docs:
"SMADDL Xd, Wn, Wm, Xa
Signed Multiply-Add Long: Xd = Xa + (Wn × Wm), treating source operands as signed."
where X denotes a 64bit register and W denotes a 32 bit one.
Any help will be appreciated.
Thx
I was able to fix it by using recommendation in this post:
asm volatile("smaddl %x0, %w1, %w2, %x3" : "=r"(r) : "r"(a), "r"(b), "r"(c));
This produces the following assembly:
_test: ; #test
; BB#0:
sub sp, sp, #48
movz w8, #0
movz x9, #0
stp x9, x9, [sp, #32]
str x9, [sp, #24]
stp w8, w8, [sp, #16]
stp w8, w8, [sp, #8]
ldp w10, w8, [sp, #16]
ldr x9, [sp, #32]
; InlineAsm Start
smaddl x9, w8, w10, x9
; InlineAsm End
str x9, [sp, #40]
add sp, sp, #48
ret lr
It seems you need to use 'w' to specifically mark 32-bit registers.
See also aarch64-inline-asm.c for a few more inline asm examples.
Related
I am trying to understand the assembly code for a simple program, shown below.
void f()
{
int i, x = 0;
for (i = 0; i < 10; i++)
x++;
printf("Value of x: %d\n", x);
}
and its corresponding assembly code on my machine is
00000000000007d4 <f>:
7d4: a9be7bfd stp x29, x30, [sp, #-32]!
7d8: 910003fd mov x29, sp
7dc: b9001fff str wzr, [sp, #28]
7e0: b9001bff str wzr, [sp, #24]
7e4: 14000007 b 800 <f+0x2c>
7e8: b9401fe0 ldr w0, [sp, #28]
7ec: 11000400 add w0, w0, #0x1
7f0: b9001fe0 str w0, [sp, #28]
7f4: b9401be0 ldr w0, [sp, #24]
7f8: 11000400 add w0, w0, #0x1
7fc: b9001be0 str w0, [sp, #24]
800: b9401be0 ldr w0, [sp, #24]
804: 7100241f cmp w0, #0x9
808: 54ffff0d b.le 7e8 <f+0x14>
80c: b9401fe1 ldr w1, [sp, #28]
810: 90000000 adrp x0, 0 <__abi_tag-0x278>
814: 9121c000 add x0, x0, #0x870
818: 97ffff9a bl 680 <printf#plt>
81c: d503201f nop
820: a8c27bfd ldp x29, x30, [sp], #32
824: d65f03c0 ret
I understand the loop, but line 814 - 818 is really confusion to me. What's the purpose of adding #0x870 to x0? What does line 818 mean? And how arguments are passed to the printf() function?
I expect words like "Value of x: " appears in the assembly code, but it seems like the compiler simply knows what to print.
I have the following C program:
int main() {
float number1, number2, sum=0.;
number1 = .5;
number2 = .3;
while(sum > -10000000.)
sum -= number1 + number2;
printf("%f",sum);
return 0;
}
Its corresponding assembly is as follows:
_main: ; #main
.cfi_startproc
; %bb.0:
sub sp, sp, #16 ; =16
.cfi_def_cfa_offset 16
str wzr, [sp, #12]
str wzr, [sp]
mov w8, #1056964608
str w8, [sp, #8]
mov w8, #39322
movk w8, #16025, lsl #16
str w8, [sp, #4]
LBB0_1: ; =>This Inner Loop Header: Depth=1
ldr s0, [sp]
fcvt d0, s0
adrp x8, lCPI0_0#PAGE
ldr d1, [x8, lCPI0_0#PAGEOFF]
fcmp d0, d1
b.le LBB0_3
; %bb.2: ; in Loop: Header=BB0_1 Depth=1
ldr s0, [sp, #8]
ldr s1, [sp, #4]
fadd s1, s0, s1
ldr s0, [sp]
fsub s0, s0, s1
str s0, [sp]
b LBB0_1
LBB0_3:
mov w0, #0
add sp, sp, #16 ; =16
ret
.cfi_endproc
; -- End function
.subsections_via_symbols
I want to analyse latency of each instructions so I'm looking for ways to obtain program counter trace.
Desired output is as follows:
0000000000 _main: ; #main
0000000001 .cfi_startproc
0000000002; %bb.0:
0000000003 sub sp, sp, #16 ; =16
0000000004 .cfi_def_cfa_offset 16
0000000005 str wzr, [sp, #12]
0000000006 str wzr, [sp]
0000000007 mov w8, #1056964608
0000000008 str w8, [sp, #8]
0000000009 mov w8, #39322
0000000010 movk w8, #16025, lsl #16
0000000011 str w8, [sp, #4]
...
where the first columns is the timestamp either in pico/nano/microseconds.
Target system is macOS, compiler is llvm, debugger is lldb.
There is no way to precisely measure the instruction time at the granularity of few cycles (at least not on this target architecture). Thus, you cannot measure the latency of one specific instruction unless it is a very slow one. The reason is that the best instructions used to measure the time are themselves pretty long and the processor can execute multiple instructions per cycles and in an out of order way (not to mention they are pipelined). This is especially true for the M1 processor you appear to run on. On ARM, the way to measure time seems to read the PMCCNTR based on this post. You certainly need to care about the superscalar out-of-order execution even with such instruction though. The delay taken by such instruction is dependent of the target architecture and AFAIK there is no official public information targetting the M1 on this topic (in fact, the documentation is pretty scarce on the way the M1 execute instructions so far).
An alternative solution is to simulate the execution of the code with LLVM-MCA which performs a static analysis of the program so to simulate the scheduling of the instructions on the target architecture. The static analysis has a big downside: the actual runtime behaviour of loops and conditional jumps is not considered.
Note that profiling a non-optimized code is generally a huge waste of time as it does not reflect the actual execution of the release version (which should be optimized). Once optimized, the code is likely bounded by the dependency chain on sum. This is especially true on the M1 processor which can execute a lot of instructions in parallel on a same (big/performance) core.
I am studying about armv8.
The following c language code
When converted to assembly with clang, w0 seems to be used for the return value, and w8 and w9 are used to save the variable values.
It is said that the arm has w series registers w0 to w30, but why are w8 and w9 used instead of w1 and w2?
int main() {
int a = 3;
int b = 5;
int c = a + b;
return c;
}
main: // #main
sub sp, sp, #16 // =16
str wzr, [sp, #12]
mov w8, #3
str w8, [sp, #8]
mov w8, #5
str w8, [sp, #4]
ldr w8, [sp, #8]
ldr w9, [sp, #4]
add w8, w8, w9
str w8, [sp]
ldr w0, [sp]
add sp, sp, #16 // =16
ret
I need to convert an integer value into a float value on a Cortex-M4 with FPU; for example:
float convert(int n) {
return (float) n;
}
armclang compiler translates this to:
push {r11, lr}
mov r11, sp
sub sp, sp, #8
str r0, [sp, #4]
ldr r0, [sp, #4]
bl __aeabi_i2f
mov sp, r11
pop {r11, lr}
bx lr
(Godbolt Link: https://godbolt.org/z/K59xGq78W)
The conversion from int to float is made by calling the library routine __aeabi_i2f which is much less efficient than using the FPU instruction VCVT.
For example, the GCC makes use of VCVT:
push {r7}
sub sp, sp, #12
add r7, sp, #0
str r0, [r7, #4]
ldr r3, [r7, #4]
vmov s15, r3 # int
vcvt.f32.s32 s15, s15
vmov.f32 s0, s15
adds r7, r7, #12
mov sp, r7
ldr r7, [sp], #4
bx lr
(https://godbolt.org/z/Pdv3nEMYq)
Is there a way to tell armclang to use the VCVT instruction?
Use the option -march=armv7+fp to tell the compiler to generate code for a machine with an FPU.
Godbolt
I saw this code on page 120 of ArmV8 Programmer's Guide:
foo//
SUB SP, SP, #0x30
STR W0, [SP, #0x2C]
STR W1, [SP, #0x28]
STR D0, [SP, #0x20]
STR D1, [SP, #0x18]
LDR W0, [SP, #0x2C]
STR W0, [SP, #0]
LDR W0, [SP, #0x28]
STR W0, [SP, #4]
LDR W0, [SP, #0x20]
STR W0, [SP, #8]
LDR W0, [SP, #0x18]
STR W0, [SP, #10]
LDR X9, [SP, #0x0]
STR X9, [X8, #0]
LDR X9, [SP, #8]
STR X9, [X8, #8]
LDR X9, [SP, #0x10]
STR X9, [X8, #0x10]
ADD SP, SP, #0x30
RET
bar//
STP X29, X30, [SP, #0x10]!
MOV X29, SP
SUB SP, SP, #0x20
ADD X8, SP, #8
MOV W0, WZR
ORR W1, WZR, #1
FMOV D0, #1.00000000
FMOV D1, #2.00000000
BL foo:
ADRP X8, {PC}, 0x78
ADD X8, X8, #0
LDR X9, [SP, #8]
STR X9, [X8, #0]
LDR X9, [SP, #0x10]
STR X9, [X8, #8]
LDR X9, [SP, #0x18]
STR X9, [X8, #0x10]
MOV SP, X29
LDP X20, X30, [SP], #0x10
RET
My question is regarding to following instruction at the end of bar routine:
LDP X20, X30, [SP], #0x10
LDP is for loading pair of registers from stack and restriction for this instruction will not allow loading to registers at that distance.
First, why this line can be valid ?
Second, why FP is loaded into x20 ? Isn't it supposed to load into x29 ?
From ARM Procedure call standard :
Each frame shall link to the frame of its caller by means of a frame record of two 64-bit values on the stack
Of course it also mentions cases which these requirements are not mandatory, but in this case FP seems to be not valid because it is pointing to sp - 0x10 which which has fp and lr .