How can I extract [0...N] arguments from VA_ARGS assuming that N will be less or equal to the number of arguments.
Example:
#define MY_SEQ r0, r1, r2, r3, r4, r5, r6, r7, \
r8, r9, r10, r11, r12, r13, r14, r15
#define EXTRACT_N(n, SEQ) {... magic ...}
...
EXTRACT_N(5, (MY_SEQ()));
should expand to:
{r0, r1, r2, r3, r4};
It's ok to assume that sequence elements are of this form WHATEVER##N where N is Nth element.
I'm looking for some nice solution to this problem NOT using BOOST, e.g. I'd like to understand how it can be done.
I did it using iterative approach, but I'd like to know if there is some other way to do it.
Here's how I implemented it:
#define EXTRACT_1(t0) t0
#define EXTRACT_2(t0, t1) EXTRACT_1(t0),t1
#define EXTRACT_3(t0, t1, t2) EXTRACT_2(t0, t1),t2
#define EXTRACT_4(t0, t1, t2, t3) EXTRACT_3(t0, t1, t2),t3
...
You cannot do that in the general case. The c preprocessor is not that flexible.
You might have something like
#define EXTRACT_N(N,A) EXTRACTTHEM ## N(A)
and have
#define EXTRACTTHEM1(X, ...) X
#define EXTRACTTHEM2(X,Y, ...) X,Y
etc
(it is easy to generate an arbitrary large, but bounded, set of such macros).
There are more powerful processors than cpp, e.g. m4 or gpp
You might consider instead generating your C or C++ code (using your own script, or some generator) and having your building system (e.g. Makefile) take care of generating the C code from something different.
Related
In arm assembly language, the instruction ADCS will add with condition flags C and set condition flags.
And the CMP instruction do the same things, so the condition flags will be recovered.
How can I solve it ?
This is my code, it is doing BCD adder with r0 and r1 :
ldr r8, =#0
ldr r9, =#15
adds r7, r8, #0
ADDLOOP:
and r4, r0, r9
and r5, r1, r9
adcs r6, r4, r5
orr r7, r6, r7
add r8, r8, #1
mov r9, r9, lsl #4
cmp r8, #3
bgt ADDEND
bl ADDLOOP
ADDEND:
mov r0, r7
I tried to save the state of condition flags, but I don't know how to do.
To save/restore the Carry flag, you could create a 0/1 integer in a register (perhaps with adc reg, zeroed_reg, #0?), then next iteration cmp reg, #1 or rsbs reg, reg, #1 to set the carry flag from it.
ARM can't materialize C as an integer 0/1 with a single instruction without any setup; compilers normally use movcs r0, #1 / movcc r0, #0 when not in a loop (Godbolt), but in a loop you'd probably want to zero a register once outside the loop instead of using two instructions predicated on carry-set / carry-clear.
Loop without modifying C
Use teq r8, #4 / bne ADDLOOP as the loop branch, like the bottom of a do{}while(r8 != 4).
Or count down from 4 with tst r8,r8 / bne ADDLOOP, using sub r8, #1 instead of add.
TEQ updates N and Z but not C or V flags. (Unless you use a shifted source operand, then it can update C). docs - unlike cmp, it sets flags like eors. The eq / ne conditions work the same: subtraction and XOR both produce zero when the inputs are equal, and non-zero in every other case. But teq doesn't even set C or V flags, and greater / less wouldn't be meaningful anyway.
This is what optimized BigInt code like GMP does, for example in its mpn_add_n function (source) which adds two bigint inputs (arrays of 32-bit chunks).
IDK why you were jumping forwards over a bl (branch-and-link) which sets lr as a return address. Don't do that, structure your asm loops like a do{}while() because it's more efficient, especially when the trip-count is known to be non-zero so you don't have to worry about running the loop zero times in some cases.
There are cbz/cbnz instructions (docs) that jump on a register being zero or non-zero without affecting flags, but they can only jump forwards (out of the loop, past an unconditional branch). They're also only available in Thumb mode, unlike teq which was probably specifically designed to give ARM an efficient way to write BigInt loops.
BCD adding
Your algorithm has bugs; you need base-10 carry, like 0x05 + 0x06 = 0x11 not 0x0b in packed BCD.
And even the binary Carry flag isn't set by something like 0x0005000 + 0x0007000; there's no carry-out from the high bit, only into the next nibble. Also, adc adds the carry-in at the bottom of the register, not at nibble your mask isolated.
So maybe you need to do something like subtract 0x000a000 from the sum (for that example shift position), because that will carry-out. (ARM sets C as a !borrow on subtraction, so maybe rsb reverse-subtract or swap the operands.)
NEON should make it possible to unpack to 8-bit elements (mask odd/even and interleave) and do all nibbles in parallel, but carry propagation is a problem; ARM doesn't have an efficient way to branch on SIMD vector conditions (unlike x86 pmovmskb). Just byte-shifting the vector and adding could generate further carries, as with 999999 + 1.
IDK if this can be cut down effectively with the same techniques hardware uses, like carry-select or carry-lookahead, but for 4-bit BCD digits with SIMD elements instead of single bits with hardware full-adders.
It's not worth doing for binary bigint because you can work in 32 or 64-bit chunks with the carry flag to help, but maybe there's something to gain when primitive hardware operations only do 4 bits at a time.
For an STM32F7, which includes instructions for double floating points, I want to convert an uint64_t to double.
In order to test that, I used the following code:
volatile static uint64_t m_testU64 = 45uLL * 0xFFFFFFFFuLL;
volatile static double m_testD;
#ifndef DO_NOT_USE_UL2D
m_testD = (double)m_testU64;
#else
double t = (double)(uint32_t)(m_testU64 >> 32u);
t *= 4294967296.0;
t += (double)(uint32_t)(m_testU64 & 0xFFFFFFFFu);
m_testD = t;
#endif
By default (if DO_NOT_USE_UL2D is not defined) the compiler (gcc or clang) is calling the function: __aeabi_ul2d() which is kind of complex in number of executed instruction. See the assembly code here : https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/arm/ieee754-df.S#L537
For my particular example, it takes 20 instructions without entering in most of the branches
And if DO_NOT_USE_UL2D is defined, the compiler generate the following assembly code:
movw r0, #1728 ; 0x6c0
vldr d2, [pc, #112] ; 0x303fa0
movt r0, #8192 ; 0x2000
vldr s0, [r0, #4]
ldr r1, [r0, #0]
vcvt.f64.u32 d0, s0
vldr s2, [r0]
vcvt.f64.u32 d1, s2
ldr r1, [r0, #4]
vfma.f64 d1, d0, d2
vstr d1, [r0, #8]
The code is simpler, and it is only 10 instructions.
So here the the questions (if DO_NOT_USE_UL2D is defined):
Is my code (in C) correct?
Is my code slower than the __aeabi_ul2d() function (not really important, but a bit curious)?
I have to do that, since I am not allowed to use function from libgcc (There are very good reasons for that...)
Be aware that the main purpure of this question is not about performance, I am really curious about the implementation in libgcc, and I really want to know if there is something wrong in my code.
I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).
How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).
I know, that compiler would do it for me, but I just write the small function for an example.
int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
int64_t tmp_acc;
asm("SMULL %0, %1, %2, %3"
: "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
: "r"(x), "r"(y)
);
return tmp_acc + acc;
}
You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:
int64_t accum(int64_t acc, int32_t x, int32_t y) {
return acc + x * (int64_t)y;
}
which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)
accum:
smlal r0, r1, r3, r2 #, y, x
bx lr #
As a bonus, this pure C also compiles efficiently for AArch64.
https://gcc.gnu.org/wiki/DontUseInlineAsm
If you insist on shooting yourself in the foot and using inline asm:
Or in the general case with other instructions, there might be a case where you'd want this.
First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.
This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.
Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.
So you can specify a 64-bit input or output as a pair of 32-bit variables.
#include <stdint.h>
int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
uint32_t prod_lo, prod_hi;
asm("SMULL %0, %1, %2, %3"
: "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);
int64_t prod = ((int64_t)prod_hi) << 32;
prod |= prod_lo; // + here won't optimize away, but | does, with gcc
return acc + prod;
}
Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.
Again from Godbolt:
# gcc -O3 output with early-clobber, valid even before ARMv6
testFunc:
str lr, [sp, #-4]! #, Save return address (link register)
SMULL ip, lr, r2, r3 # prod_lo, prod_hi, x, y
adds r0, ip, r0 #, prod, acc
adc r1, lr, r1 #, prod, acc
ldr pc, [sp], #4 # return by popping the return address into PC
# gcc -O3 output without early-clobber (&) on output constraints:
# valid only for ARMv6 and later
testFunc:
SMULL r3, r2, r2, r3 # prod_lo, prod_hi, x, y
adds r0, r3, r0 #, prod, acc
adc r1, r2, r1 #, prod, acc
bx lr #
Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.
// using an int64_t directly with inline asm, using %Q0 and %R0 constraints
// Q is the low half, R is the high half.
int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
{
int64_t prod; // gcc and clang seem to want more free registers this way
asm("SMULL %Q0, %R0, %1, %2"
: "=&r" (prod) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);
return acc + prod;
}
again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)
# gcc -O3 with the early-clobber so it's safe on ARMv5
testFunc2:
push {r4, r5} #
SMULL r4, r5, r2, r3 # prod, x, y
adds r0, r4, r0 #, prod, acc
adc r1, r5, r1 #, prod, acc
pop {r4, r5} #
bx lr #
So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.
My goal is to implement sorting algorithm using C language.
I have to make a C code that converts into least number of instructions when compiled by gcc -O0(no optimization option) in ARM machine.
So, My idea is to embed quicksort implemented in assembly directly into C code.
I referred to several following documents and tried to implement my goal.
However, I don't know how to put intarray into my assembly function 'QuickSort' as a parameter.
Reference
1.https://en.wikibooks.org/wiki/Algorithm_Implementation/Sorting/Quicksort#ARM_Assembly
2.http://forum.falinux.com/zbxe/index.php?mid=lecture_tip&comment_srl=517498&sort_index=readed_count&order_type=asc&l=fr&page=58&document_srl=567970 (sorry for non-english website)
I'm newbie in assembly.
Please help me..
#include <stdio.h>
#include <stdint.h>
int Quicksort(uint32_t intarray[]);
asm(
".global Quicksort\n\
Quicksort:\n\
qsort:\n\
stmfd sp!,{r4, r6, lr} \n\
mov r6,r2 \n\
qsort_tailcall_entry:\n\
sub r7,r6,r1\n\
cmp r7,#1\n\
ldmlefd sp!,{r4,r6,pc}\n\
ldr r7,[r0,r1,asl#2]\n\
add r2,r1,#1\n\
mov r4,r6\n\
partition_loop:\n\
ldr r3,[r0, r2, asl #2]\n\
cmp r3,r7\n\
addle r2,r2, #1\n\
ble partition_test\n\
sub r4,r4, #1\n\
ldr r5,[r0, r4, asl #2]\n\
str r5,[r0, r2, asl #2]\n\
str r3,[r0, r4, asl #2]\n\
partition_test:\n\
cmp r2,r4\n\
blt partition_loop\n\
partition_finish:\n\
sub r2,r2,#1\n\
ldr r3,[r0,r2,asl #2]\n\
str r3,[r0,r1,asl #2]\n\
str r7,[r0,r2,asl #2]\n\
bl qsort\n\
mov r1,r4\n\
b qsort_tailcall_entry\n\
"
);
int main(void){
uint32_t intarray[10] = {5,2,5,1,7,5,7,2,3,8};
Quicksort(intarray);
return 0;
}
Since you mentioned that you are compiling with gcc, you could use the gcc asm extension (as the name says, it's a gcc extension and might not be compatible with other compilers). Take a look at basic asm and extended asm. Since you will probably be accessing data from your C code, I advise you to stick with the advanced version which lets you specify memory operands.
I'm working on writing a program running on Cortex-m3.
At first I wrote an assembly file which executes 'svc'.
svc:
svc 0
bx lr
I decided to use gcc's inline asm, so I wrote it as follows, but the svc function was not inlined.
__attribute__((naked))
int svc(int no, ...)
{
(void)no;
asm("svc 0\n\tbx lr");
}
int f() {
return svc(0,1,2);
}
------------------ generated assembly ------------------
svc:
svc 0
bx lr
f:
mov r0, #0
mov r1, #1
mov r2, #2
b svc
I guess it's not inlined since it is naked, so I dropped the naked attribute and wrote like this.
int svc(int __no, ...)
{
register int no asm("r0") = __no;
register int ret asm("r0");
asm("svc 0" : "=r"(ret) : "r"(no));
return ret;
}
------------------ generated assembly ------------------
svc:
stmfd sp!, {r0, r1, r2, r3}
ldr r0, [sp]
add sp, sp, #16
svc 0
bx lr
f:
mov r0, #0 // missing instructions setting r1 and r2
svc 0
bx lr
Although I don't know why gcc adds some unnecessary stack operations, svc is good. The problem is that svc is not inlined properly, the variadic parameters were dropped.
Is there any svc primitive in gcc? If gcc does not have one, how do I write the right one?
Have a look at the syntax that is used in core_cmFunc.h which is supplied as part of the ARM CMSIS for the Cortex-M family. Here's an example that writes a value to the Priority Mask Register:
__attribute__ ((always_inline)) static inline void __set_PRIMASK(uint32_t priMask)
{
__ASM volatile ("MSR primask, %0"::"r" (priMask));
}
However, creating a variadic function like this sounds difficult.
You can use a macro like this.
#define __svc(sNum) __asm volatile("SVC %0" ::"M" (sNum))
And use it just like any compiler-primitive function, __svc(2);.
Since it is just a macro, it will only generate the provided instruction.