ARM GCC inline assembly - c

I am trying the following :
int main()
{
unsigned int result = 0;
unsigned int op1 = 10, op2 = 20;
asm volatile ("uadd8 %0, %1, %2" :
"=r" (result) :
"r" (op1), "r" (op2) );
}
I want to compile this for Cortex A9 I am using arm GNU GCC toolchain.
But I keep getting this error:
arm-none-linux-gnueabi-gcc test_2.c
Assembler messages:
Error: selected processor does not support ARM mode `uadd8 r4,r3,r2'
I tried by forcing to thumb mode by adding .code 16 also but no luck .
What is the issue here ?

The reason is that the default ARM architecture in your compiler does not implement that instruction. The uadd8 is supported in Thumb mode for architectures ARMv6T2 and ARMv7 and in ARM mode for ARMv6 and ARMv7. Hence you need to pass the proper -march= option to gcc. For example:
-march=armv6
-march=armv6t2 -mthumb
-march=armv7-a
-march=armv7-a -marm
You can check what is the default (or set by options) architecture for the compilation with:
arm-elf-gcc -E -dM -x c /dev/null | grep ARM_ARCH

Related

Inline assembly not produced when linking with main() file clang

I have the following code with an inline assembly in C:
#include <stdlib.h>
#include <stdio.h>
__attribute__((noinline))
int *get_indecies(int *padding, int pad_size, int size, int alias_type);
int *get_indecies(int *padding, int pad_size, int size, int alias_type){
int *indecies = (int *)malloc(size*sizeof(int));
asm("dsb sy\n\t");
return indecies;
}
At the bottom I the inline is inserted...
When I produce the object code using (cross compilation for aarch64) I have the inline inserted:
.....
1f0: d5033f9f dsb sy
1f4: aa1f03e1 mov x1, xzr
......
when I link this binary with my main file using:
clang verification.c get_indecies.o -I "/home/[name]/gem5/include" -L "/home/[name]/gem5/util/m5/build/arm64/out" -lm5 -lc -O0 -static -target aarch64-linux-gnu -o verfication-base-m5
and then I do an object dump of this to check if the assembly instruction is present using:
aarch64-linux-gnu-objdump verification-base-m5 -S > assembly.s
The inline assembly does not exist.... Any ideas about what is happening in the linking stage that is removing this assembly instruction? The optimisation level is turned to 0 so I am not sure...
Thanks!
I re-ran the below line recently:
clang verification.c get_indecies.o -I "/home/[name]/gem5/include" -L "/home/[name]/gem5/util/m5/build/arm64/out" -lm5 -lc -O0 -static -target aarch64-linux-gnu -o verfication-base-m5
and then object dumped this and the inline assembly was present. Not sure what the bug was during the time but in my case it is resolved.

why does this bootloader only prints 'S'

I am writing a simple x86 bootloader.
this is the c program that im having trouble with: test4.c
__asm__(".code16\n");
__asm__("jmpl $0x0, $main\n");
void prints ( char* str )
{
char* pStr = str;
while( *pStr )
{
__asm__ __volatile (
"int $0x10"
:
: "a"(0x0e00 | *pStr), "b"(7)
);
pStr++;
}
}
void main ( )
{
char* str = "\n\rHello World\n\r";
char* pStr = str;
while( *pStr )
{
__asm__ __volatile (
"int $0x10"
:
: "a"(0x0e00 | *pStr)
);
pStr++;
}
prints ( str );
}
when i try to print a string within main function, it works. But when i pass the string to another function which does carry out same instructions but still prints only S to the screen. So the final output looks something like this:
Hello World
S
Here is the linker file i used: test.ld
ENTRY(main);
SECTIONS
{
. = 0x7C00;
.text : AT(0x7C00)
{
*(.text);
}
.sig : AT(0x7DFE)
{
SHORT(0xaa55);
}
}
Here are the commands i used to compile the c program and to link it
$ gcc -c -g -Os -m32 -march=i686 -ffreestanding -Wall -Werror test4.c -o test4.o
$ ld -melf_i386 -static -Ttest.ld -nostdlib --nmagic -o test4.elf test4.o
$ objcopy -O binary test4.elf test4.bin
and i used bochs emulator to test out this bootloader
You can't do this with GCC. Ignore all the tutorials that say that you can -- they are wrong.
What's most important to keep in mind is that GCC is not a 16-bit compiler. The __asm__(".code16\n") directive does not turn it into one; it merely confuses the assembler into retargeting GCC's output from 32-bit x86 to 16-bit. This will cause strange and unexpected behavior, especially in any code using pointers.
If you want to write an x86 bootloader, you will need to:
Use a C compiler that can specifically target 16-bit x86 ("real mode"). Consider the OpenWatcom toolchain, for instance.
Become very familiar with the quirks of x86 real mode -- particularly segmentation.
Write some portions of the bootloader in assembly, particularly the startup code.

GCC4.8.3 generating invalid asm from intrinsics (operand size mismatch)

I'm working on migrating a MSVC application to Red-Hat Linux and have run into some issues with an intrinsic that was used. I'm using GCC 4.8.3 with the following command line.
gcc -c file.c -o file.o -O2 -g -masm=intel -m32 -mavx -mavx2 -D_MMX_ -D_LARGEFILE64_SOURCE=1 -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -pthread -fno-omit-frame-pointer -fno-tree-pre -fno-builtin-printf -Werror
The pseudo-code is something like this
#include <immintrin.h> // AVX/AVX2
void func_with_intrins(int x, int y, int h)
{
short ddx;
short ddy;
int yh;
ddx = xpos & 7;
ddy = ypos & 7;
...
for (yh = 0; yh < (h>>2); yh++)
{
if ((ddx + ddy) != 0)
{
__m128i ddxw = _mm_set1_epi16(ddx);
...
}
...
}
}
I've read that the _mm_set1_epi16 intrinsic can be expanded to a vpbroadcastw and I've confirmed this by looking at the resulting assembly file.S
vpbroadcastw xmm0, XMMWORD PTR [ebp-216]
With a definition from here
VPBROADCASTW xmm1, xmm2/m16
VPBROADCASTB/W/D/Q is supported in both 128-bit and 256-bit wide versions.
The error GCC gives me is
Error: operand size mismatch for `vpbroadcastw'
When I change the optimization level from O2 to O0 the assembly is much different and doesn't use the vpbroadcastw instruction.
Since I'm using intrinsics I'm at a loss as to what to do. Compiling the file as O0 is not an option. If I move the __m128i ddxw declaration to outside the loop that seems to compile but I think I would change the underlying functionality. When moved to before the for loop, the vpbroadcastw code takes two xmm registers (the stack variable is vmovdqa'd into an xmm register) so I figure gcc is running out of registers and therefore tries to invoke vpbroadcastw with the second operand being a stack offset.

Why doesn't arm-none-eabi-gcc search for my custom _start symbol?

I am compiling the below code with "-nostdlib". My understanding was that arm-none-eabi-gcc will not use the _start in "crt0.o" but it will use the user defined _start. For this I was expecting to create a start.S file and put the _start symbol.
But if I compile the below shown code without the _start symbol defined from my side, I am not getting any warning. I was expecting "warning: cannot find entry symbol _start;"
Questions:
1) Why am I not getting the warning ? From where did GCC get the _start symbol ?
2) If gcc got the _start symbol from a file from somewhere, could you let me know how to ask GCC to use the _start from my start.S file ?
$ cat test.c
int main()
{
volatile int i=0;
i = i+1;
return 0;
}
$ cat linker.ld
MEMORY
{
ram : ORIGIN = 0x8000, LENGTH = 20K
}
SECTIONS
{
.text : { *(.text*) } > ram
.bss : { *(.bss*) } > ram
}
$ arm-none-eabi-gcc -Wall -Werror -O2 -mfpu=neon-vfpv4 -mfloat-abi=hard -march=armv7-a -mtune=cortex-a7 -nostdlib -T linker.ld test.c -o test.o
$ arm-none-eabi-gcc --version
arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 4.9.3 20150529 >(release) [ARM/embedded-4_9-branch revision 224288]
Compile and link with arm-none-eabi-gcc -v -Wall -Werror -O2.... to understand what the compiler is doing (and which crt0 it is using; that crt0 probably has a _start calling your main, also _start might be the default entry point for your linker)
Notice that -nostdlib is related to the (lack of) C standard library; perhaps you want to compile in a freestanding environment (see this), then use -ffreestanding (and in that case main has no particular meaning, you need to define your starting function[s], and no standard C functions like malloc or printf are available except perhaps setjmp).
Read the C99 standard n1256 draft. It explains what freestanding means in ยง5.1.2.1

What gcc option enables loop unrolling for SSE intrinsics with immediate operands?

This question relates to gcc (4.6.3 Ubuntu) and its behavior in unrolling loops for SSE intrinsics with immediate operands.
An example of an intrinsic with immediate operand is _mm_blend_ps. It expects a 4-bit immediate integer which can only be a constant. However, using the -O3 option, the compiler apparently automatically unrolls loops (if the loop counter values can be determined at compile time) and produces multiple instances of the corresponding blend instruction with different immediate values.
This is a simple test code (blendsimple.c) which runs through the 16 possible values of the immediate operand of blend:
#include <stdio.h>
#include <x86intrin.h>
#define PRINT(V) \
printf("%s: ", #V); \
for (i = 3; i >= 0; i--) printf("%3g ", V[i]); \
printf("\n");
int
main()
{
__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);
int i;
PRINT(a);
PRINT(b);
unsigned mask;
__m128 r;
for (mask = 0; mask < 16; mask++) {
r = _mm_blend_ps(a, b, mask);
PRINT(r);
}
return 0;
}
It is possible compile this code with
gcc -Wall -march=native -O3 -o blendsimple blendsimple.c
and the code works. Obviously the compiler unrolls the loop and inserts constants for the immediate operand.
However, if you compile the code with
gcc -Wall -march=native -O2 -o blendsimple blendsimple.c
you get the following error for the blend intrinsic:
error: the last argument must be a 4-bit immediate
Now I tried to find out which specific compiler flag is active in -O3 but not in -O2 which allows the compiler to unroll the loop, but failed. Following the gcc online docs at
https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/Overall-Options.html
I executed the following commands:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled
which lists all options enabled by -O3 but not by -O2. When I add all of the 7 listed flags in addition to -O2
gcc -Wall -march=native -O2 -fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops blendsimple blendsimple.c
I would expect that the behavior is exactly the same as with -O3. However, the compiler complains that "the last argument must be a 4-bit immediate".
Does anyone have an idea what the problem is? I think it would be good to know which flag is required to enable this type of loop unrolling so that it can be activated selectively using #pragma GCC optimize or by a function attribute.
(I was also surprised that -O3 obviously doesn't even enable the unroll-loops option).
I would be grateful for any help. This is for a lecture on SSE programming I give.
Edit: Thanks a lot for your comments. jtaylor seems to be right. I got my hand on two newer versions of gcc (4.7.3, 4.8.2), and 4.8.2 complains on the immediate problem regardless of the optimization level. Moverover, I later noticed that gcc 4.6.3 compiles the code with -O2 -funroll-loops, but this also fails in 4.8.2. So apparently one cannot trust this feature and should always unroll "manually" using cpp or templates, as Jason R pointed out.
I am not sure if this applies to your situation, since I am not familiar with SSE intrinsics. But generally, you can tell the compiler to specifically optimize a section of code with :
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
do your stuff
#pragma GCC pop_options
Source: Tell gcc to specifically unroll a loop

Resources