Auto-vectorize shuffle instruction

Auto-vectorize shuffle instruction - c

I'm trying to make the compiler generate the (v)pshufd instruction (or equivalent) via auto-vectorization. It's surprisingly difficult.
For example, presuming a vector of 4 uint32 values, the transformation :
A|B|C|D => A|A|C|C is supposed to be achieved using a single instruction (corresponding intrinsic : _mm_shuffle_epi32()).
Trying to express the same transformation using only normal operations, I can write for example :
for (i=0; i<4; i+=2)
v32x4[i] = v32x4[i+1];
The compiler seems unable to make a good transformation, generating instead in a mix of scalar and vector code of more than a dozen instructions.
Unrolling manually produces an even worse outcome.
Sometimes, a little detail get in the way, preventing the compiler to translate correctly. For example, the nb of elements in the array should be a clear power of 2, pointers to table should be guaranteed to not alias, alignment should be expressed explicitly, etc.
In this case, I haven't found any similar reason, and I'm still stuck with manual intrinsics to generate a reasonable assembly.
Is there a way to generate the (v)pshufd instruction using only normal code and relying on compiler's auto-vectorizer ?

(Update: new answer since 2019-02-07.)
It is possible to make the compiler generate the (v)pshufd
instruction, even without gcc's vector extensions which I used in a
previous answer to this question.
The following examples give an impression of the possibilities.
These examples are compiled with gcc 8.2 and clang 7.
Example 1
#include<stdint.h>
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem No */
/* clang -m64 -O3 -march=skylake No */
void shuff1(int32_t* restrict a, int32_t* restrict b, int32_t n){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
for (int32_t i = 0; i < n; i=i+4) {
b[i+0] = a[i+0];
b[i+1] = a[i+0];
b[i+2] = a[i+2];
b[i+3] = a[i+2];
}
}
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem Yes */
/* clang -m64 -O3 -march=skylake Yes */
void shuff2(int32_t* restrict a, int32_t* restrict b, int32_t n){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
for (int32_t i = 0; i < n; i=i+4) {
b[i+0] = a[i+1];
b[i+1] = a[i+2];
b[i+2] = a[i+3];
b[i+3] = a[i+0];
}
}
Surprisingly clang only vectorizes permutations in the mathematical sense,
not general shuffles. With gcc -m64 -O3 -march=nehalem,
the main loop of shuff1 becomes:
.L3:
add edx, 1
pshufd xmm0, XMMWORD PTR [rdi+rax], 160
movaps XMMWORD PTR [rsi+rax], xmm0
add rax, 16
cmp edx, ecx
jb .L3
Example 2
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem No */
/* gcc -m64 -O3 -march=skylake No */
/* clang -m64 -O3 -march=nehalem No */
/* clang -m64 -O3 -march=skylake No */
void shuff3(int32_t* restrict a, int32_t* restrict b){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
b[0] = a[0];
b[1] = a[0];
b[2] = a[2];
b[3] = a[2];
}
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem Yes */
/* clang -m64 -O3 -march=skylake Yes */
void shuff4(int32_t* restrict a, int32_t* restrict b){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
b[0] = a[1];
b[1] = a[2];
b[2] = a[3];
b[3] = a[0];
}
The assembly with gcc -m64 -O3 -march=skylake:
shuff3:
mov eax, DWORD PTR [rdi]
mov DWORD PTR [rsi], eax
mov DWORD PTR [rsi+4], eax
mov eax, DWORD PTR [rdi+8]
mov DWORD PTR [rsi+8], eax
mov DWORD PTR [rsi+12], eax
ret
shuff4:
vpshufd xmm0, XMMWORD PTR [rdi], 57
vmovaps XMMWORD PTR [rsi], xmm0
ret
Again the results of the (0,3,2,1) permutation differs essentially from the (2,2,0,0) shuffle case.
Example 3
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem No */
/* clang -m64 -O3 -march=skylake No */
void shuff5(int32_t* restrict a, int32_t* restrict b, int32_t n){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 32); b = (int32_t*)__builtin_assume_aligned(b, 32);
for (int32_t i = 0; i < n; i=i+8) {
b[i+0] = a[i+2];
b[i+1] = a[i+7];
b[i+2] = a[i+7];
b[i+3] = a[i+7];
b[i+4] = a[i+0];
b[i+5] = a[i+1];
b[i+6] = a[i+5];
b[i+7] = a[i+4];
}
}
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem No */
/* clang -m64 -O3 -march=skylake No */
void shuff6(int32_t* restrict a, int32_t* restrict b, int32_t n){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 32); b = (int32_t*)__builtin_assume_aligned(b, 32);
for (int32_t i = 0; i < n; i=i+8) {
b[i+0] = a[i+0];
b[i+1] = a[i+0];
b[i+2] = a[i+2];
b[i+3] = a[i+2];
b[i+4] = a[i+4];
b[i+5] = a[i+4];
b[i+6] = a[i+6];
b[i+7] = a[i+6];
}
}
WIth gcc -m64 -O3 -march=skylake the main loop of shuff5 contains the
lane crossing vpermd shuffle instruction, which is quite impressive, I think.
Function shuff6 leads to the non lane crossing vpshufd ymm0, mem instruction, perfect.
Example 4
The assembly of shuff5 becomes quite messy if we replace b[i+5] = a[i+1];
by b[i+5] = 0;. Nevertheless the loop was vectorized. See also this Godbolt link
for all the examples discussed in this answer.
If arrays a and b are 16 (or 32) byte aligned, then we can use
a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
(or 32 instead of 16). This sometimes improves the assembly code generation a bit.

Related

Why is gcc resolving the address of a string during compilation?

I'm trying to do some bare metal programming on a Raspberry Pi 3B. I still have not figured out the correct memory addresses for the UART controls so please just ignore those. I am experiencing a strange compilation issue though.
Here is the code I am trying to compile (not link):
void pstr(char* str) {
unsigned int* AUX_MU_IO_REG = (unsigned int*)0x7E215040;
unsigned int* AUX_MU_LSR_REG = (unsigned int*)0x7E215054;
while (*str != 0) {
while (!(*AUX_MU_LSR_REG & 0x00000020)) {
}
*AUX_MU_IO_REG = (unsigned int)((unsigned char)*str);
str++;
}
return;
}
signed int kmain(unsigned int argc, char* argv[], char* envp[]) {
char* text = "Test Output String\n";
unsigned int* AUXENB = (unsigned int*)0x7E215004;
*AUXENB = 0x00000001;
pstr(text);
return 0;
}
My addresses are not correct and invalid, but that is not the point. For some reason, the string "Test Output String\n" is being resolved to an address in the object file.
It is being compiled with the command:
aarch64-unknown-linux-gnu-gcc -Wall -Wextra -std=c99 -O2 -march=armv8-a -mtune=cortex-a53 -mlittle-endian -ffreestanding -nostdlib -nostartfiles -Wno-unused-parameter -fno-stack-check -fno-stack-protector src/kernel/base.c -c -o src/kernel/base.o
Interestingly, it doesn't happen if I compile with "-O0".
Here is what it looks like with "-O2" using "aarch64-unknown-linux-gnu-objdump -d ./src/kernel/base.o":
0000000000000040 <kmain>:
40: d28a0a80 mov x0, #0x5054 // #20564
44: f2afc420 movk x0, #0x7e21, lsl #16
48: d28a0084 mov x4, #0x5004 // #20484
4c: f2afc424 movk x4, #0x7e21, lsl #16
50: b9400000 ldr w0, [x0]
It crashes at "ldr w0, [x0]" because the address 0x7e215054 is not valid. I just don't know why the compiler would even be putting that there. It should be symbol to the data in .rodata so that it can be placed in the correct location by my linker script.

How to extend a int32x2_t to a int32x4_t with NEON intrinsics on clang/AArch64 when you don't care about the new lanes?

Fellow ARMists,
I'd like to narrow and saturate 2 s32 to 2 s16 with NEON code, and pack them in a GPR.
I need to conform to a certain API, so please don't discuss efficiency or design here :)
Here's the snippet:
int32x2_t stuff32 = ...;
int16x4_t stuff16 = vqmovn_s32(vcombine_s32(stuff32, stuff32));
return vget_lane_u32(stuff16, 0)
Which generates
mov v0.d[1], v0.d[0]
sqxtn v0.4h, v0.4s
fmov w0, s0
ret
Does somebody know a way to keep the type system happy, and have the second half of the d register uninitialized ? I'd like to avoid inline assembly.
Thank you !

I'm not aware of any good solution using general arm_neon.h intrinsics, but at least with Clang, it's possible, using Clang specific builtins, to produe a vector where some elements are set to be undefined, so the codegen doesn't need to fill them with any value in particular.
A setup that uses that would look like this:
$ cat test.c
#include <arm_neon.h>
int32_t narrow_saturate(int32x2_t stuff32) {
int16x4_t stuff16 = vqmovn_s32(__builtin_shufflevector(stuff32, stuff32, 0, 1, -1, -1));
return vget_lane_s32(vreinterpret_s32_s16(stuff16), 0);
}
$ clang -target aarch64-linux-gnu test.c -S -o - -O2
[...]
narrow_saturate:
sqxtn v0.4h, v0.4s
fmov w0, s0
ret
https://godbolt.org/z/N_NsSE
See https://clang.llvm.org/docs/LanguageExtensions.html#builtin-shufflevector for documentation on __builtin_shufflevector.
EDIT: It also seems to be possible to achieve the same with Clang by using an uninitialized variable (although that can generate warnings with `-Wuninitialized):
$ cat test.c
#include <arm_neon.h>
int32_t narrow_saturate(int32x2_t stuff32) {
int32x2_t uninitialized;
int16x4_t stuff16 = vqmovn_s32(vcombine_s32(stuff32, uninitialized));
return vget_lane_s32(vreinterpret_s32_s16(stuff16), 0);
}
Clang produces the same as above for this (https://godbolt.org/z/TzHuon), while GCC still includes a mov v0.8b, v0.8b (https://godbolt.org/z/wZTAU9).

$ cat a.c
#include <arm_neon.h>
int32_t narrow_saturate(int32x2_t stuff32) {
int32x2_t zero = {0, 0};
int16x4_t stuff16 = vqmovn_s32(vcombine_s32(stuff32, zero));
return vget_lane_s32(vreinterpret_s32_s16(stuff16), 0);
}
$ gcc -O2 a.c -S -o-
[...]
narrow_saturate:
mov v0.8b, v0.8b
sqxtn v0.4h, v0.4s
umov w0, v0.s[0]
ret
https://godbolt.org/z/ATr4D7

C Writing to absolute address does not update value

Im trying to write a basic page manager for my C kernel. The code goes like this:
#define NUM_PAGES 1024
#define PAGE_SIZE 4096
#define NULL 0
#define IMPORTANT_SEGMENT 0xC0900000
struct page {
void * addr;
int in_use;
};
struct page CORE_FILE[NUM_PAGES];
void mem_init() {
for (int i = 0; i < NUM_PAGES; i++) {
CORE_FILE[i].addr = (void*)IMPORTANT_SEGMENT+PAGE_SIZE*i;
CORE_FILE[i].in_use = 0;
}
}
void * allocate() {
for (int i = 0; i < NUM_PAGES; i++) {
if (!CORE_FILE[i].in_use) {
CORE_FILE[i].in_use = 1;
return CORE_FILE[i].addr;
}
}
return NULL;
}
int deallocate(void* p) {
for (int i = 0; i < NUM_PAGES; i++) {
if (CORE_FILE[i].addr == p && CORE_FILE[i].in_use) {
CORE_FILE[i].in_use = 0;
return 0;
}
}
return -1;
}
CORE_FILE is an array of structs containing just one field for telling if the page is in use and an address (im using contiguous addresses growing from IMPORTANT_SEGMENT = 0xC0900000).
When i call allocate() it returns me the correct void* which i cast for example to char, but when i write to the address it simply does nothing.
I have checked the address to which is pointing with GDB and is the correct one.
But when i examine its contents they haven't been updated (Still 0).
void kmain(void) {
mem_init();
int * addr = (int*)allocate();
*addr = 5;
}
Im giving qemu 4 GB of RAM executing with:
qemu-system-i386 -m 4G -kernel kernel -gdb tcp::5022
Perhaps im writing to non-existent memory or maybe im overwriting the address contents after. I don't know.
Any ideas will be appreciated.
Thank you in advance.
[edit] This is the bootloader im using:
bits 32
section .text
;multiboot spec
align 4
dd 0x1BADB002 ;magic
dd 0x00 ;flags
dd - (0x1BADB002 + 0x00) ;checksum. m+f+c should be zero
global start
global keyboard_handler
global read_port
global write_port
global load_idt
extern kmain ;this is defined in the c file
extern keyboard_handler_main
read_port:
mov edx, [esp + 4]
;al is the lower 8 bits of eax
in al, dx ;dx is the lower 16 bits of edx
ret
write_port:
mov edx, [esp + 4]
mov al, [esp + 4 + 4]
out dx, al
ret
load_idt:
mov edx, [esp + 4]
lidt [edx]
sti ;turn on interrupts
ret
keyboard_handler:
call keyboard_handler_main
iretd
start:
cli ;block interrupts
mov esp, stack_space
call kmain
hlt ;halt the CPU
section .bss
resb 8192; 8KB for stack
stack_space:
My link.ld
OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
SECTIONS
{
. = 0x100000;
.text : { *(.text) }
. = 0x200000;
.data : { *(.data) }
. = 0x300000;
.bss : { *(.bss) }
}
Edit2: I compile with this
nasm -f elf32 kernel.asm -o kasm.o
gcc -g -fno-stack-protector -fno-builtin -m32 -c memory.c -o memory.o
gcc -g -fno-stack-protector -fno-builtin -m32 -c kernel.c -o kc.o
ld -m elf_i386 -T link.ld -o kernel kasm.o memory.o kc.o

The problem is with protected and real mode, when the computer boots it does so in 16 bit real mode, which makes you able to address 1 MB of data. Everything over that wont be suitable for reading/writing. If i changed the IMPORTANT_SEGMENT to 0x300000 it works.
Now i have to create and load my gdt, enable the a20 line, enable protected mode, set the registers and jump to my code.

Why getting wrong results when performing pointer arithmetic in C on dynamically linked symbols?

I encountered a weird situation where performing pointer arithmetic involving
dynamically linked symbols leads to incorrect results. I'm unsure if there
are simply missing some linker parameters or if it's a linker bug. Can someone
explain what's wrong in the following example?
Consider the following code (lib.c) of a simple shared library:
#include <inttypes.h>
#include <stdio.h>
uintptr_t getmask()
{
return 0xffffffff;
}
int fn1()
{
return 42;
}
void fn2()
{
uintptr_t mask;
uintptr_t p;
mask = getmask();
p = (uintptr_t)fn1 & mask;
printf("mask: %08x\n", mask);
printf("fn1: %p\n", fn1);
printf("p: %08x\n", p);
}
The operation in question is the bitwise AND between the address of fn1 and
the variable mask. The application (app.c) just calls fn2 like that:
extern int fn2();
int main()
{
fn2();
return 0;
}
It leads to the following output ...
mask: ffffffff
fn1: 0x2aab43c0
p: 000003c0
... which is obviously incorrect, because the same result is expected for fn1
and p. The code runs on an AVR32 architecture and is compiled as follows:
$ avr32-linux-uclibc-gcc -Os -Wextra -Wall -c -o lib.o lib.c
$ avr32-linux-uclibc-gcc -Os -Wextra -Wall -shared -o libfoo.so lib.o
$ avr32-linux-uclibc-gcc -Os -Wextra -Wall -o app app.c -L. -lfoo
The compiler thinks, it is the optimal solution to load the variable
mask into 32 bit register 7 and splitting the &-operation into two assembler
operations with immediate operands.
$ avr32-linux-uclibc-objdump -d libfoo.so
000003ce <fn1>:
3ce: 32 ac mov r12,42
3d0: 5e fc retal r12
000003d2 <fn2>:
...
3f0: e4 17 00 00 andh r7,0x0
3f4: e0 17 03 ce andl r7,0x3ce
I assume the immediate operands of the and instructions are not relocated
to the loading address of fn1 when the shared library is loaded into the
applications address space:
Is this behaviour intentional?
How can I investigate whether problem occurs when linking the shared library or when loading the executable?
Background: This is not an academic questions. OpenSSL and LibreSSL
use similar code, so changing the C source is not an option. The code runs
well on other architectures and certainly there is an unapparent reason for
doing bitwise operations on function pointers.

after correcting all the 'slopiness' in the code, the result is:
#include <inttypes.h>
#include <stdio.h>
int fn1( void );
void fn2( void );
uintptr_t getmask( void );
int main( void )
{
fn2();
return 0;
}
uintptr_t getmask()
{
return 0xffffffff;
}
int fn1()
{
return 42;
}
void fn2()
{
uintptr_t mask;
uintptr_t p;
mask = getmask();
p = (uintptr_t)fn1 & mask;
printf("mask: %08x\n", (unsigned int)mask);
printf("fn1: %p\n", fn1);
printf("p: %08x\n", (unsigned int)p);
}
and the output (on my linux 64bit computer) is:
mask: ffffffff
fn1: 0x4007c1
p: 004007c1

Load Array of Integers into AVX Register

I am currently looking into AVX Intrinsics to parallelize my code.
As for now I would like to write a benchmark an see how much speedup i can receive.
void randomtable (uint32_t crypto[4][64])
{
int k = 1;
for (int i=0;i<4;i++)
{
k++;
for (int j=0;j<64;j++)
{ crypto[i][j]= (k+j)%64; }
}
}
int main (void)
{
uint32_t crypt0[4][64];
randomtable(crypt0);
__m256i ymm0 = _m256_load_si256(&crypt0[0][0]);
}
My problem and question is how do I load the first 8 elements of the array into the ymm0?
i am compiling with gcc -mavx -march=native -g -O0 -std=c99
compile error: error: incompatible types when initializing type '__m256i' using type 'int'

This line has a typo and is missing a cast:
__m256i ymm0 = _m256_load_si256(&crypt0[0][0]);
It should be:
__m256i ymm0 = _mm256_load_si256((__m256i *)&crypt0[0][0]);
Note that you'll probably need to use AVX2 if you want to do anything further with the data (i.e. integer arithmetic, etc), so you should compile with -mavx2.