Errors in runtime ARM Neon - arm

IDE: eclipse
Using DS5
Trying to execute C code using Neon Intrinsics
ARMv8-A bare metal used
ARM-6 compiler used
Target CPU: Generic Armv8-A AArc64
Target FPU: Armv8(Neon)
Summary: I'm trying to do an addition operation using neon intrinsic. During runtime, DS5 cannot access Neon. At this piece of code, Neon gets disconnected vector_result = vaddl_high_u8(vector_one,vector_two);
Code:
#include <stdio.h>
#include "arm_neon.h"
#include <stdlib.h>
int main(){
//add and accumulate in one array;
uint8_t one[] = {255,1,2,3,4,5,6,7};
uint8_t two[] = {255,1,2,3,4,5,6,7};
uint16_t result[8];
//vectorization
uint8x16_t vector_one,vector_two;
uint16x8_t vector_result;
//load
vector_one = vld1q_u8(one);
//syntax
vector_two = vld1q_u8(two);
//addition
//result = one + two
//16x8 = 8x16
vector_result = vaddl_high_u8(vector_one,vector_two);
//store
//result = one
vst1q_u16(result,vector_result);
//print
for(int i = 0; i < 8; i++) {
printf("%02d + %02d = %02d\n", one[i], two[i], result[i]);
}
}
Error Log:

Related

understanding vector indexed store vsuxei32 in RISCV-V riscv vector instructions on spike simulator

I am trying to understand vector indexed store instructions. Here is the sample code that I tried.
The source array has elements 0xabc0,0xabc1,0xabc2,0xabc3. With the indexing 3, 2, 1, 0, I expected it to print 0xabc3,0xabc2,0xabc1,0xabc0 but I get
Store/AMO access fault!
But when indexes are given int32_t indexes[NELMS] = {0,0,0,0}; I get abc3 0 0 0 as output. I am unable to understand how the instruction is picking the indexes.
#include <stdio.h>
#include <stdint.h>
#define NELMS 4
void scg(int32_t *dest, int32_t *base_addr, int32_t *offsets, int32_t elements) {
int32_t vl = 0;
asm ("vsetvli %0, %1, e32, m1\n" : "=r"(vl) : "r"(elements));
for(int32_t i=0; i<vec_elems;i+=vl){
asm ("vle32.v v2, (%0)\n" : : "r"(base_addr+i));
asm ("vle32.v v3, (%0)\n" : : "r"(offsets));
//asm ("vse32.v v2, (%0)\n" : : "r"(dest+i));
asm ("vsoxei32.v v2, (%0), v3\n" : :"r"(dest+i));
}
}
int main() {
int elements = NELMS;
int32_t src[NELMS] = { 0xabc0,
0xabc1,
0xabc2,
0xabc3};
int32_t indexes[NELMS] = {3,2,1,0};
int32_t dst[NELMS] = {0};
scg(dst, src, indexes, elements);
for (int i = 0; i < elements; i++) {
printf("%x ", dst[i]);
}
return 0;
}

Intel(x86_64) 64 bit vs 32 bit integer arithmetic performance difference [duplicate]

This question already has answers here:
Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
(3 answers)
Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?
(2 answers)
The advantages of using 32bit registers/instructions in x86-64
(2 answers)
Why does Clang do this optimization trick only from Sandy Bridge onward?
(1 answer)
Closed 2 years ago.
I was testing some program and I came upon a rather unexpected anomaly.
I wrote a simple program that computed prime numbers, and used pthreads API to parallelize this workload.
After conducting some tests, I found that if i used uint64_t as the datatype for calculations and loops, the program took significantly more time to run than if i used uint32_t.
Here is the code that I ran:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <pthread.h>
#define UINT uint64_t
#define SIZE (1024 * 1024)
typedef struct _data
{
UINT start;
UINT len;
int t;
UINT c;
}data;
int isprime(UINT x)
{
uint8_t flag = 1;
if(x < 2)
return 0;
for(UINT i = 2;i < x/2; i++)
{
if(!(x % i ))
{
flag = 0;
break;
}
}
return flag;
}
void* calc(void *p)
{
data *a = (data*)p;
//printf("thread no. %d has start: %lu length: %lu\n",a->t,a->start,a->len);
for(UINT i = a->start; i < a->len; i++)
{
if(isprime(i))
a->c++;
}
//printf("thread no. %d found %lu primes\n", a->t,a->c);
pthread_exit(NULL);
}
int main(int argc,char **argv)
{
pthread_t *t;
data *a;
uint32_t THREAD_COUNT;
if(argc < 2)
THREAD_COUNT = 1;
else
sscanf(argv[1],"%u",&THREAD_COUNT);
t = (pthread_t*)malloc(THREAD_COUNT * sizeof(pthread_t));
a = (data*)malloc(THREAD_COUNT * sizeof(data));
printf("executing the application on %u thread(s).\n",THREAD_COUNT);
for(uint8_t i = 0; i < THREAD_COUNT; i++)
{
a[i].t = i;
a[i].start = i * (SIZE / THREAD_COUNT);
a[i].len = a[i].start + (SIZE / THREAD_COUNT);
a[i].c = 0;
}
for(uint8_t i = 0; i < THREAD_COUNT; i++)
pthread_create(&t[i],NULL,calc,(void*)&a[i]);
for(uint8_t i = 0; i < THREAD_COUNT; i++)
pthread_join(t[i],NULL);
free(a);
free(t);
return 0;
}
I changed the UINT macro between uint32_t and uint64_t and compiled and ran the program and determined its runtime using time command on linux.
I found major difference between the runtime for uint64_t vs uint32_t.
On using uint32_t it took the program 46s to run while using uint64_t it took 2m49s to run!
I wrote a blog post about it here : https://qcentlabs.com/index.php/2021/02/01/intelx86_64-64-bit-vs-32-bit-arithmetic-big-performance-difference/
You can check out the post if you want more information.
What might be the issue behind this? Are 64 bit arithmetic slower on x86_64 than 32 bit one?
In general 64-bit arithmetic is as fast as 32-bit, ignoring things like larger operands taking up more memory and BW (bandwidth), and on x86-64 addressing the full 64-bit registers requires longer instructions.
However, you have managed to hit one of the few exceptions to this rule, namely the div instruction for calculating divisions.

Is _mm256_store_ps() function is atomic ? while using alongside openmp

I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps().
I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working.
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
#include<immintrin.h>
#include<omp.h>
#define N 64
__m256 multiply_and_add_intel(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(_mm256_mul_ps(a, b),c);
}
void multiply_and_add_intel_total_omp(const float* a, const float* b, const float* c, float* d)
{
__m256 a_intel, b_intel, c_intel, d_intel;
#pragma omp parallel for private(a_intel,b_intel,c_intel,d_intel)
for(long i=0; i<N; i=i+8) {
a_intel = _mm256_loadu_ps(&a[i]);
b_intel = _mm256_loadu_ps(&b[i]);
c_intel = _mm256_loadu_ps(&c[i]);
d_intel = multiply_and_add_intel(a_intel, b_intel, c_intel);
_mm256_store_ps(&d[i],d_intel);
}
}
int main()
{
srand(time(NULL));
float * a = (float *) malloc(sizeof(float) * N);
float * b = (float *) malloc(sizeof(float) * N);
float * c = (float *) malloc(sizeof(float) * N);
float * d_intel_avx_omp = (float *)malloc(sizeof(float) * N);
int i;
for(i=0;i<N;i++)
{
a[i] = (float)(rand()%10);
b[i] = (float)(rand()%10);
c[i] = (float)(rand()%10);
}
double time_t = omp_get_wtime();
multiply_and_add_intel_total_omp(a,b,c,d_intel_avx_omp);
time_t = omp_get_wtime() - time_t;
printf("\nTime taken to calculate with AVX2 and OMP : %0.5lf\n",time_t);
}
free(a);
free(b);
free(c);
free(d_intel_avx_omp);
return 0;
}
I expect that I will get d = a * b + c but it is showing segmentation fault. I have tried to perform the same task without OpenMP and it working errorless. Please let me know if there is any compatibility issue or I am missing any part.
gcc version 7.3.0
Intel® Core™ i3-3110M Processor
OS Ubuntu 18.04
Open MP 4.5, I have executed the command $ echo |cpp -fopenmp -dM |grep -i open and it showed #define _OPENMP 201511
Command to compile, gcc first_int.c -mavx -fopenmp
** UPDATE **
As per the discussions and suggestions, the new code is,
float * a = (float *) aligned_alloc(N, sizeof(float) * N);
float * b = (float *) aligned_alloc(N, sizeof(float) * N);
float * c = (float *) aligned_alloc(N, sizeof(float) * N);
float * d_intel_avx_omp = (float *)aligned_alloc(N, sizeof(float) * N);
This working without perfectly.
Just a note, I was trying to compare general calculations, avx calculation and avx+openmp calculation. This is the result I got,
Time taken to calculate without AVX : 0.00037
Time taken to calculate with AVX : 0.00024
Time taken to calculate with AVX and OMP : 0.00019
N = 50000
The documentation for _mm256_store_ps says:
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
You can use _mm256_storeu_si256 instead for unaligned stores.
A better option is to align all your arrays on a 32-byte boundary (for 256-bit avx registers) and use aligned load and stores for maximum performance because unaligned loads/stores crossing a cache line boundary incur performance penalty.
Use std::aligned_alloc (or C11 aligned_alloc, memalign, posix_memalign, whatever you have available) instead of malloc(size), e.g.:
float* allocate_aligned(size_t n) {
constexpr size_t alignment = alignof(__m256);
return static_cast<float*>(aligned_alloc(alignment, sizeof(float) * n));
}
// ...
float* a = allocate_aligned(N);
float* b = allocate_aligned(N);
float* c = allocate_aligned(N);
float* d_intel_avx_omp = allocate_aligned(N);
In C++-17 new can allocate with alignment:
float* allocate_aligned(size_t n) {
constexpr auto alignment = std::align_val_t{alignof(__m256)};
return new(alignment) float[n];
}
Alternatively, use Vc: portable, zero-overhead C++ types for explicitly data-parallel programming that aligns heap-allocated SIMD vectors for you:
#include <cstdio>
#include <memory>
#include <chrono>
#include <Vc/Vc>
Vc::float_v random_float_v() {
alignas(Vc::VectorAlignment) float t[Vc::float_v::Size];
for(unsigned i = 0; i < Vc::float_v::Size; ++i)
t[i] = std::rand() % 10;
return Vc::float_v(t, Vc::Aligned);
}
unsigned reverse_crc32(void const* vbegin, void const* vend) {
unsigned const* begin = reinterpret_cast<unsigned const*>(vbegin);
unsigned const* end = reinterpret_cast<unsigned const*>(vend);
unsigned r = 0;
while(begin != end)
r = __builtin_ia32_crc32si(r, *--end);
return r;
}
int main() {
constexpr size_t N = 65536;
constexpr size_t M = N / Vc::float_v::Size;
std::unique_ptr<Vc::float_v[]> a(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> b(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> c(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> d_intel_avx_omp(new Vc::float_v[M]);
for(unsigned i = 0; i < M; ++i) {
a[i] = random_float_v();
b[i] = random_float_v();
c[i] = random_float_v();
}
auto t0 = std::chrono::high_resolution_clock::now();
for(unsigned i = 0; i < M; ++i)
d_intel_avx_omp[i] = a[i] * b[i] + c[i];
auto t1 = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count();
unsigned crc = reverse_crc32(d_intel_avx_omp.get(), d_intel_avx_omp.get() + M); // Make sure d_intel_avx_omp isn't optimized out.
std::printf("crc: %u, time: %.09f seconds\n", crc, seconds);
}
Parallel version:
#include <tbb/parallel_for.h>
// ...
auto t0 = std::chrono::high_resolution_clock::now();
tbb::parallel_for(size_t{0}, M, [&](unsigned i) {
d_intel_avx_omp[i] = a[i] * b[i] + c[i];
});
auto t1 = std::chrono::high_resolution_clock::now();
You must use aligned memory for these intrinsics. Change your malloc(...) to aligned_alloc(sizeof(float) * 8, ...) (C11).
This is completely unrelated to atomics. You are working on entirely separate pieces of data (even on different cache lines), so there is no need for any protection.

How to count number of bits set for a 128-bit integer

I want to use 128-bit unsigned integer in C. I have written the following code:
#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include<math.h>
#include <stdint.h>
#include <limits.h>
#define unt __uint128_t
#define G1 226854911280625642308916404954512140970
int countSetBits(unt n){
int count = 0;
while(n){ n &= (n-1) ; count++; }
return count;
}
int main(){
printf(" %d\n",countSetBits(G1) );
}
Although output should be 64, number of bits of G1, it is coming 96. I use gcc compiler. I know GMP GNU, but for my purpose, I need fast execution. Hence I want to avoid GNU library.
Because of an issue explained here, you need to assign the constant using two 64 bit values:
#include <stdio.h>
#define uint128_t __uint128_t
#define G1 ((uint128_t)12297829382473034410 << 64 | (uint128_t)12297829382473034410)
int countSetBits(uint128_t n) {
int count = 0;
while(n) {
n &= (n - 1);
count++;
}
return count;
}
int main() {
printf(" %d\n",countSetBits(G1) );
}
Outputs:
64
Live version available in onlinegdb.
There are no 128 constants in C language so you need to use two 64 bit values and combine them
#define unt __uint128_t
#define G1 ((((__uint128_t)0xaaaaaaaaaaaaaaaaull) << 64) + ((__uint128_t)0xaaaaaaaaaaaaaaaaull))
int countSetBits(unt n){
int count = 0;
while(n){ n &= (n-1) ; count++; }
return count;
}
int countSetBits1(unt n){
int count = 0;
while(n)
{
count += n & 1;
n >>= 1;
}
return count;
}
int main(){
printf(" %d\n",countSetBits(G1) );
printf(" %d\n",countSetBits1(G1) );
}
Since you're using one gcc extension, I assume more are okay. gcc has a family of intrinsic functions for returning the number of set bits in regular integer types. Depending on your CPU and gcc options, this will either become the appropriate instruction, or fall back to calling a library function.
Something like:
int bitcount_u128(unsigned __int128 n) {
uint64_t parts[2];
memcpy(parts, &n, sizeof n);
return __builtin_popcountll(parts[0]) + __builtin_popcountll(parts[1]);
}
If using an x86 processor with the popcnt instruction (Which is most made in the last decade), compile with -mpopcnt or the appropriate -march= setting to use the hardware instruction.
Alternatively, if you're okay with limiting support to just x86 processors with popcnt, the _mm_popcnt_u64() intrinsic from <nmmintrin.h> can be used instead of __builtin_popcountll().

Get GCC To Use Carry Logic For Arbitrary Precision Arithmetic Without Inline Assembly?

When working with arbitrary precision arithmetic (e.g. 512-bit integers), is there any way to get GCC to use ADC and similar instructions without using inline assembly?
A first glance at GMP's sourcecode shows that they simply have assembly implementations for every supported platform.
Here is the test code I wrote, which adds two 128-bit numbers from the command line and prints the result. (Inspired by mini-gmp's add_n):
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
int main (int argc, char **argv)
{
uint32_t a[4];
uint32_t b[4];
uint32_t c[4];
uint32_t carry = 0;
for (int i = 0; i < 4; ++i)
{
a[i] = strtoul (argv[i+1], NULL, 16);
b[i] = strtoul (argv[i+5], NULL, 16);
}
for (int i = 0; i < 4; ++i)
{
uint32_t aa = a[i];
uint32_t bb = b[i];
uint32_t r = aa + carry;
carry = (r < carry);
r += bb;
carry += (r < bb);
c[i] = r;
}
printf ("%08X%08X%08X%08X + %08X%08X%08X%08X =\n", a[3], a[2], a[1], a[0], b[3], b[2], b[1], b[0]);
printf ("%08X%08X%08X%08X\n", c[3], c[2], c[1], c[0]);
return 0;
}
GCC -O3 -std=c99 Does not produce any adc instructions, as checked by objdump. My gcc version is i686-pc-mingw32-gcc (GCC) 4.5.2.
GCC will use the carry flag if it can see that it needs to:
When adding two uint64_t values on a 32-bit machine, for example, this must result in one 32-bit ADD plus one 32-bit ADC. But apart from those cases, where the compiler is forced to use the carry, it probably cannot be persuaded to do so w/o assembler. Therefore, it may be beneficial to use the biggest integer type available to allow GCC to optimize operations by effectively letting it know that the single 'components' of the value belong together.
For the simple addition, another way to calculate the carry could be to look at the relevant bits in the operands, like:
uint32_t aa,bb,rr;
bool msbA, msbB, msbR, carry;
// ...
rr = aa+bb;
msbA = aa >= (1<<31); // equivalent: (aa & (1<<31)) != 0;
msbB = bb >= (1<<31);
msbR = rr >= (1<<31);
carry = (msbA && msbB) || ( !msbR && ( msbA || msbB) );

Resources