Loading 8-bit values using NEON/ARM - arm

I'm trying to load an array of char values into NEON registers, and then treat them as 16-bit or 32-bit integer values. So something like this...
void SubVector(short* c, const unsigned char* a, const unsigned char* b, int n)
{
for(int i = 0; i < n; i++)
{
c[i] = (short)a[i] - (short)b[i];
}
}
I'm not sure how to load the data. Should I load the 8-bit data into lanes, and then reinterpret the registers as shorts? Or load and convert? What would be the fastest way?
Does anyone have a example on how they would do this with NEON intrinsics?
Thanks!

NEON has addition and subtraction instructions that can widen values from 8->16, 16->32 or 32->64 bits. You can do 8 at a time like this:
uint8x8_t u88_a, u88_b;
uint16x8_t u168_diff;
u88_a = vld1_u8(a); // load 8 unsigned chars from a[]
u88_b = vld1_u8(b); // load 8 unsigned chars from b[]
u168_diff = vsubl_u8(u88_a, u88_b); // calculate the difference and widen to 16-bits

Related

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

What I want to do is:
Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!
For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)
Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.
Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.
#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
__m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
__m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
__m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
__m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
__m256i ab = _mm256_packs_epi32(a,b); // 16x int16_t
__m256i cd = _mm256_packs_epi32(c,d);
__m256i abcd = _mm256_packs_epi16(ab, cd); // 32x int8_t
// packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
// if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done
// but if you need sequential order, then vpermd:
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
return lanefix;
}
(Compiles nicely on the Godbolt compiler explorer).
Call this in a loop and _mm256_store_si256 the resulting vector.
(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)
With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.
If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.
Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)
Related:
SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?
Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .

Hex array to Float

I have an array of
unsigned char array_a[4] = {0x00,0x00,0x08,0x4D};
unsigned char val[4];
float test;
what I want to do is combine all the elements and store it to val to make it 0x0000084D and converter it to float, which is 2125.
I tried memcpy
memcpy(val,array_a,4);
val[4] = '\0';
but still not work.
First, 0x0000084D is the big endian representation of the integer value 2125, not IEEE float.
Second, no need to copy to another char array (and accessing the 5th element out of bounds in an attempt to "nul-terminate" the array). That part makes no sense.
To convert this array to an integer on your host, copy it in a standardized 32 bit integer first, then convert it according to the endianness of your machine (else you'd get a bad value on a little endian machine)
unsigned char array_a[4] = {0x00,0x00,0x08,0x4D};
uint32_t the_int;
memcpy(&the_int,array_a,sizeof(uint32_t));
the_int = ntohl(the_int);
printf("%d\n",the_int);
or without any external conversion libs using bit shifting making it endian-independent:
uint32_t the_int = 0;
int i;
for (i=0;i<sizeof(uint32_t);i++)
{
the_int <<= 8;
the_int += array_a[i];
}
you get 2125 all right, now you can assign it to a float if you like
float test = the_int;

How to load 4 unsigned chars and convert them to signed shorts with NEON?

I want to load 4 unsigned chars (8 bit) from the memory and widen them to signed shorts (16 bit). How can I do that with NEON intrinsics?
From the list of NEON intrinsics, I can see only load options for 8 unsigned chars at a time. But I want to do for 4 unsigned chars. Is this even possible?
I think it will be look similar to this:
inline int16x4_t LoadAndConvert4(const uint8_t * p)
{
return vreinterpret_s16_u16(vget_low_u16(vmovl_u8(
vreinterpret_u8_u32(vdup_n_u32(*(uint32_t*)p)))));
}
Or step by step:
inline int16x4_t LoadAndConvert4(const uint8_t * p)
{
uint32_t u32 = *(uint32_t*)p;
uint32x2_t u32x2 = vdup_n_u32(a32);
uint8x8_t u8x8 = vreinterpret_u8_u32(a32x2);
uint16x8_t u16x8 = vmovl_u8(a8x8);
uint16x4_t u16x4 = vget_low_u16(a16x8);
return vreinterpret_s16_u16(u16x4);
}

how can split integers into bytes without using arithmetic in c?

I am implementing four basic arithmetic functions(add, sub, division, multiplication) in C.
the basic structure of these functions I imagined is
the program gets two operands by user using scanf,
and the program split these values into bytes and compute!
I've completed addition and subtraction,
but I forgot that I shouldn't use arithmetic functions,
so when splitting integer into single bytes,
I wrote codes like
while(quotient!=0){
bin[i]=quotient%2;
quotient=quotient/2;
i++;
}
but since there is arithmetic functions that i shouldn't use..
so i have to rewrite that splitting parts,
but i really have no idea how can i split integer into single byte without using
% or /.
To access the bytes of a variable type punning can be used.
According to the Standard C (C99 and C11), only unsigned char brings certainty to perform this operation in a safe way.
This could be done in the following way:
typedef unsigned int myint_t;
myint_t x = 1234;
union {
myint_t val;
unsigned char byte[sizeof(myint_t)];
} u;
Now, you can of course access to the bytes of x in this way:
u.val = x;
for (int j = 0; j < sizeof(myint_t); j++)
printf("%d ",u.byte[j]);
However, as WhozCrag has pointed out, there are issues with endianness.
It cannot be assumed that the bytes are in determined order.
So, before doing any computation with bytes, your program needs to check how the endianness works.
#include <limits.h> /* To use UCHAR_MAX */
unsigned long int ByteFactor = 1u + UCHAR_MAX; /* 256 almost everywhere */
u.val = 0;
for (int j = sizeof(myint_t) - 1; j >= 0 ; j--)
u.val = u.val * ByteFactor + j;
Now, when you print the values of u.byte[], you will see the order in that bytes are arranged for the type myint_t.
The less significant byte will have value 0.
I assume 32 bit integers (if not the case then just change the sizes) there are more approaches:
BYTE pointer
#include<stdio.h>
int x; // your integer or whatever else data type
BYTE *p=(BYTE*)&x;
x=0x11223344;
printf("%x\n",p[0]);
printf("%x\n",p[1]);
printf("%x\n",p[2]);
printf("%x\n",p[3]);
just get the address of your data as BYTE pointer
and access the bytes directly via 1D array
union
#include<stdio.h>
union
{
int x; // your integer or whatever else data type
BYTE p[4];
} a;
a.x=0x11223344;
printf("%x\n",a.p[0]);
printf("%x\n",a.p[1]);
printf("%x\n",a.p[2]);
printf("%x\n",a.p[3]);
and access the bytes directly via 1D array
[notes]
if you do not have BYTE defined then change it for unsigned char
with ALU you can use not only %,/ but also >>,& which is way faster but still use arithmetics
now depending on the platform endianness the output can be 11,22,33,44 of 44,33,22,11 so you need to take that in mind (especially for code used in multiple platforms)
you need to handle sign of number, for unsigned integers there is no problem
but for signed the C uses 2'os complement so it is better to separate the sign before spliting like:
int s;
if (x<0) { s=-1; x=-x; } else s=+1;
// now split ...
[edit2] logical/bit operations
x<<n,x>>n - is bit shift left and right of x by n bits
x&y - is bitwise logical and (perform logical AND on each bit separately)
so when you have for example 32 bit unsigned int (called DWORD) yu can split it to BYTES like this:
DWORD x; // input 32 bit unsigned int
BYTE a0,a1,a2,a3; // output BYTES a0 is the least significant a3 is the most significant
x=0x11223344;
a0=DWORD((x )&255); // should be 0x44
a1=DWORD((x>> 8)&255); // should be 0x33
a2=DWORD((x>>16)&255); // should be 0x22
a3=DWORD((x>>24)&255); // should be 0x11
this approach is not affected by endianness
but it uses ALU
the point is shift the bits you want to position of 0..7 bit and mask out the rest
the &255 and DWORD() overtyping is not needed on all compilers but some do weird stuff without them especially on signed variables like char or int
x>>n is the same as x/(pow(2,n))=x/(1<<n)
x&((1<<n)-1) is the same as x%(pow(2,n))=x%(1<<n)
so (x>>8)=x/256 and (x&255)=x%256

unsigned char to unsigned char array of 8 original bits

I am trying to take a given unsigned char and store the 8 bit value in an unsigned char array of size 8 (1 bit per array index).
So given the unsigned char A
Id like to create an unsigned char array containing 0 1 0 0 0 0 0 1 (one number per index)
What would be the most effective way to achieve this? Happy Thanksgiving btw!!
The fastest (not sure if that's what you menat by "effective") way of doing this is probably something like
void char2bits1(unsigned char c, unsigned char * bits) {
int i;
for(i=sizeof(unsigned char)*8; i; c>>=1) bits[--i] = c&1;
}
The function takes the char to convert as the first argument and fills the array bits with the corresponding bit pattern. It runs in 2.6 ns on my laptop. It assumes 8-bit bytes, but not how many bytes long a char is, and does not require the input array to be zero-initialized beforehand.
I didn't expect this to be the fastest approach. My first attempt looked like this:
void char2bits2(unsigned char c, unsigned char * bits) {
for(;c;++bits,c>>=1) *bits = c&1;
}
I thought this would be faster by avoiding array lookups, by looping in the natural order (at the cost of producing the bits in the opposite order of what was requested), and by stopping as soon as c is zero (so the bits array would need to be zero-initialized before calling the function). But to my surprise, this version had a running time of 5.2 ns, double that of the version above.
Investigating the corresponding assembly revealed that the difference was loop unrolling, which was being performed in the former case but not the latter. So this is an illustration of how modern compilers and modern CPUs often have surprising performance characteristics.
Edit: If you actually want the unsigned chars in the result to be the chars '0' and '1', use this modified version:
void char2bits3(unsigned char c, unsigned char * bits) {
int i;
for(i=sizeof(unsigned char)*8; i; c>>=1) bits[--i] = '0'+(c&1);
}
You could use bit operators as recommended.
#include <stdio.h>
main() {
unsigned char input_data = 8;
unsigned char array[8] = {0};
int idx = sizeof(array) - 1;
while (input_data > 0) {
array[idx--] = input_data & 1;
input_data /= 2; // or input_data >>= 1;
}
for (unsigned long i = 0; i < sizeof(array); i++) {
printf("%d, ", array[i]);
}
}
Take the value, right shift it and mask it to keep only the lower bit. Add the value of the lower bit to the character '0' so that you get either '0' or '1' and write it into the array:
unsigned char val = 65;
unsigned char valArr[8+1] = {};
for (int loop=0; loop<8; loop++)
valArr[7-loop] = '0' + ((val>>loop)&1);
printf ("val = %s", valArr);

Resources