How to load uint8_t *src to uint16x8_t - arm

How to load uint8_t *src to uint16x8_t?
For example, we can only do the following:
uint8_t *src;
--->
uint8x8_t mysrc = vld1_u8(src);
Seems that I can not use vreinterpret_*() or (uint16x8_t)mysrc to transform mysrc to uint16x8_t? Is it right?

Load the 8 first values as 8-bit values:
uint8x8_t mysrc8x8 = vld1_u8(src);
Then use the "convert long move" instruction to transform these values to 16-bit values by prepending zeroes in the first 8 bits:
uint16x8_t mysrc16x8 = vmovl_u8(mysrc8x8);
Assuming that after some operations on these values, you obtain your output myoutput16x8 in an uint16x8_t format and want to convert them back to uint8x8_t, then you can use the vmovn_u16 instruction, bearing in mind that it will indeed truncate the values if they are bigger than 255:
uint8x8_t myoutput8x8 = vmovn_u16(myoutput16x8);
Hope this helps!

Related

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

What I want to do is:
Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!
For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)
Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.
Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.
#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
__m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
__m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
__m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
__m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
__m256i ab = _mm256_packs_epi32(a,b); // 16x int16_t
__m256i cd = _mm256_packs_epi32(c,d);
__m256i abcd = _mm256_packs_epi16(ab, cd); // 32x int8_t
// packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
// if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done
// but if you need sequential order, then vpermd:
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
return lanefix;
}
(Compiles nicely on the Godbolt compiler explorer).
Call this in a loop and _mm256_store_si256 the resulting vector.
(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)
With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.
If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.
Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)
Related:
SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?
Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .

Shifting bit values in C

Say I have the following code:
uint32_t fillThisNum(int16_t a, int16_t b, int16_t c){
uint32_t x = 0;
uint16_t temp_a = 0, temp_b = 0, temp_c = 0;
temp_a = a << 24;
temp_b = b << 4;
temp_c = c << 4;
x = temp_a|temp_b|temp_c;
return x;
}
Essentially what I'm trying to do is fill the 32-bit number with bit information that I can extract at a later time to perform different operations.
Parameter a would hold the first 24 bits of "data", b would hold the next 4 bits of "data" and c would hold the final 4 bits of "data".
I have a couple questions:
Do the parameters have to be the same bit length as the function type, and must they be unsigned?
Can I assign an unsigned int to a signed int? (i.e. uint32_t a = int32_t b;)
Can I fill a 32-bit number with the 16-bit parameters so long they don't exceed the length of the 32-bit return value.
Any advice/tips/hints would be much appreciated, thank you.
A correct way to write this code is:
uint32_t fillThisNum(uint32_t a, uint32_t b, uint32_t c)
{
// mask out the bits we are not interested in
a &= 0xFFFFFF; // save lowest 24 bits
b &= 0xF; // save lowest 4 bits
c &= 0xF; // save lowest 4 bits
// arrange a,b,c within a 32-bit unit so that they do not overlap
return (a << 8) + (b << 4) + c;
}
By using an unsigned type for the parameters, you avoid any issues with signed arithmetic overflow, sign extension, etc.
It's OK to pass signed values as arguments when calling the function, those values will be converted to unsigned.
By using uint32_t as the parameter type then you avoid having to declare any temporary variables or worry about type width when doing your casting. It makes it easier for you to write clear code, this way.
You don't have to do it this way but this is a simple way to make sure you don't make any mistakes.
Do the parameters have to be the same bit length as the function type, and must they be unsigned?
No, the arguments and the return value can be different types.
Can I assign an unsigned int to a signed int? (i.e. uint32_t a = int32_t b;)
Yes, the value will be converted from a signed to an unsigned value. The bits in "b" will stay the same, so while "b" is in 2's complement, "a" will be a positive 32-bit number.
So, for example, let int8_t c = -127. If you perform an assignment uint8_t d = c, then "d" will be 129.
Can I fill a 32-bit number with the 16-bit parameters so long they don't exceed the length of the 32-bit return value.
If by that, you mean the way that you did in your code:
x = temp_a|temp_b|temp_c;
Yes, that is fine, with the caveat that #chux mentioned: you can't shift an n-bit value more than n bits. If you wanted to set bits more significant than bit 15 in x, a way to do this would be to set up one of the temp masks with a 32-bit value instead of a 16-bit one.

Convert a uint16_t to char[2] to be sent over socket (unix)

I know that there are things out there roughly on this.. But my brains hurting and I can't find anything to make this work...
I am trying to send an 16 bit unsigned integer over a unix socket.. To do so I need to convert a uint16_t into two chars, then I need to read them in on the other end of the connection and convert it back into either an unsigned int or an uint16_t, at that point it doesn't matter if it uses 2bytes or 4bytes (I'm running 64bit, that's why I can't use unsigned int :)
I'm doing this in C btw
Thanks
Why not just break it up into bytes with mask and shift?
uint16_t value = 12345;
char lo = value & 0xFF;
char hi = value >> 8;
(edit)
On the other end, you assemble with the reverse:
uint16_t value = lo | uint16_t(hi) << 8;
Off the top of my head, not sure if that cast is required.
char* pUint16 = (char*)&u16;
ie Cast the address of the uint16_t.
char c16[2];
uint16_t ui16 = 0xdead;
memcpy( c16, ui16, 2 );
c16 now contains the 2 bytes of the u16. At the far end you can simply reverse the process.
char* pC16 = /*blah*/
uint16_t ui16;
memcpy( &ui16, pC16, 2 );
Interestingly though there is a call to memcpy nearly every compiler will optimise it out because its of a fixed size.
As Steven sudt points out you may get problems with big-endian-ness. to get round this you can use the htons (host-to-network short) function.
uint16_t ui16correct = htons( 0xdead );
and at the far end use ntohs (network-to-host short)
uint16_t ui16correct = ntohs( ui16 );
On a little-endian machine this will convert the short to big-endian and then at the far end convert back from big-endian. On a big-endian machine the 2 functions do nothing.
Of course if you know that the architecture of both machines on the network use the same endian-ness then you can avoid this step.
Look up ntohl and htonl for handling 32-bit integers. Most platforms also support ntohll and htonll for 64-bits as well.
Sounds like you need to use the bit mask and shift operators.
To split up a 16-bit number into two 8-bit numbers:
you mask the lower 8 bits using the bitwise AND operator (& in C) so that the upper 8 bits all become 0, and then assign that result to one char.
you shift the upper 8 bits to the right using the right shift operator (>> in C) so that the lower 8 bits are all pushed out of the integer, leaving only the top 8 bits, and assign that to another char.
Then when you send these two chars over the connection, you do the reverse: you shift what used to be the top 8 bits to the left by 8 bits, and then use bitwise OR to combine that with the other 8 bits.
Basically you are sending 2 bytes over the socket, that's all the socket need to know, regardless of endianness, signedness and so on... just decompose your uint16 into 2 bytes and send them over the socket.
char byte0 = u16 & 0xFF;
char byte1 = u16 >> 8;
At the other end do the conversion in the opposite way

Convert 8 bit sse register to 16 bit shorts

I have a __m128i register with 8 bit values with the content:
{-4,10,10,10,10,10,10,-4,-4,10,10,10,10,10,10,-4}
Now I want to convert it to eight 16 bit values in a _m128i register. It should look like:
{-4,10,10,10,10,10,10,-4}
How is this possible with the least amount of instructions as possible?
I want to use SSSE3 at most.
Assuming you just want the first 8 values out of the 16 and are going to ignore the other 8 (the example data you give is somewhat ambiguous) then you can do it with SSE2 like this:
v = _mm_srai_epi16(_mm_unpacklo_epi8(v, v), 8);
You can do it this way with one SSE2 instruction (ignoring initialization)
__m128i const zero = _mm_setzero_si128(); // (if you're in a loop pull this out)
__m128i v;
v = _mm_unpacklo_epi8(v, zero);

Loading 8-bit values using NEON/ARM

I'm trying to load an array of char values into NEON registers, and then treat them as 16-bit or 32-bit integer values. So something like this...
void SubVector(short* c, const unsigned char* a, const unsigned char* b, int n)
{
for(int i = 0; i < n; i++)
{
c[i] = (short)a[i] - (short)b[i];
}
}
I'm not sure how to load the data. Should I load the 8-bit data into lanes, and then reinterpret the registers as shorts? Or load and convert? What would be the fastest way?
Does anyone have a example on how they would do this with NEON intrinsics?
Thanks!
NEON has addition and subtraction instructions that can widen values from 8->16, 16->32 or 32->64 bits. You can do 8 at a time like this:
uint8x8_t u88_a, u88_b;
uint16x8_t u168_diff;
u88_a = vld1_u8(a); // load 8 unsigned chars from a[]
u88_b = vld1_u8(b); // load 8 unsigned chars from b[]
u168_diff = vsubl_u8(u88_a, u88_b); // calculate the difference and widen to 16-bits

Resources