How to mix two bitmaps with AVX2 with 80-20%? - c

I have 2 bitmaps. I want to mix them in 80:20 portions, so I simply multipicate the pixels value with 0,8 and 0,2. The code works fine written in C (as a for cycle), but using AVX2 instructions results a bad output image.
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#define ARRSIZE 5992826
void main(void){
FILE *bmp = fopen("filepath1", "rb"),
*bmpp = fopen("filepath2", "rb"),
*write = fopen("output", "wb");
unsigned char *a = aligned_alloc(32, ARRSIZE),
*b = aligned_alloc(32, ARRSIZE),
*c = aligned_alloc(32, ARRSIZE);
fread(c, 1, 122, bmp);
rewind(bmp);
fread(a, 1, ARRSIZE, bmp);
fread(b, 1, ARRSIZE, bmpp);
__m256i mm_a, mm_b;
__m256d mm_two = _mm256_set1_pd(2),
mm_eight = _mm256_set1_pd(8);
__m256d mm_c, mm_d,
mm_ten = _mm256_set1_pd(10.0);
int i = 122;
for(; i < ARRSIZE; i+=32){
// c[i] = ((a[i] * 0.8) + (b[i] * 0.2));
mm_a = _mm256_loadu_si256((__m256i *)&(a[i]));
mm_b = _mm256_loadu_si256((__m256i *)&(b[i]));
mm_c = _mm256_div_pd((__m256d)mm_a, mm_ten);
mm_d = _mm256_div_pd((__m256d)mm_b, mm_ten);
mm_a = (__m256i)_mm256_floor_pd(_mm256_mul_pd(mm_c, mm_eight));
mm_b = (__m256i)_mm256_floor_pd(_mm256_mul_pd(mm_d, mm_two));
mm_a = _mm256_add_epi8(mm_a, mm_b);
_mm256_storeu_si256((__m256i *)&(c[i]), mm_a);
}
fwrite(c, 1, ARRSIZE, write);
fclose(bmp);
fclose(bmpp);
fclose(write);
free(a);
free(b);
free(c);
}

A problem with the code that you had is that casting between vector types is not a value-preserving conversion, it is a reinterpretation. So (__m256d)mm_a actually means "take these 32 bytes and interpret them as 4 doubles". That can be OK, but if the data is packed RGB888 then reinterpreting it as doubles is not good.
Proper conversions could be used, but using floating point arithmetic (especially double precision) for this is overkill. Using smaller types makes more of them fit in a vector so that is usually faster since more items can be worked on with an instruction.
Also the 122 byte header should not be put into the aligned arrays, its presence there immediately unaligns the position of the actual pixel data. It can be written to the output file separately.
For example, one strategy for this is to widen to 16bit, use _mm256_mulhi_epu16 to scale by approximately 80% and approximately 20%, add them with _mm256_add_epi16, then narrow to 8bit again. The unpacking to 16bit and later packing back to 8bit works a bit strangely with 256bit vectors, think of it as 2x the 128bit operation side by side. To prevent premature truncation, the 8bit source data can be unpacked with a free shift left by 8, putting the data byte in the high byte of the corresponding word. That way the multiply-high will create 16bit intermediate results, instead of truncating them to 8bit immediately, this way we can round after doing the addition which is more proper (this does cost an extra shift, and optionally an add). For example like this (not tested):
const uint16_t scale_a = uint16_t(0x10000 * 0.8);
const uint16_t scale_b = uint16_t(0x10000 - scale_a);
__m256i roundoffset = _mm256_set1_epi16(0x80);
__m256i zero = _mm256_setzero_si256();
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = ((a[i] * 0.8) + (b[i] * 0.2));
// c[i] = ((a[i] << 8) * scale_a) + ((b[i] << 8) * scale_b) >> 7;
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i data_al = _mm256_unpacklo_epi8(zero, raw_a);
__m256i data_bl = _mm256_unpacklo_epi8(zero, raw_b);
__m256i data_ah = _mm256_unpackhi_epi8(zero, raw_a);
__m256i data_bh = _mm256_unpackhi_epi8(zero, raw_b);
__m256i scaled_al = _mm256_mulhi_epu16(data_al, _mm256_set1_epi16(scale_a));
__m256i scaled_bl = _mm256_mulhi_epu16(data_bl, _mm256_set1_epi16(scale_b));
__m256i scaled_ah = _mm256_mulhi_epu16(data_ah, _mm256_set1_epi16(scale_a));
__m256i scaled_bh = _mm256_mulhi_epu16(data_bh, _mm256_set1_epi16(scale_b));
__m256i suml = _mm256_add_epi16(scaled_al, scaled_bl);
__m256i sumh = _mm256_add_epi16(scaled_ah, scaled_bh);
__m256i roundedl = _mm256_srli_epi16(_mm256_add_epi16(suml, roundoffset), 8);
__m256i roundedh = _mm256_srli_epi16(_mm256_add_epi16(sumh, roundoffset), 8);
__m256i packed = _mm256_packus_epi16(roundedl, roundedh);
_mm256_storeu_si256((__m256i *)&(c[i]), packed);
}
There are quite a lot of shuffle operations in it, which limit the throughput to one iteration every 5 cycles (in the absence of other limiters), which is roughly 1 pixel (as output) per cycle.
A different strategy could be to use _mm256_maddubs_epi16, with a lower precision approximation of the blend factors. It treats its second operand as signed bytes and does signed saturation, so this time only a 7-bit approximation of the scales fit. Since it operates on 8bit data there is less unpacking, but there is still some unpacking since it requires the data from both images to be interleaved. Maybe like this (also not tested):
const uint8_t scale_a = uint8_t(0x80 * 0.8);
const uint8_t scale_b = uint8_t(0x80 - scale_a);
__m256i scale = _mm256_set1_epi16((scale_b << 8) | scale_a);
__m256i roundoffset = _mm256_set1_epi16(0x80);
//__m256i scale = _mm256_set1_epi16();
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = ((a[i] * 0.8) + (b[i] * 0.2));
// c[i] = (a[i] * scale_a) + (b[i] * scale_b) >> 7;
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i data_l = _mm256_unpacklo_epi8(raw_a, raw_b);
__m256i data_h = _mm256_unpackhi_epi8(raw_a, raw_b);
__m256i blended_l = _mm256_maddubs_epi16(data_l, scale);
__m256i blended_h = _mm256_maddubs_epi16(data_h, scale);
__m256i roundedl = _mm256_srli_epi16(_mm256_add_epi16(blended_l, roundoffset), 7);
__m256i roundedh = _mm256_srli_epi16(_mm256_add_epi16(blended_h, roundoffset), 7);
__m256i packed = _mm256_packus_epi16(roundedl, roundedh);
_mm256_storeu_si256((__m256i *)&(c[i]), packed);
}
With only 3 shuffles, perhaps the throughput could reach 1 iteration per 3 cycles, that would be almost 1.8 pixels per cycle.
Hopefully there are better ways to do it. Neither of these approaches is close to maxing out on multiplications, which seems like it should be the goal. I don't know how to get there though.
An other strategy is using several rounds of averaging to get close to the desired ratio, but close is not that close. Maybe something like this (not tested):
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = round_somehow((a[i] * 0.8125) + (b[i] * 0.1875));
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i mixed_8_8 = _mm256_avg_epu8(raw_a, raw_b);
__m256i mixed_12_4 = _mm256_avg_epu8(raw_a, mixed_8_8);
__m256i mixed_14_2 = _mm256_avg_epu8(raw_a, mixed_12_4);
__m256i mixed_13_3 = _mm256_avg_epu8(mixed_12_4, mixed_14_2);
_mm256_storeu_si256((__m256i *)&(c[i]), mixed_13_3);
}
But _mm256_avg_epu8 rounds up, maybe it's bad to stack it so many times. There is no "avg round down"-instruction but avg_down(a, b) == ~avg_up(~a, ~b). That does not result in a huge mess of complements because most of them cancel each other. If there is still rounding up, it makes sense to leave that for the last operation. Always rounding down saves a XOR though. Maybe something like this (not tested)
__m256i ones = _mm256_set1_epi8(-1);
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = round_somehow((a[i] * 0.8125) + (b[i] * 0.1875));
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i inv_a = _mm256_xor_si256(ones, raw_a);
__m256i inv_b = _mm256_xor_si256(ones, raw_b);
__m256i mixed_8_8 = _mm256_avg_epu8(inv_a, inv_b);
__m256i mixed_12_4 = _mm256_avg_epu8(inv_a, mixed_8_8);
__m256i mixed_14_2 = _mm256_avg_epu8(inv_a, mixed_12_4);
__m256i mixed_13_3 = _mm256_avg_epu8(_mm256_xor_si256(mixed_12_4, ones),
_mm256_xor_si256(mixed_14_2, ones));
_mm256_storeu_si256((__m256i *)&(c[i]), mixed_13_3);
}

Related

How to do 1024-bit operations using arrays of uint64_t

I am trying to find a way to compute values that are of type uint1024_t (unsigned 1024-bit integer), by defining the 5 basic operations: plus, minus, times, divide, modulus.
The way that I can do that is by creating a structure that will have the following prototype:
typedef struct {
uint64_t chunk[16];
} uint1024_t;
Now since it is complicated to wrap my head around such operations with uint64_t as block size, I have first written some code for manipulating uint8_t. Here is what I came up with:
#define UINT8_HI(x) (x >> 4)
#define UINT8_LO(x) (((1 << 4) - 1) & x)
void uint8_add(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint8_t s0, s1, s2;
uint8_t x = UINT8_LO(a) + UINT8_LO(b);
s0 = UINT8_LO(x);
x = UINT8_HI(a) + UINT8_HI(b) + UINT8_HI(x);
s1 = UINT8_LO(x);
s2 = UINT8_HI(x);
uint8_t result = s0 + (s1 << 4);
uint8_t carry = s2;
res[1 + i] = result;
res[0 + i] = carry;
}
void uint8_multiply(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint8_t s0, s1, s2, s3;
uint8_t x = UINT8_LO(a) * UINT8_LO(b);
s0 = UINT8_LO(x);
x = UINT8_HI(a) * UINT8_LO(b) + UINT8_HI(x);
s1 = UINT8_LO(x);
s2 = UINT8_HI(x);
x = s1 + UINT8_LO(a) * UINT8_HI(b);
s1 = UINT8_LO(x);
x = s2 + UINT8_HI(a) * UINT8_HI(b) + UINT8_HI(x);
s2 = UINT8_LO(x);
s3 = UINT8_HI(x);
uint8_t result = s1 << 4 | s0;
uint8_t carry = s3 << 4 | s2;
res[1 + i] = result;
res[0 + i] = carry;
}
And it seems to work just fine, however I am unable to define the same operations for division, subtraction and modulus...
Furthermore I just can't seem to see how to implement the same principal to my custom uint1024_t structure even though it is pretty much identical with a few lines of code more to manage overflows.
I would really appreciate some help in implementing the 5 basic operations for my structure.
EDIT:
I have answered below with my implementation for resolving this problem.
find a way to compute ... the 5 basic operations: plus, minus, times, divide, modulus.
If uint1024_t used uint32_t, it would be easier.
I would recommend 1) half the width of the widest type uintmax_t, or 2) unsigned, whichever is smaller. E.g. 32-bit.
(Also consider something other than uintN_t to avoid collisions with future versions of C.)
typedef struct {
uint32_t chunk[1024/32];
} u1024;
Example of some untested code to give OP an idea of how using uint32_t simplifies the task.
void u1024_mult(u1024 *product, const u1024 *a, const u1024 *b) {
memset(product, 0, sizeof product[0]);
unsigned n = sizeof product->chunk / sizeof product->chunk[0];
for (unsigned ai = 0; ai < n; ai++) {
uint64_t acc = 0;
uint32_t m = a->chunk[ai];
for (unsigned bi = 0; ai + bi < n; bi++) {
acc += (uint64_t) m * b->chunk[bi] + product->chunk[ai + bi];
product->chunk[ai + bi] = (uint32_t) acc;
acc >>= 32;
}
}
}
+, - are quite similar to the above.
/, % could be combined into one routine that computes the quotient and remainder together.
It is not that hard to post those functions here as it really is the same as grade school math, but instead of base 10, base 232. I am against posting it though as it is fun exercise to do oneself.
I hope the * sample code above inspires rather than answers.
There are some problems with your implementation for uint8_t arrays:
you did not parenthesize the macro arguments in the expansion. This is very error prone as it may cause unexpected operator precedence problems if the arguments are expressions. You should write:
#define UINT8_HI(x) ((x) >> 4)
#define UINT8_LO(x) (((1 << 4) - 1) & (x))
storing the array elements with the most significant part first is counter intuitive. Multi-precision arithmetics usually represents the large values as arrays with the least significant part first.
for a small type such as uint8_t, there is no need to split it into halves as larger types are available. Furthermore, you must propagate the carry from the previous addition. Here is a much simpler implementation for the addition:
void uint8_add(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint16_t result = a + b + res[i + 0]; // add previous carry
res[i + 0] = (uint8_t)result;
res[i + 1] = (uint8_t)(result >> 8); // assuming res has at least i+1 elements and is initialized to 0
}
for the multiplication, you must add the result of multiplying each part of each number to the appropriately chosen parts of the result number, propagating the carry to the higher parts.
Division is more difficult to implement efficiently. I recommend you study an open source multi-precision package such as QuickJS' libbf.c.
To transpose this to arrays of uint64_t, you can use unsigned 128-bit integer types if available on your platform (64-bit compilers gcc, clang and vsc all support such types).
Here is a simple implementation for the addition and multiplication:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#define NB_CHUNK 16
typedef __uint128_t uint128_t;
typedef struct {
uint64_t chunk[NB_CHUNK];
} uint1024_t;
void uint0124_add(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
uint128_t result = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
result += (uint128_t)a->chunk[i] + b->chunk[i];
dest->chunk[i] = (uint64_t)result;
result >>= CHAR_BIT * sizeof(uint64_t);
}
}
void uint0124_multiply(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
for (size_t i = 0; i < NB_CHUNK; i++)
dest->chunk[i] = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
uint128_t result = 0;
for (size_t j = 0, k = i; k < NB_CHUNK; j++, k++) {
result += (uint128_t)a->chunk[i] * b->chunk[j] + dest->chunk[k];
dest->chunk[k] = (uint64_t)result;
result >>= CHAR_BIT * sizeof(uint64_t);
}
}
}
If 128-bit integers are not available, your 1024-bit type could be implemented as an array of 32-bit integers. Here is a flexible implementation with selectable types for the array elements and the intermediary result:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#if 1 // if platform has 128 bit integers
typedef uint64_t type1;
typedef __uint128_t type2;
#else
typedef uint32_t type1;
typedef uint64_t type2;
#endif
#define TYPE1_BITS (CHAR_BIT * sizeof(type1))
#define NB_CHUNK (1024 / TYPE1_BITS)
typedef struct uint1024_t {
type1 chunk[NB_CHUNK];
} uint1024_t;
void uint0124_add(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
type2 result = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
result += (type2)a->chunk[i] + b->chunk[i];
dest->chunk[i] = (type1)result;
result >>= TYPE1_BITS;
}
}
void uint0124_multiply(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
for (size_t i = 0; i < NB_CHUNK; i++)
dest->chunk[i] = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
type2 result = 0;
for (size_t j = 0, k = i; k < NB_CHUNK; j++, k++) {
result += (type2)a->chunk[i] * b->chunk[j] + dest->chunk[k];
dest->chunk[k] = (type1)result;
result >>= TYPE1_BITS;
}
}
}

64bit multiply element by element,m256i_i64 while bigger than long long maxValue

union sseUnion
{
int64_t position[4];
btSimdFloat4 mVec256;
};
// vector operator * : multiply element by element
__m256i mul64_haswell_mul(__m256i a, __m256i b) {
// instruction does not exist. Split into 32-bit multiplies
__m256i bswap = _mm256_shuffle_epi32(b, 0xB1); // swap H<->L
__m256i prodlh = _mm256_mullo_epi32(a, bswap); // 32 bit L*H products
__m256i zero = _mm256_setzero_si256(); // 0
__m256i prodlh2 = _mm256_hadd_epi32(prodlh, zero); // a0Lb0H+a0Hb0L,a1Lb1H+a1Hb1L,0,0
__m256i prodlh3 = _mm256_shuffle_epi32(prodlh2, 0x73); // 0, a0Lb0H+a0Hb0L, 0, a1Lb1H+a1Hb1L
__m256i prodll = _mm256_mul_epu32(a, b); // a0Lb0L,a1Lb1L, 64 bit unsigned products
__m256i prod = _mm256_add_epi64(prodll, prodlh3); // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
return prod;
}
int main()
{
sseUnion _sseUnion;
_sseUnion.mVec256 = _mm256_set_epi64x(1000000, 1000000, 1000000, 1000000);
sseUnion a2;
a2.mVec256 = _mm256_setr_epi64x(401000000, 401000000, 401000000, 401000000);
a2.mVec256 = _mm256_add_epi64(_sseUnion.mVec256, a2.mVec256);
a2.mVec256 = mul64_haswell_mul(_sseUnion.mVec256, a2.mVec256);
a2.mVec256 = mul64_haswell_mul(_sseUnion.mVec256, a2.mVec256);
printf("%d", a2.mVec256.m256i_i64[0]);
}
a2.position[0-4] while bigger than int64_t maxValue, and I get a wrong value, because it's real value is 14618374452099416064. I just wanna change it to int64_t maxValue, what can I do for it?
__m256i vindex = _mm256_set_epi64x(0, 0, 0, 0);
int64_t overFlowValue = 0x8000000000000000;
int64_t maxValue = 0x7FFFFFFFFFFFFFFF;
__m256i mask = _mm256_set_epi64x(overFlowValue, overFlowValue, overFlowValue, overFlowValue);
__m256i max = _mm256_set_epi64x(maxValue, maxValue, maxValue, maxValue);
__m256i signa = _mm256_and_si256(a, mask);
__m256i signb = _mm256_and_si256(b, mask);
__m256i absA = _mm256_sub_epi64(_mm256_xor_si256(a, signa), signa);
__m256i absB = _mm256_sub_epi64(_mm256_xor_si256(b, signb), signb);
__m256i prod = mul64_mul(absA, absB);
__m256i resultSign = _mm256_and_si256(prod, mask);
__m256i result = _mm256_mask_i64gather_epi64(prod, max.m256i_i64, vindex, resultSign, 1);
__m256i resultSign1 = _mm256_xor_si256(signa, signb);
__m256i result1 = _mm256_sub_epi64(_mm256_xor_si256(result, resultSign1), resultSign1);
__m256i mul64_mul(__m256i a, __m256i b)
{
__m256i bswap = _mm256_shuffle_epi32(a, 0xB1); // swap H<->L
__m256i prodlh = _mm256_mullo_epi32(b, bswap); // 32 bit L*H products
__m256i zero = _mm256_setzero_si256(); // 0
__m256i prodlh2 = _mm256_hadd_epi32(prodlh, zero); // a0Lb0H+a0Hb0L,a1Lb1H+a1Hb1L,0,0
__m256i prodlh3 = _mm256_shuffle_epi32(prodlh2, 0x73); // 0, a0Lb0H+a0Hb0L, 0, a1Lb1H+a1Hb1L
__m256i prodll = _mm256_mul_epu32(a, b); // a0Lb0L,a1Lb1L, 64 bit unsigned products
__m256i prod = _mm256_add_epi64(prodll, prodlh3); // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
return prod;
}
and i get right result by this way.

Fastest way to partially load an array of uint8_t or uint16_t into _m256i register and fill remaining bits with 1s without AVX512

Basically what I am trying do is load an array of either uint8_t or uint16_t that is smaller than an __m256i register into an __m256i register and fill all the bits in the destination __m256i that are not filled by the array with 1s.
An example of what I want with AVX512 would be:
#define ARR_SIZE_EPI8 (some_constant_value < 32)
// partial load for uint8_t
partial_load_epi8(uint8_t * arr) {
__m256i ones = _mm256_set1_epi64x(-1)
return _mm256_mask_loadu_epi8(ones, (1 << ARR_SIZE_EPI8) - 1, arr);
}
#define ARR_SIZE_EPI16 (some_constant_value < 16)
// partial load for uin16_t
partial_load_epi16(uint16_t * arr) {
__m256i ones = _mm256_set1_epi64x(-1)
return _mm256_mask_loadu_epi16(ones, (1 << ARR_SIZE_EPI16) - 1, arr);
}
Using only AVX2 if ARR_SIZE * sizeof(T) % sizeof(int) == 0 I can use:
partial_load_epi16_avx2(uint16_t * arr) {
__m256i mask_vec = _mm256_set_epi32( /* proper values for ARR_SIZE_EPI16 elements */ );
__m256i fill_vec = _mm256_set_epi16( /* 1s until ARR_SIZE_EPI16 * sizeof(uint16_t) */ );
__m256i load_vec = _mm256_maskloadu_epi32((int32_t *)arr, mask_vec);
return _mm256_or_si256(load_vec, fill_vec);
}
This uses a fair about of .rodate but doesnt seem prohibatively expensive. On the other hand when ARR_SIZE * sizeof(T) % sizeof(int) != 0 i.e with uint16_t and an ARR_SIZE_EPI16 the best I've been able to come up with is
partial_load_epi16_avx2_not_aligned(uint16_t * arr) {
__m256i mask_vec = _mm256_set_epi32( /* proper values for ARR_SIZE_EPI16 elements */ );
uint32_t tmp = 0xffff0000 | arr[ARR_SIZE_EPI16];
__m256i fill_vec = _mm256_set_epi32( /* 1s until ARR_SIZE_EPI16 * sizeof(uint16_t) / sizeof(int32_t) */, tmp, /* 0s */ );
__m256i load_vec = _mm256_maskloadu_epi32((int32_t *)arr, mask_vec);
return _mm256_or_si256(load_vec, fill_vec);
}
// or
partial_load_epi16_avx_not_aligned(uint16_t * arr) {
__m256i fill_v = _mm256_set1_epi64x(-1);
__m256i pload = _mm256_maskload_epi32((int32_t *)arr, _mm256_set_epi32( /* Assume proper mask */ ));
fill_v = _mm256_insert_epi16(fill_v,arr[ARR_SIZE_EPI16], ARR_SIZE_EPI16);
return _mm256_blend_epi32(fill_v, pload, (1 << ((ARR_SIZE_EPI16 / 2) - 1)));
}
Which adds an vextractsi128, vpinsrw and vinsertsi128. I'm wondering if there is a better approach that doesn't have so much overhead.
Thank you!
Edit:
The memory will be provided by the user and I cannot make any assumptions about whether before start of arr or after arr + ARR_SIZE is accessible.
Use case: implementing sorting network. The instructions to implement a sorting network for a power of 2 size are often significantly more efficient than for a non-power of 2 size (especially for byte / 2 byte values) so what I am trying to do is load the user array then pad it with max value (just doing the unsigned case now) so that I can round up the sorting network size to the next power of 2.
Edit: VPBLENDD and VPBLENDVB ARE NOT REPLACEMENTS FOR VMOVDQU
edit:
Interestingly enough the best solution I have found is to inline vpblendvb with the array as operand 3. DO NOT DO THIS
Edit2:
Test program to see if vpblendd and vpblendvb cause extra pagefaults.
#include <immintrin.h>
#include <stdint.h>
#include <sys/mman.h>
#include <utility>
#define N 5
template<uint32_t... e>
constexpr __m256i inline __attribute__((always_inline))
load_N_kernel2(std::integer_sequence<uint32_t, e...> _e) {
return _mm256_set_epi8(e...);
}
template<uint32_t... e>
constexpr __m256i inline __attribute__((always_inline))
load_N_kernel(std::integer_sequence<uint32_t, e...> _e) {
return load_N_kernel2(
std::integer_sequence<uint32_t, ((((31 - e) / 4) < N) << 7)...>{});
}
constexpr __m256i inline __attribute__((always_inline)) load_N() {
return load_N_kernel(std::make_integer_sequence<uint32_t, 32>{});
}
__m256i __attribute__((noinline)) mask_load(uint32_t * arr) {
__m256i tmp;
return _mm256_mask_loadu_epi32(tmp, (1 << N) - 1, arr);
}
__m256i __attribute__((noinline)) blend_load(uint32_t * arr) {
__m256i tmp;
asm volatile("vpblendd %[m], (%[arr]), %[tmp], %[tmp]\n\t"
: [ tmp ] "=x"(tmp)
: [ arr ] "r"(arr), [ m ] "i"(((1 << N) - 1))
:);
return tmp;
}
__m256i __attribute__((noinline)) blend_load_epi8(uint32_t * arr) {
__m256i tmp = _mm256_set1_epi8(uint8_t(0xff));;
__m256i mask = load_N();
asm volatile("vpblendvb %[mask], (%[arr]), %[tmp], %[tmp]\n\t"
: [ tmp ] "+x"(tmp)
: [ arr ] "r"(arr), [ mask ] "x"(mask)
:);
return tmp;
}
void __attribute__((noinline)) mask_store(uint32_t * arr, __m256i v) {
return _mm256_mask_storeu_epi32(arr, (1 << N) - 1, v);
}
#define NPAGES (1000)
#define END_OF_PAGE (1024 - N)
#ifndef LOAD_METHOD
#define LOAD_METHOD blend_load
#endif
int
main() {
uint32_t * addr = (uint32_t *)
mmap(NULL, NPAGES * 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
for(uint32_t i = 0; i < NPAGES; i += 2) {
mask_store(addr + 1024 * i + END_OF_PAGE, LOAD_METHOD(addr + END_OF_PAGE));
}
}
Ran:
$> perf stat -e page-faults,page-faults ./partial_load
Result is same with LOAD_METHOD as blend_load, mask_load and blend_load_epi8:
Performance counter stats for './partial_load':
548 page-faults
548 page-faults
0.002155974 seconds time elapsed
0.000000000 seconds user
0.002276000 seconds sys
Edit3:
Note was compiled with clang which does not use vpblendd to implement _mm256_mask_loadu_epi32.
Here is assembly of the function:
0000000000401130 <_Z9mask_loadPj>:
401130: b0 1f mov $0x1f,%al
401132: c5 fb 92 c8 kmovd %eax,%k1
401136: 62 f1 7e a9 6f 07 vmovdqu32 (%rdi),%ymm0{%k1}{z}
40113c: c3 retq
40113d: 0f 1f 00 nopl (%rax)

How to convert 16-bit unsigned short to 8-bit unsigned char using scaling efficiently?

I'm trying to convert 16 bit unsigned short data to 8 bit unsigned char using some scaling function. Currently I'm doing this by converting into float and scale down and then saturate into 8 bit. Is there any more efficient way to do this?
int _tmain(int argc, _TCHAR* argv[])
{
float Scale=255.0/65535.0;
USHORT sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
BYTE bArr[8],bArrSSE[8];
//Desired Conventional Method
for (int i = 0; i < 8; i++)
{
bArr[i]=(BYTE)(sArr[i]*Scale);
}
__m128 vf_scale = _mm_set1_ps(Scale),
vf_Round = _mm_set1_ps(0.5),
vf_zero = _mm_setzero_ps();
__m128i vi_zero = _mm_setzero_si128();
__m128i vi_src = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&sArr[0]));
__m128 vf_Src_Lo=_mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Src_Hi=_mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Mul_Lo=_mm_sub_ps(_mm_mul_ps(vf_Src_Lo,vf_scale),vf_Round);
__m128 vf_Mul_Hi=_mm_sub_ps(_mm_mul_ps(vf_Src_Hi,vf_scale),vf_Round);
__m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
_mm_storel_epi64((__m128i *)(&bArrSSE[0]), v_dst_i);
for (int i = 0; i < 8; i++)
{
printf("ushort[%d]= %d * %f = %.3f ,\tuChar[%d]= %d,\t SSE uChar[%d]= %d \n",i,sArr[i],Scale,(float)(sArr[i]*Scale),i,bArr[i],i,bArrSSE[i]);
}
return 0;
}
Pleas note tha the scaling factor may need to be set to other values, e.g. 255.0/512.0, 255.0/1024.0 or 255.0/2048.0, so any solution should not be hard-coded for 255.0/65535.0.
If ratio in your code is fixed, you can perform the scale with the following algorithm
Shift the high byte of each word into the lower one.
E.g. 0x200 -> 0x2, 0xff80 -> 0xff
Add an offset of -1 if the low byte was less than 0x80.
E.g. 0x200 -> Offset -1, 0xff80 -> Offset 0
The first part is easily achieved with _mm_srli_epi16
The second one is trickier but it basically consists in taking the bit7 (the higher bit of the lower byte) of each word, replicating it all over the word and then negating it.
I used another approach: I created a vector of words valued -1 by comparing a vector with itself for equality.
Then I isolated the bit7 of each source word and add it to the -1 words.
#include <stdio.h>
#include <emmintrin.h>
int main(int argc, char* argv[])
{
float Scale=255.0/65535.0;
unsigned short sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
unsigned char bArr[8], bArrSSE[16];
//Desired Conventional Method
for (int i = 0; i < 8; i++)
{
bArr[i]=(unsigned char)(sArr[i]*Scale);
}
//Values to be converted
__m128i vi_src = _mm_loadu_si128((__m128i const*)sArr);
//This computes 8 words (16-bit) that are
// -1 if the low byte of relative word in vi_src is less than 0x80
// 0 if the low byte of relative word in vi_src is >= than 0x80
__m128i vi_off = _mm_cmpeq_epi8(vi_src, vi_src); //Set all words to -1
//Add the bit15 of each word in vi_src to each -1 word
vi_off
= _mm_add_epi16(vi_off, _mm_srli_epi16(_mm_slli_epi16(vi_src, 8), 15));
//Shift vi_src word right by 8 (move hight byte into low byte)
vi_src = _mm_srli_epi16 (vi_src, 8);
//Add the offsets
vi_src = _mm_add_epi16(vi_src, vi_off);
//Pack the words into bytes
vi_src = _mm_packus_epi16(vi_src, vi_src);
_mm_storeu_si128((__m128i *)bArrSSE, vi_src);
for (int i = 0; i < 8; i++)
{
printf("%02x %02x\n", bArr[i],bArrSSE[i]);
}
return 0;
}
Here is an implementation and test harness using _mm_mulhi_epu16 to perform a fixed point scaling operation.
scale_ref is your original scalar code, scale_1 is the floating point SSE implementation from your (currently deleted) answer, and scale_2 is my fixed point implementation.
I've factored out the various implementations into separate functions and also added a size parameter and a loop, so that they can be used for any size array (although currently n must be a multiple of 8 for the SSE implementations).
There is a compile-time flag, ROUND, which controls whether the fixed point implementation truncates (like your scalar code) or rounds (to nearest). Truncation is slightly faster.
Also note that scale is a run-time parameter, currently hard-coded to 255 (equivalent to 255.0/65535.0) in the test harness below, but it can be any reasonable value.
#include <stdio.h>
#include <stdint.h>
#include <limits.h>
#include <xmmintrin.h>
#define ROUND 1 // use rounding rather than truncation
typedef uint16_t USHORT;
typedef uint8_t BYTE;
static void scale_ref(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
const float kScale = (float)scale / (float)USHRT_MAX;
for (size_t i = 0; i < n; i++)
{
dest[i] = src[i] * kScale;
}
}
static void scale_1(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
const float kScale = (float)scale / (float)USHRT_MAX;
__m128 vf_Scale = _mm_set1_ps(kScale);
__m128 vf_Round = _mm_set1_ps(0.5f);
__m128i vi_zero = _mm_setzero_si128();
for (size_t i = 0; i < n; i += 8)
{
__m128i vi_src = _mm_loadu_si128((__m128i *)&src[i]);
__m128 vf_Src_Lo = _mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Src_Hi = _mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Mul_Lo = _mm_mul_ps(vf_Src_Lo, vf_Scale);
__m128 vf_Mul_Hi = _mm_mul_ps(vf_Src_Hi, vf_Scale);
//Convert -ive to +ive Value
vf_Mul_Lo = _mm_max_ps(_mm_sub_ps(vf_Round, vf_Mul_Lo), vf_Mul_Lo);
vf_Mul_Hi = _mm_max_ps(_mm_sub_ps(vf_Round, vf_Mul_Hi), vf_Mul_Hi);
__m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
_mm_storel_epi64((__m128i *)&dest[i], v_dst_i);
}
}
static void scale_2(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
const __m128i vk_scale = _mm_set1_epi16(scale);
#if ROUND
const __m128i vk_round = _mm_set1_epi16(scale / 2);
#endif
for (size_t i = 0; i < n; i += 8)
{
__m128i v = _mm_loadu_si128((__m128i *)&src[i]);
#if ROUND
v = _mm_adds_epu16(v, vk_round);
#endif
v = _mm_mulhi_epu16(v, vk_scale);
v = _mm_packus_epi16(v, v);
_mm_storel_epi64((__m128i *)&dest[i], v);
}
}
int main(int argc, char* argv[])
{
const size_t n = 8;
const USHORT scale = 255;
USHORT src[n] = { 512, 1024, 2048, 4096, 8192, 16384, 32768, 65535 };
BYTE dest_ref[n], dest_1[n], dest_2[n];
scale_ref(src, dest_ref, scale, n);
scale_1(src, dest_1, scale, n);
scale_2(src, dest_2, scale, n);
for (size_t i = 0; i < n; i++)
{
printf("src = %u, ref = %u, test_1 = %u, test_2 = %u\n", src[i], dest_ref[i], dest_1[i], dest_2[i]);
}
return 0;
}
Ok found the solution with reference to this.
Here is my Solution:
int _tmain(int argc, _TCHAR* argv[])
{
float Scale=255.0/65535.0;
USHORT sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
BYTE bArr[8],bArrSSE[8];
//Desired Conventional Method
for (int i = 0; i < 8; i++)
{
bArr[i]=(BYTE)(sArr[i]*Scale);
}
__m128 vf_scale = _mm_set1_ps(Scale),
vf_zero = _mm_setzero_ps();
__m128i vi_zero = _mm_setzero_si128();
__m128i vi_src = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&sArr[0]));
__m128 vf_Src_Lo=_mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Src_Hi=_mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Mul_Lo=_mm_mul_ps(vf_Src_Lo,vf_scale);
__m128 vf_Mul_Hi=_mm_mul_ps(vf_Src_Hi,vf_scale);
//Convert -ive to +ive Value
vf_Mul_Lo=_mm_max_ps(_mm_sub_ps(vf_zero, vf_Mul_Lo), vf_Mul_Lo);
vf_Mul_Hi=_mm_max_ps(_mm_sub_ps(vf_zero, vf_Mul_Hi), vf_Mul_Hi);
__m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
_mm_storel_epi64((__m128i *)(&bArrSSE[0]), v_dst_i);
for (int i = 0; i < 8; i++)
{
printf("ushort[%d]= %d * %f = %.3f ,\tuChar[%d]= %d,\t SSE uChar[%d]= %d \n",i,sArr[i],Scale,(float)(sArr[i]*Scale),i,bArr[i],i,bArrSSE[i]);
}
return 0;
}

OpenCV FAST corner detection SSE implementation walkthrough

Could someone help me understanding the SSE implementation of the FAST corner detection in OpenCV? I understand the algorithm but not the implementation. Could somebody walk me through the code?
The code is long, so thank you in advance.
I am using OpenCV 2.4.11 and the code goes like this:
__m128i delta = _mm_set1_epi8(-128);
__m128i t = _mm_set1_epi8((char)threshold);
__m128i m0, m1;
__m128i v0 = _mm_loadu_si128((const __m128i*)ptr);
I think the following have something to do with threshold checking, but can't understand the use of delta
__m128i v1 = _mm_xor_si128(_mm_subs_epu8(v0, t), delta);
v0 = _mm_xor_si128(_mm_adds_epu8(v0, t), delta);
Now it checks the neighboring 4 pixels, but again, what is the use of delta?
__m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[0])), delta);
__m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta);
__m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta);
__m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta);
m0 = _mm_and_si128(_mm_cmpgt_epi8(x0, v0), _mm_cmpgt_epi8(x1, v0));
m1 = _mm_and_si128(_mm_cmpgt_epi8(v1, x0), _mm_cmpgt_epi8(v1, x1));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x1, v0), _mm_cmpgt_epi8(x2, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x1), _mm_cmpgt_epi8(v1, x2)));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x2, v0), _mm_cmpgt_epi8(x3, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x2), _mm_cmpgt_epi8(v1, x3)));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x3, v0), _mm_cmpgt_epi8(x0, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x3), _mm_cmpgt_epi8(v1, x0)));
m0 = _mm_or_si128(m0, m1);
Here it checks the continuity of the neighboring pixels. (Right?)
int mask = _mm_movemask_epi8(m0);
if( mask == 0 )
continue;
This is another puzzle for me. Why shifting 8 bytes to the left? I assume the mask tells me the location of the corner candidate, but why 8 bytes?
if( (mask & 255) == 0 )
{
j -= 8;
ptr -= 8;
continue;
}
I gave up at this point...
__m128i c0 = _mm_setzero_si128(), c1 = c0, max0 = c0, max1 = c0;
for( k = 0; k < N; k++ )
{
__m128i x = _mm_xor_si128(_mm_loadu_si128((const __m128i*)(ptr + pixel[k])), delta);
m0 = _mm_cmpgt_epi8(x, v0);
m1 = _mm_cmpgt_epi8(v1, x);
c0 = _mm_and_si128(_mm_sub_epi8(c0, m0), m0);
c1 = _mm_and_si128(_mm_sub_epi8(c1, m1), m1);
max0 = _mm_max_epu8(max0, c0);
max1 = _mm_max_epu8(max1, c1);
}
max0 = _mm_max_epu8(max0, max1);
int m = _mm_movemask_epi8(_mm_cmpgt_epi8(max0, K16));
for( k = 0; m > 0 && k < 16; k++, m >>= 1 )
if(m & 1)
{
cornerpos[ncorners++] = j+k;
if(nonmax_suppression)
curr[j+k] = (uchar)cornerScore<patternSize>(ptr+k, pixel, threshold);
}
As harold said, delta is used to make unsigned comparsion.
Let's describe this implementation by steps:
__m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr +
pixel[0])), delta);
__m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta);
__m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta);
__m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta); m0 = _mm_and_si128(_mm_cmpgt_epi8(x0, v0),
_mm_cmpgt_epi8(x1, v0)); m1 = _mm_and_si128(_mm_cmpgt_epi8(v1, x0), _mm_cmpgt_epi8(v1, x1)); m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x1, v0), _mm_cmpgt_epi8(x2, v0))); ......
Here it's not checking of 4 neighboring pixels. It checks 4 points, for example, like this:
Here they check that "corner condition" is true for this 4 points, because if it's not true there are no 8 neighboring pixels that satisfy "corner condition", so it's not corner pixel. If mask is zero it means that all pixels in vector can't be corner so we shift left for 16 pixels.
int mask = _mm_movemask_epi8(m0);
if( mask == 0 )
continue;
If mask is not zero, but for first 8 pixels "corner condition" is not true they shift left only for 8 pixels to check remain pixels on next iteration.
if( (mask & 255) == 0 )
{
j -= 8;
ptr -= 8;
continue;
}
And final step. Here they count number of neighboring pixels which are greater than x + threshold to c0 counter and which are less than x - threshold to c1 counter.
Here generating mask for such conditions:
__m128i x = _mm_xor_si128(_mm_loadu_si128((const __m128i*)(ptr + pixel[k])), delta);
m0 = _mm_cmpgt_epi8(x, v0);
m1 = _mm_cmpgt_epi8(v1, x);
Note that if condition is true for element of vector his value set to 0xFF or -1 since we treat him as signed char.
c0 = _mm_and_si128(_mm_sub_epi8(c0, m0), m0);
c1 = _mm_and_si128(_mm_sub_epi8(c1, m1), m1);
If element of mask is -1 it accumulates to c0 or c1 counter since of substraction (for example c0 - (-1)) . But if it equal to zero they reset counter to zero (_mm_and_si128).
Than they need to store maximum value of counters:
max0 = _mm_max_epu8(max0, c0);
max1 = _mm_max_epu8(max1, c1);
So they store maximum number of neighboring pixels which satisfy "corner condition".
Here they determine which pixels are actually corners and which are not:
max0 = _mm_max_epu8(max0, max1);
int m = _mm_movemask_epi8(_mm_cmpgt_epi8(max0, K16));
for( k = 0; m > 0 && k < 16; k++, m >>= 1 )
if(m & 1)
{
cornerpos[ncorners++] = j+k;
if(nonmax_suppression)
curr[j+k] = (uchar)cornerScore<patternSize>(ptr+k, pixel, threshold);
}
I hope it will help. I'm sorry for my bad English.
delta is a mask in which only the signbits are set. They use it because they want to compare for greater than unsigned, but there is only a signed comparison.
Adding 128 (or subtracting it, because -128 == 128) and xoring with it do the same (if you're working with bytes), because
a + b == (a ^ b) + ((a & b) << 1)
and if b only has the top bit set, the ((a & b) << 1) term must be zero (a & b can have the top bit set, but it's shifted out).
Then as you can see in the diagram below, subtracting 128 "shifts" the entire range down in such a way that a signed comparison will give the same result as an unsigned comparison would have given on the original range.
|0 ... 127 ... 255| unsigned
|-128 ... 0 ... 127| signed
I don't know about the rest, I hope someone else can answer that.

Resources