I am porting an application from Altivec to Neon.
I see a lot of intrinsics in Altivec which return scalar values.
Do we have any such intrinsics on ARM ?
For instance vec_all_gt
There are no intrinsics that give scalar comparison results. This is because the common pattern for SIMD comparisons is to use branchless lane-masking and conditional selects to multiplex results, not branch-based control flow.
You can build them if you need them though ...
// Do a comparison of e.g. two vectors of floats
uint32x4_t compare = vcgeq_f32(a, b)
// Shift all compares down to a single bit in the LSB of each lane, other bits zero
uint32x4_t tmp = vshrq_n_u32(a.m, 31);
// Shift compare results up so lane 0 = bit 0, lane 1 = bit 1, etc.
static const int shifta[4] { 0, 1, 2, 3 };
static const int32x4_t shift = vld1q_s32(shifta);
tmp = vshlq_u32(tmp, shift)
// Horizontal add across the vector to merge the result into a scalar
return vaddvq_u32();
... at which point you can define any() (mask is non-zero) and all() (mask is 0xF) comparisons if you need branchy logic.
Related
I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb lookup on the lowest bytes. This isn't terribly difficult with AVX-512 (replace the & with a masked move if using 512-bit vectors):
__m256i cmp_8_into_32(__m256i a, __m256i b) {
return _mm256_popcnt_epi32(_mm256_cmpeq_epi8(a, b)
& _mm256_set1_epi32(0xff0f0301 /* can be any order */));
}
That's three uops and, assuming perfect scheduling, a throughput of 1 according to uops.info—not bad. Alas, vpopcntd isn't in AVX2. What's the optimal way to do this operation there? The best I can think of is to mask the pairs of bits at indices 7,8 and 15,16, then perform two constant-amount vpsrld and a vpor. So that's 6 uops, throughput of 2.5 ish. Not bad, but I wonder if there's something better.
Following chtz's comment (thanks!), I realize it's actually fairly easy:
__m256i cmp_8_into_32_1(__m256i a, __m256i b) {
const __m256i weights = _mm256_set1_epi32(0x08040201);
const __m256i all_n1 = _mm256_set1_epi16(-0x1);
__m256i cmp = _mm256_cmpeq_epi8(a, b);
__m256i hsum16 = _mm256_maddubs_epi16(weights, cmp);
return _mm256_madd_epi16(hsum16, all_n1);
}
Peter Cordes's suggestion saved an additional vpand. The two multiply–add instructions both run on either port 0 or 1, so this has the same throughput as the original popcount-based solution, although with a latency of about 11 instead of 5.
Uses 1 multiply:
__m256i cmp_8_into_32(__m256i a, __m256i b) {
__m256i cmp = _mm256_cmpeq_epi8(a, b);
__m256i msk = _mm256_and_si256(cmp, _mm256_set1_epi32(0x08040201));
__m256i hsum = _mm256_madd_epi16(msk, _mm256_set1_epi8(1));
return _mm256_srli_epi16(hsum, 8);
}
A 32-bit multiply (_mm256_mullo_epi32) is not used because it is slow.
If the results are not needed "in-lane" then one could use a _mm256_packs_epi16 immediately after the comparison to process twice as much data at once. If you don't need all of the possible states (say we don't care about lowest byte matches) then you could do 4x as much per instruction. If the results from the vpshufb lookup are getting gathered together then there may be other possible optimizations...
I have a question about using 128-bit registers to gain speed in a code. Consider the following C/C++ code: I define two unsigned long long ints a and b, and give them some values.
unsigned long long int a = 4368, b = 56480;
Then, I want to compute
a & b;
Here a is represented in the computer as a 64-bit number 4369 = 100010001001, and same for b = 56481 = 1101110010100001, and I compute a & b, which is still a 64-bit number given by the bit-by-bit logical AND between a and b:
a & b = 1000000000001
My question is the following: Do computers have a 128-bit register where I could do the operation above, but with 128-bits integers rather than with 64-bit integers, and with the same computer time? To be clearer: I would like to gain a factor two of speed in my code by using 128 bit numbers rather than 64 bit numbers, e. g. I would like to compute 128 ANDs rather than 64 ANDs (one AND for every bit) with the same computer time. If this is possible, do you have a code example? I have heard that the SSE regiters might do this, but I am not sure.
Yes, SSE2 has a 128 bit bitwise AND - you can use it via intrinsics in C or C++, e.g.
#include "emmintrin.h" // SSE2 intrinsics
__m128i v0, v1, v2; // 128 bit variables
v2 = _mm_and_si128(v0, v1); // bitwise AND
or you can use it directly in assembler - the instruction is PAND.
You can even do a 256 bit AND on Haswell and later CPUs which have AVX2:
#include "immintrin.h" // AVX2 intrinsics
__m256i v0, v1, v2; // 256 bit variables
v2 = _mm256_and_si256(v0, v1); // bitwise AND
The corresponding instruction in this case is VPAND.
The code below converts a row from an 8-Bit paletized format to 32-RGBA.
Before I trying to implement it, I would like to know if the code below is even suited for being optimized with Direct-Math or alternatively ARM Neon intrinsics or inline assembly. My first look at the documentation did not reveal anything that would cover the table-lookup part.
void CopyPixels(BYTE *pDst, BYTE *pSrc, int width,
const BYTE mask, Color* pColorTable)
{
if (width)
{
do
{
BYTE b = *pSrc++;
if (b != mask)
{
// Translate to 32-bit RGB value if not masked
const Color* pColor = pColorTable + b;
pDst[0] = pColor->Blue;
pDst[1] = pColor->Green;
pDst[2] = pColor->Red;
pDst[3] = 0xFF;
}
// Skip to next pixel
pDst += 4;
}
while (--width);
}
}
You will need a LUT of size 256*4bytes = 1024bytes.
This kind of job is not suited for SIMD at all. (except for the SSE part on Intel's new Haswell core)
NEON can handle LUTs of maximum 32bytes in size with VTBL and VTBX, but it's more or less meant to work in conjunction with CLZs as starting values for Newton-Raphson iterations.
I agree with Jake that this isn't a great vector processor problem, and may be more efficiently handled by the ARM main pipeline. That doesn't mean that you couldn't optimize it by assembly (but just plain ARM v7) for drastically improved results.
In particular, a simple improvement would be to construct your lookup table such that it can be used with a word sized copy. This would involve making sure the Color struct follows the 32-RGBA format, including having the 4th 0xFF as part of the lookup, so that you can just do a single word copy. This could be a significant performance boost with no assembly required, since it is a single memory fetch, rather than 3 (plus a constant assignment).
void CopyPixels(RGBA32Color *pDst, BYTE const *pSrc, int width,
const BYTE mask, RGBA32Color const *pColorTable)
{
if (width)
{
do
{
BYTE b = *pSrc++;
if (b != mask)
{
// Translate to 32-bit RGB value if not masked
*pDst = pColorTable[b];
}
// Skip to next pixel
pDst ++;
}
while (--width);
}
}
I have developed a Mandelbrot generator for Windows which I have just converted to use SSE Intrinsics. To detect the end of the iterations, in normal arithmetic I do a greater than compare and break out. Doing this in SSE I can do a compare of the whole vector using _mm_cmpgt_pd/_mm_cmpgt_ps however this will write a new 128-bit vector with all 1s for the case I care about.
My question is, is there a more efficient way of detecting for all 1s rather than checking the 2 packed 64 INTs? Or if it is more efficient to detect for all 0s then I could compare for less than. Here is what I currently have:
_m128d CompareResult = Magnitude > EarlyOut;
const __m128i Tmp = *reinterpret_cast< __m128i* >( &CompareResult );
if ( Tmp.m128i_u64[ 0 ] == Tmp.m128i_u64[ 1 ] == -1 )
{
break;
}
The reason I want to find a better way is because I don't like the cast, but also because according to vTune over 30% of my iteration time is spent in this last line. I know a lot of that will be in the branch itself, but I assume I can reduce this with a better detecting of 0s or 1s.
Thanks
Assuming you're testing the result of a compare then you can just extract the MS bits of each byte as a 16 bit int and test this, e.g.
int mask = _mm_movemask_epi8((__m128i)CompareResult);
if (mask == 0xffff)
{
// compare results are all "true"
}
Note that this is one example of a more general technique for SIMD predicates in SSE, i.e.
mask == 0xffff // all "true"
mask == 0x0000 // all "false"
mask != 0xffff // any "false"
mask != 0x0000 // any "true"
The SSE shift instructions I have found can only shift by the same amount on all the elements:
_mm_sll_epi32()
_mm_slli_epi32()
These shift all elements, but by the same shift amount.
Is there a way to apply different shifts to the different elements? Something like this:
__m128i a, __m128i b;
r0:= a0 << b0;
r1:= a1 << b1;
r2:= a2 << b2;
r3:= a3 << b3;
There exists the _mm_shl_epi32() intrinsic that does exactly that.
http://msdn.microsoft.com/en-us/library/gg445138.aspx
However, it requires the XOP instruction set. Only AMD Bulldozer and Interlagos processors or later have this instruction. It is not available on any Intel processor.
If you want to do it without XOP instructions, you will need to do it the hard way: Pull them out and do them one by one.
Without XOP instructions, you can do this with SSE4.1 using the following intrinsics:
_mm_insert_epi32()
_mm_extract_epi32()
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse41_reg_ins_ext.htm
Those will let you extract parts of a 128-bit register into regular registers to do the shift and put them back.
If you go with the latter method, it'll be horrifically inefficient. That's why _mm_shl_epi32() exists in the first place.
Without XOP, your options are limited. If you can control the format of the shift count argument, then you can use _mm_mullo_pi16 since multiplying by a power of two is the same as shifting by that power.
For example, if you want to shift your 8 16-bit elements in an SSE register by <0, 1, 2, 3, 4, 5, 6, 7> you can multiply by 2 raised to the shift count powers, i.e., by <0, 2, 4, 8, 16, 32, 64, 128>.
in some circumstances, this can substitute for _mm_shl_epi32(a, b):
_mm_mullo_ps(a, 1 << b);
generally speaking, this requires b to have a constant value - I don't know of an efficient way to calculate (1 << b) using older SSE instructions.