Suppose I have an array:
uint8_t arr[256];
and an element
__m128i x
containing 16 bytes,
x_1, x_2, ... x_16
I would like to efficiently fill a new __m128i element
__m128i y
with values from arr depending on the values in x, such that:
y_1 = arr[x_1]
y_2 = arr[x_2]
.
.
.
y_16 = arr[x_16]
A command to achieve this would essentially be loading a register from a non-contiguous set of memory locations. I have a painfully vague memory of having seen documentation of such a command, but can't find it now. Does it exist? Thanks in advance for your help.
This kind of capability in SIMD architectures is known as load/store scatter/gather. Unfortunately SSE does not have it. Future SIMD architectures from Intel may have this - the ill-fated Larrabee processor was one case in point. For now though you will just need to design your data structures in such a way that this kind of functionality is not needed.
Note that you can achieve the equivalent effect by using e.g. _mm_set_epi8:
y = _mm_set_epi8(arr[x_16], arr[x_15], arr[x_14], ..., arr[x_1]);
although of course this will just generate a bunch of scalar code to load your y vector. This is fine if you are doing this kind of operation outside any performance-critical loops, e.g. as part of initialisation prior to looping, but inside a loop it is likely to be a performance-killer.
Related
I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).
I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.
I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:
greys = aligned_alloc(16, xres * sizeof(int8_t*));
for (uint32_t x = 0; x < xres; x++)
{
greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}
(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?
My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.
Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?
I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.
Nevertheless, I'm grateful for any clarification available.
In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).
Least obvious question first:
once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i
Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).
If you want other parts of the vector, not the low element, your options are:
Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.
_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]
Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.
An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.
If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.
Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)
Does this turn into a linear, unbroken block in memory?
No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.
Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).
And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.
assumed _mm_load_si128 only loads signed bytes
Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.
Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.
The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).
See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.
To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.
The x86 tag wiki has some good x86 asm / performance links.
I've been searching for a while now, but can't seem to find anything useful in the documentation or on SO. This question didn't really help me out, since it makes references to modifying the assembly and I am writing in C.
I have some code making indirect accesses that I want to vectorize.
for (i = 0; i < LENGTH; ++i) {
foo[bar[i]] *= 2;
}
Since I have the indices I want to double inside bar, I was wondering if there was a way to load those indices of foo into a vector register and then I could apply my math and store it back to the same indices.
Something like the following. The load and store instructions I just made up because I couldn't find anything like them in the AVX or SSE documentation. I think I read somewhere that AVX2 has similar functions, but the processor I'm working with doesn't support AVX2.
for (i = 0; i < LENGTH; i += 8) {
// For simplicity, I'm leaving out any pointer type casting
__m256 ymm0 = _mm256_load_indirect(bar+i);
__m256 ymm1 = _mm256_set1_epi32(2); // Set up vector of just 2's
__m256 ymm2 = _mm256_mul_ps(ymm0, ymm1);
_mm256_store_indirect(ymm2, bar+i);
}
Are there any instructions in AVX or SSE that will allow me to load a vector register with an array of indices from a different array? Or any "hacky" ways around it if there isn't an explicit function?
(I' writing an answer to this old question as I think it may help others.)
Short answer
No. There are no scatter/gather instructions in the SSE and AVX instruction sets.
Longer answer
Scatter/gather instructions are expensive to implement (in terms of complexity and silicon real estate) because scatter/gather mechanism needs to be deeply intertwined with the cache memory controller. I believe this is the reason that this functionality was missing from SSE/AVX.
For newer instruction sets the situation is different. In AVX2 you have
VGATHERDPD, VGATHERDPS, VGATHERQPD, VGATHERQPS for floating point gather (intrinsics here)
VPGATHERDD, VPGATHERQD, VPGATHERDQ, VPGATHERQQ for integer gather (intrinsics here)
In AVX-512 we got
VSCATTERDPD, VSCATTERDPS, VSCATTERQPD, VSCATTERQPS for floating point scatter
(intrinsics here)
VPSCATTERDD, VPSCATTERQD, VPSCATTERDQ, VPSCATTERQQ for integer scatter (intrinsics here)
However, it is still a question whether using scatter/gather for such a simple operation would actually pay off.
I am new to AArch64 Advanced SIMD (NEON) and I want to port a AArch32 code to AArch64. In AArch32 if I wanted to access to lower or higher half of a register, I simply used Dn instead of Qn. For example if I want to access lower 64-bit of Q12, I simply referred to D24. However, I cannot figure out how can I access to half of a Vn register in AArch64.
I would like to access the higher half of a Vn register. So, if I write Vn.2S, I assume it gives me the lower half of the register. Is that correct? If yes, how can I access the higher half then?
Even i tried accessing.
As per the manual, i guess there is no way available to access slot vise.
V0 -> d0 -> s0 has same data.
Whereas in ARM32, Q0 has d0 and d1 and further d0 has s0 and s1.
I have successfully used pointers to select either the upper or lower half of an Arm Neon vector.
uint32x4_t vector = { 1, 2, 3, 4 };
uint32x2_t *upperhalf = (uint32x2_t *) &vector[2];
uint32x2_t *lowerhalf = (uint32x2_t *) &vector[0];
*lowerhalf = *upperhalf;
printf("%u", vector[0]);
Prints out 3. This is intrinsically telling the compiler to target either of the double register pairs that make up quad registers. It does not necessarily mean it will be reading or writing to memory when doing this. Instead it sees you want to target the double register directly.
This works with GCC 8, maybe older releases also. Clang 7 gave a "targeting vector..." error message. I have not been able to use the pointer to target indexes in the double register however using it as a regular vector of the datatype it is cast to, either as source or destination has always worked. Below is another example, byte swapping the vector half using the pointer.
*lowerhalf = vreinterpret_u32_u8(vrev32_u8(vreinterpret_u8_u32(*lowerhalf)));
It is not good practise to target uneven indexes as these overlap registers. I have not tried to see what that does but it will likely shuffle data around to temporary register lanes to complete an operation when doing so. Using pointers in this way has also worked when vectors are members of a struct.
Where does the x86-64's SSE instructions (vector instructions) outperform the normal instructions. Because what I'm seeing is that the frequent loads and stores that are required for executing SSE instructions is nullifying any gain we have due to vector calculation. So could someone give me an example SSE code where it performs better than the normal code.
Its maybe because I am passing each parameter separately, like this...
__m128i a = _mm_set_epi32(pa[0], pa[1], pa[2], pa[3]);
__m128i b = _mm_set_epi32(pb[0], pb[1], pb[2], pb[3]);
__m128i res = _mm_add_epi32(a, b);
for( i = 0; i < 4; i++ )
po[i] = res.m128i_i32[i];
Isn't there a way I can pass all the 4 integers at one go, I mean pass the whole 128 bytes of pa at one go? And assign res.m128i_i32 to po at one go?
Summarizing comments into an answer:
You have basically fallen into the same trap that catches most first-timers. Basically there are two problems in your example:
You are misusing _mm_set_epi32().
You have a very low computation/load-store ratio. (1 to 3 in your example)
_mm_set_epi32() is a very expensive intrinsic. Although it's convenient to use, it doesn't compile to a single instruction. Some compilers (such as VS2010) can generate very poor performing code when using _mm_set_epi32().
Instead, since you are loading contiguous blocks of memory, you should use _mm_load_si128(). That requires that the pointer is aligned to 16 bytes. If you can't guarantee this alignment, you can use _mm_loadu_si128() - but with a performance penalty. Ideally, you should properly align your data so that don't need to resort to using _mm_loadu_si128().
The be truly efficient with SSE, you'll also want to maximize your computation/load-store ratio. A target that I shoot for is 3 - 4 arithmetic instructions per memory-access. This is a fairly high ratio. Typically you have to refactor the code or redesign the algorithm to increase it. Combining passes over the data is a common approach.
Loop unrolling is often necessary to maximize performance when you have large loop bodies with long dependency chains.
Some examples of SO questions that successfully use SSE to achieve speedup.
C code loop performance (non-vectorized)
C code loop performance [continued] (vectorized)
How do I achieve the theoretical maximum of 4 FLOPs per cycle? (contrived example for achieving peak processor performance)
I want to implement (what represents abstractly) a two dimensional 4x4 matrix. All the code I write for matrix multiplication et cetera will be entirely "unrolled" as it were -- that is to say, I will not be using loops to access and write data entries in the matrix.
My question is: In C, would it be faster to use a struct as such:
typedef struct {
double e0, e1, e2, e3, e4, ..., e15
} My4x4Matrix;
Or would this be faster:
typedef double My4x4Matrix[16];
Given that I will be accessing each matrix element individually as such:
My4x4Matrix a,b,c;
// (Some initialization of a and b.)
...
c.e0=a.e0+b.e0;
c.e1=a.e1+b.e1;
...
Or
My4x4Matrix a,b,c;
// (Some initialization of a and b.)
...
c[0]=a[0]+b[0];
c[1]=a[1]+b[1];
...
Or are they exactly the same speed?
Any decent compiler will generate the exact same code, byte-for-byte. However, using arrays allows you a lot more flexibility; when accessing the matrix elements, you can choose whether you want to access fixed locations or address positions with variables.
I also highly question your choice to "unwind" (unroll?) all the operations by hand. Any good compiler can fully unroll loops with a constant number of iterations for you, and can perhaps even generate SIMD code and/or optimally schedule the order of instructions. You'll have a hard time doing better by hand, and you'll end up with code that's hideous to read. The fact that you asked this question suggests to me that you're probably not sufficiently experienced to do better than even a naive optimizing compiler.
Struct elements (fields) can only be accessed by their names explicitly specified in the program's source, which means that every time you access a field the actual field must be selected and hardcoded at compile time. If you wanted to implement the same thing with arrays, that would mean that you would use explicit constant compile-time array indices (as in your example). In this case the performance of the two will be exactly the same and the code generated will be exactly the same (excluding from consideration "malicious" compilers).
However, note that arrays provide you with an extra degree of freedom: if necessary, you can select array elements by a run-time index. This is something that's not possible with structs. Only you know whether it matters to you.
On the other hand, note also that arrays in C are not copyable, which means that you'll be forced to use memcpy to copy your array-based My4x4Matrix. With struct-based version normal language-level copying will work. With arrays this issue can be worked around by wrapping the actual array in a struct.
I guess both are the same speed. The difference between a struct and an array is just its meaning (in human terms.) Both will be compiled as memory addresses.
I would say the best way is to create a test to try it yourself. Results may vary based on system environments and compilers.