I am using SSE2 in gcc 4.4.3. In my program, I need to use say least (0 - 7) 8-bits of a 128-bit SIMD register. Please suggest a way in which I can retrieve the 8-bits quickly.
I tried with _mm_movepi64_pi64 or _mm_extract_epi16, both of which gives similar performance in my program. I was trying with union approach also. union{__m128i a1, int a2[4]}. Though, in the test case, it gave good results, in my program, this approach was not very good.
Any ideas.. (which of the above mentioned three ways I should use?)
_mm_movepi64_pi64 moves from XMM to MMX registers. There's no way it's the right choice, unless you want to do some more SIMD in MMX registers, and your code runs out of XMM regs.
If you want the bits as an array index or something, they have to be in a GP register, in which case you want SSE4.1 _mm_extract_epi8.
If you need to stick to SSE2, this should be the fastests way to get byte 5 of xmm0:
pextrw eax, xmm0, 2
movzx eax, ah
So this should hopefully get the compiler to be efficient like that:
(uint8_t)(_mm_extract_epi16(var, n/2) >> ((n%2) * 8))
Less efficient would be a shift-by-bytes _mm_bsrli_si128 (psrldq) to put the byte you want into the low byte of an xmm reg, then movd (_mm_extract_epi16(var, 0) emits movd, not pextrw r32, xmm, 0, fortunately). This way you don't have to do anything extra if the byte you want is an odd-numbered byte that pextw would leave in the high 8 of a result. Still no easy way to use this with an index that isn't a compile-time constant.
Storing 16B to memory and loading the element you want should be fairly good. (What you'll probably get with the union approach, unless the compiler optimizes it to a pextract instruction). The compiler will use a 16B-aligned location on the stack. Thus store->load forwarding should work fine in this case, so the latency will be low. If you need two separate elements into two separate integer variables, this is probably the best choice, maybe beating multiple pextrw
Related
Looking through the intel intrinsics guide, I saw this instruction. Looking through the naming pattern, the meaning should be clear: "Shift 128-bit register left by a fixed number of bits", but it is not. In actuality it shifts by a fixed number of bytes, which makes it exactly the same as _mm_bslli_si128.
Is this an oversight? Shouldn't it be shifting by bits like _mm_slli_epi32 or _mm_slli_epi64?
If not, in which situation should I use this over _mm_bslli_si128?
Is there an assembly instruction which does this correctly?
What is the best way of emulating this with smaller shifts?
1 that’s not an oversight. That instruction indeed shifts by bytes, i.e. multiples of 8 bits.
2 doesn’t matter, _mm_slli_si128 and _mm_bslli_si128 are equivalents, both compile into pslldq SSE2 instruction.
As for the emulation, I’d do it like that, assuming you have C++/17. If you’re writing C++/14, replace if constexpr with normal if, also add a message to the static_assert.
template<int i>
inline __m128i shiftLeftBits( __m128i vec )
{
static_assert( i >= 0 && i < 128 );
// Handle couple trivial cases
if constexpr( 0 == i )
return vec;
if constexpr( 0 == ( i % 8 ) )
return _mm_slli_si128( vec, i / 8 );
if constexpr( i > 64 )
{
// Shifting by more than 8 bytes, the lowest half will be all zeros
vec = _mm_slli_si128( vec, 8 );
return _mm_slli_epi64( vec, i - 64 );
}
else
{
// Shifting by less than 8 bytes.
// Need to propagate a few bits across 64-bit lanes.
__m128i low = _mm_slli_si128( vec, 8 );
__m128i high = _mm_slli_epi64( vec, i );
low = _mm_srli_epi64( low, 64 - i );
return _mm_or_si128( low, high );
}
}
TL:DR: They're synonyms; the bslli name is newer, introduced around the same time as new AVX-512 intrinsics (sometime before 2015, long after SSE2 _mm_slli_si128 was in widespread usage). I find it clearer and would recommend it for new development.
SSE/AVX2/AVX-512 do not have bit-shifts with element sizes wider than 64. (Or any other bit-granularity operation like add, except pure-vertical bitwise boolean stuff that's really 128 fully separate operations, not one big wide one. Or for AVX-512 masking and broadcast-load purposes, can be in dword or qword chunks like _mm512_xor_epi32 / vpxord)
You have to emulate it somehow, which can be fairly efficient for compile-time-constant counts so you can pick between strategies according to c >= 64, with special cases for c%8 reducing to a byte-shift. Existing SO Q&As cover that, or see #Soonts' answer on this Q.
Runtime-variable counts would suck; you'd have to branch or do both ways and blend, unlike for element bit-shifts where _mm_sll_epi64(v, _mm_cvtsi32_si128(i)) can compile to movd / psllq xmm, xmm. Unfortunately, hardware variable-count versions of byte-shuffle/shift instructions don't exist, only for the bit-shift versions.
bslli / bsrli are new, clearer intrinsic names for the same asm instructions
The b names are supported in current version of all 4 major compilers for x86 (Godbolt), and I'd recommend them for new development unless you need backwards compat with crusty old compilers, or for some reason you like the old name that doesn't both to distinguish it from different operations. (e.g. familiarity; if you don't want people to have to look up this newfangled name in the manual.)
gcc since 4.8
clang since 3.7
ICC since ICC13 or earlier, Godbolt doesn't have any older
MSVC since 19.14 or earlier, Godbolt doesn't have any older
If you check the intrinsics guide, _mm_slli_si128 is listed as an intrinsic for PSLLDQ, which is a byte shift. This is not a bug, just Intel's idea of a joke, or whatever process they used to choose names for intrinsics back in the SSE2 days. (There are only 2 hard problems in computer science: cache invalidation and naming things).
Asm mnemonics also use the same pattern of not making the byte-shuffle one look different from the bit-shifts. psllw xmm, 1 / pslld / psllq / pslldq. Again, you just have to know that 128-bit size is special, and must be a byte shuffle not a bit-shift, because x86 never has that. (Or you have to check the manual.)
The asm manual entry for pslldq in turn lists intrinsics for forms of it, interestingly only using the b name for the __m512i AVX-512BW version. When SSE2 and AVX2 were new, _mm_slli_si128 and _mm256_slli_si256 were the only names available, I think. Certainly it post-dates SSE2 intrinsics.
(Note that the si256 and si512 versions are just 2 or 4 copies of the 16-byte operation, not shifting bytes across 128-bit lanes; something a few other Q&As have asked for. This often makes AVX2 versions of shuffles like this and palignr a lot less useful than they'd otherwise be: either not worth using at all, or needing extra shuffles on top of it.)
I think this new bslli name was introduced when AVX-512 was new. Intel invented some new names for other intrinsics around that time, and the AVX-512 load/store intrinsics take void* instead of __m512i*, which is a major improvement to amount of noise in code, especially for C where implicit conversion to void* is allowed. (Creating a misaligned __m512i* is not actually a problem in C terms, but you couldn't deref it normally so it's a weird-looking thing to do.) So there was cleanup work happening on intrinsic naming then, and I think this was part of it.
(AVX-512 also gave Intel the chance to introduce some fairly bad names, like _mm_loadu_epi32(const void*) - you'd guess that's a strict-aliasing-safe way to do a 32-bit movd load, right? No, unfortunately, it's an intrinsic for vmovdqu32 xmm, [mem] with no masking. It's just _mm_loadu_si128 with a different C type for the pointer arg. It's there for consistency with the naming pattern for _mm_maskz_loadu_epi32. It would be nice to have void* load / store intrinsics for __m128i and __m256i, but if they have misleading names like that (esp. when you aren't using the mask/maskz versions in nearby code), I'll just stick to those cumbersome _mm256_loadu_si256( (const __m256i*)(arr + i) ) casts for the old intrinsic, because I love typing 256 three times. >.<
I wish asm was more maintainable (or that intrinsics just used asm mnemonics) because it's much more concise; Intel generally does a good job naming their mnemonics.
It somewhat but not entirely helps to note the difference between epi16/32/64 and si128: EPI = Extended (SSE instead of MMX) Packed Integer. (Packed implying multiple SIMD elements). si128 means a whole 128-bit integer vector.
There's no way to infer from the name that you aren't just doing the same thing to a single 128-bit integer, instead of packed elements. You just have to know that there are no bit-granularity things that ever cross 64-bit boundaries, only SIMD shuffles (which work in terms of bytes). This avoids the combinatorial explosion of building a really wide barrel shifter, or of carry propagation at such a long distance for a 128-bit add, or whatever.
Question today is fairly short. Consider the following toy C program shuffle.c for reversing two packed double in register xmm0:
#include <stdio.h>
void main () {
double x[2] = {0.0, 1.0};
asm volatile (
"movupd (%[x]), %%xmm0\n\t"
"shufpd $1, %%xmm0, %%xmm0\n\t" /* method 1 */
//"pshufd $78, %%xmm0, %%xmm0\n\t" /* method 2 */
"movupd %%xmm0, (%[x])\n\t"
:
: [x] "r" (x)
: "xmm0", "memory");
printf("x[0] = %.2f, x[1] = %.2f\n", x[0], x[1]);
}
After a dry run: gcc -msse3 -o shuffle shuffle.c | ./test, both methods/instructions will return the correct result x[0] = 1.00, x[1] = 0.00. This page says that shufpd has a latency of 6 cycles, while the intel intrinsic guide says that pshufd only has a latency of 1 cycles. This sounds like great preference to pshufd. However, This instruction is truly for packed integers. When using it for packed doubles, will there be any penalty associated with "wrong type"?
As a similar question, I also heard that instruction movaps is 1-byte smaller than movapd, and they do the same thing by reading 128bits from a 16-bit aligned address. So can we always use the former for move (between XMMs) / load (from memory) / store (to memory)? This seems crazy. I think there must be some reason to reject this. Can someone give me an explanation? Thank you.
You'll always get correct results, but it can matter for performance.
Prefer FP shuffles for FP data that will be an input to FP math instructions (like addps or vfma..., as opposed to insns like xorps).
This avoids any extra bypass-delay latency on some microarchitectures, including potentially current Intel chips. See Agner Fog's microarchitecture guide. AMD Bulldozer-family does all shuffles in the vector-integer domain, so there's a bypass delay whichever shuffle you use.
If it saves instructions, it can be worth it to use an integer shuffle anyway. (But usually it's the other way around, where you want to use shufps to combine data from two integer vectors. That's fine in even more cases, and mostly a problem only on Nehalem, IIRC.)
http://x86.renejeschke.de/html/file_module_x86_id_293.html lists the latency for CPUID 0F3n/0F2n CPUs, i.e. Pentium4 (family 0xF model 2 (Northwood) / model 3 (Prescott)). Those numbers are obviously totally irrelevant, and don't even match Agner Fog's P4 table for shufpd.
Intel's intrinsics guide sometimes has numbers that don't match experimental testing, either. See Agner Fog's instruction tables for good latency/throughput numbers, and microarch guides to understand the details.
movaps vs. movapd: No existing microarchitectures care which you use. It would be possible for someone in the future to design an x86 CPU that kept double vectors separate from float vectors internally, but for now the only distinction has been int vs. FP.
Always prefer the ps instruction when the behaviour is identical (xorps over xorpd, movhps over movhpd).
Some compilers (maybe both gcc and clang, I forget) will compile a _mm_store_si128 integer vector store to movaps, because there's no performance downside on any existing hardware, and it's one byte shorter.
IIRC, there's also no perf downside to loading integer vector data with movaps / movups, but I'm less sure about that.
There is a perf downside to using the wrong mov instruction for a reg-reg move, though. movdqa xmm1, xmm2 between two FP instructions is bad on Nehalem.
re: your inline asm:
It doesn't need to be volatile, and you could drop the "memory" clobber if you used a 16 byte struct or something as a "+m" input/output operand. Or a "+x" vector-register operand for an __m128d variable.
You'll probably get better results from intrinsics than from inline asm, unless you write whole loops in inline asm or stand-alone functions.
See the x86 tag wiki for a link to my inline asm guide.
I have a 128 bit variable filled with 4 separate integers. [1,2,3,4]. I want to shift right, so I can get [2,3,4,0]. What's the fastest way to do this.
My current code:
__m128 v1;
v1 = (__m128)_mm_srli_si128( _mm_castps_si128(v1) , 4 );
this succeeds in shifting the bits, but I am trying to go for speed and cache optimization (aka fewest variables as possible). Is there anyway to improve this code to avoid casting to and from a __m128i?
thanks
Don't worry about it. __m128 and __m128i are two different ways of interpreting the contents of an XMM register, so the cast disappears in compilation. My compiler (clang on Mac OS 10.9) compiles the whole thing down to a single instruction as it stands:
psrldq $0x4, %xmm0
How much faster is the following assembler code:
shl ax, 1
Versus the following C code:
num = num * 2;
How can I even find out?
Your assembly variant might be faster, might be slower. What made you think that it is necessarily faster?
On the x86 platform, there are quite a few ways to multiply something by 2. I would expect a compiler to do add ax, ax, which is intuitively more efficient than your shl because it doesn't involve a potentially stored constant ('1' in your case).
Also, for quite a long time, on a x86 platform the preferred way of multiplying things by constants was not a shift, but rather a lea operation (when possible). In the above example that would be lea eax, [eax*2]. (Multiplication by 3 would be done through lea eax, [eax*2+eax])
The belief in shift operations being somehow "faster" is a nice old story for newbies, which has virtually no relevance today. And, as usual, most of the time your compiler (if it is up-to-date) has much better knowledge about the underlying hardware platform than people with naive love for shift operations.
Is this, by any chance, an academic question? I assume you understand it is in the general category of "getting a haircut to lose weight".
If you are using GCC, ask to see the generated assembly with option -S. You may find it's the same as your assembler instruction.
To answer the original question, on Out-Of-Order processors instruction speed is measured by throughput and latency, and you would measure both using the rdtsc assembly instruction. But someone else did it for you for a lot of processors, so you don't need to bother. PDF
In most circumstances, it won't make a difference. Multiplication is fast on nearly all modern hardware. In particular, it is usually fast enough that unless you have meticulously hand-optimized code, the pipeline will hide the entirety of the latency and you will see no speed difference at all between the two cases.
You may be able to measure a performance difference on multiplies and shifts when you execute them in isolation, but there will typically not be any difference in the context of the rest of your compiled code. (As I noted, this may not hold true if the code is meticulously optimized).
Now, that said, shifts are still generally faster than multiplies, and almost any reasonable compiler will map a fixed power-of-two multiply into a shift, anyway (assuming that the semantics are actually equivalent on the target architecture).
Edit: one more thing you may want to try if you really care about this is x+x. I know of at least one architecture on which this can actually be faster than shifting, depending on the surrounding context.
If you have a decent compiler it will produce the same or similar code. The best way is to disassemble and checked the created code.
The answer depends, as you've been able to see here, upon many things. What the compiler will do with your C code depends on a lot of things. If we're talking x86-32 the following should be generally applicable.
At the basic level your C code indicates a memory variable which would require at least one instruction to multiply by two: "shl mem,1" and in such a simple case the C code will be slower.
If num is a local variable the compiler may decide to put it in a register (if it is used often enough and/or the function is small enough) and then you will have your "shl reg,1" instruction - maybe.
What instruction is fastest has everything to do with how they are implemented in the processor. Shl may not be the best choice since it affects the C and Z flags which slows it down. A few years ago the recommendation was "lea reg,[reg+reg]" (all reg are the same) because lea didn't affect any flags and there were variants such as (using the eax register on x86-32 platform as an example):
lea eax,[eax+eax] ; *2
lea eax,[eax+eax*2] ; *3
lea eax,[eax+eax*4] ; *5
lea eax,[eax+eax*8] ; *9
I don't know what is the norm today but your compiler probably will.
As for measuring search for information here on the rdtsc instruction which is the hands-down best alternative as it counts actual clock cycles.
Put them in a loop with a counter that goes so high that it runs for at least a second in the fastest case. Use your favorite timing mechanism to see how long each takes.
The assembly case should be done with inline assembly in the same C program as you use for the pure C test. Otherwise, you're not comparing apples to apples.
By the way, I think you should add a third test:
num <<= 1;
The question then is whether that does the same thing as the assembly version.
If, for your target platform, shifting left is the quickest way to multiply a number by two, then the chances are your compiler will do that when compiling the code. Look at the disassembly to check
So, for that one line, it's probably exactly the same speed. However, as you're unlikely to have a function containing just that one line, you might well find the compiler would defer the shift until the value is used, or otherwise mix it up with surrounding code, making it less clear cut. A good optimizing compiler will generally do a good job of beating poor to average hand written assembly.
If the compiler up to date now ( vc9 ) was really doing a good job it would outperform vc6 by a wide margin and this dont occur, this is why I even prefer to use VC6 for some code that run faster than code compiled in mingw with -O3 and VC9 with /Ox
1) On a 32-bit CPU is it faster to acccess an array of 32 boolean values or to access the 32 bits within one word? (Assume we want to check the value of the Nth element and can use either a bit-mask (Nth bit is set) or the integer N as an array index.)
It seems to me that the array would be faster because all common computer architectures natively work at the word level (32 bits, 64 bits, etc., processed in parallel) and accessing the sub-word bits takes extra work.
I know different compilers will represent things differently, but it seems that the underlying hardware architecture would dictate the answer. Or does the answer depend on the language and compiler?
And,
2) Is the speed answer reversed if this array represents a state that I pass between client and server?
This question came to mind when reading question "How use bit/bit-operator to control object state?"
P.S. Yes, I could write code to test this myself, but then the SO community wouldn't get to play along!
Bear in mind that a theoretically faster solution that doesn't fit into a cache line might be slower than a theoretically slower one that does, depending on a whole host of things. If this is actually something that needs to be fast, as determined by profiling, test both ways and see. If it doesn't, do whatever looks like cleaner code, which is probably the array.
It depends on the compiler and the access patterns and the platform. Raymond Chen has an excellent cost-benefit analysis: http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx .
Even on non x86 platforms the use of bits can be prohibitive as at least one PPC platform out there uses microcoded instructions to perform a variable shift which can do nasty things with other hardware threads.
So it can be a win, but you need to understand the context in which it will be good and bad. (Which is a general thing anyway.)
For question #1: Yes, on most 32-bit platforms, an array of boolean values should be faster, because you will just be loading each 32-bit-aligned value in the array and testing it against 0. If you use a single word, you will have all that work plus the overhead of bit-fiddling.
For question #2: Again, yes, since sending data over a network is significantly slower than operating on data in the CPU and main memory, the overhead of sending even one word will strongly outweigh any performance gain or loss you get by aligning words or bit fiddling.
This is the code generated by 0 != (value & (1 << index)) to test a bit:
00401000 mov eax,1
00401005 shl eax,cl
00401007 and eax,1
And this by values[index] to test a bool[]:
00401000 movzx eax,byte ptr [ecx+eax]
Can't figure out how to put a loop around it that doesn't get optimized away, I'll vote bool[].
If you are going to check more than one value at a time, doing it in parallel will obviously be faster. If you're only checking one value, it's probably the same.
If you need a better answer than that, write some tests and get back to us.
I think a byte array is probably better than a full-word array for simple random access.
It will give better cache locality than using the full word size, and I don't think byte access is any slower on most/all common architectures.