The SSE shift instructions I have found can only shift by the same amount on all the elements:
_mm_sll_epi32()
_mm_slli_epi32()
These shift all elements, but by the same shift amount.
Is there a way to apply different shifts to the different elements? Something like this:
__m128i a, __m128i b;
r0:= a0 << b0;
r1:= a1 << b1;
r2:= a2 << b2;
r3:= a3 << b3;
There exists the _mm_shl_epi32() intrinsic that does exactly that.
http://msdn.microsoft.com/en-us/library/gg445138.aspx
However, it requires the XOP instruction set. Only AMD Bulldozer and Interlagos processors or later have this instruction. It is not available on any Intel processor.
If you want to do it without XOP instructions, you will need to do it the hard way: Pull them out and do them one by one.
Without XOP instructions, you can do this with SSE4.1 using the following intrinsics:
_mm_insert_epi32()
_mm_extract_epi32()
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse41_reg_ins_ext.htm
Those will let you extract parts of a 128-bit register into regular registers to do the shift and put them back.
If you go with the latter method, it'll be horrifically inefficient. That's why _mm_shl_epi32() exists in the first place.
Without XOP, your options are limited. If you can control the format of the shift count argument, then you can use _mm_mullo_pi16 since multiplying by a power of two is the same as shifting by that power.
For example, if you want to shift your 8 16-bit elements in an SSE register by <0, 1, 2, 3, 4, 5, 6, 7> you can multiply by 2 raised to the shift count powers, i.e., by <0, 2, 4, 8, 16, 32, 64, 128>.
in some circumstances, this can substitute for _mm_shl_epi32(a, b):
_mm_mullo_ps(a, 1 << b);
generally speaking, this requires b to have a constant value - I don't know of an efficient way to calculate (1 << b) using older SSE instructions.
Related
Assume I have a number and I want to interpret every other bit as a new number, e.g.
uint16_t a = 0b1111111000000001;
uint16_t mask = 0xAAAA; // 0b1010101010101010
I now want to be able to get every other bit packed into two 8 bit variables, like
uint8_t b = a & mask ... // = 0b11110000
uint8_t c = a & ~mask ... // = 0b11100001
Is there an efficient way of accomplishing this? I know that I can loop and shift but I am going to do this for a lot of numbers. Even better if I can get both b and c at the same time.
You can precompute some tables if you want to avoid too much shifting.
I do it for a&mask. For the other situation it is identical with a&~mask.
First, you do a& mask to drop the 1's on the unused positions of a.
Suppose you have a=a1 0 a2 0 a3 0 a4 0. You want to get the number a1 a2 a3 a4. There are not many possibilities.
You can have a precomputed vector V of short integers and associate for each entry the corresponding value.
For example, v[0b10100010] will be 13, if the mask is 0b10101010.
If the precomputed vector is not too large it will stay in cache L1, so it will be very fast, for example, if you split your number in groups of 8 or 16 bits.
In my embedded code I need to add offsets 0x100, 0x200, 0x300 etc. (overall number of offsets is fixed, say 64) to initial register address. Is it possible to optimize it with bit shifting? I know that multiplication by 2 is left bit-shifting by 2, but I can't get my head around addition operation.
You can't replace addition with bit shifting. Only multiplication can. This is because left-shift(<<) multiplies by x * 2^n. So
(1 << 3) == 1 * pow(3, 2)
Right shift(>>) divides x / 2^n.
Doing bit shifting instead of the equivalent multiplying/dividing is faster, and is usually used in games, where performance is critical.
When using the vmlaq_s16 intrinsic/VMLA.I16 instruction, the result takes the form of a set of 8 16-bit integers. The multiplies inside the instructions however require the results to be stored in 32-bit integers to protect from overflow.
On Intel processors with SSE2, _mm_madd_epi16 preserves the length of the instruction (8 16-bit integers into 4 32-bit results) by multiplying and adding pairs of consecutive elements of the vectors, i.e.
r0 := (a0 * b0) + (a1 * b1)
r1 := (a2 * b2) + (a3 * b3)
r2 := (a4 * b4) + (a5 * b5)
r3 := (a6 * b6) + (a7 * b7)
Where r0,r1,r2,r3 are all 32-bit, and a0-a7, b0-b7 are all 16-bit elements.
Is there a trick that I'm missing with the vmlaq_s16 instruction that would allow me to still be able to process 8 16-bit elements at once and have results that don't overflow? Or is it the fact that this instruction is just provided for operands that are inherently in the 4-bit range (highly doubtful)?
Thanks!
EDIT: So I just thought about the fact that if vmlaq_s16 sets the overflow register flag(s?) for each of the elements in the result, then it's easy to count the overflows and recover the result.
EDIT 2: For everyone's reference, here's how to load 8 elements and pipeline two long multiply-adds on a 128bit register with intrinsics (proof of concept code that compiles with VS2012 for the ARM target):
signed short vector1[] = {1, 2, 3, 4, 5, 6, 7, 8};
signed short vector2[] = {1, 2, 3, 4, 5, 6, 7, 8};
int16x8_t v1; // = vdupq_n_s16(0);
int16x8_t v2; // = vdupq_n_s16(0);
v1 = vld1q_s16(vector1);
v2 = vld1q_s16(vector2);
int32x4_t sum = vdupq_n_s16(0);
sum = vmlal_s16(sum, v1.s.low64, v2.s.low64);
sum = vmlal_s16(sum, v1.s.high64, v2.s.high64);
printf("sum: %d\n", sum.n128_i32[0]);
These aren't directly equivalent operations - VMLA multiples two vectors then adds the result elementwise to a 3rd vector, unlike the self-contained half-elementwise-half-horizontal craziness of Intel's PMADDWD. Since that 3rd vector is a regular operand it has to exist in a register, thus there's no room for a 256-bit accumulator.
If you don't want to risk overflow by using VMLA to do 8x16 * 8x16 + 8x16, the alternative is to use VMLAL to do 4x16 * 4x16 + 4x32. The obvious suggestion would be to pipeline pairs of instructions to process 8x16 vectors into two 4x32 accumulators then add them together at the end, but I'll admit I'm not too familiar with intrinsics so I don't know how difficult they would make that (compared to assembly where you can exploit the fact that "64-bit vectors" and "128-bit vectors" are simply interchangable views of the same register file).
I have a question about using 128-bit registers to gain speed in a code. Consider the following C/C++ code: I define two unsigned long long ints a and b, and give them some values.
unsigned long long int a = 4368, b = 56480;
Then, I want to compute
a & b;
Here a is represented in the computer as a 64-bit number 4369 = 100010001001, and same for b = 56481 = 1101110010100001, and I compute a & b, which is still a 64-bit number given by the bit-by-bit logical AND between a and b:
a & b = 1000000000001
My question is the following: Do computers have a 128-bit register where I could do the operation above, but with 128-bits integers rather than with 64-bit integers, and with the same computer time? To be clearer: I would like to gain a factor two of speed in my code by using 128 bit numbers rather than 64 bit numbers, e. g. I would like to compute 128 ANDs rather than 64 ANDs (one AND for every bit) with the same computer time. If this is possible, do you have a code example? I have heard that the SSE regiters might do this, but I am not sure.
Yes, SSE2 has a 128 bit bitwise AND - you can use it via intrinsics in C or C++, e.g.
#include "emmintrin.h" // SSE2 intrinsics
__m128i v0, v1, v2; // 128 bit variables
v2 = _mm_and_si128(v0, v1); // bitwise AND
or you can use it directly in assembler - the instruction is PAND.
You can even do a 256 bit AND on Haswell and later CPUs which have AVX2:
#include "immintrin.h" // AVX2 intrinsics
__m256i v0, v1, v2; // 256 bit variables
v2 = _mm256_and_si256(v0, v1); // bitwise AND
The corresponding instruction in this case is VPAND.
For example:
Input: 01011111
Output: 00000101
I know I can use ~ to flip a number, but I don't know good ways to reverse it. And I'm not sure whether they can be done together.
Does anyone have any ideas?
For this sort of thing I'd advise you to go to the fantastic bit twiddling hacks webpage. Here's one of the solutions from that page:
Reverse the bits in a byte with 3 operations (64-bit multiply and modulus division):
unsigned char b; // reverse this (8-bit) byte
b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
The multiply operation creates five separate copies of the 8-bit byte pattern to fan-out into a 64-bit value. The AND operation selects the bits that are in the correct (reversed) positions, relative to each 10-bit groups of bits. The multiply and the AND operations copy the bits from the original byte so they each appear in only one of the 10-bit sets. The reversed positions of the bits from the original byte coincide with their relative positions within any 10-bit set. The last step, which involves modulus division by 2^10 - 1, has the effect of merging together each set of 10 bits (from positions 0-9, 10-19, 20-29, ...) in the 64-bit value. They do not overlap, so the addition steps underlying the modulus division behave like or operations.
This method was attributed to Rich Schroeppel in the Programming Hacks section of Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb. 29, 1972.
And here's a different solution that doesn't use 64-bit integers:
Reverse the bits in a byte with 7 operations (no 64-bit):
b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16;
Make sure you assign or cast the result to an unsigned char to remove garbage in the higher bits. Devised by Sean Anderson, July 13, 2001. Typo spotted and correction supplied by Mike Keith, January 3, 2002.