Minimum of 4 SP values in __m128 - c

Suppose to have a __m128 variable holding 4 SP values, and you want the minimum one, is there any intrinsic function available, or anything other than the naive linear comparison among the values?
Right know my solution is the following (suppose the input __m128 variable is x):
x = _mm_min_ps(x, (__m128)_mm_srli_si128((__m128i)x, 4));
min = _mm_min_ss(x, (__m128)_mm_srli_si128((__m128i)x, 8))[0];
Which is quite horrible but it's working (btw, is there anything like _mm_srli_si128 but for the __m128 type?)

There is no single instruction/intrinsic but you can do it with two shuffles and two mins:
__m128 _mm_hmin_ps(__m128 v)
{
v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3)));
v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 0, 3, 2)));
return v;
}
The output vector will contain the min of all the elements in the input vector, replicated throughout the output vector.

Paul R's answer is great! (#Paul R - if you read this thank you!) I just wanted to try to explain how it actually works for anyone new to SSE stuff like me. Of course
I might be wrong somewhere, so any corrections are welcome!
How does _mm_shuffle_ps work?
First of all, SSE registers have indexes that go in reverse to what you might expect, like this:
[6, 9, 8, 5] // values
3 2 1 0 // indexes
This order of indexing makes vector left-shifts move data from low to high indices, just like left-shifting the bits in an integer. The most-significant element is at the left.
_mm_shuffle_ps can mix the contents of two registers:
// __m128 a : (a3, a2, a1, a0)
// __m128 b : (b3, b2, b1, b0)
__m128 two_from_a_and_two_from_b = _mm_shuffle_ps(b, a, _MM_SHUFFLE(3, 2, 1, 0));
// ^ ^ ^ ^
// indexes into second operand indexes into first operand
// two_from_a_and_two_from_b : (a3, a2, b1, b0)
Here, we only want to shuffle the values of one register, not two. We can do that by passing v as both parameters, like this (you can see this in Paul R's function):
// __m128 v : (v3, v2, v1, v0)
__m128 v_rotated_left_by_1 = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3));
// v_rotated_left_by_1 : (v2, v1, v0, v3) // i.e. move all elements left by 1 with wraparound
I'm going to wrap it in a macro for readability though:
#define mm_shuffle_one(v, pattern) _mm_shuffle_ps(v, v, pattern)
(It can't be a function because the pattern argument to _mm_shuffle_ps must be constant at compile time.)
Here's a slightly modified version of the actual function – I added intermediate names for readability, as the compiler optimizes them out anyway:
inline __m128 _mm_hmin_ps(__m128 v){
__m128 v_rotated_left_by_1 = mm_shuffle_one(v, _MM_SHUFFLE(2, 1, 0, 3));
__m128 v2 = _mm_min_ps(v, v_rotated_left_by_1);
__m128 v2_rotated_left_by_2 = mm_shuffle_one(v2, _MM_SHUFFLE(1, 0, 3, 2));
__m128 v3 = _mm_min_ps(v2, v2_rotated_left_by_2);
return v3;
}
Why are shuffling the elements the way we are? And how do we find the smallest of four elements with just two min operations?
I had some trouble following how you can min 4 floats with just two vectorized min operations, but I understood it when I manually followed which values are min'd together, step by step. (Though it's likely more fun to do it on your own than read it)
Say we've got v:
[7,6,9,5] v
First, we min the values of v and v_rotated_left_by_1:
[7,6,9,5] v
3 2 1 0 // (just the indices of the elements)
[6,9,5,7] v_rotated_left_by_1
2 1 0 3 // (the indexes refer to v, and we rotated it left by 1, so the indices are shifted)
--------- min
[6,6,5,5] v2
3 2 1 0 // (explained
2 1 0 3 // below )
Each column under an element of v2 tracks which indexes of v were min'd together to get that element.
So, going column-wise left to right:
v2[3] == 6 == min(v[3], v[2])
v2[2] == 6 == min(v[2], v[1])
v2[1] == 5 == min(v[1], v[0])
v2[0] == 5 == min(v[0], v[3])
Now the second min:
[6,6,5,5] v2
3 2 1 0
2 1 0 3
[5,5,6,6] v2_rotated_left_by_2
1 0 3 2
0 3 2 1
--------- min
[5,5,5,5] v3
3 2 1 0
2 1 0 3
1 0 3 2
0 3 2 1
Voila! Each column under v3 contains (3,2,1,0) - each element of v3 has been mind with all the elements of v - so each element contains the minimum of the whole vector v.
After using the function, you can extract the minimum value with float _mm_cvtss_f32(__m128):
__m128 min_vector = _mm_hmin_ps(my_vector);
float minval = _mm_cvtss_f32(min_vector);
***
This is just a tangential thought, but what I found interesting is that this approach could be extended to sequences of arbitrary length, rotating the result of the previous step by 1, 2, 4, 8, ... 2**ceil(log2(len(v))) (i think) at each step.
That's cool from a theoretical perspective - if you can compare two sequences element-wise simultaneously, you can find the minimum/maximum1 of a sequences in logarithmic time!
1 This extends to all horizontal folds/reductions, like sum. Same shuffles, different vertical operation.
However, AVX (256-bit vectors) makes 128-bit boundaries special, and harder to shuffle across. If you only want a scalar result, extract the high half so every step narrows the vector width in half. (Like in Fastest way to do horizontal float vector sum on x86, which has more efficient shuffles than 2x shufps for 128-bit vectors, avoiding some movaps instructions when compiling without AVX.)
But if you want the result broadcast to every element like #PaulR's answer, you'd want to do in-lane shuffles (i.e. rotate within the 4 elements in every lane), then swap halves, or rotate 128-bit lanes.

Related

comparision with zero using neon instruction

I have the below code
if(value == 0)
{
value = 1;
}
Using NEON vectorized instructions I need to perform the above. How do I compare a NEON register value with 0 for equality at a time 4 elements and change the value to 1 if the element is zero.
If you want to check if any element of a vector is non-zero and branch on that:
You can use get min/max across vector lanes.
if(vmaxvq_u32(value) == 0) { // Max value across quad vector, equals zero?
value = vmovq_n_u32(1); // Set all lanes to 1
}
For double vectors
if(vmaxv_u32(value) == 0) { // Max value across double vector, equals zero?
value = vmov_n_u32(1); // Set all lanes to 1
}
Notice the only difference is the 'q' which is used to indicate quad 128-bit vector or 64-bit double vector if not. The compiler will use a mov instruction to transfer from a neon single to arm generic register to do the comparison.
Assuming integer data, then thanks to NEON having specific "compare against zero" instructions, and the bitwise way comparison results work, there's a really cheeky way to do this using just one spare register. In generalised pseudo-assembly:
VCEQ.type mask, data, #0 # Generate bitmask vector with all bits set in elements
# corresponding to zero elements in the data
VSUB.type data, data, mask # Interpret "mask" as a vector of 0s and -1s, with the
# result of incrementing just the zero elements of "data"
# (thanks to twos complement underflow)
This trick doesn't work for floating-point data as the bit-patterns for nonzero values are more complicated, and neither does it work if the replacement value is to be anything other than 1 (or -1), so in those cases you would need to construct a separate vector containing the appropriate replacement elements and do a conditional select using the comparison mask as per #Ermlg's answer.
Maybe it will look something like this:
uint32x4_t value = {7, 0, 0, 3};
uint32x4_t zero = {0, 0, 0, 0};
uint32x4_t one = {1, 1, 1, 1};
uint32x4_t mask = vceqq_u32(value, zero);
value = vbslq_u32(mask, one, value);
To get more information see here.

Efficient comparison of small integer vectors

I have small vectors. Each of them is made of 10 integers which are between 0 and 15. This means that every element in a vector can be written using 4 bits. Hence I can concatenate my vector elements and store the whole vector in a single long type (in C, C++, java...)
Vector v1 dominates vector v2 if for each i in 0,...,9, v1[i] >= v2[i]
I want to write a method compare(long v1, long v2) that would return 0 if non of the vectors dominates the other, 1 if the first one dominates and -1 if the second one dominates.
Is there any efficient way to implement compare other than getting every i component and doing 10 times the normal integer comparison?
EDIT
if v1 is exactly the same as v2 returning 1 or -1 are both fine
It's possible to do this using bit-manipulation. Space your values out so that each takes up 5 bits, with 4 bits for the value and an empty 0 in the most significant position as a kind of spacing bit.
Placing a spacing bit between each value stops borrows/carries from propagating between adjacent values and means you can do certain SIMD-like arithmetic operations on the vector just by using regular integer addition or subtraction. We can use subtraction to do a vector comparison.
To do the test you can set all the spacing bits to 1 in one of the vectors and then subtract the second one. If the value in the 4 bits below the spacing bit is greater in the second one then it will carry the bit from the spacing bit and set it to zero in the result, if not then it will remain a one (the first value is greater than or equal to the second). If the first vector dominates the second then all the spacing bits will be one after the subtraction.
Simple demonstration using ints:
#define SPACING_BITS ((1<<4)|(1<<9)|(1<<14)|(1<<19))
int createVector(int v0, int v1, int v2, int v3)
{
return v0 | (v1 << 5) | (v2 << 10) | (v3 << 15);
}
int vectorDominates(int vectorA, int vectorB)
{
// returns 1 if vectorA dominates vectorB:
return (((vectorA | SPACING_BITS) - vectorB) & SPACING_BITS) == SPACING_BITS;
}
int compare(int vectorA, int vectorB)
{
if(vectorDominates(vectorA, vectorB))
return 1;
else if(vectorDominates(vectorB, vectorA))
return -1;
return 0;
}
You can extend it to use 64 bit values using 50 bits to store the 10 values. You can also inline the calls to vectorDominates in the compare function.
Demo
Well, in C you can likely leverage vectorization to do this. I don't think it's directly possible to compare on 4-bit operands, so you're going to have to re-pack (either on the fly or just keep your data in a more suitable format) up to 8-bit before doing the comparison. Since 10 * 8 = 80 which is more than 64, you're going to need 128-bit vector instructions.
Not sure if Java VMs support that yet, but this question suggests that JNI is the answer, i.e. call C code from Java.

Algorithm to determine the maximum sum array by complementing bit mask

This question posed by my co-worker bamboozled me. I cannot even come up with a clean brute-force solution. To state the problem:
Given an array of size n containing non-negative integers, k = [10, 40, 1, 200, 5000, ..., n], a bit mask of size n, i.e. mask = 1001....b_n where |mask| = n. and an integer representing contiguous bits that can be complemented S = 3, find a configuration of mask that yields maximum sum array.
The complement size S is used to pick S contiguous bit from the bit mask and replacing it by its complement.
For example, if mask = 100001 with S = 2 you could
Change mask to 010001 by applying the mask at MSB
You can iteratively keep on complementing at any bit in the mask till you find array of maximum size.
Here is what I've come up:
Find all the 2^n bit mask configurations then apply them to find the maximum sum array
Given the initial mask configuration see if there exists a path to the maximum sum array configuration found in step 1.
Again mine is an exponential solution. Any efficient ones are appreciated.
Start off with the trivial observation that you would never apply your given bitmask G, which simply consists of S 1s, more than once on the same stretch of your original mask, M - this is because bitwise xor is commutative and associative allowing you to reorder as you please, and xor'ing a bitmask to itself gives you all 0s.
Given a bitmask B of length S, and an integral index ind in [0,n), let BestSum(ind, B) be the best possible sum that can be obtained on [ind:n) slice of your input array k when M'[ind, ind + S) = B, where M' is the final state of your mask after performing all the operations. Let us write B = b.B', where b is the MSB and consider the two possibilities for b:
b = M[ind] : In this case, you will not apply G at M[ind] and hence BestSum(ind, B) = b*k[ind] + max(BestSum(ind + 1, B'.0), BestSum(ind + 1, B'.1)).
b != M[ind] : In this case, you will apply G at M[ind] and hence BestSum(ind, B) = b*k[ind] + max(BestSum(ind + 1, (~B').0), BestSum(ind + 1, (~B').1)).
This, along with the boundary conditions, gives you a DP with runtime O(n*2^S). The best solution would be max over all BestSum(0, B).
Note that we have brushed all reachability issues under the carpet of "boundary conditions". Let us address that now - if, for a given ind and B, there is no final configuration M' such that M'[ind, ind + S) = B, define BestSum(ind, B) = -inf. That will ensure that the only cases where you need to answer unreachability is indeed the boundary - i.e., ind = n - S. The only values of (n-S, B) that are reachable at (n-S, M[n-S:n)) and (n-S, M[n-S:n) ^ G), thus handling the boundary with ease.
Would the following work?
Use DFS to expand a tree with all the possibilities (do one flip in each depth), the recursion's ending condition is:
Reached a state of all masks are 1
Keep coming back to the same position, means we can never reach the state will all masks are 1. (I am not sure how exactly we can detect this though.)

Split arrays of natural numbers according to a requirement

I have two arrays {Ai} and {Bi} of natural numbers. The sums of all elements are equal.
I need to split each element of the two arrays into three natural numbers:
Ai = A1i + A2i + A3i
Bi = B1i + B2i + B3i
such that the sum of all elements of A1 is equal to the sum of all elements of B1 and the same for all the other pairs.
The important part I initially forgot about:
Each element from A1j, A2j, A3j should be between Aj/3-2 and Aj/3+2 or at least equal to one of these numbers
Each element from B1j, B2j, B3j should be between Bj/3-2 and Bj/3+2 or at least equal to one of these numbers
So the elements of arrays must be split in almost equal parts
I look for some more elegant solution than just calculating all possible variant for both arrays.
I look for some more elegant solution than just calculating all possible variant for both arrays.
It should be possible to divide them so that the sums of A1, A2 and A3 are near to a third of A, and the same for B. It would be easy to just make all values an exact third, but that’s not possible with natural numbers. So we have to floor the results (trivial) and distribute the remainders uniformly over the three arrays (manageable).
I don't know whether it’s the only solution, but it works in O(n) and my intuition says it will hold your invariants (though I didn’t proof it):
n = 3
for j=0 to n
A[j] = {}
x = 0 // rotating pointer for the next subarray
for i in A
part = floor(A[i] / n)
rest = A[i] % n
for j=0 to n
A[j][i] = part
// distribute the rest over the arrays, and rotate the pointer
for j=0 to rest
A[x][i]++
x++
/* Do the same for B */
One could also formulate the loop without the division, only distributing the single units (1) of an A[i] over the A[x][i]s:
n = 3
for j=0 to n
A[j] = {}
for k=0 to |A|
A[j][i] = 0
x = 0 // rotating pointer for the next subarray
for i in A
// distribute the rest over the arrays, and rotate the pointer
for j=0 to A[i]
A[x][i]++
x++
You should look up the principle of dynamic programming.
In this case, it seems to be similar to some coin change problems.
As for finding A1_i, A2_i, A3_i you should do it recursively:
def find_numbers(n, a, arr):
if arr[n] not empty:
return
if n == 0:
arr[n].append(a)
return
if a.size() > 2:
return
t = n
for each element of a:
t -= element
for i = 0 to :
find_numbers(n, append(a, i), arr)
We use arr so that we do not need to compute for each number multiple times the possible combinations. If you look at the call tree after a time this function will return the combinations from the arr, and not compute them again.
In your main call:
arr = []
for each n in A:
find_number(n, [], arr)
for each n in B:
find_number(n, [], arr)
Now you have all the combinations for each n in arr[n].
I know it is a subpart of the problem, but finding the right combinations for each A_i, B_i from arr is something really similar to this. > It is very important to read the links I gave you so that you understand the underlying theory behind.
I add the stipulation that A1, A2, and A3 must be calculated from A without knowledge of B, and, similarly, B1, B2, and B3 must be calculated without knowledge of A.
The requirement that each A1i, A2i, A3i must be in [Ai/3–2, Ai/3+2] implies that the sums of the elements of A1, A2, and A3 must each be roughly one-third that of A. The stipulation compels us to define this completely.
We will construct the arrays in any serial order (e.g., from element 0 to the last element). As we do so, we will ensure the arrays remain nearly balanced.
Let x be the next element of A to be processed. Let a be round(x/3). To account for x, we must append a total of 3•a+r to the arrays A1, A2, and A3, where r is –1, 0, or +1.
Let d be sum(A1) – sum(A)/3, where the sums are of the elements processed so far. Initially, d is zero, since no elements have been processed. By design, we will ensure d is –2/3, 0, or +2/3 at each step.
Append three values as shown below to A1, A2, and A3, respectively:
If r is –1 and d is –2/3, append a+1, a–1, a–1. This changes d to +2/3.
If r is –1 and d is 0, append a–1, a, a. This changes d to –2/3.
If r is –1 and d is +2/3, append a–1, a, a. This changes d to 0.
If r is 0, append a, a, a. This leaves d unchanged.
If r is +1 and d is –2/3, append a+1, a, a. This changes d to 0.
If r is +1 and d is 0, append a+1, a, a. This changes d to +2/3.
If r is +1 and d is +2/3, append a–1, a+1, a+1. This changes d to –2/3.
At the end, the sums of A1, A2, and A3 are uniquely determined by the sum of A modulo three. The sum of A1 is (sum(A3)–2)/3, sum(A3)/3, or (sum(A3)+2)/3 according to whether the sum of A modulo three is congruent to –1, 0, or +1, respectively.
Completing the demonstration:
In any case, a–1, a, or a+1 is appended to an array. a is round(x/3), so it differs from x/3 by less than 1, so a–1, a, and a+1 each differ from x/3 by less than 2, satisfying the constraint that the values must be in [Ai/3–2, Ai/3+2].
When B1, B2, and B3 are prepared in the same way as shown above for A1, A2, and A3, their sums are determined by the sum of B3. Since the sum of A equals the sum of B, the sums of A1, A2, and A3 equal the sums of B1, B2, and B3, respectively.

Fast multiplication of k x k boolean matrices, where 8 <= k <= 16

I want to find an as fast as possible way of multiplying two small boolean matrices, where small means, 8x8, 9x9 ... 16x16. This routine will be used a lot, so it needs to be very efficient, so please don't suggest that the straightforward solution should be fast enough.
For the special cases 8x8, and 16x16 I already have fairly efficient implementations, based on the solution found here, where we treat the entire matrix as an uint64_t or uint64_t[4] respectively. On my machine this is roughly 70-80 times faster than the straightforward implementation.
However, in the case of 8 < k < 16, I don't really know how I can leverage any reasonable representation in order to enable such clever tricks as above.
So basically, I'm open for any suggestions using any kind of representation (of the matrices) and function signature. You may assume that this targets either a 32-bit or 64-bit architecture (pick what best suits your suggestion)
Given two 4x4 matrices a= 0010,0100,1111,0001, b=1100,0001,0100,0100, one could first calculate the transpose b' = 1000,1011,0000,0100.
Then the resulting matrix M(i,j)=a x b mod 2 == popcount(a[i]&b[j]) & 1; // or parity
From that one can notice that the complexity only grows in n^2, as long as the bitvector fits a computer word.
This can be speed up for 8x8 matrices at least, provided that some special permutation and bit selection operations are available. One can iterate exactly N times with NxN bits in a vector. (so 16x16 is pretty much the limit).
Each step consists of accumulating i.e. Result(n+1) = Result(n) XOR A(n) .& B(n), where Result(0) = 0, A(n) is A <<< n, and '<<<' == columnwise rotation of elements and where B(n) copies diagonal elements from the matrix B:
a b c a e i d h c g b f
B= d e f B(0) = a e i B(1) = d h c B(2) = g b f
g h i a e i d h c g b f
And after thinking it a bit further, a better option is to ^^^ (row wise rotate) matrix B and select A(n) == column copied diagonals from A:
a b c a a a b b b c c c
A= d e f A(0) = e e e , A(1) = f f f, A(2) = d d d
g h i i i i g g g h h h
EDIT To benefit later readers, I'd propose the full solution for W<=16 bit matrix multiplications in portable C.
#include <stdint.h>
void matrix_mul_gf2(uint16_t *a, uint16_t *b, uint16_t *c)
{
// these arrays can be read in two successive xmm registers or in a single ymm
uint16_t D[16]; // Temporary
uint16_t C[16]={0}; // result
uint16_t B[16];
uint16_t A[16];
int i,j;
uint16_t top_row;
// Preprocess B (while reading from input)
// -- "un-tilt" the diagonal to bit position 0x8000
for (i=0;i<W;i++) B[i]=(b[i]<<i) | (b[i]>>(W-i));
for (i=0;i<W;i++) A[i]=a[i]; // Just read in matrix 'a'
// Loop W times
// Can be parallelized 4x with MMX, 8x with XMM and 16x with YMM instructions
for (j=0;j<W;j++) {
for (i=0;i<W;i++) D[i]=((int16_t)B[i])>>15; // copy sign bit to rows
for (i=0;i<W;i++) B[i]<<=1; // Prepare B for next round
for (i=0;i<W;i++) C[i]^= A[i]&D[i]; // Add the partial product
top_row=A[0];
for (i=0;i<W-1;i++) A[i]=A[i+1];
A[W-1]=top_row;
}
for (i=0;i<W;i++) c[i]=C[i]; // return result
}
How about padding it out to the next "clever" (e.g. 8 or 16) size, with all '1' on the diagonal?
Depending on your application, storing both the matrix and its transpose together might help. You will save a lot of time that otherwise would be used to transpose during matrix multiplications, at the expense of some memory and some more operations.
There is a faster method for multiplying 8x8 matrices using 64-bit multiplication along with some simple bit trickery, which works for either GF[2] or boolean algebra.
Assuming the three matrices being packed in 8 consecutive rows of 8 bits inside a 64-bit int each, we can use multiplication to scatter the bits and do the job in just one for loop:
uint64_t mul8x8 (uint64_t A, uint64_t B) {
const uint64_t ROW = 0x00000000000000FF;
const uint64_t COL = 0x0101010101010101;
uint64_t C = 0;
for (int i=0; i<8; ++i) {
uint64_t p = COL & (A>>i);
uint64_t r = ROW & (B>>i*8);
C |= (p*r); // use ^ for GF(2) instead
}
return C;
}
The code for 16x16 is straightfoward if you can afford blocking the rows for improved efficiency.
This trick is also used extensively in high-performance linear algebra libraries, and consists in partitioning the matrix into N/M x N/M blocks of MxM submatrices, with M = 2^m chosen to maximize locality in cache. The usual way to deal with N % M != 0 is to pad rows and columns with 0s so one can use the same algorithm for all block multiplications.
We can apply the same ideas to boolean matrices of variable dimension 8 >= N >= 16 as long as we can afford to have the matrices represented internally in a row blocking format. We just assume the matrix is 16x16 and the last 16-N rows and columns are filled with 0s:
void mul16x16 (uint64_t C[2][2], const uint64_t A[2][2], const uint64_t B[2][2]) {
for (int i=0; i<2; ++i)
for (int j=0; j<2; ++j)
C[i][j] = mul8x8(A[i][0],B[0][j])
| mul8x8(A[i][1],B[1][j]); // once again, use ^ instead for GF(2)
}
Notice we have done a 16x16 matrix multiplication in just 8x8=64 integer products and some bit operations.
Also mul8x8 can be much improved with modern SSE/AVX vector instructions. In theory it is possible to perform all 8 products in parallel with one AVX512 instruction (we still need to scatter the data to the ZMM register first) and then reduce horizontally using lg2(8) = O(3) instructions.

Resources