Multiply-subtract in SSE - c

I am vectorizing a piece of code and at some point I have the following setup:
register m128 a = { 99,99,99,99,99,99,99,99 }
register m128 b = { 100,50,119,30,99,40,50,20 }
I am currently packing shorts in these registers, which is why I have 8 values per register. What I would like to do is subtract the i'th element in b with the corresponding value in a if the i'th value of b is greater than or equal to the value in a (In this case, a is filled with the constant 99 ). To this end, I first use a greater than or equal to operation between b and a, which yields, for this example:
register m128 c = { 1,0,1,0,1,0,0,0 }
To complete the operation, I'd like to use the multiply-and-subtract, i.e. to store in b the operation b -= a*c. The result would then be:
b = { 1,50,20,30,0,40,50,20 }
Is there any operation that does such thing? What I found were fused operations for Haswell, but I am currently working on Sandy-Bridge. Also, if someone has a better idea to do this, please let me know (e.g. I could do a logical subtract: if 1 in c then I subtract, nothing otherwise.

You essentially want an SSE version of this code, right?
if (b >= a)
t = b-a
else
t = b
b = t
Since we want to avoid conditionals for the the SSE version so we can get rid of the control flow like this (note that the mask is inverted):
uint16_t mask = (b>=a)-1
uint16_t tmp = b-a;
uint16_t d = (b & mask) | (tmp & ~mask)
b = d
I've checked the _mm_cmpgt_epi16 intrinsic and it has a nice property in that it returns either 0x0000 for false or 0xFFFF for true, instead of a single bit 0 or 1 (thereby eliminating the need for the first subtraction). Therefore our SSE version might look like this.
__m128i mask = _mm_cmpgt_epi16 (b, a)
__m128i tmp = _mm_sub_epi16 (b, a)
__m128 d = _mm_or_ps (_mm_and_ps (mask, tmp), _mm_andnot_ps (mask, b))
EDIT: harold has mentioned a far less complicated answer. The above solution might be helpful if you need to modify the else part of the if/else.
uint16_t mask = ~( (b>=a)-1 )
uint16_t tmp = a & mask
b = b - tmp
the SSE code will be
__m128i mask = _mm_cmpgt_epi16 (b, a)
__m128i t = _mm_sub_epi16 (b, _mm_and_si128 (mask, a))

Another alternative, if your inputs are unsigned, you can calculate
b = min(b, b-a);
This works, because if a>b then b-a wraps around and is guaranteed to result in a bigger value than b. For a<=b you will always get a value between 0 and b inclusive.
b = _mm_min_epu16(b, _mm_sub_epi16(b,a));
The required _mm_min_epu16 requires SSE4.1 or later (_mm_min_epu8 would require only SSE2).

You can copy b to c, subtract a from c, perform an arithmetic shift right by 15 positions in the 16 bit values, complement the value of c, mask c with a, and finally subtract c from b.
I'm not familiar for the intrinsics syntax, but the steps are:
register m128 c = b;
c -= a;
c >>= 15;
c = ~c;
c &= a;
b -= c;
here is an alternative with fewer steps:
register m128 c = compare_ge(b, a);
c = -c;
c &= a;
b -= c;

Related

How to swap n bits from 2 numbers?

So currently (I don't know if the width is relevant) I have 2 128bit integers that I want to swap shift bits as shift is a part of a structure:
unsigned char shift : 7;
So I get the relevant bits like this:
__uint128_t rawdata = ((*pfirst << shift) >> shift), rawdatasecond = ((*psecond << shift) >> shift);
And then I swap them like this:
*pfirst = *pfirst >> 127 - shift << 127 - shift | rawdatasecond;
*psecond = *psecond >> 127 - shift << 127 - shift | rawdata;
But I feel like I'm missing something - or am I (not sure how to test either)?
Note:
__uint128_t *pfirst, *psecond; //pointer to some integers
As I understand the task, you would like to swap bits 0, ..., n-1 for two integers. This can be done as follows:
static inline void swapbits(__uint128_t *a, __uint128_t *b, unsigned n)
{
if(n <= 8 * sizeof(*a))
{
__uint128_t mask = (n < 8 *sizeof(*a)) ? (((__uint128_t)1) << n) - 1 : ~(__uint128_t)0;
__uint128_t bits_a = *a & mask; // Get the bits from a
__uint128_t bits_b = *b & mask; // Get the bits from b
*a = (*a & ~mask) | bits_b; // Set bits in a
*b = (*b & ~mask) | bits_a; // Set bits in b
}
}
Testing is a matter of calling the function with different combinations of values for a, b and n and checking that the result is as expected.
Clearly it is not feasible to test all combinations so a number of representative cases must be defined. It is not an exact science, but think about middle and corner cases: n=0, n=1, n=64, n=127, n=128 and a and b having different bit patterns around the left-most and right-most positions as well as around the n'th position.
static void SwapHighNBits(__uint128_t *a, __uint128_t *b, size_t n)
{
// Calculate number of uninvolved bits.
size_t NU = 128 - n;
// Calculate bitwise difference and remove uninvolved bits.
__uint128_t d = (*a ^ *b) >> NU << NU;
// Apply difference.
*a ^= d;
*b ^= d;
}
A good test suite would include all combinations of bits in the critical positions: The high bit, the lowest bit swapped, the highest bit not swapped, and the lowest bit. That is four bits in each of two operands, so eight bits total, 256 combinations for a specific value of n. Then test values of n from 2 to 126. Other bits can be filled randomly.
For n = 1, the high bit and the lowest bit swapped are identical, so make separate test code for that, or write the common test code carefully to cover that. Similarly, for n = 127, the highest bit not swapped and the lowest bit are identical, and, for n = 128, there are no bits not swapped.
If defined behavior is defined for n ≤ 0 or n > 128, add test cases for those. Note that the code above does not support n = 0, as NU will be 128, and the shifts are not defined by the C standard. Of course, for n = 0, one can simply return without making any changes.

C - Saturating Signed Integer Multiplication with Bitwise Operators

Alright, so the assignment I have to do is to multiply a signed integer by 2 and return the value. If the value overflows then saturate it by returning Tmin or Tmax instead. The challenge is using only these logical operators (! ~ & ^ | + << >>) with no (if statements, loops, etc.) and only allowed a maximum of 20 logical operators.
Now my thought process to tackle this problem was first to find the limits. So I divided Tmin/max by 2 to get the boundaries. Here's what I have:
Positive
This and higher works:
1100000...
This and lower doesn't:
1011111...
If it doesn't work I need to return this:
100000...
Negative
This and lower works:
0011111...
This and higher doesn't:
0100000...
If it doesn't work I need to return this:
011111...
Otherwise I have to return:
2 * x;
(the integers are 32-bit by the way)
I see that the first two bits are important in determining whether or not the problem should return 2*x or the limits. For example an XOR would do since if the first to bits are the same then 2*x should be returned otherwise the limits should be returned. Another if statement is then needed for the sign of the integer for it is negative Tmin needs to be returned, otherwise Tmax needs to be.
Now my question is, how do you do this without using if statements? xD Or a better question is the way I am planning this out going to work or even feasible under the constraints? Or even better question is whether there is an easier way to solve this, and if so how? Any help would be greatly appreciated!
a = (x>>31); // fills the integer with the sign bit
b = (x<<1) >> 31; // fills the integer with the MSB
x <<= 1; // multiplies by 2
x ^= (a^b)&(x^b^0x80000000); // saturate
So how does this work. The first two lines use the arithmetic right shift to fill the whole integer with a selected bit.
The last line is basically the "if statement". If a==b then the right hand side evaluates to 0 and none of the bits in x are flipped. Otherwise it must be the case that a==~b and the right hand side evaluates to x^b^0x80000000.
After the statement is applied x will equal x^x^b^0x80000000 => b^0x80000000 which is exactly the saturation value.
Edit:
Here is it in the context of an actual program.
#include<stdio.h>
main(){
int i = 0xFFFF;
while(i<<=1){
int a = i >> 31;
int b = (i << 1) >> 31;
int x = i << 1;
x ^= (a^b) & (x ^ b ^ 0x80000000);
printf("%d, %d\n", i, x);
}
}
You have a very good starting point. One possible solution is to look at the first two bits.
abxx xxxx
Multiplication by 2 is equivalent to a left shift. So our result would be
bxxx xxx0
We see if b = 1 then we have to apply our special logic. The result in such a case would be
accc cccc
where c = ~a. Thus if we started with bitmasks
m1 = 0bbb bbbb
m2 = b000 0000
m3 = aaaa aaaa & bbbb bbbb
then when b = 1,
x << 1; // gives 1xxx xxxx
x |= m1; // gives 1111 1111
x ^= m2; // gives 0111 1111
x ^= m3; // gives accc cccc (flips bits for initially negative values)
Clearly when b = 0 none of our special logic happens. It's straightforward to get these bitmasks in just a few operations. Disclaimer: I haven't tested this.

How can I check if a value has even parity of bits or odd?

A value has even parity if it has an even number of '1' bits. A value has an odd parity if it has an odd number of '1' bits. For example, 0110 has even parity, and 1110 has odd parity.
I have to return 1 if x has even parity.
int has_even_parity(unsigned int x) {
return
}
x ^= x >> 16;
x ^= x >> 8;
x ^= x >> 4;
x ^= x >> 2;
x ^= x >> 1;
return (~x) & 1;
Assuming you know ints are 32 bits.
Let's see how this works. To keep it simple, let's use an 8 bit integer, for which we can skip the first two shift/XORs. Let's label the bits a through h. If we look at our number we see:
( a b c d e f g h )
The first operation is x ^= x >> 4 (remember we're skipping the first two operations since we're only dealing with an 8-bit integer in this example). Let's write the new values of each bit by combining the letters that are XOR'd together (for example, ab means the bit has the value a xor b).
( a b c d e f g h )
xor
( 0 0 0 0 a b c d )
The result is the following bits:
( a b c d ae bf cg dh )
The next operation is x ^= x >> 2:
( a b c d ae bf cg dh )
xor
( 0 0 a b c d ae bf )
The result is the following bits:
( a b ac bd ace bdf aceg bdfh )
Notice how we are beginning to accumulate all the bits on the right-hand side.
The next operation is x ^= x >> 1:
( a b ac bd ace bdf aceg bdfh )
xor
( 0 a b ac bd ace bdf aceg )
The result is the following bits:
( a ab abc abcd abcde abcdef abcdefg abcdefgh )
We have accumulated all the bits in the original word, XOR'd together, in the least-significant bit. So this bit is now zero if and only if there were an even number of 1 bits in the input word (even parity). The same process works on 32-bit integers (but requires those two additional shifts that we skipped in this demonstration).
The final line of code simply strips off all but the least-significant bit (& 1) and then flips it (~x). The result, then, is 1 if the parity of the input word was even, or zero otherwise.
GCC has built-in functions for this:
Built-in Function: int __builtin_parity (unsigned int x)
Returns the parity of x, i.e. the number of 1-bits in x modulo 2.
and similar functions for unsigned long and unsigned long long.
I.e. this function behaves like has_odd_parity. Invert the value for has_even_parity.
These should be the fastest alternative on GCC. Of course its use is not portable as such, but you can use it in your implementation, guarded by a macro for example.
The following answer was unashamedly lifted directly from Bit Twiddling Hacks By Sean Eron Anderson, seander#cs.stanford.edu
Compute parity of word with a multiply
The following method computes the parity of the 32-bit value in only 8 operations >using a multiply.
unsigned int v; // 32-bit word
v ^= v >> 1;
v ^= v >> 2;
v = (v & 0x11111111U) * 0x11111111U;
return (v >> 28) & 1;
Also for 64-bits, 8 operations are still enough.
unsigned long long v; // 64-bit word
v ^= v >> 1;
v ^= v >> 2;
v = (v & 0x1111111111111111UL) * 0x1111111111111111UL;
return (v >> 60) & 1;
Andrew Shapira came up with this and sent it to me on Sept. 2, 2007.
Try:
int has_even_parity(unsigned int x){
unsigned int count = 0, i, b = 1;
for(i = 0; i < 32; i++){
if( x & (b << i) ){count++;}
}
if( (count % 2) ){return 0;}
return 1;
}
To generalise TypeIA's answer for any architecture:
int has_even_parity(unsigned int x)
{
unsigned char shift = 1;
while (shift < (sizeof(x)*8))
{
x ^= (x >> shift);
shift <<= 1;
}
return !(x & 0x1);
}
The main idea is this. Unset the rightmost '1' bit by using x & ( x - 1 ). Let’s say x = 13(1101) and the operation of x & ( x - 1 ) is 1101 & 1100 which is 1100, notice that the rightmost set bit is converted to 0.
Now x is 1100. The operation of x & ( x - 1 ), i.e., 1100 & 1011 is 1000. Notice that the original x is 1101 and after two operations of x & (x - 1) the x is 1000, i.e., two set bits are removed after two operations. If after an odd number of operations, the x becomes zero, then it's an odd parity, else it's an even parity.
Here's a one line #define that does the trick for a char:
#define PARITY(x) ((~(x ^= (x ^= (x ^= x >> 4) >> 2) >> 1)) & 1) /* even parity */
int main()
{
char x=3;
printf("parity = %d\n", PARITY(x));
}
It's portable as heck and easily modified to work with bigger words (16, 32 bit). It's important to note also, using a #define speeds the code up, each function call requires time to push the stack and allocate memory. Code size doesn't suffer, especially if it's implemented only a few times in your code - the function call might take up as much object code as the XORs.
Admittedly, the same efficiencies may be obtained by using the inline function version of this, inline char parity(char x) {return PARITY(x);} (GCC) or __inline char parity(char x) {return PARITY(x);} (MSVC). Presuming you keep the one line define.
int parity_check(unsigned x) {
int parity = 0;
while(x != 0) {
parity ^= x;
x >>= 1;
}
return (parity & 0x1);
}
In case the end result is supposed to be a piece of code that can work (be compiled) with a C program then I suggest the following:
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
This is a piece of code I'm using to check the parity of calculated results in a 64-bit C program compiled using MSVC. You can obviously port it to 32 bit or other compilers.
This has the advantage of being much faster than using C and it also leverages the CPU's functionality.
What this example does is take as input a parameter (passed in RCX - __fastcall calling convention). It increments it by 0 thus setting the CPU's parity flag and then setting a variable (RAX) to 0 or 1 if the parity flag is on or not.

fixed point multiplication without 64 bit temporary

Hi I'm implementing some fixed point math stuff for embedded systems and I'm trying to do the multiplication of two 16.16 fixed point numbers without creating a 64bit temporary. So far here is the code I came up with that generates the least instructions.
int multiply(int x, int y){
int result;
long long temp = x;
temp *= y;
temp >>= 16;
result = temp;
return result;
}
the problem with this code is that it uses a temporary 64 bit integer which seem to generate bad assembly code. I'm trying to make a system that uses two 32 bit integers instead of a 64 bit one. Anyone know how to do this?
Think of your numbers as each composed of two large "digits."
A.B
x C.D
The "base" of the digits is the 2^bit_width, i.e., 2^16, or 65536.
So, the product is
D*B + D*A*65536 + C*B*65536 + C*A*65536*65536
However, to get the product shifted right by 16, you need to divide all these terms by 65536, so
D*B/65536 + D*A + C*B + C*A*65536
In C:
uint16_t a = x >> 16;
uint16_t b = x & 0xffff;
uint16_t c = y >> 16;
uint16_t d = y & 0xffff;
return ((d * b) >> 16) + (d * a) + (c * b) + ((c * a) << 16);
The signed version is a bit more complicated; it is often easiest to perform the arithmetic on the absolute values of x and y and then fix the sign (unless you overflow, which you can check for rather tediously).

How to simulate a 4-bit binary adder in C

My professor assigned the class to write a C program to simulate a 32-bit adder using basic adders. I know a 32-bit adder is made up of 8 X 4-bit adders. However, I am unsure even how to simulate a 4-bit adder in C. I need to implement a 4-bit binary ripple carry adder, a 4-bit binary look-ahead carry generator, and a 4-bit look-ahead carry adder. From the truth table of a full adder and a Karnaugh map, I obtained the functions of the Sum and Carry Out outputs. For Sum I received A xor B xor Carry In. For the Carry out function, I received (A*B) + (Carry in(A xor B)). Now I am unsure where to go. I'm pretty sure I need to manipulate the integers at the bit level using bitwise operators (I have basic knowledge of bitwise operators although I have never implemented them outside of paper and pencil).
How do I break the integers up to obtain the A, B and Carry In inputs for the functions? How do I obtain the Sum and Carry Out outputs? How do I string the full adders together to obtain a 4-bit adder?
Thank you for the help!
Well, for a simple solution, we can take a half adder and full adder circuit diagram and abstract it a bit. From Wikipedia:
Half Adder:
Full Adder:
#include<stdio.h>
typedef char bit;
bit carry = 0;
bit halfadd( bit A, bit B ){
carry = A & B;
return A ^ B;
}
bit fulladd( bit A, bit B ){
bit xor = A ^ B;
bit ret = carry ^ xor;
carry = (carry & xor) | (A & B);
return ret;
}
void fillNum( int num, bit *array ){
int i;
for( i = 0; i < 32; ++ i ){
array[i] = ( num >> i ) & 1;
}
}
int main(){
bit num1[32] = {0}, num2[32] = {0};
int A = 64620926, B = 1531529858;
fillNum( A, num1 );
fillNum( B, num2 );
int r = 0;
bit tmp = halfadd( num1[0], num2[0] );
putchar( tmp ? '1' : '0' );
r = tmp;
int i;
for( i = 1; i < 32; ++i ){
tmp = fulladd( num1[i], num2[i] );
r += tmp << i;
putchar( tmp ? '1' : '0' );
}
putchar( carry ? '1' : '0' );
printf("\n%d\n\n%d + %d = %d", r, A, B, A+B);
return 0;
}
That will output the added value with the LSB first, but it demonstrates the basic principal. This works according to Ideone. Just apply a similar approach to handling logic circuitry when simulating 4 bit adders.
If you don't want to read the integers to an array first, you can always use
#define GETBIT(num,bit)((num>>bit)&1)
For safety, you can put it in to a function call if you want
If I were doing this, I would simulate a 4 bit adder with a Lookup Table. In this case it would be a 256 entry table that could be setup like a 16 x 16 array of values.
unsigned short outputs[16][16];
multOut = outputs[inA][inB];
You will have to initialize your array, but that should be pretty simple.
Use the 5th bit of each value in the array as your carry out bit.
To start you will need to break your larger integers into individual bits. This will depend on the endianess of your system (whether numbers are stored most, or least significant bit first). An array of bit masks would help. Assuming big endian,
int bit[]={
1<<0, //least significant bit
1<<1,
1<<2,
1<<3
};
So then to get the first bit of a number you would do
leastSignificantBitOfA = A&bit[0];
From there you could either use some shared array to store outputs or maybe make a simple structure like:
struct fullAdderReturn{
int sum;
int carryOut;
}
struct fullAdderReturn oneBitAdder(int a, int b, int carryIn)
{
struct fullAdderReturn output;
output.sum = a&b;
output.carryOut = (a&b) | (a&carryIn) | (b&carryIn);
return output;
}
I put together a simple 2-bit ripple adder here http://ideone.com/NRoQMS hopefully it gives you some ideas .

Resources