I would like to create an SSE register with values that I can store in an array of integers, from another SSE register which contains flags 0xFFFF and zeros. For example:
__m128i regComp = _mm_cmpgt_epi16(regA, regB);
For the sake of argument, lets assume that regComp was loaded with { 0, 0xFFFF, 0, 0xFFFF }. I would like to convert this into say { 0, 80, 0, 80 }.
What I had in mind was to create an array of integers, initialized to 80 and load them to a register regC. Then, do a _mm_and_si128 bewteen regC and regComp and store the result in regD. However, this does not do the trick, which led me to think that I do not understand the positive flags in SSE registers. Could someone answer the question with a brief explanation why my solution does not work?
short valA[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
short valB[16] = { 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10 };
short ones[16] = { 1 };
short final[16];
__m128i vA, vB, vOnes, vRes, vRes2;
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
for( i=0 ; i < 16 ;i+=8){
vA = _mm_load_si128((__m128i *)&(valA)[i] );
vB = _mm_load_si128((__m128i *)&(valB)[i] );
vRes = _mm_cmpgt_epi16(vA,vB);
vRes2 = _mm_and_si128(vRes,vOnes);
_mm_storeu_si128((__m128i *)&(final)[i], vRes2);
}
You only set the first element of array ones to 1 (the rest of the array is initialised to 0).
I suggest you get rid of the array ones altogether and then change this line:
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
to:
vOnes = _mm_set1_epi16(1);
Probably a better solution though, if you just want to convert SIMD TRUE (0xffff) results to 1, would be to use a shift:
for (i = 0; i < 16; i += 8) {
vA = _mm_loadu_si128((__m128i *)&pA[i]);
vB = _mm_loadu_si128((__m128i *)&pB[i]);
vRes = _mm_cmpgt_epi16(vA, vB); // generate 0xffff/0x0000 results
vRes = _mm_srli_epi16(vRes, 15); // convert to 1/0 results
_mm_storeu_si128((__m128i *)&final[i], vRes2);
}
Try this for loading 1:
vOnes = _mm_set1_epi16(1);
This is shorter than creating a constant array.
Be careful, providing less array values than array size in C++ initializes the other values to zero. This was your error, and not the SSE part.
Don't forget the debugger, modern ones display SSE variables properly.
Related
For example, with an input ymm vector x and bit index i I want an output vector with only the ith bit kept and everything else zeroed.
With AVX512 k registers, I could write the following, but AVX2 and below doesn't have k registers, so what do you think is the best way to do it?
__m512i m512i_maskBit(__m512i x, unsigned i) {
__mmask8 m = _cvtu32_mask8(1u << i / 64);
__m512i vm = _mm512_maskz_set1_epi64(m, 1ull << i % 64);
return _mm512_and_si512(x, vm);
}
Here is an approach using variable shifts (just creating the mask):
__m256i create_mask(unsigned i) {
__m256i ii = _mm256_set1_epi32(i);
ii = _mm256_sub_epi32(ii,_mm256_setr_epi32(0,32,64,96,128,160,192,224));
__m256i mask = _mm256_sllv_epi32(_mm256_set1_epi32(1), ii);
return mask;
}
_mm256_sllv_epi32 (vpsllvd) was introduced by AVX2 and it shifts each 32 bit element by a variable amount of bits. If the (unsigned) shift-amount is bigger than 31 (i.e., also for signed negative numbers), the corresponding result is 0.
Godbolt link with small test code: https://godbolt.org/z/a5xfqTcGs
How about the simplest approach:
__m256i m256i_create_mask(unsigned i) {
// Get the required bit in every byte of the vector
__m256i vm = _mm256_broadcastb_epi8(_mm_cvtsi32_si128(1u << (i & 7u)));
// Mask off the bytes that are outside the index
__m256i vi = _mm256_broadcastb_epi8(_mm_cvtsi32_si128(i >> 3u));
__m256i vm1 = _mm256_cmpeq_epi8(vi,
_mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31));
return _mm256_and_si256(vm, vm1);
}
Here’s another approach. Not sure it’s necessarily better, it depends on CPU model and surrounding code, but it might be.
// A buffer to load vectors with a single bit set in one lane
alignas( 64 ) static const std::array<int, 16> s_oneBuffer =
{
0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0
};
__m256i maskSingleBit( __m256i x, uint32_t bitIndex )
{
// Load `1` into a single 32-bit lane of the vector
// The buffer aligned by 64 bytes, contained in a single cache line, no unaligned load penalty.
__m256i one = _mm256_loadu_si256( ( const __m256i* )( ( s_oneBuffer.data() + 8 ) - ( bitIndex / 32 ) ) );
// Left shift to move the `1` into the correct location
__m128i shift = _mm_cvtsi32_si128( bitIndex % 32 );
__m256i bit = _mm256_sll_epi32( one, shift );
// Bitwise AND with the value
return _mm256_and_si256( x, bit );
}
I want to achieve something like strncmp result but not that complicated
I tried to read https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcmp-avx2.S.html source code but I failed to understand it
suppose we have to 256 bit vector
how can I compare these two based on 8 bit comparison to achieve result like strncmp
I know there is a library but I want to understand the basics.
how it return -1,0,1 result with _mm256_cmpeq_epi8 and _mm256_min_epu8
I would do it like that.
inline int compareBytes( __m256i a, __m256i b )
{
// Compare for both a <= b and a >= b
__m256i min = _mm256_min_epu8( a, b );
__m256i le = _mm256_cmpeq_epi8( a, min );
__m256i ge = _mm256_cmpeq_epi8( b, min );
// Reverse bytes within 16-byte lanes
const __m128i rev16 = _mm_set_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
const __m256i rev32 = _mm256_broadcastsi128_si256( rev16 );
le = _mm256_shuffle_epi8( le, rev32 );
ge = _mm256_shuffle_epi8( ge, rev32 );
// Move the masks to scalar registers
uint32_t lessMask = (uint32_t)_mm256_movemask_epi8( le );
uint32_t greaterMask = (uint32_t)_mm256_movemask_epi8( ge );
// Flip high/low 16-bit pieces in the masks.
// Apparently, modern compilers are smart enough to emit ROR instructions for that code
lessMask = ( lessMask >> 16 ) | ( lessMask << 16 );
greaterMask = ( greaterMask >> 16 ) | ( greaterMask << 16 );
// Produce the desired result
if( lessMask > greaterMask )
return -1;
else if( lessMask < greaterMask )
return +1;
else
return 0;
}
The reason that method works, integer comparison is essentially searching for the most significant bit which differs, and comparison result is equal to the difference in that most significant different bit. Because we reversed order of the bytes being tested, the first byte in the vectors corresponds to the most significant bit in the masks. For this reason, ( lessMask > greaterMask ) expression evaluates to true when for the first different byte in the source vectors ( a < b ) evaluated to true.
the problem is, I want to open order when my indicator gives signal. How can I do that?
I have been trying to do with iCustom() but it is not satisfying.
I tried to use GlobalVariableSet() in indicator and GlobalVariableGet() method in EA but it is not properly worked.
Please help.
The syntax is:
double iCustom(
string symbol, // symbol
int timeframe, // timeframe
string name, // path/name of the custom indicator compiled program
... // custom indicator input parameters (if necessary)
int mode, // line index
int shift // shift
);
Here is the example using custom Alligator indicator (which should be available by default as Alligator.mq4 in MT platform).
double Alligator[3];
Alligator[0] = iCustom(NULL, 0, "Alligator", 13, 8, 8, 5, 5, 3, 0, 0);
Alligator[1] = iCustom(NULL, 0, "Alligator", 13, 8, 8, 5, 5, 3, 1, 0);
Alligator[2] = iCustom(NULL, 0, "Alligator", 13, 8, 8, 5, 5, 3, 2, 0);
where 13, 8, 8, 5, 5, 3 are corresponding input parameters of custom Alligator as defined in indicator it-self:
//---- input parameters
input int InpJawsPeriod=13; // Jaws Period
input int InpJawsShift=8; // Jaws Shift
input int InpTeethPeriod=8; // Teeth Period
input int InpTeethShift=5; // Teeth Shift
input int InpLipsPeriod=5; // Lips Period
input int InpLipsShift=3; // Lips Shift
and mode is the corresponding line index as defined in the indicator by:
SetIndexBuffer(0, ExtBlueBuffer);
SetIndexBuffer(1, ExtRedBuffer);
SetIndexBuffer(2, ExtLimeBuffer);
The syntax is:
int signal = iCustom(NULL, 0, "MyCustomIndicatorName",
...parameters it takes in...,
...the buffer index you want from the custom indicator...,
...shift in bars);
Let's say you wrote a custom moving average indicator called "myMA" and it takes in a period only as one of its extern variables. This indicator calculates a simple moving average based on the period that the user supplies and on the close of each bar. This indicator stores its calculated values in an array MAValues[] that gets assigned to an index like this: SetIndexBuffer(0, MAValues);
To get the moving average of the current bar with period 200 then, you would write:
double ma_current_bar = iCustom(NULL, 0, "myMA", 200, 0, 0);
Then once you have this value you can check it against some trading criteria you determine, and open an order when it is met. For example if you wanted to open a long position if the moving average of the current bar equals the current Ask price, you would write:
if (ma_current_bar == Ask){
OrderSend(Symbol(), OP_BUY, 1, Ask, *max slippage*, *sl*, *tp*, NULL, 0, 0, GREEN);
}
This is just example code, do NOT use this in a live EA.
This may be a slightly theoretical question. I have a char array of bytes containing network packets. I want to check for the occurrence of a particular pair of bits ('01' or '10')every 66 bits. That is to say once I locate the first pair of bits I have to skip 66 bits and check the presence of same pair of bits again. I am trying to implement a program with masks and shifts and it is kind of getting complicated. I want to know if someone can suggest a better way to do the same thing.
The code I have written so far looks something like this. It is not complete though.
test_sync_bits(char *rec, int len)
{
uint8_t target_byte = 0;
int offset = 0;
int save_offset = 0;
uint8_t *pload = (uint8_t*)(rec + 24);
uint8_t seed_mask = 0xc0;
uint8_t seed_shift = 6;
uint8_t value = 0;
uint8_t found_sync = 0;
const uint8_t sync_bit_spacing = 66;
/*hunt for the first '10' or '01' combination.*/
target_byte = *(uint8_t*)(pload + offset);
/*Get all combinations of two bits from target byte.*/
while(seed_shift)
{
value = ((target_byte & seed_mask) >> seed_shift);
if((value == 0x01) || (value == 0x10))
{
save_offset = offset;
found_sync = 1;
break;
}
else
{
seed_mask = (seed_mask >> 2) ;
seed_shift-=2;
}
}
offset = offset + 8;
seed_shift = (seed_shift - 4) > 0 ? (seed_shift - 4) : (seed_shift + 8 - 4);
seed_mask = (seed_mask >> (6 - seed_shift));
}
Another idea I came up with was to use a structure defined below
typedef struct
{
int remainder_bits;
int extra_bits;
int extra_byte;
}remainder_bits_extra_bits_map_t;
static remainder_bits_extra_bits_map_t sync_bit_check [] =
{
{6, 4, 0},
{5, 5, 0},
{4, 6, 0},
{3, 7, 0},
{2, 8, 0},
{1, 1, 1},
{0, 2, 1},
};
Is my approach correct? Can anyone suggest any improvements for the same?
Lookup Table Idea
There are only 256 possible bytes. That is few enough that you can construct a lookup table of all the possible bit combinations that can happen in one byte.
The lookup table value could record the bit position of the pattern and it could also have special values that mark possible continuation start or continuation finish values.
Edit:
I decided that continuation values would be silly. Instead, to check for a pattern that overlaps a byte, shift the byte and OR in the bit from the other byte, or manually check the end bits at each byte. Maybe ((bytes[i] & 0x01) & (bytes[i+1] & 0x80)) == 0x80 and ((bytes[i] & 0x01) & (bytes[i+1] & 0x80)) == 0x01 would work for you.
You didn't say so I am also assuming that you are looking for the first match in any byte. If you are looking for every match, then checking for the end pattern at +66 bits, that's a different problem.
To create the lookup table, I would write a program to do it for me. It could be in your favorite script language or it could be in C. The program would write a file that looked something like:
/* each value is the bit position of a possible pattern OR'd with a pattern ID bit. */
/* 0 is no match */
#define P_01 0x00
#define P_10 0x10
const char byte_lookup[256] = {
/* 0: 0000_0000, 0000_0001, 0000_0010, 0000_0011 */
0, 2|P_01, 3|P_01, 3|P_01,
/* 4: 0000_0100, 0000_0101, 0000_0110, 0000_0111, */
4|P_01, 4|P_01, 4|P_01, 4|P_01,
/* 8: 0000_1000, 0000_1001, 0000_1010, 0000_1011, */
5|P_01, 5|P_01, 5|P_01, 5|P_01,
};
Tedious. That's why I would write a program to write it for me.
This is a variation of the classic de-blocking problem that often comes up when reading from a stream. That is, data comes in discrete units that don't match up to the unit size that you wish to scan. The challenges in this are 1) buffering (which doesn't affect you because you have access to the whole array) and 2) managing all of the state (as you found out). A good approach is to write a consumer function that acts something like fread() and fseek() which maintains its own state. It returns the requested data you're interested in, aligned properly to the buffers you give it.
I cannot figure out what is wrong. I spent a few hours trying to debug this. I am compiling with gcc -m32 source.c -o source
How else can I approach this when debugging? Right now, I am isolating the code in many different ways and everything is working the way I expect but its working the wrong way when I have it all together.
This program takes an input and then looks for the highest position with the 1 bit.
I removed my code for now.
in bitsearch, you are storing num in eax, you store a special value in edx in order to perform check. check is testing if the highest bit is set (indicating a negative number), and exits if its the case...
the andl instruction in check stores the result of the operation inside the second operand (eax), so the result overwrites num.
then in zero you are using edx to perform your computation... edx contains the special value of the start of the function, so your result will always be wrong.
now at the end of zero, you are going back to check, but the check is unnecessary here, you should loop back to zeroinstead...
Does the bit-search need to be implemented in assembly? A simple for loop can accomplish the same task, and is much more readable:
int num = 10;
int maxFound = -1;
for (int numShifts = 0; numShifts < 32 && num != 0; numShifts++) {
if ((num & 1) == 1) {
maxFound = numShifts;
}
num = num >> 1;
}
//the last position that had a 1 will be in maxFound
There's a neat bit-fiddling trick: x & -x isolates the last 1-bit. The following C program uses a lookup table based on de Bruijn sequences to compute the number of trailing (!) zeros of a number in constant (!) time:
unsigned int x; // find the number of trailing zeros in 32-bit x
int r; // result goes here
int table[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = table[((uint32_t)((x & -x) * 0x077CB531U)) >> 27];
Doing this in assembly language (which I stopped learning by the age of 16) should be no problem. Now all you have to do is to reverse the bits in num and apply the technique described above.
I wrote a paper about the trick described above, but unfortunately it's not available on the web. If you're interested, I can send it to you (or anyone else who's interested) by email.
My assembly knowledge is a little rusty, but it seems to me like bitsearch is overly complicated. How about just rotating the number to the right and counting the times you need to do that until it's zero?