I would like to create an SSE register with values that I can store in an array of integers, from another SSE register which contains flags 0xFFFF and zeros. For example:
__m128i regComp = _mm_cmpgt_epi16(regA, regB);
For the sake of argument, lets assume that regComp was loaded with { 0, 0xFFFF, 0, 0xFFFF }. I would like to convert this into say { 0, 80, 0, 80 }.
What I had in mind was to create an array of integers, initialized to 80 and load them to a register regC. Then, do a _mm_and_si128 bewteen regC and regComp and store the result in regD. However, this does not do the trick, which led me to think that I do not understand the positive flags in SSE registers. Could someone answer the question with a brief explanation why my solution does not work?
short valA[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
short valB[16] = { 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10 };
short ones[16] = { 1 };
short final[16];
__m128i vA, vB, vOnes, vRes, vRes2;
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
for( i=0 ; i < 16 ;i+=8){
vA = _mm_load_si128((__m128i *)&(valA)[i] );
vB = _mm_load_si128((__m128i *)&(valB)[i] );
vRes = _mm_cmpgt_epi16(vA,vB);
vRes2 = _mm_and_si128(vRes,vOnes);
_mm_storeu_si128((__m128i *)&(final)[i], vRes2);
}
You only set the first element of array ones to 1 (the rest of the array is initialised to 0).
I suggest you get rid of the array ones altogether and then change this line:
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
to:
vOnes = _mm_set1_epi16(1);
Probably a better solution though, if you just want to convert SIMD TRUE (0xffff) results to 1, would be to use a shift:
for (i = 0; i < 16; i += 8) {
vA = _mm_loadu_si128((__m128i *)&pA[i]);
vB = _mm_loadu_si128((__m128i *)&pB[i]);
vRes = _mm_cmpgt_epi16(vA, vB); // generate 0xffff/0x0000 results
vRes = _mm_srli_epi16(vRes, 15); // convert to 1/0 results
_mm_storeu_si128((__m128i *)&final[i], vRes2);
}
Try this for loading 1:
vOnes = _mm_set1_epi16(1);
This is shorter than creating a constant array.
Be careful, providing less array values than array size in C++ initializes the other values to zero. This was your error, and not the SSE part.
Don't forget the debugger, modern ones display SSE variables properly.
I writing a fast "8 bit reverse"-routine for an avr-project with an ATmega2560 processor.
I'm using
GNU C (WinAVR 20100110) version 4.3.3 (avr) / compiled by GNU C version 3.4.5 (mingw-vista special r3), GMP version 4.2.3, MPFR version 2.4.1.
First I created a global lookup-table of reversed bytes (size: 0x100):
uint8_t BitReverseTable[]
__attribute__((__progmem__, aligned(0x100))) = {
0x00,0x80,0x40,0xC0,0x20,0xA0,0x60,0xE0,
0x10,0x90,0x50,0xD0,0x30,0xB0,0x70,0xF0,
[...]
0x1F,0x9F,0x5F,0xDF,0x3F,0xBF,0x7F,0xFF
};
This works as expected. That is the macro I intend to use, which should cost me only 5 cylces:
#define BITREVERSE(x) (__extension__({ \
register uint8_t b=(uint8_t)x; \
__asm__ __volatile__ ( \
"ldi r31, hi8(table)" "\n\t" \
"mov r30, ioRegister" "\n\t" \
"lpm ioRegister, z" "\n\t" \
:[ioRegister] "+r" (b) \
:[table] "g" (BitReverseTable) \
:"r30", "r31" \
); \
}))
The code to get it compiled (or not).
int main() /// Test for bitreverse
{
BITREVERSE(25);
return 0;
}
That's the error I get from the compiler:
c:/winavr-20100110/bin/../lib/gcc/avr/4.3.3/../../../../avr/bin/as.exe -mmcu=atmega2560 -o bitreverse.o C:\Users\xxx\AppData\Local\Temp/ccCefE75.s
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s: Assembler messages:
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:349: Error: constant value required
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:350: Error: constant value required
I guess the problem is here:
:[table] "g" (BitReverseTable) \
From my point of view BitReverseTable is the memory position of the array, which is fixed and known at compile time. Therefor it is constant.
Maybe I need to cast BitReverseTable into something (i tried anything I could think of). Maybe I need another constraint ("g" was my last test). I'm sure I used anything possible and impossible.
I coded an assembler version, which works fine, but instead of being an inline assembly code, this is a proper function which adds another 6 cycles (for call and ret).
Any advice or suggestions are very welcome!
Full source of bitreverse.c on pastebin.
Verbose compiler output also on pastebin
The following does seem to work on avr-gcc (GCC) 4.8.2, but it does have a distinct hacky aftertaste to me.
Edited to fix the issues pointed out by the OP (Thomas) in the comments:
The high byte of Z register is r31 (I had r30 and r31 swapped)
Newer AVR's like ATmega2560 support also lpm r,Z (older AVRs only lpm r0,Z)
Thanks for the fixes, Thomas! I do have an ATmega2560 board, but I prefer Teensies (in part because of the native USB), so I only compile-tested the code, didn't run it to verify. I should have mentioned that; apologies.
const unsigned char reverse_bits_table[256] __attribute__((progmem, aligned (256))) = {
0, 128, 64, 192, 32, 160, 96, 224, 16, 144, 80, 208, 48, 176, 112, 240,
8, 136, 72, 200, 40, 168, 104, 232, 24, 152, 88, 216, 56, 184, 120, 248,
4, 132, 68, 196, 36, 164, 100, 228, 20, 148, 84, 212, 52, 180, 116, 244,
12, 140, 76, 204, 44, 172, 108, 236, 28, 156, 92, 220, 60, 188, 124, 252,
2, 130, 66, 194, 34, 162, 98, 226, 18, 146, 82, 210, 50, 178, 114, 242,
10, 138, 74, 202, 42, 170, 106, 234, 26, 154, 90, 218, 58, 186, 122, 250,
6, 134, 70, 198, 38, 166, 102, 230, 22, 150, 86, 214, 54, 182, 118, 246,
14, 142, 78, 206, 46, 174, 110, 238, 30, 158, 94, 222, 62, 190, 126, 254,
1, 129, 65, 193, 33, 161, 97, 225, 17, 145, 81, 209, 49, 177, 113, 241,
9, 137, 73, 201, 41, 169, 105, 233, 25, 153, 89, 217, 57, 185, 121, 249,
5, 133, 69, 197, 37, 165, 101, 229, 21, 149, 85, 213, 53, 181, 117, 245,
13, 141, 77, 205, 45, 173, 109, 237, 29, 157, 93, 221, 61, 189, 125, 253,
3, 131, 67, 195, 35, 163, 99, 227, 19, 147, 83, 211, 51, 179, 115, 243,
11, 139, 75, 203, 43, 171, 107, 235, 27, 155, 91, 219, 59, 187, 123, 251,
7, 135, 71, 199, 39, 167, 103, 231, 23, 151, 87, 215, 55, 183, 119, 247,
15, 143, 79, 207, 47, 175, 111, 239, 31, 159, 95, 223, 63, 191, 127, 255,
};
#define USING_REVERSE_BITS \
register unsigned char r31 asm("r31"); \
asm volatile ( "ldi r31,hi8(reverse_bits_table)\n\t" : [r31] "=d" (r31) )
#define REVERSE_BITS(v) \
({ register unsigned char r30 asm("r30") = v; \
register unsigned char ret; \
asm volatile ( "lpm %[ret],Z\n\t" : [ret] "=r" (ret) : [r30] "d" (r30), [r31] "d" (r31) ); \
ret; })
unsigned char reverse_bits(const unsigned char value)
{
USING_REVERSE_BITS;
return REVERSE_BITS(value);
}
void reverse_bits_in(unsigned char *string, unsigned char length)
{
USING_REVERSE_BITS;
while (length-->0) {
*string = REVERSE_BITS(*string);
string++;
}
}
For older AVRs that only support lpm r0,Z, use
#define REVERSE_BITS(v) \
({ register unsigned char r30 asm("r30") = v; \
register unsigned char ret asm("r0"); \
asm volatile ( "lpm %[ret],Z\n\t" : [ret] "=t" (ret) : [r30] "d" (r30), [r31] "d" (r31) ); \
ret; })
The idea is that we use a local reg var r31, to keep the high byte of the Z register pair. The USING_REVERSE_BITS; macro defines it in the current scope, using inline assembly for two purposes: to avoid an unnecessary load of the low part of the table address into a register, and to make sure GCC knows we have stored a value into it (because it is an output operand) without having any way of knowing what the value should be, thus hopefully retaining it throughout the scope.
The REVERSE_BITS() macro yields the result, telling the compiler it needs the argument in register r30, and the table address high byte set by USING_REVERSE_BITS; in r31.
Sounds a bit complicated, but that's just because I don't know how to explain it better. It really is quite simple.
Compiling the above with avr-gcc-4.8.2 -O2 -fomit-frame-pointer -mmcu=atmega2560 -S yields the assembly source. (I do recommend using -O2 -fomit-frame-pointer.)
Omitting comments and the normal directives:
.text
reverse_bits:
ldi r31,hi8(reverse_bits_table)
mov r30,r24
lpm r24,Z
ret
reverse_bits_in:
mov r26,r24
mov r27,r25
ldi r31,hi8(reverse_bits_table)
ldi r24,lo8(-1)
add r24,r22
tst r22
breq .L2
.L8:
ld r30,X
lpm r30,Z
st X+,r30
subi r24,1
brcc .L8
.L2:
ret
.section .progmem.data,"a",#progbits
.p2align 8
reverse_bits_table:
.byte 0
.byte -128
; Rest of data omitted for brevity
In case you are wondering, on ATmega2560 GCC puts the first 8-bit parameter and the 8-bit function result both in register r24.
The first function is optimal, as far as I can tell. (On older AVRs that only support lpm r0,Z, you get an added move to copy the result from r0 to r24.)
For the second function, the setup part might not be exactly optimal (for one, you could do the tst r22 breq .L2 first thing to speed up the zero-length-array check), but I'm not sure if I could write a faster/shorter one myself; it's certainly acceptable to me.
The loop in the second function looks optimal to me. The way it uses r30 I found strange and scary at first, but then I realized it makes perfect sense -- fewer registers used, and there is no harm in reusing r30 this way (even if it is low part of Z register too), because it will be loaded with a new value from string at the start of the next iteration.
Note that in my previous edit, I mentioned that swapping the order of the function parameters yielded better code, but with Thomas's additions, that is no longer the case. The registers change, that's it.
If you are sure you always supply a larger-than-zero length, using
void reverse_bits_in(unsigned char *string, unsigned char length)
{
USING_REVERSE_BITS;
do {
*string = REVERSE_BITS(*string);
string++;
} while (--length);
}
yields
reverse_bits_in:
mov r26,r24 ; 1 cycle
mov r27,r25 ; 1 cycle
ldi r31,hi8(reverse_bits_table) ; 2 cycles
.L4:
ld r30,X ; 2 cycles
lpm r30,Z ; 3 cycles
st X+,r30 ; 2 cycles
subi r22,lo8(-(-1)) ; 1 cycle
brne .L4 ; 2 cycles
ret ; 4 cycles
which starts to look downright impressive to me: ten cycles per byte, four cycles for setup, and three cycles cleanup (brne takes just one cycle if no jump). The cycle counts I listed off the top of my head, so there are likely small errors in 'em (a cycle here or there). r26:r27 is X, and the first pointer parameter to the function is supplied in r24:r25, with length in r22.
The reverse_bits_table is in the correct section, and correctly aligned. (.p2align 8 does align to 256 bytes; it specifies an alignment where the low 8 bits are zero.)
Although GCC is notorious for superfluous register moves, I really like the code it generates above. Sure, there is always room for finessing; for the important code sequences I recommend trying different variants, even changing the order of function parameters (or declaring loop variables in local scopes), and so on, then compile using -S to see the generated code. The AVR instruction timings are simple, so it is pretty easy to compare code sequences, to see if one is clearly better. I like to remove the directives and comments first; it makes it easier to read the assembly.
The reason for the hacky aftertaste is that the GCC documentation explicitly says that "Defining such a register variable does not reserve the register; it remains available for other uses in places where flow control determines the variable's value is not live", and I just don't trust that this means the same to the GCC developers as it means to me. Even if it did right now, it might not in the future; there is no standard GCC developers ought to adhere to here, since this is a GCC-specific feature.
On the other hand, I do only rely on documented GCC behaviour, above, and although "hacky", it does generate efficient assembly from straightforward C code.
Personally, I would recommend recompiling the above test code, and looking at the generated assembly (perhaps use sed to strip out the comments and labels, and compare to a known good version?), whenever you update avr-gcc.
Questions?
I cannot figure out what is wrong. I spent a few hours trying to debug this. I am compiling with gcc -m32 source.c -o source
How else can I approach this when debugging? Right now, I am isolating the code in many different ways and everything is working the way I expect but its working the wrong way when I have it all together.
This program takes an input and then looks for the highest position with the 1 bit.
I removed my code for now.
in bitsearch, you are storing num in eax, you store a special value in edx in order to perform check. check is testing if the highest bit is set (indicating a negative number), and exits if its the case...
the andl instruction in check stores the result of the operation inside the second operand (eax), so the result overwrites num.
then in zero you are using edx to perform your computation... edx contains the special value of the start of the function, so your result will always be wrong.
now at the end of zero, you are going back to check, but the check is unnecessary here, you should loop back to zeroinstead...
Does the bit-search need to be implemented in assembly? A simple for loop can accomplish the same task, and is much more readable:
int num = 10;
int maxFound = -1;
for (int numShifts = 0; numShifts < 32 && num != 0; numShifts++) {
if ((num & 1) == 1) {
maxFound = numShifts;
}
num = num >> 1;
}
//the last position that had a 1 will be in maxFound
There's a neat bit-fiddling trick: x & -x isolates the last 1-bit. The following C program uses a lookup table based on de Bruijn sequences to compute the number of trailing (!) zeros of a number in constant (!) time:
unsigned int x; // find the number of trailing zeros in 32-bit x
int r; // result goes here
int table[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = table[((uint32_t)((x & -x) * 0x077CB531U)) >> 27];
Doing this in assembly language (which I stopped learning by the age of 16) should be no problem. Now all you have to do is to reverse the bits in num and apply the technique described above.
I wrote a paper about the trick described above, but unfortunately it's not available on the web. If you're interested, I can send it to you (or anyone else who's interested) by email.
My assembly knowledge is a little rusty, but it seems to me like bitsearch is overly complicated. How about just rotating the number to the right and counting the times you need to do that until it's zero?