Basically, I have 8 pieces of data, 2 bits each (4 states), being stored in the 16 LSBs of a 32-bit integer. I want to reverse the order of the data pieces to do some pattern matching.
I am given a reference integer and 8 candidates, and I need to match one of the candidates to the reference. However, the matching candidate may be transformed in some predictable way.
If the reference data is in the form [0,1,2,3,4,5,6,7], then the possible matches can be in one of these 8 forms:
[0,1,2,3,4,5,6,7], [0,7,6,5,4,3,2,1]
[6,7,0,1,2,3,4,5], [2,1,0,7,6,5,4,3]
[4,5,6,7,0,1,2,3], [4,3,2,1,0,7,6,5]
[2,3,4,5,6,7,0,1], [6,5,4,3,2,1,0,7]
The pattern is that the data is always in order, but can be reversed and rotated.
I am implementing this in C and MIPS. I have both working, but they seem bulky. My current approach is to mask each piece from the original, shift it to its new position, and OR it with the new variable (initialized to 0).
I did more hard coding in C:
int ref = 4941; // reference value, original order [1,3,0,1,3,0,1,0], (encoded as 0b0001001101001101)
int rev = 0;
rev |= ((ref & 0x0003) << 14) | ((ref & 0x000C) << 10) | ((ref & 0x0030) << 6) | ((ref & 0x00C0) << 2); // move bottom 8 bits to top
rev |= ((ref & 0xC000) >> 14) | ((ref & 0x3000) >> 10) | ((ref & 0x0C00) >> 6) | ((ref & 0x0300) >> 2); // move top 8 bits to bottom
// rev = 29124 reversed order [0,1,0,3,1,0,3,1], (0b0111000111000100)
I implemented a loop in MIPS to try to reduce the static instructions:
lw $01, Reference($00) # load reference value
addi $04, $00, 4 # initialize $04 as Loop counter
addi $05, $00, 14 # initialize $05 to hold shift value
addi $06, $00, 3 # initialize $06 to hold mask (one piece of data)
# Reverse the order of data in Reference and store it in $02
Loop: addi $04, $04, -1 # decrement Loop counter
and $03, $01, $06 # mask out one piece ($03 = Reference & $06)
sllv $03, $03, $05 # shift piece to new position ($03 <<= $05)
or $02, $02, $03 # put piece into $02 ($02 |= $03)
sllv $06, $06, $05 # shift mask for next piece
and $03, $01, $06 # mask out next piece (#03 = Reference & $06)
srlv $03, $03, $05 # shift piece to new position ($03 >>= $05)
or $02, $02, $03 # put new piece into $02 ($02 |= $03)
srlv $06, $06, $05 # shift mask back
addi $05, $05, -4 # decrease shift amount by 4
sll $06, $06, 2 # shift mask for next loop
bne $04, $00, Loop # keep looping while $04 != 0
Is there a way to implement this that is simpler or at least fewer instructions?
To reverse your bits, you can use the following code.
static int rev(int v){
// swap adjacent pairs of bits
v = ((v >> 2) & 0x3333) | ((v & 0x3333) << 2);
// swap nibbles
v = ((v >> 4) & 0x0f0f) | ((v & 0x0f0f) << 4);
// swap bytes
v = ((v >> 8) & 0x00ff) | ((v & 0x00ff) << 8);
return v;
}
MIPS implementation is 15 instructions.
rev: # value to reverse in $01
# uses $02 reg
srli $02, $01, 2
andi $02, $02, 0x3333
andi $01, $01, 0x3333
slli $01, $01, 2
or $01, $01, $02
srli $02, $01, 4
andi $02, $02, 0x0f0f
andi $01, $01, 0x0f0f
slli $01, $01, 4
or $01, $01, $02
srli $02, $01, 8
andi $02, $02, 0xff
andi $01, $01, 0xff
slli $01, $01, 8
or $01, $01, $02
# result in $01
Note that you can simultaneously reverse 2x16bits by just doubling the constants (and even 4 on 64 bits machines). But I am not sure it is useful in you case.
Note: Be-careful with handwritten optimized assembly, there are really processor specific optimization keep them if you really have a strugle with your compiler generation in a tight loop.
You can improve the pipeline, (if you code in C the compiler do it for you) and use the delay slot of the bne instruction. This will improve your instruction level parallelism.
Assuming you have something like a Mips Processor with a 1 delay slot and 5 stage pipeline (Instruction Fetch, Decode, Execute, Memory, Writeback).
This pipeline introduce Read After Write Hazards on data dependence most were on $3 register.
A RaW hasard cause your pipeline to stall.
# Reverse the order of data in Reference and store it in $02
Loop: and $03, $01, $06 # mask out one piece ($03 = Reference & $06)
addi $04, $04, -1 # decrement Loop counter (RaW on $3)
sllv $03, $03, $05 # shift piece to new position ($03 <<= $05)
sllv $06, $06, $05 # shift mask for next piece
or $02, $02, $03 # put piece into $02 ($02 |= $03)
and $03, $01, $06 # mask out next piece (#03 = Reference & $06)
srlv $06, $06, $05 # shift mask back
srlv $03, $03, $05 # shift piece to new position ($03 >>= $05)
addi $05, $05, -4 # decrease shift amount by 4
or $02, $02, $03 # put new piece into $02 ($02 |= $03)
bne $04, $00, Loop # keep looping while $04 != 0
sll $06, $06, 2 # shift mask for next loop
If you have a Superscalar processor the solution need some changes.
For a very simple and effective approach, use a 256-byte lookup table and perform 2 lookups:
extern unsigned char const xtable[256];
unsigned int ref = 4149;
unsigned int rev = (xtable[ref & 0xFF] << 8) | xtable[ref >> 8];
The xtable array can be initialized statically via a set of macros:
#define S(x) ((((x) & 0x0003) << 14) | (((x) & 0x000C) << 10) | \
(((x) & 0x0030) << 6) | (((x) & 0x00C0) << 2) | \
(((x) & 0xC000) >> 14) | (((x) & 0x3000) >> 10) | \
(((x) & 0x0C00) >> 6) | (((x) & 0x0300) >> 2))
#define X8(m,n) m((n)+0), m((n)+1), m((n)+2), m((n)+3), \
m((n)+4), m((n)+5), m((n)+6), m((n)+7)
#define X32(m,n) X8(m,(n)), X8(m,(n)+8), X8(m,(n)+16), X8(m,(n)+24)
unsigned char const xtable[256] = {
X32(S, 0), X32(S, 32), X32(S, 64), X32(S, 96),
X32(S, 128), X32(S, 160), X32(S, 192), X32(S, 224),
};
#undef S
#undef X8
#undef X32
If space is not expensive, you could use a single lookup into a 128K-byte table, which you would compute at startup time or generate with a script and include at compile time, but it is somewhat wasteful and not cache-friendly.
Related
I'm searching for a faster way for my required special extract and combine operation as described below:
+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| D1 | D0 | C1 | C0 | B1 | B0 | A1 | A0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1
+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | D | C | B | A |
+-------+-------+-------+-------+-------+-------+-------+-------+
For sake of simplicity above is only an 8-bit example, the same applies for 16 bit values. It should be implemented as fast as possible on dsPIC33F microcontroller.
The easy way in C is:
PairFlags |= (ChannelFlags & 0x0003) ? 0x0001 : 0;
PairFlags |= (ChannelFlags & 0x000C) ? 0x0002 : 0;
PairFlags |= (ChannelFlags & 0x0030) ? 0x0004 : 0;
PairFlags |= (ChannelFlags & 0x00C0) ? 0x0008 : 0;
PairFlags |= (ChannelFlags & 0x0300) ? 0x0010 : 0;
PairFlags |= (ChannelFlags & 0x0C00) ? 0x0020 : 0;
PairFlags |= (ChannelFlags & 0x3000) ? 0x0040 : 0;
PairFlags |= (ChannelFlags & 0xC000) ? 0x0080 : 0;
This will produce approx. 40 instructions (with O3) which corresponds to 1µs in my case.
The amount of instruction cycles should be reduced if possible. Is there a faster way either in C or inline assembly?
The following should work for reducing a 16-bit value to 8 bits (with each bit of output formed by ORing a pair of bits of input):
// Set even bits to bits in pair ORed together, and odd bits to 0...
PairFlags = (ChannelFlags | (ChannelFlags >> 1)) & 0x5555; // '0h0g0f0e0d0c0b0a'
// Compress the '00' or '01' bit pairs down to single '0' or '1' bits...
PairFlags = (PairFlags ^ (PairFlags >> 1)) & 0x3333; // '00hg00fe00dc00ba'
PairFlags = (PairFlags ^ (PairFlags >> 2)) & 0x0F0F; // '0000hgfe0000dcba'
PairFlags = (PairFlags ^ (PairFlags >> 4)) & 0x00FF; // '00000000hgfedcba'
Note: The ^ can be replaced by | in the above for the same result.
Assuming I got everything right (not tested), this seems to generate good, branch-free code at least on gcc and clang for x86 (-O3):
uint8_t convert (uint8_t ChannelFlags)
{
return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
( ((ChannelFlags & B1B0)!=0) << B_POS ) |
( ((ChannelFlags & C1C0)!=0) << C_POS ) |
( ((ChannelFlags & D1D0)!=0) << D_POS ) ;
}
This masks out each individual bitset, then check against zero to end up with 1 or 0 in a temporary int. This value is shifted in position in the result, before everything is finally bitwise OR:ed together. Full code:
#include <stdint.h>
#define A1A0 (3u << 0)
#define B1B0 (3u << 2)
#define C1C0 (3u << 4)
#define D1D0 (3u << 6)
#define A_POS 0
#define B_POS 1
#define C_POS 2
#define D_POS 3
uint8_t convert (uint8_t ChannelFlags)
{
return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
( ((ChannelFlags & B1B0)!=0) << B_POS ) |
( ((ChannelFlags & C1C0)!=0) << C_POS ) |
( ((ChannelFlags & D1D0)!=0) << D_POS ) ;
}
clang disassembly x86 gives 18 instructions branch free:
convert: # #convert
test dil, 3
setne al
test dil, 12
setne cl
add cl, cl
or cl, al
test dil, 48
setne al
shl al, 2
or al, cl
mov ecx, edi
shr cl, 7
shr dil, 6
and dil, 1
or dil, cl
shl dil, 3
or al, dil
ret
Not sure if more efficient but instead of using a ternary if, why not use only bitwise operations ? And just offset it with the bitshift operator
PairFlags = ((ChannelFlags & (0b1 << 0)) | (ChannelFlags & (0b10 << 0))) << 0;
PairFlags = ((ChannelFlags & (0b1 << 2)) | (ChannelFlags & (0b10 << 2))) << 1;
PairFlags = ((ChannelFlags & (0b1 << 4)) | (ChannelFlags & (0b10 << 4))) << 2;
//...
Here is an idea.
Observe one thing here:
A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1
You have 4 or operations. You can perform all of them in 1 instruction:
PairFlags = (PairFlags | (PairFlags >> 1))
Now you bits are aligned like that:
[D1][D1 or D0][D0 or C1][C1 or C0][C0 or B1][B1 or B0][B0 or A1][A1 or A0]
So you just need to extract bits 0, 2, 4, 6 to get the result.
Bit 0. Is already OK.
Bit 1 should be set to bit 2.
Bit 2 should be set to bit 4.
Bit 3 should be set to bit 6.
Final code something like that:
PairFlags = (PairFlags | (PairFlags >> 1))
PairFlags = (PairFlags&1) | ((PairFlags&4)>>1) | ((PairFlags&16)>>2) | ((PairFlags&64)>>3)
Problem:
I have a sequence of bits of indices 7 6 5 4 3 2 1 0 and I want to swizzle them the following way :
7 6 5 4 3 2 1 0 = 7 6 5 4 3 2 1 0
_____| | | | | | | |_____
| ___| | | | | |___ |
| | _| | | |_ | |
| | | | | | | |
v v v v v v v v
_ 3 _ 2 _ 1 _ 0 7 _ 6 _ 5 _ 4 _
|___________________|
|
v
7 3 6 2 5 1 4 0
i.e. I want to interleave the bits of the low and high nibbles from a byte.
Naive solution:
I can achieve this behavior in C using the following way :
int output =
((input & (1 << 0)) << 0) |
((input & (1 << 1)) << 1) |
((input & (1 << 2)) << 2) |
((input & (1 << 3)) << 3) |
((input & (1 << 4)) >> 3) |
((input & (1 << 5)) >> 2) |
((input & (1 << 6)) >> 1) |
((input & (1 << 7)) >> 0);
However it's obviously very clunky.
Striving for a more elegant solution:
I was wondering if there where something I could do to achieve this behavior faster in less machine instructions. Using SSE for example?
Some context for curious people :
I use this for packing 2d signed integer vector coordinates into a 1d value that conserves proximity when dealing with memory and caching. The idea is similar to some texture layouts optimization used by some GPUs on mobile devices.
(i ^ 0xAAAAAAAA) - 0xAAAAAAAA converts from 1d integer to 1d signed integer with this power of two proximity I was talking about.
(x + 0xAAAAAAAA) ^ 0xAAAAAAAA is just the reverse operation, going from 1d signed integer to a 1d integer, still with the same properties.
To have it become 2d and keep the proximity property, I want to alternate the x and y bits.
So you want to interleave the bits of the low and high nibbles in each byte? For scalar code a 256-byte lookup table (LUT) is probably your best bet.
For x86 SIMD, SSSE3 pshufb (_mm_shuffle_epi8) can be used as a parallel LUT of 16x nibble->byte lookups in parallel. Use this to unpack a nibble to a byte.
__m128i interleave_high_low_nibbles(__m128i v) {
const __m128i lut_unpack_bits_low = _mm_setr_epi8( 0, 1, 0b00000100, 0b00000101,
... // dcba -> 0d0c0b0a
);
const __m128i lut_unpack_bits_high = _mm_slli_epi32(lut_unpack_bits_low, 1);
// dcba -> d0c0b0a0
// ANDing is required because pshufb uses the high bit to zero that element
// 8-bit element shifts aren't available so also we have to mask after shifting
__m128i lo = _mm_and_si128(v, _mm_set1_epi8(0x0f));
__m128i hi = _mm_and_si128(_mm_srli_epi32(v, 4), _mm_set1_epi8(0x0f));
lo = _mm_shuffle_epi8(lut_unpack_bits_low, lo);
hi = _mm_shuffle_epi8(lut_unpack_bits_high, hi);
return _mm_or_si128(lo, hi);
}
This is not faster than a memory LUT for a single byte, but it does 16 bytes in parallel. pshufb is a single-uop instruction on x86 CPUs made in the last decade. (Slow on first-gen Core 2 and K8.)
Having separate lo/hi LUT vectors means that setup can be hoisted out of a loop; otherwise we'd need to shift one LUT result before ORing together.
I need help to understand what is happening in this declaration:
#define LDA(m) (LDA_OP << 5 | ((m) & 0x001f))
Thank you
y << x is a left shift of y by x.
x & y is a bitwise and of x and y.
So, the left shift operator is like multiplying by 10 in base 10, but instead you multiply for 2 in base 2, for example:
In base 10
300 * 10 = 3000
In base 2:
0b0001 * 2 = 0b0010 = 0b0001 << 1
with a << b you "push" the number a, b places to the left.
and the or operator ( | )
you have to take two bits and if one or both of them are true (1) then the result is true.
For example:
0b0010 | 0b0001 = 0b0011
0b0010 | 0b0010 = 0b0010
If you have problems with this operators, just try to work the same numbers but in binary.
All of the bit shuffling algorithms I've found deal with 16-bit or 32-bit, which means that even if I use only the first 25-bits of an int, the shuffle will leave bits outside. This function is in an inner loop of a CPU-intensive process so I'd prefer it to be as fast as possible. I've tried modifying the code of the Hacker's Delight 32-bit shuffle algorithm
x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;
but am having difficulty in doing some partly because I'm not sure where the masks come from. I tried shifting the number and re-shuffling but so far the results are all for naught. Any help would be GREATLY appreciated!
(I am using C but I can convert an algorithm from another language)
First, for the sake of evenness, we can extend the problem to a 26-bit shuffle by remembering that bit 25 will appear at the end of the interleaved list, so we can trim it off after the interleaving operation without affecting the positions of the other bits.
Now we want to interleave the first and second sets of 13 bits; but we only have an algorithm to interleave the first and second sets of 16 bits.
A straightfoward approach might be to just move the high and low parts of x into more workable positions before applying the standard algorithm:
x = (x & 0x1ffe000) << 3 | x & 0x00001fff;
x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;
The zeroes at the top of each half will be interleaved and appear at the top of the result.
~ & ^ | + << >> are the only operations I can use
Before I continue, this is a homework question, I've been stuck on this for a really long time.
My original approach: I thought that !x could be done with two's complement and doing something with it's additive inverse. I know that an xor is probably in here but I'm really at a loss how to approach this.
For the record: I also cannot use conditionals, loops, ==, etc, only the functions (bitwise) I mentioned above.
For example:
!0 = 1
!1 = 0
!anything besides 0 = 0
Assuming a 32 bit unsigned int:
(((x>>1) | (x&1)) + ~0U) >> 31
should do the trick
Assuming x is signed, need to return 0 for any number not zero, and 1 for zero.
A right shift on a signed integer usually is an arithmetical shift in most implementations (e.g. the sign bit is copied over). Therefore right shift x by 31 and its negation by 31. One of those two will be a negative number and so right shifted by 31 will be 0xFFFFFFFF (of course if x = 0 then the right shift will produce 0x0 which is what you want). You don't know if x or its negation is the negative number so just 'or' them together and you will get what you want. Next add 1 and your good.
implementation:
int bang(int x) {
return ((x >> 31) | ((~x + 1) >> 31)) + 1;
}
The following code copies any 1 bit to all positions. This maps all non-zeroes to 0xFFFFFFFF == -1, while leaving 0 at 0. Then it adds 1, mapping -1 to 0 and 0 to 1.
x = x | x << 1 | x >> 1
x = x | x << 2 | x >> 2
x = x | x << 4 | x >> 4
x = x | x << 8 | x >> 8
x = x | x << 16 | x >> 16
x = x + 1
For 32 bit signed integer x
// Set the bottom bit if any bit set.
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x ^= 1; // Toggle the bottom bit - now 0 if any bit set.
x &= 1; // Clear the unwanted bits to leave 0 or 1.
Assuming e.g. an 8-bit unsigned type:
~(((x >> 0) & 1)
| ((x >> 1) & 1)
| ((x >> 2) & 1)
...
| ((x >> 7) & 1)) & 1
You can just do ~x & 1 because it yields 1 for 0 and 0 for everything else