Why does data packing 4 integers into a 32 bit integer have different results in Nextion and Teensy(Arduino compatible) - packing

I'm controlling a Teensy 3.5 with a Nextion touchscreen. On the Nextion the following code packs 4 8 bit integers into a 32 bit integer:
sys0=vaShift_24.val<<8|vaShift_16.val<<8|vaShift_8.val<<8|vaShift_0.val
Using the same shift amount (8) has a different result on the Teensy, however, the following generates the same result:
combinedValue = (v24 << 24) | (v16 << 16) | (v08 << 8) | (v00);
I'm curious why these shifts work differently.
Nextion documentation: https://nextion.tech/instruction-set/
//Nextion:
vaShift_24.val=5
vaShift_16.val=4
vaShift_8.val=1
vaShift_0.val=51
sys0=vaShift_24.val<<8|vaShift_16.val<<8|vaShift_8.val<<8|vaShift_0.val
//Result is 84148531
//Teensy, Arduino, C++:
value24 = 5;
value16 = 4;
value8 = 1;
value0 = 51;
packedValue = (value24 << 24) | (value16 << 16) | (value8 << 8) | (value0);
Serial.print("24 to 0: ");
Serial.println(packedValue);
packedValue = (value24 << 8) | (value16 << 8) | (value8 << 8) | (value0);
Serial.print("8: ");
Serial.println(packedValue);
//Result:
//24 to 0: 84148531
//8: 1331

Problem seems to be in this line:
sys0=vaShift_24.val<<8|vaShift_16.val<<8|vaShift_8.val<<8|vaShift_0.val
You are shifting by 8 in many places. Presumably you want:
sys0 = vaShift_24.val << 24 | vaShift_16.val << 16 | vaShift_8.val << 8 | vaShift_0.val
Now result from bytes 5, 4, 1, and 55 should be, in hex
0x05040133.
If you are instead seeing
0x33010405
it means you would also have a byte order issue. But probably not.

Related

Faster way for extracting and combining bits from UINT16 to UINT8

I'm searching for a faster way for my required special extract and combine operation as described below:
+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| D1 | D0 | C1 | C0 | B1 | B0 | A1 | A0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1
+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | D | C | B | A |
+-------+-------+-------+-------+-------+-------+-------+-------+
For sake of simplicity above is only an 8-bit example, the same applies for 16 bit values. It should be implemented as fast as possible on dsPIC33F microcontroller.
The easy way in C is:
PairFlags |= (ChannelFlags & 0x0003) ? 0x0001 : 0;
PairFlags |= (ChannelFlags & 0x000C) ? 0x0002 : 0;
PairFlags |= (ChannelFlags & 0x0030) ? 0x0004 : 0;
PairFlags |= (ChannelFlags & 0x00C0) ? 0x0008 : 0;
PairFlags |= (ChannelFlags & 0x0300) ? 0x0010 : 0;
PairFlags |= (ChannelFlags & 0x0C00) ? 0x0020 : 0;
PairFlags |= (ChannelFlags & 0x3000) ? 0x0040 : 0;
PairFlags |= (ChannelFlags & 0xC000) ? 0x0080 : 0;
This will produce approx. 40 instructions (with O3) which corresponds to 1µs in my case.
The amount of instruction cycles should be reduced if possible. Is there a faster way either in C or inline assembly?
The following should work for reducing a 16-bit value to 8 bits (with each bit of output formed by ORing a pair of bits of input):
// Set even bits to bits in pair ORed together, and odd bits to 0...
PairFlags = (ChannelFlags | (ChannelFlags >> 1)) & 0x5555; // '0h0g0f0e0d0c0b0a'
// Compress the '00' or '01' bit pairs down to single '0' or '1' bits...
PairFlags = (PairFlags ^ (PairFlags >> 1)) & 0x3333; // '00hg00fe00dc00ba'
PairFlags = (PairFlags ^ (PairFlags >> 2)) & 0x0F0F; // '0000hgfe0000dcba'
PairFlags = (PairFlags ^ (PairFlags >> 4)) & 0x00FF; // '00000000hgfedcba'
Note: The ^ can be replaced by | in the above for the same result.
Assuming I got everything right (not tested), this seems to generate good, branch-free code at least on gcc and clang for x86 (-O3):
uint8_t convert (uint8_t ChannelFlags)
{
return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
( ((ChannelFlags & B1B0)!=0) << B_POS ) |
( ((ChannelFlags & C1C0)!=0) << C_POS ) |
( ((ChannelFlags & D1D0)!=0) << D_POS ) ;
}
This masks out each individual bitset, then check against zero to end up with 1 or 0 in a temporary int. This value is shifted in position in the result, before everything is finally bitwise OR:ed together. Full code:
#include <stdint.h>
#define A1A0 (3u << 0)
#define B1B0 (3u << 2)
#define C1C0 (3u << 4)
#define D1D0 (3u << 6)
#define A_POS 0
#define B_POS 1
#define C_POS 2
#define D_POS 3
uint8_t convert (uint8_t ChannelFlags)
{
return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
( ((ChannelFlags & B1B0)!=0) << B_POS ) |
( ((ChannelFlags & C1C0)!=0) << C_POS ) |
( ((ChannelFlags & D1D0)!=0) << D_POS ) ;
}
clang disassembly x86 gives 18 instructions branch free:
convert: # #convert
test dil, 3
setne al
test dil, 12
setne cl
add cl, cl
or cl, al
test dil, 48
setne al
shl al, 2
or al, cl
mov ecx, edi
shr cl, 7
shr dil, 6
and dil, 1
or dil, cl
shl dil, 3
or al, dil
ret
Not sure if more efficient but instead of using a ternary if, why not use only bitwise operations ? And just offset it with the bitshift operator
PairFlags = ((ChannelFlags & (0b1 << 0)) | (ChannelFlags & (0b10 << 0))) << 0;
PairFlags = ((ChannelFlags & (0b1 << 2)) | (ChannelFlags & (0b10 << 2))) << 1;
PairFlags = ((ChannelFlags & (0b1 << 4)) | (ChannelFlags & (0b10 << 4))) << 2;
//...
Here is an idea.
Observe one thing here:
A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1
You have 4 or operations. You can perform all of them in 1 instruction:
PairFlags = (PairFlags | (PairFlags >> 1))
Now you bits are aligned like that:
[D1][D1 or D0][D0 or C1][C1 or C0][C0 or B1][B1 or B0][B0 or A1][A1 or A0]
So you just need to extract bits 0, 2, 4, 6 to get the result.
Bit 0. Is already OK.
Bit 1 should be set to bit 2.
Bit 2 should be set to bit 4.
Bit 3 should be set to bit 6.
Final code something like that:
PairFlags = (PairFlags | (PairFlags >> 1))
PairFlags = (PairFlags&1) | ((PairFlags&4)>>1) | ((PairFlags&16)>>2) | ((PairFlags&64)>>3)

Binary Interleaving, Binary Swizzling, Alternating Bits

Problem:
I have a sequence of bits of indices 7 6 5 4 3 2 1 0 and I want to swizzle them the following way :
7 6 5 4 3 2 1 0 = 7 6 5 4 3 2 1 0
_____| | | | | | | |_____
| ___| | | | | |___ |
| | _| | | |_ | |
| | | | | | | |
v v v v v v v v
_ 3 _ 2 _ 1 _ 0 7 _ 6 _ 5 _ 4 _
|___________________|
|
v
7 3 6 2 5 1 4 0
i.e. I want to interleave the bits of the low and high nibbles from a byte.
Naive solution:
I can achieve this behavior in C using the following way :
int output =
((input & (1 << 0)) << 0) |
((input & (1 << 1)) << 1) |
((input & (1 << 2)) << 2) |
((input & (1 << 3)) << 3) |
((input & (1 << 4)) >> 3) |
((input & (1 << 5)) >> 2) |
((input & (1 << 6)) >> 1) |
((input & (1 << 7)) >> 0);
However it's obviously very clunky.
Striving for a more elegant solution:
I was wondering if there where something I could do to achieve this behavior faster in less machine instructions. Using SSE for example?
Some context for curious people :
I use this for packing 2d signed integer vector coordinates into a 1d value that conserves proximity when dealing with memory and caching. The idea is similar to some texture layouts optimization used by some GPUs on mobile devices.
(i ^ 0xAAAAAAAA) - 0xAAAAAAAA converts from 1d integer to 1d signed integer with this power of two proximity I was talking about.
(x + 0xAAAAAAAA) ^ 0xAAAAAAAA is just the reverse operation, going from 1d signed integer to a 1d integer, still with the same properties.
To have it become 2d and keep the proximity property, I want to alternate the x and y bits.
So you want to interleave the bits of the low and high nibbles in each byte? For scalar code a 256-byte lookup table (LUT) is probably your best bet.
For x86 SIMD, SSSE3 pshufb (_mm_shuffle_epi8) can be used as a parallel LUT of 16x nibble->byte lookups in parallel. Use this to unpack a nibble to a byte.
__m128i interleave_high_low_nibbles(__m128i v) {
const __m128i lut_unpack_bits_low = _mm_setr_epi8( 0, 1, 0b00000100, 0b00000101,
... // dcba -> 0d0c0b0a
);
const __m128i lut_unpack_bits_high = _mm_slli_epi32(lut_unpack_bits_low, 1);
// dcba -> d0c0b0a0
// ANDing is required because pshufb uses the high bit to zero that element
// 8-bit element shifts aren't available so also we have to mask after shifting
__m128i lo = _mm_and_si128(v, _mm_set1_epi8(0x0f));
__m128i hi = _mm_and_si128(_mm_srli_epi32(v, 4), _mm_set1_epi8(0x0f));
lo = _mm_shuffle_epi8(lut_unpack_bits_low, lo);
hi = _mm_shuffle_epi8(lut_unpack_bits_high, hi);
return _mm_or_si128(lo, hi);
}
This is not faster than a memory LUT for a single byte, but it does 16 bytes in parallel. pshufb is a single-uop instruction on x86 CPUs made in the last decade. (Slow on first-gen Core 2 and K8.)
Having separate lo/hi LUT vectors means that setup can be hoisted out of a loop; otherwise we'd need to shift one LUT result before ORing together.

How to make concatenation between 4 bytes?

I have 4 bytes:
buffer_RX[3]= \x70;
buffer_RX[4]= \xb4;
buffer_RX[5]= \xc5;
buffer_RX[6]= \x5a;
I want to concatenate them in order to have such representation 0x70b4c55a:
I already did this plaintext[1]= (rx_buffer[3]<<8)|rx_buffer[4];
This is the result that I have: 70b4
plaintext[1]= (rx_buffer[3]<<8)|(rx_buffer[4]<<8)|(rx_buffer[5]<<8)|rx_buffer[6]
It doesn't work.
Please I need help.
This is one way to do it :
plaintext[1] = (buffer_RX[3] << 24) |
(buffer_RX[4] << 16) |
(buffer_RX[5] << 8) | buffer_RX[6];
You have to understand that each byte is 8 bits, and that 4 bytes is therefore 32 bits. The first example works because you are using only 16 bits and two bytes, to work with 32 bits and four bytes you must shift each byte by 24, 16, 8 and 0 bits respectively in order to place then into the correct position.
plaintext[1]= (rx_buffer[3] << 24 ) |
(rx_buffer[4] << 16) |
(rx_buffer[5] << 8) |
rx_buffer[6] ;

k&r exercise 2-6 "setbits"

I've seen the answer here: http://clc-wiki.net/wiki/K%26R2_solutions:Chapter_2:Exercise_6
and i've tested the first, but in this part:
x = 29638;
y = 999;
p = 10;
n = 8;
return (x & ((~0 << (p + 1)) | (~(~0 << (p + 1 - n)))))
in a paper it give to me a 6, but in the program it return 28678...
in this part:
111001111000110
&000100000000111
in the result, the left-most three bits has to be 1's like in x but the bitwise operator & says:
The output of bitwise AND is 1 if the corresponding bits of all operands is 1. If either bit of an operand is 0, the result of corresponding bit is evaluated to 0.
so why it returns the number with thats 3 bits in 1?
Here we go, one step at a time (using 16-bit numbers). We start with:
(x & ((~0 << (p + 1)) | (~(~0 << (p + 1 - n)))))
Substituting in numbers (in decimal):
(29638 & ((~0 << (10 + 1)) | (~(~0 << (10 + 1 - 8)))))
Totalling up the bit shift amounts gives:
(29638 & ((~0 << 11) | (~(~0 << 3))))
Rewriting numbers as binary and applying the ~0s...
(0111001111000110 & ((1111111111111111 << 1011) | (~(1111111111111111 << 0011))))
After performing the shifts we get:
(0111001111000110 & (1111100000000000 | (~ 1111111111111000)))
Applying the other bitwise-NOT (~):
(0111001111000110 & (1111100000000000 | 0000000000000111))
And the bitwise-OR (|):
0111001111000110 & 1111100000000111
And finally the bitwise-AND (&):
0111000000000110
So we then have binary 0111000000000110, which is 2 + 4 + 4096 + 8192 + 16384, which is 28678.

I am looking for an algorithm to shuffle the first 25 bits of a (32-bit) int

All of the bit shuffling algorithms I've found deal with 16-bit or 32-bit, which means that even if I use only the first 25-bits of an int, the shuffle will leave bits outside. This function is in an inner loop of a CPU-intensive process so I'd prefer it to be as fast as possible. I've tried modifying the code of the Hacker's Delight 32-bit shuffle algorithm
x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;
but am having difficulty in doing some partly because I'm not sure where the masks come from. I tried shifting the number and re-shuffling but so far the results are all for naught. Any help would be GREATLY appreciated!
(I am using C but I can convert an algorithm from another language)
First, for the sake of evenness, we can extend the problem to a 26-bit shuffle by remembering that bit 25 will appear at the end of the interleaved list, so we can trim it off after the interleaving operation without affecting the positions of the other bits.
Now we want to interleave the first and second sets of 13 bits; but we only have an algorithm to interleave the first and second sets of 16 bits.
A straightfoward approach might be to just move the high and low parts of x into more workable positions before applying the standard algorithm:
x = (x & 0x1ffe000) << 3 | x & 0x00001fff;
x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;
The zeroes at the top of each half will be interleaved and appear at the top of the result.

Resources