Fast search and replace some nibble in int [c; microoptimisation]

Fast search and replace some nibble in int [c; microoptimisation] - c

This is variant of Fast search of some nibbles in two ints at same offset (C, microoptimisation) question with different task:
The task is to find a predefined nibble in int32 and replace it with other nibble. For example, nibble to search is 0x5; nibble to replace with is 0xe:
int: 0x3d542753 (input)
^ ^
output:0x3dE427E3 (output int)
There can be other pair of nibble to search and nibble to replace (known at compile time).
I checked my program, this part is one of most hot place (gprof proven, 75% of time is in the function); and it is called a very-very many times (gcov proven). Actually it is the 3rd or 4th loop of nested loops, with run count estimation of (n^3)*(2^n), for n=18..24.
My current code is slow (I rewrite it as function, but it is a code from loop):
static inline uint32_t nibble_replace (uint32_t A) __attribute__((always_inline))
{
int i;
uint32_t mask = 0xf;
uint32_t search = 0x5;
uint32_t replace = 0xe;
for(i=0;i<8;i++) {
if( (A&mask) == search )
A = (A & (~mask) ) // clean i-th nibble
| replace; // and replace
mask <<= 4; search <<= 4; replace <<= 4;
}
return A;
}
Is it possible to rewrite this function and macro in parallel way, using some bit logic magic? Magic is something like (t-0x11111111)&(~t)-0x88888888 and possibly usable with SSE*. Check the accepted answer of linked question to get feeling about needed magic.
My compiler is gcc452 and cpu is Intel Core2 Solo in 32bit mode (x86) or (in near future) in 64bit mode (x86-64).

This seemed like a fun question, so I wrote a solution without looking at other answers. This appears to be about 4.9x as fast on my system. On my system, it's also slightly faster than DigitalRoss's solution (~25% faster).
static inline uint32_t nibble_replace_2(uint32_t x)
{
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
y *= 15; /* This is faster than y |= y << 1; y |= y << 2; */
return x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}
I would explain how it works, but... I think explaining it spoils the fun.
Note on SIMD: This kind of stuff is very, very easy to vectorize. You don't even have to know how to use SSE or MMX. Here is how I vectorized it:
static void nibble_replace_n(uint32_t *restrict p, uint32_t n)
{
uint32_t i;
for (i = 0; i < n; ++i) {
uint32_t x = p[i];
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
y *= 15;
p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}
}
Using GCC, this function will automatically be converted to SSE code at -O3, assuming proper use of the -march flag. You can pass -ftree-vectorizer-verbose=2 to GCC to ask it to print out which loops are vectorized, e.g.:
$ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c
opt.c:66: note: LOOP VECTORIZED.
Automatic vectorization gave me an extra speed gain of about 64%, and I didn't even have to reach for the processor manual.
Edit: I noticed an additional 48% speedup by changing the types in the auto-vectorized version from uint32_t to uint16_t. This brings the total speedup to about 12x over the original. Changing to uint8_t causes vectorization to fail. I suspect there's some significant extra speed to be found with hand assembly, if it's that important.
Edit 2: Changed *= 7 to *= 15, this invalidates the speed tests.
Edit 3: Here's a change that is obvious in retrospect:
static inline uint32_t nibble_replace_2(uint32_t x)
{
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
return x ^ (y * (SEARCH ^ REPLACE));
}

Related

Is there an architecture-independent method to create a little-endian byte stream from a value in C?

I am trying to transmit values between architectures, by creating a uint8_t[] buffer and then sending that. To ensure they are transmitted correctly, the spec is to convert all values to little-endian as they go into the buffer.
I read this article here which discussed how to convert from one endianness to the other, and here where it discusses how to check the endianness of the system.
I am curious if there is a method to read bytes from a uint64 or other value in little-endian order regardless of whether the system is big or little? (ie through some sequence of bitwise operations)
Or is the only method to first check the endianness of the system, and then if big explicitly convert to little?

That's actually quite easy -- you just use shifts to convert between 'native' format (whatever that is) and little-endian
/* put a 32-bit value into a buffer in little-endian order (4 bytes) */
void put32(uint8_t *buf, uint32_t val) {
buf[0] = val;
buf[1] = val >> 8;
buf[2] = val >> 16;
buf[3] = val >> 24;
}
/* get a 32-bit value from a buffer (little-endian) */
uint32_t get32(uint8_t *buf) {
return (uint32_t)buf[0] + ((uint32_t)buf[1] << 8) +
((uint32_t)buf[2] << 16) + ((uint32_t)buf[3] << 24);
}
If you put a value into a buffer, transmit it as a byte stream to another machine, and then get the value from the received buffer, the two machines will have the same 32 bit value regardless of whether they have the same or different native byte oridering. The casts are needed becuase the default promotions will just convert to int, which might be smaller than a uin32_t, in which case the shifts could be out of range.
Be careful if you buffers are char rather than uint8_t (char might or might not be signed) -- you need to mask in that case:
uint32_t get32(char *buf) {
return ((uint32_t)buf[0] & 0xff) + (((uint32_t)buf[1] & 0xff) << 8) +
(((uint32_t)buf[2] & 0xff) << 16) + (((uint32_t)buf[3] & 0xff) << 24);
}

You can always serialize an uint64_t value to array of uint8_t in little endian order as simply
uint64_t source = ...;
uint8_t target[8];
target[0] = source;
target[1] = source >> 8;
target[2] = source >> 16;
target[3] = source >> 24;
target[4] = source >> 32;
target[5] = source >> 40;
target[6] = source >> 48;
target[7] = source >> 56;
or
for (int i = 0; i < sizeof (uint64_t); i++) {
target[i] = source >> i * 8;
}
and this will work anywhere where uint64_t and uint8_t exists.
Notice that this assumes that the source value is unsigned. Bit-shifting negative signed values will cause all sorts of headaches and you just don't want to do that.
Deserialization is a bit more complex if reading byte at a time in order:
uint8_t source[8] = ...;
uint64_t target = 0;
for (int i = 0; i < sizeof (uint64_t); i ++) {
target |= (uint64_t)source[i] << i * 8;
}
The cast to (uint64_t) is absolutely necessary, because the operands of << will undergo integer promotions, and uint8_t would always be converted to a signed int - and "funny" things will happen when you shift a set bit into the sign bit of a signed int.
If you write this into a function
#include <inttypes.h>
void serialize(uint64_t source, uint8_t *target) {
target[0] = source;
target[1] = source >> 8;
target[2] = source >> 16;
target[3] = source >> 24;
target[4] = source >> 32;
target[5] = source >> 40;
target[6] = source >> 48;
target[7] = source >> 56;
}
and compile for x86-64 using GCC 11 and -O3, the function will be compiled to
serialize:
movq %rdi, (%rsi)
ret
which just moves the 64-bit value of source into target array as is. If you reverse the indices (7 ... 0; big-endian), GCC will be clever enough to recognize that too and will compile it (with -O3) to
serialize:
bswap %rdi
movq %rdi, (%rsi)
ret

Most standardized network protocols specify numbers in big-endian format. In fact, big-endian is all referred to as network byte order, and there are functions specifically for translating integers of various sizes between host and network byte order.
These function are htons and ntohs for 16 bit values and htonl and ntohl` for 32 bit values. However, there is no equivalent for 64 bit values, and you're using little-endian for the network protocol, so these won't help you.
You can still however translate between the host byte order and the network byte order (little-endian in this case) without knowing the host order. You can do this by bit shifting the relevant values in to or out of the host numbers.
For example, to convert a 32 bit value from host to little endian and back to host:
uint32_t src_value = *some value*;
uint8_t buf[sizeof(uint32_t)];
int i;
for (i=0; i<sizeof(uint32_t); i++) {
buf[i] = (src_value >> (8 * i)) & 0xff;
}
uint32_t dest_value = 0;
for (i=0; i<sizeof(uint32_t); i++) {
dest_value |= (uint32_t)buf[i] << (8 * i);
}

For two systems that must communicated, you specify an "intercomminication-byte order". Then you have functions that convert between that and the native architecture byte order of each system.
There are three approaches to this problem. In order of efficiency:
Compile time detection of endianess
Run time detection of endianness
Endian agnostic code (corresponding to "sequence of bitwise operations" in your question).
Compile time detection of endianess
On architectures whose byte order is the same as the intercomm byte order, these functions do no transformation, but by using them, the same code becomes portable between systems.
Such functions may already exist on your target platform, for example:
Linux's endian.h be64toh() et-al
POSIX htonl, htons, ntohl, ntohs
Windows' winsock.h (same as POSIX but adds 64 bit htonll() and ntohll()
Where they don't exist creating them with cross-platform support is trivial. For example:
uint16_t intercom_to_host_16( uint16_t intercom_word )
{
#if __BIG_ENDIAN__
return intercom_word ;
#else
return intercom_word >> 8 | intercom_word << 8 ;
#endif
}
Here I have assumed that the intercom order is big-endian, that makes the function compatible with network byte order per ntohs() et al. The macro __BIG_ENDIAN__ is a predefined macro on most compilers. If not simply define it as a command line macro when compiling e.g. -D __BIG_ENDIAN__.
Run time detection of endianness
It is possible to detect endianness at runtime with minimal overhead:
uint16_t intercom_to_host_16( uint16_t intercom_word )
{
static const union
{
uint16_t word ;
uint8_t bytes[2] ;
} test = {.word = 0xff00u } ;
return test.bytes[0] == 0xffu ?
intercom_word :
intercom_word >> 8 | intercom_word << 8 ;
}
Of course you might wrap the test in a function for use in similar functions for other word sizes:
#include <stdbool.h>
bool isBigEndian()
{
static const union
{
uint16_t word ;
uint8_t bytes[2] ;
} test = {.word = 0xff00u } ;
return test.bytes[0] == 0xffu ;
}
Then simply have :
uint16_t intercom_to_host_16( uint16_t intercom_word )
{
return isBigEndian() ? intercom_word :
intercom_word >> 8 | intercom_word << 8 ;
}
Endian agnostic code
It is entirely possible to use endian agnostic code, but in that case all participants in the communication or file processing have the software overhead imposed even if the native byte order is already the same as the intercom byte order.
uint16_t intercom_to_host_16( uint16_t intercom_word )
{
uint8_t host_word [2] = { intercom_word >> 8,
intercom_word << 8 } ;
return *(uint16_t*)host_word ;
}

Parallel Verilog CRC algorithm from C-like reference

I have a set of c-like snippets provided that describe a CRC algorithm, and this article that explains how to transform a serial implementation to parallel that I need to implement in Verilog.
I tried using multiple online code generators, both serial and parallel (although serial would not work in final solution), and also tried working with the article, but got no similar results to what these snippets generate.
I should say I'm more or less exclusively hardware engineer and my understanding of C is rudimentary. I also never worked with CRC other than straightforward shift register implementation. I can see the polynomial and initial value from what I have, but that is more or less it.
Serial implementation uses augmented message. Should I also create parallel one for 6 bits wider message and append zeros to it?
I do not understand too well how the final value crc6 is generated. CrcValue is generated using the CalcCrc function for the final zeros of augmented message, then its top bit is written to its place in crc6 and removed before feeding it to the function again. Why is that? When working the algorithm to get the matrices for the parallel implementation, I should probably take crc6 as my final result, not last value of CrcValue?
Regardless of how crc6 is obtained, in the snippet for CRC check only runs through the function. How does that work?
Here are the code snippets:
const unsigned crc6Polynom =0x03; // x**6 + x + 1
unsigned CalcCrc(unsigned crcValue, unsigned thisbit) {
unsigned m = crcValue & crc6Polynom;
while (m > 0) {
thisbit ^= (m & 1);
m >>= 1;
return (((thisbit << 6) | crcValue) >> 1);
}
}
// obtain CRC6 for sending (6 bit)
unsigned GetCrc(unsigned crcValue) {
unsigned crc6 = 0;
for (i = 0; i < 6; i++) {
crcValue = CalcCrc(crcValue, 0);
crc6 |= (crcValue & 0x20) | (crc6 >> 1);
crcValue &= 0x1F; // remove output bit
}
return (crc6);
}
// Calculate CRC6
unsigned crcValue = 0x3F;
for (i = 1; i < nDataBits; i++) { // Startbit excluded
unsigned thisBit = (unsigned)((telegram >> i) & 0x1);
crcValue = CalcCrc(crcValue, thisBit);
}
/* now send telegram + GetCrc(crcValue) */
// Check CRC6
unsigned crcValue = 0x3F;
for (i = 1; i < nDataBits+6; i++) { // No startbit, but with CRC
unsigned thisBit = (unsigned)((telegram >> i) & 0x1);
crcValue = CalcCrc(crcValue, thisBit);
}
if (crcValue != 0) { /* put error handler here */ }
Thanks in advance for any advice, I'm really stuck there.

xoring bits of the data stream can be done in parallel because only the least signficant bit is used for feedback (in this case), and the order of the data stream bit xor operations doesn't affect the result.
Whether the hardware would need a parallel version depends on how a data stream is handled. The hardware could calculate the CRC one bit at a time during transmission or reception. If the hardware is staged to work with 6 bit characters, then a parallel version would make sense.
Since the snippets use a right shift for the CRC, it would seem that data for each 6 bit character is transmitted and received least significant bit first, to allow for hardware that could calculate CRC 1 bit at a time as it's transmitted or received. After all 6 bit data characters are transmitted, then the 6 bit CRC is transmitted (also least significant bit first).
The snippets seem wrong. My guess at what they should be:
/* calculate crc6 1 bit at a time */
const unsigned crc6Polynom =0x43; /* x**6 + x + 1 */
unsigned CalcCrc(unsigned crcValue, unsigned thisbit) {
crcValue ^= thisbit;
if(crcValue&1)
crcValue ^= crc6Polynom;
crcValue >>= 1;
return crcValue;
}
Example for passing 6 bits at a time. A 64 by 6 bit table lookup could be used to replace the for loop.
/* calculate 6 bits at a time */
unsigned CalcCrc6(unsigned crcValue, unsigned sixbits) {
int i;
crcValue ^= sixbits;
for(i = 0; i < 6; i++){
if(crcValue&1)
crcValue ^= crc6Polynom;
crcValue >>= 1;
}
return crcValue;
}
Assume that telegram contains 31 bits, 1 start bit + 30 data bits (five 6 bit characters):
/* code to calculate crc 6 bits at a time: */
unsigned crcValue = 0x3F;
int i;
telegram >>= 1; /* skip start bit */
for (i = 0; i < 5; i++) {
crcValue = CalcCrc6(unsigned crcValue, telegram & 0x3f);
telegram >>= 6;
}

how to make a 3D mask

Currently I meet one technique issue, which makes me want to improve the previous implementation, the situation is:
I have 5 GPIO pins, I need use these pins as the hardware identifier, for example:
pin1: LOW
pin2: LOW
pin3: LOW
pin4: LOW
pin5: LOW
this means one of my HW variants, so we can have many combinations. In previous design, the developer use if-else to implement this, just like:
if(PIN1 == LOW && ... && ......&& PIN5 ==LOW)
{
HWID = variant1;
}
else if( ... )
{
}
...
else
{
}
but I think this is not good because it will have more than 200 variants, and the code will become to long, and I want changed it to a mask. The idea is I treat this five pins as a five bits register, and because I can predict which variant I need to assign according to GPIOs status(this already defined by hardware team, they provide a variant list, with all these GPIO pins configuration), therefore, the code may look like this:
enum{
variant 0x0 //GPIO config 1
...
variant 0xF3 //GPIO config 243
}
then I can first read these five GPIO pins status, and compare to some mask to see if they are equal or not.
Question
However, for GPIO, it has three status, namely: LOW, HIGH, OPEN. If there is any good calculation method to have a 3-D mask?

You have 5 pins of 3 states each. You can approach representing this in a few ways.
First, imagine using this sort of framework:
#define LOW (0)
#define HIGH (1)
#define OPEN (2)
uint16_t config = PIN_CONFIG(pin1, pin2, pin3, pin4, pin5);
if(config == PIN_CONFIG(LOW, HIGH, OPEN, LOW, LOW))
{
// do something
}
switch(config) {
case PIN_CONFIG(LOW, HIGH, OPEN, LOW, HIGH):
// do something;
break;
}
uint16_t config_max = PIN_CONFIG(OPEN, OPEN, OPEN, OPEN, OPEN);
uint32_t hardware_ids[config_max + 1] = {0};
// init your hardware ids
hardware_ids[PIN_CONFIG(LOW, HIGH, HIGH, LOW, LOW)] = 0xF315;
hardware_ids[PIN_CONFIG(LOW, LOW, HIGH, LOW, LOW)] = 0xF225;
// look up a HWID
uint32_t hwid = hardware_ids[config];
This code is just the sort of stuff you'd like to do with pin configurations. The only bit left to implement is PIN_CONFIG
Approach 1
The first approach is to keep using it as a bitfield, but instead of 1 bit per pin you use 2 bits to represent each pin state. I think this is the cleanest, even though you're "wasting" half a bit for each pin.
#define PIN_CLAMP(x) ((x) & 0x03)
#define PIN_CONFIG(p1, p2, p3, p4, p5) \\
(PIN_CLAMP(p1) & \\
(PIN_CLAMP(p2) << 2) & \\
(PIN_CLAMP(p3) << 4) & \\
(PIN_CLAMP(p4) << 6) & \\
(PIN_CLAMP(p5) << 8))
This is kind of nice because it leaves room for a "Don't care" or "Invalid" value if you are going to do searches later.
Approach 2
Alternatively, you can use arithmetic to do it, making sure you use the minimum amount of bits necessary. That is, ~1.5 bits to encode 3 values. As expected, this goes from 0 up to 242 for a total of 3^5=243 states.
Without knowing anything else about your situation I believe this is the smallest complete encoding of your pin states.
(Practically, you have to use 8 bits to encode 243 values, so it's higher 1.5 bits per pin)
#define PIN_CLAMP(x) ((x) % 3) /* note this should really assert */
#define PIN_CONFIG(p1, p2, p3, p4, p5) \\
(PIN_CLAMP(p1) & \\
(PIN_CLAMP(p2) * 3) & \\
(PIN_CLAMP(p3) * 9) & \\
(PIN_CLAMP(p4) * 27) & \\
(PIN_CLAMP(p5) * 81))
Approach 1.1
If you don't like preprocessor stuff, you could use functions a bit like this:
enum PinLevel (low = 0, high, open);
void set_pin(uint32_t * config, uint8_t pin_number, enum PinLevel value) {
int shift = pin_number * 2; // 2 bits
int mask = 0x03 << shift; // 2 bits set to on, moved to the right spot
*config &= ~pinmask;
*config |= (((int)value) << shift) & pinmask;
}
enum PinLevel get_pin(uint32_t config, uint8_t pin_number) {
int shift = pin_number * 2; // 2 bits
return (enum PinLevel)((config >> shift) & 0x03);
}
This follows the first (2 bit per value) approach.
Approach 1.2
YET ANOTHER WAY using C's cool bitfield syntax:
struct pins {
uint16_t pin1 : 2;
uint16_t pin2 : 2;
uint16_t pin3 : 2;
uint16_t pin4 : 2;
uint16_t pin5 : 2;
};
typedef union pinconfig_ {
struct pins pins;
uint16_t value;
} pinconfig;
pinconfig input;
input.value = 0; // don't forget to init the members unless static
input.pins.pin1 = HIGH;
input.pins.pin2 = LOW;
printf("%d", input.value);
input.value = 0x0003;
printd("%d", input.pins.pin1);
The union lets you view the bitfield as a number and vice versa.
(note: all code completely untested)

This is my suggestion to solve the problem
#include<stdio.h>
#define LOW 0
#define HIGH 1
#define OPEN 2
#define MAXGPIO 5
int main()
{
int gpio[MAXGPIO] = { LOW, LOW, OPEN, HIGH, OPEN };
int mask = 0;
for (int i = 0; i < MAXGPIO; i++)
mask = mask << 2 | gpio[i];
printf("Masked: %d\n", mask);
printf("Unmasked:\n");
for (int i = 0; i < MAXGPIO; i++)
printf("GPIO %d = %d\n", i + 1, (mask >> (2*(MAXGPIO-1-i))) & 0x03);
return 0;
}
A little explanation about the code.
Masking
I am using 2 bits to save each GPIO value. The combinations are:
00: LOW
01: HIGH
02: OPEN
03 is Invalid
I am iterating the array gpio (where I have the acquired values) and creating a mask in the mask variable shifting left 2 bits and applying an or operation.
Unmasking
To get the initial values I am just making the opposite operation shifting right 2 bits multiplied by the amount of GPIO - 1 and masking with 0x03
I am applying a mask with 0x03 because those are the bit I am interested.
This is the result of the program
$ cc -Wall test.c -o test;./test
Masked: 38
Unmasked:
GPIO 1 = 0
GPIO 2 = 0
GPIO 3 = 2
GPIO 4 = 1
GPIO 5 = 2
Hope this helps

_mm_crc32_u8 gives different result than reference code

I've been struggling with the intrinsics. In particular I don't get the same results using the standard CRC calculation and the supposedly equivalent intel intrinsics. I'd like to move to using _mm_crc32_u16, and _mm_crc32_u32 but if I can't get the 8 bit operation to work there's no point.
static UINT32 g_ui32CRC32Table[256] =
{
0x00000000L, 0x77073096L, 0xEE0E612CL, 0x990951BAL,
0x076DC419L, 0x706AF48FL, 0xE963A535L, 0x9E6495A3L,
0x0EDB8832L, 0x79DCB8A4L, 0xE0D5E91EL, 0x97D2D988L,
....
// Your basic 32-bit CRC calculator
// NOTE: this code cannot be changed
UINT32 CalcCRC32(unsigned char *pucBuff, int iLen)
{
UINT32 crc = 0xFFFFFFFF;
for (int x = 0; x < iLen; x++)
{
crc = g_ui32CRC32Table[(crc ^ *pucBuff++) & 0xFFL] ^ (crc >> 8);
}
return crc ^ 0xFFFFFFFF;
}
UINT32 CalcCRC32_Intrinsic(unsigned char *pucBuff, int iLen)
{
UINT32 crc = 0xFFFFFFFF;
for (int x = 0; x < iLen; x++)
{
crc = _mm_crc32_u8(crc, *pucBuff++);
}
return crc ^ 0xFFFFFFFF;
}

That table is for a different CRC polynomial than the one used by the Intel instruction. The table is for the Ethernet/ZIP/etc. CRC, often referred to as CRC-32. The Intel instruction uses the iSCSI (Castagnoli) polynomial, for the CRC often referred to as CRC-32C.
This short example code can calculate either, by uncommenting the desired polynomial:
#include <stddef.h>
#include <stdint.h>
/* CRC-32 (Ethernet, ZIP, etc.) polynomial in reversed bit order. */
#define POLY 0xedb88320
/* CRC-32C (iSCSI) polynomial in reversed bit order. */
/* #define POLY 0x82f63b78 */
/* Compute CRC of buf[0..len-1] with initial CRC crc. This permits the
computation of a CRC by feeding this routine a chunk of the input data at a
time. The value of crc for the first chunk should be zero. */
uint32_t crc32c(uint32_t crc, const unsigned char *buf, size_t len)
{
int k;
crc = ~crc;
while (len--) {
crc ^= *buf++;
for (k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ POLY : crc >> 1;
}
return ~crc;
}
You can use this code to generate a replacement table for your code by simply computing the CRC-32C of each of the one-byte messages 0, 1, 2, ..., 255.

FWIW, I've obtained SW code that demonstrably matches the Intel crc32c instruction, but it uses a different polynomial: 0x82f63b78 The function definitely doesn't match any of the iSCSI test examples here: https://www.rfc-editor.org/rfc/rfc3720#appendix-B.4
What's frustrating in all this is every implementation I've tried for CRC-32C comes out with different hashes from all the others. Is there a true piece of reference code out there?

Switching bits in each nibble of an int

How can I switch the 0th and 3rd bits of each nibble in an integer using only bit operations (no control structures)? What kind of masks do I need to create in order to solve this problem? Any help would be appreciated. For example, 8(1000) become 1(0001).
/*
* SwitchBits(0) = 0
* SwitchBits(8) = 1
* SwitchBits(0x812) = 0x182
* SwitchBits(0x12345678) = 0x82a4c6e1
* Legal Operations: ! ~ & ^ | + << >>
*/
int SwitchBits(int n) {
}

Code:
#include <stdio.h>
#include <inttypes.h>
static uint32_t SwitchBits(uint32_t n)
{
uint32_t bit0_mask = 0x11111111;
uint32_t bit3_mask = 0x88888888;
uint32_t v_bit0 = n & bit0_mask;
uint32_t v_bit3 = n & bit3_mask;
n &= ~(bit0_mask | bit3_mask);
n |= (v_bit0 << 3) | (v_bit3 >> 3);
return n;
}
int main(void)
{
uint32_t i_values[] = { 0, 8, 0x812, 0x12345678, 0x9ABCDEF0 };
uint32_t o_values[] = { 0, 1, 0x182, 0x82A4C6E1, 0x93B5D7F0 };
enum { N_VALUES = sizeof(o_values) / sizeof(o_values[0]) };
for (int i = 0; i < N_VALUES; i++)
{
printf("0x%.8" PRIX32 " => 0x%.8" PRIX32 " (vs 0x%.8" PRIX32 ")\n",
i_values[i], SwitchBits(i_values[i]), o_values[i]);
}
return 0;
}
Output:
0x00000000 => 0x00000000 (vs 0x00000000)
0x00000008 => 0x00000001 (vs 0x00000001)
0x00000812 => 0x00000182 (vs 0x00000182)
0x12345678 => 0x82A4C6E1 (vs 0x82A4C6E1)
0x9ABCDEF0 => 0x93B5D7F0 (vs 0x93B5D7F0)
Note the use of uint32_t to avoid undefined behaviour with sign bits in signed integers.

To obtain a bit, you can mask it out using AND. To get the lowest bit, for example:
x & 0x01
Think about how AND works: both bits must be set. Since we're ANDing with 1, all bits except the first must be 0, because they're 0 in 0x01. The lowest bit will be either 0 or 1, depending on what's in x; said differently, the lowest bit will be the lowest bit in x, which is what we want. Visually:
x = abcd
AND 1 = 0001
--------
000d
(where abcd represent the bits in those slots; we don't know what they are)
To move it to bit 3's position, just shift it:
(x & 0x01) << 3
Visually, again:
x & 0x01 = 000d
<< 3
-----------
d000
To add it in, first, we need to clear out that spot in x for our bit. We use AND again:
x & ~0x08
Here, we invert 0x08 (which is 1000 in binary): this means all bits except bit 3 are set, and when we AND that with x, we get x except for that bit.
Visually,
0x08 = 1000
(invert)
-----------
0111
AND x = abcd
------------
0bcd
Combine with OR:
(x & ~0x08) | ((x & 0x01) << 3)
Visually,
x & ~0x08 = 0bcd
| ((x & 0x01) << 3) = d000
--------------------------
dbcd
Now, this only moves bit 0 to bit 3, and just overwrites bit 3. We still need to do bit 3 → 0. That's simply another:
x & 0x08 >> 3
And we need to clear out its spot:
x & ~0x01
We can combine the two clearing pieces:
x & ~0x09
And then:
(x & ~0x09) | ((x & 0x01) << 3) | ((x & 0x08) >> 3)
That of course handles only the lowest nibble. I'll leave the others as an exercise.

Try below code . Here you should know bitwise operator to implement and correct position to place.Also needs to aware of maintenance ,shifting and toggling basic properties.
#include<stdio.h>
#define BITS_SWAP(x) x=(((x & 0x88888888)>>3) | ((x & 0x11111111)<<3)) | ((x & ~ (0x88888888 | 0x11111111)))
int main()
{
int data=0;
printf("enter the data in hex=0x");
scanf("%x",&data);
printf("bits=%x",BITS_SWAP(data));
return 0;
}
OP
vinay#vinay-VirtualBox:~/c_skill$ ./a.out
enter the data in hex=0x1
bits=8
vinay#vinay-VirtualBox:~/c_skill$ ./a.out
enter the data in hex=0x812
bits=182
vinay#vinay-VirtualBox:~/c_skill$ ./a.out
enter the data in hex=0x12345678
bits=82a4c6e1
vinay#vinay-VirtualBox:~/c_skill$

Try this variant of the xor swap:
uint32_t switch_bits(uint32_t a){
static const mask = 0x11111111;
a ^= (a & mask) << 3;
a ^= (a >> 3) & mask;
a ^= (a & mask) << 3;
return a;
}

Move the low bits to the high bits and mask out the resulting bits.
Move the high bits to the low bits and mask out the resulting bits.
Mask out all bits that have not been moved.
Combine the results with ORs.
Code:
unsigned SwitchBits(unsigned n) {
return ((n << 3) & 0x88888888) | ((n >> 3) & 0x11111111) | (n & 0x66666666);
}
Alternativly, if you would like to be very clever. It can be done with two fewer operations, though this may not actually be faster due to some of the dependicies between instrutions.
Move the high bits to align with the low bits
XOR recording a 0 in the low bit if high an low bits are the same, and a 1 if they are different.
From this, mask out only the low bit of each nibble.
From this, multiply by 9, this will keep the low bit as is, and also copy it to the high bit.
From this, XOR with the original value. in the case that the high and low bit are the same, no change will correctly occure. In the case they are different, they will be effectivly exchanged.
Code:
unsigned SwitchBits(unsigned n) {
return ((((n >> 3) ^ n) & 0x11111111) * 0x9) ^ n;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Fast search and replace some nibble in int [c; microoptimisation] - c

Related

Is there an architecture-independent method to create a little-endian byte stream from a value in C?

Parallel Verilog CRC algorithm from C-like reference

how to make a 3D mask

_mm_crc32_u8 gives different result than reference code

Switching bits in each nibble of an int

Categories

Resources