I'm lost on bit shifting operations, I'm trying to reverse byte order on 32 bit ints, what I've managed to look up online I only got this far but cant seem to find why its not working
int32_t swapped = 0; // Assign num to the tmp
for(int i = 0; i < 32; i++)
{
swapped |= num & 1; // putting the set bits of num
swapped >>= 1; //shift the swapped Right side
num <<= 1; //shift the swapped left side
}
And I'm printing like this
num = swapped;
for (size_t i = 0; i < 32; i++)
{
printf("%d",(num >> i));
}
Your code looks likes its attempting to swap bits, and not bytes. If you are wanting to swap bytes, then the 'complete' method would be:
int32_t swapped = ((num >> 24) & 0x000000FF) |
((num >> 8) & 0x0000FF00) |
((num << 8) & 0x00FF0000) |
((num << 24) & 0xFF000000);
I say 'complete', because the last bitwise-and can be omitted, and the first bitwise-and can be omitted if num is unsigned.
If you want to swap the bits in a 32bit number, your loop should probably max out at 16 (if it's 32, the first 16 steps will swap the bits, the next 16 steps will swap them back again).
int32_t swapped = 0;
for(int i = 0; i < 16; ++i)
{
// the masks for the two bits (hi and lo) we will be swapping
// shift a '1' to the correct bit location based on the index 'i'
uint32_t hi_mask = 1 << (31 - i);
uint32_t lo_mask = 1 << i;
// use bitwise and to mask out the original bits in the number
uint32_t hi_bit = num & hi_mask;
uint32_t lo_bit = num & lo_mask;
// shift the bits so they switch places
uint32_t new_lo_bit = hi_bit >> (31 - i);
uint32_t new_hi_bit = lo_bit << (31 - i);
// use bitwise-or to combine back into an int
swapped |= new_lo_bit;
swapped |= new_hi_bit;
}
Code written for readability - there are faster ways to reverse the bits in a 32bit number. As for printing:
for (size_t i = 0; i < 32; i++)
{
bool bit = (num >> (31 - i)) & 0x1;
printf(bit ? "1" : "0");
}
So I have a design which incorporates CRC32C checksums to ensure data hasn't been damaged. I decided to use CRC32C because I can have both a software version and a hardware-accelerated version if the computer the software runs on supports SSE 4.2
I'm going by Intel's developer manual (vol 2A), which seems to provide the algorithm behind the crc32 instruction. However, I'm having little luck. Intel's developer guide says the following:
BIT_REFLECT32: DEST[31-0] = SRC[0-31]
MOD2: Remainder from Polynomial division modulus 2
TEMP1[31-0] <- BIT_REFLECT(SRC[31-0])
TEMP2[31-0] <- BIT_REFLECT(DEST[31-0])
TEMP3[63-0] <- TEMP1[31-0] << 32
TEMP4[63-0] <- TEMP2[31-0] << 32
TEMP5[63-0] <- TEMP3[63-0] XOR TEMP4[63-0]
TEMP6[31-0] <- TEMP5[63-0] MOD2 0x11EDC6F41
DEST[31-0] <- BIT_REFLECT(TEMP6[31-0])
Now, as far as I can tell, I've done everything up to the line starting TEMP6 correctly, but I think I may be either misunderstanding the polynomial division, or implementing it incorrectly. If my understanding is correct, 1 / 1 mod 2 = 1, 0 / 1 mod 2 = 0, and both divides-by-zero are undefined.
What I don't understand is how binary division with 64-bit and 33-bit operands will work. If SRC is 0x00000000, and DEST is 0xFFFFFFFF, TEMP5[63-32] will be all set bits, while TEMP5[31-0] will be all unset bits.
If I was to use the bits from TEMP5 as the numerator, there would be 30 divisions by zero as the polynomial 11EDC6F41 is only 33 bits long (and so converting it to a 64-bit unsigned integer leaves the top 30 bits unset), and so the denominator is unset for 30 bits.
However, if I was to use the polynomial as the numerator, the bottom 32 bits of TEMP5 are unset, resulting in divides by zero there, and the top 30 bits of the result would be zero, as the top 30 bits of the numerator would be zero, as 0 / 1 mod 2 = 0.
Am I misunderstanding how this works? Just plain missing something? Or has Intel left out some crucial step in their documentation?
The reason I went to Intel's developer guide for what appeared to be the algorithm they used is because they used a 33-bit polynomial, and I wanted to make outputs identical, which didn't happen when I used the 32-bit polynomial 1EDC6F41 (show below).
uint32_t poly = 0x1EDC6F41, sres, crcTable[256], data = 0x00000000;
for (n = 0; n < 256; n++) {
sres = n;
for (k = 0; k < 8; k++)
sres = (sres & 1) == 1 ? poly ^ (sres >> 1) : (sres >> 1);
crcTable[n] = sres;
}
sres = 0xFFFFFFFF;
for (n = 0; n < 4; n++) {
sres = crcTable[(sres ^ data) & 0xFF] ^ (sres >> 8);
}
The above code produces 4138093821 as an output, and the crc32 opcode produces 2346497208 using the input 0x00000000.
Sorry if this is badly written or incomprehensible in places, it is rather late for me.
Here are both software and hardware versions of CRC-32C. The software version is optimized to process eight bytes at a time. The hardware version is optimized to run three crc32q instructions effectively in parallel on a single core, since the throughput of that instruction is one cycle, but the latency is three cycles.
crc32c.c:
/* crc32c.c -- compute CRC-32C using the Intel crc32 instruction
* Copyright (C) 2013, 2021 Mark Adler
* Version 1.2 5 Jun 2021 Mark Adler
*/
/*
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler#alumni.caltech.edu
*/
/* Version History:
1.0 10 Feb 2013 First version
1.1 31 May 2021 Correct register constraints on assembly instructions
Include pre-computed tables to avoid use of pthreads
Return zero for the CRC when buf is NULL, as initial value
1.2 5 Jun 2021 Make tables constant
*/
// Use hardware CRC instruction on Intel SSE 4.2 processors. This computes a
// CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc. A software
// version is provided as a fall-back, as well as for speed comparisons.
#include <stddef.h>
#include <stdint.h>
// Tables for CRC word-wise calculation, definitions of LONG and SHORT, and CRC
// shifts by LONG and SHORT bytes.
#include "crc32c.h"
// Table-driven software version as a fall-back. This is about 15 times slower
// than using the hardware instructions. This assumes little-endian integers,
// as is the case on Intel processors that the assembler code here is for.
static uint32_t crc32c_sw(uint32_t crc, void const *buf, size_t len) {
if (buf == NULL)
return 0;
unsigned char const *data = buf;
while (len && ((uintptr_t)data & 7) != 0) {
crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
len--;
}
size_t n = len >> 3;
for (size_t i = 0; i < n; i++) {
uint64_t word = crc ^ ((uint64_t const *)data)[i];
crc = crc32c_table[7][word & 0xff] ^
crc32c_table[6][(word >> 8) & 0xff] ^
crc32c_table[5][(word >> 16) & 0xff] ^
crc32c_table[4][(word >> 24) & 0xff] ^
crc32c_table[3][(word >> 32) & 0xff] ^
crc32c_table[2][(word >> 40) & 0xff] ^
crc32c_table[1][(word >> 48) & 0xff] ^
crc32c_table[0][word >> 56];
}
data += n << 3;
len &= 7;
while (len) {
len--;
crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
}
return crc;
}
// Apply the zeros operator table to crc.
static uint32_t crc32c_shift(uint32_t const zeros[][256], uint32_t crc) {
return zeros[0][crc & 0xff] ^ zeros[1][(crc >> 8) & 0xff] ^
zeros[2][(crc >> 16) & 0xff] ^ zeros[3][crc >> 24];
}
// Compute CRC-32C using the Intel hardware instruction. Three crc32q
// instructions are run in parallel on a single core. This gives a
// factor-of-three speedup over a single crc32q instruction, since the
// throughput of that instruction is one cycle, but the latency is three
// cycles.
static uint32_t crc32c_hw(uint32_t crc, void const *buf, size_t len) {
if (buf == NULL)
return 0;
// Pre-process the crc.
uint64_t crc0 = crc ^ 0xffffffff;
// Compute the crc for up to seven leading bytes, bringing the data pointer
// to an eight-byte boundary.
unsigned char const *next = buf;
while (len && ((uintptr_t)next & 7) != 0) {
__asm__("crc32b\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next++;
len--;
}
// Compute the crc on sets of LONG*3 bytes, making use of three ALUs in
// parallel on a single core.
while (len >= LONG*3) {
uint64_t crc1 = 0;
uint64_t crc2 = 0;
unsigned char const *end = next + LONG;
do {
__asm__("crc32q\t" "(%3), %0\n\t"
"crc32q\t" LONGx1 "(%3), %1\n\t"
"crc32q\t" LONGx2 "(%3), %2"
: "+r"(crc0), "+r"(crc1), "+r"(crc2)
: "r"(next), "m"(*next));
next += 8;
} while (next < end);
crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1;
crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2;
next += LONG*2;
len -= LONG*3;
}
// Do the same thing, but now on SHORT*3 blocks for the remaining data less
// than a LONG*3 block.
while (len >= SHORT*3) {
uint64_t crc1 = 0;
uint64_t crc2 = 0;
unsigned char const *end = next + SHORT;
do {
__asm__("crc32q\t" "(%3), %0\n\t"
"crc32q\t" SHORTx1 "(%3), %1\n\t"
"crc32q\t" SHORTx2 "(%3), %2"
: "+r"(crc0), "+r"(crc1), "+r"(crc2)
: "r"(next), "m"(*next));
next += 8;
} while (next < end);
crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1;
crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2;
next += SHORT*2;
len -= SHORT*3;
}
// Compute the crc on the remaining eight-byte units less than a SHORT*3
// block.
unsigned char const *end = next + (len - (len & 7));
while (next < end) {
__asm__("crc32q\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next += 8;
}
len &= 7;
// Compute the crc for up to seven trailing bytes.
while (len) {
__asm__("crc32b\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next++;
len--;
}
// Return the crc, post-processed.
return ~(uint32_t)crc0;
}
// Check for SSE 4.2. SSE 4.2 was first supported in Nehalem processors
// introduced in November, 2008. This does not check for the existence of the
// cpuid instruction itself, which was introduced on the 486SL in 1992, so this
// will fail on earlier x86 processors. cpuid works on all Pentium and later
// processors.
#define SSE42(have) \
do { \
uint32_t eax, ecx; \
eax = 1; \
__asm__("cpuid" \
: "=c"(ecx) \
: "a"(eax) \
: "%ebx", "%edx"); \
(have) = (ecx >> 20) & 1; \
} while (0)
// Compute a CRC-32C. If the crc32 instruction is available, use the hardware
// version. Otherwise, use the software version.
uint32_t crc32c(uint32_t crc, void const *buf, size_t len) {
int sse42;
SSE42(sse42);
return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len);
}
Code to generate crc32c.h (stackoverflow won't let me post the tables themselves, due to a 30,000 character limit in an answer):
// Generate crc32c.h for crc32c.c.
#include <stdio.h>
#include <stdint.h>
#define LONG 8192
#define SHORT 256
// Print a 2-D table of four-byte constants in hex.
static void print_table(uint32_t *tab, size_t rows, size_t cols, char *name) {
printf("static uint32_t const %s[][%zu] = {\n", name, cols);
size_t end = rows * cols;
size_t k = 0;
for (;;) {
fputs(" {", stdout);
size_t n = 0, j = 0;
for (;;) {
printf("0x%08x", tab[k + n]);
if (++n == cols)
break;
putchar(',');
if (++j == 6) {
fputs("\n ", stdout);
j = 0;
}
putchar(' ');
}
k += cols;
if (k == end)
break;
puts("},");
}
puts("}\n};");
}
/* CRC-32C (iSCSI) polynomial in reversed bit order. */
#define POLY 0x82f63b78
static void crc32c_word_table(void) {
uint32_t table[8][256];
// Generate byte-wise table.
for (unsigned n = 0; n < 256; n++) {
uint32_t crc = ~n;
for (unsigned k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ POLY : crc >> 1;
table[0][n] = ~crc;
}
// Use byte-wise table to generate word-wise table.
for (unsigned n = 0; n < 256; n++) {
uint32_t crc = ~table[0][n];
for (unsigned k = 1; k < 8; k++) {
crc = table[0][crc & 0xff] ^ (crc >> 8);
table[k][n] = ~crc;
}
}
// Print table.
print_table(table[0], 8, 256, "crc32c_table");
}
// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial. For speed, this requires that a not be zero.
static uint32_t multmodp(uint32_t a, uint32_t b) {
uint32_t prod = 0;
for (;;) {
if (a & 0x80000000) {
prod ^= b;
if ((a & 0x7fffffff) == 0)
break;
}
a <<= 1;
b = b & 1 ? (b >> 1) ^ POLY : b >> 1;
}
return prod;
}
/* Take a length and build four lookup tables for applying the zeros operator
for that length, byte-by-byte, on the operand. */
static void crc32c_zero_table(size_t len, char *name) {
// Generate operator for len zeros.
uint32_t op = 0x80000000; // 1 (x^0)
uint32_t sq = op >> 4; // x^4
while (len) {
sq = multmodp(sq, sq); // x^2^(k+3), k == len bit position
if (len & 1)
op = multmodp(sq, op);
len >>= 1;
}
// Generate table to update each byte of a CRC using op.
uint32_t table[4][256];
for (unsigned n = 0; n < 256; n++) {
table[0][n] = multmodp(op, n);
table[1][n] = multmodp(op, n << 8);
table[2][n] = multmodp(op, n << 16);
table[3][n] = multmodp(op, n << 24);
}
// Print the table to stdout.
print_table(table[0], 4, 256, name);
}
int main(void) {
puts(
"// crc32c.h\n"
"// Tables and constants for crc32c.c software and hardware calculations.\n"
"\n"
"// Table for a 64-bits-at-a-time software CRC-32C calculation. This table\n"
"// has built into it the pre and post bit inversion of the CRC."
);
crc32c_word_table();
puts(
"\n// Block sizes for three-way parallel crc computation. LONG and SHORT\n"
"// must both be powers of two. The associated string constants must be set\n"
"// accordingly, for use in constructing the assembler instructions."
);
printf("#define LONG %d\n", LONG);
printf("#define LONGx1 \"%d\"\n", LONG);
printf("#define LONGx2 \"%d\"\n", 2 * LONG);
printf("#define SHORT %d\n", SHORT);
printf("#define SHORTx1 \"%d\"\n", SHORT);
printf("#define SHORTx2 \"%d\"\n", 2 * SHORT);
puts(
"\n// Table to shift a CRC-32C by LONG bytes."
);
crc32c_zero_table(8192, "crc32c_long");
puts(
"\n// Table to shift a CRC-32C by SHORT bytes."
);
crc32c_zero_table(256, "crc32c_short");
return 0;
}
Mark Adler's answer is correct and complete, but those seeking quick and easy way to integrate CRC-32C in their application might find it a little difficult to adapt the code, especially if they are using Windows and .NET.
I've created a library that implements CRC-32C using either hardware or software method depending on available hardware. It's available as a NuGet package for C++ and .NET. It's opensource of course.
Besides packaging Mark Adler's code above, I've found a simple way to improve throughput of the software fallback by 50%. On my computer, the library now achieves 2 GB/s in software and over 20 GB/s in hardware. For those curious, here's the optimized software implementation:
static uint32_t append_table(uint32_t crci, buffer input, size_t length)
{
buffer next = input;
#ifdef _M_X64
uint64_t crc;
#else
uint32_t crc;
#endif
crc = crci ^ 0xffffffff;
#ifdef _M_X64
while (length && ((uintptr_t)next & 7) != 0)
{
crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
--length;
}
while (length >= 16)
{
crc ^= *(uint64_t *)next;
uint64_t high = *(uint64_t *)(next + 8);
crc = table[15][crc & 0xff]
^ table[14][(crc >> 8) & 0xff]
^ table[13][(crc >> 16) & 0xff]
^ table[12][(crc >> 24) & 0xff]
^ table[11][(crc >> 32) & 0xff]
^ table[10][(crc >> 40) & 0xff]
^ table[9][(crc >> 48) & 0xff]
^ table[8][crc >> 56]
^ table[7][high & 0xff]
^ table[6][(high >> 8) & 0xff]
^ table[5][(high >> 16) & 0xff]
^ table[4][(high >> 24) & 0xff]
^ table[3][(high >> 32) & 0xff]
^ table[2][(high >> 40) & 0xff]
^ table[1][(high >> 48) & 0xff]
^ table[0][high >> 56];
next += 16;
length -= 16;
}
#else
while (length && ((uintptr_t)next & 3) != 0)
{
crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
--length;
}
while (length >= 12)
{
crc ^= *(uint32_t *)next;
uint32_t high = *(uint32_t *)(next + 4);
uint32_t high2 = *(uint32_t *)(next + 8);
crc = table[11][crc & 0xff]
^ table[10][(crc >> 8) & 0xff]
^ table[9][(crc >> 16) & 0xff]
^ table[8][crc >> 24]
^ table[7][high & 0xff]
^ table[6][(high >> 8) & 0xff]
^ table[5][(high >> 16) & 0xff]
^ table[4][high >> 24]
^ table[3][high2 & 0xff]
^ table[2][(high2 >> 8) & 0xff]
^ table[1][(high2 >> 16) & 0xff]
^ table[0][high2 >> 24];
next += 12;
length -= 12;
}
#endif
while (length)
{
crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
--length;
}
return (uint32_t)crc ^ 0xffffffff;
}
As you can see, it merely crunches larger block at a time. It needs larger lookup table, but it's still cache-friendly. The table is generated the same way, only with more rows.
One extra thing I explored is the use of PCLMULQDQ instruction to get hardware acceleration on AMD processors. I've managed to port Intel's CRC patch for zlib (also available on GitHub) to CRC-32C polynomial except the magic constant 0x9db42487. If anyone is able to decipher that one, please let me know. After supersaw7's excellent explanation on reddit, I have ported also the elusive 0x9db42487 constant and I just need to find some time to polish and test it.
First of all the Intel's CRC32 instruction serves to calculate CRC-32C (that is uses a different polynomial that regular CRC32. Look at the Wikipedia CRC32 entry)
To use Intel's hardware acceleration for CRC32C using gcc you can:
Inline assembly language in C code via the asm statement
Use intrinsics _mm_crc32_u8, _mm_crc32_u16, _mm_crc32_u32 or _mm_crc32_u64. See Intel Intrinsics Guide for a description of those for the Intel's compiler icc but gcc also implements them.
This is how you would do it with __mm_crc32_u8 that takes one byte at a time, using __mm_crc32_u64 would give further performance improvement since it takes 8 bytes at a time.
uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
uint32_t hash = 0;
size_t i = 0;
for (i=0;i<len;i++) {
hash = _mm_crc32_u8(hash, bytes[i]);
}
return hash;
}
To compile this you need to pass -msse4.2 in CFLAGS. Like gcc -g -msse4.2 test.c otherwise it will complain about undefined reference to _mm_crc32_u8.
If you want to revert to a plain C implementation if the instruction is not available in the platform where the executable is running you can use GCC's ifunc attribute. Like
uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
/* use _mm_crc32_u* here */
}
uint32_t default_crc32(const uint8_t *bytes, size_t len)
{
/* pure C implementation */
}
/* this will be called at load time to decide which function really use */
/* sse42_crc32 if SSE 4.2 is supported */
/* default_crc32 if not */
static void * resolve_crc32(void) {
__builtin_cpu_init();
if (__builtin_cpu_supports("sse4.2")) return sse42_crc32;
return default_crc32;
}
/* crc32() implementation will be resolved at load time to either */
/* sse42_crc32() or default_crc32() */
uint32_t crc32(const uint8_t *bytes, size_t len) __attribute__ ((ifunc ("resolve_crc32")));
I compare various algorithms here: https://github.com/htot/crc32c
The fastest algorithm has been taken from Intels crc_iscsi_v_pcl.asm assembly code (which is available in a modified form in the linux kernel) and using a C wrapper (crcintelasm.cc) included into this project.
To be able to run this code on 32 bit platforms first it has been ported to C (crc32intelc) where possible, a small amount of inline assembly is required. Certain parts of the code depend on the bitness, crc32q is not available on 32 bits and neither is movq, these are put in macro's (crc32intel.h) with alternative code for 32 bit platforms.
Suppose I have two 4-bit values, ABCD and abcd. How to interleave it, so it becomes AaBbCcDd, using bitwise operators? Example in pseudo-C:
nibble a = 0b1001;
nibble b = 0b1100;
char c = foo(a,b);
print_bits(c);
// output: 0b11010010
Note: 4 bits is just for illustration, I want to do this with two 32bit ints.
This is called the perfect shuffle operation, and it's discussed at length in the Bible Of Bit Bashing, Hacker's Delight by Henry Warren, section 7-2 "Shuffling Bits."
Assuming x is a 32-bit integer with a in its high-order 16 bits and b in its low-order 16 bits:
unsigned int x = (a << 16) | b; /* put a and b in place */
the following straightforward C-like code accomplishes the perfect shuffle:
x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;
He also gives an alternative form which is faster on some CPUs, and (I think) a little more clear and extensible:
unsigned int t; /* an intermediate, temporary variable */
t = (x ^ (x >> 8)) & 0x0000FF00; x = x ^ t ^ (t << 8);
t = (x ^ (x >> 4)) & 0x00F000F0; x = x ^ t ^ (t << 4);
t = (x ^ (x >> 2)) & 0x0C0C0C0C; x = x ^ t ^ (t << 2);
t = (x ^ (x >> 1)) & 0x22222222; x = x ^ t ^ (t << 1);
I see you have edited your question to ask for a 64-bit result from two 32-bit inputs. I'd have to think about how to extend Warren's technique. I think it wouldn't be too hard, but I'd have to give it some thought. If someone else wanted to start here and give a 64-bit version, I'd be happy to upvote them.
EDITED FOR 64 BITS
I extended the second solution to 64 bits in a straightforward way. First I doubled the length of each of the constants. Then I added a line at the beginning to swap adjacent double-bytes and intermix them. In the following 4 lines, which are pretty much the same as the 32-bit version, the first line swaps adjacent bytes and intermixes, the second line drops down to nibbles, the third line to double-bits, and the last line to single bits.
unsigned long long int t; /* an intermediate, temporary variable */
t = (x ^ (x >> 16)) & 0x00000000FFFF0000ull; x = x ^ t ^ (t << 16);
t = (x ^ (x >> 8)) & 0x0000FF000000FF00ull; x = x ^ t ^ (t << 8);
t = (x ^ (x >> 4)) & 0x00F000F000F000F0ull; x = x ^ t ^ (t << 4);
t = (x ^ (x >> 2)) & 0x0C0C0C0C0C0C0C0Cull; x = x ^ t ^ (t << 2);
t = (x ^ (x >> 1)) & 0x2222222222222222ull; x = x ^ t ^ (t << 1);
From Stanford "Bit Twiddling Hacks" page:
https://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableObvious
uint32_t x = /*...*/, y = /*...*/;
uint64_t z = 0;
for (int i = 0; i < sizeof(x) * CHAR_BIT; i++) // unroll for more speed...
{
z |= (x & 1U << i) << i | (y & 1U << i) << (i + 1);
}
Look at the page they propose different and faster algorithms to achieve the same.
Like so:
#include <limits.h>
typedef unsigned int half;
typedef unsigned long long full;
full mix_bits(half a,half b)
{
full result = 0;
for (int i=0; i<sizeof(half)*CHAR_BIT; i++)
result |= (((a>>i)&1)<<(2*i+1))|(((b>>i)&1)<<(2*i+0));
return result;
}
Here is a loop-based solution that is hopefully more readable than some of the others already here.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
uint64_t interleave(uint32_t a, uint32_t b) {
uint64_t result = 0;
int i;
for (i = 0; i < 31; i++) {
result |= (a >> (31 - i)) & 1;
result <<= 1;
result |= (b >> (31 - i)) & 1;
result <<= 1;
}
// Skip the last left shift.
result |= (a >> (31 - i)) & 1;
result <<= 1;
result |= (b >> (31 - i)) & 1;
return result;
}
void printBits(uint64_t a) {
int i;
for (i = 0; i < 64; i++)
printf("%lu", (a >> (63 - i)) & 1);
puts("");
}
int main(){
uint32_t a = 0x9;
uint32_t b = 0x6;
uint64_t c = interleave(a,b);
printBits(a);
printBits(b);
printBits(c);
}
I have used the 2 tricks/operations used in this post How do you set, clear, and toggle a single bit? of setting a bit at particular index and checking the bit at particular index.
The following code is implemented using these 2 operations only.
int a = 0b1001;
int b = 0b1100;
long int c=0;
int index; //To specify index of c
int bit,i;
//Set bits in c from right to left.
for(i=32;i>=0;i--)
{
index=2*i+1; //We have to add the bit in c at this index
//Check a
bit=a&(1<<i); //Checking whether the i-th bit is set in a
if(bit)
c|=1<<index; //Setting bit in c at index
index--;
//Check b
bit=b&(1<<i); //Checking whether the i-th bit is set in b
if(bit)
c|=1<<index; //Setting bit in c at index
}
printf("%ld",c);
Output: 210 which is 0b11010010
Like this:
input: 10010011
(10->01->00->11)
output: 11000110
(11->00->01->10)
input: 11010001
(11->01->00->01)
output: 01000111
(01->00->01->11)
Anyone has any ideas about that?
Fewer operations than lserni's algorithm:
uint32_t reverseByTwo(uint32_t value) {
value = ((value & 0x03030303) << 2) | ((value >> 2) & 0x03030303); // swap adjacent pairs
value = ((value & 0x0F0F0F0F) << 4) | ((value >> 4) & 0x0F0F0F0F); // swap nibbles
value = ((value & 0x00FF00FF) << 8) | ((value >> 8) & 0x00FF00FF); // swap bytes
value = ((value & 0x0000FFFF) << 16) | ((value >> 16) & 0x0000FFFF);
return value;
}
For 64-bit values just add another swap for the 32-bit halves, for smaller types, just leave out the last few swaps.
Weird request. I'd do it like this:
uint32_t reverseByTwo(uint32_t value)
{
int i;
uint32_t new_value = 0;
for (i = 0; i < 16; i++)
{
new_value <<= 2;
new_value |= (value & 0x3);
value >>= 2;
}
return new_value;
}
At each iteration, the two LSB of value are placed in the two LSB of new_value, which is shifted to the left.
For an eight-bit value,
uint8_t reverseByTwo(uint8_t value)
{
int i;
uint32_t new_value = 0;
for (i = 0; i < 4; i++)
{
new_value <<= 2;
new_value |= (value & 0x3);
value >>= 2;
}
return new_value;
}
if performances are at a premium, you can manually unroll the loop (GCC should do this by itself, but sometimes doesn't bother) and declare the function as inline.
new_value = 0;
// new_value <<= 2; // First time not necessary
new_value |= (value & 0x3);
value >>= 2;
new_value <<= 2;
new_value |= (value & 0x3);
value >>= 2;
new_value <<= 2;
new_value |= (value & 0x3);
value >>= 2;
new_value <<= 2;
new_value |= (value & 0x3);
// value >>= 2;
return new_value;
The fastest possible way to transform the bits in a single byte (char) into another single byte is to build yourself an array:
unsigned char rev[256];
rev[0] = 0; /* 00000000 -> 00000000 */
...
rev[147] = 198; /* 10010011 -> 11000110 */
...
rev[198] = 147; /* 11000110 -> 10010011 */
...
rev[255] = 255; /* 11111111 -> 11111111 */
To convert a number x to its bit-reversed form, just write rev[x]. If you have multiple bytes to convert, such as in a 4-byte int, just look up the 4 bytes in the rev table.
You'll need to convert binary to another base (here, I use decimal) when writing this code, because C doesn't have binary constants (which would be ten times more useful than octal constants).
You could also put the values into the initializer, but you'll have to count positions to make sure everything is in the right place. (Maybe write a little program to do it!)
unsigned char rev[256] = {0, ..., 198, ..., 147, ..., 255};
Fill in the ... with all the other numbers in the right places.
$x = (($x & 0x33333333) << 2) | (($x & 0xCCCCCCCC) >> 2);
$x = (($x & 0x0F0F0F0F) << 4) | (($x & 0xFOFOFOFO) >> 4);
$x = (($x & 0x00FF00FF) << 8) | (($x & 0xFF00FF00) >> 8);
$x = (($x & 0x0000FFFF) << 16) | (($x & 0xFFFF0000) >> 16);
To give credit where due, this algorithm came from "Hacker's Delight" by Henry S. Warren, Jr. The only difference is that the algorithm in the book didn't reverse by pairs; it just reversed the bits.
#include <limits.h>
#define INT_BITS (sizeof(int) * CHAR_BIT)
unsigned reverse_bits(unsigned x) {
#define PAIR(i) ((((x) >> (i*2)) & 3) << (INT_BITS - (i+1)*2))
unsigned result = 0;
unsigned i;
for(i = 0; i < INT_BITS/2; i++) {
result |= PAIR(i);
}
return result;
}
Though this will do it for an unsigned int, whereas you might want to replace int and unsigned there with char and unsigned char.
If you want the fastest possible:
output =
((input & 0xc0) >> 6) |
((input & 0x30) >> 2) |
((input & 0xc) << 2) |
((input & 0x3) << 6);
You have bits (ab cd ef gh) and want (gh ef cd ab)
If you multiply by 0x101 and store in 16 bit int, you get (ab cd ef gh ab cd ef gh).
Then you have your bit pattern in that number in two group of four bits:
(00 00 ef 00 ab 00 00 00),
(00 00 00 gh 00 cd 00 00)
So you just have to shift and mask appropriately
unsigned char swap_bit_pairs(unsigned char b)
{
unsigned int a = 0x101*b;
return ((a >> 6) & 0x33) | ((a >> 2) & 0xCC);
}
So it's possible in 6 operations
EDIT: oups! I wrote 0x66 instead of 0xCC
how to reverse the bits using bit wise operators in c language
Eg:
i/p: 10010101
o/p: 10101001
If it's just 8 bits:
u_char in = 0x95;
u_char out = 0;
for (int i = 0; i < 8; ++i) {
out <<= 1;
out |= (in & 0x01);
in >>= 1;
}
Or for bonus points:
u_char in = 0x95;
u_char out = in;
out = (out & 0xaa) >> 1 | (out & 0x55) << 1;
out = (out & 0xcc) >> 2 | (out & 0x33) << 2;
out = (out & 0xf0) >> 4 | (out & 0x0f) << 4;
figuring out how the last one works is an exercise for the reader ;-)
Knuth has a section on Bit reversal in The Art of Computer Programming Vol 4A, bitwise tricks and techniques.
To reverse the bits of a 32 bit number in a divide and conquer fashion he uses magic constants
u0= 1010101010101010, (from -1/(2+1)
u1= 0011001100110011, (from -1/(4+1)
u2= 0000111100001111, (from -1/(16+1)
u3= 0000000011111111, (from -1/(256+1)
Method credited to Henry Warren Jr., Hackers delight.
unsigned int u0 = 0x55555555;
x = (((x >> 1) & u0) | ((x & u0) << 1));
unsigned int u1 = 0x33333333;
x = (((x >> 2) & u1) | ((x & u1) << 2));
unsigned int u2 = 0x0f0f0f0f;
x = (((x >> 4) & u2) | ((x & u2) << 4));
unsigned int u3 = 0x00ff00ff;
x = (((x >> 8) & u3) | ((x & u3) << 8));
x = ((x >> 16) | (x << 16) mod 0x100000000); // reversed
The 16 and 8 bit cases are left as an exercise to the reader.
Well, this might not be the most elegant solution but it is a solution:
int reverseBits(int x) {
int res = 0;
int len = sizeof(x) * 8; // no of bits to reverse
int i, shift, mask;
for(i = 0; i < len; i++) {
mask = 1 << i; //which bit we are at
shift = len - 2*i - 1;
mask &= x;
mask = (shift > 0) ? mask << shift : mask >> -shift;
res |= mask; // mask the bit we work at at shift it to the left
}
return res;
}
Tested it on a sheet of paper and it seemed to work :D
Edit: Yeah, this is indeed very complicated. I dunno why, but I wanted to find a solution without touching the input, so this came to my haead