Bit hack: Expanding bits

Bit hack: Expanding bits - c

I am trying to convert a uint16_t input to a uint32_t bit mask. One bit in the input toggles two bits in the output bit mask. Here is an example converting a 4-bit input to an 8-bit bit mask:
Input Output
ABCDb -> AABB CCDDb
A,B,C,D are individual bits
Example outputs:
0000b -> 0000 0000b
0001b -> 0000 0011b
0010b -> 0000 1100b
0011b -> 0000 1111b
....
1100b -> 1111 0000b
1101b -> 1111 0011b
1110b -> 1111 1100b
1111b -> 1111 1111b
Is there a bithack-y way to achieve this behavior?

Interleaving bits by Binary Magic Numbers contained the clue:
uint32_t expand_bits(uint16_t bits)
{
uint32_t x = bits;
x = (x | (x << 8)) & 0x00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F;
x = (x | (x << 2)) & 0x33333333;
x = (x | (x << 1)) & 0x55555555;
return x | (x << 1);
}
The first four steps consecutively interleave the source bits in groups of 8, 4, 2, 1 bits with zero bits, resulting in 00AB00CD after the first step, 0A0B0C0D after the second step, and so on. The last step then duplicates each even bit (containing an original source bit) into the neighboring odd bit, thereby achieving the desired bit arrangement.
A number of variants are possible. The last step can also be coded as x + (x << 1) or 3 * x. The | operators in the first four steps can be replaced by ^ operators. The masks can also be modified as some bits are naturally zero and don't need to be cleared. On some processors short masks may be incorporated into machine instructions as immediates, reducing the effort for constructing and / or loading the mask constants. It may also be advantageous to increase instruction-level parallelism for out-of-order processors and optimize for those with shift-add or integer-multiply-add instructions. One code variant incorporating various of these ideas is:
uint32_t expand_bits (uint16_t bits)
{
uint32_t x = bits;
x = (x ^ (x << 8)) & ~0x0000FF00;
x = (x ^ (x << 4)) & ~0x00F000F0;
x = x ^ (x << 2);
x = ((x & 0x22222222) << 1) + (x & 0x11111111);
x = (x << 1) + x;
return x;
}

The easiest way to map a 4-bit input to an 8-bit output is with a 16 entry table. So then it's just a matter of extracting 4 bits at a time from the uint16_t, doing a table lookup, and inserting the 8-bit value into the output.
uint32_t expandBits( uint16_t input )
{
uint32_t table[16] = {
0x00, 0x03, 0x0c, 0x0f,
0x30, 0x33, 0x3c, 0x3f,
0xc0, 0xc3, 0xcc, 0xcf,
0xf0, 0xf3, 0xfc, 0xff
};
uint32_t output;
output = table[(input >> 12) & 0xf] << 24;
output |= table[(input >> 8) & 0xf] << 16;
output |= table[(input >> 4) & 0xf] << 8;
output |= table[ input & 0xf];
return output;
}
This provides a decent compromise between performance and readability. It doesn't have quite the performance of cmaster's over-the-top lookup solution, but it's certainly more understandable than thndrwrks' magical mystery solution. As such, it provides a technique that can be applied to a much larger variety of problems, i.e. use a small lookup table to solve a larger problem.

In case you want to get some estimate of relative speeds, some community wiki test code. Adjust as needed.
void f_cmp(uint32_t (*f1)(uint16_t x), uint32_t (*f2)(uint16_t x)) {
uint16_t x = 0;
do {
uint32_t y1 = (*f1)(x);
uint32_t y2 = (*f2)(x);
if (y1 != y2) {
printf("%4x %8lX %8lX\n", x, (unsigned long) y1, (unsigned long) y2);
}
} while (x++ != 0xFFFF);
}
void f_time(uint32_t (*f1)(uint16_t x)) {
f_cmp(expand_bits, f1);
clock_t t1 = clock();
volatile uint32_t y1 = 0;
unsigned n = 1000;
for (unsigned i = 0; i < n; i++) {
uint16_t x = 0;
do {
y1 += (*f1)(x);
} while (x++ != 0xFFFF);
}
clock_t t2 = clock();
printf("%6llu %6llu: %.6f %lX\n", (unsigned long long) t1,
(unsigned long long) t2, 1.0 * (t2 - t1) / CLOCKS_PER_SEC / n,
(unsigned long) y1);
fflush(stdout);
}
int main(void) {
f_time(expand_bits);
f_time(expandBits);
f_time(remask);
f_time(javey);
f_time(thndrwrks_expand);
// now in the other order
f_time(thndrwrks_expand);
f_time(javey);
f_time(remask);
f_time(expandBits);
f_time(expand_bits);
return 0;
}
Results
0 280: 0.000280 FE0C0000 // fast
280 702: 0.000422 FE0C0000
702 1872: 0.001170 FE0C0000
1872 3026: 0.001154 FE0C0000
3026 4399: 0.001373 FE0C0000 // slow
4399 5740: 0.001341 FE0C0000
5740 6879: 0.001139 FE0C0000
6879 8034: 0.001155 FE0C0000
8034 8470: 0.000436 FE0C0000
8486 8751: 0.000265 FE0C0000

Here's a working implementation:
uint32_t remask(uint16_t x)
{
uint32_t i;
uint32_t result = 0;
for (i=0;i<16;i++) {
uint32_t mask = (uint32_t)x & (1U << i);
result |= mask << (i);
result |= mask << (i+1);
}
return result;
}
On each iteration of the loop, the bit in question from the uint16_t is masked out and stored.
That bit is then shifted by its bit position and ORed into the result, then shifted again by its bit position plus 1 and ORed into the result.

If your concern is performance and simplicity, you are likely best of with a big lookup table (64k entries of 4 bytes each). With that, you can pretty much use any algorithm you like to generate the table, lookup will just be a single memory access.
If that table is too big for your liking, you can split it. For instance, you can use a 8 bit lookup table with 256 entries of 2 bytes each. With that you can perform the entire operation with just two lookups. Bonus is, that this approach allows for type-punning tricks to avoid the hassle of splitting the address with bit operations:
//Implementation defined behavior ahead:
//Works correctly for both little and big endian machines,
//however, results will be wrong on a PDP11...
uint32_t getMask(uint16_t input) {
assert(sizeof(uint16_t) == 2);
assert(sizeof(uint32_t) == 4);
static const uint16_t lookupTable[256] = { 0x0000, 0x0003, 0x000c, 0x000f, ... };
unsigned char* inputBytes = (unsigned char*)&input; //legal because we type-pun to char, but the order of the bytes is implementation defined
char outputBytes[4];
uint16_t* outputShorts = (uint16_t*)outputBytes; //legal because we type-pun from char, but the order of the shorts is implementation defined
outputShorts[0] = lookupTable[inputBytes[0]];
outputShorts[1] = lookupTable[inputBytes[1]];
uint32_t output;
memcpy(&output, outputBytes, 4); //can't type-pun directly from uint16 to uint32_t due to strict aliasing rules
return output;
}
The code above works around strict aliasing rules by casting only to/from char, which is an explicit exception to the strict aliasing rules. It also works around the effects of little/big-endian byte order by building the result in the same order as the input was split. However, it still exposes implementation defined behavior: A machine with a byte order of 1, 0, 3, 2, or other middle endian orders, will silently produce wrong results (there have actually been such CPUs like the PDP11...).
Of course, you can split the lookup table even further, but I doubt that would do you any good.

A simple loop. Maybe not bit-hacky enough?
uint32_t thndrwrks_expand(uint16_t x) {
uint32_t mask = 3;
uint32_t y = 0;
while (x) {
if (x&1) y |= mask;
x >>= 1;
mask <<= 2;
}
return y;
}
Tried another that is twice as fast. Still 655/272 as slow as expand_bits(). Appears to be fastest 16 loop iteration solution.
uint32_t thndrwrks_expand(uint16_t x) {
uint32_t y = 0;
for (uint16_t mask = 0x8000; mask; mask >>= 1) {
y <<= 1;
y |= x&mask;
}
y *= 3;
return y;
}

Try this, where input16 is the uint16_t input mask:
uint32_t input32 = (uint32_t) input16;
uint32_t result = 0;
uint32_t i;
for(i=0; i<16; i++)
{
uint32_t bit_at_i = (input32 & (((uint32_t)1) << i)) >> i;
result |= ((bit_at_i << (i*2)) | (bit_at_i << ((i*2)+1)));
}
// result is now the 32 bit expanded mask

My solution is meant to run on mainstream x86 PCs and be simple and generic. I did not write this to compete for the fastest and/or shortest implementation. It is just another way to solve the problem submitted by OP.
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#define BITS_TO_EXPAND (4U)
#define SIZE_MAX (256U)
static bool expand_uint(unsigned int *toexpand,unsigned int *expanded);
int main(void)
{
unsigned int in = 12;
unsigned int out = 0;
bool success;
char buff[SIZE_MAX];
success = expand_uint(&in,&out);
if(false == success)
{
(void) puts("Error: expand_uint failed");
return EXIT_FAILURE;
}
(void) snprintf(buff, (size_t) SIZE_MAX,"%u expanded is %u\n",in,out);
(void) fputs(buff,stdout);
return EXIT_SUCCESS;
}
/*
** It expands an unsigned int so that every bit in a nibble is copied twice
** in the resultant number. It returns true on success, false otherwise.
*/
static bool expand_uint(unsigned int *toexpand,unsigned int *expanded)
{
unsigned int i;
unsigned int shifts = 0;
unsigned int mask;
if(NULL == toexpand || NULL == expanded)
{
return false;
}
*expanded = 0;
for(i = 0; i < BIT_TO_EXPAND; i++)
{
mask = (*toexpand >> i) & 1;
*expanded |= (mask << shifts);
++shifts;
*expanded |= (mask << shifts);
++shifts;
}
return true;
}

Related

Interleave 4 byte ints to 8 byte int

I'm currently working to create a function which accepts two 4 byte unsigned integers, and returns an 8 byte unsigned long. I've tried to base my work off of the methods depicted by this research but all my attempts have been unsuccessful. The specific inputs I am working with are: 0x12345678 and 0xdeadbeef, and the result I'm looking for is 0x12de34ad56be78ef. This is my work so far:
unsigned long interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
int shift = 33;
for(int i = 64; i > 0; i-=16){
shift -= 8;
//printf("%d\n", i);
//printf("%d\n", shift);
result |= (x & i) << shift;
result |= (y & i) << (shift-1);
}
}
However, this function keeps returning 0xfffffffe which is incorrect. I am printing and verifying these values using:
printf("0x%x\n", z);
and the input is initialized like so:
uint32_t x = 0x12345678;
uint32_t y = 0xdeadbeef;
Any help on this topic would be greatly appreciated, C has been a very difficult language for me, and bitwise operations even more so.

This can be done based on interleaving bits, but skipping some steps so it only interleaves bytes. Same idea: first spread out the bytes in a couple of steps, then combine them.
Here is the plan, illustrated with my amazing freehand drawing skills:
In C (not tested):
// step 1, moving the top two bytes
uint64_t a = (((uint64_t)x & 0xFFFF0000) << 16) | (x & 0xFFFF);
// step 2, moving bytes 2 and 6
a = ((a & 0x00FF000000FF0000) << 8) | (a & 0x000000FF000000FF);
// same thing with y
uint64_t b = (((uint64_t)y & 0xFFFF0000) << 16) | (y & 0xFFFF);
b = ((b & 0x00FF000000FF0000) << 8) | (b & 0x000000FF000000FF);
// merge them
uint64_t result = (a << 8) | b;
Using SSSE3 PSHUFB has been suggested, it'll work but there is an instruction that can do a byte-wise interleave in one go, punpcklbw. So all we need to really do is get the values into and out of vector registers, and that single instruction will then just care of it.
Not tested:
uint64_t interleave(uint32_t x, uint32_t y) {
__m128i xvec = _mm_cvtsi32_si128(x);
__m128i yvec = _mm_cvtsi32_si128(y);
__m128i interleaved = _mm_unpacklo_epi8(yvec, xvec);
return _mm_cvtsi128_si64(interleaved);
}

With bit-shifting and bitwise operations (endianness independent):
uint64_t interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
for(uint8_t i = 0; i < 4; i ++){
result |= ((x & (0xFFull << (8*i))) << (8*(i+1)));
result |= ((y & (0xFFull << (8*i))) << (8*i));
}
return result;
}
With pointers (endianness dependent):
uint64_t interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
uint8_t * x_ptr = (uint8_t *)&x;
uint8_t * y_ptr = (uint8_t *)&y;
uint8_t * r_ptr = (uint8_t *)&result;
for(uint8_t i = 0; i < 4; i++){
*(r_ptr++) = y_ptr[i];
*(r_ptr++) = x_ptr[i];
}
return result;
}
Note: this solution assumes little-endian byte order

You could do it like this:
uint64_t interleave(uint32_t x, uint32_t y)
{
uint64_t z;
unsigned char *a = (unsigned char *)&x; // 1
unsigned char *b = (unsigned char *)&y; // 1
unsigned char *c = (unsigned char *)&z;
c[0] = a[0];
c[1] = b[0];
c[2] = a[1];
c[3] = b[1];
c[4] = a[2];
c[5] = b[2];
c[6] = a[3];
c[7] = b[3];
return z;
}
Interchange a and b on the lines marked 1 depending on ordering requirement.
A version with shifts, where the LSB of y is always the LSB of the output as in your example, is:
uint64_t interleave(uint32_t x, uint32_t y)
{
return
(y & 0xFFull)
| (x & 0xFFull) << 8
| (y & 0xFF00ull) << 8
| (x & 0xFF00ull) << 16
| (y & 0xFF0000ull) << 16
| (x & 0xFF0000ull) << 24
| (y & 0xFF000000ull) << 24
| (x & 0xFF000000ull) << 32;
}
The compilers I tried don't seem to do a good job of optimizing either version so if this is a performance critical situation then maybe the inline assembly suggestion from comments is the way to go.

use union punning. Easy for the compiler to optimize.
#include <stdio.h>
#include <stdint.h>
#include <string.h>
typedef union
{
uint64_t u64;
struct
{
union
{
uint32_t a32;
uint8_t a8[4]
};
union
{
uint32_t b32;
uint8_t b8[4]
};
};
uint8_t u8[8];
}data_64;
uint64_t interleave(uint32_t a, uint32_t b)
{
data_64 in , out;
in.a32 = a;
in.b32 = b;
for(size_t index = 0; index < sizeof(a); index ++)
{
out.u8[index * 2 + 1] = in.a8[index];
out.u8[index * 2 ] = in.b8[index];
}
return out.u64;
}
int main(void)
{
printf("%llx\n", interleave(0x12345678U, 0xdeadbeefU)) ;
}

XORing a 32 bit integer with itself

I'm stuck on XORing a 32-bit integer with it itself. I'm supposed to XOR the 4 8-bit portions of the integers. I understand how it works, but without storing the integer anywhere, I don't get how to do this.
I've thought it over and I'm thinking of using binary left shift and right shift operators to separate the 32 bit integer into 4 parts to XOR them. For example, if I were to use an 8-bit integer, I would do something like this:
int a = <some integer here>
(a << 4) ^ (a >> 4)
So far, it isn't working the way I thought it would work.
Here's a part of my code:
else if (choice == 2) {
int bits = 8;
printf("Enter an integer for checksum calculation: ");
scanf("%d", &in);
printf("Integer: %d, ", in);
int x = in, i;
int mask = 1 << sizeof(int) * bits - 1;
printf("Bit representation: ");
for (i = 1; i <= sizeof(int) * bits; i++) {
if (x & mask)
putchar('1');
else
putchar('0');
x <<= 1;
if (! (i % 8)) {
putchar(' ');
}
}
printf("\n");
}
Here's an example of an output:
What type of display do you want?
Enter 1 for character parity, 2 for integer checksum: 2
Enter an integer for checksum calculation: 1024
Integer: 1024, Bit representation: 00000000 00000000 00000100 00000000
Checksum of the number is: 4, Bit representation: 00000100

To accumulate the XOR of 8-bit values, you simply shift and XOR each part of the value. Conceptually it's this:
uint32_t checksum = ( (a >> 24) ^ (a >> 16) ^ (a >> 8) ^ a ) & 0xff;
However, since XOR can be done in any order, you can do the same with fewer operations:
uint32_t checksum = (a >> 16) ^ a;
checksum = ((checksum >> 8) ^ checksum) & 0xff;
If you're doing this over many values, you can extend this idea by only condensing the value at the very end. This is quite similar to how parallel commutative operations are done in larger registers with technologies like SIMD (and indeed, compilers with SIMD support should be able to optimize the following code to make it much faster):
uint32_t simple_checksum( uint32_t *v, size_t count )
{
uint32_t checksum = 0;
uint32_t *end = v + count;
for( ; v != end; v++ )
{
checksum ^= *v; /* accumulate XOR of each 32-bit value */
}
checksum ^= (checksum >> 16); /* XOR high and low words into low word */
checksum ^= (checksum >> 8 ); /* XOR each byte of low word into low byte */
return checksum & 0xff; /* everything from bits 8-31 is rubbish */
}

In general Xoring a number with itself should provide you with the value 0 so you can just as easily set the variable to 0.
0100101^0100101=0
This is a result of the Karnaugh map for the xor operation providing a 0 when both bits are a one, or both are a zero.

Circular shift 28 bits within 4 bytes in C

I have an unsigned char *Buffer that contains 4 bytes, but only 28 of them are relevant to me.
I am looking to create a function that will do a circular shift of the 28 bits while ignoring the remaining 4 bits.
For example, I have the following within *Buffer
1111000011001100101010100000
Say I want to left circular shift by 1 bit of the 28 bits, making it
1110000110011001010101010000
I have looked around and I can't figure out how to get the shift, ignore the last 4 bits, and have the ability to shift either 1, 2, 3, or 4 bits depending on a variable set earlier in the program.
Any help with this would be smashing! Thanks in advance.

Only 1 bit at a time, but this does a 28 bit circular shift
uint32_t csl28(uint32_t value) {
uint32_t overflow_mask = 0x08000000;
uint32_t value_mask = 0x07FFFFFF;
return ((value & value_mask) << 1) | ((value & overflow_mask) >> 27);
}
uint32_t csr28(uint32_t value) {
uint32_t overflow_mask = 0x00000001;
uint32_t value_mask = 0x0FFFFFFE;
return ((value & value_mask) >> 1) | ((value & overflow_mask) << 27);
}
Another version, based on this article. This shifts an artbitrary number of bits (count) within an arbitrarily wide bit field (width). To left shift a value 5 bits in a 23 bit wide field: rotl32(value, 5, 23);
uint32_t rotl32 (uint32_t value, uint32_t count, uint32_t width) {
uint32_t value_mask = ((uint32_t)~0) >> (CHAR_BIT * sizeof(value) - width);
const uint32_t mask = (width-1);
count &= mask;
return value_mask & ((value<<count) | (value>>( (-count) & mask )));
}
uint32_t rotr32 (uint32_t value, uint32_t count, uint32_t width) {
uint32_t value_mask = ((uint32_t)~0) >> (CHAR_BIT * sizeof(value) - width);
const uint32_t mask = (width-1);
count &= mask;
return value_mask & ((value>>count) | (value<<( (-count) & mask )));
}
The above functions assume the value is stored in the low order bits of "value"
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
const char *uint32_to_binary(uint32_t x)
{
static char b[33];
b[0] = '\0';
uint32_t z;
for (z = 0x80000000; z > 0; z >>= 1)
{
strcat(b, ((x & z) == z) ? "1" : "0");
}
return b;
}
uint32_t reverse(uint32_t value)
{
return (value & 0x000000FF) << 24 | (value & 0x0000FF00) << 8 |
(value & 0x00FF0000) >> 8 | (value & 0xFF000000) >> 24;
}
int is_big_endian(void)
{
union {
uint32_t i;
char c[4];
} bint = {0x01020304};
return bint.c[0] == 1;
}
int main(int argc, char** argv) {
char b[] = { 0x98, 0x02, 0xCA, 0xF0 };
char *buffer = b;
//uint32_t num = 0x01234567;
uint32_t num = *((uint32_t *)buffer);
if (!is_big_endian()) {
num = reverse(*((uint32_t *)buffer));
}
num >>= 4;
printf("%x\n", num);
for(int i=0;i<5;i++) {
printf("%s\n", uint32_to_binary(num));
num = rotl32(num, 3, 28);
}
for(int i=0;i<5;i++) {
//printf("%08x\n", num);
printf("%s\n", uint32_to_binary(num));
num = rotr32(num, 3, 28);
}
unsigned char out[4];
memset(out, 0, sizeof(unsigned char) * 4);
num <<= 4;
if (!is_big_endian()) {
num = reverse(num);
}
*((uint32_t*)out) = num;
printf("[ ");
for (int i=0;i<4;i++) {
printf("%s0x%02x", i?", ":"", out[i] );
}
printf(" ]\n");
}

First you mask the top four most significant bits
*(buffer + 3) &= 0x0F;
Then you can perform the circular shift of the remaining 28 bits by x bits.
Note: This will work for little endian architecture(x86 Pc's and most microcontrollers)

[...] that contains 4 bytes, but only 28 of them [...]
We got it, but...
I guess that you mis-typed the second number of your example. Or you '''ignore''' 4 bits from left and right so you're actually interrested in 24 bits? Anyway:
Use same principle as in
Circular shift in c.
You need to convert your Buffer to a 32 bit arithmetic type, before. Maybe uint32_t is what you need?
Where did Buffer get his value? You may need to think about endianness.

Counting the binary digits of a signed 32-bit integer in c programming

I have following function which counts the number of binary digits in an unsigned 32-bit integer.
uint32_t L(uint32_t in)
{
uint32_t rc = 0;
while (in)
{
rc++;
in >>= 1;
}
return(rc);
}
Could anyone tell me please in case of signed 32-bit integer, which approach i should take ? implementing two's complement is an option. if you have any better approach, please let me know.

What about:
uint32_t count_bits(int32_t in)
{
uint32_t unsigned_in = (uint32_t) in;
uint32_t rc = 0;
while (unsigned_in)
{
rc++;
unsigned_in >>= 1;
}
return(rc);
}
Just convert the signed int into an unsigned one and do the same thing as before.
BTW: I guess you know that - unless your processor has a special instruction for it and you have access to it - one of the fastest implementation of counting the bits is:
int count_bits(unsigned x) {
x = x - ((x >> 1) & 0xffffffff);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0f0f0f0f;
x = x + (x >> 8);
x = x + (x >> 16);
return x & 0x0000003f;
}
It's not the fastest though...

Just reuse the function you defined as is:
int32_t bla = /* ... */;
uin32_t count;
count = L(bla);
You can cast bla to uint32_t (i.e., L((uint32_t) bla);) to make the conversion explicit, but it's not required by C.
If you are using gcc, it already provides fast implementations of functions to count bits and you can use them:
int __builtin_popcount (unsigned int x);
int __builtin_popcountl (unsigned long);
int __builtin_popcountll (unsigned long long);
http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Your negative number always shows 32 because the first digit of a signed negative integer is 1. A UInt4 of 1000 = 16 but an Int4 of 1000 = -8, an Int4 of 1001 = -7, and Int4 of 1010 = -6 etc...
Since the first digit in an Int32 is meaningful rather just a bit of padding, you cannot really ignore it.

8-bit Bitwise for 64 bit integer

I want to implement bitwise cyclic shift of a 64 bit integer.
ROT(a,b) will move bit at position i to position i+b. (a is the 64 bit integer)
However, my avr processor is an 8-bit processor. Thus, to express a, I have to use
unit8_t x[8].
x[0] is the 8 most significant bits of a.
x[7] is the 8 least significant bits of a.
Can any one help to implement ROT(a,b) in term of array x?
Thank you

It makes no functional difference if the underlying processor is 64-bit, 8-bit or 1-bit. If the compiler is compliant - you are good to go. Use uint64_t. Code does not "have to use unit8_t" because the processor is an 8-bit one.
uint64_t RPT(uint64_t a, unsigned b) {
return (a << (b & 63)) | (a >> ((64 - b) & 63));
}
Extra () added for explicitness.
& 63 (or %64 is you like that style) added to insure only 6 LSBits of b contribute to the shift. Any higher bits simply imply multiple "revolutions" of a circular shift.
((64 - b) & 63) could be simplified to (-b & 63).
--
But if OP still wants "implement ROT(a,b) in term of array unit8_t x[8]":
#include <stdint.h>
// circular left shift. MSByte in a[0].
void ROT(uint8_t *a, unsigned b) {
uint8_t dest[8];
b &= 63;
// byte shift
unsigned byte_shift = b / 8;
for (unsigned i = 0; i < 8; i++) {
dest[i] = a[(i + byte_shift) & 7];
}
b &= 7; // b %= 8; form bit shift;
unsigned acc = dest[0] << b;
for (unsigned i = 8; i-- > 0;) {
acc >>= 8;
acc |= (unsigned) dest[i] << b;
a[i] = (uint8_t) acc;
}
}
#vlad_tepesch Suggested a solution that emphasizes the AVR 8-bit nature. This is an untested attempt.
void ROT(uint8_t *a, uint8_t b) {
uint8_t dest[8];
b &= 63; // Could be eliminated as following code only uses the 6 LSBits.
// byte shift
uint8_t byte_shift = b / 8u;
for (uint8_t i = 0; i < 8u; i++) {
dest[i] = a[(i + byte_shift) & 7u];
}
b &= 7u; // b %= 8u; form bit shift;
uint16_t acc = dest[0] << b;
for (unsigned i = 8u; i-- > 0;) {
acc >>= 8u;
acc |= (uint8_t) dest[i] << b;
a[i] = (uint8_t) acc;
}
}

why do not leave the work to the compiler and just implement a function
uint64_t rotL(uint64_t v, uint8_t r){
return (v>>(64-r)) | (v<<r)
}

I take it the x(i) are 8 bits.
To rotate left n times
each bit from X(i,j) where i is the index array x(0) -> x(7)
and j is the bit position within the element
then this bit will end up in
Y((i+n)/8, ( i+n) & 7 )
This will handle rotations up to 63
any number > 63 , you just mod it.