Internet Checksum function move 8bits or not - c

This is my implementation of the Internet Checksum (RFC 1071):
static unsigned short
compute_checksum(unsigned short *addr, unsigned int count) {
register unsigned long sum = 0;
while (count > 1) {
sum += * addr++;
count -= 2;
}
//if any bytes left, pad the bytes and add
if(count > 0) {
sum+=*(unsigned char*)addr;// left move 8 bits or not?
}
//Fold sum to 16 bits: add carrier to result
while (sum>>16) {
sum = (sum & 0xffff) + (sum >> 16);
}
//one's complement
sum = ~sum;
return ((unsigned short)sum);
}
when meet the odd byte, why we don't need left move 8 bits like this,and RFC does't left move 8 bits too. why? I think this is right one
sum += (*(unsigned char*)addr << 8) & 0xFF00;

The code you posted from the RFC is correct for a littleendian machine. On a bigendian machine, your shifted solution would be necessary.
With an odd number of bytes, a (theoretical) 0 byte is added on to the end of the sequence. So the last byte XX should be treated as a short with byte sequence XX 00, which needs to be handled differently depending on the endianness of your machine.
Here's one way to handle it correctly for either endianness:
if (count > 0) {
unsigned char temp[2];
temp[0] = *(unsigned char *) addr;
temp[1] = 0;
sum += *(unsigned short *) temp;
}
For those of you who don't believe the RFC code is wrong, I refer you to this linux source, where it is clear that the littleendian case and the bigendian case must be treated differently in the way I described. The linux code is a little more complicated because it handles unaligned buffers.

Short is at-least 16 bits, there's no guarantee that' it's not 32 or 64 bits.
you should be using uint_16t from <stdint.h>
you kind of need to shift, the last add would be
{
uint16_t tmp=0;
memcpy(addr,&tmp,1);
sum += tmp;
}
which preserves the alignment, so, on a little endian machhine that's not shifted
but on a big-endian it is compared to it's aligment as a char.
sum += ( *addr && *((uint16_t*)"\xff"));
but that code may not work unless you have some trick to word-align the string.
Be aware that the result is in network byte order, so
if you need it in host byte order use the
ntohs() function to convert it.
foo=compute_checksum(blah,blah_size);
printf("the internet checksum is %04h\n",(int)ntohs(foo));

Related

Bitwise operation in C language (0x80, 0xFF, << )

I have a problem understanding this code. What I know is that we have passed a code into a assembler that has converted code into "byte code". Now I have a Virtual machine that is supposed to read this code. This function is supposed to read the first byte code instruction. I don't understand what is happening in this code. I guess we are trying to read this byte code but don't understand how it is done.
static int32_t bytecode_to_int32(const uint8_t *bytecode, size_t size)
{
int32_t result;
t_bool sign;
int i;
result = 0;
sign = (t_bool)(bytecode[0] & 0x80);
i = 0;
while (size)
{
if (sign)
result += ((bytecode[size - 1] ^ 0xFF) << (i++ * 8));
else
result += bytecode[size - 1] << (i++ * 8);
size--;
}
if (sign)
result = ~(result);
return (result);
}
This code is somewhat badly written, lots of operations on a single line and therefore containing various potential bugs. It looks brittle.
bytecode[0] & 0x80 Simply reads the MSB sign bit, assuming it's 2's complement or similar, then converts it to a boolean.
The loop iterates backwards from most significant byte to least significant.
If the sign was negative, the code will perform an XOR of the data byte with 0xFF. Basically inverting all bits in the data. The result of the XOR is an int.
The data byte (or the result of the above XOR) is then bit shifted i * 8 bits to the left. The data is always implicitly promoted to int, so in case i * 8 happens to give a result larger than INT_MAX, there's a fat undefined behavior bug here. It would be much safer practice to cast to uint32_t before the shift, carry out the shift, then convert to a signed type afterwards.
The resulting int is converted to int32_t - these could be the same type or different types depending on system.
i is incremented by 1, size is decremented by 1.
If sign was negative, the int32_t is inverted to some 2's complement negative number that's sign extended and all the data bits are inverted once more. Except all zeros that got shifted in with the left shift are also replaced by ones. If this is intentional or not, I cannot tell. So for example if you started with something like 0x0081 you now have something like 0xFFFF01FF. How that format makes sense, I have no idea.
My take is that the bytecode[size - 1] ^ 0xFF (which is equivalent to ~) was made to toggle the data bits, so that they would later toggle back to their original values when ~ is called later. A programmer has to document such tricks with comments, if they are anything close to competent.
Anyway, don't use this code. If the intention was merely to swap the byte order (endianess) of a 4 byte integer, then this code must be rewritten from scratch.
That's properly done as:
static int32_t big32_to_little32 (const uint8_t* bytes)
{
uint32_t result = (uint32_t)bytes[0] << 24 |
(uint32_t)bytes[1] << 16 |
(uint32_t)bytes[2] << 8 |
(uint32_t)bytes[3] << 0 ;
return (int32_t)result;
}
Anything more complicated than the above is highly questionable code. We need not worry about signs being a special case, the above code preserves the original signedness format.
So the A^0xFF toggles the bits set in A, so if you have 10101100 xored with 11111111.. it will become 01010011. I am not sure why they didn't use ~ here. The ^ is a xor operator, so you are xoring with 0xFF.
The << is a bitshift "up" or left. In other words, A<<1 is equivalent to multiplying A by 2.
the >> moves down so is equivalent to bitshifting right, or dividing by 2.
The ~ inverts the bits in a byte.
Note it's better to initialise variables at declaration it costs no additional processing whatsoever to do it that way.
sign = (t_bool)(bytecode[0] & 0x80); the sign in the number is stored in the 8th bit (or position 7 counting from 0), which is where the 0x80 is coming from. So it's literally checking if the signed bit is set in the first byte of bytecode, and if so then it stores it in the sign variable.
Essentially if it's unsigned then it's copying the bytes from from bytecode into result one byte at a time.
If the data is signed then it flips the bits then copies the bytes, then when it's done copying, it flips the bits back.
Personally with this kind of thing i prefer to get the data, stick in htons() format (network byte order) and then memcpy it to an allocated array, store it in a endian agnostic way, then when i retrieve the data i use ntohs() to convert it back to the format used by the computer. htons() and ntohs() are standard C functions and are used in networking and platform agnostic data formatting / storage / communication all the time.
This function is a very naive version of the function which converts form the big endian to little endian.
The parameter size is not needed as it works only with the 4 bytes data.
It can be much easier archived by the union punning (and it allows compilers to optimize it - in this case to the simple instruction):
#define SWAP(a,b,t) do{t c = (a); (a) = (b); (b) = c;}while(0)
int32_t my_bytecode_to_int32(const uint8_t *bytecode)
{
union
{
int32_t i32;
uint8_t b8[4];
}i32;
uint8_t b;
i32.b8[3] = *bytecode++;
i32.b8[2] = *bytecode++;
i32.b8[1] = *bytecode++;
i32.b8[0] = *bytecode++;
return i32.i32;
}
int main()
{
union {
int32_t i32;
uint8_t b8[4];
}i32;
uint8_t b;
i32.i32 = -4567;
SWAP(i32.b8[0], i32.b8[3], uint8_t);
SWAP(i32.b8[1], i32.b8[2], uint8_t);
printf("%d\n", bytecode_to_int32(i32.b8, 4));
i32.i32 = -34;
SWAP(i32.b8[0], i32.b8[3], uint8_t);
SWAP(i32.b8[1], i32.b8[2], uint8_t);
printf("%d\n", my_bytecode_to_int32(i32.b8));
}
https://godbolt.org/z/rb6Na5
If the purpose of the code is to sign-extend a 1-, 2-, 3-, or 4-byte sequence in network/big-endian byte order to a signed 32-bit int value, it's doing things the hard way and reimplementing the wheel along the way.
This can be broken down into a three-step process: convert the proper number of bytes to a 32-bit integer value, sign-extend bytes out to 32 bits, then convert that 32-bit value from big-endian to the host's byte order.
The "wheel" being reimplemented in this case is the the POSIX-standard ntohl() function that converts a 32-bit unsigned integer value in big-endian/network byte order to the local host's native byte order.
The first step I'd do is to convert 1, 2, 3, or 4 bytes into a uint32_t:
#include <stdint.h>
#include <limits.h>
#include <arpa/inet.h>
#include <errno.h>
// convert the `size` number of bytes starting at the `bytecode` address
// to a uint32_t value
static uint32_t bytecode_to_uint32( const uint8_t *bytecode, size_t size )
{
uint32_t result = 0;
switch ( size )
{
case 4:
result = bytecode[ 0 ] << 24;
case 3:
result += bytecode[ 1 ] << 16;
case 2:
result += bytecode[ 2 ] << 8;
case 1:
result += bytecode[ 3 ];
break;
default:
// error handling here
break;
}
return( result );
}
Then, sign-extend it (borrowing from this answer):
static uint32_t sign_extend_uint32( uint32_t in, size_t size );
{
if ( size == 4 )
{
return( in );
}
// being pedantic here - the existence of `[u]int32_t` pretty
// much ensures 8 bits/byte
size_t bits = size * CHAR_BIT;
uint32_t m = 1U << ( bits - 1 );
uint32_t result = ( in ^ m ) - m;
return ( result );
}
Put it all together:
static int32_t bytecode_to_int32( const uint8_t *bytecode, size_t size )
{
uint32_t result = bytecode_to_uint32( bytecode, size );
result = sign_extend_uint32( result, size );
// set endianness from network/big-endian to
// whatever this host's endianness is
result = ntohl( result );
// converting uint32_t here to signed int32_t
// can be subject to implementation-defined
// behavior
return( result );
}
Note that the conversion from uint32_t to int32_t implicitly performed by the return statement in the above code can result in implemenation-defined behavior as there can be uint32_t values that can not be mapped to int32_t values. See this answer.
Any decent compiler should optimize that well into inline functions.
I personally think this also needs much better error handling/input validation.

CRC-15 giving wrong values

I am trying to create a CRC-15 check in c and the output is never correct for each line of the file. I am trying to output the CRC for each line cumulatively next to each line. I use: #define POLYNOMIAL 0xA053 for the divisor and text for the dividend. I need to represent numbers as 32-bit unsigned integers. I have tried printing out the hex values to keep track and flipping different shifts around. However, I just can't seem to figure it out! I have a feeling it has something to do with the way I am padding things. Is there a flaw to my logic?
The CRC is to be represented in four hexadecimal numbers, that sequence will have four leading 0's. For example, it will look like 0000xxxx where the x's are the hexadecimal digits. The polynomial I use is 0xA053.
I thought about using a temp variable and do 4 16 bit chunks of code per line every XOR, however, I'm not quite sure how I could use shifts to accomplish this so I settled for a checksum of the letters on the line and then XORing that to try to calculate the CRC code.
I am testing my code using the following input and padding with . until the string is of length 504 because that is what the pad character needs to be via the requirements given:
"This is the lesson: never give in, never give in, never, never, never, never - in nothing, great or small, large or petty - never give in except to convictions of honor and good sense. Never yield to force; never yield to the apparently overwhelming might of the enemy."
The CRC of the first 64 char line ("This is the lesson: never give in, never give in, never, never,) is supposed to be 000015fa and I am getting bfe6ec00.
My logic:
In CRCCalculation I add each character to a 32-bit unsigned integer and after 64 (the length of one line) I send it into the XOR function.
If it the top bit is not 1, I shift the number to the left one
causing 0s to pad the right and loop around again.
If the top bit is 1, I XOR the dividend with the divisor and then shift the dividend to the left one.
After all calculations are done, I return the dividend shifted to the left four ( to add four zeros to the front) to the calculation function
Add result to the running total of the result
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <ctype.h>
#define POLYNOMIAL 0xA053
void crcCalculation(char *text, int length)
{
int i;
uint32_t dividend = atoi(text);
uint32_t result;
uint32_t sumText = 0;
// Calculate CRC
printf("\nCRC 15 calculation progress:\n");
i = length;
// padding
if(i < 504)
{
for(; i!=504; i++)
{
// printf("i is %d\n", i);
text[i] = '.';
}
}
// Try calculating by first line of crc by summing the values then calcuating, then add in the next line
for (i = 0; i < 504; i++)
{
if(i%64 == 0 && i != 0)
{
result = XOR(POLYNOMIAL, sumText);
printf(" - %x\n",result);
}
sumText +=(uint32_t)text[i];
printf("%c", text[i]);
}
printf("\n\nCRC15 result : %x\n", result);
}
uint32_t XOR(uint32_t divisor, uint32_t dividend)
{
uint32_t divRemainder = dividend;
uint32_t currentBit;
// Note: 4 16 bit chunks
for(currentBit = 32; currentBit > 0; --currentBit)
{
// if topbit is 1
if(divRemainder & 0x80)
{
//divRemainder = (divRemainder << 1) ^ divisor;
divRemainder ^= divisor;
printf("%x %x\n", divRemainder, divisor);
}
// else
// divisor = divisor >> 1;
divRemainder = (divRemainder << 1);
}
//return divRemainder; , have tried shifting to right and left, want to add 4 zeros to front so >>
//return divRemainder >> 4;
return divRemainder >> 4;
}
The first issue I see is the top bit check, it should be:
if(divRemainder & 0x8000)
The question doesn't state if the CRC is bit reflected (xor data into low order bits of CRC, right shift for cycle) or not (xor data into high order bits of CRC, left shift for cycle), so I can't offer help for the rest of the code.
The question doesn't state the initial value of CRC (0x0000 or 0x7fff), or if the CRC is post complemented.
The logic for a conventional CRC is:
xor a byte of data into the CRC (upper or lower bits)
cycle the CRC 8 times (or do a table lookup)
After generating the CRC for an entire message, the CRC can be appended to the message. If a CRC is generated for a message with the appended CRC and there are no errors, the CRC will be zero (or a constant value if the CRC is post complemented).
here is a typical CRC16, extracted from: <www8.cs.umu.se/~isak/snippets/crc-16.c>
#define POLY 0x8408
/*
// 16 12 5
// this is the CCITT CRC 16 polynomial X + X + X + 1.
// This works out to be 0x1021, but the way the algorithm works
// lets us use 0x8408 (the reverse of the bit pattern). The high
// bit is always assumed to be set, thus we only use 16 bits to
// represent the 17 bit value.
*/
unsigned short crc16(char *data_p, unsigned short length)
{
unsigned char i;
unsigned int data;
unsigned int crc = 0xffff;
if (length == 0)
return (~crc);
do
{
for (i=0, data=(unsigned int)0xff & *data_p++;
i < 8;
i++, data >>= 1)
{
if ((crc & 0x0001) ^ (data & 0x0001))
crc = (crc >> 1) ^ POLY;
else crc >>= 1;
}
} while (--length);
crc = ~crc;
data = crc;
crc = (crc << 8) | (data >> 8 & 0xff);
return (crc);
}
Since you want to calculate a CRC15 rather than a CRC16, the logic will be more complex as cannot work with whole bytes, so there will be a lot of bit shifting and ANDing to extract the desire 15 bits.
Note: the OP did not mention if the initial value of the CRC is 0x0000 or 0x7FFF, nor if the result is to be complemented, nor certain other criteria, so this posted code can only be a guide.

fetch 32bit instruction from binary file in C

I need to read 32bit instructions from a binary file.
so what i have right now is:
unsigned char buffer[4];
fread(buffer,sizeof(buffer),1,file);
which will put 4 bytes in an array
how should I approach that to connect those 4 bytes together in order to process 32bit instruction later?
Or should I even start in a different way and not use fread?
my weird method right now is to create an array of ints of size 32 and the fill it with bits from buffer array
The answer depends on how the 32-bit integer is stored in the binary file. (I'll assume that the integer is unsigned, because it really is an id, and use the type uint32_t from <stdint.h>.)
Native byte order The data was written out as integer on this machine. Just read the integer with fread:
uint32_t op;
fread(&op, sizeof(op), 1, file);
Rationale: fread read the raw representation of the integer into memory. The matching fwrite does the reverse: It writes the raw representation to thze file. If you don't need to exchange the file between platforms, this is a good method to store and read data.
Little-endian byte order The data is stored as four bytes, least significant byte first:
uint32_t op = 0u;
op |= getc(file); // 0x000000AA
op |= getc(file) << 8; // 0x0000BBaa
op |= getc(file) << 16; // 0x00CCbbaa
op |= getc(file) << 24; // 0xDDccbbaa
Rationale: getc reads a char and returns an integer between 0 and 255. (The case where the stream runs out and getc returns the negative value EOF is not considered here for brevity, viz laziness.) Build your integer by shifting each byte you read by multiples of 8 and or them with the existing value. The comments sketch how it works. The capital letters are being read, the lower-case letters were already there. Zeros have not yet been assigned.
Big-endian byte order The data is stored as four bytes, least significant byte last:
uint32_t op = 0u;
op |= getc(file) << 24; // 0xAA000000
op |= getc(file) << 16; // 0xaaBB0000
op |= getc(file) << 8; // 0xaabbCC00
op |= getc(file); // 0xaabbccDD
Rationale: Pretty much the same as above, only that you shift the bytes in another order.
You can imagine little-endian and big-endian as writing the number one hundred and twenty tree (CXXIII) as either 321 or 123. The bit-shifting is similar to shifting decimal digtis when dividing by or multiplying with powers of 10, only that you shift my 8 bits to multiply with 2^8 = 256 here.
Add
unsigned int instruction;
memcpy(&instruction,buffer,4);
to your code. This will copy the 4 bytes of buffer to a single 32-bit variable. Hence you will get connected 4 bytes :)
If you know that the int in the file is the same endian as the machine the program's running on, then you can read straight into the int. No need for a char buffer.
unsigned int instruction;
fread(&instruction,sizeof(instruction),1,file);
If you know the endianness of the int in the file, but not the machine the program's running on, then you'll need to add and shift the bytes together.
unsigned char buffer[4];
unsigned int instruction;
fread(buffer,sizeof(buffer),1,file);
//big-endian
instruction = (buffer[0]<<24) + (buffer[1]<<16) + (buffer[2]<<8) + buffer[3];
//little-endian
instruction = (buffer[3]<<24) + (buffer[2]<<16) + (buffer[1]<<8) + buffer[0];
Another way to think of this is that it's a positional number system in base-256. So just like you combine digits in a base-10.
257
= 2*100 + 5*10 + 7
= 2*10^2 + 5*10^1 + 7*10^0
So you can also combine them using Horner's rule.
//big-endian
instruction = ((((buffer[0]*256) + buffer[1]*256) + buffer[2]*256) + buffer[3]);
//little-endian
instruction = ((((buffer[3]*256) + buffer[2]*256) + buffer[1]*256) + buffer[0]);
#luser droog
There are two bugs in your code.
The size of the variable "instruction" must not be 4 bytes: for example, Turbo C assumes sizeof(int) to be 2. Obviously, your program fails in this case. But, what is much more important and not so obvious: your program will also fail in case sizeof(int) be more than 4 bytes! To understand this, consider the following example:
int main()
{ const unsigned char a[4] = {0x21,0x43,0x65,0x87};
const unsigned char* p = &a;
unsigned long x = (((((p[3] << 8) + p[2]) << 8) + p[1]) << 8) + p[0];
printf("%08lX\n", x);
return 0;
}
This program prints "FFFFFFFF87654321" under amd64, because an unsigned char variable becomes SIGNED INT when it is used! So, changing the type of the variable "instruction" from "int" to "long" does not solve the problem.
The only way is to write something like:
unsigned long instruction;
instruction = 0;
for (int i = 0, unsigned char* p = buffer + 3; i < 4; i++, p--) {
instruction <<= 8;
instruction += *p;
}

What is the fastest way for calculating the sum of arbitrary large binary numbers

I can't seem to find any good literature about this. Having a BigBinaryNumber (two's complement with virtual sign bit) structure like this:
typedef unsigned char byte;
enum Sign {NEGATIVE = (-1), ZERO = 0, POSITIVE = 1};
typedef enum Sign Sign;
struct BigBinaryNumber
{
byte *number;
Sign signum;
unsigned int size;
};
typedef struct BigBinaryNumber BigBinaryNumber;
I could just go for the elementary school approach (i.e. summing individual bytes and using the carry for subsequent summing) or perhaps work with a fixed size look-up table.
Is there any good literature about the fastest method for binary summation?
The fastest method for adding numbers is your processor's existing add instruction. So long as you've got the number laid out sensibly in memory (e.g, you don't have the bit order backwards or anything), it should be pretty straightforward to load 32 bits at a time from each number, add them together natively, and get the carry:
uint32_t *word_1 = &number1.number + offset, *word_2 = &number2.number + offset;
uint32_t *word_tgt = &dest.number + offset;
uint64_t sum = *word_1 + *word_2 + carry; // note the type!
*word_tgt = (uint32_t) sum; // truncate
carry = sum >> 32;
Note that you might have to add some special cases for dealing with the last byte in the number (or make sure that *number always has a multiple of 4 bytes allocated).
If you're using a 64-bit CPU, you may be able to extend this to work with uint64_t. There's no uint128_t for the overflow, though, so you might have to use some trickery to get the carry bit.
The "trick" is to use the native (or maybe larger) integer size.
#duskwuff is on the money to walk through number, multiple bytes at a time.
As with all "What is the fastest way ...", candidate solutions should be profiled.
Follows is a one type solution so one could use the largest type as well, the native type or any 1 type. e.g. uintmax_t or unsigned. The carry is partiality handled via code and carry generation depends on testing if addition will overflow.
typedef unsigned MyInt;
#define MyInt_MAX UINT_MAX
void Add(BigBinaryNumber *a, BigBinaryNumber *b, BigBinaryNumber *sum) {
// Assume same size for a, b, sum.
// Assume memory allocated for sum.
// Assume a->size is a multiple of sizeof(MyInt);
// Assume a->number endian is little and matches platform endian.
// Assume a->alignment matches MyInt alignment.
unsigned int size = a->size;
MyInt* ap = a->number;
MyInt* bp = b->number;
MyInt* sump = sum->number;
int carry = 0;
while (size > 0) {
size -= sizeof(MyInt);
if (carry) {
if (*ap <= (MyInt_MAX - 1 - *bp)) {
carry = 0;
}
*sump++ = *ap++ + *bp++ + 1;
}
else {
if (*ap > (MyInt_MAX - *bp)) {
carry = 1;
}
*sump++ = *ap++ + *bp++;
}
} // end while
// Integer overflow/underflow handling not shown,
// but depend on carry, and the sign of a, b
// Two's complement sign considerations not shown.
}

Bitwise memmove

What is the best way to implement a bitwise memmove? The method should take an additional destination and source bit-offset and the count should be in bits too.
I saw that ARM provides a non-standard _membitmove, which does exactly what I need, but I couldn't find its source.
Bind's bitset includes isc_bitstring_copy, but it's not efficient
I'm aware that the C standard library doesn't provide such a method, but I also couldn't find any third-party code providing a similar method.
Assuming "best" means "easiest", you can copy bits one by one. Conceptually, an address of a bit is an object (struct) that has a pointer to a byte in memory and an index of a bit in the byte.
struct pointer_to_bit
{
uint8_t* p;
int b;
};
void membitmovebl(
void *dest,
const void *src,
int dest_offset,
int src_offset,
size_t nbits)
{
// Create pointers to bits
struct pointer_to_bit d = {dest, dest_offset};
struct pointer_to_bit s = {src, src_offset};
// Bring the bit offsets to range (0...7)
d.p += d.b / 8; // replace division by right-shift if bit offset can be negative
d.b %= 8; // replace "%=8" by "&=7" if bit offset can be negative
s.p += s.b / 8;
s.b %= 8;
// Determine whether it's OK to loop forward
if (d.p < s.p || d.p == s.p && d.b <= s.b)
{
// Copy bits one by one
for (size_t i = 0; i < nbits; i++)
{
// Read 1 bit
int bit = (*s.p >> s.b) & 1;
// Write 1 bit
*d.p &= ~(1 << d.b);
*d.p |= bit << d.b;
// Advance pointers
if (++s.b == 8)
{
s.b = 0;
++s.p;
}
if (++d.b == 8)
{
d.b = 0;
++d.p;
}
}
}
else
{
// Copy stuff backwards - essentially the same code but ++ replaced by --
}
}
If you want to write a version optimized for speed, you will have to do copying by bytes (or, better, words), unroll loops, and handle a number of special cases (memmove does that; you will have to do more because your function is more complicated).
P.S. Oh, seeing that you call isc_bitstring_copy inefficient, you probably want the speed optimization. You can use the following idea:
Start copying bits individually until the destination is byte-aligned (d.b == 0). Then, it is easy to copy 8 bits at once, doing some bit twiddling. Do this until there are less than 8 bits left to copy; then continue copying bits one by one.
// Copy 8 bits from s to d and advance pointers
*d.p = *s.p++ >> s.b;
*d.p++ |= *s.p << (8 - s.b);
P.P.S Oh, and seeing your comment on what you are going to use the code for, you don't really need to implement all the versions (byte/halfword/word, big/little-endian); you only want the easiest one - the one working with words (uint32_t).
Here is a partial implementation (not tested). There are obvious efficiency and usability improvements.
Copy n bytes from src to dest (not overlapping src), and shift bits at dest rightwards by bit bits, 0 <= bit <= 7. This assumes that the least significant bits are at the right of the bytes
void memcpy_with_bitshift(unsigned char *dest, unsigned char *src, size_t n, int bit)
{
int i;
memcpy(dest, src, n);
for (i = 0; i < n; i++) {
dest[i] >> bit;
}
for (i = 0; i < n; i++) {
dest[i+1] |= (src[i] << (8 - bit));
}
}
Some improvements to be made:
Don't overwrite first bit bits at beginning of dest.
Merge loops
Have a way to copy a number of bits not divisible by 8
Fix for >8 bits in a char

Resources