Bitwise XOR in C using 64bit instead of 8bits - c

I consider how to make efficient XORing of 2 bytes arrays.
I have this bytes arrays defined as unsigned char *
I think that XORing them as uint64_t will be much faster. Is it true?
How efficiently convert unsigned char * to this uint64_t * preferably inside the XORing loop? How to make padding of last bytes if length of the bytes array % 8 isn't 0?
Here is my current code that XORs bytes array, but each byte (unsigned char) separately:
unsigned char *bitwise_xor(const unsigned char *A_Bytes_Array, const unsigned char *B_Bytes_Array, const size_t length) {
unsigned char *XOR_Bytes_Array;
// allocate XORed bytes array
XOR_Bytes_Array = malloc(sizeof(unsigned char) * length);
// perform bitwise XOR operation on bytes arrays A and B
for(int i=0; i < length; i++)
XOR_Bytes_Array[i] = (unsigned char)(A_Bytes_Array[i] ^ B_Bytes_Array[i]);
return XOR_Bytes_Array;
}
Ok, in the meantime I have tried to do it this way. My bytes_array are rather large (rgba bitmaps 4*1440*900?).
static uint64_t next64bitsFromBytesArray(const unsigned char *bytesArray, const int i) {
uint64_t next64bits = (uint64_t) bytesArray[i+7] | ((uint64_t) bytesArray[i+6] << 8) | ((uint64_t) bytesArray[i+5] << 16) | ((uint64_t) bytesArray[i+4] << 24) | ((uint64_t) bytesArray[i+3] << 32) | ((uint64_t) bytesArray[i+2] << 40) | ((uint64_t) bytesArray[i+1] << 48) | ((uint64_t)bytesArray[i] << 56);
return next64bits;
}
unsigned char *bitwise_xor64(const unsigned char *A_Bytes_Array, const unsigned char *B_Bytes_Array, const size_t length) {
unsigned char *XOR_Bytes_Array;
// allocate XORed bytes array
XOR_Bytes_Array = malloc(sizeof(unsigned char) * length);
// perform bitwise XOR operation on bytes arrays A and B using uint64_t
for(int i=0; i<length; i+=8) {
uint64_t A_Bytes = next64bitsFromBytesArray(A_Bytes_Array, i);
uint64_t B_Bytes = next64bitsFromBytesArray(B_Bytes_Array, i);
uint64_t XOR_Bytes = A_Bytes ^ B_Bytes;
memcpy(XOR_Bytes_Array + i, &XOR_Bytes, 8);
}
return XOR_Bytes_Array;
}
UPDATE: (2nd approach to this problem)
unsigned char *bitwise_xor64(const unsigned char *A_Bytes_Array, const unsigned char *B_Bytes_Array, const size_t length) {
const uint64_t *aBytes = (const uint64_t *) A_Bytes_Array;
const uint64_t *bBytes = (const uint64_t *) B_Bytes_Array;
unsigned char *xorBytes = malloc(sizeof(unsigned char)*length);
for(int i = 0, j=0; i < length; i +=8) {
uint64_t aXORbBytes = aBytes[j] ^ bBytes[j];
//printf("a XOR b = 0x%" PRIx64 "\n", aXORbBytes);
memcpy(xorBytes + i, &aXORbBytes, 8);
j++;
}
return xorBytes;
}

So I did an experiment:
#include <stdlib.h>
#include <stdint.h>
#ifndef TYPE
#define TYPE uint64_t
#endif
TYPE *
xor(const void *va, const void *vb, size_t l)
{
const TYPE *a = va;
const TYPE *b = vb;
TYPE *r = malloc(l);
size_t i;
for (i = 0; i < l / sizeof(TYPE); i++) {
*r++ = *a++ ^ *b++;
}
return r;
}
Compiled both for uint64_t and uint8_t with clang with basic optimizations. In both cases the compiler vectorized the hell out of this. The difference was that the uint8_t version had code to handle when l wasn't a multiple of 8. So if we add code to handle the size not being a multiple of 8, you'll probably end up with equivalent generated code. Also, the 64 bit version unrolled the loop a few times and had code to handle that, so for big enough arrays you might gain a few percent here. On the other hand, on big enough arrays you'll be memory-bound and the xor operation won't matter a bit.
Are you sure your compiler won't deal with this? This is a kind of micro-optimization that makes sense only when you're measuring things and then you wouldn't need to ask which one is faster, you'd know.

Related

How to do 1024-bit operations using arrays of uint64_t

I am trying to find a way to compute values that are of type uint1024_t (unsigned 1024-bit integer), by defining the 5 basic operations: plus, minus, times, divide, modulus.
The way that I can do that is by creating a structure that will have the following prototype:
typedef struct {
uint64_t chunk[16];
} uint1024_t;
Now since it is complicated to wrap my head around such operations with uint64_t as block size, I have first written some code for manipulating uint8_t. Here is what I came up with:
#define UINT8_HI(x) (x >> 4)
#define UINT8_LO(x) (((1 << 4) - 1) & x)
void uint8_add(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint8_t s0, s1, s2;
uint8_t x = UINT8_LO(a) + UINT8_LO(b);
s0 = UINT8_LO(x);
x = UINT8_HI(a) + UINT8_HI(b) + UINT8_HI(x);
s1 = UINT8_LO(x);
s2 = UINT8_HI(x);
uint8_t result = s0 + (s1 << 4);
uint8_t carry = s2;
res[1 + i] = result;
res[0 + i] = carry;
}
void uint8_multiply(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint8_t s0, s1, s2, s3;
uint8_t x = UINT8_LO(a) * UINT8_LO(b);
s0 = UINT8_LO(x);
x = UINT8_HI(a) * UINT8_LO(b) + UINT8_HI(x);
s1 = UINT8_LO(x);
s2 = UINT8_HI(x);
x = s1 + UINT8_LO(a) * UINT8_HI(b);
s1 = UINT8_LO(x);
x = s2 + UINT8_HI(a) * UINT8_HI(b) + UINT8_HI(x);
s2 = UINT8_LO(x);
s3 = UINT8_HI(x);
uint8_t result = s1 << 4 | s0;
uint8_t carry = s3 << 4 | s2;
res[1 + i] = result;
res[0 + i] = carry;
}
And it seems to work just fine, however I am unable to define the same operations for division, subtraction and modulus...
Furthermore I just can't seem to see how to implement the same principal to my custom uint1024_t structure even though it is pretty much identical with a few lines of code more to manage overflows.
I would really appreciate some help in implementing the 5 basic operations for my structure.
EDIT:
I have answered below with my implementation for resolving this problem.
find a way to compute ... the 5 basic operations: plus, minus, times, divide, modulus.
If uint1024_t used uint32_t, it would be easier.
I would recommend 1) half the width of the widest type uintmax_t, or 2) unsigned, whichever is smaller. E.g. 32-bit.
(Also consider something other than uintN_t to avoid collisions with future versions of C.)
typedef struct {
uint32_t chunk[1024/32];
} u1024;
Example of some untested code to give OP an idea of how using uint32_t simplifies the task.
void u1024_mult(u1024 *product, const u1024 *a, const u1024 *b) {
memset(product, 0, sizeof product[0]);
unsigned n = sizeof product->chunk / sizeof product->chunk[0];
for (unsigned ai = 0; ai < n; ai++) {
uint64_t acc = 0;
uint32_t m = a->chunk[ai];
for (unsigned bi = 0; ai + bi < n; bi++) {
acc += (uint64_t) m * b->chunk[bi] + product->chunk[ai + bi];
product->chunk[ai + bi] = (uint32_t) acc;
acc >>= 32;
}
}
}
+, - are quite similar to the above.
/, % could be combined into one routine that computes the quotient and remainder together.
It is not that hard to post those functions here as it really is the same as grade school math, but instead of base 10, base 232. I am against posting it though as it is fun exercise to do oneself.
I hope the * sample code above inspires rather than answers.
There are some problems with your implementation for uint8_t arrays:
you did not parenthesize the macro arguments in the expansion. This is very error prone as it may cause unexpected operator precedence problems if the arguments are expressions. You should write:
#define UINT8_HI(x) ((x) >> 4)
#define UINT8_LO(x) (((1 << 4) - 1) & (x))
storing the array elements with the most significant part first is counter intuitive. Multi-precision arithmetics usually represents the large values as arrays with the least significant part first.
for a small type such as uint8_t, there is no need to split it into halves as larger types are available. Furthermore, you must propagate the carry from the previous addition. Here is a much simpler implementation for the addition:
void uint8_add(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint16_t result = a + b + res[i + 0]; // add previous carry
res[i + 0] = (uint8_t)result;
res[i + 1] = (uint8_t)(result >> 8); // assuming res has at least i+1 elements and is initialized to 0
}
for the multiplication, you must add the result of multiplying each part of each number to the appropriately chosen parts of the result number, propagating the carry to the higher parts.
Division is more difficult to implement efficiently. I recommend you study an open source multi-precision package such as QuickJS' libbf.c.
To transpose this to arrays of uint64_t, you can use unsigned 128-bit integer types if available on your platform (64-bit compilers gcc, clang and vsc all support such types).
Here is a simple implementation for the addition and multiplication:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#define NB_CHUNK 16
typedef __uint128_t uint128_t;
typedef struct {
uint64_t chunk[NB_CHUNK];
} uint1024_t;
void uint0124_add(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
uint128_t result = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
result += (uint128_t)a->chunk[i] + b->chunk[i];
dest->chunk[i] = (uint64_t)result;
result >>= CHAR_BIT * sizeof(uint64_t);
}
}
void uint0124_multiply(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
for (size_t i = 0; i < NB_CHUNK; i++)
dest->chunk[i] = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
uint128_t result = 0;
for (size_t j = 0, k = i; k < NB_CHUNK; j++, k++) {
result += (uint128_t)a->chunk[i] * b->chunk[j] + dest->chunk[k];
dest->chunk[k] = (uint64_t)result;
result >>= CHAR_BIT * sizeof(uint64_t);
}
}
}
If 128-bit integers are not available, your 1024-bit type could be implemented as an array of 32-bit integers. Here is a flexible implementation with selectable types for the array elements and the intermediary result:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#if 1 // if platform has 128 bit integers
typedef uint64_t type1;
typedef __uint128_t type2;
#else
typedef uint32_t type1;
typedef uint64_t type2;
#endif
#define TYPE1_BITS (CHAR_BIT * sizeof(type1))
#define NB_CHUNK (1024 / TYPE1_BITS)
typedef struct uint1024_t {
type1 chunk[NB_CHUNK];
} uint1024_t;
void uint0124_add(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
type2 result = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
result += (type2)a->chunk[i] + b->chunk[i];
dest->chunk[i] = (type1)result;
result >>= TYPE1_BITS;
}
}
void uint0124_multiply(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
for (size_t i = 0; i < NB_CHUNK; i++)
dest->chunk[i] = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
type2 result = 0;
for (size_t j = 0, k = i; k < NB_CHUNK; j++, k++) {
result += (type2)a->chunk[i] * b->chunk[j] + dest->chunk[k];
dest->chunk[k] = (type1)result;
result >>= TYPE1_BITS;
}
}
}

reading big-endian files in little-endian system

I have a data file that I need to read in C. It is compirsed of alternating 16-bit integer stored in binary form, and I need only the first column (ie, every other entry starting at 0)
I have a simple python script that reads the files accurately:
import numpy as np
fname = '[filename]'
columntypes = np.dtype([('curr_pA', '>i2'),('volts', '>i2')])
test = np.memmap(fname, dtype=columntypes,mode='r')['curr_pA']
I want to port this to C. Because my machine is natively little-endian I need to manually perform the byte swap. Here's what I have done:
void swapByteOrder_int16(double *current, int16_t *rawsignal, int64_t length)
{
int64_t i;
for (i=0; i<length; i++)
{
current[i] = ((rawsignal[2*i] << 8) | ((rawsignal[2*i] >> 8) & 0xFF));
}
}
int64_t read_current_int16(FILE *input, double *current, int16_t *rawsignal, int64_t position, int64_t length)
{
int64_t test;
int64_t read = 0;
if (fseeko64(input,(off64_t) position*2*sizeof(int16_t),SEEK_SET))
{
return 0;
}
test = fread(rawsignal, sizeof(int16_t), 2*length, input);
read = test/2;
if (test != 2*length)
{
perror("End of file reached");
}
swapByteOrder_int16(current, rawsignal, length);
return read;
}
In the read_current_int16 function I use fread to read a large chunk of data (both columns) into rawsignal array. I then call swapByteOrder_int16 to pick off every other value, and swap its bytes around. I then cast the result to double and store it in current.
It doesn't work. I get garbage as the output in the C code. I think I've been starting at it for too long and can no longer see my own errors. Can anyone spot anything glaringly wrong?
Perform the endian swap as unsigned math and then assign to double.
void swapByteOrder_int16(double *current, const int16_t *rawsignal, size_t length) {
for (size_t i = 0; i < length; i++) {
int16_t x = rawsignal[2*i];
x = (x*1u << 8) | (x*1u >> 8);
current[i] = x;
}
}
I prefer this mask and shift combination:
current[i] = ((rawsignal[2*i] & 0x00ff) << 8) | (rawsignal[2*i] >> 8)
As suggested by several people, doing the shifts as unsigned does the trick. I am answering this with my implementation just for the sake of completeness since I tweaked it a little from the accepted answer:
void swapByteOrder_int16(double *current, uint16_t *rawsignal, int64_t length)
{
union int16bits bitval;
int64_t i;
for (i=0; i<length; i++)
{
bitval.bits = rawsignal[2*i];
bitval.bits = (bitval.bits << 8) | (bitval.bits >> 8);
current[i] = (double) bitval.currentval;
}
}
union int16bits
{
uint16_t bits;
int16_t currentval;
};
Swapping bits with unsigned types will make things much easier:
void swapByteOrder_int16(double *current, void const *rawsignal_, size_t length)
{
uint16_t const *rawsignal = rawsignal_;
size_t i;
for (i=0; i<length; i++)
{
uint16_t tmp = rawsignal[2*i];
tmp = ((tmp >> 8) & 0xffu) | ((tmp << 8) & 0xff00u);
current[i] = (int16_t)(tmp);
}
}
NOTE: when rawsignal is not aligned, you have to memcpy() it.

Extracting 3 bytes to a number

What is the FASTEST way, using bit operators to return the number, represented with 3 different unsigned char variables ?
unsigned char byte1 = 200;
unsigned char byte2 = 40;
unsigned char byte3 = 33;
unsigned long number = byte1 + byte2 * 256 + byte3 * 256 * 256;
is the slowest way possible.
Just shift each one into place, and OR them together:
#include <stdint.h>
int main(void)
{
uint8_t a = 0xAB, b = 0xCD, c = 0xEF;
/*
* 'a' must be first cast to uint32_t because of the implicit conversion
* to int, which is only guaranteed to be at least 16 bits.
* (Thanks Matt McNabb and Tim Čas.)
*/
uint32_t i = ((uint32_t)a << 16) | (b << 8) | c;
printf("0x%X\n", i);
return 0;
}
Do note however, that almost any modern compiler will replace a multiplication by a power of two with a bit-shift of the appropriate amount.
The fastest way would be the direct memory writing, assuming you know the endian of your system (here the assumption is little endian):
unsigned char byte1 = 200;
unsigned char byte2 = 40;
unsigned char byte3 = 33;
unsigned long number = 0;
((unsigned char*)&number)[0] = byte1;
((unsigned char*)&number)[1] = byte2;
((unsigned char*)&number)[2] = byte3;
Or if you don't mind doing some excercise, you can do something like:
union
{
unsigned long ulongVal;
unsigned char chars[4]; // In case your long is 32bits
} a;
and then by assigning:
a.chars[0] = byte1;
a.chars[1] = byte2;
a.chars[2] = byte3;
a.chars[3] = 0;
you will read the final value from a.ulongVal. This will spare extra memory operations.

get16bits macro in hash function

I was looking at hash functions the other day and came across a website that had an example of one. Most of the code was easy to grasp, however this macro function I can't really wrap my head around.
Could someone breakdown what's going on here?
#define get16bits(d) ((((uint32_t)(((const uint8_t *)(d))[1])) << 8) +(uint32_t)(((const uint8_t *)(d))[0]))
Basically it gets the lower 16 bit of the 32 bit integer d
lets break it down
#define get16bits(d) ((((uint32_t)(((const uint8_t *)(d))[1])) << 8) +(uint32_t)(((const uint8_t *)(d))[0]))
uint32_t a = 0x12345678;
uint16_t b = get16bits(&a); // b == 0x00005678
first we must pass the address of a to get16bits() or it will not work.
(((uint32_t)(const uint8_t *)(d))[1])) << 8
this first converts the 32 bit integer into an array of 8 bit integers and retrieves the 2 one.
It then shifts the value by 8 bit so it and adds the lower 8 bits to it
+ (uint32_t)(((const uint8_t *)(d))[0]))
In our example it will be
uint8_t tmp[4] = (uint8_t *)&a;
uint32_t result;
result = tmp[1] << 8; // 0x00005600
result += tmp[0]; //tmp[0] == 0x78
// result is now 0x00005678
The macro is more or less equivalent to:
static uint32_t get16bits(SOMETYPE *d)
{
unsigned char temp[ sizeof *d];
uint32_t val;
memcpy(temp, d, sizeof *d);
val = (temp[0] << 8)
+ temp[1];
return val;
}
, but the macro argument has no type, and the function argument does.
Another way would be to actually cast:
static uint32_t get16bits(SOMETYPE *d)
{
unsigned char *cp = (unsigned char*) d;
uint32_t val;
val = (cp[0] << 8)
+ cp[1];
return val;
}
, which also shows the weakness: by indexing with 1, the code assumes that sizeof (*d) is at least 2.

Converting Char array to Long in C

This question may looks silly, but please guide me
I have a function to convert long data to char array
void ConvertLongToChar(char *pSrc, char *pDest)
{
pDest[0] = pSrc[0];
pDest[1] = pSrc[1];
pDest[2] = pSrc[2];
pDest[3] = pSrc[3];
}
And I call the above function like this
long lTemp = (long) (fRxPower * 1000);
ConvertLongToChar ((char *)&lTemp, pBuffer);
Which works fine.
I need a similar function to reverse the procedure. Convert char array to long.
I cannot use atol or similar functions.
You can do:
union {
unsigned char c[4];
long l;
} conv;
conv.l = 0xABC;
and access c[0] c[1] c[2] c[3]. This is good as it wastes no memory and is very fast because there is no shifting or any assignment besides the initial one and it works both ways.
Leaving the burden of matching the endianness with your other function to you, here's one way:
unsigned long int l = pdest[0] | (pdest[1] << 8) | (pdest[2] << 16) | (pdest[3] << 24);
Just to be safe, here's the corresponding other direction:
unsigned char pdest[4];
unsigned long int l;
pdest[0] = l & 0xFF;
pdest[1] = (l >> 8) & 0xFF;
pdest[2] = (l >> 16) & 0xFF;
pdest[3] = (l >> 24) & 0xFF;
Going from char[4] to long and back is entirely reversible; going from long to char[4] and back is reversible for values up to 2^32-1.
Note that all this is only well-defined for unsigned types.
(My example is little endian if you read pdest from left to right.)
Addendum: I'm also assuming that CHAR_BIT == 8. In general, substitute multiples of 8 by multiples of CHAR_BIT in the code.
A simple way would be to use memcpy:
char * buffer = ...;
long l;
memcpy(&l, buff, sizeof(long));
That does not take endianness into account, however, so beware if you have to share data between multiple computers.
If you mean to treat sizeof (long) bytes memory as a single long, then you should do the below:
char char_arr[sizeof(long)];
long l;
memcpy (&l, char_arr, sizeof (long));
This thing can be done by pasting each bytes of the long using bit shifting ans pasting, like below.
l = 0;
l |= (char_arr[0]);
l |= (char_arr[1] << 8);
l |= (char_arr[2] << 16);
l |= (char_arr[3] << 24);
If you mean to convert "1234\0" string into 1234L then you should
l = strtol (char_arr, NULL, 10); /* to interpret the base as decimal */
Does this work:
#include<stdio.h>
long ConvertCharToLong(char *pSrc) {
int i=1;
long result = (int)pSrc[0] - '0';
while(i<strlen(pSrc)){
result = result * 10 + ((int)pSrc[i] - '0');
++i;
}
return result;
}
int main() {
char* str = "34878";
printf("The answer is %d",ConvertCharToLong(str));
return 0;
}
This is dirty but it works:
unsigned char myCharArray[8];
// Put some data in myCharArray here...
long long integer = *((long long*) myCharArray);
char charArray[8]; //ideally, zero initialise
unsigned long long int combined = *(unsigned long long int *) &charArray[0];
Be wary of strings that are null terminated, as you will end up copying any bytes beyond the null terminator into combined; thus in the above assignment, charArray needs to be fully zero-initialised for a "clean" conversion.
Just found this having tried more than one of the above to no avail :=( :
char * vIn = "0";
long vOut = strtol(vIn,NULL,10);
Worked perfectly for me.
To give credit where it is due, this is where I found it:
https://www.convertdatatypes.com/Convert-char-Array-to-long-in-C.html

Resources