fetch 32bit instruction from binary file in C - c

I need to read 32bit instructions from a binary file.
so what i have right now is:
unsigned char buffer[4];
fread(buffer,sizeof(buffer),1,file);
which will put 4 bytes in an array
how should I approach that to connect those 4 bytes together in order to process 32bit instruction later?
Or should I even start in a different way and not use fread?
my weird method right now is to create an array of ints of size 32 and the fill it with bits from buffer array

The answer depends on how the 32-bit integer is stored in the binary file. (I'll assume that the integer is unsigned, because it really is an id, and use the type uint32_t from <stdint.h>.)
Native byte order The data was written out as integer on this machine. Just read the integer with fread:
uint32_t op;
fread(&op, sizeof(op), 1, file);
Rationale: fread read the raw representation of the integer into memory. The matching fwrite does the reverse: It writes the raw representation to thze file. If you don't need to exchange the file between platforms, this is a good method to store and read data.
Little-endian byte order The data is stored as four bytes, least significant byte first:
uint32_t op = 0u;
op |= getc(file); // 0x000000AA
op |= getc(file) << 8; // 0x0000BBaa
op |= getc(file) << 16; // 0x00CCbbaa
op |= getc(file) << 24; // 0xDDccbbaa
Rationale: getc reads a char and returns an integer between 0 and 255. (The case where the stream runs out and getc returns the negative value EOF is not considered here for brevity, viz laziness.) Build your integer by shifting each byte you read by multiples of 8 and or them with the existing value. The comments sketch how it works. The capital letters are being read, the lower-case letters were already there. Zeros have not yet been assigned.
Big-endian byte order The data is stored as four bytes, least significant byte last:
uint32_t op = 0u;
op |= getc(file) << 24; // 0xAA000000
op |= getc(file) << 16; // 0xaaBB0000
op |= getc(file) << 8; // 0xaabbCC00
op |= getc(file); // 0xaabbccDD
Rationale: Pretty much the same as above, only that you shift the bytes in another order.
You can imagine little-endian and big-endian as writing the number one hundred and twenty tree (CXXIII) as either 321 or 123. The bit-shifting is similar to shifting decimal digtis when dividing by or multiplying with powers of 10, only that you shift my 8 bits to multiply with 2^8 = 256 here.

Add
unsigned int instruction;
memcpy(&instruction,buffer,4);
to your code. This will copy the 4 bytes of buffer to a single 32-bit variable. Hence you will get connected 4 bytes :)

If you know that the int in the file is the same endian as the machine the program's running on, then you can read straight into the int. No need for a char buffer.
unsigned int instruction;
fread(&instruction,sizeof(instruction),1,file);
If you know the endianness of the int in the file, but not the machine the program's running on, then you'll need to add and shift the bytes together.
unsigned char buffer[4];
unsigned int instruction;
fread(buffer,sizeof(buffer),1,file);
//big-endian
instruction = (buffer[0]<<24) + (buffer[1]<<16) + (buffer[2]<<8) + buffer[3];
//little-endian
instruction = (buffer[3]<<24) + (buffer[2]<<16) + (buffer[1]<<8) + buffer[0];
Another way to think of this is that it's a positional number system in base-256. So just like you combine digits in a base-10.
257
= 2*100 + 5*10 + 7
= 2*10^2 + 5*10^1 + 7*10^0
So you can also combine them using Horner's rule.
//big-endian
instruction = ((((buffer[0]*256) + buffer[1]*256) + buffer[2]*256) + buffer[3]);
//little-endian
instruction = ((((buffer[3]*256) + buffer[2]*256) + buffer[1]*256) + buffer[0]);

#luser droog
There are two bugs in your code.
The size of the variable "instruction" must not be 4 bytes: for example, Turbo C assumes sizeof(int) to be 2. Obviously, your program fails in this case. But, what is much more important and not so obvious: your program will also fail in case sizeof(int) be more than 4 bytes! To understand this, consider the following example:
int main()
{ const unsigned char a[4] = {0x21,0x43,0x65,0x87};
const unsigned char* p = &a;
unsigned long x = (((((p[3] << 8) + p[2]) << 8) + p[1]) << 8) + p[0];
printf("%08lX\n", x);
return 0;
}
This program prints "FFFFFFFF87654321" under amd64, because an unsigned char variable becomes SIGNED INT when it is used! So, changing the type of the variable "instruction" from "int" to "long" does not solve the problem.
The only way is to write something like:
unsigned long instruction;
instruction = 0;
for (int i = 0, unsigned char* p = buffer + 3; i < 4; i++, p--) {
instruction <<= 8;
instruction += *p;
}

Related

Bitwise operation in C language (0x80, 0xFF, << )

I have a problem understanding this code. What I know is that we have passed a code into a assembler that has converted code into "byte code". Now I have a Virtual machine that is supposed to read this code. This function is supposed to read the first byte code instruction. I don't understand what is happening in this code. I guess we are trying to read this byte code but don't understand how it is done.
static int32_t bytecode_to_int32(const uint8_t *bytecode, size_t size)
{
int32_t result;
t_bool sign;
int i;
result = 0;
sign = (t_bool)(bytecode[0] & 0x80);
i = 0;
while (size)
{
if (sign)
result += ((bytecode[size - 1] ^ 0xFF) << (i++ * 8));
else
result += bytecode[size - 1] << (i++ * 8);
size--;
}
if (sign)
result = ~(result);
return (result);
}
This code is somewhat badly written, lots of operations on a single line and therefore containing various potential bugs. It looks brittle.
bytecode[0] & 0x80 Simply reads the MSB sign bit, assuming it's 2's complement or similar, then converts it to a boolean.
The loop iterates backwards from most significant byte to least significant.
If the sign was negative, the code will perform an XOR of the data byte with 0xFF. Basically inverting all bits in the data. The result of the XOR is an int.
The data byte (or the result of the above XOR) is then bit shifted i * 8 bits to the left. The data is always implicitly promoted to int, so in case i * 8 happens to give a result larger than INT_MAX, there's a fat undefined behavior bug here. It would be much safer practice to cast to uint32_t before the shift, carry out the shift, then convert to a signed type afterwards.
The resulting int is converted to int32_t - these could be the same type or different types depending on system.
i is incremented by 1, size is decremented by 1.
If sign was negative, the int32_t is inverted to some 2's complement negative number that's sign extended and all the data bits are inverted once more. Except all zeros that got shifted in with the left shift are also replaced by ones. If this is intentional or not, I cannot tell. So for example if you started with something like 0x0081 you now have something like 0xFFFF01FF. How that format makes sense, I have no idea.
My take is that the bytecode[size - 1] ^ 0xFF (which is equivalent to ~) was made to toggle the data bits, so that they would later toggle back to their original values when ~ is called later. A programmer has to document such tricks with comments, if they are anything close to competent.
Anyway, don't use this code. If the intention was merely to swap the byte order (endianess) of a 4 byte integer, then this code must be rewritten from scratch.
That's properly done as:
static int32_t big32_to_little32 (const uint8_t* bytes)
{
uint32_t result = (uint32_t)bytes[0] << 24 |
(uint32_t)bytes[1] << 16 |
(uint32_t)bytes[2] << 8 |
(uint32_t)bytes[3] << 0 ;
return (int32_t)result;
}
Anything more complicated than the above is highly questionable code. We need not worry about signs being a special case, the above code preserves the original signedness format.
So the A^0xFF toggles the bits set in A, so if you have 10101100 xored with 11111111.. it will become 01010011. I am not sure why they didn't use ~ here. The ^ is a xor operator, so you are xoring with 0xFF.
The << is a bitshift "up" or left. In other words, A<<1 is equivalent to multiplying A by 2.
the >> moves down so is equivalent to bitshifting right, or dividing by 2.
The ~ inverts the bits in a byte.
Note it's better to initialise variables at declaration it costs no additional processing whatsoever to do it that way.
sign = (t_bool)(bytecode[0] & 0x80); the sign in the number is stored in the 8th bit (or position 7 counting from 0), which is where the 0x80 is coming from. So it's literally checking if the signed bit is set in the first byte of bytecode, and if so then it stores it in the sign variable.
Essentially if it's unsigned then it's copying the bytes from from bytecode into result one byte at a time.
If the data is signed then it flips the bits then copies the bytes, then when it's done copying, it flips the bits back.
Personally with this kind of thing i prefer to get the data, stick in htons() format (network byte order) and then memcpy it to an allocated array, store it in a endian agnostic way, then when i retrieve the data i use ntohs() to convert it back to the format used by the computer. htons() and ntohs() are standard C functions and are used in networking and platform agnostic data formatting / storage / communication all the time.
This function is a very naive version of the function which converts form the big endian to little endian.
The parameter size is not needed as it works only with the 4 bytes data.
It can be much easier archived by the union punning (and it allows compilers to optimize it - in this case to the simple instruction):
#define SWAP(a,b,t) do{t c = (a); (a) = (b); (b) = c;}while(0)
int32_t my_bytecode_to_int32(const uint8_t *bytecode)
{
union
{
int32_t i32;
uint8_t b8[4];
}i32;
uint8_t b;
i32.b8[3] = *bytecode++;
i32.b8[2] = *bytecode++;
i32.b8[1] = *bytecode++;
i32.b8[0] = *bytecode++;
return i32.i32;
}
int main()
{
union {
int32_t i32;
uint8_t b8[4];
}i32;
uint8_t b;
i32.i32 = -4567;
SWAP(i32.b8[0], i32.b8[3], uint8_t);
SWAP(i32.b8[1], i32.b8[2], uint8_t);
printf("%d\n", bytecode_to_int32(i32.b8, 4));
i32.i32 = -34;
SWAP(i32.b8[0], i32.b8[3], uint8_t);
SWAP(i32.b8[1], i32.b8[2], uint8_t);
printf("%d\n", my_bytecode_to_int32(i32.b8));
}
https://godbolt.org/z/rb6Na5
If the purpose of the code is to sign-extend a 1-, 2-, 3-, or 4-byte sequence in network/big-endian byte order to a signed 32-bit int value, it's doing things the hard way and reimplementing the wheel along the way.
This can be broken down into a three-step process: convert the proper number of bytes to a 32-bit integer value, sign-extend bytes out to 32 bits, then convert that 32-bit value from big-endian to the host's byte order.
The "wheel" being reimplemented in this case is the the POSIX-standard ntohl() function that converts a 32-bit unsigned integer value in big-endian/network byte order to the local host's native byte order.
The first step I'd do is to convert 1, 2, 3, or 4 bytes into a uint32_t:
#include <stdint.h>
#include <limits.h>
#include <arpa/inet.h>
#include <errno.h>
// convert the `size` number of bytes starting at the `bytecode` address
// to a uint32_t value
static uint32_t bytecode_to_uint32( const uint8_t *bytecode, size_t size )
{
uint32_t result = 0;
switch ( size )
{
case 4:
result = bytecode[ 0 ] << 24;
case 3:
result += bytecode[ 1 ] << 16;
case 2:
result += bytecode[ 2 ] << 8;
case 1:
result += bytecode[ 3 ];
break;
default:
// error handling here
break;
}
return( result );
}
Then, sign-extend it (borrowing from this answer):
static uint32_t sign_extend_uint32( uint32_t in, size_t size );
{
if ( size == 4 )
{
return( in );
}
// being pedantic here - the existence of `[u]int32_t` pretty
// much ensures 8 bits/byte
size_t bits = size * CHAR_BIT;
uint32_t m = 1U << ( bits - 1 );
uint32_t result = ( in ^ m ) - m;
return ( result );
}
Put it all together:
static int32_t bytecode_to_int32( const uint8_t *bytecode, size_t size )
{
uint32_t result = bytecode_to_uint32( bytecode, size );
result = sign_extend_uint32( result, size );
// set endianness from network/big-endian to
// whatever this host's endianness is
result = ntohl( result );
// converting uint32_t here to signed int32_t
// can be subject to implementation-defined
// behavior
return( result );
}
Note that the conversion from uint32_t to int32_t implicitly performed by the return statement in the above code can result in implemenation-defined behavior as there can be uint32_t values that can not be mapped to int32_t values. See this answer.
Any decent compiler should optimize that well into inline functions.
I personally think this also needs much better error handling/input validation.

C - Writing integers to binary file using only 3 bytes

I have a programm in C that writes a frequency table to a binary file.
The frequency table is an array filled with structs that contains an int and a char.
So I have to write an unsigned int counter and an unsigned char character to the file (multiple times).
I know that an integer normally uses 4 bytes however I know that the int counter can never be bigger than 2^24-1.
So I could use 4 bytes to write the counter and the character to the file => 3 bytes for counter and 1 byte for the character. This would also be easy to read.
Is there an easy way to do this in C without using special libraries?
Yes, there is a very easy way of doing it in C. You can combine a char, which is one byte on all platforms, with an int of up to 24 bits in size by shifting the char by 24 bits to the left:
uint32_t toWrite = (myChar << 24) | myCount;
When you read the data back, perform the opposite operation:
uint32_t fromFile;
uint32_t myCount = fromFile & 0xFFFFFF;
char myChar = (fromFile >> 24) & 0xFF;

converting 2 byte hex numbers back to a decimal

The code below, produces the following output on a serial console
[42][25][f][27][0][0].
My question is - if just had the serial output - how would you figure out that the number was 9999? How does the maths work? I think it has something to do with little endian?
int a = 9999;
buf[0] = 'B';
buf[1] = '%';
buf[2] = a&0xff;
buf[3] = (a>>8)&0xff;
buf[4] = (a>>16)&0xff;
buf[5] = (a>>24)&0xff;
Endianness determines how numbers are stored in memory, not how arithmetic is performed on it. Since the C code you provided only uses integer arithmetic (i.e. does not deal with pointers and memory access), the resulting data will be the same whatever the endianness is.
To serialize your number, you extract every byte (&0xff) of your number by applying bit shifting (respectively 0, 8, 16 and 24 bits); e.g. 0xAABBCCDD >> 8 becomes 0xAABBCC, and the binary AND operation &0xff discards the upper bytes to keep the least significant one, in case of the example it is 0xCC.
To undo that operation, you have to take the bytes and AND them together, applying bit shifts in the opposite direction. Parsing i would use the following code:
int a = buf[2] & (buf[3] << 8) & (buf[4] << 16) & (buf[5] << 24);
There is no need to cast any of the operands here as using bitwise operators in C implies integer promotion (ISO/IEC 9899§6.3.1.1), and your resulting variable type is int — that is, assuming buf is an array of an unsigned 8-bit integer type.
Note this assumes the emitter of the serialized data also has a 32-bit int length, and uses the same signed number representation (often two's complement).
The hexi decimal number system is a base 16 numeral system.
This means that it consists of 16 different symbols and has The weight of 16.
What this means is that instead of only having 10 symbols(0-9)for illustrating numbers as in the decimal (base 10 ) system, you have 16 0-f.
Where the symbols a=10,b=11,c=12,d=13,e=14 and f=15 in The decimal system.
The weight part of the system means that the frist symbol in a hex digit has the weight of 1 which is also the case in the decimal system.
But if the number is represented by more than one digit, that weight is increased by 16.
So in hex:
f = 16^0*15 in decimal.
ff = (16^1*15)+15 in decimal.
fff=(16^2*15)+(16^1*15)+15 in decimal
n...fff = ((16^n-1)*m)+...+(16^2*15)+(16^1*15)+15 in decimal where m represents a given hex symbol.
This number system is widely used in electrical Engineering and hardware near software development, because it allows one to group bits in groups of fours and thereby illustrate large binary numbers more compact.
knowing how data are embedded into message, you can extrapolate them converting in the right format.
#include <stdio.h>
#include <stdint.h>
int main(void)
{
int32_t a = 9999;
unsigned char buf[6];
buf[0] = 'B';
buf[1] = '%';
buf[2] = a&0xff;
buf[3] = (a>>8)&0xff;
buf[4] = (a>>16)&0xff;
buf[5] = (a>>24)&0xff;
uint32_t res= buf[2] + ((buf[3] & 0xFFFFFFFFu) << 8) + ((buf[4] & 0xFFFFFFFFu) << 16) + ((buf[5] & 0xFFFFFFFFu) << 24);
printf("Converted from chars: %d\n", res);
return 0;
}
As data are pushed into buffer, you can position they back at the correct location in an int variable.
I think it has something to do with little endian?
Endianness doesn't matter because shift operations are made on the host platform considering its architecture.

Copy 6 byte array to long long integer variable

I have read from memory a 6 byte unsigned char array.
The endianess is Big Endian here.
Now I want to assign the value that is stored in the array to an integer variable. I assume this has to be long long since it must contain up to 6 bytes.
At the moment I am assigning it this way:
unsigned char aFoo[6];
long long nBar;
// read values to aFoo[]...
// aFoo[0]: 0x00
// aFoo[1]: 0x00
// aFoo[2]: 0x00
// aFoo[3]: 0x00
// aFoo[4]: 0x26
// aFoo[5]: 0x8e
nBar = (aFoo[0] << 64) + (aFoo[1] << 32) +(aFoo[2] << 24) + (aFoo[3] << 16) + (aFoo[4] << 8) + (aFoo[5]);
A memcpy approach would be neat, but when I do this
memcpy(&nBar, &aFoo, 6);
the 6 bytes are being copied to the long long from the start and thus have padding zeros at the end.
Is there a better way than my assignment with the shifting?
What you want to accomplish is called de-serialisation or de-marshalling.
For values that wide, using a loop is a good idea, unless you really need the max. speed and your compiler does not vectorise loops:
uint8_t array[6];
...
uint64_t value = 0;
uint8_t *p = array;
for ( int i = (sizeof(array) - 1) * 8 ; i >= 0 ; i -= 8 )
value |= (uint64_t)*p++ << i;
// left-align
value <<= 64 - (sizeof(array) * 8);
Note using stdint.h types and sizeof(uint8_t) cannot differ from1`. Only these are guaranteed to have the expected bit-widths. Also use unsigned integers when shifting values. Right shifting certain values is implementation defined, while left shifting invokes undefined behaviour.
Iff you need a signed value, just
int64_t final_value = (int64_t)value;
after the shifting. This is still implementation defined, but all modern implementations (and likely the older) just copy the value without modifications. A modern compiler likely will optimize this, so there is no penalty.
The declarations can be moved, of course. I just put them before where they are used for completeness.
You might try
nBar = 0;
memcpy((unsigned char*)&nBar + 2, aFoo, 6);
No & needed before an array name caz' it's already an address.
The correct way to do what you need is to use an union:
#include <stdio.h>
typedef union {
struct {
char padding[2];
char aFoo[6];
} chars;
long long nBar;
} Combined;
int main ()
{
Combined x;
// reset the content of "x"
x.nBar = 0; // or memset(&x, 0, sizeof(x));
// put values directly in x.chars.aFoo[]...
x.chars.aFoo[0] = 0x00;
x.chars.aFoo[1] = 0x00;
x.chars.aFoo[2] = 0x00;
x.chars.aFoo[3] = 0x00;
x.chars.aFoo[4] = 0x26;
x.chars.aFoo[5] = 0x8e;
printf("nBar: %llx\n", x.nBar);
return 0;
}
The advantage: the code is more clear, there is no need to juggle with bits, shifts, masks etc.
However, you have to be aware that, for speed optimization and hardware reasons, the compiler might squeeze padding bytes into the struct, leading to aFoo not sharing the desired bytes of nBar. This minor disadvantage can be solved by telling the computer to align the members of the union at byte-boundaries (as opposed to the default which is the alignment at word-boundaries, the word being 32-bit or 64-bit, depending on the hardware architecture).
This used to be achieved using a #pragma directive and its exact syntax depends on the compiler you use.
Since C11/C++11, the alignas() specifier became the standard way to specify the alignment of struct/union members (given your compiler already supports it).

Converting little endian to big endian using Bitshift Operators

I am working on endianess. My little endian program works, and gives the correct output. But I am not able to get my way around big endian. Below is the what I have so far.
I know i have to use bit shift and i dont think i am doing a good job at it. I tried asking my TA's and prof but they are not much help.
I have been following this link (convert big endian to little endian in C [without using provided func]) to understand more but cannot still make it work. Thank you for the help.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
FILE* input;
FILE* output;
input = fopen(argv[1],"r");
output = fopen(argv[2],"w");
int value,value2;
int i;
int zipcode, population;
while(fscanf(input,"%d %d\n",&zipcode, &population)!= EOF)
{
for(i = 0; i<4; i++)
{
population = ((population >> 4)|(population << 4));
}
fwrite(&population, sizeof(int), 1, output);
}
fclose(input);
fclose(output);
return 0;
}
I'm answering not to give you the answer but to help you solve it yourself.
First ask yourself this: how many bits are in a byte? (hint: 8) Next, how many bytes are in an int? (hint: probably 4) Picture this 32-bit integer in memory:
+--------+
0x|12345678|
+--------+
Now picture it on a little-endian machine, byte-wise. It would look like this:
+--+--+--+--+
0x|78|56|34|12|
+--+--+--+--+
What shift operations are required to get the bytes into the correct spot?
Remember, when you use a bitwise operator like >>, you are operating on bits. So 1 << 24 would be the integer value 1 converted into the processor's opposite endianness.
"little-endian" and "big-endian" refer to the order of bytes (we can assume 8 bits here) in a binary representation. When referring to machines, it's about the order of the bytes in memory: on big-endian machines, the address of an int will point to its highest-order byte, while on a little-endian machine the address of an int will refer to its lowest-order byte.
When referring to binary files (or pipes or transmission protocols etc.), however, it refers to the order of the bytes in the file: a "little-endian representation" will have the lowest-order byte first and the highest-order byte last.
How does one obtain the lowest-order byte of an int? That's the low 8 bits, so it's (n & 0xFF) (or ((n >> 0) & 0xFF), the usefulness of which you will see below).
The next lowest-order byte is ((n >> 8) & 0xFF).
The next lowest-order byte is ((n >> 16) & 0xFF) ... or (((n >> 8) >> 8) & 0xFF).
And so on.
So you can peal off bytes from n in a loop and output them one byte at a time ... you can use fwrite for that but it's simpler just to use putchar or putc.
You say that your teacher requires you to use fwrite. There are two ways to do that: 1) use fwrite(&n, 1, 1, filePtr) in a loop as described above. 2) Use the loop to reorder your int value by storing the bytes in the desired order in a char array rather than outputting them, then use fwrite to write it out. The latter is probably what your teacher has in mind.
Note that, if you just use fwrite to output your int it will work ... if you're running on a little-endian machine, where the bytes of the int are already stored in the right order. But the bytes will be backwards if running on a big-endian machine.
The problem with most answers to this question is portability. I've provided a portable answer here, but this recieved relatively little positive feedback. Note that C defines undefined behavior as: behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements.
The answer I'll give here won't assume that int is 16 bits in width; It'll give you an idea of how to represent "larger int" values. It's the same concept, but uses a dynamic loop rather than two fputcs.
Declare an array of sizeof int unsigned chars: unsigned char big_endian[sizeof int];
Separate the sign and the absolute value.
int sign = value < 0;
value = sign ? -value : value;
Loop from sizeof int to 0, writing the least significant bytes:
size_t foo = sizeof int;
do {
big_endian[--foo] = value % (UCHAR_MAX + 1);
value /= (UCHAR_MAX + 1);
} while (foo > 0);
Now insert the sign: foo[0] |= sign << (CHAR_BIT - 1);
Simple, yeh? Little endian is equally simple. Just reverse the order of the loop to go from 0 to sizeof int, instead of from sizeof int to 0:
size_t foo = 0;
do {
big_endian[foo++] = value % (UCHAR_MAX + 1);
value /= (UCHAR_MAX + 1);
} while (foo < sizeof int);
The portable methods make more sense, because they're well defined.

Resources