fread and endianness confusion

fread and endianness confusion - c

So to provide context, my system is little endian and the file that I am reading from is big endian (MIDI format, for those that are interested). I am supposed to read a variety of data from the file, including unsigned integers (8 bit, 16 bit, and 32 bit), chars, and booleans.
So far I know that reading unsigned integers will be an issue with fread() because I would have to convert them from big endian to little endian. My first question is, although maybe stupid to some, do I need to convert chars and booleans as well?
My second question is regarding the entire file format. Since the file is in a different endian system, do I need to read the file from the end towards the beginning (since the MSB and LSB positions will be different)? Or do I need to read in the values from the start to the end, like I would normally, and then convert those to little endian?
Thanks for taking the time to read my post and for any answers that I might receive!

Endianness only reverses order inside words of a certain length, usually 2, 4, or 8 bytes. If you're reading in a one byte value such as a char or a bool, then endianness has no effect. However, if you're reading in any value that is more than a byte, such as an integer, then endianness matters. You can still use fread since endianness has nothing to do with file reading, just make sure to convert from big endian to little endian.

When you read external data that isn't just a sequence of characters, you read it as a sequence of bytes, and construct the actual data that you want from it.
If you expect a signed 16 bit number, followed by an unsigned 8 bit number, followed by an unsigned 32 bit number, you write a function reading two bytes and returns them converted to a signed 16 bit number, one that reads one byte and returns it as an unsigned 8 bit number, and one that reads four bytes and returns them converted to a 32 bit number. Construct the 16 and 32 bit numbers using bit shifting.

Related

Determining the hexadecimal values of an address of a little endian machine

I am a bit confused on how you would approach this problem:
Consider decimal number 1027. This value is stored as a 16-bit two's complement number into addresses 124 and 125 on a little endian machine which has an addressable cell size of one byte. What values (in hexadecimal) are in each of these addresses:
124:
125:
I know that a little endian machine orders it addresses from the least significant byte to the most significant byte. But besides from that I am unsure of how you would apply that concept and how you would order bytes into the addresses.

Here's some simple Python code to convert that integer to little-endian hexadecimal representation:
# convert the integer (1027) to hex using 2 bytes and little-endian byteorder
(1027).to_bytes(length=2, byteorder='little').hex()
This gives 0304. So, the first byte (03) is in address 124 and the second one (04) occupies the next address - 125.

"Little endian" and "big endian" relate to how the machine multiplexes bytes from memory into registers of the CPU.
With each byte it gets, it increments the address counter, but does it place these bytes from left-to-right or right-to-left into the register?
So the address that gets loaded into a machine register (or an integer), can be stored reverse in the memory. Even with modern CPUs with broad data busses, the concept remained and in some CPUs the bytes get swapped inside the CPU.

Swapping an integer with a short using a generic function

Assume I have this generic function that swaps two variables:
void swap(void *v1, void *v2, int size){
char buffer[size];
memcpy(buffer, v1, size);
memcpy(v1, v2, size);
memcpy(v2, buffer, size);
}
It works fine, but I was wondering in what cases this might break. One case that comes to mind is when we have two different data types and the size specified is not enough to capture the bigger data. for example:
int x = 4444;
short y = 5;
swap(&x, &y, sizeof(short));
I'd expect that when I run this it would give an incorrect result, because memcpy would work with only 2 bytes (rather than 4) and part of the data would be lost or changed when dealing with x.
Surprisingly though, when I run it, it gives the correct answer on both my Windows 7 and Ubuntu operating systems. I know that Ubuntu and Windows differ in endianness but apparently that doesn't affect any of the two systems.
I want to know why the generic function works fine in this case.

To understand this fully you have to understand the C standard and the specifics of you machine and compiler. Starting with the C standard, here's some relevant snippets [The standard I'm using is WG14/N1256], summarized a little:
The object representation for a signed integer consists of value bits,
padding bits, and a sign bit. [section 6.2.6.2.2].
These bits are stored in a contiguous sequence of bytes. [section
6.2.6.1].
If there's N value bits, they represent powers of two from 2^0 to
2^{N-1}. [section 6.2.6.2].
The sign bit can have one of three meanings, one of which is that is
has value -2^N (two's complement) [section 6.2.6.2.2].
When you copy bytes from a short to an int, you're copying the value bits, padding bits and the sign bit of the short to bits of the int, but not necessarily preserving the meaning of the bits. Somewhat surprisingly, the standard allows this except it doesn't guarantee that the int you get will be valid if your target implementation has so-called "trap representations" and you're unlucky enough to generate one.
In practice, you've found on your machine and your compiler:
a short is represented by 2 bytes of 8 bits each.
The sign bit is bit 7 of the second byte
The value bits in ascending order of value are bits 0-7 of byte 0, and bits 0-6 of byte 1.
There's no padding bits
an int is represented by 4 bytes of 8 bits each.
The sign bit is bit 7 of the fourth byte
The value bits in ascending order of value are bits 0-7 of byte 0, 0-7 of byte 1, 0-7 of byte 2, and 0-6 of byte 3.
There's no padding bits
You would also find out that both representations use two's complement.
In pictures (where SS is the sign bit, and the numbers N correspond to a bit that has value 2^N):
short:
07-06-05-04-03-02-01-00 | SS-14-13-12-11-10-09-08
int:
07-06-05-04-03-02-01-00 | 15-14-13-12-11-10-09-08 | 23-22-21-20-19-18-17-16 | SS-30-29-28-27-26-25-24
You can see from this that if you copy the bytes of a short to the first two bytes of a zero int, you'll get the same value if the sign bit is zero (that is, the number is positive) because the value bits correspond exactly. As a corollary, you can also predict you'll get a different value if you start with a negative-valued short since the sign bit of the short has value -2^15 but the corresponding bit in the int has value 2^15.
The representation you've found on your machine is often summarized as "two's complement, little-endian", but the C standard provides a lot more flexibility in representations than that description suggests (even allowing a byte to have more than 8 bits), which is why portable code usually avoids relying on bit/byte representations of integral types.

As has already been pointed out in the comments the systems you are using are typically little-endian (least significant byte in the lowest address). Given that the memcpy sets the short to the lowest part of the int.
You might enjoy looking at Bit Twiddling Hacks for 'generic' ways to do swap operations.

Storing a string as a binary string of 'unsigned char's to in matters of compression

I need to store a string of 8 chars (they're all digits) in a compressed method,
As I understand it, each char uses 8 bits which are 1 byte and since I only use digits I can use 4 bits (2^4=16 combinations) so for each unsigned char I can store two digits instead of one. Thus I need 4 bytes to store 8 digits instead of 8 bytes.
Until here am I right or wrong?
Now how am I storing this data in a string of 4 unsigned chars? I'm not looking for an explicit answer just a kick start to understand the motivation.

There are three obvious ways to store eight decimal digits in four eight-bit values.
One is to reduce each decimal digit to four bits and to store two four-bit values in eight bits.
Another is to combine each pair of decimal digits to make a number from 0 to 99 and store that number in eight bits.
Another is to combine all eight decimal digits to make a number from 0 to 99999999 and store that in 32 bits, treating the four eight-bit values as one 32-bit integer.
To decide between these, consider what operations you need to perform to encode the value (what arithmetic or bit operations are needed to combine two digits to make the encoded value) and what operations you need to perform to decode the value (given eight bits, how do you get the digits out of them?).
To evaluate this problem, you should know about the basic arithmetic operations and the bit operations such as bit-wise AND and OR, shifting bits, using “masks” with AND operations, and so on. It may also help to know that division and remainder are usually more time-consuming operations than other arithmetic and bit operations on modern computers.

I prefer you use unsigned int as suggested by harold in comments. In unsigned char[4] you may require additional one char for terminating '\0' character.
Use shifting as you yourself suggested for proper conversion from uchar to uint.

How are the values stored in the C unsigned shorts?

I'm trying to read a binary file into a C# struct. The file was created from C and the following code creates 2 bytes out of the 50+ byte rows.
unsigned short nDayTimeBitStuffed = atoi( LPCTSTR( strInput) );
unsigned short nDayOfYear = (0x01FF & nDayTimeBitStuffed);
unsigned short nTimeOfDay = (0x01F & (nDayTimeBitStuffed >> 9) );
Binary values on the file are 00000001 and 00000100.
The expected values are 1 and 2, so I think some bit ordering/swapping is going on but not sure.
Any help would be greatly appreciated.
Thanks!

The answer is 'it depends' - most notably on the machine, and also on how the data is written to the file. Consider:
unsigned short x = 0x0102;
write(fd, &x, sizeof(x));
On some machines (Intel), the low-order byte (0x02) will be written before the high-order byte (0x01); on others (PPC, SPARC), the high-order byte will be written before the low-order one.
So, from a little-endian (Intel) machine, you'd see the bytes:
0x02 0x01
But from a big-endian (PPC) machine, you'd see the bytes:
0x01 0x02
Your bytes appear to be 0x01 and 0x04. Your calculation for 0x02 appears flawed.
The C code you show doesn't write anything. The value in nDayOfYear is the bottom 9 bits of the input value; the nTimeOfDay appears to be the next 5 bits (so 14 of the 16 bits are used).
For example, if the value in strInput is 12141 decimal, 0x2F6D, then the value in nDayOfYear would be 365 (0x16D) and the value in nTimeOfDay would be 23 (0x17).
It is a funny storage order; you can't simply compare the two values whereas if you packed the day of year in the more significant portion of the value and time into the less significant, then you could compare values as simple integers and get the correct comparison.

The expected file contents are very much related to the processor and compiler used to create the file, if it's binary.
I'm assuming a Windows machine here, which uses 2 bytes for a short and puts them in little endian order.
Your comments don't make much sense either. If it's two bytes then it should be using two chars, not shorts. The range of the first is going to be 1-365, so it definitely needs more than a single byte to represent. I'm going to assume you want the first 4 bytes, not the first 2.
This means that the first byte will be bits 0-7 of the DayOfYear, the second byte will be bits 8-15 of the DayOfYear, the third byte will be bits 0-7 of the TimeOfDay, and the fourth byte will be bits 8-15 of the TimeOfDay.

C Endian Conversion : bit by bit

I have a special unsigned long (32 bits) and I need to convert the endianness of it bit by bit - my long represents several things all smooshed together into one piece of binary.
How do I do it?

Endianness is a word-level concept where the bytes are either stored most-significant byte first (big endian) or least-significant byte first (little endian). Data transferred over a network is typically big endian (so-called network byte order). Data stored in memory on a machine can be in either order, with little endian being the most common given the prevalence of the Intel x86 architecture. Even though most computer architectures are big endian, the x86 is so ubiquitous that you'll most often see little endian data in memory.
Anyhow, the point of all that is that endianness is a very specific concept that only applies at the byte level, not the bit level. If ntohs(), ntohl(), htons(), and htonl() don't do what you want then what you're dealing with isn't endianness per se.
If you need to reverse the individual bits of your unsigned long or do anything else complicated like that, please post more information about what exactly you need to do.

Be careful to understand the meaning of 'endianness'. It refers to the order of bytes within the data, not bits within the byte. You may only need to use a function like htonl or ntohl to convert your d-word.
If you truly want to reverse the order of all bits in the 32b data type, you could write an iterative algorithm to mask and shift each bit into the appropriate reflected position.

A simple endianess conversion function for an unsigned long value could look like the following:
typedef union {
unsigned long u32;
unsigned char u8 [ 4 ];
} U32_U8;
unsigned long SwapEndian(unsigned long u)
{
U32_U8 source;
U32_U8 dest;
source.u32 = u;
dest.u8[0] = source.u8[3];
dest.u8[1] = source.u8[2];
dest.u8[2] = source.u8[1];
dest.u8[3] = source.u8[0];
return dest.u32;
}

To invert the bit order of an integer, you can shift out the bits in one direction and shift the bits to the destination in the opposite direction.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight