bit endianness and portability of C binary files - c

In C, I have a char array that I use to store data at the bit level. I store these arrays to files, then read them in machines with different architectures. My question is if the order of the bits will be guaranteed consistent? For example, if I store "10010011" to the first byte, will the adjacent 1's always be read to be in the 2^0 and 2^1 positions, or could they end up interpreted as the 2^7 and 2^6 bits?
EDIT: I want to clarify this question a little for people who read this page later. Byte endianness is the order of bytes in a multibyte object, but my concern is with the bits in a given byte. When a byte is stored to disc, it is stored as a sequence of (usually) 8 bits. I'm no hardware expert, but it has to come down to that somehow. So, my concern is if the way the byte is stored is such that any machine will read the original unsigned char value, or if what is 3 to one machine will be 192 to another. I am concerned the bits will end up shuffled somehow. Apparently, this is not a concern, according to the answer I selected as well as one of the comments below. Thanks.

the simple answer:
The bits will still be in the correct order.
However, if performing any format conversion beyond %c, for instance %d, then the endianness of the reading architecture will determine the byte order The bits within each byte will still be the same.

Endianness is about bytes' order not bits. So 00001101 in a little-endian machine will be the same in a big-endian machine. However there is something you should now about bits' order in different machines. Bits' order change in unions. If you are going to use union, read this to figure out how endianness effects bitfield packing.

The concept you are trying to ask about is known as bit-numbering or bit endianness and system architectures are referred to as least-or most- significant bit (MSB, LSB) ordering.
As far as I know the reference is always with respects to the 0-th or first bit position.
With respect to a single 8-bit byte or octet, it will be portable, such that the value of the byte will be consistently considered to be 0x93 (147 decimal). Assuming you are writing the bit string as a LSB representation with the 0-th bit is the rightmost bit (the norm for little endian processor), as typically done by users of left to right natural languages such as English.

Related

no big endian and little endian in string?

We know different byte ordering machines store the object in memory ordered from least significant byte to most, while other machines store them from most to least. e.g. a hexadecimal value of 0x01234567.
so if we write a C program that print each byte from the memory address, big endian and little endian machines produce different result.
But for strings, This same result would be obtained on any system using ASCII as its character code, independent of the byte ordering and word size conventions. As a consequence, text data is more platform-independent than binary data.
So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent. What's the point to make big endian and little endian machine just in binary data?
Array elements are always addressed from low to high, regardless of endianness conventions.
ASCII and UTF-8 strings are arrays of char, which is not a multibyte type and is not affected by endianness conventions.
"Wide" strings, where each character is represented by wchar_t or another multibyte type, will be affected, but only for the individual elements, not the string as a whole.
So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent. What's the point to make big endian and little endian machine just in binary data?
In short: we already do: for example, a file format specification will dictate if a 32-bit integer should be serialized in big-endian or little-endian order. Similarly, network protocols will dictate the byte-order of multi-byte values (which is why htons is a thing).
However if we're only concerned with in-memory representation of binary data (and not serialized binary data) then it makes sense to only store values using the fastest representation - i.e. by using the byte-order natively preferred by the CPU and ISA. For x86 and x64 this is Little-Endian, but for ARM, MIPS, 68k, and so on - the preferred order is Big-endian (Though most non-x86 ISAs now support both big-endian and little-endian modes).
But for strings, This same result would be obtained on any system using ASCII as its character code, independent of the byte ordering and word size conventions. As a consequence, text data is more platform-independent than binary data.
So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent.
In short:
ASCII Strings are not integers.
Integers are not ASCII strings.
You're basically asking why we don't represent integer numbers in Base-10 Big-Endian format: we don't because Base-10 is difficult for digital computers to work with (digital computers work in Base-2). The closest thing we have to what you're describing is binary-coded-decimal and the reason computers today don't use this normally is because it's slow and inefficient (as only 4 bits are needed to represent a Base-10 value in Base-2 - you could "pack" two Base-10 values in a single byte but that can be slow because CPUs generally are fastest on word-sized (and at-least byte-sized) values - not nibble-sized (half-byte) sized-values - and actually this still doesn't solve the big-endian vs. little-endian problem (as BCD values could still be represented using either BE or LE order - and even char-based strings could be stored in reverse order without it affecting how they're processed!).

how do I know if the case is true?

Let say if we are given a byte of binary data, how can you know what that data represents?
Is it true that you cant really know what the data represents because you need to know whether the one byte of binary data is represented in base 2, if it unsigned, signed, etc.
or is it that you can know what it represents since binary is base 2?
I am sorry to tell that a byte of data has nothing to do with it's supposed representation.
You state that because it's a byte, it's a binary representation. this is purely assumption.
It depends on the intention of the guy who store the very data.
It might represent anything. As #nos told you, it really depend on the convention the setter used to store it.
You may have a complementary to 2 number, a signed byte on 7 bit, un unsigned on 8 bits, an octal representation (or a partial representation) or a mask (each group of byte within the byte may describer something totally different than another). It could also be a representation of a special coding. Etc.
This is truly unlimited.
In order to properly interpret it you need to know the underlying convention (a spec). #fede1024 told you about files, which use special character so that you can double check with the convention.
One more thing… Bear in mind that even binary data can be stored in natural order or in reverse order: that's endianness. So when you examine a number store in at least 2 bytes, you have to know whether the most significant byte is stored first or sec on din memory. If you misinterpret this, you won't understand the underlying piece of data. Endianness is a constant for a given processor.
Base-2 and binary refer to the same thing. Typically, you do need to know whether the byte is signed or unsigned at least (in C). As for what the data represents - well, "it depends". Whether you want to interpret it as a single byte, as a character (or not), etc. With multi-byte data, you often also have to take endianness (ordering of the bytes into larger words) into account.
Some files format start with a magic number, for example all PNG files starts with 89 50 4E 47 0D 0A 1A 0A. That said, if you have a general binary file without any kind of magic number, you can just guess about his contents.
You can try to open it with an hexadecimal editor, but there is no automatic way to understand what the data represents.
You know it's base 2 since it's a byte of binary data, as you said. To see if it is true, in C everything but 0 is true. If it's 0, then it's false.

big-endian && little -endian?

Can any one tell me what this statement means:
"Specify the endianness of the object files. This only affects disassembly. This can be useful when disassembling a file format which does not describe endianness information, such as S-records. "
This article explains really well what endianness is and how to program in an endian independent way.
Different environments store numbers in different ways,
Big endian environments store information with the most significant byte first.
Little endian environments store information with the least significant byte first.
Nowadays most high level frameworks take care of all this for you, however if your programming at a lower level then this will be important.
Have a peek at the wikipedia entry for it, its really not as bad as some others
I don't understand all the question, but endianess is used to describe which way round numbers are stored.
For example, the number 256 is too big for 1 byte so it is represented in 2 bytes as a 1 in one of the bytes and a 0 in the other, representing 1 * 256 plus 0 * units.
The way round these bytes are stored is endianess

How do I work with bit data in C

In class I've been tasked with writing a C program that decompresses a text file and prints out the characters it contains. Each character in the file is represented by 2 bits (4 possible characters).
I've recently been informed that a byte is not necessarily 8 bits on all systems, and a char is not necessarily 1 byte. This then makes me wonder how on earth I'm supposed to know how many bits got loaded from a file when I loaded 1 byte. Also how am I supposed to keep the loaded data in memory when there are no data types that can guarantee a set amount of bits.
How do I work with bit data in C?
A byte is not necessarily 8 bits. That much is certainly true. A char, on the other hand, is defined to be a byte - C does not differentiate between the two things.
However, the systems you will write for will almost certainly have 8-bit bytes. Bytes of different sizes are essentially non-existant outside of really, really old systems, or certain embedded systems.
If you have to write your code to work for multiple platforms, and one or more of those have differently sized chars, then you write code specifically to handle that platform - using e.g. CHAR_BIT to determine how many bits each byte contains.
Given that this is for a class, assume 8-bit bytes, unless told otherwise. The point is not going to be extreme platform independence, the point is to teach you something about bit fiddling (or possibly bit fields, but that depends on what you've covered in class).
This then makes me wonder how on earth I'm supposed to know how many
bits got loaded from a file when I loaded 1 byte.
You'll be hard pressed to find a platform where a byte is not 8 bits. (though as noted above CHAR_BIT can be used to verify that). Also clarify the portability requirements with your instructor or state your assumptions.
Usually bits are extracted using shifts and bitwise operations, e.g. (x & 3) are the rightmost 2 bits of x. ((x>>2) & 3) are the next two bits. Pick the right data type for the platforms you are targettiing or as others say use something like uint8_t if available for your compiler.
Also see:
Type to use to represent a byte in ANSI (C89/90) C?
I would recommend not using bit fields. Also see here:
When is it worthwhile to use bit fields?
You can use bit fields in C. These indices explicitly let you specify the number of bits in each part of the field, if you are truly concerned about width. This page gives a discussion: http://msdn.microsoft.com/en-us/library/yszfawxh(v=vs.80).aspx
As an example, check out the ieee754.h for usage in the context of implementing IEEE754 floats

retrieve cobol s9(2) COMP onto C variable

I need to retrieve data from a COBOL variable of the type: "PIC S9(2) COMP" onto a C variable of the type "int".
It's stored using two bytes of a string, so I receive it as a couple of chars.
I know COBOL stores decimal data onto a "S9(2) COMP" in binary format, so It would be a great help letting me know any algorithm or way to convert it safely.
Any kind of help & suggestion will be welcome.
Update:
Finally we decided to change the picture of the variable to 9(3) in the COBOL part of the implementation, because of the endianess problem.
Thanks to all of you for the answers.
You should be able to treat that as a short, or a 16-bit twos complement integer. You will need to check for endianess though depending upon which platform originated the field.
The format of s9(2) comp will depend on the Cobol Compiler. Most Cobol compilers I know will store it a 2 byte big-endian integer (high byte first; Intel processors store low byte first). Exceptions include
MicroFocus (depends on compile parameters) - normally 1 byte integer
RM-Cobol : Has its own "Binary Format".
If it is a 2 byte integer
On Big-Endian machine (IBM Mainframe, Power etc) it should be a 2-Byte integer
On Little-Endian (Intel) you need to swap the bytes around.
The S9(2) COMP indicates it is a left-aligned (bit 0 at the left of word) not explicity synchronised signed numeric field, probably in an internal or pseudo-binary format, that can hold 2 digits and possibly a bit in the word (or in a different memory location) is used for indicating if the value is positive or negative. A cobol program can have a specific method for storing a COMP data item which may not be directly compatible with C. It appears that you need to access and test the sign indicator in a relevant way to check it and you may have to access each of the 2 bytes (2 of 8 bits) or characters (2 of 6 bits), check if they are big-endian or little-endian and put them into a signed integer field in C. A lot will depend on the architecture and compilers of the computer involved in creating and reading in the data field and can be more complicated if 2 computer types are involved.

Resources