no big endian and little endian in string? - c

We know different byte ordering machines store the object in memory ordered from least significant byte to most, while other machines store them from most to least. e.g. a hexadecimal value of 0x01234567.
so if we write a C program that print each byte from the memory address, big endian and little endian machines produce different result.
But for strings, This same result would be obtained on any system using ASCII as its character code, independent of the byte ordering and word size conventions. As a consequence, text data is more platform-independent than binary data.
So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent. What's the point to make big endian and little endian machine just in binary data?

Array elements are always addressed from low to high, regardless of endianness conventions.
ASCII and UTF-8 strings are arrays of char, which is not a multibyte type and is not affected by endianness conventions.
"Wide" strings, where each character is represented by wchar_t or another multibyte type, will be affected, but only for the individual elements, not the string as a whole.

So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent. What's the point to make big endian and little endian machine just in binary data?
In short: we already do: for example, a file format specification will dictate if a 32-bit integer should be serialized in big-endian or little-endian order. Similarly, network protocols will dictate the byte-order of multi-byte values (which is why htons is a thing).
However if we're only concerned with in-memory representation of binary data (and not serialized binary data) then it makes sense to only store values using the fastest representation - i.e. by using the byte-order natively preferred by the CPU and ISA. For x86 and x64 this is Little-Endian, but for ARM, MIPS, 68k, and so on - the preferred order is Big-endian (Though most non-x86 ISAs now support both big-endian and little-endian modes).
But for strings, This same result would be obtained on any system using ASCII as its character code, independent of the byte ordering and word size conventions. As a consequence, text data is more platform-independent than binary data.
So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent.
In short:
ASCII Strings are not integers.
Integers are not ASCII strings.
You're basically asking why we don't represent integer numbers in Base-10 Big-Endian format: we don't because Base-10 is difficult for digital computers to work with (digital computers work in Base-2). The closest thing we have to what you're describing is binary-coded-decimal and the reason computers today don't use this normally is because it's slow and inefficient (as only 4 bits are needed to represent a Base-10 value in Base-2 - you could "pack" two Base-10 values in a single byte but that can be slow because CPUs generally are fastest on word-sized (and at-least byte-sized) values - not nibble-sized (half-byte) sized-values - and actually this still doesn't solve the big-endian vs. little-endian problem (as BCD values could still be represented using either BE or LE order - and even char-based strings could be stored in reverse order without it affecting how they're processed!).

Related

Endianness for all c data types on Windows OS

I am trying to document the different data types and how they are stored in memory in C. I know how many bytes each data type takes up but I would like to know how the endianness of every data type. This is specifically for Windows.
Endianness is (usually) a function of the hardware, not the OS or the language, so all multi-byte types should have the same endianness.
Emphasis on should.
For x86 and x86_64 (on which Windows primarily runs), all multi-byte types are little-endian.
But there are always going to be some oddball platforms. The DEC VAX was little-endian except for floating-point types, which were stored in a combined big- and little-endian order. From Kapps & Stafford1:
The VAX was designed in part to be compatible with the PDP-11 computer. The PDP-11 is a 16-bit machine, and 32-bit and 64-bit floating point numbers were stored as sequences of 16-bit words with the most significant part coming first. This was unfortunate for the VAX, because the VAX almost universally places the least significant part first. Floating-point numbers are the main exception to this rule. As a consequence, when an F_floating number is stored in a longword, we have to reverse the first 16 bits with the last 16 bits.
IOW, each 16-bit word was big-endian (byte order 01), but the sequence of 16-bit words was little-endian, so the byte order of a 32-bit F_float was 2301.
As for type sizes...
C does not specify sizes for the "traditional" scalar types like int, long, float, double, etc. It specifies a minimum range of values that each type must be able to represent. A char must be able to represent all characters in the basic execution character set, meaning it must be at least 8 bits wide, but it may be wider (9-bit bytes and 36-bit words are a thing, or at least used to be). An int must be able to represent values in at least the range -32767..32767, meaning it must be at least 16 bits wide.
Kapps, Charles A. and Robert L. Stafford, VAX Assembly Language and Architecture, Prindle, Weber & Schmidt 1985
I know how many bytes each data type takes up
That's impossible. C doesn't define sizes for any of its types. It only provides minimums. For example, an int need only be 16 bits in size, but it's common for it to be larger. This can vary between compiler and OS.
I would like to know how the endianness of every data type.
C doesn't define this either. It's going to depend on the hardware of the system. For example, x86 and x86_64 are little-endian. ARM is big-endian. And some old systems use byte orders that are different than both of these.

Is C Endian neutral?

Is C endian-neutral?
Ok, another way of asking this question.
I am currently translating a lot of code from C to Matlab on the same platform (PC). Do I need to care about endianess?
Both are endian-neutral languages but C (not so sure), Matlab (pretty sure).
By the same token I am also translating C to Python.
So my question, has anybody in his experience, (translating from C to another endian-neutral language) met an unexpected problem with big/little endianness.
Obviously we are only speaking about the core language. In this case I mentionned C99.
First, some background and clarification:
As I mentioned in a comment to the original question, byte order is often confused with bit order. Endianness refers to byte order only. Bit order is only relevant in documentation and when data is sent via some serial connection.
In arithmetic, in base B (and 2 ≤ B ∈ ℕ), the i'th digit Di has value Di Bi. The least significant integral digit corresponds to i=0, i.e. D0. For binary, B = 2. For ordinary decimal numbers most humans prefer, B = 10.
(This works for all reals, not just integers. Most significant fractional digit, the first digit on the other side of the decimal point, is D-1, with more negative i's indicating less significant digits.)
Because 'bit' is a portmanteau of 'binary digit', we thus have a natural way of labeling bits, with bit 0 referring to the least significant (integer) bit (corresponding to value 1), bit 1 referring to the next one in significance (corresponding to value 2), and so on.
Some documentation for hardware using big-endian byte order insists on labeling the most significant bit in a word as "bit 0" (with bit numbers increasing from left to right -- contrary to most numeric representations, where digits grow more significant from right to left). This is just a labeling convention, as this convention does not follow the arithmetic rules. In fact, you need to know the width (number of bits) in that word, to even calculate the actual numeric value of such "bit 0"s.
Is C endian-neutral?
Yes, C (as in ISO C89, C99, and C11) is neutral with regards to byte order. The standards do not define any byte order; it is up to the implementation to decide. In practice, the compiler chooses the byte order suitable for the target architecture at compile time.
In theory, integer and floating-point types may very well have different byte order.
POSIX.1 adds networking support to C. Certain fields in network-related structures are defined to be in network byte order, most significant byte first. POSIX.1 provides htons(), htonl(), ntohs(), and ntohl() byteorder functions to convert from host to network byte order and vice versa.
In addition to network byte order (which is often called big-endian), little-endian byte order (least significant byte first) is also very common, for example on Intel/AMD architectures. The PDP-endian byte order (where four-byte values are stored second-most significant byte first, followed by the most significant byte, followed by the least significant, followed by the second-least significant byte) is nowadays rare.
Finally, C has been implemented on a large number of architectures, with byte orders covering all three mentioned above, without any byte order issues. That should be practical proof enough.
I am currently translating a lot of code from C to Matlab [or Python] on the same platform (PC). Do I need to care about endianess?
No, I don't see any reason for you to care about endianness when porting code between C, Matlab, Python, or just about any high-level language.
However:
Language being endian-neutral does not mean you don't need to care about endianness in your programs. Data byte order matters. It boils down to how your programs transfer -- read and write -- data; be that via in-memory structures (using shared memory, or between different programming languages via library bindings), to/from files, via network connections, or via pipes from/to other programs.
If your programs transfer data in some text-based format, then all you need to worry about is that format, and possibly the character set used -- I prefer UTF-8 (see utf8everywhere.org.
If your programs transfer data in binary, then you must understand that in binary, multi-byte values always have some specific byte order. It can be network byte order (or big-endian), little-endian, or native byte order for the current architecture. Just because your programming language is endian-neutral, does not mean you get to ignore the storage byte order.
For example, Matlab and Octave fread() support a fifth parameter that specifies the byte order used: native, ieee-be (IEEE big-endian), or ieee-le (IEEE little-endian). Python struct module pack and unpack functions default to native byte order and C alignment (padding), but you can use < or > as the first character in the format string to indicate little-endian or big-endian/network-endian byte order with no padding.
It is very common for C code to store binary data in native byte order. However, some C code does not. I prefer to store in native byte order, but also store known prototype values for each different basic numeric type, so that readers can trivially detect if they need to permute the byte order to interpret the code correctly. There are also various libraries and formats like NetCDF that may be utilized for creating portable binary data files.
The most important thing is to understand what the C code does, first.
I don't see why someone would want to port code from C to Matlab or Python, unless the C code was really poor to begin with -- in which case I'd just rewrite the logic, not port the existing code.
Have you met an unexpected problem with big/little endianness?
No, never when porting code between high-level languages.
Yes, when storing/retrieving binary data between different systems.
While not related to endianness, for multi-dimensional data, it is important to remember that Fortran and Matlab (and OpenGL matrices) use column-major order (each column being consecutive in memory), while C uses row-major order (each row being consecutive in memory).

Does zlib's "uncompress" preserve the data's original endianness, or does it do an endian conversion?

I am working with legacy C++ code that accesses two-byte integer data compressed in a sqlite database. The code uses zlib's uncompress function to extract the data, which comes out on my little-endian machine as little-endian values.
To allow for the possibility that this code may be ported to big-endian machines, I need to know if the data will always decompress in little-endian order, or if (instead) zlib will somehow do the conversion.
This is the only applicable tidbit I've been able to find for(from zlib's FAQ on their site):
Will zlib work on a big-endian or little-endian architecture, and can I exchange compressed data between them?
Yes and yes.
Doesn't really answer my question... I'm prepared to handle the endian conversion if needed. Is it safe to assume that the original input data endianness is what you get back out, regardless of the platform on which you run uncompress? (I don't have access to a big-endian machine at present on which to test this myself).
zlib compresses and decompresses a stream of bytes losslessly. So whatever endianess went in is exactly what comes out. This is entirely regardless of the endianess of the compressing and decompressing machines.
The FAQ entry refers to the fact that the code was written to be insensitive to the endianess of the architecture that the code is compiled to and run on.
RFC1950 specifically states how zlib's own meta-data multi-byte values are stored:
Within a computer, a number may occupy multiple bytes. All multi-byte numbers in the format described here are stored with the MOST-significant byte first (at the lower memory address). For example, the decimal number 520 is stored as:
0 1
+--------+--------+
|00000010|00001000|
+--------+--------+
^ ^
| |
| + less significant byte = 8
+ more significant byte = 2 x 256
So operations regarding multi-byte values for internal use of zlib must take endianness into account (which is what FAQ #26 answered).
The compressed data itself will be unchanged, because zlib compresses and decompresses with a granularity of bytes, and not larger units.

bit endianness and portability of C binary files

In C, I have a char array that I use to store data at the bit level. I store these arrays to files, then read them in machines with different architectures. My question is if the order of the bits will be guaranteed consistent? For example, if I store "10010011" to the first byte, will the adjacent 1's always be read to be in the 2^0 and 2^1 positions, or could they end up interpreted as the 2^7 and 2^6 bits?
EDIT: I want to clarify this question a little for people who read this page later. Byte endianness is the order of bytes in a multibyte object, but my concern is with the bits in a given byte. When a byte is stored to disc, it is stored as a sequence of (usually) 8 bits. I'm no hardware expert, but it has to come down to that somehow. So, my concern is if the way the byte is stored is such that any machine will read the original unsigned char value, or if what is 3 to one machine will be 192 to another. I am concerned the bits will end up shuffled somehow. Apparently, this is not a concern, according to the answer I selected as well as one of the comments below. Thanks.
the simple answer:
The bits will still be in the correct order.
However, if performing any format conversion beyond %c, for instance %d, then the endianness of the reading architecture will determine the byte order The bits within each byte will still be the same.
Endianness is about bytes' order not bits. So 00001101 in a little-endian machine will be the same in a big-endian machine. However there is something you should now about bits' order in different machines. Bits' order change in unions. If you are going to use union, read this to figure out how endianness effects bitfield packing.
The concept you are trying to ask about is known as bit-numbering or bit endianness and system architectures are referred to as least-or most- significant bit (MSB, LSB) ordering.
As far as I know the reference is always with respects to the 0-th or first bit position.
With respect to a single 8-bit byte or octet, it will be portable, such that the value of the byte will be consistently considered to be 0x93 (147 decimal). Assuming you are writing the bit string as a LSB representation with the 0-th bit is the rightmost bit (the norm for little endian processor), as typically done by users of left to right natural languages such as English.

retrieve cobol s9(2) COMP onto C variable

I need to retrieve data from a COBOL variable of the type: "PIC S9(2) COMP" onto a C variable of the type "int".
It's stored using two bytes of a string, so I receive it as a couple of chars.
I know COBOL stores decimal data onto a "S9(2) COMP" in binary format, so It would be a great help letting me know any algorithm or way to convert it safely.
Any kind of help & suggestion will be welcome.
Update:
Finally we decided to change the picture of the variable to 9(3) in the COBOL part of the implementation, because of the endianess problem.
Thanks to all of you for the answers.
You should be able to treat that as a short, or a 16-bit twos complement integer. You will need to check for endianess though depending upon which platform originated the field.
The format of s9(2) comp will depend on the Cobol Compiler. Most Cobol compilers I know will store it a 2 byte big-endian integer (high byte first; Intel processors store low byte first). Exceptions include
MicroFocus (depends on compile parameters) - normally 1 byte integer
RM-Cobol : Has its own "Binary Format".
If it is a 2 byte integer
On Big-Endian machine (IBM Mainframe, Power etc) it should be a 2-Byte integer
On Little-Endian (Intel) you need to swap the bytes around.
The S9(2) COMP indicates it is a left-aligned (bit 0 at the left of word) not explicity synchronised signed numeric field, probably in an internal or pseudo-binary format, that can hold 2 digits and possibly a bit in the word (or in a different memory location) is used for indicating if the value is positive or negative. A cobol program can have a specific method for storing a COMP data item which may not be directly compatible with C. It appears that you need to access and test the sign indicator in a relevant way to check it and you may have to access each of the 2 bytes (2 of 8 bits) or characters (2 of 6 bits), check if they are big-endian or little-endian and put them into a signed integer field in C. A lot will depend on the architecture and compilers of the computer involved in creating and reading in the data field and can be more complicated if 2 computer types are involved.

Resources