Does zlib's "uncompress" preserve the data's original endianness, or does it do an endian conversion? - zlib

I am working with legacy C++ code that accesses two-byte integer data compressed in a sqlite database. The code uses zlib's uncompress function to extract the data, which comes out on my little-endian machine as little-endian values.
To allow for the possibility that this code may be ported to big-endian machines, I need to know if the data will always decompress in little-endian order, or if (instead) zlib will somehow do the conversion.
This is the only applicable tidbit I've been able to find for(from zlib's FAQ on their site):
Will zlib work on a big-endian or little-endian architecture, and can I exchange compressed data between them?
Yes and yes.
Doesn't really answer my question... I'm prepared to handle the endian conversion if needed. Is it safe to assume that the original input data endianness is what you get back out, regardless of the platform on which you run uncompress? (I don't have access to a big-endian machine at present on which to test this myself).

zlib compresses and decompresses a stream of bytes losslessly. So whatever endianess went in is exactly what comes out. This is entirely regardless of the endianess of the compressing and decompressing machines.
The FAQ entry refers to the fact that the code was written to be insensitive to the endianess of the architecture that the code is compiled to and run on.

RFC1950 specifically states how zlib's own meta-data multi-byte values are stored:
Within a computer, a number may occupy multiple bytes. All multi-byte numbers in the format described here are stored with the MOST-significant byte first (at the lower memory address). For example, the decimal number 520 is stored as:
0 1
+--------+--------+
|00000010|00001000|
+--------+--------+
^ ^
| |
| + less significant byte = 8
+ more significant byte = 2 x 256
So operations regarding multi-byte values for internal use of zlib must take endianness into account (which is what FAQ #26 answered).
The compressed data itself will be unchanged, because zlib compresses and decompresses with a granularity of bytes, and not larger units.

Related

no big endian and little endian in string?

We know different byte ordering machines store the object in memory ordered from least significant byte to most, while other machines store them from most to least. e.g. a hexadecimal value of 0x01234567.
so if we write a C program that print each byte from the memory address, big endian and little endian machines produce different result.
But for strings, This same result would be obtained on any system using ASCII as its character code, independent of the byte ordering and word size conventions. As a consequence, text data is more platform-independent than binary data.
So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent. What's the point to make big endian and little endian machine just in binary data?
Array elements are always addressed from low to high, regardless of endianness conventions.
ASCII and UTF-8 strings are arrays of char, which is not a multibyte type and is not affected by endianness conventions.
"Wide" strings, where each character is represented by wchar_t or another multibyte type, will be affected, but only for the individual elements, not the string as a whole.
So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent. What's the point to make big endian and little endian machine just in binary data?
In short: we already do: for example, a file format specification will dictate if a 32-bit integer should be serialized in big-endian or little-endian order. Similarly, network protocols will dictate the byte-order of multi-byte values (which is why htons is a thing).
However if we're only concerned with in-memory representation of binary data (and not serialized binary data) then it makes sense to only store values using the fastest representation - i.e. by using the byte-order natively preferred by the CPU and ISA. For x86 and x64 this is Little-Endian, but for ARM, MIPS, 68k, and so on - the preferred order is Big-endian (Though most non-x86 ISAs now support both big-endian and little-endian modes).
But for strings, This same result would be obtained on any system using ASCII as its character code, independent of the byte ordering and word size conventions. As a consequence, text data is more platform-independent than binary data.
So my question is, why we differential big endian and little endian for binary data, we could make it the same as text data which is platform-independent.
In short:
ASCII Strings are not integers.
Integers are not ASCII strings.
You're basically asking why we don't represent integer numbers in Base-10 Big-Endian format: we don't because Base-10 is difficult for digital computers to work with (digital computers work in Base-2). The closest thing we have to what you're describing is binary-coded-decimal and the reason computers today don't use this normally is because it's slow and inefficient (as only 4 bits are needed to represent a Base-10 value in Base-2 - you could "pack" two Base-10 values in a single byte but that can be slow because CPUs generally are fastest on word-sized (and at-least byte-sized) values - not nibble-sized (half-byte) sized-values - and actually this still doesn't solve the big-endian vs. little-endian problem (as BCD values could still be represented using either BE or LE order - and even char-based strings could be stored in reverse order without it affecting how they're processed!).

Is C Endian neutral?

Is C endian-neutral?
Ok, another way of asking this question.
I am currently translating a lot of code from C to Matlab on the same platform (PC). Do I need to care about endianess?
Both are endian-neutral languages but C (not so sure), Matlab (pretty sure).
By the same token I am also translating C to Python.
So my question, has anybody in his experience, (translating from C to another endian-neutral language) met an unexpected problem with big/little endianness.
Obviously we are only speaking about the core language. In this case I mentionned C99.
First, some background and clarification:
As I mentioned in a comment to the original question, byte order is often confused with bit order. Endianness refers to byte order only. Bit order is only relevant in documentation and when data is sent via some serial connection.
In arithmetic, in base B (and 2 ≤ B ∈ ℕ), the i'th digit Di has value Di Bi. The least significant integral digit corresponds to i=0, i.e. D0. For binary, B = 2. For ordinary decimal numbers most humans prefer, B = 10.
(This works for all reals, not just integers. Most significant fractional digit, the first digit on the other side of the decimal point, is D-1, with more negative i's indicating less significant digits.)
Because 'bit' is a portmanteau of 'binary digit', we thus have a natural way of labeling bits, with bit 0 referring to the least significant (integer) bit (corresponding to value 1), bit 1 referring to the next one in significance (corresponding to value 2), and so on.
Some documentation for hardware using big-endian byte order insists on labeling the most significant bit in a word as "bit 0" (with bit numbers increasing from left to right -- contrary to most numeric representations, where digits grow more significant from right to left). This is just a labeling convention, as this convention does not follow the arithmetic rules. In fact, you need to know the width (number of bits) in that word, to even calculate the actual numeric value of such "bit 0"s.
Is C endian-neutral?
Yes, C (as in ISO C89, C99, and C11) is neutral with regards to byte order. The standards do not define any byte order; it is up to the implementation to decide. In practice, the compiler chooses the byte order suitable for the target architecture at compile time.
In theory, integer and floating-point types may very well have different byte order.
POSIX.1 adds networking support to C. Certain fields in network-related structures are defined to be in network byte order, most significant byte first. POSIX.1 provides htons(), htonl(), ntohs(), and ntohl() byteorder functions to convert from host to network byte order and vice versa.
In addition to network byte order (which is often called big-endian), little-endian byte order (least significant byte first) is also very common, for example on Intel/AMD architectures. The PDP-endian byte order (where four-byte values are stored second-most significant byte first, followed by the most significant byte, followed by the least significant, followed by the second-least significant byte) is nowadays rare.
Finally, C has been implemented on a large number of architectures, with byte orders covering all three mentioned above, without any byte order issues. That should be practical proof enough.
I am currently translating a lot of code from C to Matlab [or Python] on the same platform (PC). Do I need to care about endianess?
No, I don't see any reason for you to care about endianness when porting code between C, Matlab, Python, or just about any high-level language.
However:
Language being endian-neutral does not mean you don't need to care about endianness in your programs. Data byte order matters. It boils down to how your programs transfer -- read and write -- data; be that via in-memory structures (using shared memory, or between different programming languages via library bindings), to/from files, via network connections, or via pipes from/to other programs.
If your programs transfer data in some text-based format, then all you need to worry about is that format, and possibly the character set used -- I prefer UTF-8 (see utf8everywhere.org.
If your programs transfer data in binary, then you must understand that in binary, multi-byte values always have some specific byte order. It can be network byte order (or big-endian), little-endian, or native byte order for the current architecture. Just because your programming language is endian-neutral, does not mean you get to ignore the storage byte order.
For example, Matlab and Octave fread() support a fifth parameter that specifies the byte order used: native, ieee-be (IEEE big-endian), or ieee-le (IEEE little-endian). Python struct module pack and unpack functions default to native byte order and C alignment (padding), but you can use < or > as the first character in the format string to indicate little-endian or big-endian/network-endian byte order with no padding.
It is very common for C code to store binary data in native byte order. However, some C code does not. I prefer to store in native byte order, but also store known prototype values for each different basic numeric type, so that readers can trivially detect if they need to permute the byte order to interpret the code correctly. There are also various libraries and formats like NetCDF that may be utilized for creating portable binary data files.
The most important thing is to understand what the C code does, first.
I don't see why someone would want to port code from C to Matlab or Python, unless the C code was really poor to begin with -- in which case I'd just rewrite the logic, not port the existing code.
Have you met an unexpected problem with big/little endianness?
No, never when porting code between high-level languages.
Yes, when storing/retrieving binary data between different systems.
While not related to endianness, for multi-dimensional data, it is important to remember that Fortran and Matlab (and OpenGL matrices) use column-major order (each column being consecutive in memory), while C uses row-major order (each row being consecutive in memory).

big-endian && little -endian?

Can any one tell me what this statement means:
"Specify the endianness of the object files. This only affects disassembly. This can be useful when disassembling a file format which does not describe endianness information, such as S-records. "
This article explains really well what endianness is and how to program in an endian independent way.
Different environments store numbers in different ways,
Big endian environments store information with the most significant byte first.
Little endian environments store information with the least significant byte first.
Nowadays most high level frameworks take care of all this for you, however if your programming at a lower level then this will be important.
Have a peek at the wikipedia entry for it, its really not as bad as some others
I don't understand all the question, but endianess is used to describe which way round numbers are stored.
For example, the number 256 is too big for 1 byte so it is represented in 2 bytes as a 1 in one of the bytes and a 0 in the other, representing 1 * 256 plus 0 * units.
The way round these bytes are stored is endianess

endianess and integer variable

In c I am little worried with the concept of endianess. Suppose I declare an integer (2 bytes) on a little endian machine as
int a = 1;
and it is stored as:
Address value
1000 1
1001 0
On big endian it should be stored as vice-versa. Now if I do &a then I should get 1000 on on both machines.
If this is true then if I store int a=1 then I should get a=1 on little endian whereas 2^15 on big endian. Is this correct?
It should not matter to you how the data is represented as long as you don't transfer it between platforms (or access it through assembly code).
If you only use standard C - it's hidden from you and you shouldn't bother yourself with it. If you pass data around between unknown machines (for example you have a client and a server application that communicate through some network) - use hton and ntoh functions that convert from local to network and from network to local endianess, to avoid problems.
If you access memory directly, then the problem would not only be endian but also packing, so be careful with it.
In both little endian and big endian, the address of the "first" byte is returned. Here by "first" I mean 1000 in your example and not "first" in the sense of most-significant or least significant byte.
Now if I do &a then I should get 1000 on little endian and 1001 on big endian.
No; on both machines you will get the address 1000.
If this is true then if I store int a=1 then I should get a=1 on little endian whereas 2^15 on big endian. Is this correct?
No; because the preliminary condition is false. Also, if you assign 1 to a variable and then read the value back again, you get the result 1 back. Anything else would be a major cause of confusion. (There are occasions when this (reading back the stored value) is not guaranteed - for example, if multiple threads could be modifying the variable, or if it is an I/O register on an embedded system. However, you'd be aware of these if you needed to know about it.)
For the most part, you do not need to worry about endianness. There are a few places where you do. The socket networking functions are one such; writing binary data to a file which will be transferred between machines of different endianness is another. Pretty much all the rest of the time, you don't have to worry about it.
This is correct, but it's not only an issue in c, but any program that reads and writes binary data. If the data stays on a single computer, then it will be fine.
Also, if you are reading/writing from text files this won't be an issue.

retrieve cobol s9(2) COMP onto C variable

I need to retrieve data from a COBOL variable of the type: "PIC S9(2) COMP" onto a C variable of the type "int".
It's stored using two bytes of a string, so I receive it as a couple of chars.
I know COBOL stores decimal data onto a "S9(2) COMP" in binary format, so It would be a great help letting me know any algorithm or way to convert it safely.
Any kind of help & suggestion will be welcome.
Update:
Finally we decided to change the picture of the variable to 9(3) in the COBOL part of the implementation, because of the endianess problem.
Thanks to all of you for the answers.
You should be able to treat that as a short, or a 16-bit twos complement integer. You will need to check for endianess though depending upon which platform originated the field.
The format of s9(2) comp will depend on the Cobol Compiler. Most Cobol compilers I know will store it a 2 byte big-endian integer (high byte first; Intel processors store low byte first). Exceptions include
MicroFocus (depends on compile parameters) - normally 1 byte integer
RM-Cobol : Has its own "Binary Format".
If it is a 2 byte integer
On Big-Endian machine (IBM Mainframe, Power etc) it should be a 2-Byte integer
On Little-Endian (Intel) you need to swap the bytes around.
The S9(2) COMP indicates it is a left-aligned (bit 0 at the left of word) not explicity synchronised signed numeric field, probably in an internal or pseudo-binary format, that can hold 2 digits and possibly a bit in the word (or in a different memory location) is used for indicating if the value is positive or negative. A cobol program can have a specific method for storing a COMP data item which may not be directly compatible with C. It appears that you need to access and test the sign indicator in a relevant way to check it and you may have to access each of the 2 bytes (2 of 8 bits) or characters (2 of 6 bits), check if they are big-endian or little-endian and put them into a signed integer field in C. A lot will depend on the architecture and compilers of the computer involved in creating and reading in the data field and can be more complicated if 2 computer types are involved.

Resources