big-endian && little -endian? - disassembly

Can any one tell me what this statement means:
"Specify the endianness of the object files. This only affects disassembly. This can be useful when disassembling a file format which does not describe endianness information, such as S-records. "

This article explains really well what endianness is and how to program in an endian independent way.

Different environments store numbers in different ways,
Big endian environments store information with the most significant byte first.
Little endian environments store information with the least significant byte first.
Nowadays most high level frameworks take care of all this for you, however if your programming at a lower level then this will be important.
Have a peek at the wikipedia entry for it, its really not as bad as some others

I don't understand all the question, but endianess is used to describe which way round numbers are stored.
For example, the number 256 is too big for 1 byte so it is represented in 2 bytes as a 1 in one of the bytes and a 0 in the other, representing 1 * 256 plus 0 * units.
The way round these bytes are stored is endianess

Related

Is C Endian neutral?

Is C endian-neutral?
Ok, another way of asking this question.
I am currently translating a lot of code from C to Matlab on the same platform (PC). Do I need to care about endianess?
Both are endian-neutral languages but C (not so sure), Matlab (pretty sure).
By the same token I am also translating C to Python.
So my question, has anybody in his experience, (translating from C to another endian-neutral language) met an unexpected problem with big/little endianness.
Obviously we are only speaking about the core language. In this case I mentionned C99.
First, some background and clarification:
As I mentioned in a comment to the original question, byte order is often confused with bit order. Endianness refers to byte order only. Bit order is only relevant in documentation and when data is sent via some serial connection.
In arithmetic, in base B (and 2 ≤ B ∈ ℕ), the i'th digit Di has value Di Bi. The least significant integral digit corresponds to i=0, i.e. D0. For binary, B = 2. For ordinary decimal numbers most humans prefer, B = 10.
(This works for all reals, not just integers. Most significant fractional digit, the first digit on the other side of the decimal point, is D-1, with more negative i's indicating less significant digits.)
Because 'bit' is a portmanteau of 'binary digit', we thus have a natural way of labeling bits, with bit 0 referring to the least significant (integer) bit (corresponding to value 1), bit 1 referring to the next one in significance (corresponding to value 2), and so on.
Some documentation for hardware using big-endian byte order insists on labeling the most significant bit in a word as "bit 0" (with bit numbers increasing from left to right -- contrary to most numeric representations, where digits grow more significant from right to left). This is just a labeling convention, as this convention does not follow the arithmetic rules. In fact, you need to know the width (number of bits) in that word, to even calculate the actual numeric value of such "bit 0"s.
Is C endian-neutral?
Yes, C (as in ISO C89, C99, and C11) is neutral with regards to byte order. The standards do not define any byte order; it is up to the implementation to decide. In practice, the compiler chooses the byte order suitable for the target architecture at compile time.
In theory, integer and floating-point types may very well have different byte order.
POSIX.1 adds networking support to C. Certain fields in network-related structures are defined to be in network byte order, most significant byte first. POSIX.1 provides htons(), htonl(), ntohs(), and ntohl() byteorder functions to convert from host to network byte order and vice versa.
In addition to network byte order (which is often called big-endian), little-endian byte order (least significant byte first) is also very common, for example on Intel/AMD architectures. The PDP-endian byte order (where four-byte values are stored second-most significant byte first, followed by the most significant byte, followed by the least significant, followed by the second-least significant byte) is nowadays rare.
Finally, C has been implemented on a large number of architectures, with byte orders covering all three mentioned above, without any byte order issues. That should be practical proof enough.
I am currently translating a lot of code from C to Matlab [or Python] on the same platform (PC). Do I need to care about endianess?
No, I don't see any reason for you to care about endianness when porting code between C, Matlab, Python, or just about any high-level language.
However:
Language being endian-neutral does not mean you don't need to care about endianness in your programs. Data byte order matters. It boils down to how your programs transfer -- read and write -- data; be that via in-memory structures (using shared memory, or between different programming languages via library bindings), to/from files, via network connections, or via pipes from/to other programs.
If your programs transfer data in some text-based format, then all you need to worry about is that format, and possibly the character set used -- I prefer UTF-8 (see utf8everywhere.org.
If your programs transfer data in binary, then you must understand that in binary, multi-byte values always have some specific byte order. It can be network byte order (or big-endian), little-endian, or native byte order for the current architecture. Just because your programming language is endian-neutral, does not mean you get to ignore the storage byte order.
For example, Matlab and Octave fread() support a fifth parameter that specifies the byte order used: native, ieee-be (IEEE big-endian), or ieee-le (IEEE little-endian). Python struct module pack and unpack functions default to native byte order and C alignment (padding), but you can use < or > as the first character in the format string to indicate little-endian or big-endian/network-endian byte order with no padding.
It is very common for C code to store binary data in native byte order. However, some C code does not. I prefer to store in native byte order, but also store known prototype values for each different basic numeric type, so that readers can trivially detect if they need to permute the byte order to interpret the code correctly. There are also various libraries and formats like NetCDF that may be utilized for creating portable binary data files.
The most important thing is to understand what the C code does, first.
I don't see why someone would want to port code from C to Matlab or Python, unless the C code was really poor to begin with -- in which case I'd just rewrite the logic, not port the existing code.
Have you met an unexpected problem with big/little endianness?
No, never when porting code between high-level languages.
Yes, when storing/retrieving binary data between different systems.
While not related to endianness, for multi-dimensional data, it is important to remember that Fortran and Matlab (and OpenGL matrices) use column-major order (each column being consecutive in memory), while C uses row-major order (each row being consecutive in memory).

bit endianness and portability of C binary files

In C, I have a char array that I use to store data at the bit level. I store these arrays to files, then read them in machines with different architectures. My question is if the order of the bits will be guaranteed consistent? For example, if I store "10010011" to the first byte, will the adjacent 1's always be read to be in the 2^0 and 2^1 positions, or could they end up interpreted as the 2^7 and 2^6 bits?
EDIT: I want to clarify this question a little for people who read this page later. Byte endianness is the order of bytes in a multibyte object, but my concern is with the bits in a given byte. When a byte is stored to disc, it is stored as a sequence of (usually) 8 bits. I'm no hardware expert, but it has to come down to that somehow. So, my concern is if the way the byte is stored is such that any machine will read the original unsigned char value, or if what is 3 to one machine will be 192 to another. I am concerned the bits will end up shuffled somehow. Apparently, this is not a concern, according to the answer I selected as well as one of the comments below. Thanks.
the simple answer:
The bits will still be in the correct order.
However, if performing any format conversion beyond %c, for instance %d, then the endianness of the reading architecture will determine the byte order The bits within each byte will still be the same.
Endianness is about bytes' order not bits. So 00001101 in a little-endian machine will be the same in a big-endian machine. However there is something you should now about bits' order in different machines. Bits' order change in unions. If you are going to use union, read this to figure out how endianness effects bitfield packing.
The concept you are trying to ask about is known as bit-numbering or bit endianness and system architectures are referred to as least-or most- significant bit (MSB, LSB) ordering.
As far as I know the reference is always with respects to the 0-th or first bit position.
With respect to a single 8-bit byte or octet, it will be portable, such that the value of the byte will be consistently considered to be 0x93 (147 decimal). Assuming you are writing the bit string as a LSB representation with the 0-th bit is the rightmost bit (the norm for little endian processor), as typically done by users of left to right natural languages such as English.

how do I know if the case is true?

Let say if we are given a byte of binary data, how can you know what that data represents?
Is it true that you cant really know what the data represents because you need to know whether the one byte of binary data is represented in base 2, if it unsigned, signed, etc.
or is it that you can know what it represents since binary is base 2?
I am sorry to tell that a byte of data has nothing to do with it's supposed representation.
You state that because it's a byte, it's a binary representation. this is purely assumption.
It depends on the intention of the guy who store the very data.
It might represent anything. As #nos told you, it really depend on the convention the setter used to store it.
You may have a complementary to 2 number, a signed byte on 7 bit, un unsigned on 8 bits, an octal representation (or a partial representation) or a mask (each group of byte within the byte may describer something totally different than another). It could also be a representation of a special coding. Etc.
This is truly unlimited.
In order to properly interpret it you need to know the underlying convention (a spec). #fede1024 told you about files, which use special character so that you can double check with the convention.
One more thing… Bear in mind that even binary data can be stored in natural order or in reverse order: that's endianness. So when you examine a number store in at least 2 bytes, you have to know whether the most significant byte is stored first or sec on din memory. If you misinterpret this, you won't understand the underlying piece of data. Endianness is a constant for a given processor.
Base-2 and binary refer to the same thing. Typically, you do need to know whether the byte is signed or unsigned at least (in C). As for what the data represents - well, "it depends". Whether you want to interpret it as a single byte, as a character (or not), etc. With multi-byte data, you often also have to take endianness (ordering of the bytes into larger words) into account.
Some files format start with a magic number, for example all PNG files starts with 89 50 4E 47 0D 0A 1A 0A. That said, if you have a general binary file without any kind of magic number, you can just guess about his contents.
You can try to open it with an hexadecimal editor, but there is no automatic way to understand what the data represents.
You know it's base 2 since it's a byte of binary data, as you said. To see if it is true, in C everything but 0 is true. If it's 0, then it's false.

How do I work with bit data in C

In class I've been tasked with writing a C program that decompresses a text file and prints out the characters it contains. Each character in the file is represented by 2 bits (4 possible characters).
I've recently been informed that a byte is not necessarily 8 bits on all systems, and a char is not necessarily 1 byte. This then makes me wonder how on earth I'm supposed to know how many bits got loaded from a file when I loaded 1 byte. Also how am I supposed to keep the loaded data in memory when there are no data types that can guarantee a set amount of bits.
How do I work with bit data in C?
A byte is not necessarily 8 bits. That much is certainly true. A char, on the other hand, is defined to be a byte - C does not differentiate between the two things.
However, the systems you will write for will almost certainly have 8-bit bytes. Bytes of different sizes are essentially non-existant outside of really, really old systems, or certain embedded systems.
If you have to write your code to work for multiple platforms, and one or more of those have differently sized chars, then you write code specifically to handle that platform - using e.g. CHAR_BIT to determine how many bits each byte contains.
Given that this is for a class, assume 8-bit bytes, unless told otherwise. The point is not going to be extreme platform independence, the point is to teach you something about bit fiddling (or possibly bit fields, but that depends on what you've covered in class).
This then makes me wonder how on earth I'm supposed to know how many
bits got loaded from a file when I loaded 1 byte.
You'll be hard pressed to find a platform where a byte is not 8 bits. (though as noted above CHAR_BIT can be used to verify that). Also clarify the portability requirements with your instructor or state your assumptions.
Usually bits are extracted using shifts and bitwise operations, e.g. (x & 3) are the rightmost 2 bits of x. ((x>>2) & 3) are the next two bits. Pick the right data type for the platforms you are targettiing or as others say use something like uint8_t if available for your compiler.
Also see:
Type to use to represent a byte in ANSI (C89/90) C?
I would recommend not using bit fields. Also see here:
When is it worthwhile to use bit fields?
You can use bit fields in C. These indices explicitly let you specify the number of bits in each part of the field, if you are truly concerned about width. This page gives a discussion: http://msdn.microsoft.com/en-us/library/yszfawxh(v=vs.80).aspx
As an example, check out the ieee754.h for usage in the context of implementing IEEE754 floats

Why is that data structures usually have a size of 2^n?

Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.
There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?
Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.
Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.
Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.
I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.
Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.
I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.
It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.
It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.
In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.

Resources