C, Little and Big Endian confusion

C, Little and Big Endian confusion - c

I try to understand C programming memory Bytes order, but I'm confuse.
I try my app with some value on this site for my output verification : www.yolinux.com/TUTORIALS/Endian-Byte-Order.html
For the 64bits value I use in my C program:
volatile long long ll = (long long)1099511892096;
__mingw_printf("\tlong long, %u Bytes, %u bits,\t%lld to %lli, %lli, 0x%016llX\n", sizeof(long long), sizeof(long long)*8, LLONG_MIN, LLONG_MAX , ll, ll);
void printBits(size_t const size, void const * const ptr)
{
unsigned char *b = (unsigned char*) ptr;
unsigned char byte;
int i, j;
printf("\t");
for (i=size-1;i>=0;i--)
{
for (j=7;j>=0;j--)
{
byte = b[i] & (1<<j);
byte >>= j;
printf("%u", byte);
}
printf(" ");
}
puts("");
}
Out
long long, 8 Bytes, 64 bits, -9223372036854775808 to 9223372036854775807, 1099511892096, 0x0000010000040880
80 08 04 00 00 01 00 00 (Little-Endian)
10000000 00001000 00000100 00000000 00000000 00000001 00000000 00000000
00 00 01 00 00 04 08 80 (Big-Endian)
00000000 00000000 00000001 00000000 00000000 00000100 00001000 10000000
Tests
0x8008040000010000, 1000000000001000000001000000000000000000000000010000000000000000 // online website hex2bin conv.
1000000000001000000001000000000000000000000000010000000000000000 // my C app
0x8008040000010000, 1000010000001000000001000000000000000100000000010000000000000000 // yolinux.com
0x0000010000040880, 0000000000000000000000010000000000000000000001000000100010000000 //online website hex2bin conv., 1099511892096 ! OK
0000000000000000000000010000000000000000000001000000100010000000 // my C app, 1099511892096 ! OK
[Convert]::ToInt64("0000000000000000000000010000000000000000000001000000100010000000", 2) // using powershell for other verif., 1099511892096 ! OK
0x0000010000040880, 0000000000000000000000010000010000000000000001000000100010000100 // yolinux.com, 1116691761284 (from powershell bin conv.) ! BAD !
Problem
yolinux.com website announce 0x0000010000040880 for BIG ENDIAN ! But my computer use LITTLE ENDIAN I think (Intel proc.)
and I get same value 0x0000010000040880 from my C app and from another website hex2bin converter.
__mingw_printf(...0x%016llX...,...ll) also print 0x0000010000040880 as you can see.
Following yolinux website I have inverted my "(Little-Endian)" and "(Big-Endian)" labels in my output for the moment.
Also, the sign bit must be 0 for a positive number it's the case on my result but also yolinux result.(can not help me to be sure.)
If I correctly understand Endianness only Bytes are swapped not bits and my groups of bits seems to be correctly inverted.
It is simply an error on yolinux.com or is I missing a step about 64-bit numbers and C programming?

When you print some "multi-byte" integer using printf (and the correct format specifier) it doesn't matter whether the system is little or big endian. The result will be the same.
The difference between little and big endian is the order that multi-byte types are stored in memory. But once data is read from memory into the core processor, there is no difference.
This code shows how an integer (4 bytes) is placed in memory on my machine.
#include <stdio.h>
int main()
{
unsigned int u = 0x12345678;
printf("size of int is %zu\n", sizeof u);
printf("DEC: u=%u\n", u);
printf("HEX: u=0x%x\n", u);
printf("memory order:\n");
unsigned char * p = (unsigned char *)&u;
for(int i=0; i < sizeof u; ++i) printf("address %p holds %x\n", (void*)&p[i], p[i]);
return 0;
}
Output:
size of int is 4
DEC: u=305419896
HEX: u=0x12345678
memory order:
address 0x7ffddf2c263c holds 78
address 0x7ffddf2c263d holds 56
address 0x7ffddf2c263e holds 34
address 0x7ffddf2c263f holds 12
So I can see that I'm on a little endian machine as the LSB (least significant byte, i.e. 78) is stored on the lowest address.
Executing the same program on a big endian machine would (assuming same address) show:
size of int is 4
DEC: u=305419896
HEX: u=0x12345678
memory order:
address 0x7ffddf2c263c holds 12
address 0x7ffddf2c263d holds 34
address 0x7ffddf2c263e holds 56
address 0x7ffddf2c263f holds 78
Now it is the MSB (most significant byte, i.e. 12) that are stored on the lowest address.
The important thing to understand is that this only relates to "how multi-byte type are stored in memory". Once the integer is read from memory into a register inside the core, the register will hold the integer in the form 0x12345678 on both little and big endian machines.

There is only a single way to represent an integer in decimal, binary or hexadecimal format. For example, number 43981 is equal to 0xABCD when written as hexadecimal, or 0b1010101111001101 in binary. Any other value (0xCDAB, 0xDCBA or similar) represents a different number.
The way your compiler and cpu choose to store this value internally is irrelevant as far as C standard is concerned; the value could be stored as a 36-bit one's complement if you're particularly unlucky, as long as all operations mandated by the standard have equivalent effects.
You will rarely have to inspect your internal data representation when programming. Practically the only time when you care about endiannes is when working on a communication protocol, because then the binary format of the data must be precisely defined, but even then your code will not be different regardless of the architecture:
// input value is big endian, this is defined
// by the communication protocol
uint32_t parse_comm_value(const char * ptr)
{
// but bit shifts in C have the same
// meaning regardless of the endianness
// of your architecture
uint32_t result = 0;
result |= (*ptr++) << 24;
result |= (*ptr++) << 16;
result |= (*ptr++) << 8;
result |= (*ptr++);
return result;
}
Tl;dr calling a standard function like printf("0x%llx", number); always prints the correct value using the specified format. Inspecting the contents of memory by reading individual bytes gives you the representation of the data on your architecture.

Related

How are ints stored in C

I've been trying to understand how data is stored in C but I'm getting confused. I have this code:
int main(){
int a;
char *x;
x = (char *) &a;
x[0] = 0;
x[1] = 3;
printf("%d\n", a);
return 0;
}
I've been messing around with x[0] & x[1], trying to figure out how they work, but I just can't. For example x[1] = 3 outputs 768. Why?
I understand that there are 4 bytes (each holding 8 bits) in an int, and x[1] points to the 2nd byte. But I don't understand how making that second byte equal to 3, means a = 768.
I can visualise this in binary format:
byte 1: 00000000
byte 2: 00000011
byte 3: 00000000
byte 4: 00000000
But where does the 3 come into play? how does doing byte 2 = 3, make it 00000011 or 768.
Additional question: If I was asked to store 545 in memory. What would a[0] and a[1] = ?
I know the layout in binary is:
byte 1: 00100001
byte 2: 00000010
byte 3: 00000000
byte 4: 00000000

It is not specific to C, it is how your computer is storing the data.
There are two different methods called endianess.
Little-endian: the least significant byte is stored first.
Example: 0x11223344 will be stored as 0x44 0x33 0x22 0x11
Big-endian: the least significant byte is stored last.
Example: 0x11223344 will be stored as 0x11 0x22 0x33 0x44
Most modern computers use the little-endian system.
Additional question: If I was asked to store 545 in memory
545 in hex is 0x221 so the first byte will be 0x21 and the second one 0x02 as your computer is little-endian.
Why do I use hex numbers? Because every two digits represent exactly one byte in memory.
I've been messing around with x[0] & x[1], trying to figure out how
they work, but I just can't. For example x[1] = 3 outputs 768. Why?
768 in hex is 0x300. So the byte representation is 0x00 0x03 0x00 0x00

Warning: by casting the address of an int to a char *, the compiler is defenseless trying to maintain order. Casting is the programmer telling the compiler "I know what I am doing." Use it will care.
Another way to refer to the same region of memory in two different modes is to use a union. Here the compiler will allocate the space required that is addressable as either an int or an array of signed char.
This might be a simpler way to experiment with setting/clearing certain bits as you come to understand how the architecture of your computer stores multi-byte datatypes.
See other responses for hints about "endian-ness".
#include <stdio.h>
int main( void ) {
union {
int i;
char c[4];
} x;
x.i = 0;
x.c[1] = 3;
printf( "%02x %02x %02x %02x %08x %d\n", x.c[0], x.c[1], x.c[2], x.c[3], x.i, x.i );
x.i = 545;
printf( "%02x %02x %02x %02x %08x %d\n", x.c[0], x.c[1], x.c[2], x.c[3], x.i, x.i );
return 0;
}
00 03 00 00 00000300 768
21 02 00 00 00000221 545

Converting struct into short int array

Hello I have a following structure:
struct TestStruct{
unsigned char a;
unsigned char b;
unsigned char c;
unsigned char d;
};
struct TestStruct test;
test.a = 0x01;
test.b = 0x02;
test.c = 0x01;
test.d = 0x02;
unsigned short int *ptr = (unsigned short int *)&test;
printf("%04x\n", *ptr++);
printf("%04x\n", *ptr++);
I want to get values 0x0102 but actually I get 0x0201. How can figure it out without reordering fields in struct? I want to keep it because I am creating IP header from scratch (for learning purpose) and for better readability I want to have the same ordering with RFC documentation.
Thanks in advance.

In computers, there is a concept of endianess. In short, when storing a multi-byte field, you must choose between storing the most significant byte first (big-endian), or the least significant byte first (little-endian). This difference is sometimes called byte-order by RFC documents.
If you are implementing code that speaks cross-endianess, you will need to be cognizant of which format values are read in. The header byteswap.h is supplied to swap between formats in the most efficient ways. Consider the following example program:
#include <stdio.h>
#include <byteswap.h>
int main(void) {
unsigned int x = 0x01020304;
unsigned char * arr = (unsigned char *)&x;
printf("int: %08x\n", x);
printf("raw: %02x %02x %02x %02x\n", arr[0], arr[1], arr[2], arr[3]);
x = __bswap_32(x);
printf("swapped\n");
printf("int: %08x\n", x);
printf("raw: %02x %02x %02x %02x\n", arr[0], arr[1], arr[2], arr[3]);
}
On my computer, it outputs:
int: 01020304
raw: 04 03 02 01
swapped
int: 04030201
raw: 01 02 03 04
This shows that my computer is little endian. For the integer 0x01020304, it stores the byte 0x04 in the smaller memory address.
For specifically network usage, linux provides headers that convert from network-host. These have the benefit of already 'knowing' what your internal order is, and handling the conversion for you. For example, here's an old snippet I wrote that parses headers of ARP-packets:
recvfrom(socket->fd, buffer, ETHER_FRAME_MAX_SIZE, 0, NULL, NULL);
frame->type = ntohs(frame->type);
frame->htype = ntohs(frame->htype);
frame->ptype = ntohs(frame->ptype);
frame->oper = ntohs(frame->oper);
This snippet converts the shorts in the struct into the correct host byte order, using the ntohs (which is short for network-to-host-short) provided by arpa/inet.h.

Your implementation assumes that your machine is big-endian, which is usually not true on modern machines.
Big endian machines store multibyte values with the least significant byte in the highest address and the most significant byte in the lowest address, while little endian machines (which tend to be more common these days) do the exact opposite, storing the least significant byte in the lowest address and the most significant byte in the highest address. For instance this is how each architecture would represent the 4-byte value 0x01020304 if it were to be stored at memory addresses 0x10-0x13.
Endianness
Byte 0x10
Byte 0x11
Byte 0x12
Byte 0x13
Big
0x01
0x02
0x03
0x04
Little
0x04
0x03
0x02
0x01
The C-standard forces your compiler to place the elements in your struct in the order that they are defined, so when you fill the struct and then use type-punning to interpret the memory location as a 2-byte int instead of (effectively) an array of 1-byte ints, the computer will assume which byte is most significant and which is less significant based on its own endianness.
To manually force the computer to recognize a multi-byte value as the endianness you expect, you need to use bit-shifting to move each byte into its proper place, for instance, using your struct as an example:
unsigned short fixedEndianness = ((unsigned short)test.a << 8) | (unsigned short)test.b;
...which will work on any architecture

why does a integer type need to be little-endian?

I am curious about little-endian
and I know that computers almost have little-endian method.
So, I praticed through a program and the source is below.
int main(){
int flag = 31337;
char c[10] = "abcde";
int flag2 = 31337;
return 0;
}
when I saw the stack via gdb,
I noticed that there were 0x00007a69 0x00007a69 .... ... ... .. .... ...
0x62610000 0x00656463 .. ...
So, I have two questions.
For one thing,
how can the value of char c[10] be under the flag?
I expected there were the value of flag2 in the top of stack and the value of char c[10] under the flag2 and the value of flag under the char c[10].
like this
7a69
"abcde"
7a69
Second,
I expected the value were stored in the way of little-endian.
As a result, the value of "abcde" was stored '6564636261'
However, the value of 31337 wasn't stored via little-endian.
It was just '7a69'.
I thought it should be '697a'
why doesn't integer type conform little-endian?

There is some confusion in your understanding of endianness, stack and compilers.
First, the locations of variables in the stack may not have anything to do with the code written. The compiler is free to move them around how it wants, unless it is a part of a struct, for example. Usually they try to make as efficient use of memory as possible, so this is needed. For example having char, int, char, int would require 16 bytes (on a 32bit machine), whereas int, int, char, char would require only 12 bytes.
Second, there is no "endianness" in char arrays. They are just that: arrays of values. If you put "abcde" there, the values have to be in that order. If you would use for example UTF16 then endianness would come into play, since then one part of the codeword (not necessarily one character) would require two bytes (on a normal 8-bit machine). These would be stored depending on endianness.
Decimal value 31337 is 0x007a69 in 32bit hexadecimal. If you ask a debugger to show it, it will show it as such whatever the endianness. The only way to see how it is in memory is to dump it as bytes. Then it would be 0x69 0x7a 0x00 0x00 in little endian.
Also, even though little endian is very popular, it's mainly because x86 hardware is popular. Many processors have used big endian (SPARC, PowerPC, MIPS amongst others) order and some (like older ARM processors) could run in either one, depending on the requirements.
There is also a term "network byte order", which actually is big endian. This relates to times before little endian machines became most popular.

Integer byte order is an arbitrary processor design decision. Why for example do you appear to be uncomfortable with little-endian? What makes big-endian a better choice?
Well probably because you are a human used to reading numbers from left-to-right; but the machine hardly cares.
There is in fact a reasonable argument that it is intuitive for the least-significant-byte to be placed in the lowest order address; but again, only from a human intuition point-of-view.

GDB shows you 0x62610000 0x00656463 because it is interpreting data (...abcde...) as 32bit words on a little endian system.
It could be either way, but the reasonable default is to use native endianness.
Data in memory is just a sequence of bytes. If you tell it to show it as a sequence (array) of short ints, it changes what it displays. Many debuggers have advanced memory view features to show memory content in various interpretations, including string, int (hex), int (decimal), float, and many more.

You got a few excellent answers already.
Here is a little code to help you understand how variables are laid out in memory, either using little-endian or big-endian:
#include <stdio.h>
void show_var(char* varname, unsigned char *ptr, size_t size) {
int i;
printf ("%s:\n", varname);
for (i=0; i<size; i++) {
printf("pos %d = %2.2x\n", i, *ptr++);
}
printf("--------\n");
}
int main() {
int flag = 31337;
char c[10] = "abcde";
show_var("flag", (unsigned char*)&flag, sizeof(flag));
show_var("c", (unsigned char*)c, sizeof(c));
}
On my Intel i5 Linux machine it produces:
flag:
pos 0 = 69
pos 1 = 7a
pos 2 = 00
pos 3 = 00
--------
c:
pos 0 = 61
pos 1 = 62
pos 2 = 63
pos 3 = 64
pos 4 = 65
pos 5 = 00
pos 6 = 00
pos 7 = 00
pos 8 = 00
pos 9 = 00
--------

Output Explanation of this program in C?

I have this program in C:
int main(int argc, char *argv[])
{
int i=300;
char *ptr = &i;
*++ptr=2;
printf("%d",i);
return 0;
}
The output is 556 on little endian.
I tried to understand the output. Here is my explanation.
Question is Will the answer remains the same in the big endian machine?
i = 300;
=> i = 100101100 //in binary in word format => B B Hb 0001 00101100 where B = Byte and Hb = Half Byte
(A)=> in memory (assuming it is Little endian))
0x12345678 - 1100 - 0010 ( Is this correct for little endian)
0x12345679 - 0001 - 0000
0x1234567a - 0000 - 0000
0x1234567b - 0000 - 0000
0x1234567c - Location of next intezer(location of ptr++ or ptr + 1 where ptr is an intezer pointer as ptr is of type int => on doing ++ptr it will increment by 4 byte(size of int))
when
(B)we do char *ptr = &i;
ptr will become of type char => on doing ++ptr it will increment by 1 byte(size of char)
so on doing ++ptr it will jump to location -> 0x12345679 (which has 0001 - 0000)
now we are doing
++ptr = 2
=> 0x12345679 will be overwritten by 2 => 0x12345679 will have 00*10** - 0000 instead of 000*1* - 0000
so the new memory content will look like this :
(C)
0x12345678 - 1100 - 0010
0x12345679 - 0010 - 0000
0x1234567a - 0000 - 0000
0x1234567b - 0000 - 0000
which is equivalent to => B B Hb 0010 00101100 where B = Byte and Hb = Half Byte
Is my reasoning correct?Is there any other short method for this?
Rgds,
Softy

In a little-endian 32-bit system, the int 300 (0x012c) is typically(*) stored as 4 sequential bytes, lowest first: 2C 01 00 00. When you increment the char pointer that was formerly the int pointer &i, you're pointing at the second byte of that sequence, and setting it to 2 makes the sequence 2C 02 00 00 -- which, when turned back into an int, is 0x22c or 556.
(As for your understanding of the bit sequence...it seems a bit off. Endianness affects byte order in memory, as the byte is the smallest addressable unit. The bits within the byte don't get reversed; the low-order byte will be 2C (00101100) whether the system is little-endian or big-endian. (Even if the system did reverse the bits of a byte, it'd reverse them again to present them to you as a number, so you wouldn't notice a difference.) The big difference is where that byte appears in the sequence. The only places where bit order matters, is in hardware and drivers and such where you can receive less than a byte at a time.)
In a big-endian system, the int is typically(*) represented by the byte sequence 00 00 01 2C (differing from the little-endian representation solely in the byte order -- highest byte comes first). You're still modifying the second byte of the sequence, though...making 00 02 01 2C, which as an int is 0x02012c or 131372.
(*) Lots of things come into play here, including two's complement (which almost all systems use these days...but C doesn't require it), the value of sizeof(int), alignment/padding, and whether the system is truly big- or little-endian or a half-assed implementation of it. This is a big part of why mucking around with the bytes of a bigger type so often leads to undefined or implementation-specific behavior.

This is implementation defined. The internal representation of an int is not known according to the standard, so what you're doing is not portable. See section 6.2.6.2 in the C standard.
However, as most implementations use two's complement representation of signed ints, the endianness will affect the result as described in cHaos answer.

This is your int:
int i = 300;
And this is what the memory contains at &i: 2c 01 00 00
With the next instruction you assign the address of i to ptr, and then you move to the next byte with ++ptr and change its value to 2:
char *ptr = &i;
*++ptr = 2;
So now the memory contains: 2c 02 00 00 (i.e. 556).
The difference is that in a big-endian system in the address of i you would have seen 00 00 01 2C, and after the change: 00 02 01 2C.
Even if the internal rappresentation of an int is implementation-defined:
For signed integer types, the bits of the object representation shall
be divided into three groups: value bits, padding bits, and the sign
bit. There need not be any padding bits; signed char shall not have
any padding bits. There shall be exactly one sign bit. Each bit that
is a value bit shall have the same value as the same bit in the object
representation of the corresponding unsigned type (if there are M
value bits in the signed type and N in the unsigned type, then M ≤ N).
If the sign bit is zero, it shall not affect the resulting value. If
the sign bit is one, the value shall be modified in one of the
following ways: — the corresponding value with sign bit 0 is negated
(sign and magnitude); — the sign bit has the value −(2M) (two’s
complement); — the sign bit has the value −(2M − 1) (ones’
complement). Which of these applies is implementation-defined, as
is whether the value with sign bit 1 and all value bits zero (for the
first two), or with sign bit and all value bits 1 (for ones’
complement), is a trap representation or a normal value. In the case
of sign and magnitude and ones’ complement, if this representation is
a normal value it is called a negative zero.

I like experiments and that's the reason for having the PowerPC G5.
stacktest.c:
int main(int argc, char *argv[])
{
int i=300;
char *ptr = &i;
*++ptr=2;
/* Added the Hex dump */
printf("%d or %x\n",i, i);
return 0;
}
Build command:
powerpc-apple-darwin9-gcc-4.2.1 -o stacktest stacktest.c
Output:
131372 or 2012c
Resume: the cHao's answer is complete and in case you're in doubt here is the experimental evidence.

int to char casting

int i = 259; /* 03010000 in Little Endian ; 00000103 in Big Endian */
char c = (char)i; /* returns 03 in both Little and Big Endian?? */
In my computer it assigns 03 to char c and I have Little Endian, but I don't know if the char casting reads the least significant byte or reads the byte pointed by the i variable.

Endianness doesn't actually change anything here. It doesn't try to store one of the bytes (MSB, LSB etc).
If char is unsigned it will wrap around. Assuming 8-bit char 259 % 256 = 3
If char is signed the result is implementation defined. Thank you pmg: 6.3.1.3/3 in the C99 Standard

Since you're casting from a larger integer type to a smaller one, it takes the least significant part regardless of endianness. If you were casting pointers instead, though, it would take the byte at the address, which would depend on endianness.
So c = (char)i assigns the least-significant byte to c, but c = *((char *)(&i)) would assign the first byte at the address of i to c, which would be the same thing on little-endian systems only.

If you want to test for little/big endian, you can use a union:
int isBigEndian (void)
{
union foo {
size_t i;
char cp[sizeof(size_t)];
} u;
u.i = 1;
return *u.cp != 1;
}
It works because in little endian, it would look like 01 00 ... 00, but in big endian, it would be 00 ... 00 01 (the ... is made up of zeros). So if the first byte is 0, the test returns true. Otherwise it returns false. Beware, however, that there also exist mixed endian machines that store data differently (some can switch endianness; others just store the data differently). The PDP-11 stored a 32-bit int as two 16-bit words, except the order of the words was reversed (e.g. 0x01234567 was 4567 0123).

When casting from int(4 bytes) to char(1 byte), it will preserve the last 1 byte.
Eg:
int x = 0x3F1; // 0x3F1 = 0000 0011 1111 0001
char y = (char)x; // 1111 0001 --> -15 in decimal (with Two's complement)
char z = (unsigned char)x; // 1111 0001 --> 241 in decimal

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight