How are ints stored in C - c

I've been trying to understand how data is stored in C but I'm getting confused. I have this code:
int main(){
int a;
char *x;
x = (char *) &a;
x[0] = 0;
x[1] = 3;
printf("%d\n", a);
return 0;
}
I've been messing around with x[0] & x[1], trying to figure out how they work, but I just can't. For example x[1] = 3 outputs 768. Why?
I understand that there are 4 bytes (each holding 8 bits) in an int, and x[1] points to the 2nd byte. But I don't understand how making that second byte equal to 3, means a = 768.
I can visualise this in binary format:
byte 1: 00000000
byte 2: 00000011
byte 3: 00000000
byte 4: 00000000
But where does the 3 come into play? how does doing byte 2 = 3, make it 00000011 or 768.
Additional question: If I was asked to store 545 in memory. What would a[0] and a[1] = ?
I know the layout in binary is:
byte 1: 00100001
byte 2: 00000010
byte 3: 00000000
byte 4: 00000000

It is not specific to C, it is how your computer is storing the data.
There are two different methods called endianess.
Little-endian: the least significant byte is stored first.
Example: 0x11223344 will be stored as 0x44 0x33 0x22 0x11
Big-endian: the least significant byte is stored last.
Example: 0x11223344 will be stored as 0x11 0x22 0x33 0x44
Most modern computers use the little-endian system.
Additional question: If I was asked to store 545 in memory
545 in hex is 0x221 so the first byte will be 0x21 and the second one 0x02 as your computer is little-endian.
Why do I use hex numbers? Because every two digits represent exactly one byte in memory.
I've been messing around with x[0] & x[1], trying to figure out how
they work, but I just can't. For example x[1] = 3 outputs 768. Why?
768 in hex is 0x300. So the byte representation is 0x00 0x03 0x00 0x00

Warning: by casting the address of an int to a char *, the compiler is defenseless trying to maintain order. Casting is the programmer telling the compiler "I know what I am doing." Use it will care.
Another way to refer to the same region of memory in two different modes is to use a union. Here the compiler will allocate the space required that is addressable as either an int or an array of signed char.
This might be a simpler way to experiment with setting/clearing certain bits as you come to understand how the architecture of your computer stores multi-byte datatypes.
See other responses for hints about "endian-ness".
#include <stdio.h>
int main( void ) {
union {
int i;
char c[4];
} x;
x.i = 0;
x.c[1] = 3;
printf( "%02x %02x %02x %02x %08x %d\n", x.c[0], x.c[1], x.c[2], x.c[3], x.i, x.i );
x.i = 545;
printf( "%02x %02x %02x %02x %08x %d\n", x.c[0], x.c[1], x.c[2], x.c[3], x.i, x.i );
return 0;
}
00 03 00 00 00000300 768
21 02 00 00 00000221 545

Related

Converting struct into short int array

Hello I have a following structure:
struct TestStruct{
unsigned char a;
unsigned char b;
unsigned char c;
unsigned char d;
};
struct TestStruct test;
test.a = 0x01;
test.b = 0x02;
test.c = 0x01;
test.d = 0x02;
unsigned short int *ptr = (unsigned short int *)&test;
printf("%04x\n", *ptr++);
printf("%04x\n", *ptr++);
I want to get values 0x0102 but actually I get 0x0201. How can figure it out without reordering fields in struct? I want to keep it because I am creating IP header from scratch (for learning purpose) and for better readability I want to have the same ordering with RFC documentation.
Thanks in advance.
In computers, there is a concept of endianess. In short, when storing a multi-byte field, you must choose between storing the most significant byte first (big-endian), or the least significant byte first (little-endian). This difference is sometimes called byte-order by RFC documents.
If you are implementing code that speaks cross-endianess, you will need to be cognizant of which format values are read in. The header byteswap.h is supplied to swap between formats in the most efficient ways. Consider the following example program:
#include <stdio.h>
#include <byteswap.h>
int main(void) {
unsigned int x = 0x01020304;
unsigned char * arr = (unsigned char *)&x;
printf("int: %08x\n", x);
printf("raw: %02x %02x %02x %02x\n", arr[0], arr[1], arr[2], arr[3]);
x = __bswap_32(x);
printf("swapped\n");
printf("int: %08x\n", x);
printf("raw: %02x %02x %02x %02x\n", arr[0], arr[1], arr[2], arr[3]);
}
On my computer, it outputs:
int: 01020304
raw: 04 03 02 01
swapped
int: 04030201
raw: 01 02 03 04
This shows that my computer is little endian. For the integer 0x01020304, it stores the byte 0x04 in the smaller memory address.
For specifically network usage, linux provides headers that convert from network-host. These have the benefit of already 'knowing' what your internal order is, and handling the conversion for you. For example, here's an old snippet I wrote that parses headers of ARP-packets:
recvfrom(socket->fd, buffer, ETHER_FRAME_MAX_SIZE, 0, NULL, NULL);
frame->type = ntohs(frame->type);
frame->htype = ntohs(frame->htype);
frame->ptype = ntohs(frame->ptype);
frame->oper = ntohs(frame->oper);
This snippet converts the shorts in the struct into the correct host byte order, using the ntohs (which is short for network-to-host-short) provided by arpa/inet.h.
Your implementation assumes that your machine is big-endian, which is usually not true on modern machines.
Big endian machines store multibyte values with the least significant byte in the highest address and the most significant byte in the lowest address, while little endian machines (which tend to be more common these days) do the exact opposite, storing the least significant byte in the lowest address and the most significant byte in the highest address. For instance this is how each architecture would represent the 4-byte value 0x01020304 if it were to be stored at memory addresses 0x10-0x13.
Endianness
Byte 0x10
Byte 0x11
Byte 0x12
Byte 0x13
Big
0x01
0x02
0x03
0x04
Little
0x04
0x03
0x02
0x01
The C-standard forces your compiler to place the elements in your struct in the order that they are defined, so when you fill the struct and then use type-punning to interpret the memory location as a 2-byte int instead of (effectively) an array of 1-byte ints, the computer will assume which byte is most significant and which is less significant based on its own endianness.
To manually force the computer to recognize a multi-byte value as the endianness you expect, you need to use bit-shifting to move each byte into its proper place, for instance, using your struct as an example:
unsigned short fixedEndianness = ((unsigned short)test.a << 8) | (unsigned short)test.b;
...which will work on any architecture

C, Little and Big Endian confusion

I try to understand C programming memory Bytes order, but I'm confuse.
I try my app with some value on this site for my output verification : www.yolinux.com/TUTORIALS/Endian-Byte-Order.html
For the 64bits value I use in my C program:
volatile long long ll = (long long)1099511892096;
__mingw_printf("\tlong long, %u Bytes, %u bits,\t%lld to %lli, %lli, 0x%016llX\n", sizeof(long long), sizeof(long long)*8, LLONG_MIN, LLONG_MAX , ll, ll);
void printBits(size_t const size, void const * const ptr)
{
unsigned char *b = (unsigned char*) ptr;
unsigned char byte;
int i, j;
printf("\t");
for (i=size-1;i>=0;i--)
{
for (j=7;j>=0;j--)
{
byte = b[i] & (1<<j);
byte >>= j;
printf("%u", byte);
}
printf(" ");
}
puts("");
}
Out
long long, 8 Bytes, 64 bits, -9223372036854775808 to 9223372036854775807, 1099511892096, 0x0000010000040880
80 08 04 00 00 01 00 00 (Little-Endian)
10000000 00001000 00000100 00000000 00000000 00000001 00000000 00000000
00 00 01 00 00 04 08 80 (Big-Endian)
00000000 00000000 00000001 00000000 00000000 00000100 00001000 10000000
Tests
0x8008040000010000, 1000000000001000000001000000000000000000000000010000000000000000 // online website hex2bin conv.
1000000000001000000001000000000000000000000000010000000000000000 // my C app
0x8008040000010000, 1000010000001000000001000000000000000100000000010000000000000000 // yolinux.com
0x0000010000040880, 0000000000000000000000010000000000000000000001000000100010000000 //online website hex2bin conv., 1099511892096 ! OK
0000000000000000000000010000000000000000000001000000100010000000 // my C app, 1099511892096 ! OK
[Convert]::ToInt64("0000000000000000000000010000000000000000000001000000100010000000", 2) // using powershell for other verif., 1099511892096 ! OK
0x0000010000040880, 0000000000000000000000010000010000000000000001000000100010000100 // yolinux.com, 1116691761284 (from powershell bin conv.) ! BAD !
Problem
yolinux.com website announce 0x0000010000040880 for BIG ENDIAN ! But my computer use LITTLE ENDIAN I think (Intel proc.)
and I get same value 0x0000010000040880 from my C app and from another website hex2bin converter.
__mingw_printf(...0x%016llX...,...ll) also print 0x0000010000040880 as you can see.
Following yolinux website I have inverted my "(Little-Endian)" and "(Big-Endian)" labels in my output for the moment.
Also, the sign bit must be 0 for a positive number it's the case on my result but also yolinux result.(can not help me to be sure.)
If I correctly understand Endianness only Bytes are swapped not bits and my groups of bits seems to be correctly inverted.
It is simply an error on yolinux.com or is I missing a step about 64-bit numbers and C programming?
When you print some "multi-byte" integer using printf (and the correct format specifier) it doesn't matter whether the system is little or big endian. The result will be the same.
The difference between little and big endian is the order that multi-byte types are stored in memory. But once data is read from memory into the core processor, there is no difference.
This code shows how an integer (4 bytes) is placed in memory on my machine.
#include <stdio.h>
int main()
{
unsigned int u = 0x12345678;
printf("size of int is %zu\n", sizeof u);
printf("DEC: u=%u\n", u);
printf("HEX: u=0x%x\n", u);
printf("memory order:\n");
unsigned char * p = (unsigned char *)&u;
for(int i=0; i < sizeof u; ++i) printf("address %p holds %x\n", (void*)&p[i], p[i]);
return 0;
}
Output:
size of int is 4
DEC: u=305419896
HEX: u=0x12345678
memory order:
address 0x7ffddf2c263c holds 78
address 0x7ffddf2c263d holds 56
address 0x7ffddf2c263e holds 34
address 0x7ffddf2c263f holds 12
So I can see that I'm on a little endian machine as the LSB (least significant byte, i.e. 78) is stored on the lowest address.
Executing the same program on a big endian machine would (assuming same address) show:
size of int is 4
DEC: u=305419896
HEX: u=0x12345678
memory order:
address 0x7ffddf2c263c holds 12
address 0x7ffddf2c263d holds 34
address 0x7ffddf2c263e holds 56
address 0x7ffddf2c263f holds 78
Now it is the MSB (most significant byte, i.e. 12) that are stored on the lowest address.
The important thing to understand is that this only relates to "how multi-byte type are stored in memory". Once the integer is read from memory into a register inside the core, the register will hold the integer in the form 0x12345678 on both little and big endian machines.
There is only a single way to represent an integer in decimal, binary or hexadecimal format. For example, number 43981 is equal to 0xABCD when written as hexadecimal, or 0b1010101111001101 in binary. Any other value (0xCDAB, 0xDCBA or similar) represents a different number.
The way your compiler and cpu choose to store this value internally is irrelevant as far as C standard is concerned; the value could be stored as a 36-bit one's complement if you're particularly unlucky, as long as all operations mandated by the standard have equivalent effects.
You will rarely have to inspect your internal data representation when programming. Practically the only time when you care about endiannes is when working on a communication protocol, because then the binary format of the data must be precisely defined, but even then your code will not be different regardless of the architecture:
// input value is big endian, this is defined
// by the communication protocol
uint32_t parse_comm_value(const char * ptr)
{
// but bit shifts in C have the same
// meaning regardless of the endianness
// of your architecture
uint32_t result = 0;
result |= (*ptr++) << 24;
result |= (*ptr++) << 16;
result |= (*ptr++) << 8;
result |= (*ptr++);
return result;
}
Tl;dr calling a standard function like printf("0x%llx", number); always prints the correct value using the specified format. Inspecting the contents of memory by reading individual bytes gives you the representation of the data on your architecture.

Unexpected output in the C code with union

I don't understand the output in the following C code:
#include <stdio.h>
int main()
{
union U
{
int i;
char s[3];
} u;
u.i=0x3132;
printf("%s", u.s);
return 0;
}
Initial memory is 32 bits and is the binary value of 0x3132 which is
0000 0000 0000 0000 0011 0001 0011 0010.
If the last three bytes of 0x3132 are the value of s (without leading zeroes), then s[0]=0011,s[1]=0001,s[2]=0011.
This gives the values of s=0011 0001 0011=787.
Question: Why the output is 21 and not 787?
The value 0x3132 is represented in memory as: 0x32 , 0x31 , 0x0 , 0x0, because the byte order is in little endian.
The printf call prints out the string represented by the member of the union s. The string is printed out byte by byte. First 0x32 and then 0x31 which are the ascii values for the characters: '2' and '1'. Then the printing stops as the third element is the null character: 0x0.
Note that the representation of int is implementation defined and may not consist of 4 bytes and may have padding. Thus the member of the union s may not represent a string, in which case calling printf with the %s specifier will cause undefined behavior.
first see this code sample:
#include <inttypes.h>
#include <stdio.h>
#include <stdint.h>
int main()
{
union{
int32_t i32;
uint32_t u32;
int16_t i16[2];
uint16_t u16[2];
int8_t i8[4];
uint8_t u8[4];
} u;
u.u8[3] = 52;
u.u8[2] = 51;
u.u8[1] = 50;
u.u8[0] = 49;
printf(" %d %d %d %d \n", u.u8[3], u.u8[2], u.u8[1], u.u8[0]); // 52 51 50 49
printf(" %x %x %x %x \n", u.u8[3], u.u8[2], u.u8[1], u.u8[0]); // 34 33 32 31
printf(" 0x%x \n", u.i32); // 0x34333231
return 0;
}
the union here is just to access the memory of u in 6 different ways.
you may use u.i32 to read or write as int32_t or
you may use u.u32 to read or write as uint32_t or
you may use u.i16[0] or u.i16[1] to read or write as int16_t or
you may use u.u16[0] or u.u16[1] to read or write as uint16_t or
or like this to write as uint8_t:
u.u8[3] = 52;
u.u8[2] = 51;
u.u8[1] = 50;
u.u8[0] = 49;
and read like this as int8_t:
printf(" %d %d %d %d \n", u.u8[3], u.u8[2], u.u8[1], u.u8[0]);
then output is:
52 51 50 49
and read as int32_t:
printf(" 0x%x \n", u.i32);
then output is:
0x34333231
so as you see in this sample code union shares one memory place with many names/types.
in your sample code u.i=0x3132; this writes 0x3132 inside u.i memory, and depending on endianness of you system which is little-endian here, then you asked printf("%s", u.s); from compiler, so u.s is array of type char meaning constant pointer to char type, so this printf("%s", u.s); will reads u.s[0] and prints that on the output stdout then reads u.s[1] and prints that on the output stdout and so on ..., until one of this u.s[i] is zero.
this is what your code doing, so if none of u.s[0], u.s[1], u.s[2], u.s[3] not zero, then memory outside of your union will be read until one zero found or system memory fault error happens.
It means that you machine is little-endian, so the bytes are stored in the opposite order, like this:
32 31 00 00
So: s[0] = 0x32, s[1] = 0x31, s[2] = 0x00.
Even if in theory printing an array of chars using "%s" is undefined behaviour, this works, it prints 0x32 (character '2'), 0x31 (character '1') and then it stops a 0x00.
if you write your code like this:
#include <stdio.h>
int main( void )
{
union U
{
int i;
char s[3];
} u;
u.i=0x3132;
printf("%s", u.s);
printf( "%8x\n", (unsigned)u.i);
}
Then you would see that the contents of u.i is 0x0000000000003132, which would actually be stored as: 0x3231000000000000 due to Endianness
and 0x00 is not a printable character, so the output from the second call to printf() is <blank><blank><blank><blank><blank><blank>3132 as you would expect
and the ascii char 1 is 0x31 and ascii char 2 is 0x32 and the first 0x00 stops the %s operations, so the first printf() outputs 21.

Need clarification about unsigned char * in C

Given the code:
...
int x = 123
...
unsigned char * xx = (char *) & x;
...
I have xx[0] = 123, xx[1] = 0, xx[2] = 0, etc.
Can someone explain what is happening here? I dont have a great understanding of pointers in general, so the simpler the better.
Thanks
You're accessing the bytes (chars) of a little-endian int in sequence. The number 123 in an int on a little-endian system will usually be stored as {123,0,0,0}. If your number had been 783 (256 * 3 + 15), it would be stored as {15,3,0,0}.
I'll try to explain all the pieces in ASCII pictures.
int x = 123;
Here, x is the symbol representing a location of type int. Type int uses 4 bytes of memory on a 32-bit machine, or 8 bytes on a 64-bit machine. This can be compiler dependent as well. But for this discussion, let's assume 32-bits (4 bytes).
Memory on x86 is managed "little endian", meaning if a number requires multiple bytes (it's value is > 255 unsigned, or > 127 signed, single byte values), then the number is stored with the least significant byte in the lowest address. If your number were hexadecimal, 0x12345678, then it would be stored as:
x: 78 <-- address that `x` represents
56 <-- x addr + 1 byte
34 <-- x addr + 2 bytes
12 <-- x addr + 3 bytes
Your number, decimal 123, is 7B hex, or 0000007B (all 4 bytes shown), so would look like:
x: 7B <-- address that `x` represents
00 <-- x addr + 1 byte
00 <-- x addr + 2 bytes
00 <-- x addr + 3 bytes
To make this clearer, let's make up a memory address for x, say, 0x00001000. Then the byte locations would have the following values:
Address Value
x: 00001000 7B
00001001 00
00001002 00
00001003 00
Now you have:
unsigned char * xx = (char *) & x;
Which defines a pointer to an unsigned char (an 8-bit, or 1-byte unsigned value, ranging 0-255) whose value is the address of your integer x. In other words, the value contained at location xx is 0x00001000.
xx: 00
10
00
00
The ampersand (&) indicates you want the address of x. And, technically, the declaration isn't correct. It really should be cast properly as:
unsigned char * xx = (unsigned char *) & x;
So now you have a pointer, or address, stored in the variable xx. That address points to x:
Address Value
x: 00001000 7B <-- xx points HERE (xx has the value 0x00001000)
00001001 00
00001002 00
00001003 00
The value of xx[0] is what xx points to offset by 0 bytes. It's offset by bytes because the type of xx is a pointer to an unsigned char which is one byte. Therefore, each offset count from xx is by the size of that type. The value of xx[1] is just one byte higher in memory, which is the value 00. And so on. Pictorially:
Address Value
x: 00001000 7B <-- xx[0], or the value at `xx` + 0
00001001 00 <-- xx[1], or the value at `xx` + 1
00001002 00 <-- xx[2], or the value at `xx` + 2
00001003 00 <-- xx[3], or the value at `xx` + 3
Yeah, you're doing something you shouldn't be doing...
That said... One part of the result is you're working on a little Endian processor. The int x = 123; statement allocates 4 bytes on the stack and intializes it with the value 123; Since it is little Endian, the memory looks like 123, 0, 0, 0 in memory. If it was big Endian, it would be 0, 0, 0, 123. Your char pointer is pointing to the first byte of memory where x is stored.
unsigned char * xx = (char *) & x;
You take the address of x, you tell the compiler it is a pointer to a character[string], you assign that to xx, which is a pointer to a character[string]. The cast to (char *) just keeps the compiler happy.
Now if you print xx, or inspect it, it can depend on the machine what you see - the so-called little-endian ot big-endian way of storing integers. X86 is little endian and stores the bytes of the integer in reverse. So storing 0x00000123 will store 0x23 0x01 0x00 0x00, which is what you see when inspecting the location xx points to as characters.

How to interpret *( (char*)&a )

I see a way to know the endianness of the platform is this program but I don't understand it
#include <stdio.h>
int main(void)
{
int a = 1;
if( *( (char*)&a ) == 1) printf("Little Endian\n");
else printf("Big Endian\n");
system("PAUSE");
return 0;
}
What does the test do?
An int is almost always larger than a byte and often tracks the word size of the architecture. For example, a 32-bit architecture will likely have 32-bit ints. So given typical 32 bit ints, the layout of the 4 bytes might be:
00000000 00000000 00000000 00000001
or with the least significant byte first:
00000001 00000000 00000000 00000000
A char* is one byte, so if we cast this address to a char* we'll get the first byte above, either
00000000
or
00000001
So by examining the first byte, we can determine the endianness of the architecture.
This would only work on platforms where sizeof(int) > 1. As an example, we'll assume it's 2, and that a char is 8 bits.
Basically, with little-endian, the number 1 as a 16-bit integer looks like this:
00000001 00000000
But with big-endian, it's:
00000000 00000001
So first the code sets a = 1, and then this:
*( (char*)&a ) == 1)
takes the address of a, treats it as a pointer to a char, and dereferences it. So:
If a contains a little-endian integer, you're going to get the 00000001 section, which is 1 when interpeted as a char
If a contains a big-endian integer, you're going to get 00000000 instead. The check for == 1 will fail, and the code will assume the platform is big-endian.
You could improve this code by using int16_t and int8_t instead of int and char. Or better yet, just check if htons(1) != 1.
You can look at an integer as a array of 4 bytes (on most platforms). A little endian integer will have the values 01 00 00 00 and a big endian 00 00 00 01.
By doing &a you get the address of the first element of that array.
The expression (char*)&a casts it to the address of a single byte.
And finally *( (char*)&a ) gets the value contained by that address.
take the address of a
cast it to char*
dereference this char*, this will give you the first byte of the int
check its value - if it's 1, then it's little endian. Otherwise - big.
Assume sizeof(int) == 4, then:
|........||........||........||........| <- 4bytes, 8 bits each for the int a
| byte#1 || byte#2 || byte#3 || byte#4 |
When step 1, 2 and 3 are executed, *( (char*)&a ) will give you the first byte, | byte#1 |.
Then, by checking the value of byte#1 you can understand if it's big or little endian.
The program just reinterprets the space taken up by an int as an array of chars and assumes that 1 as an int will be stored as a series of bytes, the lowest order of which will be a byte of value 1, the rest being 0.
So if the lowest order byte occurs first, then the platform is little endian, else its big endian.
These assumptions may not work on every single platform in existance.
a = 00000000 00000000 00000000 00000001
^ ^
| |
&a if big endian &a if little endian
00000000 00000001
^ ^
| |
(char*)&a for BE (char*)&a for LE
*(char*)&a = 0 for BE *(char*)&a = 1 for LE
Here's how it breaks down:
a -- given the variable a
&a -- take its address; type of the expression is int *
(char *)&a -- cast the pointer expression from type int * to type char *
*((char *)&a) -- dereference the pointer expression
*((char *)&a) == 1 -- and compare it to 1
Basically, the cast (char *)&a converts the type of the expression &a from a pointer to int to a pointer to char; when we apply the dereference operator to the result, it gives us the value stored in the first byte of a.
*( (char*)&a )
In BigEndian data for int i=1 (size 4 byte) will arrange in memory as:- (From lower address to higher address).
00000000 -->Address 0x100
00000000 -->Address 0x101
00000000 -->Address 0x102
00000001 -->Address 0x103
While LittleEndian is:-
00000001 -->Address 0x100
00000000 -->Address 0x101
00000000 -->Address 0x102
00000000 -->Address 0x103
Analyzing the above cast:-
Also &a= 0x100 and thus
*((char*)0x100) implies consider by taking one byte(since 4 bytes loaded for int) a time so the data at 0x100 will be refered.
*( (char*)&a ) == 1 => (*0x100 ==1) that is 1==1 and so true,implying its little endian.

Resources