Testing the endianness of a machine - c

Here is the program I used:
int hex = 0x23456789;
char * val = &hex;
printf("%p\n",hex);
printf("%p %p %p %p\n",*val,*(val+1),*(val+2),*(val+3));
Here is my output:
0x23456789
0xffffff89 0x67 0x45 0x23
I am working on a 64 bit CPU with a 64 bit OS. This shows my machine is little endian. Why is the first byte 0xffffff89? Why the ff's?

Firstly, you should be using %x since those aren't pointers.
The %x specifiers expect an integer. Because you are passing in a value of type 'char', which is a signed type, the value is being converted to an integer and being sign extended.
http://en.wikipedia.org/wiki/Sign_extension
That essentially means that it takes the most significant bit and uses it for all the higher bits. So 0x89 => 0b10001001 , which has a highest bit of '1' becomes 0xFFFFFF89.
The proper solution is to specify a 'length' parameter options. You can get more info here: Printf Placeholders Essentially, between the '%' and the 'x', you can put extra parameters. 'hh' means that you are passing a char value.
int hex = 0x23456789;
char *val = (char*)&hex;
printf("%x\n",hex);
printf("%hhx %hhx %hhx %hhx\n", val[0], val[1], val[2], val[3]);

char is a signed type, it gets promoted to int when passed as an argument. This promotion causes sign extension. 0x89 is a negative value for char, it gets thus sign-extended to 0xffffff89. This does not happen for the other values, they don't exceed CHAR_MAX, 127 or 0x7f on the most machines. You are getting confused by this behavior because you use the wrong format specifier.

%p is asking printf to format it as an address, you are actaully passing a value (*val)
On a 64 bit machine pointer addresses are 64bit, so printf is adding the ffff to pad the fields

As #Martin Beckett said, %p asks printf to print a pointer, which is equivalent to %#x or %#lx (the exact format depends on your OS).
This means printf expect an int or a long (again depends on OS), but you are only supplying it with char so the value is up-cast to the appropriate type.
When you cast a smaller signed number to a bigger signed number you have to do something called sign extension in order to preserve the value. In the case of 0x89 this occurs because the sign bit is set, so the upper bytes are 0xff and get printed because they are significant.
In the case of 0x67, 0x45, 0x23 sign extension does not happen because the sign bit is not set, and so the upper bytes are 0s and thus not printed.

I test the endian-ness with the condition ((char)((int)511) == (char)255). True means little, false means big.
I have tested this on a few separate systems, both little and big, using gcc with optimizations off and to max. In every test I have done I have gotten correct results.
You could put that condition in an if of your application before it needs to do endian-critical operations. If you only want to guarentee you are using the right endian-ness for your entire application, you could instead use a static assertion method such as follows:
extern char ASSERTION__LITTLE_ENDIAN[((char)((int)511) == (char)255)?1:-1];
That line in the global scope will create a compile error if the system is not little endian and will refuse to compile. If there was no error, it compiles perfectly as if that line didn't exist. I find that the error message is pretty descriptive:
error: size of array 'ASSERTION__LITTLE_ENDIAN' is negative
Now if you're paranoid of your compiler optimizing the actual check away like I am, you can do the following:
int endian;
{
int i = 255;
char * c = &i;
endian = (c[0] == (char)255);
}
if(endian) // if endian is little
Which compacts nicely in to this macro:
#define isLittleEndian(e) int e; { int i = 255; char * c = &i; e = (c[0] == (char)255); }
isLittleEndian(endian);
if(endian) // if endian is little
Or if you use GCC, you can get away with:
#define isLittleEndian ({int i = 255; char * c = &i; (c[0] == (char)255);})
if(isLittleEndian) // if endian is little

Related

How does the hardware know whether a variable is positive or negative?

I'm sorry if this question is too basic...I just have not found the answer to it anywhere.
Say I declare a C variable like this:
unsigned int var = 241;
In this case the var is unsigned so my intention is for it to have decimal value 241.
Alternatively I could declare it like this:
signed int var = -15;
In this case I declared it as signed integer so, as per my understanding it should have the decimal value -15.
However both times, I assume the var will be declared in memory(hardware) like this:
1111 0001.
So how does the processor know, at the lowest level which is in the hardware that I intended to declare this as 241 or -15?
I'm aware of the two's complement notation that is used to represent negative numbers and such but, I assume in hardware the processor only sees a sequence of ones and zeroes and then does some operation with it by switching the states of some ICs. How does the processor know whether to interpret the sequence of bits in standard binary(for unsigned) or 2's complement(for signed)?
Also another somewhat unrelated questions:
In C I can do this:
unsigned int var = -15;
printf("The var is: %d ", var);
This will as expected print -15.
Why, when I do this:
signed int var = 0xF1; //or 0b11110001
printf("The var is: %d ", var);
I get 241 instead of -15? Since I declared it as signed and in two's complement 0xF1 is -15 why am I getting the value 241 which is the equivalent of 0xF1 in standard binary?
Why does the compiler let me do stuff like:
unsigned int var = -15;
Shouldn't it throw an error telling me I can't assign negative values to a variable which I have declared as unsigned?
Thank you and I apologize for my many and perhaps basic questions, there is so much I do not know :D.
The hardware does not know.
The compiler knows.
The compiler knows because you said so here signed int var = -15;, "This, dear compiler, is a variable which can be negative and I init it to a negative value."
Here you said differently unsigned int var = 241;, "This, dear compiler, is a variable which cannot be negative and I init it to a positive value."
The compiler will keep that in mind for anything you later do with the variable and its values. The compiler will turn all corresponding code into that set of instructions in machine language, which will cause the hardware to behave accordingly. So the hardware ends up doing things appropriate to negative or not; not because of knowing, but because of not getting a choice on it.
An interesting aspect of "corresponding instructions" (as pointed out by Peter Cordes in a comment below) is the fact that for the special (but very widely used) case of 2-complement representation of negative values, the instructions are actually identical for both (which is an important advantage of 2-complement).
If the two values were char (signed or not), then their internal representation (8-bit pattern) would be the same in memory or register.
The only difference would be in the instructions the compiler emits when dealing with such values.
For example, if these values are stored in variables declared signed or unsigned in C, then a comparison between such values would make the compiler generate a signed or unsigned specific comparison instruction at assembly level.
But in your example you use ints.
Assuming that on your platform these ints use four bytes, then the two constants you gave are not identical when it comes to their 32-bit pattern.
The higher bits take in consideration the sign of the value and propagate to fill with 0 or 1 up to 32-bits (see the sequences of 0 or f below).
Note that assigning a negative value to an unsigned int produces a warning at compilation if you use the proper compiler flags (-Wconversion for example).
In his comment below, #PeterCordes reminds us that such an assignment is legal in C, and useful in some situations; the usage (or not) of compiler flags to detect (or not) such cases is only a matter of personal choice.
However, assigning -15U instead of -15 makes explicit the intention to consider the constant as unsigned (despite the minus sign), and does not trigger the warning.
int i1=-15;
int i2=0xF1;
int i3=241;
printf("%.8x %d\n", i1, i1); // fffffff1 -15
printf("%.8x %d\n", i2, i2); // 000000f1 241
printf("%.8x %d\n", i3, i3); // 000000f1 241
unsigned int u1=-15; // warning: unsigned conversion from ‘int’ to ‘unsigned int’ changes value from ‘-15’ to ‘4294967281’
unsigned int u2=0xF1;
unsigned int u3=241;
printf("%.8x %u\n", u1, u1); // fffffff1 4294967281
printf("%.8x %u\n", u2, u2); // 000000f1 241
printf("%.8x %u\n", u3, u3); // 000000f1 241

Printing actual bit representation of integers in C

I wanted to print the actual bit representation of integers in C. These are the two approaches that I found.
First:
union int_char {
int val;
unsigned char c[sizeof(int)];
} data;
data.val = n1;
// printf("Integer: %p\nFirst char: %p\nLast char: %p\n", &data.f, &data.c[0], &data.c[sizeof(int)-1]);
for(int i = 0; i < sizeof(int); i++)
printf("%.2x", data.c[i]);
printf("\n");
Second:
for(int i = 0; i < 8*sizeof(int); i++) {
int j = 8 * sizeof(int) - 1 - i;
printf("%d", (val >> j) & 1);
}
printf("\n");
For the second approach, the outputs are 00000002 and 02000000. I also tried the other numbers and it seems that the bytes are swapped in the two. Which one is correct?
Welcome to the exotic world of endian-ness.
Because we write numbers most significant digit first, you might imagine the most significant byte is stored at the lower address.
The electrical engineers who build computers are more imaginative.
Someimes they store the most significant byte first but on your platform it's the least significant.
There are even platforms where it's all a bit mixed up - but you'll rarely encounter those in practice.
So we talk about big-endian and little-endian for the most part. It's a joke about Gulliver's Travels where there's a pointless war about which end of a boiled egg to start at. Which is itself a satire of some disputes in the Christian Church. But I digress.
Because your first snippet looks at the value as a series of bytes it encounters then in endian order.
But because the >> is defined as operating on bits it is implemented to work 'logically' without regard to implementation.
It's right of C to not define the byte order because hardware not supporting the model C chose would be burdened with an overhead of shuffling bytes around endlessly and pointlessly.
There sadly isn't a built-in identifier telling you what the model is - though code that does can be found.
It will become relevant to you if (a) as above you want to breakdown integer types into bytes and manipulate them or (b) you receive files for other platforms containing multi-byte structures.
Unicode offers something called a BOM (Byte Order Marker) in UTF-16 and UTF-32.
In fact a good reason (among many) for using UTF-8 is the problem goes away. Because each component is a single byte.
Footnote:
It's been pointed out quite fairly in the comments that I haven't told the whole story.
The C language specification admits more than one representation of integers and particularly signed integers. Specifically signed-magnitude, twos-complement and ones-complement.
It also permits 'padding bits' that don't represent part of the value.
So in principle along with tackling endian-ness we need to consider representation.
In principle. All modern computers use twos complement and extant machines that use anything else are very rare and unless you have a genuine requirement to support such platforms, I recommend assuming you're on a twos-complement system.
The correct Hex representation as string is 00000002 as if you declare the integer with hex represetation.
int n = 0x00000002; //n=2
or as you where get when printing integer as hex like in:
printf("%08x", n);
But when printing integer bytes 1 byte after the other, you also must consider the endianess, which is the byte order in multi-byte integers:
In big endian system (some UNIX system use it) the 4 bytes will be ordered in memory as:
00 00 00 02
While in little endian system (most of OS) the bytes will be ordered in memory as:
02 00 00 00
The first prints the bytes that represent the integer in the order they appear in memory. Platforms with different endian will print different results as they store integers in different ways.
The second prints the bits that make up the integer value most significant bit first. This result is independent of endian. The result is also independent of how the >> operator is implemented for signed ints as it does not look at the bits that may be influenced by the implementation.
The second is a better match to the question "Printing actual bit representation of integers in C". Although there is a lot of ambiguity.
It depends on your definition of "correct".
The first one will print the data exactly like it's laid out in memory, so I bet that's the one you're getting the maybe unexpected 02000000 for. *) IMHO, that's the correct one. It could be done simpler by just aliasing with unsigned char * directly (char pointers are always allowed to alias any other pointers, in fact, accessing representations is a usecase for char pointers mentioned in the standard):
int x = 2;
unsigned char *rep = (unsigned char *)&x;
for (int i = 0; i < sizeof x; ++i) printf("0x%hhx ", rep[i]);
The second one will print only the value bits **) and take them in the order from the most significant byte to the least significant one. I wouldn't call it correct because it also assumes that bytes have 8 bits, and because the shifting used is implementation-defined for negative numbers. ***) Furthermore, just ignoring padding bits doesn't seem correct either if you really want to see the representation.
edit: As commented by Gerhardh meanwhile, this second code doesn't print byte by byte but bit by bit. So, the output you claim to see isn't possible. Still, it's the same principle, it only prints value bits and starts at the most significant one.
*) You're on a "little endian" machine. On these machines, the least significant byte is stored first in memory. Read more about Endianness on wikipedia.
**) Representations of types in C may also have padding bits. Some types aren't allowed to include padding (like char), but int is allowed to have them. This second option doesn't alias to char, so the padding bits remain invisible.
***) A correct version of this code (for printing all the value bits) must a) correctly determine the number of value bits (8 * sizeof int is wrong because bytes (char) can have more then 8 bits, even CHAR_BIT * sizeof int is wrong, because this would also count padding bits if present) and b) avoid the implementation-defined shifting behavior by first converting to unsigned. It could look for example like this:
#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
+ (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))
int main(void)
{
int x = 2;
for (unsigned mask = 1U << (IMAX_BITS((unsigned)-1) - 1); mask; mask >>= 1)
{
putchar((unsigned) x & mask ? '1' : '0');
}
puts("");
}
See this answer for an explanation of this strange macro.

Copying integer to char buffer in C

uint32_t after = 0xe1ca95ee;
char new_buf[4];
memcpy(new_buf, &after, 4);
printf("%x\n", *new_buf); // I want to print the content of new_buf
I want to copy the content of after to new_buf. But the result is confusing. printf gives me ffffffee. It looks like an address. I have already dereferenced new_buf.
According to the comments, I can't use memcpy or strncpy to do this task. But why? memcpy and strncpy are only designed to handle char *? But the content of after is in memory.
PS: I know I should use sprintf or snprintf. If you can explain why memcpy and strncpy is not for this case, I appreciate it.
This is the problem right here:
printf("%x\n", *new_buf);
This gives you what you asked for: it prints the char at location new_buf using %x format. that location contains 0xee already (after your successful memcpy, least significant byte first, since you're most probably on an Intel machine, little endian), but it is printed as 0xffffffee (a negative number), since it's a char and not an unsigned char, and because ofcourse 0xee is a signed byte (> 0x7F, highest bit set)
You should use instead:
printf("%x\n", *((unsigned int*)new_buf));
...edited below...
Or rather:
printf("%x\n", *((uint32_t*)new_buf));
If you do this:
int i;
for (i = 0; i < 4; i++) printf("%x\n", new_buf[i]);
You can see it prints
ffffffee
ffffff95
ffffffca
ffffffe1
So your bytes are all there. As pointed out by #George André, they are signed bytes, so you see fs being padded to the front because the numbers are negative and it always prints 4 bytes, ee represented in 32 bits is ffffffee. You are probably on a little-endian machine so that the least significant byte ee is actually stored at the lowest memory position, which is why you get the "last" byte of your number when dereferencing new_buf. The other part is answered already by others, you must declare new_buf as unsigned or cast during printing.
uint32_t after = 0xe1ca95ee;
char new_buf[4];
memcpy(new_buf, &after, 4);
printf("%x\n", *((unsigned char*)new_buf));
Or alternatively
uint32_t after = 0xe1ca95ee;
unsinged char new_buf[4];
memcpy(new_buf, &after, 4);
printf("%x\n", *new_buf);
If you're trying to copy an integer and print it to stdout, as an integer, in base 16:
char new_buf[4];
...
printf("%x\n", *new_buf);
Whatever you stored in new_buf, it's type is still char[4]. So, the type of *new_buf is char (it's identical to new_buf[0]).
So, you're getting the first char of your integer (which may be the high or low byte, depending on platform), having it automatically promoted to an integer, and then printing that as an unsigned int in base 16.
memcpy has indeed copied your value into the array, but if you want to print it, use
printf("%x\n", *(uint32_t *)new_buf);
or
printf("%02x%02x%02x%02x\n", new_buf[0], new_buf[1], new_buf[2], new_buf[3]);
(but note in the latter case your byte order may be reversed, depending on platform).
If you're trying to create a char array containing a base-16 string representation of your number:
Don't use memcpy, that doesn't convert from the integer to its string representation.
Try
uint32_t after = 0xe1ca95ee;
char new_buf[1 + 2*sizeof(after)];
snprintf(new_buf, sizeof(new_buf), "%x", after);
printf("original %x formatted as '%s'\n", after, new_buf);
(the buffer is sized to give 2 chars per octet, plus one for the nul-terminator).

What does casting char* do to a reference of an int? (Using C)

In my course for intro to operating systems, our task is to determine if a system is big or little endian. There's plenty of results I've found on how to do it, and I've done my best to reconstruct my own version of a code. I suspect it's not the best way of doing it, but it seems to work:
#include <stdio.h>
int main() {
int a = 0x1234;
unsigned char *start = (unsigned char*) &a;
int len = sizeof( int );
if( start[0] > start[ len - 1 ] ) {
//biggest in front (Little Endian)
printf("1");
} else if( start[0] < start[ len - 1 ] ) {
//smallest in front (Big Endian)
printf("0");
} else {
//unable to determine with set value
printf( "Please try a different integer (non-zero). " );
}
}
I've seen this line of code (or some version of) in almost all answers I've seen:
unsigned char *start = (unsigned char*) &a;
What is happening here? I understand casting in general, but what happens if you cast an int to a char pointer? I know:
unsigned int *p = &a;
assigns the memory address of a to p, and that can you affect the value of a through dereferencing p. But I'm totally lost with what's happening with the char and more importantly, not sure why my code works.
Thanks for helping me with my first SO post. :)
When you cast between pointers of different types, the result is generally implementation-defined (it depends on the system and the compiler). There are no guarantees that you can access the pointer or that it correctly aligned etc.
But for the special case when you cast to a pointer to character, the standard actually guarantees that you get a pointer to the lowest addressed byte of the object (C11 6.3.2.3 §7).
So the compiler will implement the code you have posted in such a way that you get a pointer to the least significant byte of the int. As we can tell from your code, that byte may contain different values depending on endianess.
If you have a 16-bit CPU, the char pointer will point at memory containing 0x12 in case of big endian, or 0x34 in case of little endian.
For a 32-bit CPU, the int would contain 0x00001234, so you would get 0x00 in case of big endian and 0x34 in case of little endian.
If you de reference an integer pointer you will get 4 bytes of data(depends on compiler,assuming gcc). But if you want only one byte then cast that pointer to a character pointer and de reference it. You will get one byte of data. Casting means you are saying to compiler that read so many bytes instead of original data type byte size.
Values stored in memory are a set of '1's and '0's which by themselves do not mean anything. Datatypes are used for recognizing and interpreting what the values mean. So lets say, at a particular memory location, the data stored is the following set of bits ad infinitum: 01001010 ..... By itself this data is meaningless.
A pointer (other than a void pointer) contains 2 pieces of information. It contains the starting position of a set of bytes, and the way in which the set of bits are to be interpreted. For details, you can see: http://en.wikipedia.org/wiki/C_data_types and references therein.
So if you have
a char *c,
an short int *i,
and a float *f
which look at the bits mentioned above, c, i, and f are the same, but *c takes the first 8 bits and interprets it in a certain way. So you can do things like printf('The character is %c', *c). On the other hand, *i takes the first 16 bits and interprets it in a certain way. In this case, it will be meaningful to say, printf('The character is %d', *i). Again, for *f, printf('The character is %f', *f) is meaningful.
The real differences come when you do math with these. For example,
c++ advances the pointer by 1 byte,
i++ advanced it by 4 bytes,
and f++ advances it by 8 bytes.
More importantly, for
(*c)++, (*i)++, and (*f)++ the algorithm used for doing the addition is totally different.
In your question, when you do a casting from one pointer to another, you already know that the algorithm you are going to use for manipulating the bits present at that location will be easier if you interpret those bits as an unsigned char rather than an unsigned int. The same operatord +, -, etc will act differently depending upon what datatype the operators are looking at. If you have worked in Physics problems wherein doing a coordinate transformation has made the solution very simple, then this is the closest analog to that operation. You are transforming one problem into another that is easier to solve.

Defined behavior, passing character to printf("%02X"

I recently came across this question, where the OP was having issues printing the hexadecimal value of a variable. I believe the problem can be summed by the following code:
#include <stdio.h>
int main() {
char signedChar = 0xf0;
printf("Signed\n”);
printf(“Raw: %02X\n”, signedChar);
printf(“Masked: %02X\n”, signedChar &0xff);
printf(“Cast: %02X\n", (unsigned char)signedChar);
return 0;
}
This gives the following output:
Signed
Raw: FFFFFFF0
Masked: F0
Cast: F0
The format string used for each of the prints is %02X, which I’ve always interpreted as ‘print the supplied int as a hexadecimal value with at least two digits’.
The first case passes the signedCharacter as a parameter and prints out the wrong value (because the other three bytes of the int have all of their bits set).
The second case gets around this problem, by applying a bit mask (0xFF) against the value to remove all but the least significant byte, where the char is stored. Should this work? Surely: signedChar == signedChar & 0xFF?
The third case gets around the problem by casting the character to an unsigned char (which seems to clear the top three bytes?).
For each of the three cases above, can anybody tell me if the behavior defined? How/Where?
I don't think this behavior is completely defined by c standard. After all it depends on binary representation of signed values. I will just describe how it's likely to work.
printf(“Raw: %02X\n”, signedChar);
(char)0xf0 which can be written as (char)-16 is converted to (int)-16 its hex representation is 0xfffffff0.
printf(“Masked: %02X\n”, signedChar &0xff);
0xff is of type int so before calculating &, signedChar is converted to (int)-16.
((int)-16) & ((int)0xff) == (int)0x000000f0.
printf(“Cast: %02X\n", (unsigned char)signedChar);
(unsigned char)0xf0 which can be written as (unsigned char)240 is converted to (unsigned int)240 as hex it's 0x000000f0

Resources