How was an array of char stored? - c

Here is something weird I found:
When I have a char* s of three elements, and assigned it to be "21",
The printed short int value of s appears to be 12594, which is same to 0010001 0010010 in binary, and 49 50 for separate char. But according to the ASCII chart, the value of '2' is 50 and '1' is 49.
when I shift the char to right, *(short*)s >>= 8, the result is agreed with (1.), which is '1' or 49. But after I assigned the char *s = '1', the printed string of s also appears to be "1", which I earlier thought it will become "11".
I am kind of confused about how bits stored in a char now, hope someone can explain this.
Following is the code I use:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%lu,%lu\n",sizeof(char), sizeof(short));
char* s = malloc(sizeof(char)*3);
*s = '2', *(s+1) = '1', *(s+2) = '\0';
printf("%s\n",s);
printf("%d\n",*(short int*)s);
*(short*)s >>= 8;
printf("%s\n",s);
printf("%d\n",*(short int*)s);
*s = '1';
printf("%s\n",s);
return 0;
}
And the output is:
1,2
21
12594
1
49
1
This program is compiled on macOS with gcc.

You need some understanding of the concept of "endianess" here, that values can be represented as "little endian" and "big endian".
I am going to skip the discussion of how legal it is, about involved undefined bahaviour.
(Here is however a relevant link, provided by Lundin, credits:
What is the strict aliasing rule?)
But lets look at a pair of byte in memory, of which the lower-addressed contains a 50 and the higher addressed contains a 49:
50 49
You introduce them exactly this way, by explicitly setting lower byte and higher byte (via char type).
Then you read them, forcing the compiler to consider it a short, which is a two byte sized type on your system.
Compilers and hardware can be created with different "opinions" on what is a good representation of two byte values in two cosecutive bytes. It is called "endianess".
Two compilers, both of which are perfectly standard-conforming can act like this:
The short to be returned is
take the value from lower address, multiply it by 256, add the value from higher address
take the value from the higher address, multiply it by 256, add the value from the lower address
They do not actually do so, it is a much more efficient mechanism implemented in hardware, but the point is that even the implementation in hardware implicity does this or that.

You are re-interpreting representations by aliasing types in a way that is not allowed by the standard: you can process a short value as if it were a char array, but not the opposite. Doing that can cause weird errors with optimizing compilers that could assume that the value has never been initialized, or could optimize out a full branch of code that contains Undefined Behaviour.
Then the answer to your question is called endianess. In a big endian representation, the most significant byte has the lowest address (258 or 0x102 will be represented as the 2 byte 0x01, 0x02 in that order) while in little endian representation the least significant byte has the lowest address (0x102 is represented as 0x02, 0x01 in that order).
Your system happens to be a little endian one.

Related

Printing actual bit representation of integers in C

I wanted to print the actual bit representation of integers in C. These are the two approaches that I found.
First:
union int_char {
int val;
unsigned char c[sizeof(int)];
} data;
data.val = n1;
// printf("Integer: %p\nFirst char: %p\nLast char: %p\n", &data.f, &data.c[0], &data.c[sizeof(int)-1]);
for(int i = 0; i < sizeof(int); i++)
printf("%.2x", data.c[i]);
printf("\n");
Second:
for(int i = 0; i < 8*sizeof(int); i++) {
int j = 8 * sizeof(int) - 1 - i;
printf("%d", (val >> j) & 1);
}
printf("\n");
For the second approach, the outputs are 00000002 and 02000000. I also tried the other numbers and it seems that the bytes are swapped in the two. Which one is correct?
Welcome to the exotic world of endian-ness.
Because we write numbers most significant digit first, you might imagine the most significant byte is stored at the lower address.
The electrical engineers who build computers are more imaginative.
Someimes they store the most significant byte first but on your platform it's the least significant.
There are even platforms where it's all a bit mixed up - but you'll rarely encounter those in practice.
So we talk about big-endian and little-endian for the most part. It's a joke about Gulliver's Travels where there's a pointless war about which end of a boiled egg to start at. Which is itself a satire of some disputes in the Christian Church. But I digress.
Because your first snippet looks at the value as a series of bytes it encounters then in endian order.
But because the >> is defined as operating on bits it is implemented to work 'logically' without regard to implementation.
It's right of C to not define the byte order because hardware not supporting the model C chose would be burdened with an overhead of shuffling bytes around endlessly and pointlessly.
There sadly isn't a built-in identifier telling you what the model is - though code that does can be found.
It will become relevant to you if (a) as above you want to breakdown integer types into bytes and manipulate them or (b) you receive files for other platforms containing multi-byte structures.
Unicode offers something called a BOM (Byte Order Marker) in UTF-16 and UTF-32.
In fact a good reason (among many) for using UTF-8 is the problem goes away. Because each component is a single byte.
Footnote:
It's been pointed out quite fairly in the comments that I haven't told the whole story.
The C language specification admits more than one representation of integers and particularly signed integers. Specifically signed-magnitude, twos-complement and ones-complement.
It also permits 'padding bits' that don't represent part of the value.
So in principle along with tackling endian-ness we need to consider representation.
In principle. All modern computers use twos complement and extant machines that use anything else are very rare and unless you have a genuine requirement to support such platforms, I recommend assuming you're on a twos-complement system.
The correct Hex representation as string is 00000002 as if you declare the integer with hex represetation.
int n = 0x00000002; //n=2
or as you where get when printing integer as hex like in:
printf("%08x", n);
But when printing integer bytes 1 byte after the other, you also must consider the endianess, which is the byte order in multi-byte integers:
In big endian system (some UNIX system use it) the 4 bytes will be ordered in memory as:
00 00 00 02
While in little endian system (most of OS) the bytes will be ordered in memory as:
02 00 00 00
The first prints the bytes that represent the integer in the order they appear in memory. Platforms with different endian will print different results as they store integers in different ways.
The second prints the bits that make up the integer value most significant bit first. This result is independent of endian. The result is also independent of how the >> operator is implemented for signed ints as it does not look at the bits that may be influenced by the implementation.
The second is a better match to the question "Printing actual bit representation of integers in C". Although there is a lot of ambiguity.
It depends on your definition of "correct".
The first one will print the data exactly like it's laid out in memory, so I bet that's the one you're getting the maybe unexpected 02000000 for. *) IMHO, that's the correct one. It could be done simpler by just aliasing with unsigned char * directly (char pointers are always allowed to alias any other pointers, in fact, accessing representations is a usecase for char pointers mentioned in the standard):
int x = 2;
unsigned char *rep = (unsigned char *)&x;
for (int i = 0; i < sizeof x; ++i) printf("0x%hhx ", rep[i]);
The second one will print only the value bits **) and take them in the order from the most significant byte to the least significant one. I wouldn't call it correct because it also assumes that bytes have 8 bits, and because the shifting used is implementation-defined for negative numbers. ***) Furthermore, just ignoring padding bits doesn't seem correct either if you really want to see the representation.
edit: As commented by Gerhardh meanwhile, this second code doesn't print byte by byte but bit by bit. So, the output you claim to see isn't possible. Still, it's the same principle, it only prints value bits and starts at the most significant one.
*) You're on a "little endian" machine. On these machines, the least significant byte is stored first in memory. Read more about Endianness on wikipedia.
**) Representations of types in C may also have padding bits. Some types aren't allowed to include padding (like char), but int is allowed to have them. This second option doesn't alias to char, so the padding bits remain invisible.
***) A correct version of this code (for printing all the value bits) must a) correctly determine the number of value bits (8 * sizeof int is wrong because bytes (char) can have more then 8 bits, even CHAR_BIT * sizeof int is wrong, because this would also count padding bits if present) and b) avoid the implementation-defined shifting behavior by first converting to unsigned. It could look for example like this:
#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
+ (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))
int main(void)
{
int x = 2;
for (unsigned mask = 1U << (IMAX_BITS((unsigned)-1) - 1); mask; mask >>= 1)
{
putchar((unsigned) x & mask ? '1' : '0');
}
puts("");
}
See this answer for an explanation of this strange macro.

taking integer value in character pointer

int main()
{
int i=21;
char *p;
p=(char*)&i;
printf("%d",*p);
getch();
return 0;
}
printf statement gave me perfect answer but I think it shouldn't have as 'p' is a character pointer it will be able to save its base address but int takes up two spaces, *p shouldn't be able to give me integer value as it will point to address let say X but int is stored in two bytes so value need to be collected from X and X+1 address but I ran this code and gave me the value , or do I have the wrong insight on this ?
p=(char*)&i;
This points p to the lowest address in i. Whether that is the address of the low order byte or the high order byte depends on the endianness of your system. (It could even be an internal byte ... PDP-11's are little-endian but longs (32 bits) were stored with the high order 16-bit word first, so the byte order was 2,3,0,1.) Likely you're running on a little-endian machine (x86's are) so it points to the low order byte.
*p
Given little-endianness, this fetches the low order byte of i, which is (char)21, and then does the default conversion to an int, giving (int)21, and prints 21. If i contained a value > 255, you would get the "wrong" result. Also if it contained a value > 127 and < 256 and char is signed on your system -- it would print a negative value.
Since the result depends on the endianness of the machine and is implementation-defined and thus is not portable, you should not do this sort of thing unless your specific goal is to determine the endianness of your machine. Beginning programmers should spend a lot less time trying to understand why bad code sometimes "works" and instead learn how to write good code. A general rule (with plenty of exceptions): code with casts is bad code.

What does casting char* do to a reference of an int? (Using C)

In my course for intro to operating systems, our task is to determine if a system is big or little endian. There's plenty of results I've found on how to do it, and I've done my best to reconstruct my own version of a code. I suspect it's not the best way of doing it, but it seems to work:
#include <stdio.h>
int main() {
int a = 0x1234;
unsigned char *start = (unsigned char*) &a;
int len = sizeof( int );
if( start[0] > start[ len - 1 ] ) {
//biggest in front (Little Endian)
printf("1");
} else if( start[0] < start[ len - 1 ] ) {
//smallest in front (Big Endian)
printf("0");
} else {
//unable to determine with set value
printf( "Please try a different integer (non-zero). " );
}
}
I've seen this line of code (or some version of) in almost all answers I've seen:
unsigned char *start = (unsigned char*) &a;
What is happening here? I understand casting in general, but what happens if you cast an int to a char pointer? I know:
unsigned int *p = &a;
assigns the memory address of a to p, and that can you affect the value of a through dereferencing p. But I'm totally lost with what's happening with the char and more importantly, not sure why my code works.
Thanks for helping me with my first SO post. :)
When you cast between pointers of different types, the result is generally implementation-defined (it depends on the system and the compiler). There are no guarantees that you can access the pointer or that it correctly aligned etc.
But for the special case when you cast to a pointer to character, the standard actually guarantees that you get a pointer to the lowest addressed byte of the object (C11 6.3.2.3 §7).
So the compiler will implement the code you have posted in such a way that you get a pointer to the least significant byte of the int. As we can tell from your code, that byte may contain different values depending on endianess.
If you have a 16-bit CPU, the char pointer will point at memory containing 0x12 in case of big endian, or 0x34 in case of little endian.
For a 32-bit CPU, the int would contain 0x00001234, so you would get 0x00 in case of big endian and 0x34 in case of little endian.
If you de reference an integer pointer you will get 4 bytes of data(depends on compiler,assuming gcc). But if you want only one byte then cast that pointer to a character pointer and de reference it. You will get one byte of data. Casting means you are saying to compiler that read so many bytes instead of original data type byte size.
Values stored in memory are a set of '1's and '0's which by themselves do not mean anything. Datatypes are used for recognizing and interpreting what the values mean. So lets say, at a particular memory location, the data stored is the following set of bits ad infinitum: 01001010 ..... By itself this data is meaningless.
A pointer (other than a void pointer) contains 2 pieces of information. It contains the starting position of a set of bytes, and the way in which the set of bits are to be interpreted. For details, you can see: http://en.wikipedia.org/wiki/C_data_types and references therein.
So if you have
a char *c,
an short int *i,
and a float *f
which look at the bits mentioned above, c, i, and f are the same, but *c takes the first 8 bits and interprets it in a certain way. So you can do things like printf('The character is %c', *c). On the other hand, *i takes the first 16 bits and interprets it in a certain way. In this case, it will be meaningful to say, printf('The character is %d', *i). Again, for *f, printf('The character is %f', *f) is meaningful.
The real differences come when you do math with these. For example,
c++ advances the pointer by 1 byte,
i++ advanced it by 4 bytes,
and f++ advances it by 8 bytes.
More importantly, for
(*c)++, (*i)++, and (*f)++ the algorithm used for doing the addition is totally different.
In your question, when you do a casting from one pointer to another, you already know that the algorithm you are going to use for manipulating the bits present at that location will be easier if you interpret those bits as an unsigned char rather than an unsigned int. The same operatord +, -, etc will act differently depending upon what datatype the operators are looking at. If you have worked in Physics problems wherein doing a coordinate transformation has made the solution very simple, then this is the closest analog to that operation. You are transforming one problem into another that is easier to solve.

How long can a char be?

Why does int a = 'adf'; compile and run in C?
The literal 'adf' is a multi-byte character constant. Its value is platform dependent. Don't use it.
For example, one some platform a 32-bit unsigned integer could take the value 0x00616466, and on another it could be 0x66646100, and on yet another it could be 0x84860081...
This, as Kerrek said, is a multi-byte character constant. It works because each character takes up 8 bits. 'adf' is 3 characters, which is 24 bits. An int is usually large enough to contain this.
But all of the above is platform dependent, and could be different from architecture to architecture. This kind of thing is still used in ancient Apple code, can't quite remember where, although file creator codes ring a bell.
Note the difference in syntax between " and '.
char *x = "this is a string. The value assigned to x is a pointer to the string in memory"
char y = '!' // the value assigned to y is the numerical character value of the character '!'
char z = 'asd' // the value of z is the numerical value of the 'string' data, which can in theory be expressed as an int if it's short enough
It works just because "adf" is 3 ASCII characters and thus 3 bytes long and your platform is a 24 bit or larger system. It would fail on a 16bit system for instance.
Its also worth remembering that although sizeof(char) will always return 1, dependending on platform and compiler more than 1 byte of memory space could be assigned to a char hence for
struct st
{
int a;
char c;
};
when you:
sizeof(st) a number of 32 bit systems will return 8. This is because the system will pad out the single byte for char c to 4 bytes.
ASCII. Every character has a numerical value. Halfway through this tutorial is a description if you need more information http://en.wikibooks.org/wiki/C_Programming/Variables
Edit_______________________________________
char letter2 = 97; /* in ASCII, 97 = 'a' */
This is considered by some to be extremely bad practice, if we are using it to store a character, not a small number, in that if someone reads your code, most readers are forced to look up what character corresponds with the number 97 in the encoding scheme. In the end, letter1 and letter2 store both the same thing – the letter "a", but the first method is clearer, easier to debug, and much more straightforward.
One important thing to mention is that characters for numerals are represented differently from their corresponding number, i.e. '1' is not equal to 1.

C convert from int to char

I have a simple code
char t = (char)(3000);
Then value of t is -72. The hex value of 3000 is 0xBB8. I couldn't understand why the value of t is -72.
Thanks for your answers.
I don't know about Mac. So my result is -72. As I know, MAC is using Big Endian, so does it affect the result? I dont have any MAC computer to test so I want to know from MAC people.
The hex value of 3000 is 0xBB8.
And so the hex value of the char (which, by the way, appears to be signed on your compiler) is 0xB8.
If it were unsigned, 0xB8 would be 184. But since it's signed, its actual value is 256 less, i.e. -72.
If you want to know why this is, read about two's complement notation.
A char is 8 bits (which can only represent a 0-255 range). Trying to cast 3000 to a char is... impossible impossible, at least for what you are intending.
This is happening because 3000 is too big a value and causes an overflow. Char is generally from -128 to 127 signed, or 0 to 255 unsigned, but it can change depending upon the implementation.
char is an integral type with certain range of representable values. int is also an integral type with certain range of representable values. Normally, range of int is [much] wider than that of char. When you try to squeeze into a char an int value that doesn't fit into the range of char, the value will not "fit", of course. The actual result is implementation-defined.
In your case 3000 is an int value that doesn't' fit into the range of char on your implementation. So, you won't get 3000 as the result. If you really want to know why it specifically came out as -72 - consult the documentation that came with your implementation.
As specified, the 16-bit hex value of 3000 is 0x0BB8. Although implementation specific, from your posted results this is likely stored in memory in 8-bit pairs as B8 0B (some architectures would store it as 0B B8. This is known as endianness.)
char, on the other hand, is probably not a 16-bit type. Again, this is implementation specific, but from your posted results it appears to be 8-bits, which is not uncommon.
So while your program has allocated 8-bits of memory for your value, you're storing twice as much information in that memory. When your program retrieves this value later, it will only be pulling the first stored octet, in this case B8. The 0B will be ignored, and may cause problems later down the line if it ended up overwriting something important. This is known as a buffer overflow, which is very bad.
Assuming two's complement (technically implementation specific, but a reasonable assumption), the hex value of B8 translates to either -72 or 184 in decimal, depending on whether your dealing with a signed or unsigned type. Since you didn't specify either, your compiler will go with it's default. Yet again, this is implementation specific, and it appears your compiler goes with signed char.
Therefore, you get -72. But don't expect the same results on any other system.
A char is (typically) just 8 bits, so you cant store values as large as 3000 (which would require at least 11 12 bits). So if you trie to store 3000 in a byte, it will just wrap.
Since 3000 is 0xBBA, it requires two bytes, one 0x0B and one which is 0xBA. If you try to store it in a single byte, you will just get one of them (0xBA). And since a byte is (typically) signed, that is -72.
char is used to hold a single character, and you're trying to store a 4-digit int in one. Perhaps you meant to use an array of chars, or string (char t[4] in this case).
To convert an int to a string (untested):
#include <stdlib.h>
int main() {
int num = 3000;
char numString[4];
itoa(num, buf, 10);
}
oh, i get it, it's overflow, it's like char is only from -256 to 256 or something like that i'm not sure, like if you have a var which type's max limit is 256 and you add 1 to it, than it becomes -256 and so on

Resources