Endianness -- why do chars put in an Int16 print backwards? - c

The following C code, compiled and run in XCode:
UInt16 chars = 'ab';
printf("\nchars: %2.2s", (char*)&chars);
prints 'ba', rather than 'ab'.
Why?

That particular implementation seems to store multi-character constants in little-endian format. In the constant 'ab' the character 'b' is the least significant byte (the little end) and the character 'a' is the most significant byte. If you viewed chars as an array, it'd be chars[0] = 'b' and chars[1] = 'a', and thus would be treated by printf as "ba".
Also, I'm not sure how accurate you consider Wikipedia, but regarding C syntax it has this section:
Multi-character constants (e.g. 'xy') are valid, although rarely
useful — they let one store several characters in an integer (e.g. 4
ASCII characters can fit in a 32-bit integer, 8 in a 64-bit one).
Since the order in which the characters are packed into one int is not
specified, portable use of multi-character constants is difficult.
So it appears the 'ab' multi-character constant format should be avoided in general.

It depends on the system you're compiling/running your program on.
Obviously on your system, the short value is stored in memory as 0x6261 (ba): the little endian way.
When you ask to decode a string, printf will read byte by byte the value you have stored in memory, which actually is 'b', then 'a'. Thus your result.

Multicharacter character literals are implementation-defined:
C99 6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined."
gcc and icl print ba on Windows 7. tcc prints a and drops the second letter altogether...

The answer to your question can be found in your tags: Endianness. On a little endian machine the least significant byte is stored first. This is a convention and does not affect efficiency at all.
Of course, this means that you cannot simply cast it to a character string, since the order of characters is wrong, because there are no significant bytes in a character string, but just a sequence.
If you want to view the bytes within your variable, I suggest using a debugger that can read the actual bytes.

Related

How do I compare single multibyte character constants cross-platform in C?

In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char's in C as well.
My current solution uses str.compare() and size() of a character string as seen in my previous post.
Now, since I only use one (multibyte) character in the std::string, would it be possible to achieve the same using a char?
For example, if( str[i] == '¶' )? How do I achieve that using char's?
(edit: made a type on SO for comparison operator as pointed out in the comments)
How do I compare single multibyte character constants cross-platform in C?
You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c', '¶') have type int, not char. The primary relevant section of C17 is paragraph 6.4.4.4/10:
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
(Emphasis added.)
Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as
the implementation rejects integer character constants containing multibyte source characters; or
the implementation rejects integer character constants that do not map to a single-byte execution character; or
the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.
That is not an exhaustive list.
You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual chars.
Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:
char str[] = "Some string ¶ some text ¶ to see";
char char_to_compare[] = "¶";
int char_size = sizeof(char_to_compare) - 1; // don't count the string terminator
for (char *location = strstr(str, char_to_compare);
location;
location = strstr(location + char_size, char_to_compare)) {
puts("Found!");
}
That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.
If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.
I believe you meant something like this:
char a = '¶';
char b = '¶';
if (a == b) /*do something*/;
The above may or may not work, if the value of '¶' is bigger than the char range, then it will overflow, causing a and b to store a different value than that of '¶'. Regardless of which value they hold, they may actually both have the same value.
Remember, the char type is simply a single-byte wide (8-bits) integer, so in order to work with multibyte characters and avoid overflow you just have to use a wider integer type (short, int, long...).
short a = '¶';
short b = '¶';
if (a == b) /*do something*/;
From personal experience, I've also noticed, that sometimes your environment may try to use a different character encoding than what you need. For example, trying to print the 'á' character will actually produce something else.
unsigned char x = 'á';
putchar(x); //actually prints character 'ß' in console.
putchar(160); //will print 'á'.
This happens because the console uses an Extended ASCII encoding, while my coding environment actually uses Unicode, parsing a value of 225 for 'á' instead of the value of 160 that I want.

Are multi-character character constants valid in C? Maybe in MS VC?

While reviewing some WINAPI code intended to compile in MS Visual C++, I found the following (simplified):
char buf[4];
// buf gets filled ...
switch ((buf[0] << 8) + buf[1]) {
case 'CT':
/* ... */
case 'SY':
/* ... */
default:
break;
}
}
Assuming 16 bit chars, I can understand why the shift of buf[0] and addition of buf[1]. What I don't gather is how the comparisons in the case clauses are intended to work.
I don't have access to Visual C++ and, of course, those yield multi-character character constant [-Wmultichar] warnings on gcc/MingW.
This is a non-portable way of storing more than one chars in one int. Finally, the comparison happens as the int values, as usual.
Note: consider concatenated representation of the ASCII values for each individual char as the final int value.
Following the wiki article, (emphasis mine)
[...] Multi-character constants (e.g. 'xy') are valid, although rarely useful — they let one store several characters in an integer (e.g. 4 ASCII characters can fit in a 32-bit integer, 8 in a 64-bit one). Since the order in which the characters are packed into an int is not specified, portable use of multi-character constants is difficult.
Related, C11, chapter §6.4.4.4/p10
An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g.,
'ab'), or containing a character or escape sequence that does not map to a single-byte
execution character, is implementation-defined. [....]
Yes, they are valid and its type is int and its value is implementation dependent.
From C11 draft, 6.4.4.4p10:
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. The
value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined.
(emphasis added)
GCC is being cautious, and warns to let you know in case you have used it unintentionally.

Difference between binary zeros and ASCII character zero

gcc (GCC) 4.8.1
c89
Hello,
I was reading a book about pointers. And using this code as a sample:
memset(buffer, 0, sizeof buffer);
Will fill the buffer will binary zero and not the character zero.
I am just wondering what is the difference between the binary and the character zero. I thought it was the same thing.
I know that textual data is human readable characters and binary data is non-printable characters. Correct me if I am wrong.
What would be a good example of binary data?
For added example, if you are dealing with strings (textual data) you should use fprintf. And if you are using binary data you should use fwrite. If you want to write data to a file.
Many thanks for any suggestions,
The quick answer is that the character '0' is represented in binary data by the ASCII number 48. That means, when you want the character '0', the file actually has these bits in it: 00110000. Similarly, the printable character '1' has a decimal value of 49, and is represented by the byte 00110001. ('A' is 65, and is represented as 01000001, while 'a' is 97, and is represented as 01100001.)
If you want the null terminator at the end of the string, '\0', that actually has a 0 decimal value, and so would be a byte of all zeroes: 00000000. This is truly a 0 value. To the compiler, there is no difference between
memset(buffer, 0, sizeof buffer);
and
memset(buffer, '\0', sizeof buffer);
The only difference is a semantic one to us. '\0' tells us that we're dealing with a character, while 0 simply tells us we're dealing with a number.
It would help you tremendously to check out an ascii table.
fprintf outputs data using ASCII and outputs strings. fwrite writes pure binary data. If you fprintf(fp, "0"), it will put the value 48 in fp, while if you fwrite(fd, 0) it will put the actual value of 0 in the file. (Note, my usage of fprintf and fwrite were obviously not proper usage, but shows the point.)
Note: My answer refers to ASCII because it's one of the oldest, best known character sets, but as Eric Postpichil mentions in the comments, the C standard isn't bound to ASCII. (In fact, while it does occasionally give examples using ASCII, the standard seems to go out of its way to never assume that ASCII will be the character set used.). fprintf outputs using the execution character set of your compiled program.
If you are asking about the difference between '0' and 0, these two are completely different:
Binary zero corresponds to a non-printable character \0 (also called the null character), with the code of zero. This character serves as null terminator in C string:
5.2.1.2 A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
ASCII character zero '0' is printable (not surprisingly, producing a character zero when printed) and has a decimal code of 48.
Binary zero: 0
Character zero: '0', which in ASCII is 48.
binary data: the raw data that the cpu gets to play with, bit after bit, the stream of 0s and 1s (usually organized in groups of 8, aka Bytes, or multiples of 8)
character data: bytes interpreted as characters. Conventions like ASCII give the rules how a specific bit sequence should be displayed by a terminal, a printer, ...
for example, the binary data (bit sequence ) 00110000 should be displayed as 0
if I remember correctly, the unsigned integer datatypes would have a direct match between the binary value of the stored bits and the interpreted value (ignore strangeness like Endian ^^).
On a higher level, for example talking about ftp transfer, the destinction is made between:
the data should be interpreted as (multi)byte characters, aka text (this includes non-character signs like a line break)
the data is a big bit/bytestream, that can't be broken down in smaller human readable bits, for example an image or a compiled executable
in system every character have a code and zero ASCII code is 0x30(hex).
to fill this buffer with zero character you must enter this code :
memset(buffer,30,(size of buffer))

Char - ASCII relation

A char in the C programming language is a fixed-size byte entity designed specifically to be large enough to store a character value from an encoding such as ASCII.
But to what extent are the integer values relating to ASCII encoding interchangeable with the char characters? Is there any way to refer to 'A' as 65 (decimal)?
getchar() returns an integer - presumably this relates directly to such values? Also, if I am not mistaken, it is possible in certain contexts to increment chars ... such that (roughly speaking) '?'+1 == '#'.
Or is such encoding not guaranteed to be ASCII? Does it depend entirely upon the particular environment? Is such manipulation of chars impractical or impossible in C?
Edit: Relevant: C comparison char and int
I am answering just the question about incrementing characters, since the other issues are addressed in other answers.
The C standard guarantees that '0' to '9' are consecutive, so you can increment a digit character (except '9') and get the next digit character, or do other arithmetic with them (C 1999 5.2.1 3).
The relationships between other characters are not guaranteed by the C standard, so you would need documentation from your specific C implementation (primarily the compiler) regarding this.
But to what extent are the integer values relating to ASCII encoding interchangeable with the char characters? Is there any way to refer to 'A' as 65 (decimal)?
In fact, you can't do anything else. char is just an integral type, and if you write
char ch = 'A';
then (assuming ASCII), ch will merely hold the integer value 65 - presenting it to the user is a different problem.
Or is such encoding not guaranteed to be ASCII?
No, it isn't. C doesn't rely on any specific character encoding.
Does it depend entirely upon the particular environment?
Yes, pretty much.
Is such manipulation of chars impractical or impossible in C?
No, you just have to be careful and know the standard quite well - then you'll be safe.
character literals like 'A' have type int .. they are completely interchangeable with their integer value. However, that integer value is not mandated by the C standard; it might be ASCII (and is for the vast majority of common implementations) but need not be; it is implementation defined. The mapping of integer values for characters does have one guarantee given by the Standard: the values of the decimal digits are continguous. (i.e., '1' - '0' == 1, ... '9' - '0' == 9).
Where the source code has 'A', the compiled object will just have the byte value instead. That's why it is allowed to do arithmetic with bytes (the type of 'A' is char, i.e. byte).
Of course, a character encoding (more accurately, a code page) must be applied to get that byte value, and that codepage would serve as the "native" encoding of the compiler for hard-coded strings and char values.
Loosely, you could think of char and string literals in C source as essentially being macros. On an ASCII system the "macro" 'A' would resolve to (char) 65, and on an EBCDIC system to (char) 193. Similarly, C strings compile down to zero-terminated arrays of chars (bytes). This logic affects the symbol table also, since the symbols are taken from the source in its native encoding.
So no, ASCII is not the only possibility for the encoding of literals in source code. But due to the restriction of single-quoted characters being chars, there is a guarantee that UTF-16 or other multi-byte encodings are excluded.

What does \x mean in C/C++?

Example:
char arr[] = "\xeb\x2a";
BTW, are the following the same?
"\xeb\x2a" vs. '\xeb\x2a'
\x indicates a hexadecimal character escape. It's used to specify characters that aren't typeable (like a null '\x00').
And "\xeb\x2a" is a literal string (type is char *, 3 bytes, null-terminated), and '\xeb\x2a' is a character constant (type is int, 2 bytes, not null-terminated, and is just another way to write 0xEB2A or 60202 or 0165452). Not the same :)
As other have said, the \x is an escape sequence that starts a "hexadecimal-escape-sequence".
Some further details from the C99 standard:
When used inside a set of single-quotes (') the characters are part of an "integer character constant" which is (6.4.4.4/2 "Character constants"):
a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.
and
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
So the sequence in your example of '\xeb\x2a' is an implementation defined value. It's likely to be the int value 0xeb2a or 0x2aeb depending on whether the target platform is big-endian or little-endian, but you'd have to look at your compiler's documentation to know for certain.
When used inside a set of double-quotes (") the characters specified by the hex-escape-sequence are part of a null-terminated string literal.
From the C99 standard 6.4.5/3 "String literals":
The same considerations apply to each element of the sequence in a character string literal or a wide string literal as if it were in an integer character constant or a wide character constant, except that the single-quote ' is representable either by itself or by the escape sequence \', but the double-quote " shall be represented by the escape sequence \".
Additional info:
In my opinion, you should avoid avoid using 'multi-character' constants. There are only a few situations where they provide any value over using an regular, old int constant. For example, '\xeb\x2a' could be more portably be specified as 0xeb2a or 0x2aeb depending on what value you really wanted.
One area that I've found multi-character constants to be of some use is to come up with clever enum values that can be recognized in a debugger or memory dump:
enum CommandId {
CMD_ID_READ = 'read',
CMD_ID_WRITE = 'writ',
CMD_ID_DEL = 'del ',
CMD_ID_FOO = 'foo '
};
There are few portability problems with the above (other than platforms that have small ints or warnings that might be spewed). Whether the characters end up in the enum values in little- or big-endian form, the code will still work (unless you're doing some else unholy with the enum values). If the characters end up in the value using an endianness that wasn't what you expected, it might make the values less easy to read in a debugger, but the 'correctness' isn't affected.
When you say:
BTW,are these the same:
"\xeb\x2a" vs '\xeb\x2a'
They are in fact not. The first creates a character string literal, terminated with a zero byte, containing the two characters who's hex representation you provide. The second creates an integer constant.
It's a special character that indicates the string is actually a hexadecimal number.
http://www.austincc.edu/rickster/COSC1320/handouts/escchar.htm
The \x means it's a hex character escape. So \xeb would mean character eb in hex, or 235 in decimal. See http://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx for ore information.
As for the second, no, they are not the same. The double-quotes, ", means it's a string of characters, a null-terminated character array, whereas a single quote, ', means it's a single character, the byte that character represents.
\x allows you to specify the character by its hexadecimal code.
This allows you to specify characters that are normally not printable (some of which have special escape sequences predefined such as '\n'=newline and '\t'=tab '\b'=bell)
A useful website is here.
And I quote:
x Unsigned hexadecimal integer
That way, your \xeb is like 235 in decimal.

Resources