Special char Literals - c

I want to assign a char with a char literal, but it's a special character say 255 or 13.I know that I can assign my char with a literal int that will be cast to a char: char a = 13;I also know that Microsoft will let me use the hex code as a char literal: char a = '\xd'
I want to know if there's a way to do this that gcc supports also.

Writing something like
char ch = 13;
is mostly portable, to platforms on which the value 13 is the same thing as on your platform (which is all systems which uses the ASCII character set, which indeed is most systems today).
There may be platforms on which 13 can mean something else. However, using '\r' instead should always be portable, no matter the character encoding system.
Using other values, which does not have character literal equivalents, are not portable. And using values above 127 is even less portable, since then you're outside the ASCII table, and into the extended ASCII table, in which the letters can depend on the locale settings of the system. For example, western European and eastern European language settings will most likely have different characters in the 128 to 255 range.
If you want to use a byte which can contain just some binary data and not letters, instead of using char you might be wanting to use e.g. uint8_t, to tell other readers of your code that you're not using the variable for letters but for binary data.

The hexidecimal escape sequence is not specific to Microsoft. It's part of C/C++: http://en.cppreference.com/w/cpp/language/escape
Meaning that to assign a hexidecimal number to a char, this is cross platform code:
char a = '\xD';
The question already demonstrates assigning a decimal number to a char:
char a = 13;
And octal numbers can also be assigned as well, with only the escape switch:
char a = '\023';
Incidentally, '\0' is common in C/C++ to represent the null-character (independent of platform). '\0' is not a special character that can be escaped. That's actually invoking the octal escape sequence.

Related

How do I compare single multibyte character constants cross-platform in C?

In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char's in C as well.
My current solution uses str.compare() and size() of a character string as seen in my previous post.
Now, since I only use one (multibyte) character in the std::string, would it be possible to achieve the same using a char?
For example, if( str[i] == '¶' )? How do I achieve that using char's?
(edit: made a type on SO for comparison operator as pointed out in the comments)
How do I compare single multibyte character constants cross-platform in C?
You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c', '¶') have type int, not char. The primary relevant section of C17 is paragraph 6.4.4.4/10:
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
(Emphasis added.)
Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as
the implementation rejects integer character constants containing multibyte source characters; or
the implementation rejects integer character constants that do not map to a single-byte execution character; or
the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.
That is not an exhaustive list.
You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual chars.
Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:
char str[] = "Some string ¶ some text ¶ to see";
char char_to_compare[] = "¶";
int char_size = sizeof(char_to_compare) - 1; // don't count the string terminator
for (char *location = strstr(str, char_to_compare);
location;
location = strstr(location + char_size, char_to_compare)) {
puts("Found!");
}
That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.
If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.
I believe you meant something like this:
char a = '¶';
char b = '¶';
if (a == b) /*do something*/;
The above may or may not work, if the value of '¶' is bigger than the char range, then it will overflow, causing a and b to store a different value than that of '¶'. Regardless of which value they hold, they may actually both have the same value.
Remember, the char type is simply a single-byte wide (8-bits) integer, so in order to work with multibyte characters and avoid overflow you just have to use a wider integer type (short, int, long...).
short a = '¶';
short b = '¶';
if (a == b) /*do something*/;
From personal experience, I've also noticed, that sometimes your environment may try to use a different character encoding than what you need. For example, trying to print the 'á' character will actually produce something else.
unsigned char x = 'á';
putchar(x); //actually prints character 'ß' in console.
putchar(160); //will print 'á'.
This happens because the console uses an Extended ASCII encoding, while my coding environment actually uses Unicode, parsing a value of 225 for 'á' instead of the value of 160 that I want.

Internals binary saving of C chars

Hey i stumbled about something pretty weird while programming. I tried to transform a utf8 char into a hexadecimal byte representation like 0x89 or 0xff.
char test[3] = "ü";
for (int x = 0; x < 3; x++){
printf("%x\n",test[x]);
}
And i get the following output :
ffffffc3
ffffffbc
0
I know that C uses one byte of data fore every one char and therefore if i want to store an weird char like "ü" they count as 2 chars.
Transforming ASCII Chars is no problem but once i get to non ASCII Chars (from germans to chinese) instead to getting outputs like 0xc3 and 0xbc c adds 0xFFFFFF00 to them.
I know that i can just do something like &0xFF and fix that weird representation, but i can wrap my head around why that keeps happening in the first place.
C allows type char to behave either as a signed type or as an unsigned type, as the C implementation chooses. You are observing the effect of it being a signed type, which is pretty common. When the char value of test[x] is passed to printf, it is promoted to type int, in value-preserving manner. When the value is negative, that involves sign-extension, whose effect is exactly what you describe. To avoid that, add an explicit cast to unsigned char:
printf("%x\n", (unsigned char) test[x]);
Note also that C itself does not require any particular characters outside the 7-bit ASCII range to be supported in source code, and it does not specify the execution-time encoding with which ordinary string contents are encoded. It is not safe to assume UTF-8 will be the execution character set, nor to assume that all compilers will accept UTF-8 source code, or will default to assuming that encoding even if they do support it.
The encoding of source code is a matter you need to sort out with your implementation, but if you are using at least C11 then you can ensure execution-time UTF-8 encoding for specific string literals by using UTF-8 literals, which are prefixed with u8:
char test[3] = u8"ü";
Be aware also that UTF-8 code sequences can be up to four bytes long, and most of the characters in the basic multilingual plane require 3. The safest way to declare your array, then, would be to let the compiler figure out the needed size:
// better
char test[] = u8"ü";
... and then to use sizeof to determine the size chosen:
for (int x = 0; x < sizeof(test); x++) {
// ...

Purpose of using octal for ASCII

Why would a C programmer use escape sequences (oct/hex) for ASCII values rather than decimal?
Follow up: does this have to do with either performance or portability?
Example:
char c = '\075';
You use octal or hexadecimal because there isn't a way to specify decimal codes inside a character literal or string literal. Octal was prevalent in PDP-11 code. These days, it probably makes more sense to use hexadecimal, though '\0' is more compact than '\x0' (so use '\0' when you null terminate a string, etc.).
Also, beware that "\x0ABad choice" doesn't have the meaning you might expect, whereas "\012007 wins" probably does. (The difference is that a hex escape runs on until it comes across a non-hex digit, whereas octal escapes stop after 3 digits at most. To get the expected result, you'd need "\x0A" "Bad choice" using 'adjacent string literal concatenation'.)
And this has nothing to do with performance and very little to do with portability. Writing '\x41' or '\101' instead of 'A' is a way of decreasing the portability and readability of your code. You should only consider using escape sequences when there isn't a better way to represent the character.
No it does not have anything to do with performance and portability. It is just one convenient way to define character literals and to use in string literal specially for non-printable characters.
It has nothing to do with performance nor portability. In fact, you don't need any codes at all, instead of this:
char c = 65;
You can simply write:
char c = 'A';
But some characters are not so easy to type, e.g. ASCII SOH, so you might write:
char c = 1; // SOH
Or any other form, hexadecimal, octal, depending on your preference.
It has nothing to do with performance nor with portability. It is simply that the ASCII character set (as are its derivatives up to UTF) is organized in bytes and bits. For example, the 32 first characters are the control characters, 32 = 040 = 0x20, ASCII code of 'A' is 65 = 0101 = 0x41 and 'a' is 97 = 0141 = 0x61, ASCII code of '0' is 48 = 060 = 0x30.
I do not know for you, but for me '0x30' and 0x'41' are easier to remember and use in manual operations than 48 and 65.
By the way a byte represents exactly all value between 0 and 255 that is 0 and 0xFF ...
I didn't know this works.
But I got immediatly a pretty usefull idea for it.
Imagin you have got a low memory enviroment and have to use a permission system like the unix folder permissions.
Lets say there are 3 groups and for each group 2 different options which can be allowed or denied.
0 means none of both options,
1 means first option allowed,
2 means second option allowed and
3 means both allowed.
To store the permissions you could do it like:
char* bar = "213"; // first group has allowed second option, second group has first option allowed and third has full acces.
But there you have four byte storage for that information.
Ofc you could just convert that to decimal notation. But thats less readable.
But now as I know this....
doing:
char bar = '\0213';
Is pretty readable and also saving memory!
I love it :D

Char - ASCII relation

A char in the C programming language is a fixed-size byte entity designed specifically to be large enough to store a character value from an encoding such as ASCII.
But to what extent are the integer values relating to ASCII encoding interchangeable with the char characters? Is there any way to refer to 'A' as 65 (decimal)?
getchar() returns an integer - presumably this relates directly to such values? Also, if I am not mistaken, it is possible in certain contexts to increment chars ... such that (roughly speaking) '?'+1 == '#'.
Or is such encoding not guaranteed to be ASCII? Does it depend entirely upon the particular environment? Is such manipulation of chars impractical or impossible in C?
Edit: Relevant: C comparison char and int
I am answering just the question about incrementing characters, since the other issues are addressed in other answers.
The C standard guarantees that '0' to '9' are consecutive, so you can increment a digit character (except '9') and get the next digit character, or do other arithmetic with them (C 1999 5.2.1 3).
The relationships between other characters are not guaranteed by the C standard, so you would need documentation from your specific C implementation (primarily the compiler) regarding this.
But to what extent are the integer values relating to ASCII encoding interchangeable with the char characters? Is there any way to refer to 'A' as 65 (decimal)?
In fact, you can't do anything else. char is just an integral type, and if you write
char ch = 'A';
then (assuming ASCII), ch will merely hold the integer value 65 - presenting it to the user is a different problem.
Or is such encoding not guaranteed to be ASCII?
No, it isn't. C doesn't rely on any specific character encoding.
Does it depend entirely upon the particular environment?
Yes, pretty much.
Is such manipulation of chars impractical or impossible in C?
No, you just have to be careful and know the standard quite well - then you'll be safe.
character literals like 'A' have type int .. they are completely interchangeable with their integer value. However, that integer value is not mandated by the C standard; it might be ASCII (and is for the vast majority of common implementations) but need not be; it is implementation defined. The mapping of integer values for characters does have one guarantee given by the Standard: the values of the decimal digits are continguous. (i.e., '1' - '0' == 1, ... '9' - '0' == 9).
Where the source code has 'A', the compiled object will just have the byte value instead. That's why it is allowed to do arithmetic with bytes (the type of 'A' is char, i.e. byte).
Of course, a character encoding (more accurately, a code page) must be applied to get that byte value, and that codepage would serve as the "native" encoding of the compiler for hard-coded strings and char values.
Loosely, you could think of char and string literals in C source as essentially being macros. On an ASCII system the "macro" 'A' would resolve to (char) 65, and on an EBCDIC system to (char) 193. Similarly, C strings compile down to zero-terminated arrays of chars (bytes). This logic affects the symbol table also, since the symbols are taken from the source in its native encoding.
So no, ASCII is not the only possibility for the encoding of literals in source code. But due to the restriction of single-quoted characters being chars, there is a guarantee that UTF-16 or other multi-byte encodings are excluded.

character type int

A character constant has type int in C.
Now suppose my machine's local character set is Windows Latin-1 ( http://www.ascii-code.com/) which is a 256 character set so every char between single quotes, like 'x', is mapped to an int value between 0 and 255 right ?
Suppose plain char is signed on my machine and consider the following code:
char ch = 'â'
if(ch == 'â')
{
printf("ok");
}
Because of the integer promotion ch will be promoted into a negative
quantity of type int (cause it has a leading zero) and beingâ mapped to a positive
quantity ok will not be printed.
But I'm sure i'm missing something , can you help ?
Your C implementation has a notion of an execution character set. Moreover, if your program source code is read from a file (as it always is), the compiler has (or should have) a notion of a source character set. For example, in GCC you can tune those parameters on the command line. The combination of those two settings determines the integral value that is assigned to your literal â.
Actually, the initial assignment will not work as expected:
char ch = 'â';
There's an overflow here, and gcc will warn about it. Technically, this is undefined behavior, although for the very common single-byte char type, the behavior is predictable enough -- it's a simple integer overflow. Depending on your default character set, that's a multibyte character; I get decimal 50082 if I print it as an integer on my machine.
Furthermore, the comparison is invalid, again because char is too small to hold the value being compared, and again, a good compiler will warn about it.
ISO C defines wchar_t, a type wide enough to hold extended (i.e., non-ASCII) characters, along with wide character versions of many library functions. Code that must deal with non-ASCII text should use this wide character type as a matter of course.
In a case where char is signed:
When processing char ch = 'â', the compiler will convert â to 0xFFFFFFE2, and store 0xE2 in ch. There is no overflow, as the value is signed.
When processing if(ch == 'â'), the compiler will extend ch (0xE2) to integer (0xFFFFFFE2) and compare it to 'â' (0xFFFFFFE2 also), so the condition will be true.

Resources