Hey i stumbled about something pretty weird while programming. I tried to transform a utf8 char into a hexadecimal byte representation like 0x89 or 0xff.
char test[3] = "ü";
for (int x = 0; x < 3; x++){
printf("%x\n",test[x]);
}
And i get the following output :
ffffffc3
ffffffbc
0
I know that C uses one byte of data fore every one char and therefore if i want to store an weird char like "ü" they count as 2 chars.
Transforming ASCII Chars is no problem but once i get to non ASCII Chars (from germans to chinese) instead to getting outputs like 0xc3 and 0xbc c adds 0xFFFFFF00 to them.
I know that i can just do something like &0xFF and fix that weird representation, but i can wrap my head around why that keeps happening in the first place.
C allows type char to behave either as a signed type or as an unsigned type, as the C implementation chooses. You are observing the effect of it being a signed type, which is pretty common. When the char value of test[x] is passed to printf, it is promoted to type int, in value-preserving manner. When the value is negative, that involves sign-extension, whose effect is exactly what you describe. To avoid that, add an explicit cast to unsigned char:
printf("%x\n", (unsigned char) test[x]);
Note also that C itself does not require any particular characters outside the 7-bit ASCII range to be supported in source code, and it does not specify the execution-time encoding with which ordinary string contents are encoded. It is not safe to assume UTF-8 will be the execution character set, nor to assume that all compilers will accept UTF-8 source code, or will default to assuming that encoding even if they do support it.
The encoding of source code is a matter you need to sort out with your implementation, but if you are using at least C11 then you can ensure execution-time UTF-8 encoding for specific string literals by using UTF-8 literals, which are prefixed with u8:
char test[3] = u8"ü";
Be aware also that UTF-8 code sequences can be up to four bytes long, and most of the characters in the basic multilingual plane require 3. The safest way to declare your array, then, would be to let the compiler figure out the needed size:
// better
char test[] = u8"ü";
... and then to use sizeof to determine the size chosen:
for (int x = 0; x < sizeof(test); x++) {
// ...
Related
In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char's in C as well.
My current solution uses str.compare() and size() of a character string as seen in my previous post.
Now, since I only use one (multibyte) character in the std::string, would it be possible to achieve the same using a char?
For example, if( str[i] == '¶' )? How do I achieve that using char's?
(edit: made a type on SO for comparison operator as pointed out in the comments)
How do I compare single multibyte character constants cross-platform in C?
You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c', '¶') have type int, not char. The primary relevant section of C17 is paragraph 6.4.4.4/10:
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
(Emphasis added.)
Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as
the implementation rejects integer character constants containing multibyte source characters; or
the implementation rejects integer character constants that do not map to a single-byte execution character; or
the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.
That is not an exhaustive list.
You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual chars.
Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:
char str[] = "Some string ¶ some text ¶ to see";
char char_to_compare[] = "¶";
int char_size = sizeof(char_to_compare) - 1; // don't count the string terminator
for (char *location = strstr(str, char_to_compare);
location;
location = strstr(location + char_size, char_to_compare)) {
puts("Found!");
}
That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.
If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.
I believe you meant something like this:
char a = '¶';
char b = '¶';
if (a == b) /*do something*/;
The above may or may not work, if the value of '¶' is bigger than the char range, then it will overflow, causing a and b to store a different value than that of '¶'. Regardless of which value they hold, they may actually both have the same value.
Remember, the char type is simply a single-byte wide (8-bits) integer, so in order to work with multibyte characters and avoid overflow you just have to use a wider integer type (short, int, long...).
short a = '¶';
short b = '¶';
if (a == b) /*do something*/;
From personal experience, I've also noticed, that sometimes your environment may try to use a different character encoding than what you need. For example, trying to print the 'á' character will actually produce something else.
unsigned char x = 'á';
putchar(x); //actually prints character 'ß' in console.
putchar(160); //will print 'á'.
This happens because the console uses an Extended ASCII encoding, while my coding environment actually uses Unicode, parsing a value of 225 for 'á' instead of the value of 160 that I want.
in pure and portable c.
So I am having trouble casting what has to be a variable, from a variable. char *nib to int hex. The idea being I had a char *nib ="ab"; or "0xab" or anything that directly represents two characters as a char *. then casting it as a integer. and writing it to a file to get a one to one write. so I start with char *nib="0xab"; then I write it as a int presumably, and to a hexdump or edit and the result is just ab.
I've been able to do this as a constant directly declaring... but the nib is always static.
this has to be a one to one starting with a two char string or nib. Not converting anything, purly casting.
So can you write it directly to a file without converting it? three look up tables seems like a bit much for a value what has the name length
There is no way to cast 2 characters (2 bytes) into one byte because cast does not change binary representation of the value.
The closest you can get to casting string that looks like hex to some value that will show something similar is use 0-15 characters via escape sequence like char* nib = "\x0A\x0B" and cast ( *((short*)nib)) that to 2-byte value (0x0A0B in this case) and store that to a file (I'm not sure if there is portable integer type of 2 bytes - short often is 2 bytes wide but does not have to be 2 bytes). Unfortunately I don't think there is a portable way to store 2 byte integer value to a file as different architectures may have different byte order.
Writing string value character by character is likely safest approach. Or convert string to int a usual way and use your own read/write code for integers to ensure portability.
I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.
I want to assign a char with a char literal, but it's a special character say 255 or 13.I know that I can assign my char with a literal int that will be cast to a char: char a = 13;I also know that Microsoft will let me use the hex code as a char literal: char a = '\xd'
I want to know if there's a way to do this that gcc supports also.
Writing something like
char ch = 13;
is mostly portable, to platforms on which the value 13 is the same thing as on your platform (which is all systems which uses the ASCII character set, which indeed is most systems today).
There may be platforms on which 13 can mean something else. However, using '\r' instead should always be portable, no matter the character encoding system.
Using other values, which does not have character literal equivalents, are not portable. And using values above 127 is even less portable, since then you're outside the ASCII table, and into the extended ASCII table, in which the letters can depend on the locale settings of the system. For example, western European and eastern European language settings will most likely have different characters in the 128 to 255 range.
If you want to use a byte which can contain just some binary data and not letters, instead of using char you might be wanting to use e.g. uint8_t, to tell other readers of your code that you're not using the variable for letters but for binary data.
The hexidecimal escape sequence is not specific to Microsoft. It's part of C/C++: http://en.cppreference.com/w/cpp/language/escape
Meaning that to assign a hexidecimal number to a char, this is cross platform code:
char a = '\xD';
The question already demonstrates assigning a decimal number to a char:
char a = 13;
And octal numbers can also be assigned as well, with only the escape switch:
char a = '\023';
Incidentally, '\0' is common in C/C++ to represent the null-character (independent of platform). '\0' is not a special character that can be escaped. That's actually invoking the octal escape sequence.
I read that C not define if a char is signed or unsigned, and in GCC page this says that it can be signed on x86 and unsigned in PowerPPC and ARM.
Okey, I'm writing a program with GLIB that define char as gchar (not more than it, only a way for standardization).
My question is, what about UTF-8? It use more than an block of memory?
Say that I have a variable
unsigned char *string = "My string with UTF8 enconding ~> çã";
See, if I declare my variable as
unsigned
I will have only 127 values (so my program will to store more blocks of mem) or the UTF-8 change to negative too?
Sorry if I can't explain it correctly, but I think that i is a bit complex.
NOTE:
Thanks for all answer
I don't understand how it is interpreted normally.
I think that like ascii, if I have a signed and unsigned char on my program, the strings have diferently values, and it leads to confuse, imagine it in utf8 so.
I've had a couple requests to explain a comment I made.
The fact that a char type can default to either a signed or unsigned type can be significant when you're comparing characters and expect a certain ordering. In particular, UTF8 uses the high bit (assuming that char is an 8-bit type, which is true in the vast majority of platforms) to indicate that a character code point requires more than one byte to be represented.
A quick and dirty example of the problem:
#include <stdio.h>
int main( void)
{
signed char flag = 0xf0;
unsigned char uflag = 0xf0;
if (flag < (signed char) 'z') {
printf( "flag is smaller than 'z'\n");
}
else {
printf( "flag is larger than 'z'\n");
}
if (uflag < (unsigned char) 'z') {
printf( "uflag is smaller than 'z'\n");
}
else {
printf( "uflag is larger than 'z'\n");
}
return 0;
}
On most projects that I work, the unadorned char type is typically avoided in favor us using a typedef that explicitly specifies an unsigned char. Something like the uint8_t from stdint.h or
typedef unsigned char u8;
Generally dealing with an unsigned char type seems to work well and have few problems - the one area that I have seen occasional problems is when using something of that type to control a loop:
while (uchar_var-- >= 0) {
// infinite loop...
}
Two things:
Whether a char type is signed or unsigned won't affect your ability to translate UTF8-encoded-strings to and from whatever display string type you're using (WCHAR or whatnot). Don't worry about it, in other words: the UTF8 bytes are just bytes, and whatever you're using as an encoder/decoder will do the right thing.
Some of your confusion may be that you're trying to do this:
unsigned char *string = "This is a UTF8 string";
Don't do this-- you're mixing different concepts. A UTF-8 encoded string is just a sequence of bytes. C string literals (as above) were not really designed to represent this; they're designed to represent "ASCII-encoded" strings. Although for some cases (like mine here) they end up being the same thing, in your example in the question, they may not. And certainly in other cases they won't be. Load your Unicode strings from an external resource. In general I'd be wary of embedding non-ASCII characters in a .c source file; even if the compiler knows what to do with them, other software in your toolchain may not.
Using unsigned char has its pros and cons. The biggest benefits are that you don't get sign extension or other funny features such as signed overflow that would produce unexpected results from calculations. Unsigned char is also compatible with <cctype> macros/functions such as isalpha(ch) (all these require values in unsigned char range). On the other hand, all I/O functions require char*, requiring you to cast whenever you do I/O.
As for UTF-8, storing it in signed or unsigned arrays is fine but you have to be careful with those string literals as there is little guarantee about them being valid UTF-8. C++0x adds UTF-8 string literals to avoid possible issues and I would expect the next C standard to adopt those as well.
In general you should be fine, though, as long as you make sure that your source code files are always UTF-8 encoded.
signed / unsigned affect only arithmetic operations. if char is unsigned then higher values will be positive. in case of signed they will be negative. But range is same still.
Not really, unsigned / signed does not specify how many values a variable can hold. It specifies how they are interpreted.
So, an unsigned char has the same amount of values as a signed char, except that the one has negative numbers and the other doesn't. It is still 8 bits (if we assume that a char holds 8 bits, I'm not sure it does everywhere).
It makes no differences when using a char* as a string. The only time signed/unsigned would make a difference is if you would be interpreting it as a number, like for arithmetic or if you were to print it as an integer.
UTF-8 characters cannot be assumed to store in one byte. UTF-8 characters can be 1-4 bytes wide. So, a char, wchar_t, signed or unsigned would not be sufficient for assuming one unit can always store one UTF-8 character.
Most platforms (such as PHP, .NET, etc.) have you build strings normally (such as char[] in C) and you use a library to convert between encodings and parse characters out of the string.
As to you'r question:
think if I have a singed or unsigned ARRAY of chars can be it make my program run wrong? – drigoSkalWalker
Yes. Mine did. Heres a simple runnable excerpt from my app that totally comes out wrong if using ordinary signed chars.
Try running it after changing all chars to unsigned in parameters. Like this:
int is_valid(unsigned char c);
it should then work properly.
#include <stdio.h>
int is_valid(char c);
int main() {
char ch = 0xFE;
int ans = is_valid(ch);
printf("%d", ans);
}
int is_valid(char c) {
if((c == 0xFF) || (c == 0xFE)) {
printf("NOT valid\n");
return 0;
}
else {
printf("valid\n")
return 1;
}
}
What it does is validate if the char is a valid byte within utf-8.
0xFF and 0xFE are NOT valid bytes in utf-8.
imagine the problem if the function validates it as a valid byte?
what happens is this:
0xFE
=
11111110
=
254
If you save this in a ordinary char (that is signed) the leftmost bit, most significant bit, makes it negative. But what negative number is it?
It does this by flipping the bits and adding one bit.
11111110
00000001
00000001 + 00000001 =
00000010 = 2
and remember it made it negative, so it becomes -2
so (-2 == 0xFE) in the function ofcourse isnt true.
same goes for (-2 == 0xFF).
So a function that checks for invalid bytes ends up validating unvalid bytes as if they are ok :-o.
Two other reasons I can think of to stick to unsigned when dealing with utf-8 is:
If you might need some bitshifting to the right, there can be trouble because then you might end up adding 1's from the left if using signed chars.
utf-8 and unicode only uses positive numbers so... why dont you as well? keeping it simple :)