hi i am interested in those chars which are representable by ascii table. for that reason i am doing the following:
int t(char c) { return (int) c; }
...
if(!(t(d)>255)) { dostuff(); }
so i am interested in only ascii table representable chars, which i assume after conversion to int should be less than 256, am i right? thanks!
Usually (not always) a char is 8-bits so all chars would typically have a value of less than 256. So your test would always succeed.
Also, ASCII only goes up to 127, not 255. The characters after that are not standard ASCII, and can vary depending on code pages.
If you are dealing with international characters you should probably use wide characters instead of char.
Use the library:
#include <ctype.h>
...
if (isascii(d)) { dostuff(); }
Two caveats:
The C standard does not decide if char is by default signed or unsigned. If your compiler treated char as signed by default the cast to int could result in negative values instead of the values from 128 to 255 (and this is assuming that your chars are 8-bit, too). Perhaps it's better to use unsigned char if you want to be sure this range will be converted the way you expect.
Technically ASCII is from 0 to 127, everything above is some kind of extension.
char is an integral type in C. You can do the check directly:
char c;
/* assign to c */
if (c >= 0 && c <= 127) {
/* in ASCII range */
}
I am assuming you don't want to use isascii() (it's not in the C standard, although it is POSIX).
Also, you can check if CHAR_MAX is equal to 127. If it is, you don't need the comparison with 127, since c will not exceed it by definition. Similarly, if CHAR_MIN is 0, then you don't need the comparison with 0. Both CHAR_MIN and CHAR_MAX are defined in limits.h.
I think you're thinking about an integer value overflowing a char, and therefore convert it to an int. But, that doesn't help with overflow since the damage has already been done.
Size of char is always 1 byte (as per standard). For all practical matters this means that a char var cannot have a value bigger than 255. (though there are systems, where a byte has more than 8 bits and thus a char value can be bigger, but these are rare nowadays)
Additional caveat is that if char is not defined as signed or unsigned, so it can be in the -128 to 127 range or the 0 to 255 range. (assuming 8 bits per byte, of course :-))
Meanwhile, the ASCII table is 7-bit, which means it covers the range of 0 to 127. So if you are interested in only ASCII symbols, you can just check if the value of your char var is in that range. No need to cast for the comparison.
Related
int a = 0x11223344;
char b = (char)a;
I am new to programming and learning C. Why do I get value of b here as D?
If I want to store an integer into a char type variable, which byte of the integer will be stored?
This is not fully defined by the C standard.
In the particular situation you tried it, what likely happened is that the low eight bits of 0x11223344 were stored in b, producing 4416 (6810) in b, and printing that prints “D” because your system using ASCII character codes, and 68 is the ASCII code for “D”.
However, you should be wary of something like this working, because it is contingent on several things, and variations are possible.
First, the C standard allows char to be signed or unsigned. It also allows char to be any width that is eight bits or greater. In most C implementations today, it is eight bits.
Second, the conversion from int to char depends on whether char is signed or unsigned and may not be fully defined by the C standard.
If char is unsigned, then the conversion is defined to wrap modulo M+1, where M is the largest value representable in char. Effectively, this is the same as taking the low byte of the value. If the unsigned char has eight bits, its M is 255, so M+1 is 256.
If char is signed and the value is out of range of the char type, the conversion is implementation-defined: It may either trap or produce an implementation-defined value. Your C implementation may wrap conversions to signed integer types similarly to how it wraps conversions to unsigned types, but another reasonable behavior is to “clamp” out-of-range values to the limits of the type, CHAR_MIN and CHAR_MAX. For example, converting −8000 to char could yield the minimum, −128, while converting 0x11223344 to char could yield the maximum, +127.
Third, the C standard does not require implementations to use ASCII. It is very common to use ASCII. (Usually, the character encoding is not just ASCII, because ASCII covers only values from 0 to 127. C implementations often use some extension beyond ASCII for values from 128 to 255.)
I have a code like this:
#include <stdio.h>
int main()
{
char a=20,b=30;
char c=a*b;
printf("%c\n",c);
return 0;
}
The output of this program is X .
How is this output possible if a*b=600 which overflows as char values lies between -128 and 127 ?
Whether char is signed or unsigned is implementation defined. Either way, it is an integer type.
Anyway, the multiplication is done as int due to integer promotions and the result is converted to char.
If the value does not fit into the "smaller" type, it is implementation defined for a signed char how this is done. Far by most (if not all) implementations simply cut off the upper bits.
For an unsigned char, the standard actually requires (briefly) cutting of the upper bits.
So:
(int)20 * (int)20 -> (int)600 -> (char)(600 % 256) -> 88 == 'X'
(Assuming 8 bit char).
See the link and its surrounding paragraphs for more details.
Note: If you enable compiler warnings (as always recommended), you should get a truncation warning for the assignment. This can be avoided by an explicit cast (only if you are really sure about all implications). The gcc option is -Wconversion.
First off, the behavior is implementation-defined here. A char may be either unsigned char or signed char, so it may be able to hold 0 to 255 or -128 to 127, assuming CHAR_BIT == 8.
600 in decimal is 0x258. What happens is the least significant eight bits are stored, the value is 0x58 a.k.a. X in ASCII.
This code will cause undefined behavior if char is signed.
I thought overflow of signed integer is undefined behavior, but conversion to smaller type is implementation-defined.
quote from N1256 6.3.1.3 Signed and unsigned integers:
3 Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.
If the value is simply truncated to 8 bits, (20 * 30) & 0xff == 0x58 and 0x58 is ASCII code for X. So, if your system do this and use ASCII code, the output will be X.
First, looks like you have unsigned char with a range from 0 to 255.
You're right about the overflow.
600 - 256 - 256 = 88
This is just an ASCII code of 'X'.
So, where can unsigned char be useful?
If I understood right, unsigned char can represent numbers from -128 to 127. But every encoding table uses positive numbers. So, unsigned char can't be used for representing characters. Am I right?
No, unsigned char is 0 to 255.
It can be useful in representing binary data (a single byte), although, like any primitive data type, the possibilities are endless.
First of all, what you are representing is signed char, unsigned char ranges from 0 - 255.
To answer your questions about negative valued character, you are right that character encoding is done using positive values.
On a different view, just think of signed and unsigned char as integer representation.
Unsigned char is used to represent bytes. If you need just one byte of memory in a variable, you use unsigned char and assign an integer to it.
fo example, there is used uint8_t to represent bytes, but is not more than that.
A signed char can represent number from -128 to +127
and unsigned char is from 0 to 255.
Altough unsigned is more convenient in many use cases,
everthing binary-related can be done with signed too:
0=0, 1=1 ... 127=127, -128=128, -127=129, -126=130 ... -1=255
Such conversions happens automatically (or, better to say,
it´s just different interpretation).
("binary-related" means that a mathematical -2 * 2 would be possible too with unsigned,
but make even less sense)
Regarding So, where can unsigned char be useful?
Here perhaps?: (a very simple example to test for ASCII digit)
BOOL isDigit(unsigned char c)
{
if((c >= '0') &&(c <= '9')) return TRUE;
return FALSE;
}
By virtue of argument type unsigned char guarantees input will be a single ASCII character (there are 128 encoded ASCII possibilities, with Extended ASCII, there are 255 possibilities). So, in this function, all that remains is to test input value for specific criteria (in this case is it a digit) There is no requirement for function to test for negative numbers. A regular char (i.e. signed) cannot contain the entire range of ASCII characters. The sizeof unsigned char is also significant in that it is only 1 byte as opposed to 4 bytes (typically, but not always) for say, an int
Perhaps I'm overthinking this, as it seems like it should be a lot easier. I want to take a value of type int, such as is returned by fgetc(), and record it in a char buffer if it is not an end-of-file code. E.g.:
char buf;
int c = fgetc(stdin);
if (c < 0) {
/* handle end-of-file */
} else {
buf = (char) c; /* not quite right */
}
However, if the platform has signed default chars then the value returned by fgetc() may be outside the range of char, in which case casting or assigning it to (signed) char produces implementation-defined behavior (right?). Surely, though, there is tons of code out there that does exactly the equivalent of the example. Is it all relying on implementation-defined behavior and/or assuming 7-bit data?
It looks to me like if I want to be certain that the behavior of my code is defined by C to be what I want, then I need to do something like this:
buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
I think that produces defined, correct behavior whether default chars are signed or unsigned, and regardless even of the size of char. Is that right? And is it really needful to do that to ensure portability?
fgetc() returns unsigned char and EOF. EOF is always < 0. If the system's char is signed or unsigned, it makes no difference.
C11dr 7.21.7.1 2
If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined).
The concern I have about is that is looks to be 2's compliment dependent and implying the range of unsigned char and char are both just as wide. Both of these assumptions are certainly nearly always true today.
buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
[Edit per OP comment]
Let's assume fgetc() returns no more different characters than stuff-able in the range CHAR_MIN to CHAR_MAX, then (c - (UCHAR_MAX + 1)) would be more portable is replaced with (c - CHAR_MAX + CHAR_MIN). We do not know (c - (UCHAR_MAX + 1)) is in range when c is CHAR_MAX + 1.
A system could exist that has a signed char range of -127 to +127 and an unsigned char range 0 to 255. (5.2.4.2.1), but as fgetc() gets a character, it seems to have all be unsigned char or all ready limited itself to the smaller signed char range, before converting to unsigned char and return that value to the user. OTOH, if fgetc() returned 256 different characters, conversion to a narrow ranged signed char would not be portable regardless of formula.
Practically, it's simple - the obvious cast to char always works.
But you're asking about portability...
I can't see how a real portable solution could work.
This is because the guaranteed range of char is -127 to 127, which is only 255 different values. So how could you translate the 256 possible return values of fgetc (excluding EOF), to a char, without losing information?
The best I can think of is to use unsigned char and avoid char.
With thanks to those who responded, and having now read relevant portions of the C99 standard, I have come to agree with the somewhat surprising conclusion that storing an arbitrary non-EOF value returned by fgetc() as type char without loss of fidelity is not guaranteed to be possible. In large part, that arises from the possibility that char cannot represent as many distinct values as unsigned char.
For their part, the stdio functions guarantee that if data are written to a (binary) stream and subsequently read back, then the read back data will compare equal to the original data. That turns out to have much narrower implications than I at first thought, but it does mean that fputs() must output a distinct value for each distinct char it successfully outputs, and that whatever conversion fgets() applies to store input bytes as type char must accurately reverse the conversion, if any, by which fputs() would produce the input byte as its output. As far as I can tell, however, fputs() and fgets() are permitted to fail on any input they don't like, so it is not certain that fputs() maps every possible char value to an unsigned char.
Moreover, although fputs() and fgets() operate as if by performing sequences of fputc() and fgetc() calls, respectively, it is not specified what conversions they might perform between char values in memory and the underlying unsigned char values on the stream. If a platform's fputs() uses standard integer conversion for that purpose, however, then the correct back-conversion is as I proposed:
int c = fgetc(stream);
char buf;
if (c >= 0) buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
That arises directly from the integer conversion rules, which specify that integer values are converted to unsigned types by adding or subtracting the integer multiple of <target type>_MAX + 1 needed to bring the result into the range of the target type, supported by the constraints on representation of integer types. Its correctness for that purpose does not depend on the specific representation of char values or on whether char is treated as signed or unsigned.
However, if char cannot represent as many distinct values as unsigned char, or if there are char values that fgets() refuses to output (e.g. negative ones), then there are possible values of c that could not have resulted from a char conversion in the first place. No back-conversion argument is applicable to such bytes, and there may not even be a meaningful sense of char values corresponding to them. In any case, whether the given conversion is the correct reverse-conversion for data written by fputs() seems to be implementation defined. It is certainly implementation-defined whether buf = (char) c will have the same effect, though it does have on very many systems.
Overall, I am struck by just how many details of C I/O behavior are implementation defined. That was an eye-opener for me.
Best way to portably assign the result of fgetc() to a char in C
C2X is on the way
A sub-problem is saving an unsigned char value into a char, which may be signed. With 2's complement, that is not a problem.*1
On non-2's complement machines with signed char that do not support -0 *2, that is a problem. (I know of no such machines.)
In any case, with C2X, support for non-2's complement encoding is planned to be dropped, so as time goes on, we can eventually ignore non-2's complement issues and confidently use
int c = fgetc(stdin);
...
char buf = (c > CHAR_MAX) ? (char)(c - (UCHAR_MAX + 1)) : (char)c;
UCHAR_MAX > INT_MAX??
A 2nd portability issue not discussed is when UCHAR_MAX > INT_MAX. e.g. All integer types are 64-bit. Some graphics processor have used a common size for all integer types.
On such unicorn machines, if (c < 0) is insufficient. Could use:
int c = fgetc(stdin);
#if UCHAR_MAX <= INT_MAX
if (c < 0) {
#else
if (c == EOF && (feof(stdin) || ferror(stdin))) {
#endif
...
Pedantically, ferror(stdin) could be true due to a prior input function and not this one which returned UCHAR_MAX, but let us not go into that rabbit-hole.
*1 In the case of int to signed char with c > CHAR_MAX, "Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." applies. With 2's complement, this is overwhelmingly maps [128 255] to [-128 -1].
*2 With non-2's compliment and -0 support, the common mapping is least 8 bits remain the same. This does make for 2 zeros, yet properly handling of strings in <string.h> uses "For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value)." So -0 is not a null character as that char is accessed as a non-zero unsigned char.
Given that signed and unsigned ints use the same registers, etc., and just interpret bit patterns differently, and C chars are basically just 8-bit ints, what's the difference between signed and unsigned chars in C? I understand that the signedness of char is implementation defined, and I simply can't understand how it could ever make a difference, at least when char is used to hold strings instead of to do math.
It won't make a difference for strings. But in C you can use a char to do math, when it will make a difference.
In fact, when working in constrained memory environments, like embedded 8 bit applications a char will often be used to do math, and then it makes a big difference. This is because there is no byte type by default in C.
In terms of the values they represent:
unsigned char:
spans the value range 0..255 (00000000..11111111)
values overflow around low edge as:
0 - 1 = 255 (00000000 - 00000001 = 11111111)
values overflow around high edge as:
255 + 1 = 0 (11111111 + 00000001 = 00000000)
bitwise right shift operator (>>) does a logical shift:
10000000 >> 1 = 01000000 (128 / 2 = 64)
signed char:
spans the value range -128..127 (10000000..01111111)
values overflow around low edge as:
-128 - 1 = 127 (10000000 - 00000001 = 01111111)
values overflow around high edge as:
127 + 1 = -128 (01111111 + 00000001 = 10000000)
bitwise right shift operator (>>) does an arithmetic shift:
10000000 >> 1 = 11000000 (-128 / 2 = -64)
I included the binary representations to show that the value wrapping behaviour is pure, consistent binary arithmetic and has nothing to do with a char being signed/unsigned (expect for right shifts).
Update
Some implementation-specific behaviour mentioned in the comments:
char != signed char. The type "char" without "signed" or "unsinged" is implementation-defined which means that it can act like a signed or unsigned type.
Signed integer overflow leads to undefined behavior where a program can do anything, including dumping core or overrunning a buffer.
#include <stdio.h>
int main(int argc, char** argv)
{
char a = 'A';
char b = 0xFF;
signed char sa = 'A';
signed char sb = 0xFF;
unsigned char ua = 'A';
unsigned char ub = 0xFF;
printf("a > b: %s\n", a > b ? "true" : "false");
printf("sa > sb: %s\n", sa > sb ? "true" : "false");
printf("ua > ub: %s\n", ua > ub ? "true" : "false");
return 0;
}
[root]# ./a.out
a > b: true
sa > sb: true
ua > ub: false
It's important when sorting strings.
There are a couple of difference. Most importantly, if you overflow the valid range of a char by assigning it a too big or small integer, and char is signed, the resulting value is implementation defined or even some signal (in C) could be risen, as for all signed types. Contrast that to the case when you assign something too big or small to an unsigned char: the value wraps around, you will get precisely defined semantics. For example, assigning a -1 to an unsigned char, you will get an UCHAR_MAX. So whenever you have a byte as in a number from 0 to 2^CHAR_BIT, you should really use unsigned char to store it.
The sign also makes a difference when passing to vararg functions:
char c = getSomeCharacter(); // returns 0..255
printf("%d\n", c);
Assume the value assigned to c would be too big for char to represent, and the machine uses two's complement. Many implementation behave for the case that you assign a too big value to the char, in that the bit-pattern won't change. If an int will be able to represent all values of char (which it is for most implementations), then the char is being promoted to int before passing to printf. So, the value of what is passed would be negative. Promoting to int would retain that sign. So you will get a negative result. However, if char is unsigned, then the value is unsigned, and promoting to an int will yield a positive int. You can use unsigned char, then you will get precisely defined behavior for both the assignment to the variable, and passing to printf which will then print something positive.
Note that a char, unsigned and signed char all are at least 8 bits wide. There is no requirement that char is exactly 8 bits wide. However, for most systems that's true, but for some, you will find they use 32bit chars. A byte in C and C++ is defined to have the size of char, so a byte in C also is not always exactly 8 bits.
Another difference is, that in C, a unsigned char must have no padding bits. That is, if you find CHAR_BIT is 8, then an unsigned char's values must range from 0 .. 2^CHAR_BIT-1. THe same is true for char if it's unsigned. For signed char, you can't assume anything about the range of values, even if you know how your compiler implements the sign stuff (two's complement or the other options), there may be unused padding bits in it. In C++, there are no padding bits for all three character types.
"What does it mean for a char to be signed?"
Traditionally, the ASCII character set consists of 7-bit character encodings. (As opposed to the 8 bit EBCIDIC.)
When the C language was designed and implemented this was a significant issue. (For various reasons like data transmission over serial modem devices.) The extra bit has uses like parity.
A "signed character" happens to be perfect for this representation.
Binary data, OTOH, is simply taking the value of each 8-bit "chunk" of data, thus no sign is needed.
Arithmetic on bytes is important for computer graphics (where 8-bit values are often used to store colors). Aside from that, I can think of two main cases where char sign matters:
converting to a larger int
comparison functions
The nasty thing is, these won't bite you if all your string data is 7-bit. However, it promises to be an unending source of obscure bugs if you're trying to make your C/C++ program 8-bit clean.
Signedness works pretty much the same way in chars as it does in other integral types. As you've noted, chars are really just one-byte integers. (Not necessarily 8-bit, though! There's a difference; a byte might be bigger than 8 bits on some platforms, and chars are rather tied to bytes due to the definitions of char and sizeof(char). The CHAR_BIT macro, defined in <limits.h> or C++'s <climits>, will tell you how many bits are in a char.).
As for why you'd want a character with a sign: in C and C++, there is no standard type called byte. To the compiler, chars are bytes and vice versa, and it doesn't distinguish between them. Sometimes, though, you want to -- sometimes you want that char to be a one-byte number, and in those cases (particularly how small a range a byte can have), you also typically care whether the number is signed or not. I've personally used signedness (or unsignedness) to say that a certain char is a (numeric) "byte" rather than a character, and that it's going to be used numerically. Without a specified signedness, that char really is a character, and is intended to be used as text.
I used to do that, rather. Now the newer versions of C and C++ have (u?)int_least8_t (currently typedef'd in <stdint.h> or <cstdint>), which are more explicitly numeric (though they'll typically just be typedefs for signed and unsigned char types anyway).
The only situation I can imagine this being an issue is if you choose to do math on chars. It's perfectly legal to write the following code.
char a = (char)42;
char b = (char)120;
char c = a + b;
Depending on the signedness of the char, c could be one of two values. If char's are unsigned then c will be (char)162. If they are signed then it will an overflow case as the max value for a signed char is 128. I'm guessing most implementations would just return (char)-32.
One thing about signed chars is that you can test c >= ' ' (space) and be sure it's a normal printable ascii char. Of course, it's not portable, so not very useful.