C Compatibility Between Integers and Characters - c

How does C handle converting between integers and characters? Say you've declared an integer variable and ask the user for a number but they input a string instead. What would happen?

The user input is treated as a string that needs to be converted to an int using atoi or another conversion function. Atoi will return 0 if the string cannot be interptreted as a number because it contains letters or other non-numeric characters.
You can read a bit more at the atoi documentation on MSDN - http://msdn.microsoft.com/en-us/library/yd5xkb5c(VS.80).aspx

Uh?
You always input a string. Then you parse convert this string to number, with various ways (asking again, taking a default value, etc.) of handling various errors (overflow, incorrect chars, etc.).

Another thing to note is that in C, characters and integers are "compatible" to some degree. Any character can be assigned to an int. The reverse also works, but you'll lose information if the integer value doesn't fit into a char.
char foo = 'a'; // The ascii value representation for lower-case 'a' is 97
int bar = foo; // bar now contains the value 97
bar = 255; // 255 is 0x000000ff in hexadecimal
foo = bar; // foo now contains -1 (0xff)
unsigned char foo2 = foo; // foo now contains 255 (0xff)

As other people have noted, the data is normally entered as a string -- the only question is which function is used for doing the reading. If you're using a GUI, the function may already deal with conversion to integer and reporting errors and so in an appropriate manner. If you're working with Standard C, it is generally easier to read the value into a string (perhaps with fgets() and then convert. Although atoi() can be used, it is seldom the best choice; the trouble is determining whether the conversion succeeded (and produced zero because the user entered a legitimate representation of zero) or not.
Generally, use strtol() or one of its relatives (strtoul(), strtoll(), strtoull()); for converting floating point numbers, use strtod() or a similar function. The advantage of the integer conversion routines include:
optional base selection (for example, base 10, or base 10 - hex, or base 8 - octal, or any of the above using standard C conventions (007 for octal, 0x07 for hex, 7 for decimal).
optional error detection (by knowing where the conversion stopped).
The place I go for many of these function specifications (when I don't look at my copy of the actual C standard) is the POSIX web site (which includes C99 functions). It is Unix-centric rather than Windows-centric.

The program would crash, you need to call atoi function.

Related

What does atof stand for?

In C atof=a-to-f(loat) converts a string into a double precision float. I am wondering what the a part of atof stand for.
atof is a function in the C programming language that converts a string into a floating point numerical representation. atof stands for ASCII to float. It is included in the C standard library header file stdlib.h. Its prototype is as follows
double atof (const char *str);
The str argument points to a string, represented by an array of characters, containing the character representation of a floating point value. If the string is not a valid textual representation of a double, atof will silently fail, returning zero (0.0) in that case. [1]
Note that while atoi and atol return variable types corresponding with their name ("atoi" returns an integer and "atol" returns a long integer), atof however, does not return a float, it returns a double.
A related function is sscanf. This function extracts values from strings and its return argument is the number of valid values it managed to extract (so, unlike atof, sscanf can be used to test if a string starts with a valid number).
To best answer what the a stands for, go back to early 1970s when bytes cost approached dollars each.
Even if a originally stood for ASCII, atof() did not and still does not mean to convert ASCII into double as the implementation may have used an alternate character encoding. With EBCDIC or PETSCII, one could think of a as alpha and write code for atof() per that non-ASCII encoding.

How do I compare single multibyte character constants cross-platform in C?

In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char's in C as well.
My current solution uses str.compare() and size() of a character string as seen in my previous post.
Now, since I only use one (multibyte) character in the std::string, would it be possible to achieve the same using a char?
For example, if( str[i] == '¶' )? How do I achieve that using char's?
(edit: made a type on SO for comparison operator as pointed out in the comments)
How do I compare single multibyte character constants cross-platform in C?
You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c', '¶') have type int, not char. The primary relevant section of C17 is paragraph 6.4.4.4/10:
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
(Emphasis added.)
Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as
the implementation rejects integer character constants containing multibyte source characters; or
the implementation rejects integer character constants that do not map to a single-byte execution character; or
the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.
That is not an exhaustive list.
You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual chars.
Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:
char str[] = "Some string ¶ some text ¶ to see";
char char_to_compare[] = "¶";
int char_size = sizeof(char_to_compare) - 1; // don't count the string terminator
for (char *location = strstr(str, char_to_compare);
location;
location = strstr(location + char_size, char_to_compare)) {
puts("Found!");
}
That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.
If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.
I believe you meant something like this:
char a = '¶';
char b = '¶';
if (a == b) /*do something*/;
The above may or may not work, if the value of '¶' is bigger than the char range, then it will overflow, causing a and b to store a different value than that of '¶'. Regardless of which value they hold, they may actually both have the same value.
Remember, the char type is simply a single-byte wide (8-bits) integer, so in order to work with multibyte characters and avoid overflow you just have to use a wider integer type (short, int, long...).
short a = '¶';
short b = '¶';
if (a == b) /*do something*/;
From personal experience, I've also noticed, that sometimes your environment may try to use a different character encoding than what you need. For example, trying to print the 'á' character will actually produce something else.
unsigned char x = 'á';
putchar(x); //actually prints character 'ß' in console.
putchar(160); //will print 'á'.
This happens because the console uses an Extended ASCII encoding, while my coding environment actually uses Unicode, parsing a value of 225 for 'á' instead of the value of 160 that I want.

Internals binary saving of C chars

Hey i stumbled about something pretty weird while programming. I tried to transform a utf8 char into a hexadecimal byte representation like 0x89 or 0xff.
char test[3] = "ü";
for (int x = 0; x < 3; x++){
printf("%x\n",test[x]);
}
And i get the following output :
ffffffc3
ffffffbc
0
I know that C uses one byte of data fore every one char and therefore if i want to store an weird char like "ü" they count as 2 chars.
Transforming ASCII Chars is no problem but once i get to non ASCII Chars (from germans to chinese) instead to getting outputs like 0xc3 and 0xbc c adds 0xFFFFFF00 to them.
I know that i can just do something like &0xFF and fix that weird representation, but i can wrap my head around why that keeps happening in the first place.
C allows type char to behave either as a signed type or as an unsigned type, as the C implementation chooses. You are observing the effect of it being a signed type, which is pretty common. When the char value of test[x] is passed to printf, it is promoted to type int, in value-preserving manner. When the value is negative, that involves sign-extension, whose effect is exactly what you describe. To avoid that, add an explicit cast to unsigned char:
printf("%x\n", (unsigned char) test[x]);
Note also that C itself does not require any particular characters outside the 7-bit ASCII range to be supported in source code, and it does not specify the execution-time encoding with which ordinary string contents are encoded. It is not safe to assume UTF-8 will be the execution character set, nor to assume that all compilers will accept UTF-8 source code, or will default to assuming that encoding even if they do support it.
The encoding of source code is a matter you need to sort out with your implementation, but if you are using at least C11 then you can ensure execution-time UTF-8 encoding for specific string literals by using UTF-8 literals, which are prefixed with u8:
char test[3] = u8"ü";
Be aware also that UTF-8 code sequences can be up to four bytes long, and most of the characters in the basic multilingual plane require 3. The safest way to declare your array, then, would be to let the compiler figure out the needed size:
// better
char test[] = u8"ü";
... and then to use sizeof to determine the size chosen:
for (int x = 0; x < sizeof(test); x++) {
// ...

output of negative integer to %u format specifier

Consider the following code
char c=125;
c+=10;
printf("%d",c); //outputs -121 which is understood.
printf("%u",c); // outputs 4294967175.
printf("%u",-121); // outputs 4294967175
%d accepts negative numbers therefore output is -121 in first case.
output in case 2 and case 3 is 4294967175. I don't understand why?
Do
232 - 121 = 4294967175
printf interprets data you provide thanks to the % values
%d signed integer, value from -231 to 231-1
%u unsigned integer, value from 0 to 232-1
In binary, both integer values (-121 and 4294967175) are (of course) identical:
`0xFFFFFF87`
See Two's complement
printf is a function with variadic arguments. In such case a "default argument promotions" are applied on arguments before the function is called. In your case, c is first converted from char to int and then sent to printf. The conversion does not depend on the corresponding '%' specifier of the format. The value of this int parameter can be interpreted as 4294967175 or -121 depending on signedness. The corresponding parts in the C standard are:
6.5.2.2 Function call
6 - ... If the expression that denotes the called function has a type that does not include a
prototype, the integer promotions are performed on each argument, and arguments that
have type float are promoted to double. These are called the default argument
promotions.
7- If the expression that denotes the called function has a type that does include a prototype,
the arguments are implicitly converted, as if by assignment, to the types of the
corresponding parameters, taking the type of each parameter to be the unqualified version
of its declared type. The ellipsis notation in a function prototype declarator causes
argument type conversion to stop after the last declared parameter. The default argument
promotions are performed on trailing arguments.
If char is signed in your compiler (which is the most likely case) and is 8 bits wide (extremely likely,) then c+=10 will overflow it. Overflow of a signed integer results in undefined behavior. This means you can't reason about the results you're getting.
If char is unsigned (not very likely on most PC platforms), then see the other answers.
printf uses something called variadic arguments. If you make a brief research about them you'll find out that the function that uses them does not know the type of the input you're passing to it. Therefore there must be a way to tell the function how it must interpret the input, and you're doing it with the format specifiers.
In your particular case, c is a 8-bit signed integer. Therefore, if you set it to the literal -121 inside it, it will memorize: 10000111. Then, by the integer promotion mechanism you have it converted to an int: 11111111111111111111111110000111.
With "%d" you tell printf to interpret 11111111111111111111111110000111 as a signed integer, therefore you have -121 as output. However, with "%u" you're telling printf that 11111111111111111111111110000111 is an unsigned integer, therefore it will output 4294967175.
EDIT: As stated in the comments, actually the behaviour is undefined in C. That's because you have more than one way to encode negative numbers (sign and modulo, One's complement, ...) and sone other aspects (such as endianness, if I'm not wrong, influences this result). So the result is said to be implementation defined. Therefore you may get a different output rather than 4294967175. But the main concepts I explained for different interpretation of the same string of bits and the lossness of the type of data in variadic arguments still hold.
Try to convert the number into base 10, first as a pure binary number, then knowing that it's memorized in 32-bit Two's complement... you get two different results. But if I do not tell you which intepretation you need to use, that binary string can represent everything (a 4-char ASCII string, a number, a small 8-bit 2x2 image, your safe combination, ...).
EDIT: you can think of "%<format_string>" as a sort of "extension" for that string of bits. You know, when you create a file, you usually give it an extension, which is actually a part of the filename, to remember in which format/encoding that file has been stored. Let's suppose you have your favorite song saved as song.ogg file on your PC. If you rename the file in song.txt, song.odt, song.pdf, song, song.akwardextension, that does not change the content of the file. But if you try to open it with the program usually associated to .txt or .whatever, it reads the bytes in the file, but when it tries to interpret sequences of bytes it may fail (that's why if you open song.ogg with Emacs or VIm or whatever text editor you get sonething that looks like garbage information, if you open it with, for instance, GIMP, GIMP cannot read it, and if you open it with VLC you listen to your favorite song). The extension is just a reminder for you: it reminds you how to interpret that sequence of bits. As printf has no knowledge for that interpretation, you need to provide it one, and if you tell printf that a signed integer is acutally unsigned, well, it's like opening song.ogg with Emacs...

C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.

Resources