why is the compiler treating character as an integer? - c

I have a small snippet of code.
When I run this on my DevC++ gnu compiler it shows the following output:
main ()
{ char b = 'a';
printf ("%d,", sizeof ('a'));
printf ("%d", sizeof (b));
getch ();
}
OUTPUT: 4,1
Why is 'a' treated as an integer, whereas as b is treated as only a character constant?

Because character literals are of type int and not char in C.
So sizeof 'a' == sizeof (int).
Note that in C++, a character literal is of type char and so sizeof 'a' == sizeof (char).

That's just the way it is in C. That's just how the language was originally defined. As for why... Back then virtually everything in C was an int, unless there was a very good reason to make it something else. So, historically character constants in C have type int.
Note BTW, in C nomenclature 'a' is called constant, not literal. C has string literals and no other literals.

In C, a character literal has type int.
In C++, a character literal that contains only one character has type char, which is an integral type.
In both C and C++, a wide character literal has type wchar_t, and a multicharacter literal has type int.

From IBM XL C/C++ documentation
A character literal contains a sequence of characters or escape
sequences enclosed in single quotation mark symbols, for example 'c'.
A character literal may be prefixed with the letter L, for example
L'c'. A character literal without the L prefix is an ordinary
character literal or a narrow character literal. A character literal
with the L prefix is a wide character literal. An ordinary character
literal that contains more than one character or escape sequence
(excluding single quotes ('), backslashes () or new-line characters)
is a multicharacter literal.
Character literals have the following form:
.---------------------.
V |
>>-+---+--'----+-character-------+-+--'------------------------><
'-L-' '-escape_sequence-'
At least one character or escape sequence must appear in the character
literal. The characters can be from the source program character set,
excluding the single quotation mark, backslash and new-line symbols. A
character literal must appear on a single logical source line.
C A character literal has type int

Related

Escape sequence in C in string

In this code:
int main()
{
char str[]= "geeks\nforgeeks";
char *ptr1, *ptr2;
ptr1 = &str[3];
ptr2 = str + 5;
printf("%c", ++*str - --*ptr1 + *ptr2 + 2);
printf("%s", str);
getchar();
return 0;
}
Why does the compiler interpret \n as an escape sequence and not as two characters viz, \ and n?
On the other hand, this program does not comment out hello.
int main()
{
char str[]= "geeks/*hello*/geeks";
printf ("%s",str);
return 0;
}
Why does the compiler interpret \n as an escape sequence and not as
two characters viz, \ and n?
By the definition of the escape sequence.
The C Standard (5.2.2 Character display semantics)
2 Alphabetic escape sequences representing nongraphic characters in
the execution character set are intended to produce actions on display
devices as follows:
//...
\n (new line) Moves the active position to the initial position
of the next line.
If you want to have two separate characters \ and n then you should write for example
char str[]= "geeks\\nforgeeks";
Now there are two separate characters one of which is represented by the escape sequence '\\' and other by the symbol 'n'.
As for the second your question
On the other hand, this program does not comment out hello.
char str[]= "geeks/*hello*/geeks";
Then within a string literal symbols /* and */ do not form a comment. They are elements of the string literal.
The C Standard (6.4.9 Comments)
1 Except within a character constant, a string literal, or a
comment, the characters /* introduce a comment. The contents of such a
comment are examined only to identify multibyte characters and to find
the characters */ that terminate it.
But that is inside a string, so how does the compiler know?
The below quotes and links are sufficient enough to understand what is an escape sequence.
From Wikipedia
In C, all escape sequences consist of two or more characters, the first of which is the backslash, \ (called the "Escape character");
Also from C11 Standard
In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

Why might a string literal not be a string?

I'm struggling with this part in the C standard about string literals, especially the second part of it:
"In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. 80)"
"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
Source: ISO/IEC 9899:2018 (C18), §6.4.5/6, Page 51
I don't understand the explanation - "because a null character can be embedded in it by a \0 escape sequence.".
To look at the referenced section §7.1.1., regarding the definition of a "string", it is stated:
"A string is a contiguous sequence of characters terminated by and including the first null character."
Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132
I've thought about that the focus maybe lays on the "can", in a way that a string literal does not have to include/embed the null character, while a string is needed to.
But then again I´m asking myself: How is one able to use a string literal as string if it has not a string-terminating null character in it, to determine the end of the string (required for string-operating functions)?
I´m totally drawing blanks at the moment.
Note: I´m aware of that a string literal is stored in read-only memory and can´t be modified and a string is a generic term for a sequence of characters terminated by NUL, which can or can not be mutable.
Thus, my question is not: "What is the difference between a string and a string literal?"
My Question is:
Why/How can a string-literal not be a string?
and, according to my concerns, so far:
Is it true, that a string literal can have the NUL byte omitted?
I wanted to ask this question myself but short before posting it, I got the clue. My confusion was made because of the little misplaced wording inside of the quote.
But I decided to not delete the question´s draft as it could be useful for future readers and provide a Q&A instead.
Feel free to comment and hint.
Related stuff:
What is the difference between char s[] and char *s?
What is the type of string literals in C and C++?
Are string literals const?
"Life-time" of a string literal in C
You're overthinking it.
"A string is a contiguous sequence of characters terminated by and including the first null character."
Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132
says that a "string" only extends up to the first null character. Characters that may exist after the null are not part of the string. However
"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
makes it clear a string literal may contain an embedded null. If it does, the string literal AS A WHOLE is not a string -- the string is just the prefix of the string literal up to the first null
Let´s take a look at the definition of the term "string literal" at the same section in C18, §6.5.1/3:
"A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz"."
According to that, a string literal is only consisted of the characters enclosed in quotation marks, the bare string content. It does not have an appended \0. The NUL byte is appended later at translation, as said at §6.5.1/6:
"In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. 80)"
Let´s make an example:
"foo" is a string literal, but not a string because "foo" does not contain an embedded null character.
"foo\0" is a string literal and a string because the literal itself contains a null character at the end of the character sequence.
Note that you don´t need to explicitly insert the null character at the end of a string literal to change it to a string. As already said, it is implicitly appended during the program translation.
Means,
const char *s = "foo";
is equal to
const char *s = "foo\0";
I admit, that the sentence of:
"A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
is a little confusing and illogical in the context. It would be better phrased:
"A string literal might not be a string (see 7.1.1), because a null character might not (OR is not required to) be embedded in it by a \0 escape sequence."
or alternatively:
"A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
As #EricPostpischil pointed in his comment, the meaning of the footnote is probably quite different.
It means that if the string literal contains a null character inside of it, but not at the end, as it is required for a string, the string literal is not equivalent to a string.
F.e.:
The string literal
"foo\0bar"
is not a string, as it contains the first null character embedded inside of the string literal, but not at the end of it.

Multi character constant warning for escaped \t

If I write putchar('\\t'); while trying to print "\t" instead of an actual tab, I get the multi character constant warning. On the other hand, if I write putchar('\\'); I get no warning. Upon looking in the ASCII table, there is no character '\\', only '\'. So why is there no warning? Why is '\\' one character but '\\t' is more than one? Can a backslash only be used to escape one following character?
You cannot print \ and t with one putchar invocation, since putchar puts one and exactly only one character into the standard output. Use 2:
putchar('\\');
putchar('t');
Another option would be to use the string "\\t" with fputs:
fputs("\\t", stdout);
There is no warning for '\\' because that is one way how you enter the character literal for the character \. On ASCII this is synonymous with '\134' and '\x5c'.
From C11 6.4.4.4 paragraphs 2 and 4:
2
An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'. [...] With a few exceptions detailed later, the elements of the sequence are any members of the source character set; they are mapped in an implementation-defined manner to members of the execution character set.
[...]
4
The double-quote " and question-mark ? are representable either by themselves or by the escape sequences \" and \?, respectively, but the single-quote ' and the backslash \ shall be represented, respectively, by the escape sequences \' and \\.
The reason why you get a warning for this is that the behaviour is wholly implementation-defined. In C11 J.3.4 the following is listed as implementation-defined behaviour:
The value of an integer character constant containing more than one character or containing a character or escape sequence that does not map to a single-byte execution character (6.4.4.4).
Since '\\' contains an escape sequence that maps to a single-byte execution character \, there is no implementation-defined pitfalls, and nothing to warn about; but \\t contains 2 characters: \ and t, and it wouldn't do what you want portably.
\\ is one character, t is one character, so that is clearly two characters.
\\ is an escape sequence, just like \t; it means \.
If you want to print the two characters \ and t, you clearly need either two calls to putch() or a function that takes a string argument "\\t".
https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences

character string literal and string literal in standard?

I am confused by these four terms:
character string literal
character constants
string literal.
multibyte character sequence
And reading this quote in C Standard:
A character string literal need not be a string (see 7.1.1), because
a null character may be embedded in it by a \0 escape sequence.
What is meant by the first part ?
A string-literal is
either a character string literal, e.g. "abc";
or UTF-8 string literal, e.g. u8"abc";
or wide string literal, e.g. L"abc".
From the standard (emphasis mine):
A character string literal is a sequence of zero or more multibyte characters enclosed in
double-quotes, as in "xyz". A UTF−8 string literal is the same, except prefixed by u8.
A wide string literal is the same, except prefixed by the letter L, u, or U.....
In translation phase 7, a byte or code of value zero is appended to each multibyte
character sequence that results from a string literal or literals. 78)
78) A string literal need not be a string (see 7.1.1), because a null character may be embedded in it by a
\0 escape sequence.
A string is a contiguous sequence of characters terminated by and including the first null
character.
So a string literal may have \0 also in the middle or even at the beginning, for instance "a\0b" or "\0ab". I think this is what the footnote is saying.
A character constant is a c-char-sequence (usually a single character) in single quotes, with a possible prefix L/u/U.
An integer character constant is a sequence of one or more multibyte characters enclosed
in single-quotes, as in 'x'. A wide character constant is the same, except prefixed by the
letter L, u, or U.
So the terminology is not very symmetric, IMO. E.g. wide character constant is a particular case of character constant. However both character string literal and wide string literal belong to string literals.

How does c compare character variable against string?

The following code is completely ok in C but not in C++. In following code if statement is always false. How C compares character variable against string?
int main()
{
char ch='a';
if(ch=="a")
printf("confusion");
return 0;
}
The following code is completely ok in C
No, Not at all.
In your code
if(ch=="a")
is essentially trying to compare the value of ch with the base address of the string literal "a",. This is meaning-and-use-less.
What you want here, is to use single quotes (') to denote a char literal, like
if(ch == 'a')
NOTE 1:
To elaborate on the difference between single quotes for char literals and double quotes for string literal s,
For char literal, C11, chapter §6.4.4.4
An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'
and, for string literal, chapter §6.4.5
Acharacter string literal is a sequence of zero or more multibyte characters enclosed in
double-quotes, as in "xyz".
NOTE 2:
That said, as a note, the recommend signature of main() is int main(void).
I wouldn't say the code is okay in either language.
'a' is a single character. It is actually a small integer, having as its value the value of the given character in the machine's character set (almost invariably ASCII). So 'a' has the value 97, as you can see by running
char c = 'a';
printf("%d\n", c);
"a", on the other hand, is a string. It is an array of characters, terminated by a null character. In C, arrays are almost always referred to by pointers to their first element, so in this case the string constant "a" acts like a pointer to an array of two characters, 'a' and the terminating '\0'. You could see that by running
char *str = "a";
printf("%d %d\n", str[0], str[1]);
This will print
97 0
Now, we don't know where in memory the compiler will choose to put our string, so we don't know what the value of the pointer will be, but it's safe to say that it will never be equal to 97. So the comparison if(ch=="a") will always be false.
When you need to compare a character and a string, you have two choices. You can compare the character to the first character of the string:
if(c == str[0])
printf("they are equal\n");
else printf("confusion\n");
Or you can construct a string from the character, and compare that. In C, that might look like this:
char tmpstr[2];
tmpstr[0] = c;
tmpstr[1] = '\0';
if(strcmp(tmpstr, str) == 0)
printf("they are equal\n");
else printf("confusion\n");
That's the answer for C. There's a different, more powerful string type in C++, so things would be different in that language.
There is difference between 'a' (a character) and "a" (a string having two characters a and \0). ch=="a" comparison will be evaluated to false because in this expression "a" will converted to pointer to its first element and of course that address is not a character but a hexadecimal number.
Change it to
if(ch=='a')

Resources