strlen - the length of the string is sometimes increased by 1 - c

I'm doing some C puzzle questions. In most cases, I am able to find the right answer, but with that one I am having problems. I know the right answer by using the compiler, but I don't know the reason.
Have a look at the code:
char c[] = "abc\012\0x34";
What would strlen(c) return, using a Standard C compiler?
My compiler returns 4 when what I expected was 3.
What I thought is strlen() would search for the first occurrence of the NULL character but somehow the result is one more than I expected.
Any idea why?

Let's write
char c[] = "abc\012\0x34";
with single characters:
char c[] = { 'a', 'b', 'c', '\012', '\0', 'x', '3', '4', '\0' };
The first \0 you see is the start of an octal escape sequence \012 that extends over the following octal digits.
Octal escape sequences are specified in section 6.4.4.4 of the standard (N1570 draft):
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
they consist of a backslash followed by one, two, or three octal digits. In paragraph 7 of that section, the extent of octal and hexadecimal escape sequences is given:
7 Each octal or hexadecimal escape sequence is the longest sequence of characters that can
constitute the escape sequence.
Note that while the length of an octal escape sequence is limited to at most three octal digits (thus "\123456" consists of five characters, { '\123', '4', '5', '6', '\0' }), hexadecimal escape sequences have unlimited length
hexadecimal-escape-sequence:
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit
and thus "\x123456789abcdef" consists of only two characters ({ '\x123456789abcdef', '\0' }).

Related

Printing octal literal in printf

What is the proper way to print an octal literal? For example, the following works for the hex digit \x but not for the octal \0:
printf("\x66 \0102\n");
f 2
How can this be done?
Octal literal consists of 1 to 3 digits. 4-digit sequence like \0102 is not supported. It seems this is treated as two characters: \010 and 2.
What you want may be printf("\x66 \102\n");. This will print f B if ASCII is used.
Quote from N1570 6.4.4.4 Character constants:
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
The octal digits that follow the backslash in an octal escape sequence are taken to be part
of the construction of a single character for an integer character constant or of a single
wide character for a wide character constant. The numerical value of the octal integer so
formed specifies the value of the desired character or wide character.

char array initialization with octal constant

I saw a comment that said initialization of a char array with "\001" would put a nul as the first character. I have seen where \0 does set a nul.
The unedited comment:
char input[SIZE] = ""; is sufficient initialization. while ( '\001' == input[0]) doesn't do what you think it is doing if you have initialized input[SIZE] = "\001"; (which creates an empty-string with the nul-character as the 1st character.)
This program
#include <stdio.h>
#define SIZE 8
int main ( void) {
char input[SIZE] = "\001";
if ( '\001' == input[0]) {//also tried 1 == input[0]
printf ( "octal 1\n\n");
}
else {
printf ( "empty string\n");
}
return 0;
}
running on Linux, compiled with gcc, outputs:
octal 1
so the first character is 1 rather than '\0'.
Is this the standard behavior or just something with Linux and gcc? Why does it not set a nul?
Is this the standard behavior or just something with Linux and gcc? Why does it not set a nul?
The behavior of the code you present is as required by the standard. In both string literals and integer character constants, octal escapes may contain one, two, or three digits, and the C standard specifies that
Each octal [...] escape sequence is the longest sequence of
characters that can constitute the escape sequence.
(C2011, 6.4.4.4/7)
In this context it is additionally relevant that \0 is an octal escape sequence, not a special, independent code for the null character. The wider context of the above quotation will make that clear.
In the string literal "\001", the backslash is followed by three octal digits, and an octal escape can have three digits, therefore the escape sequence consists of the backslash and all three digits. The first character of the resulting string is the one with integer value 1.
If for some reason you wanted a string literal consisting of a null character followed by the decimal digits 0 and 1, then you could either express the null with a full three-digit escape,
"\00001"
or split it up like so:
"\0" "01"
C will join adjacent string literals to produce the wanted result.
I saw a comment that said initialization of a char array with "\001" would put a nul as the first character.
That comment was in error.
From 6.4.4.1 Integer constants, paragraph 3, emphasis mine:
An octal constant consists of the prefix 0 optionally followed by a sequence of the digits 0 through 7 only.
But what we are looking at here is not an integer constant at all. What we have here is, actually, an octal escape sequence. And that is defined as follows (in 6.4.4.4 Character constants):
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
The definition -- both for integer constants as well as character constants -- is "greedy", as elaborated by paragraph 7:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.
That means, if the first octal digit is followed by something that could be an octal digit, that next character is considered an octal digit belonging to that constant (to a maximum of three in the case of character constants -- not so for integer constants!).
Hence, your "\001" is, indeed, a character with the value 1.
Note that, while octal character constants run up to three characters maximum (making such a constant quite safe to use if padded with leading zeroes as necessary to get a length of three digits), hexadecimal character constants run as long as there are hexadecimal digits (potentially overflowing the char type they are meant to initialize).
See http://c0x.coding-guidelines.com/6.4.4.4.html
Octal sequence is defined as:
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
and item 873:
The octal digits that follow the backslash in an octal escape sequence
are taken to be part of the construction of a single character for an
integer character constant or of a single wide character for a wide
character constant.
also item 877:
Each octal or hexadecimal escape sequence is the longest sequence of
characters that can constitute the escape sequence.
Therefore the behaviour is correct. "\001" should not have null byte at position 0.

What do the characters starting with '\' and followed by a number e.g '\234' mean?

I have been looking at the source of an app when I came across these characters e.g '\233', '\234', '\235' and when I print them, I get garbage characters.
\233 is the character with the octal code 233.
In decimal this is 2×82 + 3×8 + 3 = 155
The meaning depends on the characterset being used. Codes beyond 127 are not defined in 7-bit ASCII.
As advertised by DevSolar:
http://rootdirectory.de/chrome/site/encoding.html might be helpful
They are octal-escape-sequences, which are used to represent specific byte values in a character constant or string literal.
C11, 6.4.4.4 Character constants:
character-constant:
' c-char-sequence '
L' c-char-sequence '
u' c-char-sequence '
U' c-char-sequence '
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except the single-quote ', backslash \, or new-line character
escape-sequence
escape-sequence:
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
universal-character-name
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
An octal escape sequence is defined as a backslash followed by one to three octal digits (0-7).
To avoid getting a following decimal digit interpreted as part of the octal sequence, it is common practice to pad an octal escape sequence with leading zeroes. As opposed to octal integer constants, though, a leading zero is not required.
Note that the semantic meaning of such an escape sequence depends on the context. I could write "Fu\303\237", and it could mean "Fuß" (in UTF-8) or "Fuß" (in CP-1252), depending of what encoding I am assuming the string to be in. What I can not do, portably, is writing either of those strings in the source directly, because the interpretation of any character not in the source character set (i.e., ASCII-7 without dollar, at-sign, and backtick) is implementation-defined. While most compilers today can be made to interpret string literals as UTF-8, octal escape sequences are the portable way.
FWIW, there are also hexadecimal escape sequences; however they are not as well-defined: They greedily gobble as many "hex digits" as they can get, even beyond what a char can hold; so if the next character in the string literal is one of [0-9a-fA-F], you have no way of "terminating" the hex escape before that (1); this is why octal sequences are preferred by some.
(1): As M.M pointed out, you could split your string literal in two ("\xAB" "CD").
As for what the various character values could stand for, in which encoding, I recommend a good code table. This one I whipped up myself, as I could not find any existing one listing all the information I needed in one page.
It's an escape sequence, for octal values. The syntax is \nnn.
You can read more about escape sequences in c here.
Garbage is printed, because 233 in octal is 155 in decimal, 234 is 156 and 235 is 157. They do not represent any ascii character.
That notation is octal-escape-sequence which represents octal number representation for a char literal (char constant).
Quoting C11, chapter §6.4.4.4, Character constants
The single-quote ', the double-quote ", the question-mark ?, the backslash \, and
arbitrary integer values are representable according to the following table of escape
sequences:
...
octal character \octal digits
and, regarding the values,
The octal digits that follow the backslash in an octal escape sequence are taken to be part
of the construction of a single character for an integer character constant or of a single
wide character for a wide character constant. The numerical value of the octal integer so
formed specifies the value of the desired character or wide character.

Understanding output of printf containing backslash (\012)

Can you please help me to understand the output of this simple code:
const char str[10] = "55\01234";
printf("%s", str);
The output is:
55
34
The character sequence \012 inside the string is interpreted as an octal escape sequence. The value 012 interpreted as octal is 10 in decimal, which is the line feed (\n) character on most terminals.
From the Wikipedia page:
An octal escape sequence consists of \ followed by one, two, or three octal digits. The octal escape sequence ends when it either contains three octal digits already, or the next character is not an octal digit.
Since your sequence contains three valid octal digits, that's how it's going to be parsed. It doesn't continue with the 3 from 34, since that would be a fourth digit and only three digits are supported.
So you could write your string as "55\n34", which is more clearly what you're seeing and which would be more portable since it's no longer hard-coding the newline but instead letting the compiler generate something suitable.
\012 is an escape sequence which represents octal code of symbol:
012 = 10 = 0xa = LINE FEED (in ASCII)
So your string looks like 55[LINE FEED]34.
LINE FEED character is interpreted as newline sequence on many platforms. That is why you see two strings on a terminal.
\012 is a new line escape sequence as others stated already.
(What might be, as chux absolute correct commented, different if ASCII isn't the used charset. But anyway it is in this notation an octal digit.)
this is meant by standard as it says for c99 in ISO/IEC 9899
for:
6.4.4.4 Character constants
[...]
3 The single-quote ', the double-quote ", the question-mark ?, the backslash \, and
arbitrary integer values are representable according to the following table of escape
sequences:
single quote' \'
double quote" \"
question mark? \?
backslash\ \
octal character \octal digits
hexadecimal character \x hexadecimal digits
And the range it gets bound to:
Constraints
9 The value of an octal or hexadecimal escape sequence shall be in the range of
representable values for the type unsigned char for an integer character constant, or
the unsigned type corresponding to wchar_t for a wide character constant.

Size of escaped characters in C

Why does the following program output 5?
#include <stdio.h>
main()
{
char str[]="S\065AB";
printf("\n%d", sizeof(str));
}
Short answer: See David Heffernan's answer.
Long answer:
§ 6.4.4.4 of the C(99) standard specifies "character constants", which (among others) include simple escape sequences (e.g. '\n', '\\'), octal escape sequences (e.g. '\0'), hexadecimal escape sequences (e.g. '\x0f'), and universal character names (e.g. '\u0112').
The backslash in your example introduces such an escape / octal / hex / universal constant. The following octal digit ([0-7]) makes it an octal constant (hex would be '\x', universal would be '\u', escape sequence would be '\['"?\abfnrtv]').
That octal constant is terminated once three octal digits are consumed, or a non-octal-digit is encountered.
I.e., '\065' is equivalent to '\x35' or (decimal) 53, which is (coincidentally) '5' on the ASCII table - a single character, anyway.
It's the size of the array which has five elements: S, \065, A, B, \0

Resources