I've been going through strlen and sizeof for strings (char arrays) and I don't quite get one thing.
I have the following code:
int main() {
char str[]="gdb\0eahr";
printf("sizeof=%u\n",sizeof(str));
printf("strlen=%u\n",strlen(str));
return 0;
}
the output of the code is:
sizeof=9
strlen=3
At first I was pretty sure that 2 separate characters \ followed by 0 wouldn't actually act as a NUL (\0) but I managed to figure that it does.
The thing is that I have no idea why sizeof shows 9 and not 10.
Since sizeof counts the amount of used bytes by the data type why doesn't it count the byte for the \?
In a following example:
char str[]="abc";
printf("sizeof=%u\n",sizeof(str));
that would print out "4" because of the NUL value terminating the array so why is \ being not counted?
In a character or string constant, the \ character marks the beginning of an escape sequence, used to represent character values for which there isn't a symbol in the source character set. For example, the escape sequence \n represents the newline character, \b represents the backspace character, \0 represents the zero-valued character (which is also the string terminator), etc.
In the string literal "gdb\0eahr", the escape sequence \0 maps to a single 0-valued character; the actual contents of str are {'g', 'd', 'b', 0, 'e', 'a', 'h', 'r', 0}.
It seems you already have the answer:
At first I was pretty sure that 2 separate characters "\" followed by "0" wouldn't actually act as a NULL "\0" but I managed to figure that it does.
The sequence \0 is an octal escape sequence for the byte 0. So while there are two characters in the code to denote this, it translated to a single byte in the string.
So you have 7 alphabetic characters, a null byte in the middle, and a null byte at the end. That's 9 bytes.
Why should char str[]="gdb\0eahr"; be 10 bytes with sizeof operator? It is 9 bytes because there are 8 string elements + trailing zero.
\0 is only 1 character, not 2. \'s purpose is to escape characters, therefore you might see some of these: \t, \n, \\ and others.
Strlen returns 3 because you have string termination at position str[3].
Single sequence of \ acts as escape character and is not part of string size. If you want to literally use \ in your string, you have to write it twice in sequence like \\, then this is single char of \ printable char.
The C compiler scans text strings as a part of compiling the source code and during the scan any special, escape sequences of characters are turned into a single character. The symbol backslash (\) is used to indicate the start of an escape sequence.
There are several formats for escape sequences. The most basic is a backslash followed by one of several special letters. These two characters are then translated into a single character. Some of these are:
'\n' is turned into a line feed character (0x0A or decimal 10)
'\t' is turned into a tab character (0x09 or decimal 9)
'\r' is turned into a carriage return character (0x0D or decimal 13)
'\\' is turned into a backslash character (0x5C)
This escape sequence idea was used back in the old days so that when a line of text was printed to a teletype machine or printer or a CRT terminal, the programmer could use these and other special command code characters to set where the next character would be printed or to cause the device to do some physical action like ring a bell or feed the paper to the next line.
The escape character also allowed you to embed a double quote (") or a single quote (') into a text string so that you could print text that contain quote marks.
In addition to the above special sequences of backslash followed by a letter there was also a way to specify any character by using a backslash followed by one up to three octal digits (0 through 7). So you could specify a line feed character by either using '\n' or you could use '\12' where 12 is the octal representation of the hexadecimal value A or the decimal value 10.
Then the ability to use a hexadecimal escape sequence was introduced with the backslash followed by the letter x followed by one or more hexadecimal digits. So you can write a line feed character with '\n' or '\12' or '\xa'.
See also Escape sequences in C in Wikipedia.
Related
During my GDB debugging sessions, I've noticed that null terminator characters, denoting the end of a string, and shown as \0 in C files, show up as \000 in GDB when displaying the value of a variable storing such a character.
(gdb) print buffer[10]
$2 = 0 '\000'
Can anyone tell me why that is?
GDB seems to always use 3 octal digits to display character escapes - and for a good reason_ Consider the following string
const char *str = "\1\2\3\4\5";
then
(gdb) p str
$1 = 0x555555556004 "\001\002\003\004\005"
This is because C standard says that an escape sequence consists of maximum of 3 octal digits. Thus if you write:
"\0a"
it means string literal of two characters - null followed by a. But if you write
"\01"
it means a string literal of one character: ASCII code 1 - Start-of-Header control character. In fact the shortest way to write ASCII null followed by the digit 1 (i.e. ASCII code 49) in a string literal is "\0001" The other possibilities are "\0" "1" using string concatenation; separate escapes "\0\61"; or using hex escapes \x..., all of which will be even longer....
So by always using 3 octal digits, GDB can produce consistent output for strings - such that when copied to a C program will result in the same string during runtime. Furthermore the output routine is simpler because it does not need to consider the following character.
This record '\0' is an octal escape sequence of a character constant (literal).
An octal escape sequence may contain at most three octal digits.
In C (and similar languages), a string is declared for example as "abc". Another example is "ab\"c". I have a file which contains these strings. That is, the file contents is "abc" or "ab\c" etc. Any literal string that can be defined in a .c file can be defined in the file I'm reading.
These strings can be malformed. E.g. "abc (no closing quotes). What is the best way to write a parser to make sure the string in the file is a valid C literal string? (so that if I copy the file contents and paste them after char* str =, the resulting expression will be accepted by the compiler when at the top of a function)
The strings are each in a separate line.
Alternatively, you can think of this as wanting to parse lines that declare literal string variables. Imagine I'm grepping a big file and use char\* .* = (.*);$ and want to make sure the part in the parenthesis will not cause compilation errors;
The grammar for C string literals is given in C 2018 6.4.5. Supposing you want to parse only plain strings, not those with encoding prefixes such as u in u"xyz", then the grammar for a string-literal is " s-char-sequenceopt ", where “opt” means optional and s-char-sequence is one or more s-char tokens. An s-char is any member of the source character set except ", \ or the new-line character or is an escape-sequence.
The source character set includes at least the Latin alphabet (26 letters A-Z) in uppercase and lowercase, the ten digits, space, horizontal tab, vertical tab, form feed, and these characters:
"#%&’()*+,-./:;?[\]^_{|}~
However, a C implementation may include other characters in its source character set. Therefore, any character found in the string other than ", \, or the new-line character must be accepted as potentially valid in some C implementation.
An escape-sequence is defined in 6.4.4.4 1 to be one of:
\ followed by ', ", ?, \, a, b, f, n, r, t, v,
\ followed by one to three octal digits, or
\x followed by one or more hexadecimal digits, or
a universal-character-name.
Paragraph 7 says:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.
A universal-character-name is defined in 6.4.3 to be \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits. Paragraph 2 limits these:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (‘), nor one in the range D800 through DFFF inclusive.
This part of the C grammar looks fairly simple to parse:
A string literal must start with a ".
If the next character is anything other than ", \, or a new-line character, then accept it.
If the next character is \ and it is followed by one of the single characters listed above, accept it and the following character.
If the next character is \ and it is followed by one to three octal digits, accept it and up to three octal digits.
If the next two characters are \x and are followed by a hexadecimal digit, accept them and all the hexadecimal digits that follow.
If the next two characters are \u and are followed by four hexadecimal digits, accept those six characters. However, if the value is one of those prohibited in the constraint above, this is not a valid C string literal.
If the next two characters are \U and are followed by eight hexadecimal digits, accept those ten characters. However, if the value is one of those prohibited in the constraint above, this is not a valid C string literal.
Repeat the above until the next character is not accepted.
If the next character is not ", this is not a valid C string literal.
If the next character is ", accept it.
If that is the end of the line read from the file, it is a valid C string literal. Otherwise, it is not.
I have the following C code:
#include <stdio.h>
#include <strings.h>
int main(void){
char * str = "\012\0345";
char testArr[8] = {'\0','1','2','\0','3','4','5','\0'};
printf("%s\n",str);
printf("**%s**",testArr);
return 0;
}
See live code here
I'm having trouble understanding the results and I have googled but am unsure that I understand why a null character at the start of a string and why one in the middle would cause only the string "5" to display. Also, when I assign each string character to array testArr and then attempt to display that array of characters the result is different despite the string and the array having the same characters. So, I'm struck by the confounding results, especially their disparity. With the string str, does the code display "5" because the null characters overwrite what is in memory?
Also, with the array I created using the same characters, nothing displays of the data contained in array testArr. Is it that once the first null is encountered for some reason everything else is ignored? If so, why doesn't the same behavior occur with string str which contains the same characters?
An octal escape sequence is \ followed by one to three octal digits, per C 2018 6.4.4.4 1. Per 6.4.4.4 7: “Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.” So, when the compiler sees "\012\0345", it interprets it as the sequence \012 (which is ten), the sequence \034 (which is twenty-eight), and the character 5.
To represent the string you intended, you could use "\00012\000345". Since an octal escape sequence stops at three digits, this is interpreted as the sequence \000, the characters 1 and 2, the sequence \000, and the characters 3, 4, and 5. (A null terminating character will also be appended automatically.)
When you printed "\012\0345", the characters with codes ten and twenty-eight were printed but had no visible effect. (Your C implementation likely uses ASCII, in which case they are control characters. \012 is new-line, so it should have caused a line advance, but you probably did not notice that. \034 is a file-separator control character, which likely has no effect when printed to a regular terminal display.)
When you printed testArr, the null character in the first position ended the string.
Idea is to try to create escape sequences 'human' way. For example, I use two characters to create '\n', the '\' and 'n'.
What I'm thinking about is char array[3]={'\\','n','\0'};
so I can change 'n' character and still use it as an escape sequence.
When I printf(array) it now prints:
\n
and I'd like it to go to next line.
For example:
what if I wanted to check manually what every letter in alphabet does when used as escape sequence with a loop?
for(char='a';char<='z';char++)
{
/* create escape sequence with that letter */
/* print that escape sequence and see what it does */
}
It's not an assignment,has no practical use (at least not yet), but just a theoretical question that I couldn't find answer anywhere, nor figure it out myself.
The escape sequence represents a single character and is evaluated at compile time. You cannot have a literal string interpreted as an escape sequence at run time.
For example '\n' is a newline (or line-feed character - 0x0A in ASCII)
Note that:
char array[3]={'\\','n','\0'};
is equivalent to:
char array[3] = "\\n" ;
so perhaps unsurprisingly when you printf(array) it prints \n - that is what you have asked it to do.
Undefined escape sequences simply won't compile, so you might simply:
char = '\a' ;
char = '\b' ;
... // etc.
and see which lines the compiler baulks at. However that is not the complete story because some escape sequences require operands, for example \x on its own has no meaning, whereas \xab is the character represented by 0xab (171 decimal). Others are not even letters. Most are related to white-space, and their effect may be dependent on the terminal or console capabilities of the execution platform. So a naive investigation may not generate the results you seek, because it does not account for the language semantics or platform capabilities.
All supported escape sequences are in fact well defined - you'll find few surprises except perhaps those related to platform capabilities (for example if your target has no means to generate a beep, \a will have no useful effect):
\a Beep
\b Backspace
\f Form-feed
\n Newline
\r Carriage return
\t Horizontal tab
\v Vertical tab
\\ Backslash
\' Single quotation mark
\" Double quotation mark
\0 ASCII 0x00 (null terminator)
\ooo Octal representation
\xdd Hexadecimal representation
What about writing your own printf()?
Where you can check for a '\' followed by a 'n' and than only print from char[0] to '\''n'. Finally add "printf("\n");
mfg
Consider this line of text:
First line of text.
If a character array string is used to load the first TEN characters in the array it will output as:
First lin'\0'
First contains 5 letters, lin contains 3 letters. Where are the other two characters being used?
Is \0 considered two characters?
Or is the space between the words considered a character, thus '\0` is one character?
Yes, space is a character. In ASCII encoding it has code number 32.
The space between the two words has ASCII code 0x20 (0408, or 3210); it occupies one byte.
The null at the end of the string, ASCII code 0x00 (0 in both octal and decimal) occupies the other byte.
Note that the space bar is simply the key on the keyboard that generates a space character when typed.
'\0' is the null-terminator, it is literally the value zero in all implementations.
'\0' is considered a single character because the backslash \ means to escape a character. '\0' and '0' thus are both single characters, but mean very different things.
Note that space is represented by a different ascii value.
space is represented in String as "\s", probably space is represented as '\s' as a character