Handling Backslash Escape Sequences in C - c

I was looking at an example in my course slides, which didn't come with much explanation.
char getchar_escaped(void)
{
char c;
if ((c = getchar()) != '\\') return c;
switch ((c = getchar())) {
case '\\':
return '\\';
case 'n':
return '\n';
default:
return c;
}
}
What exactly is happening in this code? How is this dealing with newlines and double slashes?

In C character string literals and single-character constants, there are a number of 'special' characters that cannot be readily represented in source code text. Examples are the newline character, the nul (terminator) character and the carriage-return.
The language allows us coders to include such characters by using escape sequences - which are entered using the backslash character (\) followed by a suitably-descriptive 'ordinary' character. So, we can specify the newline character using an 'escaped' "n", like this: char NewLine = '\n'; similarly, the nul and carriage-return characters are represented by \0 and \r, respectively.
However, this convention causes a problem when we actually want to specify the backslash character itself! So, in order to do so, we specify an escape sequence where the second character is also a backslash; thus, the code char BackSlash = '\\'; assigns to BackSlash the value (probably ASCII, but not necessarily so) of the backslash itself.
In your code, the test after the first c = getchar() checks for an input backslash character - which, if found, signals the start of one of these "escape sequences" - if it isn't found, we can simply return the actual character input. However, if we thus detect the start of an escape sequence, we need to check the next character: if this is an "n" (case 'n':) we return the escape sequence representing newline character (return '\n';); if it is another backslash (case '\\':) we return the sequence for the actual backslash (return '\';).
Other standard escape sequences aren't detected in your code, but it would be trivial to add further checks for these.
Please feel free to ask for further clarification and/or explanation.

\ has special meaning. It usually alter the meaning of the next character. Example: \n means new line, which is actually an ASCII character. But since \ mean "alter the next character" than how you could have the literal \ character? By altering it with \ by doing \\. This mean "take the literal \ character".
char getchar_escaped(void)
{
char c;
// read a char from the input if it is the '\' character than return with it
if ((c = getchar()) != '\\') return c;
switch ((c = getchar())) { // read in another character
case '\\': return '\\'; // if it a '\' character then return '\'
case 'n': return '\n'; // if it an 'n' than return the new line character: '\n'
default: // otherwise
return c; // just return the character that was read
}
}

There are two distinct uses of backslash escape sequences in the code you posted.
C uses backslash escape sequences as part of the grammar of the C language to represent certain character values in a character constant or a string literal. In a character constant or string literal, the sequence \\ represents a single backslash character, and the sequence \n represents a single newline character. There are several more of these backslash escape sequences in the C language. Consult a C reference for details.
The program's getchar_escaped function is reading characters from standard input and applying its own backslash escape rules that happen to match those of the C language itself as far as the sequences \\ and \n are concerned. If it is not currently reading a backslash escape sequence and it reads a backslash character, it reads the next character and returns a character corresponding to the backslash sequence (e.g. returning a newline character if the character following the backslash is n). (In fact, n is the only character that it doesn't map to an identical character. The special case for handling a backslash followed by a backslash is redundant.)

Related

Escape sequence in C in string

In this code:
int main()
{
char str[]= "geeks\nforgeeks";
char *ptr1, *ptr2;
ptr1 = &str[3];
ptr2 = str + 5;
printf("%c", ++*str - --*ptr1 + *ptr2 + 2);
printf("%s", str);
getchar();
return 0;
}
Why does the compiler interpret \n as an escape sequence and not as two characters viz, \ and n?
On the other hand, this program does not comment out hello.
int main()
{
char str[]= "geeks/*hello*/geeks";
printf ("%s",str);
return 0;
}
Why does the compiler interpret \n as an escape sequence and not as
two characters viz, \ and n?
By the definition of the escape sequence.
The C Standard (5.2.2 Character display semantics)
2 Alphabetic escape sequences representing nongraphic characters in
the execution character set are intended to produce actions on display
devices as follows:
//...
\n (new line) Moves the active position to the initial position
of the next line.
If you want to have two separate characters \ and n then you should write for example
char str[]= "geeks\\nforgeeks";
Now there are two separate characters one of which is represented by the escape sequence '\\' and other by the symbol 'n'.
As for the second your question
On the other hand, this program does not comment out hello.
char str[]= "geeks/*hello*/geeks";
Then within a string literal symbols /* and */ do not form a comment. They are elements of the string literal.
The C Standard (6.4.9 Comments)
1 Except within a character constant, a string literal, or a
comment, the characters /* introduce a comment. The contents of such a
comment are examined only to identify multibyte characters and to find
the characters */ that terminate it.
But that is inside a string, so how does the compiler know?
The below quotes and links are sufficient enough to understand what is an escape sequence.
From Wikipedia
In C, all escape sequences consist of two or more characters, the first of which is the backslash, \ (called the "Escape character");
Also from C11 Standard
In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

Is there a way to escape the C string null terminator character?

I have a simple question that I couldn't find the answer for by googling -- is it possible to escape the C language's string null terminator '\0', so that we can include it in a string?
Note that a "string" with an embedded NUL is no longer a string. You cannot safely use it as an argument to functions declared in <string.h>, for example
char embeddednul[] = "zero\0one\0two\0"; // embeddednul[12] = embeddednul[13] = 0
printf("len: %d\n", strlen(embeddednul)); // 4??
char tmp[1000] = {0};
strcpy(tmp, embeddednul); // copies 'z', 'e', 'r', 'o', and 0
char *p = embeddednul;
while (*p) {
while (*p) putchar(*p++); // prints zero
putchar('\n'); // then one
p++; // then two
}
Use the \ to escape the \ like so
printf("\\0");
You can embed a null character in a string.
But there is no way to escape the null character.
To "escape" something means to remove its usual or its special interpretation, or to give it some other interpretation.
In C string and character constants, the backslash character \ gives a special meaning to the character following it. For example, n is normally an ordinary letter, but \n is the newline character. 0 is normally an ordinary digit character, but \0 is the character with the numeric value 0, and is the NUL or end-of-string character.
So if you write something like
char *twostrings = "two\0strings";
you successfully embed a null character into the string. But you haven't "escaped the null character" — you have in fact escaped the zero, to turn it into a null character.
Now, what if you wanted to "escape the null character" — that is, to remove its special meaning as the end-of string terminator? And the answer is, no, there is no way to do that. A null character is always treated as the end of a string, by any normal C function that deals with strings. Given the twostrings variable initialized as above, if you write
printf("%zu\n", strlen(twostrings));
you're going to get 3, because strlen stops at the first null character it finds. If you write
char onestring[10];
strcpy(onestring, twostrings);
printf("%s\n", onestring);
you're going to get "one", because strcpy stops at the first null character it finds. You'll get the same thing if you write
printf("%s\n", twostrings);
If you tried to "escape the null character" by writing
twostrings = "two\\0strings";
what would actually happen is that you would escape the backslash, removing its special meaning. You'd get an actual backslash character and an actual 0 character in the string, with no extra null character at all.
See also this question. See also this question and particularly the second part of this answer.

sizeof(string) not including a "\" sign

I've been going through strlen and sizeof for strings (char arrays) and I don't quite get one thing.
I have the following code:
int main() {
char str[]="gdb\0eahr";
printf("sizeof=%u\n",sizeof(str));
printf("strlen=%u\n",strlen(str));
return 0;
}
the output of the code is:
sizeof=9
strlen=3
At first I was pretty sure that 2 separate characters \ followed by 0 wouldn't actually act as a NUL (\0) but I managed to figure that it does.
The thing is that I have no idea why sizeof shows 9 and not 10.
Since sizeof counts the amount of used bytes by the data type why doesn't it count the byte for the \?
In a following example:
char str[]="abc";
printf("sizeof=%u\n",sizeof(str));
that would print out "4" because of the NUL value terminating the array so why is \ being not counted?
In a character or string constant, the \ character marks the beginning of an escape sequence, used to represent character values for which there isn't a symbol in the source character set. For example, the escape sequence \n represents the newline character, \b represents the backspace character, \0 represents the zero-valued character (which is also the string terminator), etc.
In the string literal "gdb\0eahr", the escape sequence \0 maps to a single 0-valued character; the actual contents of str are {'g', 'd', 'b', 0, 'e', 'a', 'h', 'r', 0}.
It seems you already have the answer:
At first I was pretty sure that 2 separate characters "\" followed by "0" wouldn't actually act as a NULL "\0" but I managed to figure that it does.
The sequence \0 is an octal escape sequence for the byte 0. So while there are two characters in the code to denote this, it translated to a single byte in the string.
So you have 7 alphabetic characters, a null byte in the middle, and a null byte at the end. That's 9 bytes.
Why should char str[]="gdb\0eahr"; be 10 bytes with sizeof operator? It is 9 bytes because there are 8 string elements + trailing zero.
\0 is only 1 character, not 2. \'s purpose is to escape characters, therefore you might see some of these: \t, \n, \\ and others.
Strlen returns 3 because you have string termination at position str[3].
Single sequence of \ acts as escape character and is not part of string size. If you want to literally use \ in your string, you have to write it twice in sequence like \\, then this is single char of \ printable char.
The C compiler scans text strings as a part of compiling the source code and during the scan any special, escape sequences of characters are turned into a single character. The symbol backslash (\) is used to indicate the start of an escape sequence.
There are several formats for escape sequences. The most basic is a backslash followed by one of several special letters. These two characters are then translated into a single character. Some of these are:
'\n' is turned into a line feed character (0x0A or decimal 10)
'\t' is turned into a tab character (0x09 or decimal 9)
'\r' is turned into a carriage return character (0x0D or decimal 13)
'\\' is turned into a backslash character (0x5C)
This escape sequence idea was used back in the old days so that when a line of text was printed to a teletype machine or printer or a CRT terminal, the programmer could use these and other special command code characters to set where the next character would be printed or to cause the device to do some physical action like ring a bell or feed the paper to the next line.
The escape character also allowed you to embed a double quote (") or a single quote (') into a text string so that you could print text that contain quote marks.
In addition to the above special sequences of backslash followed by a letter there was also a way to specify any character by using a backslash followed by one up to three octal digits (0 through 7). So you could specify a line feed character by either using '\n' or you could use '\12' where 12 is the octal representation of the hexadecimal value A or the decimal value 10.
Then the ability to use a hexadecimal escape sequence was introduced with the backslash followed by the letter x followed by one or more hexadecimal digits. So you can write a line feed character with '\n' or '\12' or '\xa'.
See also Escape sequences in C in Wikipedia.

What is the difference between '\0' and '\n' in the C programming language?

What is the difference between the '\0' character and the '\n' character in the C programming language?
'\0' is a NULL character (ASCII 0), which is actually a string terminator too... (C strings are NULL-terminated): if you have a string like "this is\0a string", the part after the '\0' will be ignored (even if it will actually be inside the generated code).
'\n' is a newline (ASCII 10). It is noteworthy that in some circumstances, this newline character can actually be transformed. For example, on Windows, where the newline in files is indicated by the "\r\n" sequence (two bytes: ASCII 13, carriage return, followed by ASCII 10, line feed), if you write to a file (e.g. using fprintf()) a string containing a '\n' character, it will be automatically converted to a "\r\n" sequence if the file is open in ASCII mode (which is generally the default).
'\0' is a null: this terminates a string. '\n' is a newline
'\0' is a NULL character, which indicates the end of a string in C. (printf("%s") will stop printing at the first occurence of \0 in the string.
'\n' is a newline, which will simply make the text continue on the next line when printing.
\0 is the null byte, used to terminate strings.
\n is the newline character, 10 in ASCII, used (on Unix) to separate lines.
'\0' is a character constant that is written as octal-escape-sequence. Its value is 0. It is not the same as '0'. The last has value 48 in ASCII or 240 in EBCDIC
'\n' is a character constant that is written as simple-escape-sequence and denotes the new line character. Its value is equal to 10.

Concatenating two characters to create escape sequence

Idea is to try to create escape sequences 'human' way. For example, I use two characters to create '\n', the '\' and 'n'.
What I'm thinking about is char array[3]={'\\','n','\0'};
so I can change 'n' character and still use it as an escape sequence.
When I printf(array) it now prints:
\n
and I'd like it to go to next line.
For example:
what if I wanted to check manually what every letter in alphabet does when used as escape sequence with a loop?
for(char='a';char<='z';char++)
{
/* create escape sequence with that letter */
/* print that escape sequence and see what it does */
}
It's not an assignment,has no practical use (at least not yet), but just a theoretical question that I couldn't find answer anywhere, nor figure it out myself.
The escape sequence represents a single character and is evaluated at compile time. You cannot have a literal string interpreted as an escape sequence at run time.
For example '\n' is a newline (or line-feed character - 0x0A in ASCII)
Note that:
char array[3]={'\\','n','\0'};
is equivalent to:
char array[3] = "\\n" ;
so perhaps unsurprisingly when you printf(array) it prints \n - that is what you have asked it to do.
Undefined escape sequences simply won't compile, so you might simply:
char = '\a' ;
char = '\b' ;
... // etc.
and see which lines the compiler baulks at. However that is not the complete story because some escape sequences require operands, for example \x on its own has no meaning, whereas \xab is the character represented by 0xab (171 decimal). Others are not even letters. Most are related to white-space, and their effect may be dependent on the terminal or console capabilities of the execution platform. So a naive investigation may not generate the results you seek, because it does not account for the language semantics or platform capabilities.
All supported escape sequences are in fact well defined - you'll find few surprises except perhaps those related to platform capabilities (for example if your target has no means to generate a beep, \a will have no useful effect):
\a Beep
\b Backspace
\f Form-feed
\n Newline
\r Carriage return
\t Horizontal tab
\v Vertical tab
\\ Backslash
\' Single quotation mark
\" Double quotation mark
\0 ASCII 0x00 (null terminator)
\ooo Octal representation
\xdd Hexadecimal representation
What about writing your own printf()?
Where you can check for a '\' followed by a 'n' and than only print from char[0] to '\''n'. Finally add "printf("\n");
mfg

Resources