C program to store special characters in strings - c

Basically, I can't figure this out, I want my C program to store the entire plaintext of a batch program then insert in a file and then run.
I finished my program, but holding the contents is my problem. How do I insert the code in a string and make it ignore ALL special characters like %s \ etc?

You have to escape special characters with a \, you can escape backslash itself with another backslash (i.e. \\).

As Ian previously mentioned, you can escape characters that aren't allowed in normal C strings with \; for instance, newline becomes \n, double-quote becomes \", and backslash becomes \\.
If you're unable or unwilling to do this for whatever reason, then you may be out of luck if you're solution must be in C. However, if you're willing to switch to C++, then you can use raw strings:
const char* s1 = R"foo(
Hello
World
)foo";
This is equivalent to
const char* s2 = "\nHello\nWorld\n";
A raw string must begin with R" followed by an arbitrary delimiter (made of any source character but parentheses, backslash and spaces; can be empty; and at most 16 characters long), then (, and must end with ) followed by the delimiter and ". The delimiter must be chosen such that the termination substring (), delimiter, ") does not appear within the string.

Related

Different behavior for \" in C

I have a strange problem when using string function in C.
Currently I have a function that sends string to UART port.
When I give to it a string like
char buf[32];
strcpy(buf, "AT+CPMS=\"SM");
strcat(buf, "\"");
uart0_putstr(buf);
//or
uart0_putstr("AT+CPMS=SM"); //not a valid AT command, but without quotes just for test
it works well and sends string to UART. But when I use such call:
char buf[32];
strcpy(buf, "AT+CPMS=\"SM\"");
uart0_putstr(buf);
//or
uart0_putstr("AT+CPMS=\"SM\"");
it doesn't print to UART anything.
Maybe you can explain me what the difference between strings in first and second/third cases?
First the C language part:
String literals: All C string literals include an implicit null byte at the end; the C string literal "123" defines a 4 byte array with the values 49,50,51,0. The null byte is always there even if it is never mentioned and enables strlen, strcat etc. to find the end of the string. The suggestion strcpy(buf, "AT+CPMS=\"SM\"\0"); is nonsensical: The character array produced by "AT+CPMS=\"SM\"\0" now ends in two consecutive zero bytes; strcpy will stop at the first one already. "" is a 1 byte array whose single element has the value 0. There is no need to append another 0 byte.
strcat, strcpy: Both functions always add a null byte at the end of the string. There is no need to add a second one.
Escaping: As you know, a C string literal consists of characters book-ended by double quotes: "abc". This makes it impossible to have simple double quotes as part of the string because that would end the string. We have to "escape" them. The C language uses the backslash to give certain characters a special meaning, or, in this case, suppress the special meaning. The entire combination of backslash and subsequent source code character are transformed into a single character, or byte, in the compiled program. The combination \n is transformed into a single byte with the value 13 (usually interpreted as a newline by output devices), \r is 10 (usually carriage return), and \" is transformed into the byte 34, usually printed as the " glyph. The string Between the arrows is a double quote: ->"<- must be coded as "Between the arrows is a double quote: ->\"<-" in C. The middle double quote doesn't end the string literal because it is "escaped".
Then the UART part: The internet makes me believe that the command you want to send over the UART looks like AT+CPMS="SM", followed by a carriage return. The corresponding C string literal would be "AT+CPMS=\"SM\"\r".
The page I linked also inserts a delay between sending commands. Sending too quickly may cause errors that appear only sometimes.
The things to note are :
The AT command syntax probably demands that SM be surrounded by quotes on both sides.
Additionally, the protocol probably demands that a command end in a carriage return.
This ...
char buf[32];
strcpy(buf, "AT+CPMS=\"SM");
strcat(buf, "\"");
... produces the same contents in buf as this ...
char buf[32];
strcpy(buf, "AT+CPMS=\"SM\"");
... does, up to and including the string terminator at index 12. I fully expect an immediately following call to ...
uart0_putstr(buf);
... to have the same effect in each case unless uart0_putstr() looks at bytes past the terminator or its behavior is sensitive to factors other than its argument.
If it does look past the terminator, however, then that might explain not only a difference between those two, but also a difference with ...
uart0_putstr("AT+CPMS=\"SM\"");
... because in this last case, looking past the string terminator would overrun the bounds of the array, producing undefined behavior.
Thanks all. Finally It was resolved with adding NULL char to the end of string.

Why does C not recognize strings across multiple lines?

(I am very new to C.)
Visual newlines seem to be unimportant in C.
For instance:
int i; int j;
is same as
int i;
int j;
and
int k = 0 ;
is same as
int
k
=
0
;
so why is
"hello
hello"
not the same as
"hello hello"
It is because a line that contains a starting quote character and not an ending quote character was more likely a typing mistake or other error than an attempt to write a string across multiple lines, so the decision was made that string literals would not span source lines, unless deliberately indicated with \ at the end of a line.
Further, when such an error occurs, the compile would be faced with reading possibly thousands of lines of code before determining there was no closing quote character (end of file was reached) or finding what was intended as an opening quote character for some other string literal and then attempting to parse the contents of that string literal as C code. In addition to burdening early compilers with limited compute resources, this could result in confusing error messages in a part of the source code far removed from the missing quote character.
This choice is effected in C 2018 6.4.5 1, which says that a string literal is " s-char-sequenceopt ", where s-char-sequence is any member of the character set except the quote character, a backslash, or a new-line character (and a string literal may also have an encoding prefix, a u8, u, U, or L before the first ").
Strings can be continued over newlines by putting a backslash immediately before the newline:
"hello \
hello"
Or (better), using string concatenation:
"hello "
"hello"
Note that the space has been carefully preserved so that these are equivalent to "hello hello" except for the line numbering in the file after the appearance.
The backslash-newline line elimination is done very early in the translation process — in phase 2 of the conceptual translation phases.
Note that there is no stripping leading blanks or anything. If you write:
printf("Some long string with maybe an integer %d in it\
and some more data on the next line\n", i);
Then the string has a sequence of (at least) 8 blanks in it between in it and and some. The count of 8 assumes that the printf() statement is aligned in the left margin; if it is indented, you'd need to add the extra white space corresponding to the indentation.
1- using double quotes for each string :
char *str = "hello "
"hello" ;
** One problem with this approach is that we need to escape specially characters such as quotation mark " itself.
2- Using - \ :
char *str = "hello \
hello" ;
** This form is a lot easier to write, we don't need to write quotation marks for each line.
We can think of a C program as a series of tokens: groups of characters that can't be split up without changing their meaning. Identifiers and keywords are tokens. So are operators like + and -, punctuation marks such as the comma
and semicolon, and string literals.
For example, the line
int i; int j;
consists of 6 tokens: int, i, ;, int, j and ;. Most of the time, and particularly in this case, the amount of space (space, tab and newline characters) is not critical. That's why the compiler will treat
int i
;int
j;
The same.
Writing
"Hello
Hello"
Is like writing
un signed
and hope that the compiler treat it as
unsigned
Just like space is not allowed between a keyword, newline character is not allowed in a string literal token. But it can be included using the newline escape '\n' when needed.
To write strings across lines use string concatenation method
"Hello"
"Hello"
Although the above method is recommended, you can also use a backslash
"Hello \
Hello"
With the backslash method, beware of the beginning space in a new line. The string will include everything in that line until it finds a closing quote or another backslash.

\\\ prints just one backslash?

In C, on mentioning three backslashes, like so:
#include <stdio.h>
int main() {
printf(" \\\ ");
}
prints out just one backslash in the output. Why and how does this work?
That sequence is:
A space
A double backslash, which encodes a single backslash in the runtime string
A backslash followed by a space, which is not a standard escape sequence and should give you a diagnostic
The C11 draft says (in note 77):
The semantics of these characters were discussed in 5.2.2. If any other
character follows a backslash, the result is not a token and a diagnostic is
required.
On godbolt.org I got:
<source>:8:14: warning: unknown escape sequence '\ ' [-Wunknown-escape-sequence]
So you seem to be using a non-conforming compiler, which chooses to implement undefined backslash sequences by just letting the character through.
That is printing:
space
slash
escaped space
The 3rd slash is being interpreted as "slash space"
C11; 6.4.4.4 Character constants:
The double-quote " and question-mark ? are representable either by themselves or by the
escape sequences \" and \?, respectively, but the single-quote ' and the backslash \
shall be represented, respectively, by the escape sequences \' and \\.
So, To represent a single backslash, it’s necessary to place double backslashes \\ in the source code. To print two \\ you need four backslash \\\\. In your code extra \ is a space character, which isn't valid.
It is indeed a very simple operation in c. A \ is just a escape sequences. Hence below statement will print two slash.
printf(" \\\\ ");
For example some characters in c are represented with a slash like end of a line character \n or end of a string character \0 etc. But if you want to print such a character as it is what will you do? Hence you need to add a escape sequence character in front of it:
printf("\\n"); // will print \n
But
printf("\n"); // will print end of character hence you don't see anything in output

sizeof(string) not including a "\" sign

I've been going through strlen and sizeof for strings (char arrays) and I don't quite get one thing.
I have the following code:
int main() {
char str[]="gdb\0eahr";
printf("sizeof=%u\n",sizeof(str));
printf("strlen=%u\n",strlen(str));
return 0;
}
the output of the code is:
sizeof=9
strlen=3
At first I was pretty sure that 2 separate characters \ followed by 0 wouldn't actually act as a NUL (\0) but I managed to figure that it does.
The thing is that I have no idea why sizeof shows 9 and not 10.
Since sizeof counts the amount of used bytes by the data type why doesn't it count the byte for the \?
In a following example:
char str[]="abc";
printf("sizeof=%u\n",sizeof(str));
that would print out "4" because of the NUL value terminating the array so why is \ being not counted?
In a character or string constant, the \ character marks the beginning of an escape sequence, used to represent character values for which there isn't a symbol in the source character set. For example, the escape sequence \n represents the newline character, \b represents the backspace character, \0 represents the zero-valued character (which is also the string terminator), etc.
In the string literal "gdb\0eahr", the escape sequence \0 maps to a single 0-valued character; the actual contents of str are {'g', 'd', 'b', 0, 'e', 'a', 'h', 'r', 0}.
It seems you already have the answer:
At first I was pretty sure that 2 separate characters "\" followed by "0" wouldn't actually act as a NULL "\0" but I managed to figure that it does.
The sequence \0 is an octal escape sequence for the byte 0. So while there are two characters in the code to denote this, it translated to a single byte in the string.
So you have 7 alphabetic characters, a null byte in the middle, and a null byte at the end. That's 9 bytes.
Why should char str[]="gdb\0eahr"; be 10 bytes with sizeof operator? It is 9 bytes because there are 8 string elements + trailing zero.
\0 is only 1 character, not 2. \'s purpose is to escape characters, therefore you might see some of these: \t, \n, \\ and others.
Strlen returns 3 because you have string termination at position str[3].
Single sequence of \ acts as escape character and is not part of string size. If you want to literally use \ in your string, you have to write it twice in sequence like \\, then this is single char of \ printable char.
The C compiler scans text strings as a part of compiling the source code and during the scan any special, escape sequences of characters are turned into a single character. The symbol backslash (\) is used to indicate the start of an escape sequence.
There are several formats for escape sequences. The most basic is a backslash followed by one of several special letters. These two characters are then translated into a single character. Some of these are:
'\n' is turned into a line feed character (0x0A or decimal 10)
'\t' is turned into a tab character (0x09 or decimal 9)
'\r' is turned into a carriage return character (0x0D or decimal 13)
'\\' is turned into a backslash character (0x5C)
This escape sequence idea was used back in the old days so that when a line of text was printed to a teletype machine or printer or a CRT terminal, the programmer could use these and other special command code characters to set where the next character would be printed or to cause the device to do some physical action like ring a bell or feed the paper to the next line.
The escape character also allowed you to embed a double quote (") or a single quote (') into a text string so that you could print text that contain quote marks.
In addition to the above special sequences of backslash followed by a letter there was also a way to specify any character by using a backslash followed by one up to three octal digits (0 through 7). So you could specify a line feed character by either using '\n' or you could use '\12' where 12 is the octal representation of the hexadecimal value A or the decimal value 10.
Then the ability to use a hexadecimal escape sequence was introduced with the backslash followed by the letter x followed by one or more hexadecimal digits. So you can write a line feed character with '\n' or '\12' or '\xa'.
See also Escape sequences in C in Wikipedia.

Concatenating two characters to create escape sequence

Idea is to try to create escape sequences 'human' way. For example, I use two characters to create '\n', the '\' and 'n'.
What I'm thinking about is char array[3]={'\\','n','\0'};
so I can change 'n' character and still use it as an escape sequence.
When I printf(array) it now prints:
\n
and I'd like it to go to next line.
For example:
what if I wanted to check manually what every letter in alphabet does when used as escape sequence with a loop?
for(char='a';char<='z';char++)
{
/* create escape sequence with that letter */
/* print that escape sequence and see what it does */
}
It's not an assignment,has no practical use (at least not yet), but just a theoretical question that I couldn't find answer anywhere, nor figure it out myself.
The escape sequence represents a single character and is evaluated at compile time. You cannot have a literal string interpreted as an escape sequence at run time.
For example '\n' is a newline (or line-feed character - 0x0A in ASCII)
Note that:
char array[3]={'\\','n','\0'};
is equivalent to:
char array[3] = "\\n" ;
so perhaps unsurprisingly when you printf(array) it prints \n - that is what you have asked it to do.
Undefined escape sequences simply won't compile, so you might simply:
char = '\a' ;
char = '\b' ;
... // etc.
and see which lines the compiler baulks at. However that is not the complete story because some escape sequences require operands, for example \x on its own has no meaning, whereas \xab is the character represented by 0xab (171 decimal). Others are not even letters. Most are related to white-space, and their effect may be dependent on the terminal or console capabilities of the execution platform. So a naive investigation may not generate the results you seek, because it does not account for the language semantics or platform capabilities.
All supported escape sequences are in fact well defined - you'll find few surprises except perhaps those related to platform capabilities (for example if your target has no means to generate a beep, \a will have no useful effect):
\a Beep
\b Backspace
\f Form-feed
\n Newline
\r Carriage return
\t Horizontal tab
\v Vertical tab
\\ Backslash
\' Single quotation mark
\" Double quotation mark
\0 ASCII 0x00 (null terminator)
\ooo Octal representation
\xdd Hexadecimal representation
What about writing your own printf()?
Where you can check for a '\' followed by a 'n' and than only print from char[0] to '\''n'. Finally add "printf("\n");
mfg

Resources