#include <stdio.h>
char *prg = "char *prg = %c%s%c;main(){printf(prg,34,prg,34);} " ;
void main (){
printf(prg,34,prg,34);
}
Reason behind the following output
char *prg = "char *prg = %c%s%c;main(){printf(prg,34,prg,34);} ";main(){printf(prg,34,prg,34);}
the content in *prg Is printed but in place of "%c%s%c" the content is replaced, why is it embedded
That's the way printf() works - in the format (the first prg argument), each conversion specification (%c, %s) converts a subsequent argument (34, prg) according to the conversion specifier (c, s) and writes the result to the standard output.
That program is a classical example of a program that prints its source code to stdout. Ascii character 34 is the ascii literal for " which is needed to be able to print the delimiters of a C string without incurring in an infinite recursive problem. There are some characters that cannot be used literally, as they are transformed by the compiler, one of these is ", which dissapears when compiled in a string literal, others are the escaped char literals \n, \t,... depending on that, convert into a different char literal. This is the reason the source code must be all in one line (control chars transform themselves by the compiler), no #include ... statements must be allowed (because it ends in a newline), and other things like this.
Compile it and rearrange it (you have modified it, for clarity, on posting) so you can output the exact same source code as you give the compiler.
Note
The program code, to mimic exactly its source form in the output, must be written as:
char*p="char*p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);}
without a trailing newline at the end of the line.
If you substitute 34 by '\"', and p by its value, you'll get:
...
printf(
"char*p=%c%s%c;main(){printf(p,34,p,34);}",
'\"',
"char *p=%c%s%c;main(){printf(p,34,p,34);}",
'\"');
that will form the original string, putting the right delimiters at the proper positions. (note: the %c and %s in the third parameter are not further expanded)
Note2
This program depends on ASCII encoding for ". It should'n work on EBCDIC encodings (you must use the encoding for " instead of the number 34)
Related
I have a strange problem when using string function in C.
Currently I have a function that sends string to UART port.
When I give to it a string like
char buf[32];
strcpy(buf, "AT+CPMS=\"SM");
strcat(buf, "\"");
uart0_putstr(buf);
//or
uart0_putstr("AT+CPMS=SM"); //not a valid AT command, but without quotes just for test
it works well and sends string to UART. But when I use such call:
char buf[32];
strcpy(buf, "AT+CPMS=\"SM\"");
uart0_putstr(buf);
//or
uart0_putstr("AT+CPMS=\"SM\"");
it doesn't print to UART anything.
Maybe you can explain me what the difference between strings in first and second/third cases?
First the C language part:
String literals: All C string literals include an implicit null byte at the end; the C string literal "123" defines a 4 byte array with the values 49,50,51,0. The null byte is always there even if it is never mentioned and enables strlen, strcat etc. to find the end of the string. The suggestion strcpy(buf, "AT+CPMS=\"SM\"\0"); is nonsensical: The character array produced by "AT+CPMS=\"SM\"\0" now ends in two consecutive zero bytes; strcpy will stop at the first one already. "" is a 1 byte array whose single element has the value 0. There is no need to append another 0 byte.
strcat, strcpy: Both functions always add a null byte at the end of the string. There is no need to add a second one.
Escaping: As you know, a C string literal consists of characters book-ended by double quotes: "abc". This makes it impossible to have simple double quotes as part of the string because that would end the string. We have to "escape" them. The C language uses the backslash to give certain characters a special meaning, or, in this case, suppress the special meaning. The entire combination of backslash and subsequent source code character are transformed into a single character, or byte, in the compiled program. The combination \n is transformed into a single byte with the value 13 (usually interpreted as a newline by output devices), \r is 10 (usually carriage return), and \" is transformed into the byte 34, usually printed as the " glyph. The string Between the arrows is a double quote: ->"<- must be coded as "Between the arrows is a double quote: ->\"<-" in C. The middle double quote doesn't end the string literal because it is "escaped".
Then the UART part: The internet makes me believe that the command you want to send over the UART looks like AT+CPMS="SM", followed by a carriage return. The corresponding C string literal would be "AT+CPMS=\"SM\"\r".
The page I linked also inserts a delay between sending commands. Sending too quickly may cause errors that appear only sometimes.
The things to note are :
The AT command syntax probably demands that SM be surrounded by quotes on both sides.
Additionally, the protocol probably demands that a command end in a carriage return.
This ...
char buf[32];
strcpy(buf, "AT+CPMS=\"SM");
strcat(buf, "\"");
... produces the same contents in buf as this ...
char buf[32];
strcpy(buf, "AT+CPMS=\"SM\"");
... does, up to and including the string terminator at index 12. I fully expect an immediately following call to ...
uart0_putstr(buf);
... to have the same effect in each case unless uart0_putstr() looks at bytes past the terminator or its behavior is sensitive to factors other than its argument.
If it does look past the terminator, however, then that might explain not only a difference between those two, but also a difference with ...
uart0_putstr("AT+CPMS=\"SM\"");
... because in this last case, looking past the string terminator would overrun the bounds of the array, producing undefined behavior.
Thanks all. Finally It was resolved with adding NULL char to the end of string.
I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.
However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.
For example, the character I want to read is "§".
And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".
I looked it up in the debugger and saw "§" in the char array. But I don't have  anywhere in my input file.
screenshot of debugger
And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "§".
There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?
Thanks!
Code as follows:
case 1:
mode = 1;
FILE *fp;
fp = fopen ("input2.txt", "r");
int charCount = 0;
while(!feof(fp)) {
original_message[charCount] = fgetc(fp);
charCount++;
}
original_message[charCount - 1] = '\0';
fclose(fp);
k = strlen(original_message);//split the original message into k input symbols
printf("k: \n%lld\n", k);
printf("ASCII code:\n");
for (int i = 0; i < k; i++)
{
ASCII = original_message[i];
printf("%d ", ASCII);
}
C's getchar (and getc and fgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.
But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp) with fgetwc(fp), and then you should be able to start reading characters like § as themselves.
You will have to #include <wchar.h> to get the prototype for fgetwc. And you may have to add the call
setlocale(LC_CTYPE, "");
at the top of your program to synchronize your program's character set "locale" with that of your operating system.
Not your original code, but I wrote this little program:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
wchar_t c;
setlocale(LC_CTYPE, "");
while((c = fgetwc(stdin)) != EOF)
printf("%lc %d\n", c, c);
}
When I type "A", it prints A 65.
When I type "§", it prints § 167.
When I type "Ƶ", it prints Ƶ 437.
When I type "†", it prints † 8224.
Now, with all that said, reading wide characters using functions like fgetwc isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:
Your original_message array is going to have to be an array of wchar_t, not an array of char.
Your original_message array isn't going to be an ordinary C string — it's a "wide character string". So you can't call strlen on it; you're going to have to call wcslen.
Similarly, you can't print it using %s, or its characters using %c. You'll have to remember to use %ls or %lc.
So although you can convert your entire program to use "wide" strings and "w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.
Perhaps I'm misinterpreting my results, but:
#include <stdio.h>
int
main(void)
{
char buf[32] = "";
int x;
x = scanf("%31[^\0]", buf);
printf("x = %d, buf=%s", x, buf);
}
$ printf 'foo\n\0bar' | ./a.out
x = 1, buf=foo
Since the string literal "%31[^\0]" contains an embedded null, it seems that it should be treated the same as "%31[^", and the compiler should complain that the [ is unmatched. Indeed, if you replace the string literal, clang gives:
warning: no closing ']' for '%[' in scanf format string [-Wformat]
Why does it work to embed a null character in the string literal passed to scanf?
-- EDIT --
The above is undefined behavior and merely happens to "work".
First of all, Clang totally fails to output any meaningful diagnostics here, whereas GCC knows exactly what is happening - so yet again GCC 1 - 0 Clang.
And as for the format string - well, it doesn't work. The format argument to scanf is a string. The string ends at terminating null, i.e. the format string you're giving to scanf is
scanf("%31[^", buf);
On my computer, compiling the program gives
% gcc scanf.c
scanf.c: In function ‘main’:
scanf.c:8:20: warning: no closing ‘]’ for ‘%[’ format [-Wformat=]
8 | x = scanf("%31[^\0]", buf);
| ^
scanf.c:8:21: warning: embedded ‘\0’ in format [-Wformat-contains-nul]
8 | x = scanf("%31[^\0]", buf);
| ^~
The scanset must have the closing right bracket ], otherwise the conversion specifier is invalid. If conversion specifier is invalid, the behaviour is undefined.
And, on my computer running it,
% printf 'foo\n\0bar' | ./a.out
x = 0, buf=
Q.E.D.
This is a rather strange situation. I think there are a couple of things going on.
First of all, a string in C ends by definition at the first \0. You can always scoff at this rule, for example by writing a string literal with an explicit \0 in the middle of it. When you do, though, the characters after the \0 are mostly invisible. Very few standard library functions are able to see them, because of course just about everything that interprets a C string will stop at the first \0 it finds.
However: the string you pass as the first argument to scanf is typically parsed twice -- and by "parsed" I mean actually interpreted as a scanf format string possibly containing special % sequences. It's always going to be parsed at run time, by the actual copy of scanf in your C run-time library. But it's typically also parsed by the compiler, at compile time, so that the compiler can warn you if the % sequences don't match the actual arguments you call it with. (The run-time library code for scanf, of course, is unable to perform this checking.)
Now, of course, there's a pretty significant potential problem here: what if the parsing performed by the compiler is in some way different than the parsing performed by the actual scanf code in the run-time library? That might lead to confusing results.
And, to my considerable surprise, it looks like the scanf format parsing code in compilers can (and in some cases does) do something special and unexpected. clang doesn't (it doesn't complain about the malformed string at all), but gcc says both "no closing ‘]’ for ‘%[’ format" and "embedded ‘\0’ in format". So it's noticing.
This is possible (though still surprising) because the compiler, at least, can see the whole string literal, and is in a position to notice that the null character is an explicit one inserted by the programmer, not the more usual implicit one appended by the compiler. And indeed the warning "embedded ‘\0’ in format" emitted by gcc proves that gcc, at least, is quite definitely written to accommodate this possibility. (See the footnote below for a bit more on the compiler's ability to "see" the whole string literal.)
But the second question is, why does it (seem to) work at runtime? What is the actual scanf code in the C library doing?
That code, at least, has no way of knowing that the \0 was explicit and that there are "real" characters following it. That code simply must stop at the first \0 that it finds. So it's operating as if the format string was
"%31[^"
That's a malformed format string, of course. The run-time library code isn't required to do anything reasonable. But my copy, like yours, is able to read the full string "foo". What's up with that?
My guess is that after seeing the % and the [ and the ^, and deciding that it's going to scan characters not matching some set, it's perfectly willing to, in effect, infer the missing ], and sail on matching characters from the scanset, which ends up having no excluded characters.
I tested this by trying the variant
x = scanf("%31[^\0o]", buf);
This also matched and printed "foo", not "f".
Obviously things are nothing like guaranteed to work like this, of course. #AnttiHaapala has already posted an answer showing that his C RTL declines to scan "foo" with the malformed scan string at all.
Footnote:
Most of the time, an embedded in \0 in a string truly, prematurely ends it. Most of the time, everything following the \0 is effectively invisible, because at run time, every piece of string interpreting code will stop at the first \0 it finds, with no way to know whether it was one explicitly inserted by the programmer or implicitly appended by the compiler. But as we've seen, the compiler can tell the difference, because the compiler (obviously) can see the entire string literal, exactly as entered by the programmer. Here's proof:
char str1[] = "Hello, world!";
char str2[] = "Hello\0world!";
printf("sizeof(str1) = %zu, strlen(str1) = %zu\n", sizeof(str1), strlen(str1));
printf("sizeof(str2) = %zu, strlen(str2) = %zu\n", sizeof(str2), strlen(str2));
Normally, sizeof on a string literal gives you a number one bigger than strlen. But this code prints:
sizeof(str1) = 14, strlen(str1) = 13
sizeof(str2) = 13, strlen(str2) = 5
Just for fun I also tried:
char str3[5] = "Hello";
This time, though, strlen gave a larger number:
sizeof(str3) = 5, strlen(str3) = 10
I was mildly lucky. str3 has no trailing \0, neither one inserted by me nor appended by the compiler, so strlen sails off the end, and could easily have counted hundreds or thousands of characters before finding a random \0 somewhere in memory, or crashing.
Why can a null character be embedded in a conversion specifier for scanf?
A null character cannot directly be specified as part of a scanset as in "%31[^\0]" as the parsing of the string ends with the first null character.
"%31[^\0]" is parsed by scanf() as if it was "%31[^". As it is an invalid scanf()
specifier, UB will likely follow. A compiler may provide diagnostics on more than what scanf() sees.
A null character can be part of a scanset as in "%31[^\n]". This will read in all characters, including the null character, other than '\n'.
In the unusual case of reading null chracters, to determine the number of characters read scanned, use "%n".
int n = 0;
scanf("%31[^\n]%n", buf, &n);
scanf("%*1[\n]"); // Consume any 1 trailing \n
if (n) {
printf("First part of buf=%s, %d characters read ", buf, n);
}
I just came across this quine question, but no one really went into how it works: C/C++ program that prints its own source code as its output
char*s="char*s=%c%s%c;main(){printf(s,34,s,34);}";main(){printf(s,34,s,34);}
What I especially don't understand is the following has the same output even though I changed the ints:
char*s="char*s=%c%s%c;main(){printf(s,34,s,34);}";main(){printf(s,5,s,11);}
It still prints the 34s! Can someone walk me through this step by step?
Let's start by formatting the code to span multiple lines. This breaks the fact that it's a quine, but makes it easier to see what's happening:
char* s = "char*s=%c%s%c;main(){printf(s,34,s,34);}";
main() {
printf(s, 34, s, 34);
}
Essentially, this is a declaration of a string s that's a printf-formatted string, followed by a declaration of a function main that invokes printf on four arguments. (This definition of main uses the old-fashioned "implicit int" rule in C where functions are assumed to have int as a return type unless specified otherwise. I believe this is currently deprecated in C and know for certain that this is not legal C++ code.)
So what exactly is this printf call doing? Well, it might help to note that 34 is the ASCII code for a double-quote, so the line
printf(s, 34, s, 34);
is essentially
printf(s, '"', s, '"');
This means "print the string s with arguments ", s, and "." So what's s? It's shown here:
char* s = "char*s=%c%s%c;main(){printf(s,34,s,34);}";
This follows a common self-reference trick. Ignoring the %c%s%c part, this is basically a string representation of the rest of the program. The %c%s%c part occurs at the point where it becomes self-referential.
So what happens if you call printf(s, '"', s, '"')? This will fill in the placeholder %c%s%c with "char*s=%c%s%c;main(){printf(s,34,s,34);}", which is the string contents of the string s. Combined with the rest of the string s, this therefore proints
char*s="char*s=%c%s%c;main(){printf(s,34,s,34);}";main(){printf(s,34,s,34);";
which is the source code of the program. I think this is kinda neat - the closest English translation of a general Quine program I know is "print this string, the second time in quotes" (try it - see what happens!), and this basically does exactly that.
You were asking why changing the numbers to 5 and 11 didn't change that 34 was being printed. That's correct! The string literal s has 34 hard-coded into it, so changing the 5 and 11 in the call to printf won't change that. What it will do is no longer print the quotation marks around the inside of the string s and instead print non-printing characters.
I thought I understand printf, but I guess not. I have:
char sTemp[100];
sprintf(sTemp, "%%%02x", (unsigned)c);
I think that c is an unsigned char and I think a linefeed, but for some reason, what I get coming out is
0x0.000000000000ap-1022
If I make the 'x' in the format string an 'X', then an 'X' appears in the output string.
I completely misinterpreted the results of my experiments in the first version of this answer; apologies all around.
The result of that sprintf() call when c is '\n' is this string:
"%0a"
I believe you are then doing:
printf(sTemp);
Which is the same as:
printf("%0a");
Which is a valid format string for hexadecimal float output. You aren't passing a float variable, however, so printf() pulls whatever happens to be on the stack nearby and uses that as the value to format.
Instead, do:
printf( "%s", sTemp );
and you should see your expected "%0a".
Note that clang, and probably other compilers, give you a warning when you use printf(sTemp):
so.c:9:12: warning: format string is not a string literal (potentially
insecure) [-Wformat-security]
Because of precisely this sort of thing: memory on the stack is accessed that wasn't supposed to be.