I've wanted to learn more about the C standard library, so I decided to implement printf using only the putchar function. I'd barely started when something odd happened. All I'd done was write a loop to print a verbatim copy of the format string, and then I realized that all the escape sequences, (\n, \t, etc.) had already been parsed and "properly" output.
Here's the minimal code:
int my_printf(const char* s){
size_t i;
char c;
for (i = 0; (c = *(s + i)) != '\0'; ++i){
putchar(c);
}
return 0;
}
int main(void){
my_printf("Here\t1\n0\n");
return 0;
}
I was expecting a literal Here\t1\n0\n to be output, but I instead got:
Here 1
0
Any idea why this happening? My first thought was that the compiler I'm using, (gcc), was trying to help by pre-analyzing the format string, but that seems odd, since it would cause a lot of problems, since it would break any char array. So, does anyone know why this is happening? And is this behavior defined in the standard? Thank you for any help!
Edit: As mafso stated in their answer, the replacements are done at compile time, and it is standard. Section 5.1.1.2.1.5 of the standard has the actual text.
The escape sequences are replaced at compile time, not at runtime by printf. "\n" is a string starting with a literal new-line character (it’s the only way to put a literal new-line character into a string).
The only part of the string printf interprets are conversion specification, which always start with a % sign, (and the 0-terminator, of course), every other character is printed literally. You don’t need to do anything further.
Related
I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.
However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.
For example, the character I want to read is "§".
And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".
I looked it up in the debugger and saw "§" in the char array. But I don't have  anywhere in my input file.
screenshot of debugger
And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "§".
There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?
Thanks!
Code as follows:
case 1:
mode = 1;
FILE *fp;
fp = fopen ("input2.txt", "r");
int charCount = 0;
while(!feof(fp)) {
original_message[charCount] = fgetc(fp);
charCount++;
}
original_message[charCount - 1] = '\0';
fclose(fp);
k = strlen(original_message);//split the original message into k input symbols
printf("k: \n%lld\n", k);
printf("ASCII code:\n");
for (int i = 0; i < k; i++)
{
ASCII = original_message[i];
printf("%d ", ASCII);
}
C's getchar (and getc and fgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.
But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp) with fgetwc(fp), and then you should be able to start reading characters like § as themselves.
You will have to #include <wchar.h> to get the prototype for fgetwc. And you may have to add the call
setlocale(LC_CTYPE, "");
at the top of your program to synchronize your program's character set "locale" with that of your operating system.
Not your original code, but I wrote this little program:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
wchar_t c;
setlocale(LC_CTYPE, "");
while((c = fgetwc(stdin)) != EOF)
printf("%lc %d\n", c, c);
}
When I type "A", it prints A 65.
When I type "§", it prints § 167.
When I type "Ƶ", it prints Ƶ 437.
When I type "†", it prints † 8224.
Now, with all that said, reading wide characters using functions like fgetwc isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:
Your original_message array is going to have to be an array of wchar_t, not an array of char.
Your original_message array isn't going to be an ordinary C string — it's a "wide character string". So you can't call strlen on it; you're going to have to call wcslen.
Similarly, you can't print it using %s, or its characters using %c. You'll have to remember to use %ls or %lc.
So although you can convert your entire program to use "wide" strings and "w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.
To investigate how C deals with UTF-8 / Unicode characters, I did this little experiment.
It's not that I'm trying to solve anything particular at the moment, but I know that Java deals with the whole encoding situation in a transparent way to the coder and I was wondering how C, that is a lot lower level, treats its characters.
The following test seems to indicate that C is entirely ignorant about encoding concerns, as that it's just up to the display device to know how to interpret the sequence of chars when showing them on screen. The later tests (when printing the characters surrounded by _) seem particular telling?
#include <stdio.h>
#include <string.h>
int main() {
char str[] = "João"; // ã does not belong to the standard
// (or extended) ASCII characters
printf("number of chars = %d\n", (int)strlen(str)); // 5
int len = 0;
while (str[len] != '\0')
len++;
printf("number of bytes = %d\n", len); // 5
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// "João"
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__�__�__o_ -> wow!!!
str[2] = 'X'; // let's change this special character
// and see what happens
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// JoX�o
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__X__�__o_
}
I have knowledge of how ASCII / UTF-8 work, what I'm really unsure is on at what moment do the characters get interpreted as "compound" characters, as it seems that C just treats them as dumb bytes. What's really the science behind this?
The printing isn't a function of C, but of the display context, whatever that is. For a terminal there are UTF-8 decoding functions which map the raw character data into the character to be shown on screen using a particular font. A similar sort of display logic happens in graphical applications, though with even more complexity relating to proportional font widths, ligatures, hyphenation, and numerous other typographical concerns.
Internally this is often done by decoding UTF-8 into some intermediate form first, like UTF-16 or UTF-32, for look-up purposes. In extremely simple terms, each character in a font has a Unicode identifier. In practice this is a lot more complicated as there is room for character variants, and multiple characters may be represented by a singular character in a font, like "fi" and "ff" ligatures. Accented characters like "ç" may be a combination of characters, as allowed by Unicode. That's where things like Zalgo text come about: you can often stack a truly ridiculous number of Unicode "combining characters" together into a single output character.
Typography is a complex world with complex libraries required to render properly.
You can handle UTF-8 data in C, but only with special libraries. Nothing that C ships with in the Standard Library can understand them, to C it's just a series of bytes, and it assumes byte is equivalent to character for the purposes of length. That is strlen and such work with bytes as a unit, not characters.
C++, as an example, has much better support for this distinction between byte and character. Other languages have even better support, with languages like Swift having exceptional support for UTF-8 specifically and Unicode in general.
printf("_%c_", str[i]); prints the character associated with each str[i] - one at a time.
The value of char str[i] is converted to an int when passed ot a ... function. The int value is then converted to unsigned char as directed by "%c" and "and the resulting character is written".
char str[] = "João"; does not certainly specify a UTF8 sequence. That in an implementation detail. A specified way is to use char str[] = u8"João"; since C11 (or maybe C99).
printf() does not specify a direct way to print UTF8 stirrings.
I was trying to remove spaces of a string after scanning it. The program has compiled perfectly but it is not showing any output and the output screen just keeps getting shut down after scanning the string. Is there a logical error or is there some problem with my compiler(I am using a devc++ compiler btw).
Any kind of help would be appreciated
int main()
{
char str1[100];
scanf("%s",&str1);
int len = strlen(str1);
int m;
for (m=0;m<=len;){
if (&str1[m]==" "){
m++;
}
else {
printf("%c",&str1[m]);
}
m++;
}
return 0;
}
Edit : sorry for the error of m=1, I was just checking in my compiler whether that works or not and just happened to paste that code
Your code contains a lot of issues, and the behaviour you describe is very likely not because of a bug in the compiler :-)
Some of the issues:
Use scanf("%s",str1) instead of scanf("%s",&str1). Since str1 is defined as a character array, it automatically decays to a pointer to char as required.
Note that scanf("%s",str1) will never read in any white space because "%s" is defined as skipping leading white spaces and stops reading in when detecting the first white space.
In for (m=1;m<=len;) together with str1[m], note that an array in C uses zero-based indizes, i.e. str1[0]..str1[len-1], such that m <= len exceeds array bounds. Use for (m=0;m<len;) instead.
Expression &str1[m]==" " is correct from a type perspective, but semantically a nonsense. You are comparing the memory address of the mth character with the memory address of a string literal " ". Use str1[m]==' ' instead and note the single quotes denoting a character value rather than double quotes denoting a string literal.
Statement printf("%c",&str1[m]) passes the memory address of a character to printf rather than the expected character value itself. Use printf("%c",str1[m]) instead.
Hope I found everything. Correct these things, turn on compiler warnings, and try to get ahead. In case you face further troubles, don't hesitate to ask again.
Hope it helps a bit and good luck in experiencing C language :-)
There are many issues:
You cannot read a string with spaces using scanf("%s") , use fgets instead (see comments).
scanf("%s", &str1) is wrong anyway, it should be scanf("%s", str1);, str1 being already the address of of the string
for (m = 0; m <= len;) is wrong, it should be for (m = 0; m < len;), because otherwise the last character you will check is the NUL string terminator.
if (&str1[m]==" ") is wrong, you should write if (str1[m]==' '), " " does not denote a space character but a string literal, you need ' ' instead..
printf("%c", &str1[m]); is wrong, you want to print a char so you need str1[m] (without the &).
You should remove both m++ and put that into the for statement: for (m = 1; m < len; m++), that makes the code clearer.
And possibly a few more problems.
And BTW your attempt doesn't remove the spaces from the string, it merely displays the string skipping spaces.
There are a number of smaller errors here that are adding up.
First, check the bounds on your for loop. You're iterating from index 1 to index strlen(str1), inclusive. That's a reasonable thing to try, but remember that in C, string indices start from 0 and go up to strlen(str1), inclusive. How might you adjust the loop to handle this?
Second, take a look at this line:
if (&str1[m] == " ") {
Here, you're attempting to check whether the given character is a space character. However, this doesn't do what you think it does. The left-hand side of this expression, &str1[m], is the memory address of the mth character of the string str1. That should make you pause for a second, since if you want to compare the contents of memory at a given location, you shouldn't be looking at the address of that memory location. In fact, the true meaning of this line of code is "if the address of character m in the array is equal to the address of a string literal containing the empty string, then ...," which isn't what you want.
I suspect you may have started off by writing out this line first:
if (str1[m] == " ") {
This is closer to what you want, but still not quite right. Here, the left-hand side of the expression correctly says "the mth character of str1," and its type is char. The right-hand side, however, is a string literal. In C, there's a difference between a character (a single glyph) and a string (a sequence of characters), so this comparison isn't allowed. To fix this, change the line to read
if (str1[m] == ' ') {
Here, by using single quotes rather than double quotes, the right-hand side is treated as "the space character" rather than "a string containing a space." The types now match.
There are some other details about this code that need some cleanup. For instance, look at how you're printing out each character. Is that the right way to use printf with a character? Think about the if statement we discussed above and see if you can tinker with your code. Similarly, look at how you're reading a string. And there may be even more little issues here and there, but most of them are variations on these existing themes.
But I hope this helps get you in the right direction!
For loop should start from 0 and less than length (not less or equal)
String compare is wrong. Should be char compare to ' ' , no &
Finding apace should not do anything, non space outputs. You ++m twice.
& on %c output is address not value
From memory, scanf stops on whitespace anyway so needs fgets
int main()
{
char str1[100];
scanf("%s",str1);
int len = strlen(str1);
int m;
for (m=0;m<len;++m){
if (str1[m]!=' '){
printf("%c",str1[m]);
}
}
return 0;
}
There are few mistakes in your logic.
scanf terminates a string when it encounter any space or new line.
Read more about it here. So use fgets as said by others.
& in C represents address. Since array is implemented as pointers in C, its not advised to use & while getting string from stdin. But scanf can be used like scanf("%s",str1) or scanf("%s",&str[1])
While incrementing your index put m++ inside else condition.
Array indexing in C starts from 0 not 1.
So after these changes code will becames something like
int main()
{
char str1[100];
fgets(str1, sizeof str1 , stdin);
int len = strlen(str1);
int m=0;
while(m < len){
if (str1[m] == ' '){
m++;
}
else {
printf("%c",str1[m]);
m++;
}
}
return 0;
}
This piece of code is acting a bit strange to my taste. Please, anyone care to explain why? And how to force '\n' to be interpreted as a special char?
beco#raposa:~/tmp/user/foo/bar$ ./interpretastring.x "2nd\nstr"
1st
str
2nd\nstr
beco#raposa:~/tmp/user/foo/bar$ cat interpretastring.c
#include <stdio.h>
int main(int argc, char **argv)
{
char *s="1st\nstr";
printf("%s\n", s);
printf("%s\n", argv[1]);
return 0;
}
Bottom line, the intention is that the 2nd string to be printed in two lines, just like the first. This program is a simplification. The real program has problems reading from a file using fgets (not a S.O. argument to argv like here), but I think solving here will also solve there.
It seems the shell doesn't recognize and convert the "escape sequence". Use a shell software that supports \n escape sequence.
For all purposes, this just take care of \n and no other characters get special treatment.
This answer here does the job with lower complexity. It does not change "2 chars" into "one single special \n". It just changes <\><n> to "<space><newline>". That's fine. It would be better if there were a C Standard Library to interpret special chars in a string (as I know it has for RegExp for instance).
/* change '\\n' into ' \n' */
void changebarn(char *nt)
{
while(nt!=NULL)
if((nt=strchr(nt,'\\')))
if(*++nt=='n')
{
*nt='\n';
*(nt-1)=' ';
}
}
I am reading the book "The C Programming Language" by Brian Kernighan and Dennis Ritchie(2nd edition, published by PHI). In the first article 1.1 Getting started of the first chapter A Tutorial Introduction, page number 7, they say that one must use \n in the printf() argument, otherwise the C compile will produce an error message. But when I compiled the program without \n in printf(), it went fine. I did not see any error message. I am using Dev-C portable with "MinGW GCC 4.6.2 32-bit" compiler.
Why I do not get the error message?
Here is the passage in question, from page 7 of the second edition of K&R:
You must use \n to include a newline character in the printf argument; if you try something like
printf("hello, world
");
the C compiler will produce an error message.
This means that you can't embed a literal newline in a quoted string.
Either one of the lines below, however, are fine:
printf("hello, world"); /* does not print a newline */
printf("hello, world\n"); /* prints a newline */
All the text above is saying is that you can't have a quoted string that spans multiple lines in the source code.
You can also escape a newline with a backslash. The C preprocessor will remove the backslash and newline, so the following two statements are equivalent:
printf("hello, world\
");
printf("hello, world");
And if you have a lot of text, you can put multiple quoted strings next to each other, or separated by whitespace, and the compiler will join them for you:
printf("hello, world\n"
"this is a second line of text\n"
"but you still need to include backslash-n to break each line\n");
You don't get a compile-time error message because there is no error.
In the first article they say that one must use \n in the printf() argument, otherwise the C compiler will produce an error message.
Can you cite (by section and/or page number) where that statement appears? I seriously do not believe that K&R (you're using the second edition, right?) says that. If it did say that, it would be an error in the book.
Update: What the book says, quite correctly, is that a newline in a string literal is represented by the two-character sequence \n, not by an actual newline character. A string literal must be on a single logical source line; something like
printf("hello
world");
is a syntax error. This applies to all string literals, whether they're printf format strings or not.
An actual newline in a string literal is an error. A \n sequence that represents a newline is optional; its lack is not an error, but a printf format string should usually end with a \n.
There is no requirement for a printf call to include the \n character, and I've never seen a compiler complain about a printf that lacks a \n.
There is an issue here, but it's not a compile-time error.
Some examples:
printf("No newline");
This is a perfectly legal call. It prints the specified string on standard output without a newline character.
printf("hello%c", '\n');
There's no \n in the format string, but it prints hello followed by a newline. Again, this is perfectly legal.
The actual issue is that you should (almost) always print a newline at the very end of your output. This complete program:
#include <stdio.h>
int main(void) {
printf("hello");
return 0;
}
is legal, but its behavior may be undefined in some implementations. The relevant rule is in the standard, section 7.21.2 paragraph 2 (the quote is from the N1570 draft):
A text stream is an ordered sequence of characters composed into
lines, each line consisting of zero or more characters plus a
terminating new-line character. Whether the last line requires a
terminating new-line character is implementation-defined.
Whether that terminating newline character is required or not, it's (almost always) a very good idea to end your output with a newline. If I run it on my system, I get the string hello immediately followed by my shell prompt on the same line. It's not illegal, but it's inconvenient and ugly.
But that applies only at the very end of the program's output. This program is perfectly valid and has well defined behavior:
#include <stdio.h>
int main(void) {
printf("hello");
putchar('\n');
return 0;
}
Still, the easiest and most reliable way to produce clean output is for each printf call to print exactly one line, which ends with exactly one '\n' character. This isn't a universal rule; sometimes it's convenient to print a line a piece at a time, or to print two or more lines in a single printf.
Very often, if you don't end your printf format string with a \n, some of the output stays in the stdout buffer, and you need to call fflush to get all the output shown.
This means that if you don't get all the expected output you should add fflush at appropriate places (e.g. before calls to fork).
But you won't get a compiler message in such case, because it is not an error (it may be a mistake many beginners are doing). If you really wanted, you could customize your compiler (e.g. with MELT if using a recent GCC compiler) to get the warning. I believe it is not worth the effort (because there are legitimate calls to printf without any \n....)
An example of legitimate printf calls without newlines would be if you coded a (recursive) function to output an expression from its AST; you certainly should not emit a newline after each token.
See documentation of printf(3), fflush(3), stdio(3), setvbuf(3) etc...