This question references Reflections on Trusting Trust, figure 2.
Take a look at this snippet of code, from figure 2:
...
c = next( );
if(c != '\\')
return(c);
c = next( );
if (c != '\\')
return('\\');
if (c == 'n')
return('\n');
It says:
This is an amazing piece of code. It "knows" in a completely portable way what character code is compiled for a new line in any character set. The act of knowing then allows it to recompile itself, thus perpetuating the knowledge.
I would like to read the rest of the paper. Can someone explain how the above code is recompiling itself? I'm not sure I understand how this snippet of code relates to the code in "Stage 1":
(source: bell-labs.com)
The stage 2 example is very interesting because it is an extra level of indirection with a self replicating program.
What he means is that since this compiler code is written in C it is completely portable because it detects the presence of a literal \n and returns the character code for \n without ever knowing what that actual character code is since the compiler was written in C and compiled for the system.
The paper goes on to show you very interesting trojan horse with the compiler. If you use this same technique to make the compiler insert a bug into any program, then remove move the bug from the source code, the compiler will compile the bug into the supposedly bug free compiler.
It is a bit confusing but essentially it is about multiple levels of indirection.
What this piece of code does is to translate escape characters, which is part of the job of a C compiler.
c = next( );
if(c != '\\')
return(c);
Here, if c is not \\(the character \), means it's not the start of an escape character, so return itself.
If it is, then it's the start of an escape character.
c = next( );
if (c == '\\')
return('\\');
if (c == 'n')
return('\n');
Here you have a typo in your question, it's if (c == '\\'), not if (c != '\\'). This piece of code continue to examine the character following \, it's clear, if it's \, then the whole escape character is \\, so return it. The same for \n.
The description of that code, from Ken Thompson's paper is: (emphasis added)
Figure 2 is an idealization of the code in the C compiler that interprets the character escape sequence.
So you're looking at part of a C compiler. The C compiler is written in C, so it will be used to compile itself (or, more accurately, the next version of itself). Hence the statement that the code is able to "recompile itself".
Related
Exercise 1-9 in The C Programming Language by Denis Ritchie and Brian Kernighan, second edition:
Write a program to copy its input to its output, replacing each string of one or more blanks by a single blank.
Given that I'm using the book as a reference, I know only the C principles which have been discussed in the book up to Exercise 1-9, that is, variable assignments, while- and for-loops, if-statements, symbolic constants, character I/O via getchar() and putchar(), escape sequences and the printf() function. Maybe I'm forgetting something, but that's most of it. (I'm at page 20, where the exercise is at.)
Here's my (not working) code:
#include <stdio.h>
main()
{
int c;
while ((c = getchar()) != EOF) { // As long as EOF is not met, repeat the following…
if(c == ' ') { // If the input character is a blank, proceed (otherwise skip)…
putchar(c); // Output the blank space which was just inputed…
while(c == ' ') { // As long as more spaces keep coming in, don't do anything (proceed when another character comes along)…
;
}
}
else { // When a character other than a blank is inputed, output that character…
putchar(c);
} // Now retest the master while-loop condition (EOF not met) and proceed…
}
}
What I'm getting as a result is a working input-to-output program, that keeps on inputting and stops outputting the moment a blank is typed. (An exception to this is if the blank is removed with a backspace before entering a new line in the console.)
For example, the input abcde\nabcde abcde\nabcde will yield the output abcde, omitting the second and third lines, given that a blank is contained in the former. Here I am obviously using \n to represent an inputted new line (normally using the Enter key).
What have I done wrong, and what could I do to fix this issue? I know there are several working models of this program spread all over the internet, but I'm wondering why this one (which is my creation) in particular doesn't work. Again, do note that my knowledge of C is mostly limited to the first twenty pages of the book whose details are provided below.
Specs:
I'm running Eclipse version 2021-12 (4.22.0) on Debian GNU/Linux 11 (bullseye). I downloaded the pre-compiled Eclipse version from the official Eclipse.org website.
References:
Kernighan, B.W. and Ritchie, D.M. (1988). The C programming language / ANSI C Version. Englewood Cliffs, N.J.: Prentice Hall.
while(c == ' ') is forever loop.
you should try to remember previous character and if prevous character is whitespace and current character is also whitespace, skip it.
This question already has answers here:
EOF exercise 1-6 K&R The C programming language
(1 answer)
Why does this C program print weird characters in output?
(3 answers)
Why does printf not flush after the call unless a newline is in the format string?
(10 answers)
Closed 5 years ago.
I am a noob teaching myself to program in C using The C Programming Language, Second Edition (by K&R). In Chapter 1 Section 1.5.1 File Copying, the authors touch very briefly on operational precedence when making comparison between values, underscoring the importance of using parenthesis, in this case, to ensure that assignment is made to the variable 'c' before the comparison is evaluated. They make the assertion that:
c = getchar() != EOF
is equivalent to
c = (getchar() != EOF)
Which "has the undesired effect of setting c to 0 or 1, depending on whether or not the call of getchar encountered end of file"
The authors then pose Excercise 1-6 - Verify that the expression getchar () != EOF is 0 or 1
Based on the author's previous assertion, this seemed almost trivial so I created this code:
#include <stdio.h>
main()
{
int c;
while (c = (getchar() != EOF))
putchar(c);
}
Unfortunately, when I run the program, it simply outputs whatever characters I type in the command window rather than the expected string of 1 or 0 if EOF is encountered.
While I am a noob, I think I get the logic that the authors are trying to teach and yet I can not demonstrate this simple task. In this case, should not the variable c take on the value that the comparison expression evaluated to rather than whatever character getchar() happens to fetch, particularly because of the location of the parenthesis? If c is indeed taking on the value of the comparison, putchar() should only output 0 or 1 and yet, as formulated, it outputs what I type in the command window. What am I doing wrong? What do I not understand? Could it be my compiler? I am coding in in Visual Studio 2017 Community edition on Windows 10 on x64 architecture. I have Tiny C Compiler but have not tried executing from command prompt with TCC yet.
When you run the program, the characters that you see doesn't come from your program. It's the console (or terminal)'s echo funcion that shows whatever character you have typed (and you can even erase them before you hit Enter). Your program only outpus characters with ASCII code 0 or 1, both of which are invisible.
If you change putchar(c) to printf("%d", c) you'll be able to see a sequence of 1s. No zero will appear because when c becomes zero, the loop stops and it won't be printed.
Characters '0' and '1' have the ASCII code of 48 and 49, respectively, despite the fact that your terminal may use another encoding. If you want to output a literal number 0, use the character notation. You can also try putchar(48) but don't use this too much (You'll later find out that it's highly discouraged to use magic numbers in your program).
putchar('0');
^ ^
The assertion that
c = getchar() != EOF;
is equivalent to
c = (getchar() != EOF);
is because of operator precedence. The operator != (inequality) has a higher precedence over = (value assignment), so it gets evaluated prior to assignment.
Finally, it's extremely rare for someone to have written that. The correct intention is to write this:
while ( (c = getchar()) != EOF )
The thing is c = (getchar() != EOF) it will get one input character and then it compares. Result will be 1 in case it is not EOF. Then it is assigned. The value of assignment statement is the value that is being assigned. It enters the loop. Prints the character having ascii value 1. But that character is non-printable so you don't see anything.
Once it gets EOF it will break from the loop. So you never get to see anything other than the character which has ascii value of 1.(Even you don't see that also as it is non-printable). These are known as ascii-control characters. (Not belong to the printable class).
Also you said as c=0 or c=1 it should print 0 or 1. Then try this simple code
int c= 68;
putchar(c);
Check the output and you will get the idea what happens when we try to print. It's the character whose ascii code is 68, that is printed, not the value 68.
The right way to do it would be ((c = getchar()) != EOF).
Originally I mentioned that on some machines it prints some funny characters. The representation of the non-printable characters depend on the used charset. It might be some non-standard encoding (non-unicode) which assigns to the ascii code 1 some representation.(breaking the idea of nonprintables)
Alright, so I'm doing exercise 8 in K&R second edition. Upon looking up the answer after my attempt at doing the exercise didn't print anything but the newlines (the other ints for tabs and empty spaces remained 0 despite running loops to count - I later found out that I used the wrong character for blank space which is just a blank space but it still neglected to count '\t' correctly), I found this:
#include <stdio.h>
int main(void)
{
int blanks, tabs, newlines;
int c;
int done = 0;
int lastchar = 0;
blanks = 0;
tabs = 0;
newlines = 0;
while(done == 0)
{
c = getchar();
if(c == ' ')
++blanks;
if(c == '\t')
++tabs;
if(c == '\n')
++newlines;
if(c == EOF)
{
if(lastchar != '\n')
{
++newlines;
}
done = 1;
}
lastchar = c;
}
printf("Blanks: %d\nTabs: %d\nLines: %d\n", blanks, tabs, newlines);
return 0;
}
Now this works fine. K&R is interesting in that it uses ideas not taught to you in the actual text, for instance I tried to run my "while" loop with multiple IFs the same way this one does, except my WHILE loop ran only when getchar was != EOF. I want to know why it didn't work that way.
I found that what they did is a much better idea, creating the int done and then assigning it a 1 instead of 0 at the end of the program was a much better idea, but mine still ran somewhat correctly. (sorry I don't have my own original code this time).
Where I am stumped is what is the purpose of main(void) and return 0;? Before starting this book I found criticism on this but readers claimed it was only in the 1st edition. Here I find that the 2nd edition doesn't teach that but then puts it in the solutions text.
Also, what is the purpose of the int "lastchar"? If getchar(c) is the input and lastchar is always defined as 0, then how could lastchar possibly be changed by any input whatsoever to make it meaningful to the program at all by running a loop to count newlines with it? I see that lastchar is defined as 'c' at the end of the program, but how does that pertain to it being called previously?
Sorry if any of my questions are complicated. Please just answer whatever you can and let me know if you need any further clarification. Just to reiterate I'm very curious why the program can't run a while loop using getchar(c) != EOF, with the same IF statements. Rather than using while done == 0. I feel as if it could be a little shorter/concise (definitely can't say simpler) that way.
Where I am stumped is what is the purpose of main(void) and return 0;?
In standard C programs, main(0) should return an int, and 0 indicates successful program completion. One could argue that main should have two parameters -- the command-line argument count and an array of arguments, but if your program doesn't make use of arguments then it isn't necessary.
Also, what is the purpose of the int "lastchar"?
And the end of the while loop, the program stores a copy of the current character in the lastchar variable. As you can see in the EOF-handling code, it makes use of lastchar when determining whether the input text ended in a partial line.
I'm very curious why the program can't run a while loop using getchar(c) != EOF, with the same IF statements.
You could code it that way, but the conditional for the while can appear confusing to someone who doesn't have a lot of experience with C: while ((c = getchar()) != EOF). You would also have to move the if (lastchar != '\n') ++newlines; to just outside of the while loop.
Maybe you should make that change to the program and compare it's output to the original for various types of input (empty file, file ending with a newline, file not ending with a newline). Do both programs show the same output? If not, why? Does the modified version still seem more concise? Which would be easier to make changes to in the future?
Many decisions go into a choice of how to structure a program. Even one as simple as this K&R example.
I'm new to programming and I can't seem to get my head around why the following happens in my code, which is:
#include <stdio.h>
/*copy input to output; 1st version */
main()
{
int c;
c = getchar();
while (c != EOF) {
putchar(c);
c = getchar();
}
}
So after doing some reading, I've gathered the following:
Nothing executes until I hit Enter as getchar() is a holding function.
Before I hit Enter, all my keystrokes are stored in a buffer
When getchar() is called upon, it simply goes looks at the first value in the buffer, becomes that value, and then removes that value from the buffer.
My question is that when I remove the first c = getchar() the resulting piece of code has exactly the same functionality as the original code, albeit before I type anything a smiley face symbol immediately appears on the screen. Why does this happen? Is it because putchar(c) doesn't hold up the code, and tries to display c, which isn't yet defined, hence it outputs some random symbol? I'm using Code::Blocks if that helps.
The function you listed will simply echo back to you every character you type at it. It is true that the I/O is "buffered". It is the keyboard input driver of the operating system that is doing this buffering. While it's buffering keys you press, it echoes each key back at you. When you press a newline the driver passes the buffered characters along to your program and getchar then sees them.
As written, the function should work fine:
c = getchar(); // get (buffer) the first char
while (c != EOF) { // while the user has not typed ^D (EOF)
putchar(c); // put the character retrieved
c = getchar(); // get the next character
}
Because of the keyboard driver buffering, it will only echo back every time you press a newline or you exit with ^D (EOF).
The smiley face is coming from what #YuHao described: you might be missing the first getchar in what you ran, so putchar is echoing junk. Probably a 0, which looks like a smiley on your screen.
If you ommit the first getchar(), the code will look like this:
int c;
while (c != EOF) {
putchar(c);
c = getchar();
}
Here, c is uninitialized, so calling putchar(c) the first time will output a garbage value, that's where you get the smiley face.
"I'm new to programming"
You're not advised to learn programming using C (and the difficulties you're going through are because of the C language). For example, my first computer science classes were in pascal. Other univesities may use scheme or lisp, or even structured natural languages to teach programming. MIT's online classes are given in python.
C is not a language you would want to use in the first months of programming. The specific reason in your case is due to the fact that the language allowed you to use the value of an uninitialized value.
When you declare the integer variable "c", it gets an implicitly reserved space on the program stack, but without have any meaningful value: it's "trash", the value is whatever value was already on memory at that time. The C language requires that the programmer implicitly knows that he needs to assign some value before using a variable. Removing the first getchar results in uses before assignment in the while condition (c != EOF) and putchar(c), both before c has any meaningful value.
Consider the same code rewritten in python:
import sys
c = sys.stdin.read(1)
while c != '':
c = sys.stdin.read(1)
sys.stdout.write(c)
Remove the initial read and you get the following error:
hdante#aielwaste:/tmp$ python3 1.py
Traceback (most recent call last):
File "1.py", line 3, in <module>
while c != '':
NameError: name 'c' is not defined
That's a NameError: you used the value without assigned to it resulted in a language error.
For more information, try an online course, for example:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-00-introduction-to-computer-science-and-programming-fall-2008/video-lectures/
About uninitialized values:
http://en.wikipedia.org/wiki/Uninitialized_variable
I sitting right now on K&R The C programming Language . and i have stack on 1 Exercise 1-8 .
The Exercise it self.
Write a program to replace each tab by three-character sequence >, backspace, -, witch prints as →, and each backspace by the similar ←. This makes tabs and backspaces visible.
As i understand here that exercise ask me to make pointing arrows in tabs and backspaces. But i cant get how to clip 2 characters together in C
Here is program it self
#include <stdio.h>
main ()
{
int c;
while ((c=getchar()) !=EOF)
{
if (c == '\t')
printf(">->->\b");
if (c == '\b')
printf("<-<-<-\b");
if (c !='\t')
if (c !='\b')
putchar(c);
}
getchar();
}
So where is my mistake can you help me here ?
The sequence desired is
>\b-
Note that this may not work on modern terminal emulators, since most do not support overprinted characters. The original idea was to mimic the old typewriter technique of printing a character, backing the head up by one character, and striking another character over top of the previous one.
If your terminal supports UTF-8, you can substitute the '→' Unicode glyph (U+2192 RIGHTWARDS ARROW), which is encoded in UTF-8 as
\xe2\x86\x92
Similarly, '←' (U+2190) is
\xe2\x86\x90