Bug in K&R second edition? - c

Here's something that looks like a bug to me, however I am confused that my observation does not seem to pop up anywhere else on the internet, given the age and popularity of the book. Or maybe I am just bad at searching or it is not a bug at all.
I am talking about the "print out the longest input line" program from chapter one. Here's the code:
#include <stdio.h>
#define MAXLINE 1000 /* maximum input line length */
int getline(char line[], int maxline);
void copy(char to[], char from[]);
/* print the longest input line */
main()
{
int len; /* current line length */
int max; /* maximum length seen so far */
char line[MAXLINE]; /* current input line */
char longest[MAXLINE]; /* longest line saved here */
max = 0;
while ((len = getline(line, MAXLINE)) > 0)
if (len > max) {
max = len;
copy(longest, line);
}
if (max > 0) /* there was a line */
printf("%s", longest);
return 0;
}
/* getline: read a line into s, return length */
int getline(char s[],int lim)
{
int c, i;
for (i=0; i < lim-1 && (c=getchar())!=EOF && c!='\n'; ++i)
s[i] = c;
if (c == '\n') {
s[i] = c;
++i;
}
s[i] = '\0';
return i;
}
/* copy: copy 'from' into 'to'; assume to is big enough */
void copy(char to[], char from[])
{
int i;
i = 0;
while ((to[i] = from[i]) != '\0')
++i;
}
Now, it seems to me that it should be lim-2 as opposed to lim-1 in the for condition of getline. Otherwise, when the input is exactly of maximum length, that is 999 characters followed by '\n', getline will index into s[MAXLINE], which is out of bounds, and also all kinds of horrible things might happen when copy is called and from[] does not end with a '\0'.

I think you're confused somewhere. This loop condition:
for (i=0; i < lim-1 && (c=getchar())!=EOF && c!='\n'; ++i)
Ensures that i is never greater than lim - 2, so in the maximum-length case, i is lim-1 after the loop exits and the null character is stored into that last position.

In the case of 999 non-\n characters followed by a \n, c never equals \n. When the for loop exits, c is equal to the last non-newline character.
So c != '\n' and doesn't enter the block that does i++ so i never goes out of bounds.

When the input is of length 999 and is followed by \n, value of limit is 1000 and the value of lim -1 will become 999 and the loop test condition i < lim -1 will become false when i becomes 998. For i < 999, c == \n never be true and hence array s will indexed to s[999] and not s[1000]

Related

Program adds random characters to string in C

I am in the process of learning C, and I stumbled upon a bug, and I can't seem to get my head around it.
This code is supposed to accept multiple lines of text (followed by an Enter), compare the lines to each other, stop comparing upon hitting EOF (aka. crtl - Z), choose the longest string and print it out to the console.
It does that, but upon printing the text out, it sometimes adds some additional characters to the output.
This is the code:
int get_line(char s[]);
void copy(char from[], char to[]);
main()
{
int len;
char line[1000]; //the array which is used to compare
char longest[1000]; //the array in which the longest line will be stored
int max = 1; //the maximum length of a line
while ((len = get_line(line)) > 0) //checks if the length of a line in more than 0
if (len > max)
{
max = len;
copy(line, longest);
}
if (max > 0)
printf("%s", longest); //prints out the longest line, if it exists
}
int get_line(char s[])
{
int c, i;
for (i = 0; (c = getchar()) != EOF && c != '\n'; ++i) //reads the line until ^Z of Enter
s[i] = c;
return (i);
}
copy(char from[], char to[])
{
for (int i = 0; from[i] != '\n'; ++i)
to[i] = from [i];
}
In C a String is an Array of chars which is terminated with \0, this is called a null terminated String.
So when calling printf("%s",longest), it will print each element in the longest[] array which has length 1000.
To fix this you have to set the length of your char Array to the length of the actual String plus 1, and set longest[length +1] = '\0'.
This will tell the printf() function that the String ends at length + 1.

K&R - Section 1.9 - Character Arrays (concerning the getline function)

My question is concerning some code from section 1.9-character arrays in the Kernighan and Ritchie book. The code is as follows:
#include <stdio.h>
#define MAXLINE 1000 /* maximum input line size */
int getline(char line[], int maxline);
void copy(char to[], char from[]);
/* print longest input line */
main()
{
int len; /* current line length */
int max; /* maximum length seen so far */
char line[MAXLINE]; /* current input line */
char longest[MAXLINE]; /* longest line saved here */
max = 0;
while ((len = getline(line, MAXLINE)) > 0)
if (len > max) {
max = len;
copy(longest, line);
}
if (max > 0) /* there was a line */
printf("%s", longest);
return 0;
}
/* get line: read a line into s, return length */
int getline(char s[], int lim) //
{
int c, i;
for (i=0; i<lim-1 && ((c=getchar()) != EOF) && c != '\n' ; ++i)
s[i] = c;
if (c == '\n') {
s[i] = c;
++i;
}
s[i] = '\0';
return i;
}
/* copy : copy 'from' into 'to'; assume to is big enough */
void copy(char to[], char from[])
{
int i;
i = 0;
while ((to[i] = from[i]) != '\0')
++i;
}
My question is about the getline function. Now looking at this following output from my command line as a reference:
me#laptop
$ characterarray.exe
aaaaa
a
aaaaaa
^Z
aaaaaa
me#laptop
$
When I type in the first character which is 'a', does the character 'a' go through the for loop, in the getline function, and initialize s[0] - s[998] = a ? And my second part of the question is once the program leaves the for loop and goes to the
s[i] = '\0'
return i;
wouldn't it be initialized s[998] = '\0' and the return integer is 998? I've spent over an hour staring at this problem and I can't seem to grasp what is going on.
Character 'a' does not go through the for loop. Look here:
for (i=0; i<lim-1 && ((c=getchar()) != EOF) && c != '\n' ; ++i)
c = getchar() is where your program stops and waits for user input -- not just once, but in a loop! This is a condition of continuing the for loop, and each time it is checked, a getchar() is called. The loop goes only one char at a time. Each time a char is added to the current end of an array (indicated by i) and the loop breaks when a newline symbol is entered or when a limit size is reached. This is very compact code but it is kind of oldschool C coding which is not very readable these relaxed days. It can be broken into following parts:
i = 0;
while(i < lim - 1) {
c = getchar();
if (c == EOF || c == '\n') break;
s[i] = c;
++i;
}
s[i] = '\0';
return i;
(also I don't think a newline symbol is needed to be included in the string, so I omitted the
if (c == '\n') {
s[i] = c;
++i;
}
part.)
Here is how the input is working in the for-loop.
Before the for-loop, the array s is empty(ish. It's full of whatever garbage is in that memory). If you type in a single a and hit enter, the for-loop goes through 2 characters 'a' and '\n'. For 'a', the variable c becomes 'a' from getchar() in the for-loop parameters, and that gets saved in the i spot (which is 0) of s. So, the array s is now
s[0] = 'a' and the rest of s has random garbage in it.
Then c becomes '\n' from getchar(). This stops the for-loop because of the c != '\n' check. The if-statement has s[i], where i is 1, become '\n' and i bumps up to 2.
Now s is
s[0] = 'a'
s[1] = '\n'
getline finishes up by making s[2] be '\0', which is the end of string character, and returns i.
Your end result is
s[0] = 'a'
s[1] = '\n'
s[2] = '\0'
and i, your length, is 2.

Broken code on K&R?

At page29 of 'The C programming language' (Second edition) by K&R I've read a procedure I think is broken. Since I'm a beginner I would expect I'm wrong though I cannot explain why.
Here's the code:
#include <stdio.h>
#define MAXLINE 1000 // Maximum input line size
int get1line(char line[], int maxline);
void copy(char to[], char from[]);
// Print longest input line
int
main()
{
int len; // Current line lenght
int max; // Maximum lenght seen so far
char line[MAXLINE]; // Current input line
char longest[MAXLINE]; // Longest line saved here
max = 0;
while ((len = get1line(line, MAXLINE)) > 0)
if (len > max) {
max = len;
copy(longest, line);
}
if (max > 0) // There was a line to read
printf("Longest string read is: %s", longest);
return 0;
}
// `get1line()` : save a line from stdin into `s`, return `lenght`
int
get1line(char s[], int lim)
{
int c, i;
for (i = 0; i < lim -1 && (c = getchar()) != EOF && c != '\n'; ++i)
s[i] = c;
if (c == '\n') {
s[i] = c;
++i;
}
s[i] = '\0';
return i;
}
// `copy()` : copy `from` into `to`; assuming
// `to` is big enough.
void
copy(char to[], char from[])
{
int i;
i = 0;
while ((to[i] = from[i]) != '\0')
++i;
}
My perplexity is: we're using the function get1line and suppose that at the end of the for-loop i is set at lim -1. Then the following if-statement will update i at lim, causing the next instruction (the one which puts the NULL character at the end of the string) to corrupting the stack (since s[lim] is not allocated, in that case).
Is the code broken?
Summary: It's not possible to exit the loop with both i == lim-1 and c == '\n', so the case you are worried about never arises.
In detail: We can rewrite the for loop (while preserving its meaning) to make the order of events clear.
i = 0;
for (;;) {
if (i >= lim-1) break; /* (1) */
c = getchar();
if (c == EOF) break; /* (2) */
if (c == '\n') break; /* (3) */
s[i] = c;
++i;
}
At loop exit (1) it can't be the case that c == '\n' because if that were the case then the loop would have exited at (3) the previous time around.*
At loop exits (2) and (3) it can't be the case that i == lim-1 because if that were the case then the loop would have exited at (1).
* This depends on lim being at least 2, so that there was in fact a previous time around the loop. The program only ever calls get1line with lim equal to MAXLINE, so this is always the case.**
** You could make make the function safe when lim is less than 2 by initializing c to a value other than '\n' before the loop begins. But if you are concerned about this possibility, then you might also want to be concerned about the possibility that lim is INT_MIN, so that lim-1 results in undefined behaviour due to integer overflow.
The code is wrong if lim == 0 because it uses c uninitialised and adds the \0.
It is also wrong if lim == 1 because it uses c uninitialised. Calling the function with lim < 2 is not very useful, but it should not fail like this.
If lim > 1 then the function is OK
for (i = 0; i < lim -1 && (c = getchar()) != EOF && c != '\n'; ++i)
s[i] = c;
The loop exits either if i == lim-1 or if c == EOF or if c == '\n'.
If the first condition is true (i == lim-1), then the last condition is definitely not true (unless lim < 2, as noted above).
If the first condition is false (i < lim-1), then even if the loop exits with c == \n, we know that there is space in the buffer because we know that i < lim-1.

Writing in the location outside of array

I've just started learning programming. This is my first post. I'm reading a book "C Programming Language" by Kernighan and Ritchie, and I came across an example that I don't understand (section 1.9, p 30).
This program takes text as input, determines the longest line, and prints it.
Char array line[MAXLINE] is declared, where MAXLINE is 1000. This should mean that the last element of this array has index of MAXLINE-1, which is 999.
However, if you look at function getline, which is being passed line[] array as an argument (and MAXLINE as lim), it appears that if user input is a line longer than MAXLINE, i will be incremented until i = lim, that is, i = MAXLINE. Therefore, the statement line[i] = '\0' will be line[MAXLINE] = '\0'.
This looks wrong to me - how can we write to the line[MAXLINE] location, if the size of line[] is MAXLINE. Wouldn't it be writing into the location outside of the array?
The only explanation I can come up with is that when declaring char array[size], C language actually creates char array[size+1] array, where the last element is reserved for the NULL character. If so, this is pretty confusing, and isn't mentioned in the book. Can anyone confirm this, or explain what's going on?
#include <stdio.h>
#define MAXLINE 1000 /* maximum input line length */
int getline(char line[], int maxline);
void copy(char to[], char from[]);
/* print the longest input line */
main()
{
int len; /* current line length */
int max; /* maximum length seen so far */
char line[MAXLINE]; /* current input line */
char longest[MAXLINE]; /* longest line saved here */
max = 0;
while ((len = getline(line, MAXLINE)) > 0)
if (len > max) {
max = len;
copy(longest, line);
}
if (max > 0) /* there was a line */
printf("%s", longest);
return 0;
}
/* getline: read a line into s, return length */
int getline(char s[],int lim)
{
int c, i;
for (i=0; i < lim-1 && (c=getchar())!=EOF && c!='\n'; ++i)
s[i] = c;
if (c == '\n') {
s[i] = c;
++i;
}
s[i] = '\0';
return i;
}
/* copy: copy 'from' into 'to'; assume to is big enough */
void copy(char to[], char from[])
{
int i;
i = 0;
while ((to[i] = from[i]) != '\0')
++i;
}
This for loop appears to be doing the reading in getline:
for (i=0; i < lim-1 && (c=getchar())!=EOF && c!='\n'; ++i)
s[i] = c;
It looks like i is incremented until it reaches lim - 1, not lim (where lim here is equal to MAXLINE in the case you were talking about). Hence, if the line is longer than MAXLINE, it stops after reading MAXLINE-1 characters, and tacks on the '\0' at the end like you expect.
If you look at this line, then you can see that it stops the loop two characters before the limit. i < lim -1
for (i=0; i < lim-1 && (c=getchar())!=EOF && c!='\n'; ++i)
If the char was a \n it is appended, so the 0-Byte is exactly at the limit in this case, if the line is exactly one byte shorter then the limit (which is correct, because the 0-Byte is also included).
No, I think it is clean.
Note that since the book was written, POSIX has standardized a getline() function with a completely different interface; this can cause some grief, but it is fixable by renaming the function from K&R.
The code is:
int getline(char s[],int lim)
{
int c, i;
for (i = 0; i < lim-1 && (c=getchar()) != EOF && c != '\n'; ++i)
s[i] = c;
if (c == '\n') {
s[i] = c;
++i;
}
s[i] = '\0';
return i;
}
Let's consider 2 cases:
998 characters followed by newline.
999 characters followed by newline.
In the first case, when the character before the newline is read, i is 997, which is less than 999 (lim-1), so the getchar() is executed, the character is neither EOF nor newline, and s[997] is assigned, and i is incremented to 998. Since i is still less than 999, the newline is read, and the loop is terminated. Because c is the newline, s[998] is given the newline and i is incremented to 999. Then the assignment s[i] = '\0'; writes to element 999, which is safe.
The analysis in the second case is similar. When the character before the newline is read, i is 998, which is less than 999, so getchar() is executed, the character is neither EOF nor newline, so s[998] is assigned, and i is incremented to 999. Since i is no longer less than 999, the loop exits without reading the newline; since c is not a newline, the body of the if after the loop is not executed; then the null is written to s[999], which is safe.
If EOF is detected before the newline (so the file doesn't end with a newline and technically isn't a text file according to the C standard), the loop is safely broken without overflowing the buffer.
Is there a case that isn't covered?
This is called testing the boundary conditions. It is important to test just below a limit (to make sure it works OK) and at the limit (to ensure it handles that OK). Most of the time, the algorithm doesn't need more than one test just below and one test at the limit; sometimes, if the algorithm handles several numbers either side of a limit (e.g. average of 3 cells), then you have to do more testing at the upper boundary. Lower boundary testing is also important — testing for 0, 1, 2, ... is very valuable.
general answer
reading/writing outside of allocated memory is undefined behaviour.
In many cases it will lead to the dreaded Segmentation fault.
In some cases you might get away due to sheer luck (e.g. because the actual memory you have accessed is physically/logically existing and not used otherwise).
the simple answer is: do not do this!! protect your code against accessing out-of-bounds memory.
C does never do any magic, like allocating n+1 bytes when you really only asked you to allocate n bytes.
as for your specific example
for (i=0; i < lim-1 /* ... */ ; ++i)
this will not really increment i up to lim, as the condition makes sure that i is smaller than lim-1, so as soon as it reaches lim-1 (which is still a valid index within s[]) it will stop the for-loop..

Embracing/or not ++i in one part of a code. K&R longest line example

int getline(char s[], int lim)
{
int c, i;
for(i=0; i<lim-1 && (c=getchar())!=EOF && c!='\n'; ++i)
s[i] = c;
if(c=='\n'){
s[i] = c;
++i;
}
s[i] = '\0';
return i;
}
This example is from K&R book on C, chapter 1.9 on arrays. What I do not understand is why do we have to embrace ++i inside if statement? Writing it outside should do the same work.
if(c=='\n')
s[i] = c;
++i;
s[i] = '\0'
return 0;
}
In case of embracing i program works as intended, but on the second case(which in my opinion should do the same work and this is why I edited that part) it doesn't. I ran it through debugger and watched i which in both cases was correctly calculated and returned. But program still won't work without embracing ++i. I don't get my print from printf statement, and Ctrl+D just won't work in terminal or XTerm(thorough CodeBlocks) I can't figure out why. Any hint please? Am I missing some logical step? Here is a complete code:
//Program that reads lines and prints the longest
/*----------------------------------------------------------------------------*/
#include <stdio.h>
#define MAXLINE 1000
int getline(char currentline[], int maxlinelenght);
void copy(char saveto[], char copyfrom[]);
/*----------------------------------------------------------------------------*/
int main(void)
{
int len, max;
char line[MAXLINE], longest[MAXLINE];
max = 0;
while( (len = getline(line, MAXLINE)) > 0 )
if(len > max){
max = len;
copy(longest, line);
}
if(max > 0)
printf("StrLength:%d\nString:%s", max, longest);
return 0;
}
/*----------------------------------------------------------------------------*/
int getline(char s[], int lim)
{
int c, i;
for(i=0; i<lim-1 && (c=getchar())!=EOF && c!='\n'; ++i)
s[i] = c;
if(c=='\n'){
s[i] = c;
++i;
}
s[i] = '\0';
return i;
}
/*----------------------------------------------------------------------------*/
void copy(char to[], char from[])
{
int i;
i = 0;
while( (to[i]=from[i]) != '\0')
++i;
}
/*----------------------------------------------------------------------------*/
The line
if(c == '\n')
is equivalent to
if(c != EOF)
Does that help explain why the embracing occurs?
There is a logic there:
if(c=='\n'){
s[i] = c;
++i;
}
It means only if you read an additional newline, you need to increment i one more in order to keep space for the \0 character. If you put ++i outside the if block. it means that it will always increase i by 1 even there is no newline input, in this case, since i is already incremented in the for loop , there is already space for \0, therefore, ++i again will be wrong. You can print the value of i and see how it works.
The index specified by i is the location where the terminating null should be placed when there is no more input for the line. The location just before the index i contains the last valid character in the string.
Keep in mind that the loop that reads data from stdin can terminate for reasons other than reading a \n character.
If you had this construct:
if(c=='\n')
s[i] = c;
++i;
then if the last character read from stdin wasn't a newline you would increment the index by one without writing anything into the location specified by the pre-incremented value of i. You would be effectively adding an unspecified character to the result.
Worse(?), if the for loop terminated because of the i<lim-1 condition you would end up writing the terminating null character after the specified end of the array, resulting in undefined behavior (memory corruption).
The ++i is inside the if statement because we do not want to increment i if we are not placing the \n character in the current index; that would result in leaving an index in between the last character of the input and the \0 at the end of the character string.
For loop can exit due to 3 conditions
1. reading char limit reached or EOF encountered
2. New Line encountered
For first Case We need to store Null into string s , as i points to next position to last valid character read so no need to increment i.
But for second case , as i points to next position to last valid character read , we now store newline at that position then increment i for storing NULL character.
Thats why we need to increment i in 2nd case not in 1st case.
if(c=='\n'){
s[i] = c;
++i;
}

Resources