On K&R, the following code is proposed to count words, lines and characters in input. Exercise 1.11 asks:
How would you test the word count program? What kinds of input are
most likely to uncover bugs if there are any?
The only answer I see to these questions is testing the code on some input that contains several lines, words and tabs.
Can you see any other way to test this code?
#include <stdio.h>
#define IN 1 /* inside a word */
#define OUT 0 /* outside a word */
/* count lines, words and characters in input */
main(){
int c, n1, nw, nc, state;
state = OUT;
n1 = nw = nc = 0;
while ((c = getchar()) != EOF){
++nc;
if (c == '\n')
++n1;
if (c == ' ' || c == '\n' || c == '\t')
state = OUT;
else if (state == OUT){
state = IN;
++ nw;
}
}
printf("%d %d %d\n",n1,nw,nc);
}
Test the program using all of the following types of inputs:
An empty file.
A file with only new lines and no words.
A file with very long words, all on one line.
A file with very long words, on many lines.
The program might produce invalid output, but should not crash if given special characters.
Test the program with "N" blank lines inserted at random locations throughout the document.
Test the program with "N" blank lines inserted at the beginning of the document.
Test the program with "N" blank lines inserted at the end of the document.
Test the program with both one character words and long words, including hyphenated words with these inputs:
A file with only one space separating each word.
A file with one space or "N" spaces separating each word.
A file with only one tab separating each word.
A file with one space or "N" tabs separating each word.
A file with only one space OR tab separating each word.
A file with one space or "N" spaces OR tabs separating each word.
Test the program with single quotes and double quotes, with and without spaces between the words and the quotes, and with nested levels of quotes.
Also:
Make sure the program doesn't count un-intended characters as a word or part of a word. For example, make sure a carriage return, which is a legal MS-DOS character is not counted as a word if it is included at the end of a line.
Create the largest possible file for which space was designated for this application, and make sure that the program does not crash, that other applications are NOT impacted, and that the output is correct.
Create the largest possible file for which space was designated for this application, containing only spaces, newlines and tabs, except for words at the end of the file, and make sure that the program does not crash, that other applications are NOT impacted, and that the output is correct.
Create the largest possible file for which space was designated for this application, containing only spaces, newlines and tabs, except for words at the beginning of the file, and make sure that the program does not crash, that other applications are NOT impacted, and that the output is correct.
Create the largest possible file for which space was designated for this application, containing only one very long word: the output of the program should be 1.
Have the program write a debugging file that contains a printf for each while, if, and else statement. Make sure that the tests cause all of the printf statements to be reached. In other words, there shouldn't be any parts of the code that remain unused at the end of the testing.
There should be a good reason the output doesn't match the output of the wc program.
The idea behind the question is to illustrate the concept of "white box" testing. Look at every "choice point in your program, and see how you can exercise the logic behind it to uncover the "corner cases":
To exercise the while loop, feed it input that has no data (i.e. EOF comes right away)
Feed the program a file with a single line and no \n before EOF to exercise the line counting if
Feed the program a file with one or more lines composed entirely of whitespace characters
Feed the program a file with the last \n missing, and see if the last word gets counted
Feed the program a file with single-character words to exercise the logic of switching between IN and OUT
Related
Exercise 1-9 in The C Programming Language by Denis Ritchie and Brian Kernighan, second edition:
Write a program to copy its input to its output, replacing each string of one or more blanks by a single blank.
Given that I'm using the book as a reference, I know only the C principles which have been discussed in the book up to Exercise 1-9, that is, variable assignments, while- and for-loops, if-statements, symbolic constants, character I/O via getchar() and putchar(), escape sequences and the printf() function. Maybe I'm forgetting something, but that's most of it. (I'm at page 20, where the exercise is at.)
Here's my (not working) code:
#include <stdio.h>
main()
{
int c;
while ((c = getchar()) != EOF) { // As long as EOF is not met, repeat the following…
if(c == ' ') { // If the input character is a blank, proceed (otherwise skip)…
putchar(c); // Output the blank space which was just inputed…
while(c == ' ') { // As long as more spaces keep coming in, don't do anything (proceed when another character comes along)…
;
}
}
else { // When a character other than a blank is inputed, output that character…
putchar(c);
} // Now retest the master while-loop condition (EOF not met) and proceed…
}
}
What I'm getting as a result is a working input-to-output program, that keeps on inputting and stops outputting the moment a blank is typed. (An exception to this is if the blank is removed with a backspace before entering a new line in the console.)
For example, the input abcde\nabcde abcde\nabcde will yield the output abcde, omitting the second and third lines, given that a blank is contained in the former. Here I am obviously using \n to represent an inputted new line (normally using the Enter key).
What have I done wrong, and what could I do to fix this issue? I know there are several working models of this program spread all over the internet, but I'm wondering why this one (which is my creation) in particular doesn't work. Again, do note that my knowledge of C is mostly limited to the first twenty pages of the book whose details are provided below.
Specs:
I'm running Eclipse version 2021-12 (4.22.0) on Debian GNU/Linux 11 (bullseye). I downloaded the pre-compiled Eclipse version from the official Eclipse.org website.
References:
Kernighan, B.W. and Ritchie, D.M. (1988). The C programming language / ANSI C Version. Englewood Cliffs, N.J.: Prentice Hall.
while(c == ' ') is forever loop.
you should try to remember previous character and if prevous character is whitespace and current character is also whitespace, skip it.
here is the part of the program that i am struggling with, "You will need some kind of loop to read through the entries in a text file, for the 1st 4 fields in the text file, you will know that you are at the end of the field when you read a comma. For the last field in the text file, you will know you are at the end of the field when you read a newline.
I did the program using functions from string.h but was challenged do the same program without the use of string.h and i am stuck at the loop part of this program.
i know my loop is incorrect but i am having trouble figuring out the correct loop to use, any tips will be helpfully
do
{
ch = fgetc(fout);
if(ch == EOF)
{
break;
}
field[x]=ch;//loads all the characters into the array
printf("%c",field[x]);//printing
x++;
}while(1);//infinite loop
fclose(fout);
One way to solve this would be to make a 2dimensional array so a array of field and have an if in the loop that increments the variable that says which field array to use when there is a comma. You would end up with 5 char arrays that each have one field, and you'd have a \n in the last one Wich you then need to remove.
Hope this helps
I would like to write a lottery program in C, that reads the chosen numbers of former weeks into an array. I have got a text file in which there are 5 columns that are separated with tabulators. My questions would be the following:
What should I separate the columns with? (e.g. a comma, a semicolon, a tabulator or something else)
Should I include a kind of EOF in the last row? (e.g. -1, "EOF") Is there any accepted or "official" convention to do this?
Which function should I use for reading the numbers? Is there any proper or "accepted" way of reading data from text files?
I used to write a C program for a "Who Wants to Be a Billionaire" game. In that one I used a kind of function that read each line into an array that was big enough to hold a whole line. After that I separated its data into variables like this:
line: "text1";"text2";"text3";"text4"endline (-> line loaded into a buffer array)
text1 -> answer1 (until reaching the semicolon)
text2 -> answer2 (until reaching the semicolon)
text3 -> answer3 (until reaching the semicolon)
text4 -> answer4 (until reaching the end of the line)
endline -> start over, that is read a new line and separate its contents into variables.
It worked properly, but I don't know if it was good enough for a programmer. (btw I'm not a programmer yet, I study Computer Science at a university)
Every answers and advice is welcome. Thanks in advance for your kind help!
The scanf() family of functions don't care about newlines, so if you want to process lines, you need to read the lines first and then process the lines with sscanf(). The scanf() family of functions also treats white space — blanks, tabs, newlines, etc. — interchangeably. Using tabs as separators is fine, but blanks will work too. Clearly, if you're reading and processing a line at a time, newlines won't really factor into the scanning.
int lottery[100][5];
int line;
char buffer[4096];
for (line = 0; fgets(buffer, sizeof(buffer), stdin) != 0 && line < 100; line++)
{
if (sscanf(buffer, "%d %d %d %d %d", &lottery[line][0], &lottery[line][1],
&lottery[line][2], &lottery[line][3], &lottery[line][4]) != 5)
{
fprintf(stderr, "Faulty line: [%s]\n", line);
break;
}
}
This stops on EOF, too many lines, and a faulty line (one which doesn't start with 5 numbers; you can check their values etc in the loop if you want to — but what are the tests you need to run?). If you want to validate the white space separators, you have to work harder.
Maybe you want to test for nothing but spaces and newlines after the 5 numbers; that's a bit trickier (it can be done; look up the %n conversion specification in sscanf()).
I would like to count the number of lines in an ASCII text file.
I thought the best way to do this would be by counting the newlines in the file:
for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) { /* Count word line endings. */
if (c == '\n') ++lines;
}
However, I'm not sure if this would account for the last line on all both MS Windows and Linux. That is if my text file finishes as below, without an explicit newline, is there one encoded there anyway or should I add an extra ++lines; after the for loop?
cat
dog
Then what about if there is an explicit newline at the end of the file? Or do I just need to test for this case by keeping track of the previously read value?
If there is no newline, one won't be generated. C tells you exactly what's there.
Text files are always expected to end with a line feed. There's no canonical way of handling files that don't.
Here's how some tools choose to deal with characters after the last line feed:
wc doesn't count it as a line (so you have good precedence for that)
Vim marks the file as [noeol], and saves the file without a trailing line feed
GNU sed treats the file as if it had a last line feed
sh's read exits with error, but still returns the data
Since behaviour is pretty much undefined, you can just do whatever's convenient or useful to you.
First, there will not be any implicitly encoded newline at the end of the last line. The only way there will be a newline is if the software or person that produced the file put it there. Putting it there is generally considered good practice, however.
The ultimate answer for what you should report as the line count depends on the convention that you need to follow for the software or people that will be using this line count, and probably what you can assume about the behavior of the input source as well.
Most command-line tools will terminate their output with a newline character. In this case, the sensible answer may be to report the number of newline characters as the number of actual lines.
On the other hand, when a text editor is displaying a file, you will see that the line numbering in the margin (if supported) contains a number for the last line whether it is empty or not. This is in part to tell the user that there is a blank line there, but if you want to count the number of lines displayed in the margin, it is one plus the number of newline characters in the file. It is typical for some coders to not terminate their last lines with a newline character (sometimes due to sloppiness), so in this case this convention would actually be the right answer.
I'm not sure any other conventions make much sense. For example, if you choose not to count the last line unless it is non-empty, then what counts as non-empty? The file ending after newline? What if there is whitespace on that line? What if there are several empty lines at the end of the file?
If you're going to use this method, you could always keep a separate counter for how many letters on the line you are at. If the count at the end is greater than 1, then you know there is stuff on the last line that wasn't counted.
int letters = 0
for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) { /* Count word line endings. */
letters++; // Increase count on character
if (c == '\n')
{
++words;
letters = 0; // Set back to 0 after new line
}
}
if (letters > 0)
{
++words;
}
Your concern is real, the last line in the file may be missing the final end of line marker. The end of line marker is a single '\n' in Linux, a CR LF pair in Windows that the C runtime converts automatically into a '\n'.
You can simplify your code and handle the special case of the last line missing a linefeed this way:
int c, last = '\n', lines = 0;
while ((c = getc(fp)) != EOF) { /* Count word line endings. */
if (c == '\n')
lines += 1;
last = c;
}
if (last != '\n')
lines += 1;
Since you are concerned with speed, using getc instead of fgetc will help on platforms where it is defined as a macro that handles the stream structures directly and calls a function only to refill the buffer, every BUFSIZ characters or so, unless the stream is unbuffered.
How about this:
Create a flag for yourself to keep track of any non \n characters following a \n that is reset when c=='\n'.
After the EOF, check to see if the flag is true and increment if yes.
bool more_chars = false;
for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) { /* Count word line endings. */
if (c == '\n') {
more_chars = false;
++words;
} else more_chars = true;
}
if(more_chars) words++;
Windows and UNIX/Linux style line breaks make no difference here. On either system a text file may or may not have a newline at the end of the last line.
If you always add 1 to the line count, this effectively counts the empty line at the end of the file when there is a newline at the end (i.e., file "foo\n" will count as having two lines: "foo" and ""). This may be an entirely reasonable solution, depending on how you want to define a line.
Another definition of a "line" is that it always ends in a newline, i.e., the file "foo\nbar" would only have one line ("foo") by this definition. This definition is used by wc.
Of course you could keep track of whether the newline was the last character in file and only add 1 to the count in case it wasn't. Then a "line" would be defined as either ending in a newline or being non-empty at the end of the file, which sounds quite complex to me.
This question is in K&R, exercise 1.9. I wrote the following code:
#include<stdio.h>
main()
{
int c,i=0,n=0;
while((c=getchar())!=EOF)
{
if(c!=' '||c!='\t')
{
i=0;
putchar(c);
}
else if(c==' '||c=='\t')
{
i++
}
if((c+1)!=' '||(c+1)!='\t')
n=i;
if(n!=0)
{
c=' ';
putchar(c);
}
}
}
but i could not get the desired output. I am using gcc in ubuntu. When I enter something like hello\t\ta as input then my output is hello\_\_a i.e number of tab is replaced by number of space and when I enter hello\_\_a then my output is same as input.
Please help me with it or suggest me something new to get the desired output.
Instead of giving your the full working program, I prefer to guide you to the right direction.
First of all, c+1 does not mean "next character in the input". It only adds 1 to the value of c, which effectively converts c to the next character in the ASCII table.
For example if c is 'a', c+1 means 'b', which is next character int the ASCII table, and if c is ' ' (a single space) that has a code of 32 in the table, c+1 is '!' that has a code 33 in the table.
Well, to get the next character, you need to read it! In the same way you read the first character. The best way to achieve this, is to always hold the previous read character, and check that with the currently read character.
So you need two variables, for example c and pc. You read the character and store it in c. At first, pc is '\0'. If the read character is not space or tab, you write it to the output. If it is tab, you change it to space. And if it is space, you check the previous character (pc). If it is not space, print c. At the end of the loop, you should store the value of c into pc, which means you are holding the previous character in pc.
I guess I told you the complete solution!
The problem is: you want to check the NEXT character, but you check the current character's value incremented by one.
The approach is slightly wrong, here is a hint, keep the last character as state, if the newly entered character is a space and the last character was a space, then don't output, simply go back round the loop and wait for the next character.
If the current character is not a space, output and update the state...