Count lines in ASCII file using C - c

I would like to count the number of lines in an ASCII text file.
I thought the best way to do this would be by counting the newlines in the file:
for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) { /* Count word line endings. */
if (c == '\n') ++lines;
}
However, I'm not sure if this would account for the last line on all both MS Windows and Linux. That is if my text file finishes as below, without an explicit newline, is there one encoded there anyway or should I add an extra ++lines; after the for loop?
cat
dog
Then what about if there is an explicit newline at the end of the file? Or do I just need to test for this case by keeping track of the previously read value?

If there is no newline, one won't be generated. C tells you exactly what's there.

Text files are always expected to end with a line feed. There's no canonical way of handling files that don't.
Here's how some tools choose to deal with characters after the last line feed:
wc doesn't count it as a line (so you have good precedence for that)
Vim marks the file as [noeol], and saves the file without a trailing line feed
GNU sed treats the file as if it had a last line feed
sh's read exits with error, but still returns the data
Since behaviour is pretty much undefined, you can just do whatever's convenient or useful to you.

First, there will not be any implicitly encoded newline at the end of the last line. The only way there will be a newline is if the software or person that produced the file put it there. Putting it there is generally considered good practice, however.
The ultimate answer for what you should report as the line count depends on the convention that you need to follow for the software or people that will be using this line count, and probably what you can assume about the behavior of the input source as well.
Most command-line tools will terminate their output with a newline character. In this case, the sensible answer may be to report the number of newline characters as the number of actual lines.
On the other hand, when a text editor is displaying a file, you will see that the line numbering in the margin (if supported) contains a number for the last line whether it is empty or not. This is in part to tell the user that there is a blank line there, but if you want to count the number of lines displayed in the margin, it is one plus the number of newline characters in the file. It is typical for some coders to not terminate their last lines with a newline character (sometimes due to sloppiness), so in this case this convention would actually be the right answer.
I'm not sure any other conventions make much sense. For example, if you choose not to count the last line unless it is non-empty, then what counts as non-empty? The file ending after newline? What if there is whitespace on that line? What if there are several empty lines at the end of the file?

If you're going to use this method, you could always keep a separate counter for how many letters on the line you are at. If the count at the end is greater than 1, then you know there is stuff on the last line that wasn't counted.
int letters = 0
for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) { /* Count word line endings. */
letters++; // Increase count on character
if (c == '\n')
{
++words;
letters = 0; // Set back to 0 after new line
}
}
if (letters > 0)
{
++words;
}

Your concern is real, the last line in the file may be missing the final end of line marker. The end of line marker is a single '\n' in Linux, a CR LF pair in Windows that the C runtime converts automatically into a '\n'.
You can simplify your code and handle the special case of the last line missing a linefeed this way:
int c, last = '\n', lines = 0;
while ((c = getc(fp)) != EOF) { /* Count word line endings. */
if (c == '\n')
lines += 1;
last = c;
}
if (last != '\n')
lines += 1;
Since you are concerned with speed, using getc instead of fgetc will help on platforms where it is defined as a macro that handles the stream structures directly and calls a function only to refill the buffer, every BUFSIZ characters or so, unless the stream is unbuffered.

How about this:
Create a flag for yourself to keep track of any non \n characters following a \n that is reset when c=='\n'.
After the EOF, check to see if the flag is true and increment if yes.
bool more_chars = false;
for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) { /* Count word line endings. */
if (c == '\n') {
more_chars = false;
++words;
} else more_chars = true;
}
if(more_chars) words++;

Windows and UNIX/Linux style line breaks make no difference here. On either system a text file may or may not have a newline at the end of the last line.
If you always add 1 to the line count, this effectively counts the empty line at the end of the file when there is a newline at the end (i.e., file "foo\n" will count as having two lines: "foo" and ""). This may be an entirely reasonable solution, depending on how you want to define a line.
Another definition of a "line" is that it always ends in a newline, i.e., the file "foo\nbar" would only have one line ("foo") by this definition. This definition is used by wc.
Of course you could keep track of whether the newline was the last character in file and only add 1 to the count in case it wasn't. Then a "line" would be defined as either ending in a newline or being non-empty at the end of the file, which sounds quite complex to me.

Related

How to know if the file end with a new line character or not

I'm trying to input a line at the end of a file that has the following shape "1 :1 :1 :1" , so at some point the file may have a new line character at the end of it, and in order to execute the operation I have to deal with that, so I came up with the following solution :
go to the end of the file and go backward by 1 characters (the length of the new line character in Linux OS as I guess), read that character and if it wasn't a new line character insert a one and then insert the whole line else go and insert the line, and this is the translation of that solution on C :
int insert_element(char filename[]){
elements *elem;
FILE *p,*test;
size_t size = 0;
char *buff=NULL;
char c='\n';
if((p = fopen(filename,"a"))!=NULL){
if(test = fopen(filename,"a")){
fseek(test,-1,SEEK_END );
c= getc(test);
if(c!='\n'){
fprintf(test,"\n");
}
}
fclose(test);
p = fopen(filename,"a");
fseek(p,0,SEEK_END);
elem=(elements *)malloc(sizeof(elements));
fflush(stdin);
printf("\ninput the ID\n");
scanf("%d",&elem->id);
printf("input the adress \n");
scanf("%s",elem->adr);
printf("innput the type \n");
scanf("%s",elem->type);
printf("intput the mark \n");
scanf("%s",elem->mark);
fprintf(p,"%d :%s :%s :%s",elem->id,elem->adr,elem->type,elem->mark);
free(elem);
fflush(stdin);
fclose(p);
return 1;
}else{
printf("\nRrror while opening the file !\n");
return 0;
}
}
as you may notice that the whole program depends on the length of the new line character (1 character "\n") so I wonder if there is an optimal way, in another word works on all OS's
It seems you already understand the basics of appending to a file, so we just have to figure out whether the file already ends with a newline.
In a perfect world, you'd jump to the end of the file, back up one character, read that character, and see if it matches '\n'. Something like this:
FILE *f = fopen(filename, "r");
fseek(f, -1, SEEK_END); /* this is a problem */
int c = fgetc(f);
fclose(f);
if (c != '\n') {
/* we need to append a newline before the new content */
}
Though this will likely work on Posix systems, it won't work on many others. The problem is rooted in the many different ways systems separate and/or terminate lines in text files. In C and C++, '\n' is a special value that tells the text mode output routines to do whatever needs to be done to insert a line break. Likewise, the text mode input routines will translate each line break to '\n' as it returns the data read.
On Posix systems (e.g., Linux), a line break is indicated by a line feed character (LF) which occupies a single byte in UTF-8 encoded text. So the compiler just defines '\n' to be a line feed character, and then the input and output routines don't have to do anything special in text mode.
On some older systems (like old MacOS and Amiga) a line break might be a represented by a carriage return character (CR). Many IBM mainframes used different character encodings called EBCDIC that don't have a direct mappings for LF or CR, but they do have a special control character called next line (NL). There were even systems (like VMS, IIRC) that didn't use a stream model for text files but instead used variable length records to represent each line, so the line breaks themselves were implicit rather than marked by a specific control character.
Most of those are challenges you won't face on modern systems. Unicode added more line break conventions, but very little software supports them in a general way.
The remaining major line break convention is the combination CR+LF. What makes CR+LF challenging is that it's two control characters, but the C i/o functions have to make them appear to the programmer as though they are the single character '\n'. That's not a big deal with streaming text in or out. But it makes seeking within a file hard to define. And that brings us back to the problematic line:
fseek(f, -1, SEEK_END);
What does it mean to back up "one character" from the end on a system where line breaks are indicated by a two character sequence like LF+CR? Do we really want the i/o system to have to possibly scan the entire file in order for fseek (and ftell) to figure out how to make sense of the offset?
The C standards people punted. In text mode, the offset argument for fseek can only be 0 or a value returned by a previous call to ftell. So the problematic call, with a negative offset, isn't valid. (On Posix systems, the invalid call to fseek will likely work, but the standard doesn't require it to.)
Also note that Posix defines LF as a line terminator rather than a separator, so a non-empty text file that doesn't end with a '\n' should be uncommon (though it does happen).
For a more portable solution, we have two choices:
Read the entire file in text mode, remembering whether the most recent character you read was '\n'.
This option is hugely inefficient, so unless you're going to do this only occasionally or only with short files, we can rule that out.
Open the file in binary mode, seek backwards a few bytes from the end, and then read to the end, remembering whether the last thing you read was a valid line break sequence.
This might be a problem if our fseek doesn't support the SEEK_END origin when the file is opened in binary mode. Yep, the C standard says supporting that is optional. However, most implementations do support it, so we'll keep this option open.
Since the file will be read in binary mode, the input routines aren't going to convert the platform's line break sequence to '\n'. We'll need a state machine to detect line break sequences that are more than one byte long.
Let's make the simplifying assumption that a line break is either LF or CR+LF. In the latter case, we don't care about the CR, so we can simply back up one byte from the end and test whether it's LF.
Oh, and we have to figure out what to do with an empty file.
bool NeedsLineBreak(const char *filename) {
const int LINE_FEED = '\x0A';
FILE *f = fopen(filename, "rb"); /* binary mode */
if (f == NULL) return false;
const bool empty_file = fseek(f, 0, SEEK_END) == 0 && ftell(f) == 0;
const bool result = !empty_file ||
(fseek(f, -1, SEEK_END) == 0 && fgetc(f) == LINE_FEED);
fclose(f);
return result;
}

how to scan line in c program not from file

How to scan total line from user input with c program?
I tried scanf("%99[^\n]",st), but it is not working when I scan something before this scan statment.It worked if this is the first scan statement.
How to scan total line from user input with c program?
There are many ways to read a line of input, and your usage of the word scan suggests you're already focused on the scanf() function for the job. This is unfortunate, because, although you can (to some extent) achieve what you want with scanf(), it's definitely not the best tool for reading a line.
As already stated in the comments, your scanf() format string will stop at a newline, so the next scanf() will first find that newline and it can't match [^\n] (which means anything except newline). As a newline is just another whitespace character, adding a blank in front of your conversion will silently eat it up ;)
But now for the better solution: Assuming you only want to use standard C functions, there's already one function for exactly the job of reading a line: fgets(). The following code snippet should explain its usage:
char line[1024];
char *str = fgets(line, 1024, stdin); // read from the standard input
if (!str)
{
// couldn't read input for some reason, handle error here
exit(1); // <- for example
}
// fgets includes the newline character that ends the line, but if the line
// is longer than 1022 characters, it will stop early here (it will never
// write more bytes than the second parameter you pass). Often you don't
// want that newline character, and the following line overwrites it with
// 0 (which is "end of string") **only** if it was there:
line[strcspn(line, "\n")] = 0;
Note that you might want to check for the newline character with strchr() instead, so you actually know whether you have the whole line or maybe your input buffer was to small. In the latter case, you might want to call fgets() again.
How to scan total line from user input with c program?
scanf("%99[^\n]",st) reads a line, almost.
With the C Standard Library a line is
A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. Whether the last line requires a terminating new-line character is implementation-defined. C11dr ยง7.21.2 2
scanf("%99[^\n]",st) fails to read the end of the line, the '\n'.
That is why on the 2nd call, the '\n' remains in stdin to be read and scanf("%99[^\n]",st) will not read it.
There are ways to use scanf("%99[^\n]",st);, or a variation of it as a step in reading user input, yet they suffer from 1) Not handling a blank line "\n" correctly 2) Missing rare input errors 3) Long line issues and other nuances.
The preferred portable solution is to use fgets(). Loop example:
#define LINE_MAX_LENGTH 200
char buf[LINE_MAX_LENGTH + 1 + 1]; // +1 for long lines detection, +1 for \0
while (fgets(buf, sizeof buf, stdin)) {
size_t eol = strcspn(buf, "\n"); **
buf[eol] = '\0'; // trim potential \n
if (eol >= LINE_MAX_LENGTH) {
// IMO, user input exceeding a sane generous threshold is a potential hack
fprintf(stderr, "Line too long\n");
// TBD : Handle excessive long line
}
// Use `buf[[]`
}
Many platforms support getline() to read a line.
Short-comings: Non C-standard and allow a hacker to overwhelm system resources with insanely long lines.
In C, there is not a great solution. What is best depends on the various coding goals.
** I prefer size_t eol = strcspn(buf, "\n\r"); to read lines in a *nix environment that may end with "\r\n".
scanf() should never be used for user input. The best way to get input from the user is with fgets().
Read more: http://sekrit.de/webdocs/c/beginners-guide-away-from-scanf.html
char str[1024];
char *alline = fgets(str, 1024, stdin);
scanf("%[^'\n']s",alline);
I think the correct solution should be like this. It is worked for me.
Hope it helps.

Searching for strings that are NULL terminated within a file where they are not NULL terminated

I am writing a program that opens two files for reading: the first file contains 20 names which I store in an array of the form Names[0] = John\0. The second file is a large text file that contains many occurences of each of the 20 names.
I need my program to scan the entirity of the second file and each time it finds one of the names, a variable Count is incremented and so on the completion of the program, the total number of all the names appearing in the text is stored in Count.
Here is my loop which searches for and counts the number of name occurences:
char LineOfText[85];
char *TempName;
while(fgets(LineOfText, sizeof(LineOfText), fpn)){
for(a = 0; a<NumOfNames; a++){
TempName = strstr(LineOfText, Names[a]);
if(TempName != NULL){
Count++;
}
}
}
No matter what I do, this loop doesn't work as I would expect it to, but I have discovered what is wrong (I think!). My problem is that each name in the array is NULL terminated, but when a name appears in the text file it is not NULL terminated, unless it occurs as the last word of a line. Therefore, this while loop is only counting the number of times any of the names appear at the end of a line, rather than the number of appearances of any of the names anywhere in the text file. How can I adjust this loop to combat this problem?
Thank you for any advice in advance.
The issue here is probably your use of fgets, which does not trim the newline from the line it reads.
If you are creating your names array by reading lines with fgets, then all the names will be terminated with a newline character. The lines in the file being read with fgets will also be terminated with a newline character, so the names will only match at the end of the lines.
strstr does not compare the NUL byte which terminates the pattern string, for obvious reasons. If it did, it would only match suffix strings, which would make it a very different function.
Also, you will only find a maximum of one instance of each name in each line. If you think that a name might appear more than once in the same line, you should replace:
TempName = strstr(LineOfText, Names[a]);
if(TempName != NULL){
Count++;
}
with something like:
for (TempName = LineOfText;
(TempName = strstr(TempName, Names[a]);
++Count, ++TempName) {
}
For reference, here is the definition of fgets from the C standard (emphasis added):
The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.
This is different from gets, which does not retain the new-line character.
I think the NULL termination of the names array is not an issue (See strstr function reference). The strstr function is not going to compare the terminator. You do have the possibility of missing additional names on each line. See my adjustment below for an example of how you could count multiple names on each line.
char LineOfText[85];
char *TempName;
while(fgets(LineOfText, sizeof(LineOfText), fpn)){
for(a = 0; a<NumOfNames; a++){
TempName = strstr(LineOfText, Names[a]);
/* Iterate through line for multiple occurrences of each name */
while(TempName != NULL){
Count++;
/* Get next occurrence of name on line. fgets is going to
leave a newline at the end of the LineOfText string so
unless some of your names contain a newline, it shouldn't
move past the end of the buffer */
TempName = strstr(TempName + 1, Names[a]);
}
}
}

Word count debugging

On K&R, the following code is proposed to count words, lines and characters in input. Exercise 1.11 asks:
How would you test the word count program? What kinds of input are
most likely to uncover bugs if there are any?
The only answer I see to these questions is testing the code on some input that contains several lines, words and tabs.
Can you see any other way to test this code?
#include <stdio.h>
#define IN 1 /* inside a word */
#define OUT 0 /* outside a word */
/* count lines, words and characters in input */
main(){
int c, n1, nw, nc, state;
state = OUT;
n1 = nw = nc = 0;
while ((c = getchar()) != EOF){
++nc;
if (c == '\n')
++n1;
if (c == ' ' || c == '\n' || c == '\t')
state = OUT;
else if (state == OUT){
state = IN;
++ nw;
}
}
printf("%d %d %d\n",n1,nw,nc);
}
Test the program using all of the following types of inputs:
An empty file.
A file with only new lines and no words.
A file with very long words, all on one line.
A file with very long words, on many lines.
The program might produce invalid output, but should not crash if given special characters.
Test the program with "N" blank lines inserted at random locations throughout the document.
Test the program with "N" blank lines inserted at the beginning of the document.
Test the program with "N" blank lines inserted at the end of the document.
Test the program with both one character words and long words, including hyphenated words with these inputs:
A file with only one space separating each word.
A file with one space or "N" spaces separating each word.
A file with only one tab separating each word.
A file with one space or "N" tabs separating each word.
A file with only one space OR tab separating each word.
A file with one space or "N" spaces OR tabs separating each word.
Test the program with single quotes and double quotes, with and without spaces between the words and the quotes, and with nested levels of quotes.
Also:
Make sure the program doesn't count un-intended characters as a word or part of a word. For example, make sure a carriage return, which is a legal MS-DOS character is not counted as a word if it is included at the end of a line.
Create the largest possible file for which space was designated for this application, and make sure that the program does not crash, that other applications are NOT impacted, and that the output is correct.
Create the largest possible file for which space was designated for this application, containing only spaces, newlines and tabs, except for words at the end of the file, and make sure that the program does not crash, that other applications are NOT impacted, and that the output is correct.
Create the largest possible file for which space was designated for this application, containing only spaces, newlines and tabs, except for words at the beginning of the file, and make sure that the program does not crash, that other applications are NOT impacted, and that the output is correct.
Create the largest possible file for which space was designated for this application, containing only one very long word: the output of the program should be 1.
Have the program write a debugging file that contains a printf for each while, if, and else statement. Make sure that the tests cause all of the printf statements to be reached. In other words, there shouldn't be any parts of the code that remain unused at the end of the testing.
There should be a good reason the output doesn't match the output of the wc program.
The idea behind the question is to illustrate the concept of "white box" testing. Look at every "choice point in your program, and see how you can exercise the logic behind it to uncover the "corner cases":
To exercise the while loop, feed it input that has no data (i.e. EOF comes right away)
Feed the program a file with a single line and no \n before EOF to exercise the line counting if
Feed the program a file with one or more lines composed entirely of whitespace characters
Feed the program a file with the last \n missing, and see if the last word gets counted
Feed the program a file with single-character words to exercise the logic of switching between IN and OUT

Reading input from file in C

Okay so I have a file of input that I calculate the amount of words and characters in each line with success.
When I get to the end of the line using the code below it exits the loop and only reads in the first line. How do I move on to the next line of input to continue the program?
EDIT: I must parse each line separately so I cant use EOF
while( (c = getchar()) != '\n')
Change '\n' to EOF. You're reading until the end of the line when you want to read until the end of the file (EOF is a macro in stdio.h which corresponds to the character at the end of a file).
Disclaimer: I make no claims about the security of the method.
'\n' is the line feed (new line)-character, so the loop will terminate when the end of first line is reached. The end of the file is marked by an end-of-file (EOF)-characte. cstdio (or stdio.h), which contains the getchar()-function, has the EOF -constant defined, so just change the while-line to
while( (c = getchar()) != EOF)
From the man page: "reads the next character from stream and returns it as an unsigned char cast to an int, or EOF on end of file or error." EOF is a macro (often -1) for the return of this and related functions that indicates end of file. You want to check whether this is what you're getting back. Note that getc returns a signed int, but that valid values are unsigned chars cast to ints. What out if c is a signed char.
Well, the \n character is actually a combination of two characters, two bytes:
the 13th byte + the 10th byte. You could try something like,
int c2=getchar(),c1;
while(1)
{
c1=c2;
c2=getchar();
if(c1==EOF)
break;
if(c1==(char)13 && c2==(char)10)
break;
/*use c1 as the input character*/
}
this should test if two input characters make the proper couplet (13,10)

Resources