We have a program that will take a file as input, and then count the lines in that file, but without counting the empty lines.
There is already a post in Stack Overflow with this question, but the answer to that doesn't cover me.
Let's take a simple example.
File:
I am John\n
I am 22 years old\n
I live in England\n
If the last '\n' didn't exist, then the counting would be easy. We actually already had a function that did this here:
/* Reads a file and returns the number of lines in this file. */
uint32_t countLines(FILE *file) {
uint32_t lines = 0;
int32_t c;
while (EOF != (c = fgetc(file))) {
if (c == '\n') {
++lines;
}
}
/* Reset the file pointer to the start of the file */
rewind(file);
return lines;
}
This function, when taking as input the file above, counted 4 lines. But I only want 3 lines.
I tried to fix this in many ways.
First I tried by doing fgets in every line and comparing that line with the string "\0". If a line was just "\0" with nothing else, then I thought that would solve the problem.
I also tried some other solutions but I can't really find any.
What I basically want is to check the last character in the file (excluding '\0') and checking if it is '\n'. If it is, then subtract 1 from the number of lines it previously counted (with the original function). I don't really know how to do this though. Are there any other easier ways to do this?
I would appreciate any type of help.
Thanks.
You can actually very efficiently amend this issue by keeping track of just the last character as well.
This works because empty lines have the property that the previous character must have been an \n.
/* Reads a file and returns the number of lines in this file. */
uint32_t countLines(FILE *file) {
uint32_t lines = 0;
int32_t c;
int32_t last = '\n';
while (EOF != (c = fgetc(file))) {
if (c == '\n' && last != '\n') {
++lines;
}
last = c;
}
/* Reset the file pointer to the start of the file */
rewind(file);
return lines;
}
Here is a slightly better algorithm.
#include <stdio.h>
// Reads a file and returns the number of lines in it, ignoring empty lines
unsigned int countLines(FILE *file)
{
unsigned int lines = 0;
int c = '\0';
int pc = '\n';
while (c = fgetc(file), c != EOF)
{
if (c == '\n' && pc != '\n')
lines++;
pc = c;
}
if (pc != '\n')
lines++;
return lines;
}
Only the first newline in any sequence of newlines is counted, since all but the first newline indicate blank lines.
Note that if the file does not end with a '\n' newline character, any characters encountered (beyond the last newline) are considered a partial last line. This means that reading a file with no newlines at all returns 1.
Reading an empty file will return 0.
Reading a file ending with a single newline will return 1.
(I removed the rewind() since it is not necessary.)
Firstly, detect lines that only consist of whitespace. So let's create a function to do that.
bool stringIsOnlyWhitespace(const char * line) {
int i;
for (i=0; line[i] != '\0'; ++i)
if (!isspace(line[i]))
return false;
return true;
}
Now that we have a test function, let's build a loop around it.
while (fgets(line, sizeof line, fp)) {
if (! (stringIsOnlyWhitespace(line)))
notemptyline++;
}
printf("\n The number of nonempty lines is: %d\n", notemptyline);
Source is Bill Lynch, I've little bit changed.
I think your approach using fgets() is totally fine. Try something like this:
char line[200];
while(fgets(line, 200, file) != NULL) {
if(strlen(line) <= 1) {
lines++;
}
}
If you don't know about the length of the lines in your files, you may want to check if line actually contains a whole line.
Edit:
Of course this depends on how you define what an empty line is. If you define a line with only whitespaces as empty, the above code will not work, because strlen() includes whitespaces.
Related
Empty lines also should be removed if they are duplicates. If line has escape sequences (like \t), it's different than empty line. Code below is deleting too many lines, or sometimes leave duplicates. How to fix this?
#include <stdio.h>
#include <stdlib.h>
int main()
{
char a[6000];
char b[6000];
int test = 0;
fgets(a, 6000, stdin);
while (fgets(b, 6000, stdin) != NULL) {
for (int i = 0; i < 6000; i++) {
if (a[i] != b[i]) {
test = 1;
}
}
if (test == 0) {
fgets(b, 6000, stdin);
} else {
printf("%s", a);
}
int j = 0;
while (j < 6000) {
a[j] = b[j];
j++;
}
test = 0;
}
return 0;
}
Your logic is mostly sound. You are on the right track with your train of thought:
Read a line into previous (a).
Read another line into current (b).
If previous and current have the same contents, go to step 2.
Print previous.
Move current to previous.
Go to step 2.
This still has some problems, however.
Unnecessary line-read
To start, consider this bit of code:
while(fgets(b,6000,stdin)!=NULL) {
...
if(test==0) {
fgets(b,6000,stdin);
}
else {
printf("%s",a);
}
...
}
If a and b have the same contents (test==0), you use an unchecked fgets to read a line again, except you read again when the loop condition fgets(b,6000,stdin)!=NULL is evaluated. The problem is that you're mostly ignoring the line you just read, meaning you're moving an unknown line from b to a. Since the loop already reads another line and checks for failure appropriately, just let the loop read the line, and invert the if statement's equality test to print a if test!=0.
Where's the last line?
Your logic also will not print the last line. Consider a file with 1 line. You read it, then fgets in the loop condition attempts to read another line, which fails because you're at the end of the file. There is no print statement outside the loop, so you never print the line.
Now what about a file with 2 lines that differ? You read the first line, then the last line, see they're different, and print the first line. Then you overwrite the first line's buffer with the last line. You fail to read another line because there aren't any more, and the last line is, again, not printed.
You can fix this by replacing the first (unchecked) fgets with a[0] = 0. That makes the first byte of a a null byte, which means the end of the string. It won't compare equal to a line you read, so test==1, meaning a will be printed. Since there is no string in a to print, nothing is printed. Things then continue as normal, with the contents of b being moved into a and another line being read.
Unique last line problem
This leaves one problem: the last line won't be printed if it's not a duplicate. To fix this, just print b instead of a.
The final recipe
Assign 0 to the first byte of previous (a[0]).
Read a line into current (b).
If previous and current have the same contents, go to step 2.
Print current.
Move current to previous.
Go to step 2.
As you can see, it's not much different from your existing logic; only steps 1 and 4 differ. It also ensures that all fgets calls are checked. If there are no lines in a file, nothing is printed. If there is only 1 line in a file, it is printed. If 2 lines differ, both are printed. If 2 lines are the same, the first is printed.
Optional: optimizations
Instead of checking all 6000 bytes, you only check up to the first null byte in either string since fgets will automatically add one to mark the end of the string.
Faster still would be to add a break statement inside the if statement of your for loop. If a single byte doesn't match, the entire line is not a duplicate, so you can stop comparing early—a lot faster if only byte 10 differs in two 1000-byte lines!
#include <stdio.h>
#include <string.h>
int main(void)
{
char buff[2][6000];
unsigned count=0;
char *prev=NULL
, *this= buff[count%2]
;
while( fgets(this, sizeof buff[0] , stdin)) {
if(!prev || strcmp(prev, this) ) { // first or different
fputs(this, stdout);
prev=this;
count++;
this=buff[count%2];
}
}
fprintf(stderr, "Number of lines witten: %u\n", count);
return 0;
}
There are few problems in your code, like :
for(int i=0; i<6000; i++) {
if(a[i]!=b[i]) {
test=1;
}
}
In this loop, every time the whole buffer will be compared character by character even if it finds if(a[i]!=b[i]) for some value of i. Probably you should break loop after test=1.
Your logic will also not work for a file with just 1 line as you are not printing line outside the loop.
Another problem is fixed length buffer of size of 6000 char.
May you can use getline to solve your problem. You can do -
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char * line = NULL;
char * comparewith = NULL;
int notduplicate;
size_t len = 0;
ssize_t read;
while ((read = getline(&line, &len, stdin)) != -1) {
((comparewith == NULL ) || (strcmp (line, comparewith) != 0)) ? (notduplicate = 1) : (notduplicate = 0);
if (notduplicate) {
printf ("%s\n", line);
if (comparewith != NULL)
free(comparewith);
comparewith = line;
line = NULL;
}
}
if (line)
free (line);
if (comparewith)
free (comparewith);
return 0;
}
An important point to note:
getline() is not in the C standard library. getline() was originally GNU extension and standardized in POSIX.1-2008. So, this code may not be portable. To make it portable, you'll need to roll your own getline() something like this.
Here is a much simpler solution that has no limitation on line length:
#include <stdio.h>
int main(void) {
int c, last1 = 0, last2 = 0;
while ((c = getchar()) != EOF) {
if (c != '\n' || last1 != '\n' || last2 != '\n')
putchar(c);
last2 = last1;
last1 = c;
}
return 0;
}
The code skips sequences of more than 2 consecutive newline characters, hence it removes duplicate blank lines.
I have text file which include thousands of string
but each string split by a space " "
How can i count how many strings there are?
You don't need the strtok() as you only need to count the number of space characters.
while (fgets(line, sizeof line, myfile) != NULL) {
for (size_t i = 0; line[i]; i++) {
if (line[i] == ' ') totalStrings++;
}
}
If you want to consider any whitespace character then you can use isspace() function.
You can read character by character as well without using an array:
int ch;
while ((ch=fgetc(myfile)) != EOF) {
if (ch == ' ') totalStrings++;
}
But I don't see why you want to avoid using an array as it would probably be more efficient (reading more chars at a time rather than reading one byte at a time).
fgets() function will read entire line from file (you need to know maximum possible size of that line. Then, you can use strtok() from ` to parse the string and count the words.
Using fgetc(), you can count the spaces.
Take note that in cases wherein there are spaces at the beginning of the string, those will be counted as well and it is okay if spaces are present on the start of the line. Else, it won't give accurate results as the first string won't be counted because it has no space before it.
To solve that, we need to check first the first character and increment the string counter if it is an alphabet character.
int str_count = 0;
int c;
// first char
if( isalpha(c = fgetc(myfile)) )
str_count++;
else
ungetc(c, myfile);
Then, we loop through the rest of the contents.
Checking if an alphabet character follows a space will verify if there is a next string after the space, else a space at the end of the line will be counted as well, giving an inaccurate result.
do
{
c = fgetc(myfile);
if( c == EOF )
break;
if(isspace(c)) {
if( isalpha(c = fgetc(myfile)) ) {
str_count++;
ungetc(c, myfile);
} else if(c == '\n') { // for multiple newlines
str_count++;
}
}
} while(1);
Tested on a Lorem Ipsum generator of 1500 words:
http://pastebin.com/w6EiSHbx
I was trying to take a full line input in C. Initially I did,
char line[100] // assume no line is longer than 100 letters.
scanf("%s", line);
Ignoring security flaws and buffer overflows, I knew this could never take more than a word input. I modified it again,
scanf("[^\n]", line);
This, of course, couldn't take more than a line of input. The following code, however was running into infinite loop,
while(fscanf(stdin, "%[^\n]", line) != EOF)
{
printf("%s\n", line);
}
This was because, the \n was never consumed, and would repeatedly stop at the same point and had the same value in line. So I rewrote the code as,
while(fscanf(stdin, "%[^\n]\n", line) != EOF)
{
printf("%s\n", line);
}
This code worked impeccably(or so I thought), for input from a file. But for input from stdin, this produced cryptic, weird, inarticulate behavior. Only after second line was input, the first line would print. I'm unable to understand what is really happening.
All I am doing is this. Note down the string until you encounter a \n, store it in line and then consume the \n from the input buffer. Now print this line and get ready for next line from the input. Or am I being misled?
At the time of posting this question however, I found a better alternative,
while(fscanf(stdin, "%[^\n]%*c", line) != EOF)
{
printf("%s\n", line);
}
This works flawlessly for all cases. But my question still remains. How come this code,
while(fscanf(stdin, "%[^\n]\n", line) != EOF)
{
printf("%s\n", line);
}
worked for inputs from file, but is causing issues for input from standard input?
Use fgets(). #FredK
char buf[N];
while (fgets(buf, sizeof buf, stdin)) {
// crop potential \n if desired.
buf[strcspn(buf, "\n")] = '\0';
...
}
There are to many issues trying to use scanf() for user input that render it prone to mis-use or code attacks.
// Leaves trailing \n in stdin
scanf("%[^\n]", line)
// Does nothing if line begins with \n. \n remains in stdin
// As return value not checked, use of line may be UB.
// If some text read, consumes \n and then all following whitespace: ' ' \n \t etc.
// Then does not return until a non-white-space is entered.
// As stdin is usually buffered, this implies 2 lines of user input.
// Fails to limit input.
scanf("%[^\n]\n", line)
// Does nothing if line begins with \n. \n remains in stdin
// Consumes 1 char after `line`, even if next character is not a \n
scanf("%99[^\n]%*c", line)
Check against EOF is usual the wrong check. #Weather Vane The following, when \n is first entered, returns 0 as line is not populated. As 0 != EOF, code goes on to use an uninitialized line leading to UB.
while(fscanf(stdin, "%[^\n]%*c", line) != EOF)
Consider entering "1234\n" to the following. Likely infinite loop as first fscanf() read "123", tosses the "4" and the next fscanf() call gets stuck on \n.
while(fscanf(stdin, "%3[^\n]%*c", line) != EOF)
When checking the results of *scanf(), check against what you want, not against one of the values you do not want. (But even the following has other troubles)
while(fscanf(stdin, "%[^\n]%*c", line) == 1)
About the closest scanf() to read a line:
char buf[100];
buf[0] = 0;
int cnt = scanf("%99[^\n]", buf);
if (cnt == EOF) Handle_EndOfFile();
// Consume \n if next stdin char is a \n
scanf("%*1[\n]");
// Use buf;
while(fscanf(stdin, "%[^\n]%*c", line) != EOF)
worked for inputs from file, but is causing issues for input from standard input?
Posting sample code and input/data file would be useful. With modest amount of code posted, some potential reasons.
line overrun is UB
Input begins with \n leading to UB
File or stdin not both opened in same mode. \r not translated in one.
Note: The following fails when a line is 100 characters. So meeting the assumption cal still lead to UB.
char line[100] // assume no line is longer than 100 letters.
scanf("%s", line);
Personally, I think fgets() is badly designed. When I read a line, I want to read it in whole regardless of its length (except filling up all RAM). fgets() can't do that in one go. If there is a long line, you have to manually run it multiple times until it reaches the newline. The glibc-specific getline() is more convenient in this regard. Here is a function that mimics GNU's getline():
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long my_getline(char **buf, long *m_buf, FILE *fp)
{
long tot = 0, max = 0;
char *p;
if (*m_buf == 0) { // empty buffer; allocate
*m_buf = 16; // initial size; could be larger
*buf = (char*)malloc(*m_buf); // FIXME: check NULL
}
for (p = *buf, max = *m_buf;;) {
long l, old_m;
if (fgets(p, max, fp) == NULL)
return tot? tot : EOF; // reach end-of-file
for (l = 0; l < max; ++l)
if (p[l] == '\n') break;
if (l < max) { // a complete line
tot += l, p[l] = 0;
break;
}
old_m = *m_buf;
*m_buf <<= 1; // incomplete line; double the buffer
*buf = (char*)realloc(*buf, *m_buf); // check NULL
max = (*m_buf) - old_m;
p = (*buf) + old_m - 1; // point to the end of partial line
}
return tot;
}
int main(int argc, char *argv[])
{
long l, m_buf = 0;
char *buf = 0;
while ((l = my_getline(&buf, &m_buf, stdin)) != EOF)
puts(buf);
free(buf);
return 0;
}
I usually use my own readline() function. I wrote this my_getline() a moment ago. It has not been thoroughly tested. Please use with caution.
I want program count lines in text file by function. It used to work ,but it always return 0 now.
What am I doing wrong?
#include <stdio.h>
int couLineF(FILE* fp){ //count lines in file
int count = 0,ch;
while((ch = fgetc(fp)) != EOF){
if(ch == (int)"\n" ) count++;
}
rewind(fp);
return count;
}
int main(){
FILE *fp = fopen("book.txt","r");
int lines;
if(fp){
lines = couLineF(fp);
printf("number of lines is : %d",lines);
}
return 0;
}
Another question
Are there any other ways to get number of lines in text file?
Your problem is here:
if(ch == (int)"\n" )
You are casting the address of "\n", a string literal, into an int and comparing it with ch. This doesn't make any sense.
Replace it with
if(ch == '\n' )
to fix it. This checks if ch is a newline character.(Use single quotes(') for denoting a character and double quotes(") for a string)
Other problems are:
Not closing the file using fclose if fopen was successful.
Your program won't count the last line if it doesn't end with \n.
There is absolutely no reason to use rewind(fp) as you never use the FILE pointer again.
I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
So I used fgets to read the first line and use sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
This one obviously take blank character as a delimitter but I have literally 100s of delimitters.
Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?
Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.
Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.
#include <stdio.h>
#include <string.h>
int main() {
static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
int state = WORD, ptr = 0;
char buffer[BUFLEN], *digits = "1234567890";
while ((c = getchar()) != EOF) {
if (strchr(digits, c)) {
if (WORD == state) {
buffer[ptr++] = c;
} else {
buffer[0] = c;
ptr = 1;
}
state = WORD;
} else {
if (WORD == state) {
buffer[ptr] = '\0';
printf("%s\n", buffer);
}
state = DELIM;
}
}
return 0;
}
If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.
In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.
Several points:
First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.
Second, you mentioned this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.
Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.
#include <stdio.h>
#include <ctype.h>
int get_next_word(FILE *file, char *word, size_t wordSize)
{
size_t i = 0;
int c;
/**
* Skip over any non-alphanumeric characters
*/
while ((c = fgetc(file)) != EOF && !isalnum(c))
; // empty loop
if (c != EOF)
word[i++] = c;
/**
* Read up to the next non-alphanumeric character and
* store it to word
*/
while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
{
word[i++] = c;
}
word[i] = 0;
return c != EOF;
}
int main(void)
{
char word[SIZE]; // where SIZE is large enough to handle expected inputs
FILE *file;
...
while (get_next_word(file, word, sizeof word))
// do something with word
...
}
I would use:
FILE *file;
char string[200];
while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
/* do something with string... */
}
This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...
What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:
char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);
while(p)
{
p = strtok(NULL, ",");
printf("%S\n", p);
}