How can I count specific words from a file? - c

I am trying to create a c program that read a file and count specific words.
I tried this code but I don't get any result:
#include<stdio.h>
#include<stdlib.h>
void main
{
File *fp = fopen("file.txt","r+");
int count =0;
char ch[10];
while((fgetc(fp)!=NULL)
{
while((fgetc(fp)!=NULL)
{
if((fgets(ch,3,fp))=="the" || (fgets(ch,3,fp))=="and")
count++;
}
}
printf("%d",count);
}

As you're acquiring data in blocks of 3 at a time, you're assuming that the two words "the" and "and" are aligned on 3 character boundaries. That will not, in general, be the case.
You also need to use strncmp to compare the strings.
As a first review, I'd read line by line and search each line for the words you want.
I'm also unsure as your intention behind having two nested while loops.

You can't compare string pointers with the equality operator, you have to use the strcmp function.
There are also other problems with the code you have. For once, the fgetc calls does not return NULL on errors or problems, but EOF. Otherwise it returns a character read from the file.
Also, your two fgets in the condition will cause reading of two "lines" (though each "line" you read will only be two characters) from the file.

fgets(ch, 3, fp) makes you read 2 characters plus the null-terminator, if you want to read 3 characters and the null-terminator you want fgets(ch, 4, fp) instead. Also, you need to use strcmp to compare strings.
Also, what are all those while loops for ?

if((fgets(ch,3,fp))=="the" || (fgets(ch,3,fp))=="and")
The above line is completely useless.
fgets(ch,3,fp) gets your word from the file to ch[10] . But you cannot compare that using == .
What I would do is use strcmp and give size 4 in fgets (never forget the \o)

You gotta use strcmp() to compare two strings. Not relational operators.

Just out of my head (perhaps not the optimal way, but should be pretty easy to read and understand):
#define WHITE_SPACE(c) ((c)==' ' || (c)=='\r' || (c)=='\n' || (c)=='\t'))
int CountWords(const char* fileName,int numOfWords,const char words[])
{
int count = 0;
FILE* fp = fopen(fileName,"rt");
fseek(fp,0,SEEK_END);
int size = ftell(fp);
fseek(fp,0,SEEK_SET);
char* buf = new char[size];
fread(buf,size,1,fp);
fclose(fp);
for (int i=0,j; i<size; i=j+1)
{
for (j=i; j<size; j++)
{
if (WHITE_SPACE(buf[j]))
break;
}
for (int n=0; n<numOfWords; n++)
{
int len = strlen(words[n]);
if (len == j-i && !memcmp(buf+i,words[n],len))
count++;
}
}
delete[] buf;
return count;
}
Please note, however, that I have not compiled nor tested it (as I said above, "out of my head")...

Take a look at String matching algorithms.
You can also find implementation examples of Boyer-Moore in github

The line
if((fgets(ch,3,fp))=="the" || (fgets(ch,3,fp))=="and")
has a couple of problems:
You can't compare string values with the == operator; you need to use the strcmp library function;
You're not comparing the same input to "the" and "and"; when the first comparison fails, you're reading the next 3 characters from input;
Life will be easier if you abstract out the input and comparison operations; at a high level, it would look something like this:
#define MAX_WORD_LENGTH 10 // or however big it needs to be
...
char word[MAX_WORD_LENGTH + 1];
...
while ( getNextWord( word, sizeof word, fp )) // will loop until getNextWord
{ // returns false (error or EOF)
if ( match( word ) )
count++;
}
The getNextWord function handles all the input; it will read characters from the input stream until it recognizes a "word" or until there's no room left in the input buffer. In this particular case, we'll assume that a "word" is simply any sequence of non-whitespace characters (meaning punctuation will be counted as part of a word). If you want to be able to recognize punctuation as well, this gets a bit harder; for example, a ' may be quoting character ('hello'), in which case it should not be part of the word, or it may be part of a contraction or a posessive (it's, Joe's), in which case it should be part of the word.
#include <ctype.h>
...
int getNextWord( char *target, size_t targetSize, FILE *fp )
{
size_t i = 0;
int c;
/**
* Read the next character from the input stream, skipping
* over any leading whitespace. We'll add each non-whitespace
* character to the target buffer until we see trailing
* whitespace or EOF.
*/
while ( (c = fgetc( fp )) != EOF && i < targetSize - 1 )
{
if ( isspace( c ) )
{
if ( i == 0 )
continue;
else
break;
}
else
{
target[i++] = c;
}
}
target[i] = 0; // add 0 terminator to string
return i > 0; // if i == 0, then we did not successfully read a word
}
The match function simply compares the input word to a list of target words, and returns "true" (1) if it sees a match. In this case, we create a list of target words with a terminating NULL entry; we just walk down the list, comparing each element to our input. If we reach the NULL entry, we didn't find a match.
#include <string.h>
...
int match( const char *word )
{
const char *targets[] = {"and", "the", NULL};
const char *t = targets;
while ( t && strcmp( t, word ))
t++;
return t != NULL; // evaluates to true if we match either "the" or "and"
}
Note that this comparison is case-sensitive; "The" will not compare equal to "the". If you want a case-insensitive comparison, you'll have to make a copy of the input string and convert it all to lowercase, and compare that copy to the target:
#include <stdlib.h>
#Include <ctype.h>
#include <string.h>
...
int match( const char *word )
{
const char *targets[] = {"and", "the", NULL};
const char *t = targets;
char *wcopy = malloc( strlen( word ) + 1 );
if ( wcopy )
{
char *w = word;
char *c = wcopy;
while ( *w )
*c++ = tolower( *w++ );
}
else
{
fprintf( stderr, "malloc failure in match: fatal error, exiting\n" );
exit(0);
}
while ( t && strcmp( t, wcopy))
t++;
free( wcopy );
return t != NULL; // evaluates to true if we match either "the" or "and"
}

Related

What is the best way to match a string to specified format?

The format that I want to match the string to is "from:<%s>" or "FROM:<%s>". The %s can be any length of characters representing an email address.
I have been using sscanf(input, "%*[fromFROM:<]%[#:-,.A-Za-z0-9]>", output). But it doesn't catch the case where the last ">" is missing. Is there a clean way to check if the input string is correctly formatted?
You can't directly tell whether trailing literal characters in a format string are matched; there's no direct way for sscanf()) to report their absence. However, there are a couple of tricks that'll do the job:
Option 1:
int n = 0;
if (sscanf("%*[fromFROM:<]%[#:-,.A-Za-z0-9]>%n", email, &n) != 1)
…error…
else if (n == 0)
…missing >…
Option 2:
char c = '\0';
if (sscanf("%*[fromFROM:<]%[#:-,.A-Za-z0-9]%c", email, &c) != 2)
…error — malformed prefix or > missing…
else if (c != '>')
…error — something other than > after email address…
Note that the 'from' scan-set will match ROFF or MorfROM or <FROM:morf as a prefix to the email address. That's probably too generous. Indeed, it would match: from:<foofoomoo of from:<foofoomoo#example.com>, which is a much more serious problem, especially as you throw the whole of the matched material away. You should probably capture the value and be more specific:
char c = '\0';
char from[5];
if (sscanf("%4[fromFROM]:<%[#:-,.A-Za-z0-9]%[>]", from, email, &c) != 3)
…error…
else if (strcasecmp(from, "FROM") != 0)
…not from…
else if (c != '>')
…missing >…
or you can compare using strcmp() with from and FROM if that's what you want. The options here are legion. Be aware that strcasecmp() is a POSIX-specific function; Microsoft provides the equivalent stricmp().
Use "%n". It records the offset of the scan of input[], if scanning got that far.
Use it to:
Detect scan success that include the >.
Detect Extra junk.
A check of the return value of sscanf() is not needed.
Also use a width limit.
char output[100];
int n = 0;
// sscanf(input, "%*[fromFROM:<]%[#:-,.A-Za-z0-9]>", output);
sscanf(input, "%*[fromFROM]:<%99[#:-,.A-Za-z0-9]>%n", output);
// ^^ width ^^
if (n == 0 || input[n] != '\0') {
puts("Error, scan incomplete or extra junk
} else [
puts("Success");
}
If trailing white-space, like a '\n', is OK, use " %n".
Regarding the first part of the string, if you want to accept only FROM:< or from:< , then you can simply use the function strncmp with both possibilities. Note, however, that this means that for example From:< will not be accepted. In your question, you implied that this is how you want your program to behave, but I'm not sure if this really is the case.
Generally, I wouldn't recommend using the function sscanf for such a complex task, because that function is not very flexible. Also, in ISO C, it is not guaranteed that character ranges are supported when using the %[] format specifier (although most common platforms probably do support it). Therefore, I would recommend checking the individual parts of the string "manually":
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <stdbool.h>
bool is_valid_string( const char *line )
{
const char *p;
//verify that string starts with "from:<" or "FROM:<"
if (
strncmp( line, "from:<", 6 ) != 0
&&
strncmp( line, "FROM:<", 6 ) != 0
)
{
return false;
}
//verify that there are no invalid characters before the `>`
for ( p = line + 6; *p != '>'; p++ )
{
if ( *p == '\0' )
return false;
if ( isalpha( (unsigned char)*p ) )
continue;
if ( isdigit( (unsigned char)*p ) )
continue;
if ( strchr( "#:-,.", *p) != NULL )
continue;
return false;
}
//jump past the '>' character
p++;
//verify that we are now at the end of the string
if ( *p != '\0' )
return false;
return true;
}
int main( void )
{
char line[200];
//read one line of input
if ( fgets( line, sizeof line, stdin ) == NULL )
{
printf( "Input failure!\n" );
exit( EXIT_FAILURE );
}
//remove newline character
line[strcspn(line,"\n")] = '\0';
//call function and print result
if ( is_valid_string ( line ) )
printf( "VALID\n" );
else
printf( "INVALID\n" );
}
This program has the following output:
This is an invalid string.
INVALID
from:<john.doe#example.com
INVALID
from:<john.doe#example.com>
VALID
FROM:<john.doe#example.com
INVALID
FROM:<john.doe#example.com>
VALID
FROM:<john.doe#example!!!!.com>
INVALID
FROM:<john.doe#example.com>invalid
INVALID

Put char into array by using pointer in c

I have a problem, they gave me a task. They told us that we must use a pointer to put the value from the keyboard to array and then print that array.
I try to create that, but I don't know why this is wrong. I define my array then I get value and put that value into an array.
#include <stdio.h>
#include <stdlib.h>
#define N 10000 // Maximum array size
int main ()
{
char keyboardArray[N];
char *r;
r = keyboardArray;
while( (*r++ = getchar()) != EOF );
printf("You write %s", r);
return 0;
}
You have several problems:
At the end of the loop, r points to the end of the string, not the beginning. So printing r won't print the string that was entered. You should print the keyboardArray rather than r.
You're never adding a null terminator to the string, so you can't use the %s format operator.
getchar() returns int, not char -- this is needed to be able to distinguish EOF from ordinary characters. So you need to read into a different variable before storing into the array.
int main ()
{
char keyboardArray[N];
char *r;
int c;
r = keyboardArray;
while( (c = getchar()) != EOF ) {
*r++ = c;
}
*r = '\0'; // Add null terminator
printf("You write %s\n", keyboardArray);
}
Note that this will read until EOF, so the user will have to type a special character like Control-d (on Unix) or Control-z (on Windows) to end the input. You might want to check for newline as well, so they can enter a single line:
while ((c = getchar()) != EOF && c != '\n') {
I think that in any case you need an intermediate variable that will accept a read character.
Also you need to append the entered sequence of characters with the terminating zero.
For example
#include <stdio.h>
#define N 10000 // Maximum array size
int main( void )
{
char keyboardArray[N];
char *r = keyboardArray;
for ( int c;
r + 1 < keyboardArray + N && ( c = getchar() ) != EOF && c != '\n';
++r )
{
*r = c;
}
*r = '\0';
printf( "You write %s\n", keyboardArray );
}

Reading a file in C

I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
So I used fgets to read the first line and use sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
This one obviously take blank character as a delimitter but I have literally 100s of delimitters.
Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?
Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.
Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.
#include <stdio.h>
#include <string.h>
int main() {
static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
int state = WORD, ptr = 0;
char buffer[BUFLEN], *digits = "1234567890";
while ((c = getchar()) != EOF) {
if (strchr(digits, c)) {
if (WORD == state) {
buffer[ptr++] = c;
} else {
buffer[0] = c;
ptr = 1;
}
state = WORD;
} else {
if (WORD == state) {
buffer[ptr] = '\0';
printf("%s\n", buffer);
}
state = DELIM;
}
}
return 0;
}
If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.
In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.
Several points:
First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.
Second, you mentioned this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.
Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.
#include <stdio.h>
#include <ctype.h>
int get_next_word(FILE *file, char *word, size_t wordSize)
{
size_t i = 0;
int c;
/**
* Skip over any non-alphanumeric characters
*/
while ((c = fgetc(file)) != EOF && !isalnum(c))
; // empty loop
if (c != EOF)
word[i++] = c;
/**
* Read up to the next non-alphanumeric character and
* store it to word
*/
while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
{
word[i++] = c;
}
word[i] = 0;
return c != EOF;
}
int main(void)
{
char word[SIZE]; // where SIZE is large enough to handle expected inputs
FILE *file;
...
while (get_next_word(file, word, sizeof word))
// do something with word
...
}
I would use:
FILE *file;
char string[200];
while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
/* do something with string... */
}
This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...
What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:
char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);
while(p)
{
p = strtok(NULL, ",");
printf("%S\n", p);
}

Building a basic shell, more specifically using execvp()

In my program I am taking user input and parsing it into a 2d char array. The array is declared as:
char parsedText[10][255] = {{""},{""},{""},{""},{""},
{""},{""},{""},{""},{""}};
and I am using fgets to grab the user input and parsing it with sscanf. This all works as I think it should.
After this I want to pass parsedText into execvp, parsedText[0] should contain the path and if any arguments are supplied then they should be in parsedText[1] thru parsedText[10].
What is wrong with execvp(parsedText[0], parsedText[1])?
One thing probably worth mentioning is that if I only supply a command such as "ls" without any arguments it appears to work just fine.
Here is my code:
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include "308shell.h"
int main( int argc, char *argv[] )
{
char prompt[40] = "308sh";
char text[40] = "";
char parsedText[10][40] = {{""},{""},{""},{""},{""},
{""},{""},{""},{""},{""}};
// Check for arguments to change the prompt.
if(argc >= 3){
if(!(strcmp(argv[1], "-p"))){
strcpy(prompt, argv[2]);
}
}
strcat(prompt, "> ");
while(1){
// Display the prompt.
fputs(prompt, stdout);
fflush(stdout);
// Grab user input and parse it into parsedText.
mygetline(text, sizeof text);
parseInput(text, parsedText);
// Check if the user wants to exit.
if(!(strcmp(parsedText[0], "exit"))){
break;
}
execvp(parsedText[0], parsedText[1]);
printf("%s\n%s\n", parsedText[0], parsedText[1]);
}
return 0;
}
char *mygetline(char *line, int size)
{
if ( fgets(line, size, stdin) )
{
char *newline = strchr(line, '\n'); /* check for trailing '\n' */
if ( newline )
{
*newline = '\0'; /* overwrite the '\n' with a terminating null */
}
}
return line;
}
char *parseInput(char *text, char parsedText[][40]){
char *ptr = text;
char field [ 40 ];
int n;
int count = 0;
while (*ptr != '\0') {
int items_read = sscanf(ptr, "%s%n", field, &n);
strcpy(parsedText[count++], field);
field[0]='\0';
if (items_read == 1)
ptr += n; /* advance the pointer by the number of characters read */
if ( *ptr != ' ' ) {
strcpy(parsedText[count], field);
break; /* didn't find an expected delimiter, done? */
}
++ptr; /* skip the delimiter */
}
}
execvp takes a pointer to a pointer (char **), not a pointer to an array. It's supposed to be a pointer to the first element of an array of char * pointers, terminated by a null pointer.
Edit: Here's one (not very good) way to make an array of pointers suitable for execvp:
char argbuf[10][256] = {{0}};
char *args[10] = { argbuf[0], argbuf[1], argbuf[2], /* ... */ };
Of course in the real world your arguments probably come from a command line string the user entered, and they probably have at least one character (e.g. a space) between them, so a much better approach would be to either modify the original string in-place, or make a duplicate of it and then modify the duplicate, adding null terminators after each argument and setting up args[i] to point to the right offset into the string.
You could instead do a lot of dynamic allocation (malloc) every step of the way, but then you have to write code to handle every possible point of failure. :-)

Parsing text in C

I have a file like this:
...
words 13
more words 21
even more words 4
...
(General format is a string of non-digits, then a space, then any number of digits and a newline)
and I'd like to parse every line, putting the words into one field of the structure, and the number into the other. Right now I am using an ugly hack of reading the line while the chars are not numbers, then reading the rest. I believe there's a clearer way.
Edit: You can use pNum-buf to get the length of the alphabetical part of the string, and use strncpy() to copy that into another buffer. Be sure to add a '\0' to the end of the destination buffer. I would insert this code before the pNum++.
int len = pNum-buf;
strncpy(newBuf, buf, len-1);
newBuf[len] = '\0';
You could read the entire line into a buffer and then use:
char *pNum;
if (pNum = strrchr(buf, ' ')) {
pNum++;
}
to get a pointer to the number field.
fscanf(file, "%s %d", word, &value);
This gets the values directly into a string and an integer, and copes with variations in whitespace and numerical formats, etc.
Edit
Ooops, I forgot that you had spaces between the words.
In that case, I'd do the following. (Note that it truncates the original text in 'line')
// Scan to find the last space in the line
char *p = line;
char *lastSpace = null;
while(*p != '\0')
{
if (*p == ' ')
lastSpace = p;
p++;
}
if (lastSpace == null)
return("parse error");
// Replace the last space in the line with a NUL
*lastSpace = '\0';
// Advance past the NUL to the first character of the number field
lastSpace++;
char *word = text;
int number = atoi(lastSpace);
You can solve this using stdlib functions, but the above is likely to be more efficient as you're only searching for the characters you are interested in.
Given the description, I think I'd use a variant of this (now tested) C99 code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
struct word_number
{
char word[128];
long number;
};
int read_word_number(FILE *fp, struct word_number *wnp)
{
char buffer[140];
if (fgets(buffer, sizeof(buffer), fp) == 0)
return EOF;
size_t len = strlen(buffer);
if (buffer[len-1] != '\n') // Error if line too long to fit
return EOF;
buffer[--len] = '\0';
char *num = &buffer[len-1];
while (num > buffer && !isspace((unsigned char)*num))
num--;
if (num == buffer) // No space in input data
return EOF;
char *end;
wnp->number = strtol(num+1, &end, 0);
if (*end != '\0') // Invalid number as last word on line
return EOF;
*num = '\0';
if (num - buffer >= sizeof(wnp->word)) // Non-number part too long
return EOF;
memcpy(wnp->word, buffer, num - buffer);
return(0);
}
int main(void)
{
struct word_number wn;
while (read_word_number(stdin, &wn) != EOF)
printf("Word <<%s>> Number %ld\n", wn.word, wn.number);
return(0);
}
You could improve the error reporting by returning different values for different problems.
You could make it work with dynamically allocated memory for the word portion of the lines.
You could make it work with longer lines than I allow.
You could scan backwards over digits instead of non-spaces - but this allows the user to write "abc 0x123" and the hex value is handled correctly.
You might prefer to ensure there are no digits in the word part; this code does not care.
You could try using strtok() to tokenize each line, and then check whether each token is a number or a word (a fairly trivial check once you have the token string - just look at the first character of the token).
Assuming that the number is immediately followed by '\n'.
you can read each line to chars buffer, use sscanf("%d") on the entire line to get the number, and then calculate the number of chars that this number takes at the end of the text string.
Depending on how complex your strings become you may want to use the PCRE library. At least that way you can compile a perl'ish regular expression to split your lines. It may be overkill though.
Given the description, here's what I'd do: read each line as a single string using fgets() (making sure the target buffer is large enough), then split the line using strtok(). To determine if each token is a word or a number, I'd use strtol() to attempt the conversion and check the error condition. Example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/**
* Read the next line from the file, splitting the tokens into
* multiple strings and a single integer. Assumes input lines
* never exceed MAX_LINE_LENGTH and each individual string never
* exceeds MAX_STR_SIZE. Otherwise things get a little more
* interesting. Also assumes that the integer is the last
* thing on each line.
*/
int getNextLine(FILE *in, char (*strs)[MAX_STR_SIZE], int *numStrings, int *value)
{
char buffer[MAX_LINE_LENGTH];
int rval = 1;
if (fgets(buffer, buffer, sizeof buffer))
{
char *token = strtok(buffer, " ");
*numStrings = 0;
while (token)
{
char *chk;
*value = (int) strtol(token, &chk, 10);
if (*chk != 0 && *chk != '\n')
{
strcpy(strs[(*numStrings)++], token);
}
token = strtok(NULL, " ");
}
}
else
{
/**
* fgets() hit either EOF or error; either way return 0
*/
rval = 0;
}
return rval;
}
/**
* sample main
*/
int main(void)
{
FILE *input;
char strings[MAX_NUM_STRINGS][MAX_STRING_LENGTH];
int numStrings;
int value;
input = fopen("datafile.txt", "r");
if (input)
{
while (getNextLine(input, &strings, &numStrings, &value))
{
/**
* Do something with strings and value here
*/
}
fclose(input);
}
return 0;
}

Resources