How to split text from file into words? - c

I'm trying to get the text from a file and split it into words by removing spaces and other symbols. This is part of my code for handling the file:
void addtext(char wordarray[M][N])
{
FILE *fileinput;
char word[N];
char filename[N];
char *pch;
int i=0;
printf("Input the name of the text file(ex \"test.txt\"):\n");
scanf("%19s", filename);
if ((fileinput = fopen(filename, "rt"))==NULL)
{
printf("cannot open file\n");
exit(EXIT_FAILURE);
}
fflush(stdin);
while (fgets(word, N, fileinput)!= NULL)
{
pch = strtok (word," ,'.`:-?");
while (pch != NULL)
{
strcpy(wordarray[i++], pch);
pch = strtok (NULL, " ,'.`:-?");
}
}
fclose(fileinput);
wordarray[i][0]='\0';
return ;
}
But here is the issue. When the text input from the file is:
Alice was beginning to get very tired of sitting by her sister on the bank.
Then the output when I try to print it is this:
Alice
was
beginning
to
get
very
tired
of
sitting
by
her
s
ister
on
the
bank
As you can see, the word "sister" is split into 2. This happens quite a few times when adding a bigger text file. What am I missing?

If you count the characters you'll see that s is the 57th character. 57 is 19 times 3 which is the number of parsed characters in each cycle, (20 -1, as fgets null terminates the string and leaves the 20th character in the buffer).
As you are reading lines in batches of 19 characters, the line will be cuted every multiple of 19 charater and the rest will be read by the next fgets in the cycle.
The first two times you where lucky enough that the line was cutted at a space, character 19 at the end of beggining, character 38 at the end of tired, the third time it was in the midle of sister so it cuted it in two words.
Two possible fixes:
Replace:
while (fgets(word, N, fileinput)!= NULL)
With:
while (fscanf(fileinput, %19s, word) == 1)
Provided that there are no words larger than 19 in the file, which is the case.
Make word large enough to take whole the line:
char word[80];
80 should be enough for the sample line.

What am I missing?
You are missing that a single fgets call at maximum will read N-1 characters from the file, Consequently the buffer word may contain only the first part of a word. For instance it seems that in your case the s from the word sister was read by one fgets call and that the remaining part, i.e. ister was read by the next fgets call. Consequently, your code detected sister as two words.
So you need to add code that can check whether the end of the is a whole word or a part of a word.
To start with you can increase N to a higher number but to make it work in general you must add code that checks the end of the word buffer.
Also notice that long words may require more than 2 fgets call.
As a simple alternative to fgets and strtok consider fread and a simple char-by-char passing of the input.
Below is a simple, low-performance example of how it can be done.
int isdelim(char c)
{
if (c == '\n') return 1;
if (c == ' ') return 1;
if (c == '.') return 1;
return 0;
}
void addtext(void)
{
FILE *fileinput;
char *filename = "test.txt";
if ((fileinput = fopen(filename, "rt"))==NULL)
{
printf("cannot open file\n");
return;
}
char c;
int state = LOOK_FOR_WORD;
while (fread(&c, 1, 1, fileinput) == 1)
{
if (state == LOOK_FOR_WORD)
{
if (isdelim(c))
{
// Nothing to do.. keep looking for next word
}
else
{
// A new word starts
putchar(c);
state = READING_WORD;
}
}
else
{
if (isdelim(c))
{
// Current word ended
putchar('\n');
state = LOOK_FOR_WORD;
}
else
{
// Current word continues
putchar(c);
}
}
}
fclose(fileinput);
return ;
}
To keep the code simple it prints the words using putchar instead of saving them in an array but that is quite easy to change.
Further, the code only reads one char at the time from the file. Again it's quit easy to change the code and read bigger chunks from the file.
Likewise you can add more delimiters to isdelim as you like (and improve the implementation)

Related

Check multiple files with "strstr" and "fopen" in C

Today I decided to learn to code for the first time in my life. I decided to learn C. I have created a small program that checks a txt file for a specific value. If it finds that value then it will tell you that that specific value has been found.
What I would like to do is that I can put multiple files go through this program. I want this program to be able to scan all files in a folder for a specific string and display what files contain that string (basically a file index)
I just started today and I'm 15 years old so I don't know if my assumptions are correct on how this can be done and I'm sorry if it may sound stupid but I have been thinking of maybe creating a thread for every directory I put into this program and each thread individually runs that code on the single file and then it displays all the directories in which the string can be found.
I have been looking into threading but I don't quite understand it. Here's the working code for one file at a time. Does anyone know how to make this work as I want it?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
//searches for this string in a txt file
char searchforthis[200];
//file name to display at output
char ch, file_name[200];
FILE *fp;
//Asks for full directory of txt file (example: C:\users\...) and reads that file.
//fp is content of file
printf("Enter name of a file you wish to check:\n");
gets(file_name);
fp = fopen(file_name, "r"); // read mode
//If there's no data inside the file it displays following error message
if (fp == NULL)
{
perror("Error while opening the file.\n");
exit(EXIT_FAILURE);
}
//asks for string (what has to be searched)
printf("Enter what you want to search: \n");
scanf("%s", searchforthis);
char* p;
// Find first occurrence of searchforthis in fp
p = strstr(searchforthis, fp);
// Prints the result
if (p) {
printf("This Value was found in following file:\n%s", file_name);
} else
printf("This Value has not been found.\n");
fclose(fp);
return 0;
}
This line,
p = strstr(searchforthis, fp);
is wrong. strstr() is defined as, char *strstr(const char *haystack, const char *needle), no file pointers in it.
Forget about gets(), its prone to overflow, reference, Why is the gets function so dangerous that it should not be used?.
Your scanf("%s",...) is equally dangerous to using gets() as you don't limit the character to be read. Instead, you could re-format it as,
scanf("%199s", searchforthis); /* 199 characters + \0 to mark the end of the string */
Also check the return value of scanf() , in case an input error occurs, final code should look like this,
if (scanf("%199s", searchforthis) != 1)
{
exit(EXIT_FAILURE);
}
It is even better, if you use fgets() for this, though keep in mind that fgets() will also save the newline character in the buffer, you are going to have to strip it manually.
To actually perform checks on the file, you have to read the file line by line, by using a function like, fgets() or fscanf(), or POSIX getline() and then use strstr() on each line to determine if you have a match or not, something like this should work,
char *p;
char buff[500];
int flag = 0, lines = 1;
while (fgets(buff, sizeof(buff), fp) != NULL)
{
size_t len = strlen(buff); /* get the length of the string */
if (len > 0 && buff[len - 1] == '\n') /* check if the last character is the newline character */
{
buff[len - 1] = '\0'; /* place \0 in the place of \n */
}
p = strstr(buff, searchforthis);
if (p != NULL)
{
/* match - set flag to 1 */
flag = 1;
break;
}
}
if (flag == 0)
{
printf("This Value has not been found.\n");
}
else
{
printf("This Value was found in following file:\n%s", file_name);
}
flag is used to determine whether or not searchforthis exists in the file.
Side note, if the line contains more than 499 characters, you will need a larger buffer, or a different function, consider getline() for that case, or even a custom one reading character by character.
If you want to do this for multiple files, you have to place the whole process in a loop. For example,
for (int i = 0; i < 5; i++) /* this will execute 5 times */
{
printf("Enter name of a file you wish to check:\n");
...
}

how to extaract data from file starting from 2nd line in c

Im very new to this language, can you help me:
Instead of making the user input col, row, and direction(scanf). I want to extract the data from file(format below)
From the file format i do not want to extract the first line(5,6), i only want to extract the remaining lines.
Below is a code of how to extract data from a file(using command line arguments), but this code extract the first line also, and only prints the lines.I do not want to print the line but to extract the data from a file instead of making the user input it.
File format:
colrow direction(starting from 2nd line)
5,6
A0 H
D0 V
C1 V
A4 H
F0 v
code of scanf
yourcolumn = getchar();
col = charToNum(yourcolumn); //function to input column
printf("enter row");
scanf("%d",&row);
printf("h: horizontally or v: vertically?\n");
scanf(" %c",&direction);
Code for extracting data from file:
#include <stdio.h>
int main(int argc, char* argv[])
{
char const* const fileName = argv[1]; /* should check that argc > 1 */
FILE* file = fopen(fileName, "r"); /* should check the result */
char line[256];
while (fgets(line, sizeof(line), file)) {
/* note that fgets don't strip the terminating \n, checking its
presence would allow to handle lines longer that sizeof(line) */
printf("%s", line);
}
/* may check feof here to make a difference between eof and io failure -- network
timeout for instance */
fclose(file);
return 0;
}
Since you are reading line-by-line, I suggest you restructure you file reading to match your logic
while (EOF != fscanf(file, "%[^\n]\n", line)) {
printf("> %s\n", line);
}
Is a way that one can read every line, one at a time. You can lookup the caveats of using fscanf and how to adjust the code to safely read without overflowing your line buffer.
Then, if you want to skip the first line, your code could look like this
if (EOF != fscanf(file, "%[^\n]\n", line)) {
// skip the first line
}
while (EOF != fscanf(file, "%[^\n]\n", line)) {
printf("> %s\n", line);
}
And your processing logic will look a lot like your mental process.
Yes, you could use a line counter, and only process if the counter is high enough; but, it is generally better to avoid introducing variables, if you can live without them. This is because an extra added variable doesn't make the code too hard to reason about; but, after you've repeated that "extra variable" rationale five or six times, the code quickly turns into something that's harder to maintain and harder to reason about. By the time you hit twenty or more extra variables, the odds of maintaining the code quickly without breaking it are lower.
Read the first line also with fgets() into a string and then scan the string for row, direction.
char line[256];
if (fgets(line, sizeof(line), file)) {
if (sscanf(line, "%d %c", &row, &direction) != 2) {
printf("Invalid first line '%s'\n", line);
} else {
while (fgets(line, sizeof(line), file)) {
printf("%s", line);
}
}
}

Reading lines from text file to structs C

I am trying to read lines from a list to my structs, and it is almost working. I am not really sure what the problem is, but the last line of the text file wont show up when I call for the structs and I do not think the words are placed right...
void loadFile(char fileName[], Song *arr, int nrOf) {
FILE *input = fopen(fileName, "r");
if (input == NULL) {
printf("Error, the file could not load!");
} else {
fscanf(input, "%d", &nrOf);
fscanf(input, "%*[^\n]\n", NULL);
for (int i = 0; i < nrOf; i++) {
fgets(arr[i].song, sizeof(arr[i].song), input);
fgets(arr[i].artist, sizeof(arr[i].artist), input);
fgets(arr[i].year, sizeof(arr[i].year), input);
}
for (int i = 0; i < nrOf; i++) {
printf("%s", arr[i].song);
printf("%s", arr[i].artist);
printf("%s", arr[i].year);
}
rewind(input);
printf("The file is now ready.\n");
}
fclose(input);
}
The text file starts with a number on the first line to keep track of how many songs there are in the list. I therefore tried with this:
fscanf(input, "%d", &nrOf);
fscanf(input, "%*[^\n]\n", NULL);
to be able to skip the first line after nrOf got the number.
EDIT:
Here is the struct:
typedef struct Song {
char song[20];
char artist[20];
char year[5];
} Song;
Here is the text file:
4
Mr Tambourine Man
Bob Dylan
1965
Dead Ringer for Love
Meat Loaf
1981
Euphoria
Loreen
2012
Love Me Now
John Legend
2016
And the struct is dynamic allocated:
Song *arr;
arr = malloc(sizeof(Song));
there are a combination of reasons why the last line(s) do not print
The main reason is the last line(s) were never read
Should not call fclose() in any execution path where the file failed to open
there is no need to call rewind() when the next statement is fclose()
Since the calls to printf() for the fields in the Song array are output, one right after another, this will result in a long long single line output to the terminal, Hopefully the terminal is set to automatically scroll after so many columns of output, but that cannot be depended upon.
When outputting an error message, it is best to output it to stderr, not stdout. The function: perror() does that AND also outputs the reason the OS thinks the error occurred. (it does this by referencing errno to select which error message to output.)
the following is the key problem:
if the input file contains one song info per line then the field year will either contain a trailing newline or the newline will not have been read. If the newline was not read, then the next call to fgets() which was trying to input the song title will only receive a newline then all following fields (of all songs) will be progressively further off.
Suggest after reading a song fields, use a loop to clear out any remaining characters in the input line, similar to:
int ch;
while( (ch = getchar( input )) && EOF != ch && '\n' != ch );

Reading lines ahead in a file (In C)

I have a file that looks like this:
This is the first line in the file
This is the third line in the file
Where I have a blank line in the file (On line 2). I want to read the file line by line (Which I do using fgets), but then i want to read ahead just check if a line there is a blank line in the file.
However, My while fgetshas a break statement in it, because my function is only so posed to read the file a line at a time per function call.
so if I call the function:
func(file);
It would read the first line, then break.
If I called it again, it would read the second line then break, etc.
Because I have to implement it this way, it's hard to read ahead, is there any way I can accomplish this?
This is my code:
int main(void) {
FILE * file;
if(file == NULL){perror("test.txt"); return EXIT_FAILURE;}
readALine(file);
}
void readALine(FILE * file) {
char buffer[1000];
while(fgets(buffer,sizeof(buffer),file) != NULL) {
//Read lines ahead to check if there is a line
//which is blank
break; //only read a line each FUNCTION CALL
}
}
So to clarify, if I WAS reading the entire file at once (Only one function call) it would go like this (Which is easy to implement).
int main(void) {
FILE * file = fopen("test.txt","r");
if(file == NULL){perror("test.txt"); return EXIT_FAILURE;}
readALine(file);
}
void readALine(FILE * file) {
char buffer[1000];
while(fgets(buffer,sizeof(buffer),file) != NULL) {
if(isspace(buffer[0]) {
printf("Blank line found\n");
}
}
}
But since I'm reading the file in (Line by line, per function call), The second piece of code above wouldn’t work (Since I break per line read, which I can't change).
Is there a way I could use fseek to accomplish this?
A while loop ending in an unconditional break is an if statement, so I don't really see why you are using a while loop. I'm also assuming you are not worried about a single line being longer than 1000 chars.
the continue statement jumps over to the next iteration of the loop and checks the condition again.
void readALine(FILE * file) {
char buffer[1000];
while(fgets(buffer,sizeof(buffer),file) != NULL) {
if(!isspace(buffer[0]) { //note the not operator
//I'm guessing isspace checks for a newline character since otherwise this will be true also for lines beginning with space
continue; //run the same loop again
}
break;
}
//buffer contains the next line except for empty ones here...
}
You can "read ahead" by simply storing your position in the file (with position = ftell(your_file)), then read the line, if this is a blank line do whatever you have to do, and finally go back to the position you were (with fseek(your_file, position, SEEK_SET)).
Hope this helps !
The while loop in readALine reads lines until the end of the file. So it will skip blank lines, and all other lines.
You can return from within the loop if you've found a non-blank line:
while(fgets(buffer,sizeof(buffer),file) != NULL) {
if (buffer[0] != '\n')
return;
}
If you also want to skip lines that consist of nothing but spaces, you can write a function that does that check:
bool isNothingButWhitespace(char *s) {
while (*s == ' ' || *s == '\n')
s++;
return *s == '\0';
}
This will find the first character that's not whitespace. If it's the string terminator '\0' then it will return true (the string was nothing but whitespace) otherwise falseS (there was some non-whitespace character found).
If the while loop in readALine completes due to it reaching the end of file, you need some way to signal that back to the caller. I recommend setting buffer[0] = '\0'.

How do I extract a specific numbered line from a text file? (C)

I am trying to write a function that prints a specific line from a text file based on the number given. For example, let's say the file contains the following:
1 hello1 one
2 hello2 two
3 hello3 three
If the number given is '3', the function will output "hello3 three". If the number given is '1', the function output will be "hello1 one".
I am very new to C but here is my logic so far.
I imagine first thing is first, I need to find the character 'number' inside the file. Then what? How do I go about writing the line out without including the number? How do I even find the 'number'? I am sure it's very simple but I have no idea how to do this. Here is what I have so far:
void readNumberedLine(char *number)
{
int size = 1024;
char *buffer = malloc(size);
char *line;
FILE *fp;
fp = fopen("xxxxx.txt", "r");
while(fp != NULL && fgets(buffer, sizeof(buffer), fp) != NULL)
{
if(line = strstr(buffer, number))
//here is where I am confused as to what to do.
}
if (fp != NULL)
{
fclose(fp);
}
}
Any help at all would be greatly appreciated.
from what you are saying you are looking for lines tagged with a number at the beginning of the line. In which case you want something where you can read a line with a tag prefix
bool readTaggedLine(char* filename, char* tag, char* result)
{
FILE *f;
f = fopen(filename, "r");
if(f == NULL) return false;
while(fgets(result, 1024, f))
{
if(strncmp(tag, result, strlen(tag))==0)
{
strcpy(result, result+strlen(tag)+1);
return true;
}
}
return false;
}
then use it like
char result[3000];
if(readTaggedLine("blah.txt", "3", result))
{
printf("%s\r\n", result);
}
else
{
printf("Could not find the desired line\r\n");
}
I would try the following.
Approach 1:
Read and throw away (n - 1) lines
// Consider using readline(), see reference below
line = readline() // one more time
return line
Approach 2:
Read block by block and count carriage-return characters (e.g. '\n').
Keep reading and throwing away for the first (n - 1) '\n's
Read characters till next '\n' and accumulate them into line
return line
readline(): Reading one line at a time in C
P.S. Following is a shell solution, it may be used to unit test the C program.
// Display 42nd line of file foo
$ head --lines 42 foo | tail -1
// (head displays lines 1-42, and tail displays the last of them)
You can use an additional value to help you record how many lines you have read.Then in while loop compare the value with your input value, if they are equal, output the buffer.

Resources