C extract words from a txt file except spaces and punctuations

C extract words from a txt file except spaces and punctuations - c

I'm trying to extract the words from a .txt file which contains the following sentence
Quando avevo cinqve anni, mia made mi perpeteva sempre che la felicita e la chiave della vita. Quando andai a squola mi domandrono come vuolessi essere da grande. Io scrissi: selice. Mi dissero che non avevo capito il corpito, e io dissi loro che non avevano capito la wita.
The problem is that in the array that I use to store the words, it stores also empty words ' ' which come always after one of the following ',' '.' ':'
I know that things like "empty words" or "empty chars" don't make sense but please try the code with the text that I've passed and you'll understand.
Meanwhile I'm trying to understand the use of sscanf with this modifier sscanf(buffer, "%[^.,:]"); that should allow me to store strings ignoring the . and , and : characters however I don't know what should i write in %[^] to ignore the empty character ' ' which always gets saved.
The code is the following
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
static void load_array(const char* file_name){
char buffer[2048];
char a[100][100];
int buf_size = 2048;
FILE *fp;
int j = 0, c = 0;
printf("\nLoading data from file...\n");
fp = fopen(file_name,"r");
if(fp == NULL){
fprintf(stderr,"main: unable to open the file");
exit(EXIT_FAILURE);
}
fgets(buffer,buf_size,fp);
//here i store each word in an array of strings when I encounter
//an unwanted char I save the word into the next element of the
//array
for(int i = 0; i < strlen(buffer); i++) {
if((buffer[i] >= 'a' && buffer[i] <= 'z') || (buffer[i] >= 'A' && buffer[i] <= 'Z')) {
a[j][c++] = buffer[i];
} else {
j++;
c = 0;
continue;
}
}
//this print is used only to see the words in the array of strings
for(int i = 0; i < 100; i++)
printf("%s %d\n", a[i], i);
fclose(fp);
printf("\nData loaded\n");
}
//Here I pass the file_name from command line
int main(int argc, char const *argv[]) {
if(argc < 2) {
printf("Usage: ordered_array_main <file_name>\n");
exit(EXIT_FAILURE);
}
load_array(argv[1]);
}
I know that I should store only the necessary number and words and not 100 everytime, I want to think about that later on, right now I want to fix the issue with the empty words.
Compilation and execution
gcc -o testloadfile testloadfile.c
./testloadfile "correctme.txt"

you could instead try to use strtok
fgets(buffer,buf_size,fp);
for (char* tok = strtok(buffer,".,: "); *tok; tok = strtok(NULL,".,: "))
{
printf("%s\n", tok);
}
Note that if you want to store what strtok returns you need to either copy the contents of what tok points to or allocate a copy using strdup/malloc+strcpy since strtok modifies its copy of the first argument as it parses the string.

You forgot to add the final '\0' in each of a's line, and your algorithm have many flaw (like how you increment j each time a non-letter appear. What if you have ", " ? you increment two time instead of one).
One "easy" way is to use "strtok", as Anders K. show you.
fgets(buffer,buf_size,fp);
for (char* tok = strtok(buffer,".,:"); *tok; tok = strtok(NULL,".,:")) {
printf("%s\n", tok);
}
The "problem" of that function, is that you have to specify all the delimiter, so you have to add ' ' (space), '\t' (tabulation) etc etc.
Since you only want "word" as described by "contain only letter, minuscule or majuscule", then you can do the following:
int main(void)
{
char line[] = "Hello ! What a beautiful day, isn't it ?";
char *beginWord = NULL;
for (size_t i = 0; line[i]; ++i) {
if (isalpha(line[i])) { // upper or lower letter ==> valid character for a word
if (!beginWord) {
// We found the beginning of a word
beginWord = line + i;
}
} else {
if (beginWord) {
// We found the end of a word
char tmp = line[i];
line[i] = '\0';
printf("'%s'\n", beginWord);
line[i] = tmp;
beginWord = NULL;
}
}
}
return (0);
}
Note that how "isn't" is splitted in "isn" and "t", since ' is not an accpeted character for your word.
The algo is pretty simple: we just loop the string, and if it's a valid letter and beginWord == NULL, then it's the beginning of the word. If it's not a valid letter and beginWord != NULL, then it's the end of a word. Then you can have every number of letter between two word, you still can detect cleanly the word.

Related

How to read specific words from a file?

I have a file that contains words and their synonyms each on a separate line.
I am writing this code that should read the file line by line then display it starting from the second word which is the synonym.
I used the variable count in the first loop in order to be able to count the number of synonyms of each word because the number of synonyms differs from one to another. Moreover I used the condition synonyms[i]==',' because each synonym is separate by a comma.
The purpose of me writing such code is to put them in a binary search tree in order to have a full dictionary.
The code doesn't contain any error yet it is not working.
I have tried to each the loop but that didn't work too.
Sample input from the file:
abruptly - dead, short, suddenly
acquittance - release
adder - common, vipera
Sample expected output:
dead short suddenly
acquittance realse
common vipera
Here is the code:
void LoadFile(FILE *fp){
int count;
int i;
char synonyms[50];
char word[50];
while(fgets(synonyms,50,fp)!=NULL){
for (i=0;i<strlen(synonyms);i++)
if (synonyms[i]==',' || synonyms[i]=='\n')
count++;
}
while(fscanf(fp,"%s",word)==1){
for(i=1;i<strlen(synonyms);i++){
( fscanf(fp,"%s",synonyms)==1);
printf("%s",synonyms);
}
}
}
int main(){
char fn[]="C:/Users/CLICK ONCE/Desktop/Semester 4/i2206/Project/Synonyms.txt";
FILE *fp;
fp=fopen(fn,"rt");
if (fp==NULL){
printf("Cannot open this file");
}
else{
LoadFile(fp);
}
return 0;
}

Here is my solution. I have split the work into functions for readability. The actual parsing is done in parsefunction. That function thakes into account hyphenated compound words such as seventy-two. The word and his synonyms must be separated by an hyphen preceded by at least one space.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
// Trim leading and trailing space characters.
// Warning: string is modified
char* trim(char* s) {
char* p = s;
int l = strlen(p);
while (isspace(p[l - 1])) p[--l] = 0;
while (*p && isspace(*p)) ++p, --l;
memmove(s, p, l + 1);
return s;
}
// Warning: string is modified
int parse(char* line)
{
char* token;
char* p;
char* word;
if (line == NULL) {
printf("Missing input line\n");
return 0;
}
// first find the word delimiter: an hyphen preceded by a space
p = line;
while (1) {
p = strchr(p, '-');
if (p == NULL) {
printf("Missing hypen\n");
return 0;
}
if ((p > line) && (p[-1] == ' ')) {
// We found an hyphen preceded by a space
*p = 0; // Replace by nul character (end of string)
break;
}
p++; // Skip hyphen inside hypheneted word
}
word = trim(line);
printf("%s ", word);
// Next find synonyms delimited by a coma
char delim[] = ", ";
token = strtok(p + 1, delim);
while (token != NULL) {
printf("%s ", token);
token = strtok(NULL, delim);
}
printf("\n");
return 1;
}
int LoadFile(FILE* fp)
{
if (fp == NULL) {
printf("File not open\n");
return 0;
}
int ret = 1;
char str[1024]; // Longest allowed line
while (fgets(str, sizeof(str), fp) != NULL) {
str[strcspn(str, "\r\n")] = 0; // Remove ending \n
ret &= parse(str);
}
return ret;
}
int main(int argc, char *argv[])
{
FILE* fp;
char* fn = "Synonyms.txt";
fp = fopen(fn, "rt");
if (fp == NULL) {
perror(fn);
return 1;
}
int ret = LoadFile(fp);
fclose(fp);
return ret;
}

I think the biggest conceptual misunderstanding demonstrated in the code is a failure to understand how fgets and fscanf work.
Consider the following lines of code:
while(fgets(synonyms,50,fp)!=NULL){
...
while(fscanf(fp,"%49s",word)==1){
for(i=1;i<strlen(synonyms);i++){
fscanf(fp,"%49s",synonyms);
printf("%s",synonyms);
}
}
}
The fgets reads one line of the input. (Unless there is an input line that is greater than 49 characters long (48 + a newline), in which case fgets will only read the first 49 characters. The code should check for that condition and handle it.) The next fscanf then reads a word from the next line of input. The first line is effectively being discarded! If the input is formatted as expected, the 2nd scanf will read a single - into synonyms. This makes strlen(synonyms) evaluate to 1, so the for loop terminates. The while scanf loop then reads another word, and since synonyms still contains a string of length 1, the for loop is never entered. while scanf then proceeds to read the rest of the file. The next call to fgets returns NULL (since the fscanf loop has read to the end of the file) so the while/fgets loop terminates after 1 iteration.
I believe the intention was for the scanfs inside the while/fgets to operate on the line read by fgets. To do that, all the fscanf calls should be replaced by sscanf.

C - Program not detecting blank spaces

I want my program to read a file containing words separated by blank spaces and then prints words one by one. This is what I did:
char *phrase = (char *)malloc(LONGMAX * sizeof(char));
char *mot = (char *)malloc(TAILLE * sizeof(char));
FILE *fp = NULL;
fp = fopen("mots.txt", "r");
if (fp == NULL) {
printf("err ");
} else {
fgets(phrase, LONGMAX, fp);
while (phrase[i] != '\0') {
if (phrase[i] != " ") {
mot[m] = phrase[i];
i++;
m++;
} else {
printf("%s\n", phrase[i]);
mot = "";
}
}
}
but it isn't printing anything! Am I doing something wrong? Thanks!

The i in the following:
while (phrase[i]!='\0'){
Should be initialized to 0 before being used, then incremented as you iterate through the string.
You have not shown where/how it is created.
Also in this line,
if(phrase[i]!=" "){
the code is comparing a char: (phrase[i]) with a string: ( " " )
// char string
if(phrase[i] != " " ){
change it to:
// char char
if(phrase[i] != ' '){
//or better yet, include all whitespace:
if(isspace(phrase[i]) {
There is no error checking in the following, but it is basically your code with modifications. Read comments for explanation on edits to fgets() usage, casting return of malloc(), how and when to terminate the output buffer mot, etc.:
This performs the following: read a file containing words separated by blank spaces and then prints words one by one.
int main(void)
{
int i = 0;
int m = 0;
char* phrase=malloc(LONGMAX);//sizeof(char) always == 0
if(phrase)//test to make sure memory created
{
char* mot=malloc(TAILLE);//no need to cast the return of malloc in C
if(mot)//test to make sure memory created
{
FILE* fp=NULL;
fp=fopen("_in.txt","r");
if(fp)//test to make sure fopen worked
{//shortcut of what you had :) (left off the print err)
i = 0;
m = 0;
while (fgets(phrase,LONGMAX,fp))//fgets return NULL when no more to read.
{
while(phrase[i] != NULL)//test for end of last line read
{
// if(phrase[i] == ' ')//see a space, terminate word and write to stdout
if(isspace(phrase[i])//see ANY white space, terminate and write to stdout
{
mot[m]=0;//null terminate
if(strlen(mot) > 0) printf("%s\n",mot);
i++;//move to next char in phrase.
m=0;//reset to capture next word
}
else
{
mot[m] = phrase[i];//copy next char into mot
m++;//increment both buffers
i++;// "
}
}
mot[m]=0;//null terminate after while loop
}
//per comment about last word. Print it out here.
mot[m]=0;
printf("%s\n",mot);
fclose(fp);
}
free(mot);
}
free(phrase);
}
return 0;
}

phrase[i]!=" "
You compare character (phrase[i]) and string (" "). If you want to compare phrase[i] with space character, use ' ' instead.
If you want to compare string, use strcmp.
printf("%s\n",phrase[i]);
Here, you use %s for printing the string, but phrase[i] is a character.
Do not use mot=""; to copy string in c. You should use strcpy:
strcpy(mot, " ");
If you want to print word by word from one line of string. You can use strtok to split string by space character.
fgets(phrase,LONGMAX,fp);
char * token = strtok(phrase, " ");
while(token != NULL) {
printf("%s \n", token);
token = strtok(NULL, " ");
}
OT, your program will get only one line in the file because you call only one time fgets. If your file content of many line, you should use a loop for fgets function.
while(fgets(phrase,LONGMAX,fp)) {
// do something with pharse string.
// strtok for example.
char * token = strtok(phrase, " ");
while(token != NULL) {
printf("%s \n", token);
token = strtok(NULL, " ");
}
}

Your program has multiple problems:
the test for end of file is incorrect: you should just compare the return value of fgets() with NULL.
the test for spaces is incorrect: phrase[i] != " " is a type mismatch as you are comparing a character with a pointer. You should use isspace() from <ctype.h>
Here is a much simpler alternative that reads one byte at a time, without a line buffer nor a word buffer:
#include <ctype.h>
#include <stdio.h>
int main() {
int inword = 0;
int c;
while ((c = getchar()) != EOF) {
if (isspace(c)) {
if (inword) {
putchar('\n');
inword = 0;
}
} else {
putchar(c);
inword = 1;
}
}
if (inword) {
putchar('\n');
}
return 0;
}

How to prevent char arrays from being overwritten from for loops when using strcat

Whenever a word from wordlist passes as a valid word, strcat(code,wordlist[i]) is called to add the word to world list.
So if at the first line "am" is put, code=am.
Or if abhcgmsopa bqcedpwon abmnpc abcdponm dfajbbmmn cabnmo is put at the first line, the three corresponding valid words are put.
However at the second line the values in code get overwritten and extra characters are put, even though code is initialized outside the while-loop and strcat should append the values to the end of code. Then when the while-loop ends, code is replaced by the entirely by "xq", where x was the first letter put into code and q is from "quitting".
Code isn't reinitialized or changed aside from what is appended to it.
How can I prevent this?
Thanks
*Edit: I defined some stack functions before the main but edited it out here to minimize the code
int main(int argc, char const *argv[])
{
char input[300];
char code[]="";
int ci;
/* set up an infinite loop */
while (1)
{
//break;
/* get line of input from standard input */
printf ("\nEnter input to check or q to quit\n");
fgets(input, 300, stdin);
/* remove the newline character from the input */
int i = 0;
while (input[i] != '\n' && input[i] != '\0')
{
i++;
}
input[i] = '\0';
/* check if user enter q or Q to quit program */
if ( (strcmp (input, "q") == 0) || (strcmp (input, "Q") == 0) )
break;
/*Start tokenizing the input into words separated by space
We use strtok() function from string.h*/
/*The tokenized words are added to an array of words*/
char delim[] = " ";
char *ptr = strtok(input, delim);
int j = 0 ;
char *wordlist[300];
while (ptr != NULL)
{
wordlist[j++] = ptr;
ptr = strtok(NULL, delim);
}
/*Run the algorithm to decode the message*/
//j=words in line;i=i-th word we are evaluating
//k=k-th letter in i-th word
stack1 st;
for(int i=0;i<j;i++){
//stack1 st;
init(&st);
for(int k=0;k<strlen(wordlist[i]);k++){
if((int)wordlist[i][k]<101 && (int)wordlist[i][k]>96){ //check if this letter is a/b/c/d with ascii
push(&st,&wordlist[i][k]);
printf("%c added\n",st.ptr[st.inUse-1]);
}
else{
if(wordlist[i][k]==top(&st)+12){ //check if letter is m/n/o/p corresponding to a/b/c/d from top()
pop(&st);
}
}
}
if(is_empty(&st)){
printf("%s is valid\n",wordlist[i]);
strcat(code,wordlist[i]);
strcat(code," ");
}
else{
printf("%s is invalid\n",wordlist[i]);
clear(&st);
}
printf("code:%s\n",code);
}
printf("code after loop: %s",code);
}
printf("code: %s\n",code);
for(int i=0;i<300;i++){
if ((int)code[i]<101 && (int)code[i]>96){
printf("%c",code[i]);
}
if(!((int)code[i]<96+26 && (int)code[i]>96)){
printf(" ");
}
}
printf("code:%s",code);
printf ("\nGoodbye\n");
return 0;
}

The problem is that your code variable is an array of 1 character! This line:
char code[]="";
declares it as an empty string (no characters) plus a null terminator.
You need to assign it as an array big enough to hold the maximum possible answer! If this is, say, 500, then use this:
char code[500]="";

C Reading a file of digits separated by commas

I am trying to read in a file that contains digits operated by commas and store them in an array without the commas present.
For example: processes.txt contains
0,1,3
1,0,5
2,9,8
3,10,6
And an array called numbers should look like:
0 1 3 1 0 5 2 9 8 3 10 6
The code I had so far is:
FILE *fp1;
char c; //declaration of characters
fp1=fopen(argv[1],"r"); //opening the file
int list[300];
c=fgetc(fp1); //taking character from fp1 pointer or file
int i=0,number,num=0;
while(c!=EOF){ //iterate until end of file
if (isdigit(c)){ //if it is digit
sscanf(&c,"%d",&number); //changing character to number (c)
num=(num*10)+number;
}
else if (c==',' || c=='\n') { //if it is new line or ,then it will store the number in list
list[i]=num;
num=0;
i++;
}
c=fgetc(fp1);
}
But this is having problems if it is a double digit. Does anyone have a better solution? Thank you!

For the data shown with no space before the commas, you could simply use:
while (fscanf(fp1, "%d,", &num) == 1 && i < 300)
list[i++] = num;
This will read the comma after the number if there is one, silently ignoring when there isn't one. If there might be white space before the commas in the data, add a blank before the comma in the format string. The test on i prevents you writing outside the bounds of the list array. The ++ operator comes into its own here.

First, fgetc returns an int, so c needs to be an int.
Other than that, I would use a slightly different approach. I admit that it is slightly overcomplicated. However, this approach may be usable if you have several different types of fields that requires different actions, like a parser. For your specific problem, I recommend Johathan Leffler's answer.
int c=fgetc(f);
while(c!=EOF && i<300) {
if(isdigit(c)) {
fseek(f, -1, SEEK_CUR);
if(fscanf(f, "%d", &list[i++]) != 1) {
// Handle error
}
}
c=fgetc(f);
}
Here I don't care about commas and newlines. I take ANYTHING other than a digit as a separator. What I do is basically this:
read next byte
if byte is digit:
back one byte in the file
read number, irregardless of length
else continue
The added condition i<300 is for security reasons. If you really want to check that nothing else than commas and newlines (I did not get the impression that you found that important) you could easily just add an else if (c == ... to handle the error.
Note that you should always check the return value for functions like sscanf, fscanf, scanf etc. Actually, you should also do that for fseek. In this situation it's not as important since this code is very unlikely to fail for that reason, so I left it out for readability. But in production code you SHOULD check it.

My solution is to read the whole line first and then parse it with strtok_r with comma as a delimiter. If you want portable code you should use strtok instead.
A naive implementation of readline would be something like this:
static char *readline(FILE *file)
{
char *line = malloc(sizeof(char));
int index = 0;
int c = fgetc(file);
if (c == EOF) {
free(line);
return NULL;
}
while (c != EOF && c != '\n') {
line[index++] = c;
char *l = realloc(line, (index + 1) * sizeof(char));
if (l == NULL) {
free(line);
return NULL;
}
line = l;
c = fgetc(file);
}
line[index] = '\0';
return line;
}
Then you just need to parse the whole line with strtok_r, so you would end with something like this:
int main(int argc, char **argv)
{
FILE *file = fopen(argv[1], "re");
int list[300];
if (file == NULL) {
return 1;
}
char *line;
int numc = 0;
while((line = readline(file)) != NULL) {
char *saveptr;
// Get the first token
char *tok = strtok_r(line, ",", &saveptr);
// Now start parsing the whole line
while (tok != NULL) {
// Convert the token to a long if possible
long num = strtol(tok, NULL, 0);
if (errno != 0) {
// Handle no value conversion
// ...
// ...
}
list[numc++] = (int) num;
// Get next token
tok = strtok_r(NULL, ",", &saveptr);
}
free(line);
}
fclose(file);
return 0;
}
And for printing the whole list just use a for loop:
for (int i = 0; i < numc; i++) {
printf("%d ", list[i]);
}
printf("\n");

ANSI C strcmp() function never returning 0, where am I going wrong?

C isn't the language I know so I'm out of my comfort zone (learning C) and I have ran into an issue that I can't currently figure out.
I am trying to read from a text file one word at a time and compare it to a word that I have passed into the function as a pointer.
I am currently reading it from the file one character at a time and storing those characters in a new char array until it hits a space, then comparing that char array to the original word stored in the pointer (stored where it's pointing to, anyway).
When I do a printf to check if both arrays are the same they are, they both equal "Hello". At first I thought maybe it's because my char array doesn't have an end terminator but I tried adding one but still nothing is seeming to work.
My code is below and I would appreciate any help. Again C isn't my strong area.
If I do "Hello" it will be > 0 by the way, so I think it's because the gets() stdin function is also including the enter key or something of that sort. I am not sure of a better way to grab the string though.
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
int partA(char*);
main()
{
// Array to store my string
char myWord[81];
// myword = pointer to my char array to store. 80 = the size (maximum). stdin = standard input from my keyboard.
fgets(myWord, 80, stdin);
partA(myWord);
}
int partA(char *word)
{
// points to file.
FILE *readFile;
fopen_s(&readFile, "readThisFile.txt", "r");
char character;
char newWord[50];
int i = 0;
while ((character = fgetc(readFile)) != EOF)
{
if (character == ' ')
{
newWord[i] = '\0';
int sameWord = strcmp(word, newWord);
printf("Word: %s", word);
printf("newWord: %s", newWord);
if (sameWord == 0)
printf(" These words are the same.");
if (sameWord > 0)
printf(" sameWord > 0.");
if (sameWord < 0)
printf(" sameWord < 0.");
printf("\n");
i = 0;
}
if (character != ' ')
{
newWord[i] = character;
i++;
}
printf("%c", character);
}
fclose(readFile);
return 1;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

C extract words from a txt file except spaces and punctuations - c

Related

How to read specific words from a file?

C - Program not detecting blank spaces

How to prevent char arrays from being overwritten from for loops when using strcat

C Reading a file of digits separated by commas

ANSI C strcmp() function never returning 0, where am I going wrong?

Categories

Resources