Searching a particular string in a large file - c

I am making program in C which can search for a specific string in a large .txt file and count it and then print it out. But it seems that something have go wrong, cause the output of my program is different from that of the two text editor. According to the text editor, there are totally 3000 words,in this case I search for the word "make", in that .txt file. But the output of my program is just 2970.
I cannot find out the problem of my program. So I am curios about how could a text editor search for a specific string so accurately? How do people implement that? Can any people show me some code in C?
To make things clear: that is a large .txt file, 20M or so, containing lots of characters. So I think it's not so good to read it into memory all at once. I have implement my program by splitting my program in to pieces and then scan all of those for parsing. However, it fail some way.
Maybe I should put the code here. Wait a minute please.
The code is kinda long, 70 lines or so. I have put it on my github, if you have any interest, please help. https://github.com/walkerlala/searchText
note that the only related file is wordCount.c and testfile.txt which goes like:
#include<stdio.h>
#include<stdlib.h>
#include<stdbool.h>
char arr[51];
int flag=0;
int flag2=0;
int flag3=0;
int flag4=0;
int pieceCount(FILE*);
int main()
{
//the file in which I want to search the word is testfile.txt
//I have formatted the file so that it contain no newlins any more
FILE* fs=fopen("testfile.txt","r");
int n=pieceCount(fs);
printf("%d\n",n);
rewind(fs); //refresh the file...
static bool endOfPiece1=false,endOfPiece2=false,endOfPiece3=false;
bool begOfPiece1,begOfPiece2,begOfPiece3;
for(int start=0;start<n;++start){
fgets(arr,sizeof(arr),fs);
for(int i=0;i<=46;++i){
if((arr[i]=='M'||arr[i]=='m')&&(arr[i+1]=='A'||arr[i+1]=='a')&&(arr[i+2]=='K'||arr[i+2]=='k')&&(arr[i+3]=='E'||arr[i+3]=='e')){
flag+=1;
//continue;
}
}
//check the border
begOfPiece1=((arr[1]=='e'||arr[1]=='E'));
if(begOfPiece1==true&&endOfPiece1==true)
flag2+=1;
endOfPiece1=((arr[47]=='m'||arr[47]=='M')&&(arr[48]=='a'||arr[48]=='A')&&(arr[49]=='k'||arr[49]=='K'));
begOfPiece2=((arr[1]=='k'||arr[1]=='K')&&(arr[2]=='e'||arr[2]=='E'));
if(begOfPiece2==true&&endOfPiece2==true)
flag3+=1;
endOfPiece2=((arr[48]=='m'||arr[48]=='M')&&(arr[49]=='a'||arr[49]=='A'));
begOfPiece3=((arr[1]=='a'||arr[1]=='A')&&(arr[2]=='k'||arr[2]=='K')&&(arr[3]=='e'||arr[3]=='E'));
if(begOfPiece3==true&&endOfPiece3==true)
flag4+=1;
endOfPiece3=(arr[49]=='m'||arr[49]=='M');
}
printf("%d\n%d\n%d\n%d\n",flag,flag2,flag3,flag4);
getchar();
return 0;
}
//the function counts how many pieces have I split the file into
int pieceCount(FILE* file){
static int count=0;
char arr2[51]={'\0'};
while(fgets(arr2,sizeof(arr),file)){
count+=1;
continue;
}
return count;
}

You can do this quite simply just by having a rolling buffer. You don't need to break the file into sections.
#include <stdio.h>
#include <string.h>
int main(void) {
char buff [4]; // word buffer
int count = 0; // occurrences
FILE* fs=fopen("test.txt","r"); // open the file
if (fs != NULL) { // if the file opened
if (4 == fread(buff, 1, 4, fs)) { // fill the buffer
do { // if it worked
if (strnicmp(buff, "make", 4) == 0) // check for target word
count++; // tally
memmove(buff, buff+1, 3); // shift the buffer down
} while (1 == fread(buff+3, 1, 1, fs)); // fill the last position
} // end of file
fclose(fs); // close the file
}
printf("%d\n", count); // report the result
return 0;
}
For simplicity I stopped short of making the search word "softer" and allocating the correct buffer and various sizes, since that wasn't in the question. And I have to leave something for OP to do.

Related

unsorted double linked list corrupted Aborted (core dumped)

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include <time.h>
typedef char string [50];
typedef struct adresse{
int zip_code;
string city;
}
adresse Fr[27381996];// I will upload data to this table from the file
int main (){
int i=-1, line_nb=1;
int j=1;
double timing=0.0;//This is for calculating the timing
//These upcoming lines are for opening the file
FILE * France_adr;
France_adr=fopen("france.csv","r");
char adr[300];
for(line_nb=1;line_nb<10;line_nb++)
{
time_t begin= time(NULL);//Here I start the timing
while((!feof(France_adr))&& (i<line_nb))
{
fgets(adr,300,France_adr);
char* s = strdup(adr);
char* val = strsep(&s,","); /* This is because in the file
there are some data that I
don't want to use and they are separated with a comma*/
while(val!=NULL)
{
val=strsep(&s,",");
// This is also to sort the data I want to get into my
table
if(j==2 && i!=-1)
{
Fr[i].zip_code=atoi(val);
printf("%d | ",Fr[i].zip_code);
}
j++;
}
i++;
j=1;
printf("\n");
}
fclose(France_adr);
printf("\n\n");
time_t end = time(NULL);
duree += (double)(1000*(end-begin));
//This section is for writing the timing and number of lines into a new
file
FILE * donnee_t;
donnee_t=fopen("Affichage_Donnee_Courbe.csv","a");
fprintf(donnee_t,"\n %d,%f",i,duree);
fclose(donnee_t);
i=0;
}
return 0;
}
I am working on this project where I have to upload huge data from a file. So what I got do is to display those lines of data on the terminal and see how much it takes for it to be displayed, and eventually creating a curve that shows how the time evolves according to the number of lines displayed (So
I write the number of lines displayed and time it took it to be displayed in another file .CSV). And since the file has 27 million line of data I thought of doing a loop for how many lines I want to display every time, But the terminal shows me that this error. I hope I explained the problem very well and I hope I can have your help. enter image description here

How do you open a FILE with the user input and put it into a string in C

So I have to write a program that prompts the user to enter the name of a file, using a pointer to an array created in main, and then open it. On a separate function I have to take a user defined string to a file opened in main and return the number of lines in the file based on how many strings it reads in a loop and returns that value to the caller.
So for my first function this is what I have.
void getFileName(char* array1[MAX_WIDTH])
{
FILE* data;
char userIn[MAX_WIDTH];
printf("Enter filename: ");
fgets(userIn, MAX_WIDTH, stdin);
userIn[strlen(userIn) - 1] = 0;
data = fopen(userIn, "r");
fclose(data);
return;
}
For my second function I have this.
int getLineCount(FILE* data, int max)
{
int i = 0;
char *array1[MAX_WIDTH];
if(data != NULL)
{
while(fgets(*array1, MAX_WIDTH, data) != NULL)
{
i+=1;
}
}
printf("%d", i);
return i;
}
And in my main I have this.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define MAX_WIDTH 144
void getFileName(char* array1[MAX_WIDTH]);
int getLineCount(FILE* data, int max);
int main(void)
{
char *array1[MAX_WIDTH];
FILE* data = fopen(*array1, "r");
int max;
getFileName(array1);
getLineCount(data, max);
return 0;
}
My text file is this.
larry snedden 123 mocking bird lane
sponge bob 321 bikini bottom beach
mary fleece 978 pasture road
hairy whodunit 456 get out of here now lane
My issue is that everytime I run this I keep getting a 0 in return and I don't think that's what I'm supposed to be getting back. Also, in my second function I have no idea why I need int max in there but my teacher send I needed it, so if anyone can explain that, that'd be great. I really don't know what I'm doing wrong. I'll appreciate any help I can get.
There were a number of issues with the posted code. I've fixed the problems with the code and left some comments describing what I did. I do think that this code could benefit by some restructuring and renaming (e.g. array1 doesn't tell you what the purpose of the variable is). The getLineCount() function is broken for lines that exceed MAX_WIDTH and ought to be rewritten to count actual lines, not just calls to fgets.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define MAX_WIDTH 144
/**
* Gets a handle to the FILE to be processed.
* - Renamed to indicate what the function does
* - removed unnecessary parameter, and added return of FILE*
* - removed the fclose() call
* - added rudimentary error handling.
**/
FILE *getFile()
{
char userIn[MAX_WIDTH+1];
printf("Enter filename: ");
fgets(userIn, MAX_WIDTH, stdin);
userIn[strlen(userIn) - 1] = 0; // chop off newline.
FILE *data = fopen(userIn, "r");
if (data == NULL) {
perror(userIn);
}
return data;
}
/**
* - removed the unnecessary 'max' parameter
* - removed null check of FILE *, since this is now checked elsewhere.
* - adjusted size of array1 for safety.
**/
int getLineCount(FILE* data)
{
int i = 0;
char array1[MAX_WIDTH+1];
while(fgets(array1, MAX_WIDTH, data) != NULL)
{
i+=1;
}
return i;
}
/**
* - removed unnecessary array1 variable
* - removed fopen of uninitialized char array.
* - added some rudimentary error handling.
*/
int main(void)
{
FILE *data = getFile();
if (data != NULL) {
int lc = getLineCount(data);
fclose(data);
printf("%d\n", lc);
return 0;
}
return 1;
}
There are several things I think you should repair at first:
getFileName should help you getting the file name (as the name says), so in that function you shouldn’t have both array1 and userIn (as a matter of fact array1 is not even used in the function, so it can be eliminated all togheter). The paramater and the file name should be ‘the same’.
data is a local FILE pointer, this means once you exit the function you lose it. My recommandation is to make it global, or pass it as an argument from the main class. Also do not close it 1 line after you open it.
I guess the getLineCount is fine, but usually is a good practice to return and printf in main what is returned.
That max that is passed to the second function maybe to help you with the max size of a line? it might be.
Summing up, your getFileName should return the file name, so that userIn is what should be given by that parameter. The File opening should be done IN THE MAIN FUNCTION and be closed after everything you do related to the file, so at the end. Also, open the file after you get the name of the file.
Hopefully it helps you! Keep us tuned with your progress.

Unexpected Output - Storing into 2D array in c

I am reading data from a number of files, each containing a list of words. I am trying to display the number of words in each file, but I am running into issues. For example, when I run my code, I receive the output as shown below.
Almost every amount is correctly displayed with the exception of two files, each containing word counts in the thousands. Every other file only has three digits worth of words, and they seem just fine.
I can only guess what this problem could be (not enough space allocated somewhere?) and I do not know how to solve it. I apologize if this is all poorly worded. My brain is fried and I am struggling. Any help would be appreciated.
I've tried to keep my example code as brief as possible. I've cut out a lot of error checking and other tasks related to the full program. I've also added comments where I can. Thanks.
StopWords.c
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <stddef.h>
#include <string.h>
typedef struct
{
char stopwords[2000][60];
int wordcount;
} LangData;
typedef struct
{
int languageCount;
LangData languages[];
} AllData;
main(int argc, char **argv)
{
//Initialize data structures and open path directory
int langCount = 0;
DIR *d;
struct dirent *ep;
d = opendir(argv[1]);
//Count the number of language files in the directory
while(readdir(d))
langCount++;
//Account for "." and ".." in directory
//langCount = langCount - 2 THIS MAKES SENSE RIGHT?
langCount = langCount + 1; //The program crashes if I don't do this, which doesn't make sense to me.
//Allocate space in AllData for languageCount
AllData *data = malloc(sizeof(AllData) + sizeof(LangData)*langCount); //Unsure? Seems to work.
//Reset the directory in preparation for reading data
rewinddir(d);
//Copy all words into respective arrays.
char word[60];
int i = 0;
int k = 0;
int j = 0;
while((ep = readdir(d)) != NULL) //Probably could've used for loops to make this cleaner. Oh well.
{
if (!strcmp(ep->d_name, ".") || !strcmp(ep->d_name, ".."))
{
//Filtering "." and ".."
}
else
{
FILE *entry;
//Get string for path (i should make this a function)
char fullpath[100];
strcpy(fullpath, path);
strcat(fullpath, "\\");
strcat(fullpath, ep->d_name);
entry = fopen(fullpath, "r");
//Read all words from file
while(fgets(word, 60, entry) != NULL)
{
j = 0;
//Store each word one character at a time (better way?)
while(word[j] != '\0') //Check for end of word
{
data->languages[i].stopwords[k][j] = word[j];
j++; //Move onto next character
}
k++; //Move onto next word
data->languages[i].wordcount++;
}
//Display number of words in file
printf("%d\n", data->languages[i].wordcount);
i++; Increment index in preparation for next language file.
fclose(entry);
}
}
}
Output
256 //czech.txt: Correct
101 //danish.txt: Correct
101 //dutch.txt: Correct
547 //english.txt: Correct
1835363006 //finnish.txt: Should be 1337. Of course it's 1337.
436 //french.txt: Correct
576 //german.txt: Correct
737 //hungarian.txt: Correct
683853 //icelandic.txt: Should be 1000.
399 //italian.txt: Correct
172 //norwegian.txt: Correct
269 //polish.txt: Correct
437 //portugese.txt: Correct
282 //romanian.txt: Correct
472 //spanish.txt: Correct
386 //swedish.txt: Correct
209 //turkish.txt: Correct
Do the files have more than 2000 words? You have only allocated space for 2000 words so once your program tries to copy over word 2001 it will be doing it outside of the memory allocated for that array, possibly into the space allocated for "wordcount".
Also I want to point out that fgets returns a string to the end of the line or at most n characters (60 in your case), whichever comes first. This will work find if there is only one word per line in the files you are reading from, otherwise will have to locate spaces within the string and count words from there.
If you are simply trying to get a word count, then there is no need to store all the words in an array in the first place. Assuming one word per line, the following should work just as well:
char word[60];
while(fgets(word, 60, entry) != NULL)
{
data->languages[i].wordcount++;
}
fgets reference- http://www.cplusplus.com/reference/cstdio/
Update
I took another look and you might want to try allocating data as follows:
typedef struct
{
char stopwords[2000][60];
int wordcount;
} LangData;
typedef struct
{
int languageCount;
LangData *languages;
} AllData;
AllData *data = malloc(sizeof(AllData));
data->languages = malloc(sizeof(LangData)*langCount);
This way memory is being specifically allocated for the languages array.
I agree that langCount = langCount - 2 makes sense. What error are you getting?

Frequency of each word of text file. Error while allocating memory?

Good evening everyone!
I have started messing around with strings and pointers in C.
I want to write a programm that reads a text file, then calculating the frequency of each word and printing it.
My variables are:
FILE *fp;
char *words[N] //N defined 100
int i=0, y=0;
int *freq;
int freq_count=0;;
int word_number=0;
The code part:
for(i=0;i<word_counter;i++){
while(y<word_counter){
if(strcmp(words[i],words[y]==0){
freq1++;
} y++;
}
if(i==0){
freq=(int*)malloc(sizeof(int));
strcpy(freq, freq1); freq1=0;
}
else{
freq=(int*)realloc(freq, (i+1)*sizeof(int));
strcpy(freq, freq1); freq1=0;
}
y=0;
}
I get several errors running this...What is wrong?
Take into consideration that in words[N] i have put each word of the text by itself in each cell.
Thank you all in advance.
Maybe another array is not what you want, but still better than using realloc and condition in loop.
int freq[N];
for(i=0;i<word_counter;i++){
freq1 = 0;
for(y=0;y<word_counter;y++){
if(strcmp(words[i],words[y]==0)
freq1++;
}
freq[i] = freq1;
}

creating a bitmap image from existing bitmap, in C

I am writing a C program, which will retrieve the information (header information, pixel information) from a bitmap image, and use that information to create another bitmap image (the new image will obviously be same as the original).
The problem is that, in some cases, extra bytes get added (on their own) to the new image, due to which the image is not formed properly.
In another case, some bytes get missing in the new image, due to which image formation itself fails.
(This happens while writing the pixel information. the bitmap header information gets written properly to the new file.)
I have debugged the code but I couldn't find out what is causing this.
I'll be glad if somebody could tell me what the error is.
//creating a bitmap file
#include<stdio.h>
#include<conio.h>
#include<stdlib.h>
#include<math.h>
long extract(FILE *,long ,int );
long extract(FILE *fp1,long offset,int size)
{
unsigned char *ptr;
unsigned char temp='0';
long value=0L;
int i;
//to initialize the ptr
ptr=&temp;
//sets the file pointer at specific position i.e. after the offset
fseek(fp1,offset,SEEK_SET);
//now fgetcing (size) values starting from the offset
for(i=1;i<=size;i++)
{
fread(ptr,sizeof(char),1,fp1);
value=(long)(value+(*ptr)*(pow(256,(i-1)))); //combining the values one after another in a single variable
}
return value;
}
int main()
{
int row,col;
int i,j,k;
int dataoffset,offset;
char magicnum[2];
FILE *fp1,*fp4;
clrscr();
if((fp1=fopen("stripes.bmp","rb"))==NULL)
{
printf("\a\nCant open the image.\nSystem is exiting.");
exit(0);
}
if((fp4=fopen("op.bmp","a"))==NULL)
{
printf("\n\aError while creating a file.\nSystem is exiting ..... ");
exit(0);
}
fputc((int)extract(fp1,0L,1),fp4);
fputc((int)extract(fp1,1L,1),fp4);
fputc((int)extract(fp1,2L,1),fp4);
fputc((int)extract(fp1,3L,1),fp4);
fputc((int)extract(fp1,4L,1),fp4);
fputc((int)extract(fp1,5L,1),fp4);
fputc((int)extract(fp1,6L,1),fp4);
fputc((int)extract(fp1,7L,1),fp4);
fputc((int)extract(fp1,8L,1),fp4);
fputc((int)extract(fp1,9L,1),fp4);
fputc((int)extract(fp1,10L,1),fp4);
fputc((int)extract(fp1,11L,1),fp4);
fputc((int)extract(fp1,12L,1),fp4);
fputc((int)extract(fp1,13L,1),fp4);
fputc((int)extract(fp1,14L,1),fp4);
fputc((int)extract(fp1,15L,1),fp4);
fputc((int)extract(fp1,16L,1),fp4);
fputc((int)extract(fp1,17L,1),fp4);
fputc((int)extract(fp1,18L,1),fp4);
fputc((int)extract(fp1,19L,1),fp4);
fputc((int)extract(fp1,20L,1),fp4);
fputc((int)extract(fp1,21L,1),fp4);
fputc((int)extract(fp1,22L,1),fp4);
fputc((int)extract(fp1,23L,1),fp4);
fputc((int)extract(fp1,24L,1),fp4);
fputc((int)extract(fp1,25L,1),fp4);
fputc((int)extract(fp1,26L,1),fp4);
fputc((int)extract(fp1,27L,1),fp4);
fputc((int)extract(fp1,28L,1),fp4);
fputc((int)extract(fp1,29L,1),fp4);
fputc((int)extract(fp1,30L,1),fp4);
fputc((int)extract(fp1,31L,1),fp4);
fputc((int)extract(fp1,32L,1),fp4);
fputc((int)extract(fp1,33L,1),fp4);
fputc((int)extract(fp1,34L,1),fp4);
fputc((int)extract(fp1,35L,1),fp4);
fputc((int)extract(fp1,36L,1),fp4);
fputc((int)extract(fp1,37L,1),fp4);
fputc((int)extract(fp1,38L,1),fp4);
fputc((int)extract(fp1,39L,1),fp4);
fputc((int)extract(fp1,40L,1),fp4);
fputc((int)extract(fp1,41L,1),fp4);
fputc((int)extract(fp1,42L,1),fp4);
fputc((int)extract(fp1,43L,1),fp4);
fputc((int)extract(fp1,44L,1),fp4);
fputc((int)extract(fp1,45L,1),fp4);
fputc((int)extract(fp1,46L,1),fp4);
fputc((int)extract(fp1,47L,1),fp4);
fputc((int)extract(fp1,48L,1),fp4);
fputc((int)extract(fp1,49L,1),fp4);
fputc((int)extract(fp1,50L,1),fp4);
fputc((int)extract(fp1,51L,1),fp4);
fputc((int)extract(fp1,52L,1),fp4);
fputc((int)extract(fp1,53L,1),fp4);
//setting the file pointer at the beginning
rewind(fp1);
/*CHECKING WHETHER THE FILE IS IN BMP FORMAT OR NOT, WE CHECK THE MAGIC NUMBER OF THE FILE, MAGIC NUMBER'S OFFSET IS 0 i.e. IT'S STORED AT THE FRONT OF THE IMAGE, AND THE SIZE IS 2*/
//at first extracting the magic number
for(i=0;i<2;i++)
{
magicnum[i]=(char)extract(fp1,i,1);
}
//now checking
if((magicnum[0]=='B') && (magicnum[1]=='M'))
;
else
{
printf("\aThe image is not a bitmap image.\nSystem is exiting ... ");
exit(0);
}
//storing the header information
//get the starting position or offset of the data(pixel)
dataoffset=(int)extract(fp1,10,4);
//get the number of rows
row=(int)extract(fp1,22,4);
//get the number of columns
col=(int)extract(fp1,18,4);
//storing the data
offset=dataoffset;
for(j=0;j<col;j++)
{
for(k=0;k<row;k++)
{
for(i=0;i<=2;i++)
{
fputc((int)extract(fp1,offset++,1),fp4);
}
}
}
fcloseall();
return 0;
}
Make sure you open the output file in binary mode as well.
If you don't do that, the byte value corresponding to '\n' may be expanded to carriage return and line feed.
Consider this line:
value=(long)(value+(*ptr)*(pow(256,(i-1))));
pow is a floating point function returning a double. This means that (*ptr) is implicitly casted to double. The whole expression (value+(*ptr)*(pow(256,(i-1)))) will be a double. Now this can be larger than 2147483647 which is the largest number a long can hold (on most common 32-bit platforms), and the result when converting an out of range double to long is undefined. See what happens on this example:
#include <stdio.h>
int main(int argc, char **argv) {
int i;
for (i = 0; i < 10; i++) {
double d = 2147483647.0d + i;
printf("double=%f long=%ld\n", d, (long)d);
}
return 0;
}
Here is the output when I run it on my system: (hidden in case you want to guess or test it yourself first):
double=2147483647.000000 long=2147483647
double=2147483648.000000 long=-2147483648
double=2147483649.000000 long=-2147483648
double=2147483650.000000 long=-2147483648
double=2147483651.000000 long=-2147483648
double=2147483652.000000 long=-2147483648
double=2147483653.000000 long=-2147483648
double=2147483654.000000 long=-2147483648
double=2147483655.000000 long=-2147483648
double=2147483656.000000 long=-2147483648
One way to fix it would be to change it to unsigned long instead.
Personally I'd use 1 << (8*(i-1)) instead of pow to avoid messing with floating point, but there is lots of other things that I'd do very different too, but that is probably out of scope for this question (might be a question for the code review site).

Resources