Unexpected Output - Storing into 2D array in c - c

I am reading data from a number of files, each containing a list of words. I am trying to display the number of words in each file, but I am running into issues. For example, when I run my code, I receive the output as shown below.
Almost every amount is correctly displayed with the exception of two files, each containing word counts in the thousands. Every other file only has three digits worth of words, and they seem just fine.
I can only guess what this problem could be (not enough space allocated somewhere?) and I do not know how to solve it. I apologize if this is all poorly worded. My brain is fried and I am struggling. Any help would be appreciated.
I've tried to keep my example code as brief as possible. I've cut out a lot of error checking and other tasks related to the full program. I've also added comments where I can. Thanks.
StopWords.c
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <stddef.h>
#include <string.h>
typedef struct
{
char stopwords[2000][60];
int wordcount;
} LangData;
typedef struct
{
int languageCount;
LangData languages[];
} AllData;
main(int argc, char **argv)
{
//Initialize data structures and open path directory
int langCount = 0;
DIR *d;
struct dirent *ep;
d = opendir(argv[1]);
//Count the number of language files in the directory
while(readdir(d))
langCount++;
//Account for "." and ".." in directory
//langCount = langCount - 2 THIS MAKES SENSE RIGHT?
langCount = langCount + 1; //The program crashes if I don't do this, which doesn't make sense to me.
//Allocate space in AllData for languageCount
AllData *data = malloc(sizeof(AllData) + sizeof(LangData)*langCount); //Unsure? Seems to work.
//Reset the directory in preparation for reading data
rewinddir(d);
//Copy all words into respective arrays.
char word[60];
int i = 0;
int k = 0;
int j = 0;
while((ep = readdir(d)) != NULL) //Probably could've used for loops to make this cleaner. Oh well.
{
if (!strcmp(ep->d_name, ".") || !strcmp(ep->d_name, ".."))
{
//Filtering "." and ".."
}
else
{
FILE *entry;
//Get string for path (i should make this a function)
char fullpath[100];
strcpy(fullpath, path);
strcat(fullpath, "\\");
strcat(fullpath, ep->d_name);
entry = fopen(fullpath, "r");
//Read all words from file
while(fgets(word, 60, entry) != NULL)
{
j = 0;
//Store each word one character at a time (better way?)
while(word[j] != '\0') //Check for end of word
{
data->languages[i].stopwords[k][j] = word[j];
j++; //Move onto next character
}
k++; //Move onto next word
data->languages[i].wordcount++;
}
//Display number of words in file
printf("%d\n", data->languages[i].wordcount);
i++; Increment index in preparation for next language file.
fclose(entry);
}
}
}
Output
256 //czech.txt: Correct
101 //danish.txt: Correct
101 //dutch.txt: Correct
547 //english.txt: Correct
1835363006 //finnish.txt: Should be 1337. Of course it's 1337.
436 //french.txt: Correct
576 //german.txt: Correct
737 //hungarian.txt: Correct
683853 //icelandic.txt: Should be 1000.
399 //italian.txt: Correct
172 //norwegian.txt: Correct
269 //polish.txt: Correct
437 //portugese.txt: Correct
282 //romanian.txt: Correct
472 //spanish.txt: Correct
386 //swedish.txt: Correct
209 //turkish.txt: Correct

Do the files have more than 2000 words? You have only allocated space for 2000 words so once your program tries to copy over word 2001 it will be doing it outside of the memory allocated for that array, possibly into the space allocated for "wordcount".
Also I want to point out that fgets returns a string to the end of the line or at most n characters (60 in your case), whichever comes first. This will work find if there is only one word per line in the files you are reading from, otherwise will have to locate spaces within the string and count words from there.
If you are simply trying to get a word count, then there is no need to store all the words in an array in the first place. Assuming one word per line, the following should work just as well:
char word[60];
while(fgets(word, 60, entry) != NULL)
{
data->languages[i].wordcount++;
}
fgets reference- http://www.cplusplus.com/reference/cstdio/
Update
I took another look and you might want to try allocating data as follows:
typedef struct
{
char stopwords[2000][60];
int wordcount;
} LangData;
typedef struct
{
int languageCount;
LangData *languages;
} AllData;
AllData *data = malloc(sizeof(AllData));
data->languages = malloc(sizeof(LangData)*langCount);
This way memory is being specifically allocated for the languages array.
I agree that langCount = langCount - 2 makes sense. What error are you getting?

Related

How do you open a FILE with the user input and put it into a string in C

So I have to write a program that prompts the user to enter the name of a file, using a pointer to an array created in main, and then open it. On a separate function I have to take a user defined string to a file opened in main and return the number of lines in the file based on how many strings it reads in a loop and returns that value to the caller.
So for my first function this is what I have.
void getFileName(char* array1[MAX_WIDTH])
{
FILE* data;
char userIn[MAX_WIDTH];
printf("Enter filename: ");
fgets(userIn, MAX_WIDTH, stdin);
userIn[strlen(userIn) - 1] = 0;
data = fopen(userIn, "r");
fclose(data);
return;
}
For my second function I have this.
int getLineCount(FILE* data, int max)
{
int i = 0;
char *array1[MAX_WIDTH];
if(data != NULL)
{
while(fgets(*array1, MAX_WIDTH, data) != NULL)
{
i+=1;
}
}
printf("%d", i);
return i;
}
And in my main I have this.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define MAX_WIDTH 144
void getFileName(char* array1[MAX_WIDTH]);
int getLineCount(FILE* data, int max);
int main(void)
{
char *array1[MAX_WIDTH];
FILE* data = fopen(*array1, "r");
int max;
getFileName(array1);
getLineCount(data, max);
return 0;
}
My text file is this.
larry snedden 123 mocking bird lane
sponge bob 321 bikini bottom beach
mary fleece 978 pasture road
hairy whodunit 456 get out of here now lane
My issue is that everytime I run this I keep getting a 0 in return and I don't think that's what I'm supposed to be getting back. Also, in my second function I have no idea why I need int max in there but my teacher send I needed it, so if anyone can explain that, that'd be great. I really don't know what I'm doing wrong. I'll appreciate any help I can get.
There were a number of issues with the posted code. I've fixed the problems with the code and left some comments describing what I did. I do think that this code could benefit by some restructuring and renaming (e.g. array1 doesn't tell you what the purpose of the variable is). The getLineCount() function is broken for lines that exceed MAX_WIDTH and ought to be rewritten to count actual lines, not just calls to fgets.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define MAX_WIDTH 144
/**
* Gets a handle to the FILE to be processed.
* - Renamed to indicate what the function does
* - removed unnecessary parameter, and added return of FILE*
* - removed the fclose() call
* - added rudimentary error handling.
**/
FILE *getFile()
{
char userIn[MAX_WIDTH+1];
printf("Enter filename: ");
fgets(userIn, MAX_WIDTH, stdin);
userIn[strlen(userIn) - 1] = 0; // chop off newline.
FILE *data = fopen(userIn, "r");
if (data == NULL) {
perror(userIn);
}
return data;
}
/**
* - removed the unnecessary 'max' parameter
* - removed null check of FILE *, since this is now checked elsewhere.
* - adjusted size of array1 for safety.
**/
int getLineCount(FILE* data)
{
int i = 0;
char array1[MAX_WIDTH+1];
while(fgets(array1, MAX_WIDTH, data) != NULL)
{
i+=1;
}
return i;
}
/**
* - removed unnecessary array1 variable
* - removed fopen of uninitialized char array.
* - added some rudimentary error handling.
*/
int main(void)
{
FILE *data = getFile();
if (data != NULL) {
int lc = getLineCount(data);
fclose(data);
printf("%d\n", lc);
return 0;
}
return 1;
}
There are several things I think you should repair at first:
getFileName should help you getting the file name (as the name says), so in that function you shouldn’t have both array1 and userIn (as a matter of fact array1 is not even used in the function, so it can be eliminated all togheter). The paramater and the file name should be ‘the same’.
data is a local FILE pointer, this means once you exit the function you lose it. My recommandation is to make it global, or pass it as an argument from the main class. Also do not close it 1 line after you open it.
I guess the getLineCount is fine, but usually is a good practice to return and printf in main what is returned.
That max that is passed to the second function maybe to help you with the max size of a line? it might be.
Summing up, your getFileName should return the file name, so that userIn is what should be given by that parameter. The File opening should be done IN THE MAIN FUNCTION and be closed after everything you do related to the file, so at the end. Also, open the file after you get the name of the file.
Hopefully it helps you! Keep us tuned with your progress.

Two-dimensional char array too large exit code 139

Hey guys I'm attempting to read in workersinfo.txt and store it into a two-dimensional char array. The file is around 4,000,000 lines with around 100 characters per line. I want to store each file line on the array. Unfortunately, I get exit code 139(Not enough memory). I'm aware I have to use malloc() and free() but I've tried a couple of things and I haven't been able to make them work.Eventually I have to sort the array by ID number but I'm stuck on declaring the array.
The file looks something like this:
First Name, Last Name,Age, ID
Carlos,Lopez,,10568
Brad, Patterson,,20586
Zack, Morris,42,05689
This is my code so far:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
FILE *ptr_file;
char workers[4000000][1000];
ptr_file =fopen("workersinfo.txt","r");
if (!ptr_file)
perror("Error");
int i = 0;
while (fgets(workers[i],1000, ptr_file)!=NULL){
i++;
}
int n;
for(n = 0; n < 4000000; n++)
{
printf("%s", workers[n]);
}
fclose(ptr_file);
return 0;
}
The Stack memory is limited. As you pointed out in your question, you MUST use malloc to allocate such a big (need I say HUGE) chunk of memory, as the stack cannot contain it.
you can use ulimit to review the limits of your system (usually including the stack size limit).
On my Mac, the limit is 8Mb. After running ulimit -a I get:
...
stack size (kbytes, -s) 8192
...
Or, test the limit using:
struct rlimit slim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur // the stack limit
I truly recommend you process each database entry separately.
As mentioned in the comments, assigning the memory as static memory would, in most implementations, circumvent the stack.
Still, IMHO, allocating 400MB of memory (or 4GB, depending which part of your question I look at), is bad form unless totally required - especially for a single function.
Follow-up Q1: How to deal with each DB entry separately
I hope I'm not doing your homework or anything... but I doubt your homework would include an assignment to load 400Mb of data to the computer's memory... so... to answer the question in your comment:
The following sketch of single entry processing isn't perfect - it's limited to 1Kb of data per entry (which I thought to be more then enough for such simple data).
Also, I didn't allow for UTF-8 encoding or anything like that (I followed the assumption that English would be used).
As you can see from the code, we read each line separately and perform error checks to check that the data is valid.
To sort the file by ID, you might consider either running two lines at a time (this would be a slow sort) and sorting them, or creating a sorted node tree with the ID data and the position of the line in the file (get the position before reading the line). Once you sorted the binary tree, you can sort the data...
... The binary tree might get a bit big. did you look up sorting algorithms?
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 1 on sucesss or 0 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return 0;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
return 0;
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 1;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
return 0;
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE* ptr_file;
ptr_file = fopen("workersinfo.txt", "r");
if (!ptr_file)
perror("File Error");
struct DBEntry line;
while (read_db_line(ptr_file, &line)) {
// do what you want with the data... print it?
printf(
"First name:\t%s\n"
"Last name:\t%s\n"
"Age:\t\t%s\n"
"ID:\t\t%s\n"
"--------\n",
line.first_name, line.last_name, line.age, line.id);
}
// close file
fclose(ptr_file);
return 0;
}
Followup Q2: Sorting array for 400MB-4GB of data
IMHO, 400MB is already touching on the issues related to big data. For example, implementing a bubble sort on your database should be agonizing as far as performance goes (unless it's a single time task, where performance might not matter).
Creating an Array of DBEntry objects will eventually get you a larger memory foot-print then the actual data..
This will not be the optimal way to sort large data.
The correct approach will depend on your sorting algorithm. Wikipedia has a decent primer on sorting algorythms.
Since we are handling a large amount of data, there are a few things to consider:
It would make sense to partition the work, so different threads/processes sort a different section of the data.
We will need to minimize IO to the hard drive (as it will slow the sorting significantly and prevent parallel processing on the same machine/disk).
One possible approach is to create a heap for a heap sort, but only storing a priority value and storing the original position in the file.
Another option would probably be to employ a divide and conquer algorithm, such as quicksort, again, only sorting a computed sort value and the entry's position in the original file.
Either way, writing a decent sorting method will be a complicated task, probably involving threading, forking, tempfiles or other techniques.
Here's a simplified demo code... it is far from optimized, but it demonstrates the idea of the binary sort-tree that holds the sorting value and the position of the data in the file.
Be aware that using this code will be both relatively slow (although not that slow) and memory intensive...
On the other hand, it will require about 24 bytes per entry. For 4 million entries, it's 96MB, somewhat better then 400Mb and definitely better then the 4GB.
#include <stdlib.h>
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// this might be a sorting node for a sorted bin-tree:
struct SortNode {
struct SortNode* next; // a pointer to the next node
fpos_t position; // the DB entry's position in the file
long value; // The computed sorting value
}* top_sorting_node = NULL;
// this function will free all the memory used by the global Sorting tree
void clear_sort_heap(void) {
struct SortNode* node;
// as long as there is a first node...
while ((node = top_sorting_node)) {
// step forward.
top_sorting_node = top_sorting_node->next;
// free the original first node's memory
free(node);
}
}
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 0 on sucesss or 1 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return -1;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
clear_sort_heap();
exit(2);
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 0;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
clear_sort_heap();
exit(1);
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// read and sort a single line from the database.
// return 0 if there was no data to sort. return 1 if data was read and sorted.
int sort_line(FILE* fp) {
// allocate the memory for the node - use calloc for zero-out data
struct SortNode* node = calloc(sizeof(*node), 1);
// store the position on file
fgetpos(fp, &node->position);
// use a stack allocated DBEntry for processing
struct DBEntry line;
// check that the read succeeded (read_db_line will return -1 on error)
if (read_db_line(fp, &line)) {
// free the node's memory
free(node);
// return no data (0)
return 0;
}
// compute sorting value - I'll assume all IDs are numbers up to long size.
sscanf(line.id, "%ld", &node->value);
// heap sort?
// This is a questionable sort algorythm... or a questionable implementation.
// Also, I'll be using pointers to pointers, so it might be a headache to read
// (it's a headache to write, too...) ;-)
struct SortNode** tmp = &top_sorting_node;
// move up the list until we encounter something we're smaller then us,
// OR untill the list is finished.
while (*tmp && (*tmp)->value <= node->value)
tmp = &((*tmp)->next);
// update the node's `next` value.
node->next = *tmp;
// inject the new node into the tree at the position we found
*tmp = node;
// return 1 (data was read and sorted)
return 1;
}
// writes the next line in the sorting
int write_line(FILE* to, FILE* from) {
struct SortNode* node = top_sorting_node;
if (!node) // are we done? top_sorting_node == NULL ?
return 0; // return 0 - no data to write
// step top_sorting_node forward
top_sorting_node = top_sorting_node->next;
// read data from one file to the other
fsetpos(from, &node->position);
char* buffer = NULL;
ssize_t length;
size_t buff_size = 0;
length = getline(&buffer, &buff_size, from);
if (length <= 0) {
perror("Line Copy Error - Couldn't read data");
return 0;
}
fwrite(buffer, 1, length, to);
free(buffer); // getline allocates memory that we're incharge of freeing.
return 1;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE *fp_read, *fp_write;
fp_read = fopen("workersinfo.txt", "r");
fp_write = fopen("sorted_workersinfo.txt", "w+");
if (!fp_read) {
perror("File Error");
goto cleanup;
}
if (!fp_write) {
perror("File Error");
goto cleanup;
}
printf("\nSorting");
while (sort_line(fp_read))
printf(".");
// write all sorted data to a new file
printf("\n\nWriting sorted data");
while (write_line(fp_write, fp_read))
printf(".");
// clean up - close files and make sure the sorting tree is cleared
cleanup:
printf("\n");
fclose(fp_read);
fclose(fp_write);
clear_sort_heap();
return 0;
}

Searching a particular string in a large file

I am making program in C which can search for a specific string in a large .txt file and count it and then print it out. But it seems that something have go wrong, cause the output of my program is different from that of the two text editor. According to the text editor, there are totally 3000 words,in this case I search for the word "make", in that .txt file. But the output of my program is just 2970.
I cannot find out the problem of my program. So I am curios about how could a text editor search for a specific string so accurately? How do people implement that? Can any people show me some code in C?
To make things clear: that is a large .txt file, 20M or so, containing lots of characters. So I think it's not so good to read it into memory all at once. I have implement my program by splitting my program in to pieces and then scan all of those for parsing. However, it fail some way.
Maybe I should put the code here. Wait a minute please.
The code is kinda long, 70 lines or so. I have put it on my github, if you have any interest, please help. https://github.com/walkerlala/searchText
note that the only related file is wordCount.c and testfile.txt which goes like:
#include<stdio.h>
#include<stdlib.h>
#include<stdbool.h>
char arr[51];
int flag=0;
int flag2=0;
int flag3=0;
int flag4=0;
int pieceCount(FILE*);
int main()
{
//the file in which I want to search the word is testfile.txt
//I have formatted the file so that it contain no newlins any more
FILE* fs=fopen("testfile.txt","r");
int n=pieceCount(fs);
printf("%d\n",n);
rewind(fs); //refresh the file...
static bool endOfPiece1=false,endOfPiece2=false,endOfPiece3=false;
bool begOfPiece1,begOfPiece2,begOfPiece3;
for(int start=0;start<n;++start){
fgets(arr,sizeof(arr),fs);
for(int i=0;i<=46;++i){
if((arr[i]=='M'||arr[i]=='m')&&(arr[i+1]=='A'||arr[i+1]=='a')&&(arr[i+2]=='K'||arr[i+2]=='k')&&(arr[i+3]=='E'||arr[i+3]=='e')){
flag+=1;
//continue;
}
}
//check the border
begOfPiece1=((arr[1]=='e'||arr[1]=='E'));
if(begOfPiece1==true&&endOfPiece1==true)
flag2+=1;
endOfPiece1=((arr[47]=='m'||arr[47]=='M')&&(arr[48]=='a'||arr[48]=='A')&&(arr[49]=='k'||arr[49]=='K'));
begOfPiece2=((arr[1]=='k'||arr[1]=='K')&&(arr[2]=='e'||arr[2]=='E'));
if(begOfPiece2==true&&endOfPiece2==true)
flag3+=1;
endOfPiece2=((arr[48]=='m'||arr[48]=='M')&&(arr[49]=='a'||arr[49]=='A'));
begOfPiece3=((arr[1]=='a'||arr[1]=='A')&&(arr[2]=='k'||arr[2]=='K')&&(arr[3]=='e'||arr[3]=='E'));
if(begOfPiece3==true&&endOfPiece3==true)
flag4+=1;
endOfPiece3=(arr[49]=='m'||arr[49]=='M');
}
printf("%d\n%d\n%d\n%d\n",flag,flag2,flag3,flag4);
getchar();
return 0;
}
//the function counts how many pieces have I split the file into
int pieceCount(FILE* file){
static int count=0;
char arr2[51]={'\0'};
while(fgets(arr2,sizeof(arr),file)){
count+=1;
continue;
}
return count;
}
You can do this quite simply just by having a rolling buffer. You don't need to break the file into sections.
#include <stdio.h>
#include <string.h>
int main(void) {
char buff [4]; // word buffer
int count = 0; // occurrences
FILE* fs=fopen("test.txt","r"); // open the file
if (fs != NULL) { // if the file opened
if (4 == fread(buff, 1, 4, fs)) { // fill the buffer
do { // if it worked
if (strnicmp(buff, "make", 4) == 0) // check for target word
count++; // tally
memmove(buff, buff+1, 3); // shift the buffer down
} while (1 == fread(buff+3, 1, 1, fs)); // fill the last position
} // end of file
fclose(fs); // close the file
}
printf("%d\n", count); // report the result
return 0;
}
For simplicity I stopped short of making the search word "softer" and allocating the correct buffer and various sizes, since that wasn't in the question. And I have to leave something for OP to do.

LZW encoding for large file

I am building an LZW encoding algorithm, which uses dictionary and hashing so it can reach fast enough for working words already stored in a dictionary.
The algorithm gives proper results when ran on smaller files (cca few hundreds of symbols), but on the larger files (and especially in those files which contain of less different symbols - for example, it gives the worst performance when ran on a file which consists only of 1 symbol, 'y' let's say). The worst performance, in terms that it just crashes when dictionary is not even close to being full. However, when the large input file consists of more than 1 symbol, dictionary gets close to being full, approximately 90%, but again then it crashes.
Considering the structure of my algorithm, I am not quite sure what is causing it to crash in general, or crash so soon when large file of just 1 symbol is given.
It must be something about hashing (first time doing it, so it might have some bugs).
The hash function I am using can be found here, and from what I have tested it, it gives good results: oat_hash
LZW encoding algorithm is based on this link, with slight change, that it works until the dictionary is not full: LZW encoder
Let's get into code:
Note: oat_hash is changed so it returns value % CAPACITY, so every index is from DICTIONARY
// Globals
#define CAPACITY 100000
char *DICTIONARY[CAPACITY];
unsigned short CODES[CAPACITY]; // CODES and DICTIONARY are linked via index: word from dictionary on index i, has its code in CODES on index i
int position = 0;
int code_counter = 0;
void encode(FILE *input, FILE *output){
int succ1 = fseek(input, 0, SEEK_SET);
if(succ1 != 0) printf("Error: file not open!");
int succ2 = fseek(output, 0, SEEK_SET);
if(succ2 != 0) printf("Error: file not open!");
//1. Working word = next symbol from the input
char *working_word = malloc(2048*sizeof(char));
char new_symbol = getc(input);
working_word[0] = new_symbol;
working_word[1] = '\0';
//2. WHILE(there are more symbols on the input) DO
//3. NewSymbol = next symbol from the input
while((new_symbol = getc(input)) != EOF){
char *workingWord_and_newSymbol= NULL;
char newSymbol[2];
newSymbol[0] = new_symbol;
newSymbol[1] = '\0';
workingWord_and_newSymbol = working_word_and_new_symbol(working_word, newSymbol);
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));
//4. IF(WorkingWord + NewSymbol) is already in the dictionary THEN
if(DICTIONARY[index] != NULL){
// 5. WorkingWord += NewSymbol
working_word = working_word_and_new_symbol(working_word, newSymbol);
}
//6. ELSE
else{
//7. OUTPUT: code for WorkingWord
int idx = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[idx]);
//8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
if(!dictionary_full()){
DICTIONARY[index] = workingWord_and_newSymbol;
CODES[index] = code_counter + 1;
code_counter += 1;
working_word = strdup(newSymbol);
}else break;
}
//10. END IF
}
//11. END WHILE
//12. OUTPUT: code for WorkingWord
int index = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[index]);
free(working_word);
}
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));
And later
int idx = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[idx]);
//8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
if(!dictionary_full()){
DICTIONARY[index] = workingWord_and_newSymbol;
CODES[index] = code_counter + 1;
code_counter += 1;
working_word = strdup(newSymbol);
}else break;
idx and index are unbounded and you use them to access a bounded array. You're accessing memory out of range. Here's a suggestion, but it may skew the distribution. If your hash range is much larger than CAPACITY it shouldn't be a problem. But you also have another problem which was mentioned, collisions, you need to handle them. But that's a different problem.
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol)) % CAPACITY;
// and
int idx = oat_hash(working_word, strlen(working_word)) % CAPACITY;
LZW compression is certainly used to construct binary files and normally is capable of reading binary files.
The following code is problematic as it relies on new_symbol never being a \0.
newSymbol[0] = new_symbol; newSymbol[1] = '\0';
strlen(workingWord_and_newSymbol)
strdup(newSymbol)
Needs re-write to work with arrays of bytes rather than strings.
fopen() was not shown. Insure one is opening in binary. input = fopen(..., "rb");
#Wumpus Q. Wumbley is correct, use int newSymbol.
Minor:
new_symbol and newSymbol are confusing.
Consider:
// char *working_word = malloc(2048*sizeof(char));
#define WORKING_WORD_N (2048)
char *working_word = malloc(WORKING_WORD_N*sizeof(*working_word));
// or
char *working_word = malloc(WORKING_WORD_N);

Segmentation fault while convering large number of xml files into text

#include<stdio.h>
#include<stdlib.h>
#include<dirent.h>
#include<string.h>
int main()
{
FILE *fin,*fout;
char dest[80]="/home/vivs/InexCorpusText/";
char file[30];
DIR *dir;
char c,state='1';
int len;
struct dirent *ent;
if((dir=opendir("/home/vivs/InexCorpus"))!=NULL)
{
while((ent=readdir(dir))!=NULL)
{
if(strcmp(ent->d_name,".") &&
strcmp(ent->d_name,"..") &&
strcmp(ent->d_name,".directory"))
{
len=strlen(ent->d_name);
strcpy(file,ent->d_name);
file[len-3]=file[len-1]='t';
file[len-2]='x';
//strcat(source,ent->d_name);
strcat(dest,file);
printf("%s\t%s\n",ent->d_name,dest);
fin=fopen(ent->d_name,"r");
fout=fopen(dest,"w");
while((c=fgetc(fin))!=EOF)
{
if(c=='<')
{
fputc(' ',fout);
state='0';
}
else if(c=='>')
state='1';
else if(state=='1')
{
if(c!='\n')
fputc(c,fout);
if(c=='.')
{
c=fgetc(fin);
if(c==' '||c=='\n'||c=='<')
{
fputc('\n',fout);
ungetc(c,fin);
}
else fputc(c,fout);
}
}
}
}
close(fin);
close(fout);
strcpy(dest,"/home/vivs/InexCorpusText/");
}
closedir(dir);
}
else
{
printf("Error in opening directory\n");
}
return 0;
}
I was trying to convert xml files to text. This code simply remove tags and nothing else.
When i execute this code for around 300 files, it doesn't show any error but when number goes to 500 or more i receive a segmentation fault after processing around 300 files.
At least one reason 'right from the start':
Here is struct dirent declaration from man:
On Linux, the dirent structure is defined as follows:
struct dirent {
ino_t d_ino; /* inode number */
off_t d_off; /* offset to the next dirent */
unsigned short d_reclen; /* length of this record */
unsigned char d_type; /* type of file; not supported
by all file system types */
char d_name[256]; /* filename */
};
You are in trouble on any name longer than 30 (actually 29) chars. Memory overwrite occurs because file has only 30 bytes (reserve 1 for '\0' terminator):
char file[30];
...
strcpy(file,ent->d_name);
There are two structures within XML that it does not appear that you account for.
Attribute contents can contain unescaped > characters, which could throw off your count. See http://www.w3.org/TR/REC-xml/#NT-AttValue.
CDATA sections can contain both < and > characters as literal text, as long as they do not appear as part of the closing ]]> string. See http://www.w3.org/TR/REC-xml/#NT-CharData. This could seriously throw off your logic.
Why don't you look in your files to see if any contain the text CDATA?
You might want to consider using xsltproc or libxslt; a very simple XSLT transform would give you exactly what you want. See Extract part of an XML file as plain text using XSLT for such a transform engine.
OK, another problematic place:
len=strlen(ent->d_name);
....
file[len-3]=file[len-1]='t';
file[len-2]='x';
Because d_name could have less than 3 characters it could again lead to memory overwrite.
You should be careful with functions like strlen() and always validate their result.

Resources