Hey guys I'm attempting to read in workersinfo.txt and store it into a two-dimensional char array. The file is around 4,000,000 lines with around 100 characters per line. I want to store each file line on the array. Unfortunately, I get exit code 139(Not enough memory). I'm aware I have to use malloc() and free() but I've tried a couple of things and I haven't been able to make them work.Eventually I have to sort the array by ID number but I'm stuck on declaring the array.
The file looks something like this:
First Name, Last Name,Age, ID
Carlos,Lopez,,10568
Brad, Patterson,,20586
Zack, Morris,42,05689
This is my code so far:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
FILE *ptr_file;
char workers[4000000][1000];
ptr_file =fopen("workersinfo.txt","r");
if (!ptr_file)
perror("Error");
int i = 0;
while (fgets(workers[i],1000, ptr_file)!=NULL){
i++;
}
int n;
for(n = 0; n < 4000000; n++)
{
printf("%s", workers[n]);
}
fclose(ptr_file);
return 0;
}
The Stack memory is limited. As you pointed out in your question, you MUST use malloc to allocate such a big (need I say HUGE) chunk of memory, as the stack cannot contain it.
you can use ulimit to review the limits of your system (usually including the stack size limit).
On my Mac, the limit is 8Mb. After running ulimit -a I get:
...
stack size (kbytes, -s) 8192
...
Or, test the limit using:
struct rlimit slim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur // the stack limit
I truly recommend you process each database entry separately.
As mentioned in the comments, assigning the memory as static memory would, in most implementations, circumvent the stack.
Still, IMHO, allocating 400MB of memory (or 4GB, depending which part of your question I look at), is bad form unless totally required - especially for a single function.
Follow-up Q1: How to deal with each DB entry separately
I hope I'm not doing your homework or anything... but I doubt your homework would include an assignment to load 400Mb of data to the computer's memory... so... to answer the question in your comment:
The following sketch of single entry processing isn't perfect - it's limited to 1Kb of data per entry (which I thought to be more then enough for such simple data).
Also, I didn't allow for UTF-8 encoding or anything like that (I followed the assumption that English would be used).
As you can see from the code, we read each line separately and perform error checks to check that the data is valid.
To sort the file by ID, you might consider either running two lines at a time (this would be a slow sort) and sorting them, or creating a sorted node tree with the ID data and the position of the line in the file (get the position before reading the line). Once you sorted the binary tree, you can sort the data...
... The binary tree might get a bit big. did you look up sorting algorithms?
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 1 on sucesss or 0 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return 0;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
return 0;
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 1;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
return 0;
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE* ptr_file;
ptr_file = fopen("workersinfo.txt", "r");
if (!ptr_file)
perror("File Error");
struct DBEntry line;
while (read_db_line(ptr_file, &line)) {
// do what you want with the data... print it?
printf(
"First name:\t%s\n"
"Last name:\t%s\n"
"Age:\t\t%s\n"
"ID:\t\t%s\n"
"--------\n",
line.first_name, line.last_name, line.age, line.id);
}
// close file
fclose(ptr_file);
return 0;
}
Followup Q2: Sorting array for 400MB-4GB of data
IMHO, 400MB is already touching on the issues related to big data. For example, implementing a bubble sort on your database should be agonizing as far as performance goes (unless it's a single time task, where performance might not matter).
Creating an Array of DBEntry objects will eventually get you a larger memory foot-print then the actual data..
This will not be the optimal way to sort large data.
The correct approach will depend on your sorting algorithm. Wikipedia has a decent primer on sorting algorythms.
Since we are handling a large amount of data, there are a few things to consider:
It would make sense to partition the work, so different threads/processes sort a different section of the data.
We will need to minimize IO to the hard drive (as it will slow the sorting significantly and prevent parallel processing on the same machine/disk).
One possible approach is to create a heap for a heap sort, but only storing a priority value and storing the original position in the file.
Another option would probably be to employ a divide and conquer algorithm, such as quicksort, again, only sorting a computed sort value and the entry's position in the original file.
Either way, writing a decent sorting method will be a complicated task, probably involving threading, forking, tempfiles or other techniques.
Here's a simplified demo code... it is far from optimized, but it demonstrates the idea of the binary sort-tree that holds the sorting value and the position of the data in the file.
Be aware that using this code will be both relatively slow (although not that slow) and memory intensive...
On the other hand, it will require about 24 bytes per entry. For 4 million entries, it's 96MB, somewhat better then 400Mb and definitely better then the 4GB.
#include <stdlib.h>
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// this might be a sorting node for a sorted bin-tree:
struct SortNode {
struct SortNode* next; // a pointer to the next node
fpos_t position; // the DB entry's position in the file
long value; // The computed sorting value
}* top_sorting_node = NULL;
// this function will free all the memory used by the global Sorting tree
void clear_sort_heap(void) {
struct SortNode* node;
// as long as there is a first node...
while ((node = top_sorting_node)) {
// step forward.
top_sorting_node = top_sorting_node->next;
// free the original first node's memory
free(node);
}
}
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 0 on sucesss or 1 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return -1;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
clear_sort_heap();
exit(2);
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 0;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
clear_sort_heap();
exit(1);
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// read and sort a single line from the database.
// return 0 if there was no data to sort. return 1 if data was read and sorted.
int sort_line(FILE* fp) {
// allocate the memory for the node - use calloc for zero-out data
struct SortNode* node = calloc(sizeof(*node), 1);
// store the position on file
fgetpos(fp, &node->position);
// use a stack allocated DBEntry for processing
struct DBEntry line;
// check that the read succeeded (read_db_line will return -1 on error)
if (read_db_line(fp, &line)) {
// free the node's memory
free(node);
// return no data (0)
return 0;
}
// compute sorting value - I'll assume all IDs are numbers up to long size.
sscanf(line.id, "%ld", &node->value);
// heap sort?
// This is a questionable sort algorythm... or a questionable implementation.
// Also, I'll be using pointers to pointers, so it might be a headache to read
// (it's a headache to write, too...) ;-)
struct SortNode** tmp = &top_sorting_node;
// move up the list until we encounter something we're smaller then us,
// OR untill the list is finished.
while (*tmp && (*tmp)->value <= node->value)
tmp = &((*tmp)->next);
// update the node's `next` value.
node->next = *tmp;
// inject the new node into the tree at the position we found
*tmp = node;
// return 1 (data was read and sorted)
return 1;
}
// writes the next line in the sorting
int write_line(FILE* to, FILE* from) {
struct SortNode* node = top_sorting_node;
if (!node) // are we done? top_sorting_node == NULL ?
return 0; // return 0 - no data to write
// step top_sorting_node forward
top_sorting_node = top_sorting_node->next;
// read data from one file to the other
fsetpos(from, &node->position);
char* buffer = NULL;
ssize_t length;
size_t buff_size = 0;
length = getline(&buffer, &buff_size, from);
if (length <= 0) {
perror("Line Copy Error - Couldn't read data");
return 0;
}
fwrite(buffer, 1, length, to);
free(buffer); // getline allocates memory that we're incharge of freeing.
return 1;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE *fp_read, *fp_write;
fp_read = fopen("workersinfo.txt", "r");
fp_write = fopen("sorted_workersinfo.txt", "w+");
if (!fp_read) {
perror("File Error");
goto cleanup;
}
if (!fp_write) {
perror("File Error");
goto cleanup;
}
printf("\nSorting");
while (sort_line(fp_read))
printf(".");
// write all sorted data to a new file
printf("\n\nWriting sorted data");
while (write_line(fp_write, fp_read))
printf(".");
// clean up - close files and make sure the sorting tree is cleared
cleanup:
printf("\n");
fclose(fp_read);
fclose(fp_write);
clear_sort_heap();
return 0;
}
Related
i'm trying to modify nodemcu lua file.list functon /app/modules/file.c
to return a large string of filenames separated with newline char. currently it returns array and is very memory consuming also i'm stripping file size.
Here is what i have done (examined how some other functions return strings)
static int file_list( lua_State* L )
{
char temp[32];
unsigned st = luaL_optinteger( L, 1, 1 ); // start offset
unsigned tf = luaL_optinteger( L, 2, 100000 ); // how much files to list
tf=tf+st;
vfs_dir *dir;
if (dir = vfs_opendir("")) {
lua_newtable( L );
struct vfs_stat stat;
int i=1;
int ii=0;
while (vfs_readdir(dir, &stat) == VFS_RES_OK) {
if (i<st)
{
i++;
continue;
}
if (i>=tf)
{
break;
}
strcpy (temp,stat.name);
strcat (temp,"\n");
lua_pushstring( L, temp );
i++;
ii++;
}
vfs_closedir(dir);
return ii;
}
return 0;
}
It do not work as expected, If I request more than 40 files (at once after device reboots) I see output like this:
....
3fff0d10 already freed
3fff11b8 already freed
3fff0cc8 already freed
3fff1f88 already freed
.....
and device restart, but if I request 30 files, and every time increase them in steps of 30, manage to get 400 files at one time.
=file.list(1,30)
=file.list(1,60)
=file.list(1,90)
that way it works, If I do directly:
=file.list(1,60)
it does not work. Noticed also that memory is allocated and not set free after function finished, but also not reallocate after same command execution so it is not memory leak, just some date stays in the stack perhaps.
I'm trying to add strings to a Binary Search Tree using a recursive insert method (the usual for BSTs, IIRC) so I can later print them out using recursion as well.
Trouble is, I've been getting a segmentation faults I don't really understand. Related code follows (this block of code is from my main function):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
// Stores the size of the C-strings we will use;
// Standardized to 100 (assignment specifications say
// ALL strings will be no more than 100 characters long)
// Please note that I defined this as a preprocessor
// directive because using the const keyword makes it
// impossible to define the size of the C-String array
// (C doesn't allow for static array struct members whose
// size is given as a variable)
#define STRING_SIZE 100
// The flags for case sensitivity
// and an output file
int cflag = 0, oflag = 0;
// These are intended to represent the boolean
// values true and false (there's no native bool
// data type in C, apparently)
const int TRUE = 1;
const int FALSE = 0;
// Type alias for the bool type
typedef int bool;
// This is the BST struct. A BST is basically just
// a Node with at most two children (left and right)
// and a data element.
typedef struct BST
{
struct BST *left;
struct BST *right;
char *key;
int counter;
} BST;
// ----- FUNCTION PROTOTYPES -----
void insert(BST **root, char *key);
int caseSenStrCmp(char *str1, char *str2);
int caseInsenStrCmp(char *str1, char *str2);
bool existsInTree(BST *root, char *key, int cflag);
void inOrderPrint(BST *root, int oflag, FILE *outFile);
void deallocateTree(BST *root);
int main(int argc, char **argv) {
extern char *optarg;
extern int optind;
int c, err = 0;
// Holds the current line in the file/user-provided string.
char currentLine[STRING_SIZE];
// This will store the input/output file
// directories
char fileDirectory[STRING_SIZE];
static char usage[] = "Usage: %s [-c] [-o output_file_name] [input_file_name]\n";
while ((c = getopt(argc, argv, "co:")) != -1)
switch (c)
{
case 'c':
cflag = 1;
break;
case 'o':
oflag = 1;
// If an output file name
// was entered, copy it
// to fileDirectory
if (argv[optind] != NULL)
{
strcpy(fileDirectory, argv[optind]);
}
break;
case '?':
err = 1;
break;
default:
err = 1;
break;
}
if (err)
{
// Generic error message
printf("ERROR: Invalid input.\n");
fprintf(stderr, usage, argv[0]);
exit(1);
}
// --- BST SORT CODE STARTS HERE ---
printf("This is BEFORE setting root to NULL\n");
// This is the BST. As the assignment instructions
// specify, it is initially set to NULL
BST *root = NULL;
// Pointer to the mode the files
// will be opened in. Starts as
// "w" since we're opening the output file
// first
printf("This is AFTER setting root to NULL\n");
char *mode = (char*)malloc(sizeof(char*));
strcpy(mode, "w");
printf("Wrote w to mode pointer");
// Pointer to the output file
FILE *outFile;
// Attempt to open output file
outFile = fopen(fileDirectory, mode);
printf("Opened outfile \n");
// Now update mode and fileDirectory so
// we can open the INPUT file
strcpy(mode, "r");
printf("Wrote r to mode\n");
// Check if we have an input file name
// If argv[optind] isn't NULL, that means we have
// an input file name, so copy it into fileDirectory
if (argv[optind] != NULL)
{
strcpy(fileDirectory, argv[optind]);
}
printf("Wrote input file name to fileDirectory.\n");
// Pointer to the input file
FILE *inFile;
// Attempt to open the input file
//printf("%d", inFile = fopen(fileDirectory, mode));
printf("Opened input file\n");
// If the input file was opened successfully, process it
if (inFile != NULL)
{
// Process the file while EOF isn't
// returned
while (!feof(inFile))
{
// Get a single line (one string)
//fgets(currentLine, STRING_SIZE, inFile);
printf("Wrote to currentLine; it now contains: %s\n", currentLine);
// Check whether the line is an empty line
if (*currentLine != '\n')
{
// If the string isn't an empty line, call
// the insert function
printf("currentLine wasn't the NULL CHAR");
insert(&root, currentLine);
}
}
// At this point, we're done processing
// the input file, so close it
fclose(inFile);
}
// Otherwise, process user input from standard input
else
{
do
{
printf("Please enter a string (or blank line to exit): ");
// Scanf takes user's input from stdin. Note the use
// of the regex [^\n], allowing the scanf statement
// to read input until the newline character is encountered
// (which happens when the user is done writing their string
// and presses the Enter key)
scanf("%[^\n]s", currentLine);
// Call the insert function on the line
// provided
insert(&root, currentLine);
} while (caseSenStrCmp(currentLine, "") != 0);
}
// At this point, we've read all the input, so
// perform in-order traversal and print all the
// strings as per assignment specification
inOrderPrint(root, oflag, outFile);
// We're done, so reclaim the tree
deallocateTree(root);
}
// ===== AUXILIARY METHODS ======
// Creates a new branch for the BST and returns a
// pointer to it. Will be called by the insert()
// function. Intended to keep the main() function
// as clutter-free as possible.
BST* createBranch(char *keyVal)
{
// Create the new branch to be inserted into
// the tree
BST* newBranch = (BST*)malloc(sizeof(BST));
// Allocate memory for newBranch's C-string
newBranch->key = (char*)malloc(STRING_SIZE * sizeof(char));
// Copy the user-provided string into newBranch's
// key field
strcpy(newBranch->key, keyVal);
// Set newBranch's counter value to 1. This
// will be incremented if/when other instances
// of the key are inserted into the tree
newBranch->counter = 1;
// Set newBranch's child branches to null
newBranch->left = NULL;
newBranch->right = NULL;
// Return the newly created branch
return newBranch;
}
// Adds items to the BST. Includes functionality
// to verify whether an item already exists in the tree
// Note that we pass the tree's root to the insert function
// as a POINTER TO A POINTER so that changes made to it
// affect the actual memory location that was passed in
// rather than just the local pointer
void insert(BST **root, char *key)
{
printf("We made it to the insert function!");
// Check if the current branch is empty
if (*root == NULL)
{
// If it is, create a new
// branch here and insert it
// This will also initialize the
// entire tree when the first element
// is inserted (i.e. when the tree is
// empty)
*root = createBranch(key);
}
// If the tree ISN'T empty, check whether
// the element we're trying to insert
// into the tree is already in it
// If it is, don't insert anything (the
// existsInTree function takes care of
// incrementing the counter associated
// with the provided string)
if (!existsInTree(*root, key, cflag))
{
// If it isn't, check if the case sensitivity
// flag is set; if it is, perform the
// checks using case-sensitive string
// comparison function
if (cflag) {
// Is the string provided (key) is
// greater than the string stored
// at the current branch?
if (caseSenStrCmp((*root)->key, key))
{
// If so, recursively call the
// insert() function on root's
// right child (that is, insert into
// the right side of the tree)
// Note that we pass the ADDRESS
// of root's right branch, since
// the insert function takes a
// pointer to a pointer to a BST
// as an argument
insert(&((*root)->right), key);
}
// If not, the key passed in is either less than
// or equal to the current branch's key,
// so recursively call the insert()
// function on root's LEFT child (that is,
// insert into the left side of the tree)
else
{
insert(&((*root)->left), key);
}
}
// If it isn't, perform the checks using
// the case-INsensitive string comparison
// function
else {
// The logic here is exactly the same
// as the comparisons above, except
// it uses the case-insensitive comparison
// function
if (caseInsenStrCmp((*root)->key, key))
{
insert(&((*root)->right), key);
}
else
{
insert(&((*root)->left), key);
}
}
}
}
// CASE SENSITIVE STRING COMPARISON function. Returns:
// -1 if str1 is lexicographically less than str2
// 0 if str1 is lexicographically equal to str2
// 1 if str2 is lexicographically greater than str1
I'm using getopt to parse options that the user enters. I've been doing a little bit of basic debugging using printf statements just to see how far I get into the code before it crashes, and I've more or less narrowed down the cause. It seems to be this part here:
do
{
printf("Please enter a string (or blank line to exit): ");
// Scanf takes user's input from stdin. Note the use
// of the regex [^\n], allowing the scanf statement
// to read input until the newline character is encountered
// (which happens when the user is done writing their string
// and presses the Enter key)
scanf("%[^\n]s", currentLine);
// Call the insert function on the line
// provided
insert(&root, currentLine);
} while (caseSenStrCmp(currentLine, "\n") != 0);
Or rather, calls to the insert function in general, since the printf statement I put at the beginning of the insert function ("We made it to the insert function!) gets printed over and over again until the program finally crashes with a segmentation fault, which probably means the problem is infinite recursion?
If so, I don't understand why it's happening. I initialized the root node to NULL at the beginning of main, so it should go directly into the insert functions *root == NULL case, at least on its first call.
Does it maybe have something to do with the way I pass root as a pointer to a pointer (BST **root in parameter list of insert function)? Am I improperly recursing, i.e. is this statement (and others similar to it)
insert(&((*root)->right), key);
incorrect somehow? This was my first guess, but I don't see how that would cause infinite recursion - if anything, it should fail without recursing at all if that was the case? Either way, it doesn't explain why infinite recursion happens when root is NULL (i.e. on the first call to insert, wherein I pass in &root - a pointer to the root pointer - to the insert function).
I'm really stuck on this one. At first I thought it might have something to do with the way I was copying strings to currentLine, since the line
if(*currentLine != '\0')
in the while (!feof(inFile)) loop also crashes the program with a segmentation fault, but even when I commented that whole part out just to test the rest of the code I ended up with the infinite recursion problem.
Any help at all is appreciated here, I've been trying to fix this for over 5 hours to no avail. I legitimately don't know what to do.
**EDIT: Since a lot of the comments involved questions regarding the way I declared other variables and such in the rest of the code, I've decided to include the entirety of my code, at least until the insert() function, which is where the problem (presumably) is. I only omitted things to try and keep the code to a minimum - I'm sure nobody likes to read through large blocks of code.
Barmar:
Regarding fopen() and fgets(): these were commented out so that inFile would remain NULL and the relevant conditional check would fail, since that part of the code also fails with a segmentation fault
createBranch() does initialize both the left and right children of the node it creates to NULL (as can be seen above)
currentLine is declared as an array with a static size.
#coderredoc:
My understanding of it is that it reads from standard input until it encounters the newline character (i.e. user hits the enter button), which isn't recorded as part of the string
I can already see where you were going with this! My loop conditional was set for the do/while loop was set to check for the newline character, so the loop would never have terminated. That's absolutely my fault; it was a carryover from a previous implementation of that block that I forgot to change.
I did change it after you pointed it out (see new code above), but unfortunately it didn't fix the problem (I'm guessing it's because of the infinite recursion happening inside the insert() function - once it gets called the first time, it never returns and just crashes with a segfault).
**
I managed to figure it out - turns out the problem was with the insert() function after all. I rewrote it (and the rest of the code that was relevant) to use a regular pointer rather than a pointer to a pointer:
BST* insert(BST* root, char *key)
{
// If branch is null, call createBranch
// to make one, then return it
if (root == NULL)
{
return createBranch(key);
}
// Otherwise, check whether the key
// already exists in the tree
if (!existsInTree(root, key, cflag))
{
// If it doesn't, check whether
// the case sensitivity flag is set
if (cflag)
{
// If it is, use the case-sensitive
// string comparison function to
// decide where to insert the key
if (caseSenStrCmp(root->key, key))
{
// If the key provided is greater
// than the string stored at the
// current branch, insert into
// right child
root->right = insert(root->right, key);
}
else
{
// Otherwise, insert into left child
root->left = insert(root->left, key);
}
}
// If it isn't, use the case-INsensitive string
// comparison function to decide where to insert
else
{
// Same logic as before. If the key
// provided is greater, insert into
// current branch's right child
if (caseInsenStrCmp(root->key, key))
{
root->right = insert(root->right, key);
}
// Otherwise, insert into the left child
else
{
root->left = insert(root ->left, key);
}
}
}
// Return the root pointer
return root;
}
Which immediately solved the infinite recursion/seg fault issue. It did reveal a few other minor semantic errors (most of which were probably made in frustration as I desperately tried to fix this problem without rewriting the insert function), but I've been taking care of those bit by bit.
I've now got a new problem (albeit probably a simpler one than this) which I'll make a separate thread for, since it's not related to segmentation faults.
I am reading data from a number of files, each containing a list of words. I am trying to display the number of words in each file, but I am running into issues. For example, when I run my code, I receive the output as shown below.
Almost every amount is correctly displayed with the exception of two files, each containing word counts in the thousands. Every other file only has three digits worth of words, and they seem just fine.
I can only guess what this problem could be (not enough space allocated somewhere?) and I do not know how to solve it. I apologize if this is all poorly worded. My brain is fried and I am struggling. Any help would be appreciated.
I've tried to keep my example code as brief as possible. I've cut out a lot of error checking and other tasks related to the full program. I've also added comments where I can. Thanks.
StopWords.c
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <stddef.h>
#include <string.h>
typedef struct
{
char stopwords[2000][60];
int wordcount;
} LangData;
typedef struct
{
int languageCount;
LangData languages[];
} AllData;
main(int argc, char **argv)
{
//Initialize data structures and open path directory
int langCount = 0;
DIR *d;
struct dirent *ep;
d = opendir(argv[1]);
//Count the number of language files in the directory
while(readdir(d))
langCount++;
//Account for "." and ".." in directory
//langCount = langCount - 2 THIS MAKES SENSE RIGHT?
langCount = langCount + 1; //The program crashes if I don't do this, which doesn't make sense to me.
//Allocate space in AllData for languageCount
AllData *data = malloc(sizeof(AllData) + sizeof(LangData)*langCount); //Unsure? Seems to work.
//Reset the directory in preparation for reading data
rewinddir(d);
//Copy all words into respective arrays.
char word[60];
int i = 0;
int k = 0;
int j = 0;
while((ep = readdir(d)) != NULL) //Probably could've used for loops to make this cleaner. Oh well.
{
if (!strcmp(ep->d_name, ".") || !strcmp(ep->d_name, ".."))
{
//Filtering "." and ".."
}
else
{
FILE *entry;
//Get string for path (i should make this a function)
char fullpath[100];
strcpy(fullpath, path);
strcat(fullpath, "\\");
strcat(fullpath, ep->d_name);
entry = fopen(fullpath, "r");
//Read all words from file
while(fgets(word, 60, entry) != NULL)
{
j = 0;
//Store each word one character at a time (better way?)
while(word[j] != '\0') //Check for end of word
{
data->languages[i].stopwords[k][j] = word[j];
j++; //Move onto next character
}
k++; //Move onto next word
data->languages[i].wordcount++;
}
//Display number of words in file
printf("%d\n", data->languages[i].wordcount);
i++; Increment index in preparation for next language file.
fclose(entry);
}
}
}
Output
256 //czech.txt: Correct
101 //danish.txt: Correct
101 //dutch.txt: Correct
547 //english.txt: Correct
1835363006 //finnish.txt: Should be 1337. Of course it's 1337.
436 //french.txt: Correct
576 //german.txt: Correct
737 //hungarian.txt: Correct
683853 //icelandic.txt: Should be 1000.
399 //italian.txt: Correct
172 //norwegian.txt: Correct
269 //polish.txt: Correct
437 //portugese.txt: Correct
282 //romanian.txt: Correct
472 //spanish.txt: Correct
386 //swedish.txt: Correct
209 //turkish.txt: Correct
Do the files have more than 2000 words? You have only allocated space for 2000 words so once your program tries to copy over word 2001 it will be doing it outside of the memory allocated for that array, possibly into the space allocated for "wordcount".
Also I want to point out that fgets returns a string to the end of the line or at most n characters (60 in your case), whichever comes first. This will work find if there is only one word per line in the files you are reading from, otherwise will have to locate spaces within the string and count words from there.
If you are simply trying to get a word count, then there is no need to store all the words in an array in the first place. Assuming one word per line, the following should work just as well:
char word[60];
while(fgets(word, 60, entry) != NULL)
{
data->languages[i].wordcount++;
}
fgets reference- http://www.cplusplus.com/reference/cstdio/
Update
I took another look and you might want to try allocating data as follows:
typedef struct
{
char stopwords[2000][60];
int wordcount;
} LangData;
typedef struct
{
int languageCount;
LangData *languages;
} AllData;
AllData *data = malloc(sizeof(AllData));
data->languages = malloc(sizeof(LangData)*langCount);
This way memory is being specifically allocated for the languages array.
I agree that langCount = langCount - 2 makes sense. What error are you getting?
The purpose of this code is to read the following txts(d.txt,e.txt,f.txt) and do the actions that are required in order to put the alphabet with the correct order into the output.txt. The code suppose to work since in output.txt i get the correct results but there is a problem with the testing i did using the printf (it's at the end of newfile function). In order to run i give as input d.txt and output.txt.
It should print
top->prev points to file :d
top->prev points to file :e
but instead it prints the following and i can't find the reason
top->prev points to file :d
top->prev points to file :f
d.txt:
abc
#include e.txt
mno
e.txt:
def
#include f.txt
jkl
f.txt:
ghi
code:
%{
#include <stdio.h>
#include <stdlib.h>
struct yyfilebuffer{
YY_BUFFER_STATE bs;
struct yyfilebuffer *prev;
FILE *f;
char *filename;
}*top;
int i;
char temporal[7];
void newfile(char *filename);
void popfile();
void create();
%}
%s INC
%option noyywrap
%%
"#include " {BEGIN INC;}
<INC>.*$ {for(i=1;i<strlen(yytext)-2;i++)
{
temporal[i-1]=yytext[i];
}
newfile(temporal);
BEGIN INITIAL;
}
<<EOF>> {popfile();
BEGIN INITIAL;
}
%%
void main(int argc,int **argv)
{
if ( argc < 3 )
{
printf("\nUsage yybuferstate <filenamein> <filenameout>");
exit(1);
}
else
{
create();
newfile(argv[1]);
yyout = fopen(argv[2], "w");
yylex();
}
system("pause");
}
void create()
{
top = NULL;
}
void newfile(char *filename)
{
struct yyfilebuffer *newptr;
if(top == NULL)
{
newptr = malloc(1*sizeof(struct yyfilebuffer));
newptr->prev = NULL;
newptr->filename = filename;
newptr->f = fopen(filename,"r");
newptr->bs = yy_create_buffer(newptr->f, YY_BUF_SIZE);
top = newptr;
yy_switch_to_buffer(top->bs);
}
else
{
newptr = malloc(1*sizeof(struct yyfilebuffer));
newptr->prev = top;
newptr->filename = filename;
newptr->f = fopen(filename,"r");
newptr->bs = yy_create_buffer(newptr->f, YY_BUF_SIZE);
top = newptr;
yy_switch_to_buffer(top->bs); //edw
}
if(top->prev != NULL)
{
printf("top->prev points to file : %s\n",top->prev->filename);
}
}
void popfile()
{
struct yyfilebuffer *temp;
temp = NULL;
if(top->prev == NULL)
{
printf("\n Error : Trying to pop from empty stack");
exit(1);
}
else
{
temp = top;
top = temp->prev;
yy_switch_to_buffer(top->bs);
system("pause");
}
}
You need to think about how you manage memory, remembering that C does not really have a string type in the way you might be used to from other languages.
You define a global variable:
char temporal[7];
(which has an odd name, since globals are anything but temporary), and then fill in its value in your lexer:
for(i=1;i<strlen(yytext)-2;i++) {
temporal[i-1]=yytext[i];
}
There are at least three problems with the above code:
temporal only has room for a six-character filename, but nowhere do you check to make sure that yyleng is not greater than 6. If it is, you will overwrite random memory. (The flex-generated scanner sets yyleng to the length of the token whose starting address is yytext. So you might as well use that value instead of computing strlen(yytext), which involves a scan over the text.)
You never null-terminate temporal. That's OK the first time, because it has static lifetime and will therefore be filled with zeros at program initialization. But the second and subsequent times you are counting on the new filename to not be shorter than the previous one; otherwise, you'll end up with part of the previous name at the end of the new name.
You could have made much better use of the standard C library. Although for reasons I will note below, this does not solve the problem you observe, it would have been better to use the following instead of the loop, after checking that yyleng is not too big:
memcpy(temporal, yytext + 1, yyleng - 2); /* Copy the filename */
temporal[yyleng - 2] = '\0'; /* NUL-terminate the copy */
Once you make the copy in temporal, you give that to newfile:
newfile(temporal);
And in newfile, what we see is:
newptr->filename = filename;
That does not copy filename. The call to newfile passed the address of temporal as an argument, so within newfile, the value of the parameter filename is the address of temporal. You then store that address in newptr->filename, so newptr->filename is also the address of temporal.
But, as noted above, temporal is not temporary. It is a global variable whose lifetime is the entire lifetime of the program. So the next time your lexical scanner encounters an include directive, it will put it into temporal, overwriting the previous contents. So what then happens to the filename member in the yyfilebuffer structure? Answer: nothing. It still points to the same place, temporal, but the contents of that place have changed. So when you later print out the contents of the string pointed to by that filename field, you'll get a different string from the one which happened to be in temporal when you first created that yyfilebuffer structure.
On the whole, you'll find it easier to manage memory if newfile and popfile "own" the memory in the filebuffer stack. That means that newfile should make a copy of its argument into freshly-allocated storage, and popfile should free that storage, since it is no longer needed. If newfile makes a copy, then it is not necessary for the lexical-scanner action which calls newfile to make a copy; it is only necessary for it to make sure that the string is correctly NUL-terminated when it calls newfile.
In short, the code might look like this:
/* Changed parameter to const, since we are not modifying its contents */
void newfile(const char *filename) {
/* Eliminated this check as obviously unnecessary: if(top == NULL) */
struct yyfilebuffer *newptr = malloc(sizeof(struct yyfilebuffer));
newptr->prev = top;
// Here we copy filename. Since I suspect that you are on Windows,
// I'll write it out in full. Normally, I'd use strdup.
newptr->filename = malloc(strlen(filename) + 1);
strcpy(newptr->filename, filename);
newptr->f = fopen(filename,"r");
newptr->bs = yy_create_buffer(newptr->f, YY_BUF_SIZE);
top = newptr;
yy_switch_to_buffer(top->bs); //edw
if(top->prev != NULL) {
printf("top->prev points to file : %s\n",top->prev->filename);
}
}
void popfile() {
if(top->prev == NULL) {
fprintf(stderr, "Error : Trying to pop from empty stack\n");
exit(1);
}
struct yyfilebuffer temp = top;
top = temp->prev;
/* Reclaim memory */
free(temp->filename);
free(temp);
yy_switch_to_buffer(top->bs);
system("pause");
}
Now that newfile takes ownership of the string passed to it, we no longer need to make a copy. Since the action clearly indicates that you expect the argument to the #include to be something like a C #include directive (surrounded either by "..." or <...>), it is better to make that explicit:
<INC>\".+\"$|"<".+">"$ {
/* NUL-terminate the filename by overwriting the trailing "*/
yytext[yyleng - 1] = '\0';
newfile(yytext + 1);
BEGIN INITIAL;
}
I am building an LZW encoding algorithm, which uses dictionary and hashing so it can reach fast enough for working words already stored in a dictionary.
The algorithm gives proper results when ran on smaller files (cca few hundreds of symbols), but on the larger files (and especially in those files which contain of less different symbols - for example, it gives the worst performance when ran on a file which consists only of 1 symbol, 'y' let's say). The worst performance, in terms that it just crashes when dictionary is not even close to being full. However, when the large input file consists of more than 1 symbol, dictionary gets close to being full, approximately 90%, but again then it crashes.
Considering the structure of my algorithm, I am not quite sure what is causing it to crash in general, or crash so soon when large file of just 1 symbol is given.
It must be something about hashing (first time doing it, so it might have some bugs).
The hash function I am using can be found here, and from what I have tested it, it gives good results: oat_hash
LZW encoding algorithm is based on this link, with slight change, that it works until the dictionary is not full: LZW encoder
Let's get into code:
Note: oat_hash is changed so it returns value % CAPACITY, so every index is from DICTIONARY
// Globals
#define CAPACITY 100000
char *DICTIONARY[CAPACITY];
unsigned short CODES[CAPACITY]; // CODES and DICTIONARY are linked via index: word from dictionary on index i, has its code in CODES on index i
int position = 0;
int code_counter = 0;
void encode(FILE *input, FILE *output){
int succ1 = fseek(input, 0, SEEK_SET);
if(succ1 != 0) printf("Error: file not open!");
int succ2 = fseek(output, 0, SEEK_SET);
if(succ2 != 0) printf("Error: file not open!");
//1. Working word = next symbol from the input
char *working_word = malloc(2048*sizeof(char));
char new_symbol = getc(input);
working_word[0] = new_symbol;
working_word[1] = '\0';
//2. WHILE(there are more symbols on the input) DO
//3. NewSymbol = next symbol from the input
while((new_symbol = getc(input)) != EOF){
char *workingWord_and_newSymbol= NULL;
char newSymbol[2];
newSymbol[0] = new_symbol;
newSymbol[1] = '\0';
workingWord_and_newSymbol = working_word_and_new_symbol(working_word, newSymbol);
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));
//4. IF(WorkingWord + NewSymbol) is already in the dictionary THEN
if(DICTIONARY[index] != NULL){
// 5. WorkingWord += NewSymbol
working_word = working_word_and_new_symbol(working_word, newSymbol);
}
//6. ELSE
else{
//7. OUTPUT: code for WorkingWord
int idx = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[idx]);
//8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
if(!dictionary_full()){
DICTIONARY[index] = workingWord_and_newSymbol;
CODES[index] = code_counter + 1;
code_counter += 1;
working_word = strdup(newSymbol);
}else break;
}
//10. END IF
}
//11. END WHILE
//12. OUTPUT: code for WorkingWord
int index = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[index]);
free(working_word);
}
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));
And later
int idx = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[idx]);
//8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
if(!dictionary_full()){
DICTIONARY[index] = workingWord_and_newSymbol;
CODES[index] = code_counter + 1;
code_counter += 1;
working_word = strdup(newSymbol);
}else break;
idx and index are unbounded and you use them to access a bounded array. You're accessing memory out of range. Here's a suggestion, but it may skew the distribution. If your hash range is much larger than CAPACITY it shouldn't be a problem. But you also have another problem which was mentioned, collisions, you need to handle them. But that's a different problem.
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol)) % CAPACITY;
// and
int idx = oat_hash(working_word, strlen(working_word)) % CAPACITY;
LZW compression is certainly used to construct binary files and normally is capable of reading binary files.
The following code is problematic as it relies on new_symbol never being a \0.
newSymbol[0] = new_symbol; newSymbol[1] = '\0';
strlen(workingWord_and_newSymbol)
strdup(newSymbol)
Needs re-write to work with arrays of bytes rather than strings.
fopen() was not shown. Insure one is opening in binary. input = fopen(..., "rb");
#Wumpus Q. Wumbley is correct, use int newSymbol.
Minor:
new_symbol and newSymbol are confusing.
Consider:
// char *working_word = malloc(2048*sizeof(char));
#define WORKING_WORD_N (2048)
char *working_word = malloc(WORKING_WORD_N*sizeof(*working_word));
// or
char *working_word = malloc(WORKING_WORD_N);