LZW encoding for large file

LZW encoding for large file - c

I am building an LZW encoding algorithm, which uses dictionary and hashing so it can reach fast enough for working words already stored in a dictionary.
The algorithm gives proper results when ran on smaller files (cca few hundreds of symbols), but on the larger files (and especially in those files which contain of less different symbols - for example, it gives the worst performance when ran on a file which consists only of 1 symbol, 'y' let's say). The worst performance, in terms that it just crashes when dictionary is not even close to being full. However, when the large input file consists of more than 1 symbol, dictionary gets close to being full, approximately 90%, but again then it crashes.
Considering the structure of my algorithm, I am not quite sure what is causing it to crash in general, or crash so soon when large file of just 1 symbol is given.
It must be something about hashing (first time doing it, so it might have some bugs).
The hash function I am using can be found here, and from what I have tested it, it gives good results: oat_hash
LZW encoding algorithm is based on this link, with slight change, that it works until the dictionary is not full: LZW encoder
Let's get into code:
Note: oat_hash is changed so it returns value % CAPACITY, so every index is from DICTIONARY
// Globals
#define CAPACITY 100000
char *DICTIONARY[CAPACITY];
unsigned short CODES[CAPACITY]; // CODES and DICTIONARY are linked via index: word from dictionary on index i, has its code in CODES on index i
int position = 0;
int code_counter = 0;
void encode(FILE *input, FILE *output){
int succ1 = fseek(input, 0, SEEK_SET);
if(succ1 != 0) printf("Error: file not open!");
int succ2 = fseek(output, 0, SEEK_SET);
if(succ2 != 0) printf("Error: file not open!");
//1. Working word = next symbol from the input
char *working_word = malloc(2048*sizeof(char));
char new_symbol = getc(input);
working_word[0] = new_symbol;
working_word[1] = '\0';
//2. WHILE(there are more symbols on the input) DO
//3. NewSymbol = next symbol from the input
while((new_symbol = getc(input)) != EOF){
char *workingWord_and_newSymbol= NULL;
char newSymbol[2];
newSymbol[0] = new_symbol;
newSymbol[1] = '\0';
workingWord_and_newSymbol = working_word_and_new_symbol(working_word, newSymbol);
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));
//4. IF(WorkingWord + NewSymbol) is already in the dictionary THEN
if(DICTIONARY[index] != NULL){
// 5. WorkingWord += NewSymbol
working_word = working_word_and_new_symbol(working_word, newSymbol);
}
//6. ELSE
else{
//7. OUTPUT: code for WorkingWord
int idx = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[idx]);
//8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
if(!dictionary_full()){
DICTIONARY[index] = workingWord_and_newSymbol;
CODES[index] = code_counter + 1;
code_counter += 1;
working_word = strdup(newSymbol);
}else break;
}
//10. END IF
}
//11. END WHILE
//12. OUTPUT: code for WorkingWord
int index = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[index]);
free(working_word);
}

int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));
And later
int idx = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[idx]);
//8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
if(!dictionary_full()){
DICTIONARY[index] = workingWord_and_newSymbol;
CODES[index] = code_counter + 1;
code_counter += 1;
working_word = strdup(newSymbol);
}else break;
idx and index are unbounded and you use them to access a bounded array. You're accessing memory out of range. Here's a suggestion, but it may skew the distribution. If your hash range is much larger than CAPACITY it shouldn't be a problem. But you also have another problem which was mentioned, collisions, you need to handle them. But that's a different problem.
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol)) % CAPACITY;
// and
int idx = oat_hash(working_word, strlen(working_word)) % CAPACITY;

LZW compression is certainly used to construct binary files and normally is capable of reading binary files.
The following code is problematic as it relies on new_symbol never being a \0.
newSymbol[0] = new_symbol; newSymbol[1] = '\0';
strlen(workingWord_and_newSymbol)
strdup(newSymbol)
Needs re-write to work with arrays of bytes rather than strings.
fopen() was not shown. Insure one is opening in binary. input = fopen(..., "rb");
#Wumpus Q. Wumbley is correct, use int newSymbol.
Minor:
new_symbol and newSymbol are confusing.
Consider:
// char *working_word = malloc(2048*sizeof(char));
#define WORKING_WORD_N (2048)
char *working_word = malloc(WORKING_WORD_N*sizeof(*working_word));
// or
char *working_word = malloc(WORKING_WORD_N);

Related

C Program to count keywords from a keyword text file in a fake resume and display the result

#EDIT: I think the problem is that I put my 2 text files on desktop. Then, I move them to the same place as the source file and it works. But the program cannot run this time, the line:
cok = 0;
shows "exception thrown".
// end EDIT
I have the assignment at school to write a C program to create 2 text files. 1 file stores 25 keywords, and 1 file stores the fake resume. The problem is, my program cannot read my keywords.txt file. Anyone can help me? Thank you so much.
#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
//*****************************************************
// MAIN FUNCTION
int main()
{
//File pointer for resume.txt file declared
FILE *fpR;
//File pointer for keywords.txt file declared and open it in read mode
FILE* fpK = fopen("keywords.txt", "r");
//To store character extracted from keyword.txt file
char cK;
//To store character extracted from resume.txt file
char cR;
//To store word extracted from keyword.txt file
char wordK[50];
//To store word extracted from resume.txt file
char wordR[50];
//To store the keywords
char keywords[10][50];
//To store the keywords counter and initializes it to zero
int keywordsCount[10] = { 0, 0, 0, 0 };
int coK, coR, r, r1;
coK = coR = r = r1 = 0;
//Checks if file is unable to open then display error message
if (fpK == NULL)
{
printf("Could not open files");
exit(0);
}//End of if
//Extracts a character from keyword.txt file and stores it in cK variable and Loops till end of file
while ((cK = fgetc(fpK)) != EOF)
{
//Checks if the character is comma
if (cK != ',')
{
//Store the character in wordK coK index position
wordK[coK] = cK;
//Increase the counter coK by one
coK++;
}//End of if
//If it is comma
else
{
//Stores null character
wordK[coK] = '\0';
//Copies the wordK to the keywords r index position and increase the counter r by one
strcpy(keywords[r++], wordK);
//Re initializes the counter to zero for next word
coK = 0;
fpR = fopen("resume.txt", "r");
//Extracts a character from resume.txt file and stores it in cR variable and Loops till end of file
while ((cR = fgetc(fpR)) != EOF)
{
//Checks if the character is space
if (cR != ' ')
{
//Store the character in wordR coR index position
wordR[coR] = cR;
//Increase the counter coR by one
coR++;
}//End of if
else
{
//Stores null character
wordR[coR] = '\0';
//Re initializes the counter to zero for next word
coR = 0;
//Compares word generated from keyword.txt file and word generated from resume.txt file
if (strcmp(wordK, wordR) == 0)
{
//If both the words are same then increase the keywordCounter arrays r1 index position value by one
keywordsCount[r1] += 1;
}//End of if
}//End of else
}//End of inner while loop
//Increase the counter by one
r1++;
//Close the file for resume
fclose(fpR);
}//End of else
}//End of outer while loop
//Close the file for keyword
fclose(fpK);
//Display the result
printf("\n Result \n");
for (r = 0; r < r1; r++)
printf("\n Keyword: %s %d time available", keywords[r], keywordsCount[r]);
}//End of main
I think the problem is the text files, aren't they?
The name of my 1st test file is "keywords.txt", and its content is:
Java, CSS, HTML, XHTML, MySQL, College, University, Design, Development, Security, Skills, Tools, C, Programming, Linux, Scripting, Network, Windows, NT
The name of my 2nd test file is "resume.txt", and its content is:
Junior Web developer able to build a Web presence from the ground up -- from concept, navigation, layout, and programming to UX and SEO. Skilled at writing well-designed, testable, and efficient code using current best practices in Web development. Fast learner, hard worker, and team player who is proficient in an array of scripting languages and multimedia Web tools. (Something like this).
I don't see any problem with these 2 files. But my program still cannot open the file and the output keeps showing "Could not open files".

while ((cK = fgetc(fpK)) != EOF)
If you check the documentation, you can see that fgets returns an int. But since cK is a char, you force a conversion to char, which can change its value. You then compare the possibly changed value to EOF, which is not correct. You need to compare the value that fgets returns to EOF since fgetc returns an EOF on end of file.

Unexpected Output - Storing into 2D array in c

I am reading data from a number of files, each containing a list of words. I am trying to display the number of words in each file, but I am running into issues. For example, when I run my code, I receive the output as shown below.
Almost every amount is correctly displayed with the exception of two files, each containing word counts in the thousands. Every other file only has three digits worth of words, and they seem just fine.
I can only guess what this problem could be (not enough space allocated somewhere?) and I do not know how to solve it. I apologize if this is all poorly worded. My brain is fried and I am struggling. Any help would be appreciated.
I've tried to keep my example code as brief as possible. I've cut out a lot of error checking and other tasks related to the full program. I've also added comments where I can. Thanks.
StopWords.c
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <stddef.h>
#include <string.h>
typedef struct
{
char stopwords[2000][60];
int wordcount;
} LangData;
typedef struct
{
int languageCount;
LangData languages[];
} AllData;
main(int argc, char **argv)
{
//Initialize data structures and open path directory
int langCount = 0;
DIR *d;
struct dirent *ep;
d = opendir(argv[1]);
//Count the number of language files in the directory
while(readdir(d))
langCount++;
//Account for "." and ".." in directory
//langCount = langCount - 2 THIS MAKES SENSE RIGHT?
langCount = langCount + 1; //The program crashes if I don't do this, which doesn't make sense to me.
//Allocate space in AllData for languageCount
AllData *data = malloc(sizeof(AllData) + sizeof(LangData)*langCount); //Unsure? Seems to work.
//Reset the directory in preparation for reading data
rewinddir(d);
//Copy all words into respective arrays.
char word[60];
int i = 0;
int k = 0;
int j = 0;
while((ep = readdir(d)) != NULL) //Probably could've used for loops to make this cleaner. Oh well.
{
if (!strcmp(ep->d_name, ".") || !strcmp(ep->d_name, ".."))
{
//Filtering "." and ".."
}
else
{
FILE *entry;
//Get string for path (i should make this a function)
char fullpath[100];
strcpy(fullpath, path);
strcat(fullpath, "\\");
strcat(fullpath, ep->d_name);
entry = fopen(fullpath, "r");
//Read all words from file
while(fgets(word, 60, entry) != NULL)
{
j = 0;
//Store each word one character at a time (better way?)
while(word[j] != '\0') //Check for end of word
{
data->languages[i].stopwords[k][j] = word[j];
j++; //Move onto next character
}
k++; //Move onto next word
data->languages[i].wordcount++;
}
//Display number of words in file
printf("%d\n", data->languages[i].wordcount);
i++; Increment index in preparation for next language file.
fclose(entry);
}
}
}
Output
256 //czech.txt: Correct
101 //danish.txt: Correct
101 //dutch.txt: Correct
547 //english.txt: Correct
1835363006 //finnish.txt: Should be 1337. Of course it's 1337.
436 //french.txt: Correct
576 //german.txt: Correct
737 //hungarian.txt: Correct
683853 //icelandic.txt: Should be 1000.
399 //italian.txt: Correct
172 //norwegian.txt: Correct
269 //polish.txt: Correct
437 //portugese.txt: Correct
282 //romanian.txt: Correct
472 //spanish.txt: Correct
386 //swedish.txt: Correct
209 //turkish.txt: Correct

Do the files have more than 2000 words? You have only allocated space for 2000 words so once your program tries to copy over word 2001 it will be doing it outside of the memory allocated for that array, possibly into the space allocated for "wordcount".
Also I want to point out that fgets returns a string to the end of the line or at most n characters (60 in your case), whichever comes first. This will work find if there is only one word per line in the files you are reading from, otherwise will have to locate spaces within the string and count words from there.
If you are simply trying to get a word count, then there is no need to store all the words in an array in the first place. Assuming one word per line, the following should work just as well:
char word[60];
while(fgets(word, 60, entry) != NULL)
{
data->languages[i].wordcount++;
}
fgets reference- http://www.cplusplus.com/reference/cstdio/
Update
I took another look and you might want to try allocating data as follows:
typedef struct
{
char stopwords[2000][60];
int wordcount;
} LangData;
typedef struct
{
int languageCount;
LangData *languages;
} AllData;
AllData *data = malloc(sizeof(AllData));
data->languages = malloc(sizeof(LangData)*langCount);
This way memory is being specifically allocated for the languages array.
I agree that langCount = langCount - 2 makes sense. What error are you getting?

Two-dimensional char array too large exit code 139

Hey guys I'm attempting to read in workersinfo.txt and store it into a two-dimensional char array. The file is around 4,000,000 lines with around 100 characters per line. I want to store each file line on the array. Unfortunately, I get exit code 139(Not enough memory). I'm aware I have to use malloc() and free() but I've tried a couple of things and I haven't been able to make them work.Eventually I have to sort the array by ID number but I'm stuck on declaring the array.
The file looks something like this:
First Name, Last Name,Age, ID
Carlos,Lopez,,10568
Brad, Patterson,,20586
Zack, Morris,42,05689
This is my code so far:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
FILE *ptr_file;
char workers[4000000][1000];
ptr_file =fopen("workersinfo.txt","r");
if (!ptr_file)
perror("Error");
int i = 0;
while (fgets(workers[i],1000, ptr_file)!=NULL){
i++;
}
int n;
for(n = 0; n < 4000000; n++)
{
printf("%s", workers[n]);
}
fclose(ptr_file);
return 0;
}

The Stack memory is limited. As you pointed out in your question, you MUST use malloc to allocate such a big (need I say HUGE) chunk of memory, as the stack cannot contain it.
you can use ulimit to review the limits of your system (usually including the stack size limit).
On my Mac, the limit is 8Mb. After running ulimit -a I get:
...
stack size (kbytes, -s) 8192
...
Or, test the limit using:
struct rlimit slim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur // the stack limit
I truly recommend you process each database entry separately.
As mentioned in the comments, assigning the memory as static memory would, in most implementations, circumvent the stack.
Still, IMHO, allocating 400MB of memory (or 4GB, depending which part of your question I look at), is bad form unless totally required - especially for a single function.
Follow-up Q1: How to deal with each DB entry separately
I hope I'm not doing your homework or anything... but I doubt your homework would include an assignment to load 400Mb of data to the computer's memory... so... to answer the question in your comment:
The following sketch of single entry processing isn't perfect - it's limited to 1Kb of data per entry (which I thought to be more then enough for such simple data).
Also, I didn't allow for UTF-8 encoding or anything like that (I followed the assumption that English would be used).
As you can see from the code, we read each line separately and perform error checks to check that the data is valid.
To sort the file by ID, you might consider either running two lines at a time (this would be a slow sort) and sorting them, or creating a sorted node tree with the ID data and the position of the line in the file (get the position before reading the line). Once you sorted the binary tree, you can sort the data...
... The binary tree might get a bit big. did you look up sorting algorithms?
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 1 on sucesss or 0 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return 0;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
return 0;
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 1;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
return 0;
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE* ptr_file;
ptr_file = fopen("workersinfo.txt", "r");
if (!ptr_file)
perror("File Error");
struct DBEntry line;
while (read_db_line(ptr_file, &line)) {
// do what you want with the data... print it?
printf(
"First name:\t%s\n"
"Last name:\t%s\n"
"Age:\t\t%s\n"
"ID:\t\t%s\n"
"--------\n",
line.first_name, line.last_name, line.age, line.id);
}
// close file
fclose(ptr_file);
return 0;
}
Followup Q2: Sorting array for 400MB-4GB of data
IMHO, 400MB is already touching on the issues related to big data. For example, implementing a bubble sort on your database should be agonizing as far as performance goes (unless it's a single time task, where performance might not matter).
Creating an Array of DBEntry objects will eventually get you a larger memory foot-print then the actual data..
This will not be the optimal way to sort large data.
The correct approach will depend on your sorting algorithm. Wikipedia has a decent primer on sorting algorythms.
Since we are handling a large amount of data, there are a few things to consider:
It would make sense to partition the work, so different threads/processes sort a different section of the data.
We will need to minimize IO to the hard drive (as it will slow the sorting significantly and prevent parallel processing on the same machine/disk).
One possible approach is to create a heap for a heap sort, but only storing a priority value and storing the original position in the file.
Another option would probably be to employ a divide and conquer algorithm, such as quicksort, again, only sorting a computed sort value and the entry's position in the original file.
Either way, writing a decent sorting method will be a complicated task, probably involving threading, forking, tempfiles or other techniques.
Here's a simplified demo code... it is far from optimized, but it demonstrates the idea of the binary sort-tree that holds the sorting value and the position of the data in the file.
Be aware that using this code will be both relatively slow (although not that slow) and memory intensive...
On the other hand, it will require about 24 bytes per entry. For 4 million entries, it's 96MB, somewhat better then 400Mb and definitely better then the 4GB.
#include <stdlib.h>
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// this might be a sorting node for a sorted bin-tree:
struct SortNode {
struct SortNode* next; // a pointer to the next node
fpos_t position; // the DB entry's position in the file
long value; // The computed sorting value
}* top_sorting_node = NULL;
// this function will free all the memory used by the global Sorting tree
void clear_sort_heap(void) {
struct SortNode* node;
// as long as there is a first node...
while ((node = top_sorting_node)) {
// step forward.
top_sorting_node = top_sorting_node->next;
// free the original first node's memory
free(node);
}
}
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 0 on sucesss or 1 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return -1;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
clear_sort_heap();
exit(2);
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 0;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
clear_sort_heap();
exit(1);
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// read and sort a single line from the database.
// return 0 if there was no data to sort. return 1 if data was read and sorted.
int sort_line(FILE* fp) {
// allocate the memory for the node - use calloc for zero-out data
struct SortNode* node = calloc(sizeof(*node), 1);
// store the position on file
fgetpos(fp, &node->position);
// use a stack allocated DBEntry for processing
struct DBEntry line;
// check that the read succeeded (read_db_line will return -1 on error)
if (read_db_line(fp, &line)) {
// free the node's memory
free(node);
// return no data (0)
return 0;
}
// compute sorting value - I'll assume all IDs are numbers up to long size.
sscanf(line.id, "%ld", &node->value);
// heap sort?
// This is a questionable sort algorythm... or a questionable implementation.
// Also, I'll be using pointers to pointers, so it might be a headache to read
// (it's a headache to write, too...) ;-)
struct SortNode** tmp = &top_sorting_node;
// move up the list until we encounter something we're smaller then us,
// OR untill the list is finished.
while (*tmp && (*tmp)->value <= node->value)
tmp = &((*tmp)->next);
// update the node's `next` value.
node->next = *tmp;
// inject the new node into the tree at the position we found
*tmp = node;
// return 1 (data was read and sorted)
return 1;
}
// writes the next line in the sorting
int write_line(FILE* to, FILE* from) {
struct SortNode* node = top_sorting_node;
if (!node) // are we done? top_sorting_node == NULL ?
return 0; // return 0 - no data to write
// step top_sorting_node forward
top_sorting_node = top_sorting_node->next;
// read data from one file to the other
fsetpos(from, &node->position);
char* buffer = NULL;
ssize_t length;
size_t buff_size = 0;
length = getline(&buffer, &buff_size, from);
if (length <= 0) {
perror("Line Copy Error - Couldn't read data");
return 0;
}
fwrite(buffer, 1, length, to);
free(buffer); // getline allocates memory that we're incharge of freeing.
return 1;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE *fp_read, *fp_write;
fp_read = fopen("workersinfo.txt", "r");
fp_write = fopen("sorted_workersinfo.txt", "w+");
if (!fp_read) {
perror("File Error");
goto cleanup;
}
if (!fp_write) {
perror("File Error");
goto cleanup;
}
printf("\nSorting");
while (sort_line(fp_read))
printf(".");
// write all sorted data to a new file
printf("\n\nWriting sorted data");
while (write_line(fp_write, fp_read))
printf(".");
// clean up - close files and make sure the sorting tree is cleared
cleanup:
printf("\n");
fclose(fp_read);
fclose(fp_write);
clear_sort_heap();
return 0;
}

fread returns no data in my buffer in spite of saying it read 4096 bytes

I'm porting some C code that loads sprites from files containing multiple bitmaps. Basically the code fopens the file, fgetcs some header info, then freads the bitmap data. I can see that the fgetcs are returning proper data, but the outcome of the fread is null. Here's the code - fname does exist, the path is correct, fil is non-zero, num is the number of sprites in the file (encoded into the header, little-endian), pak is an array of sprites, sprite is a typedef of width, height and bits, and new_sprite inits one for you.
FILE *fil;
uint8 *buffu;
uint8 read;
int32 x,num;
int32 w,h,c;
fil = fopen(fname, "rb");
if (!fil) return NULL;
num = fgetc(fil);
num += fgetc(fil)*256;
if (num > max) max = num;
for (x=0;x<max;x++) {
// header
w=fgetc(fil);
w+=fgetc(fil)*256;
h=fgetc(fil);
h+=fgetc(fil)*256;
fgetc(fil); // stuff we don't use
fgetc(fil);
fgetc(fil);
fgetc(fil);
// body
buffu = (uint8*)malloc(w * h);
read=fread(buffu,1,w*h,fil);
pak->spr[x]=new_sprite(w,h);
memcpy(pak->spr[x]->data, buffu, w*h);
// done
free(buffu);
}
I've stepped through this code line by line, and I can see that w and h are getting set up properly, and read=4096, which is the right number of bits. However, buffer is "" after the fread, so of course memcpy does nothing useful and my pak is filled with empty sprites.
My apologies for what is surely a totally noob question, but I normally use Cocoa so this pure-C file handling is new to me. I looked all over for examples of fread, and they all look like the one here - which apparently works fine on Win32.

Since fgetc seems to work, you could try this as a test
int each;
int byte;
//body
buffu = malloc(w * h);
for (each = 0; each < w*h; each++) {
byte = fgetc(fil);
if ( byte == EOF) {
printf("End of file\n");
break;
}
buffu[each] = (uint8)byte;
printf ("byte: %d each: %d\n", byte, each);
}
pak->spr[x]=new_sprite(w,h);
memcpy(pak->spr[x]->data, buffu, w*h);
// done

You say:
However, buffer is "" after the fread, so of course memcpy does nothing useful
But that is not true at all. memcpy() is not a string function, it will copy the requested number of bytes. Every time. If that isn't "useful", then something else is wrong.
Your buffer, when treated as a string (which it is not, it's a bunch of binary data) will look like an empty string if the first byte happens to be 0. The remaining 4095 bytes can be whatever, to C's string printing functions it will look "empty".

fgetc not starting at beginning of file - c [duplicate]

This question already has an answer here:
fgetc not starting at beginning of large txt file
(1 answer)
Closed 9 years ago.
Problem solved here:
fgetc not starting at beginning of large txt file
I am working in c and fgetc isn't getting chars from the beginning of the file. It seems to be starting somewhere randomly within the file after a \n. The goal of this function is to modify the array productsPrinted. If "More Data Needed" or "Hidden non listed" is encountered, the position in the array, productsPrinted[newLineCount], will be changed to 0. Any help is appreciated.
Update: It works on smaller files, but doesn't start at the beginning of the larger,617kb, file.
function calls up to category:
findNoPics(image, productsPrinted);
findVisible(visible, productsPrinted);
removeCategories(category, productsPrinted);
example input from fgetc():
Category\n
Diagnostic & Testing /Scan Tools\n
Diagnostic & Testing /Scan Tools\n
Hidden non listed\n
Diagnostic & Testing /Scan Tools\n
Diagnostic & Testing /Scan Tools\n
Hand Tools/Open Stock\n
Hand Tools/Sockets and Drive Sets\n
More Data Needed\n
Hand Tools/Open Stock\n
Hand Tools/Open Stock\n
Hand Tools/Open Stock\n
Shop Supplies & Equip/Tool Storage\n
Hidden non listed\n
Shop Supplies & Equip/Heaters\n
Code:
void removeCategories(FILE *category, int *prodPrinted){
char more[17] = { '\0' }, hidden[18] = { '\0' };
int newLineCount = 0, i, ch = 'a', fix = 0;
while ((ch = fgetc(category)) != EOF){ //if fgetc is outside while, it works//
more[15] = hidden[16] = ch;
printf("%c", ch);
/*shift char in each list <- one*/
for (i = 0; i < 17; i++){
if (i < 17){
hidden[i] = hidden[i + 1];
}
if (i < 16){
more[i] = more[i + 1];
}
}
if (strcmp(more, "More Data Needed") == 0 || strcmp(hidden, "Hidden non listed") == 0){
prodPrinted[newLineCount] = 0;
/*printf("%c", more[0]);*/
}
if (ch == '\n'){
newLineCount++;
}
}
}

Let computers do the counting. You have not null terminated your strings properly. The fixed strings (mdn and hdl are initialized but do not have null terminators, so string comparisons using them are undefined.
Given this sample data:
Example 1
More Data Needed
Hidden non listed
Example 2
Keeping lines short.
But as they get longer, the overwrite is worse...or is it?
Hidden More Data Needed in a longer line.
Lines containing "Hidden non listed" are zapped.
Example 3
This version of the program:
#include <stdio.h>
#include <string.h>
static
void removeCategories(FILE *category, int *prodPrinted)
{
char more[17] = { '0' };
char hidden[18] = { '0' };
char mdn[17] = { "More Data Needed" };
char hnl[18] = { "Hidden non listed" };
int newLineCount = 0, i, ch = '\0';
do
{
/*shift char in each list <- one*/
for (i = 0; i < 18; i++)
{
if (i < 17)
hidden[i] = hidden[i + 1];
if (i < 16)
more[i] = more[i + 1];
}
more[15] = hidden[16] = ch = fgetc(category);
if (ch == EOF)
break;
printf("%c", ch); /*testing here, starts rndmly in file*/
//printf("<<%c>> ", ch); /*testing here, starts rndmly in file*/
//printf("more <<%s>> hidden <<%s>>\n", more, hidden);
if (strcmp(more, mdn) == 0 || strcmp(hidden, hnl) == 0)
{
prodPrinted[newLineCount] = 0;
}
if (ch == '\n')
{
newLineCount++;
}
} while (ch != EOF);
}
int main(void)
{
int prod[10];
for (int i = 0; i < 10; i++)
prod[i] = 37;
removeCategories(stdin, prod);
for (int i = 0; i < 10; i++)
printf("%d: %d\n", i, prod[i]);
return 0;
}
produces this output:
Example 1
More Data Needed
Hidden non listed
Example 2
Keeping lines short.
But as they get longer, the overwrite is worse...or is it?
Hidden More Data Needed in a longer line.
Lines containing "Hidden non listed" are zapped.
Example 3
0: 37
1: 0
2: 0
3: 37
4: 37
5: 37
6: 0
7: 0
8: 37
9: 37

You may check which mode you opened the file, and you may have some error-check to make sure you have got the right return value.
Here you can refer to man fopen to get which mode to cause the stream position.
The fopen() function opens the file whose name is the string pointed to
by path and associates a stream with it.
The argument mode points to a string beginning with one of the follow‐
ing sequences (Additional characters may follow these sequences.):
r Open text file for reading. The stream is positioned at the
beginning of the file.
r+ Open for reading and writing. The stream is positioned at the
beginning of the file.
w Truncate file to zero length or create text file for writing.
The stream is positioned at the beginning of the file.
w+ Open for reading and writing. The file is created if it does
not exist, otherwise it is truncated. The stream is positioned
at the beginning of the file.
a Open for appending (writing at end of file). The file is cre‐
ated if it does not exist. The stream is positioned at the end
of the file.
a+ Open for reading and appending (writing at end of file). The
file is created if it does not exist. The initial file position
for reading is at the beginning of the file, but output is
always appended to the end of the file.
And there is another notice, that the file you operated should not more than 2G, or there maybe problem.
And you can use fseek to set the file position indicator.
And you can use debugger to watch these variables to see why there are random value. I think debug is efficient than trace output.

Maybe you can try rewinding the file pointer at the beginning of your function.
rewind(category);
Most likely another function is reading from the same file. If this solves your problem, it would be better to find which other function (or previous call to this function) is reading from the same file and make sure rewinding the pointer won't break something else.
EDIT:
And just to be sure, maybe you could change the double assignment to two different statements. Based on this post, your problem might as well be caused by a compiler optimization of that line. I haven't checked with the standard, but according to best answer the behavior in c and c++ might be undefined, therefore your strange results. Good luck