Hash table sorting and execution time - c

I write a program to count the frequency word count using hash table, but I don't how to sort it.
I use struct to store value and count.
My hash code generate function is using module and my hash table is using by linked list.
1.My question is how do I sort them by frequency?
2.I am wondering that why my printed execute time is always zero, but I check it for many time. Where is the wrong way?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <ctype.h>
#define HASHSIZE 29989
#define FACTOR 31
#define VOCABULARYSIZE 30
typedef struct HashNode HashNode;
struct HashNode{
char* voc;//vocabulary
int freq;//frequency
struct HashNode *next;//pointed to the same hashcode
//but actually are different numbers
};
HashNode *HashTable[HASHSIZE] = {NULL,0,NULL};//an array of pointers
unsigned int HashCode(const char *pVoc){//generate hashcode
unsigned int index = 0;
int n = strlen(pVoc);
int i = 0;
for(; i < n; i ++)
index = FACTOR*index + pVoc[i];
return index % HASHSIZE;
}
void InsertVocabulary(const char *pVoc){//insert vocabulary to hash table
HashNode *ptr;
unsigned int index = HashCode(pVoc);
for(ptr = HashTable[index]; ptr != NULL; ptr = ptr -> next){//search if already exist
if(!strcmp (pVoc, ptr -> voc)){
(ptr->freq)++;
return;
}
}
ptr = (HashNode*)malloc(sizeof(HashNode));//if doesn't exist, create it
ptr -> freq = 1;
ptr -> voc = (char*)malloc(strlen(pVoc)+1);
strcpy(ptr -> voc, pVoc);
ptr -> next = HashTable[index];
HashTable[index] = ptr;
}
void ReadVocabularyTOHashTable(const char *path){
FILE *pFile;
char buffer[VOCABULARYSIZE];
pFile = fopen(path, "r");//open file for read
if(pFile == NULL)
perror("Fail to Read!\n");//error message
char ch;
int i =0;
do{
ch = fgetc(pFile);
if(isalpha(ch))
buffer[i++] = tolower(ch);//all convert to lowercase
else{
buffer[i] = '\0';//c-style string
i = 0;
if(!isalpha(buffer[0]))
continue;//blank line
else //printf("%s\n",buffer);
InsertVocabulary(buffer);
}
}while(ch != EOF);
fclose(pFile);
}
void WriteVocabularyTOHashTable(const char *path){
FILE *pFile;
pFile = fopen(path, "w");
if(pFile == NULL)
perror("Fail to Write\n");
int i = 0;
for(; i < HASHSIZE; i++){
HashNode *ptr = HashTable[i];
for(; ptr != NULL; ptr = ptr -> next){
fprintf(pFile, "Vocabulary:%s,Count:%d\n", ptr -> voc, ptr -> freq);
if(ptr -> next == NULL)
fprintf(pFile,"\n");
}
}
fclose(pFile);
}
int main(void){
time_t start, end;
time(&start);
ReadVocabularyTOHashTable("test.txt");
WriteVocabularyTOHashTable("result.txt");
time(&end);
double diff = difftime(end,start);
printf("%.21f seconds.\n", diff);
system("pause");
return 0;
}

This is an answer to your first question, sorting by frequency. Every hash node in your table is a distinct vocabulary entry. Some hash to the same code (thus your collision chains) but eventually you have one HashNode for every unique entry. To sort them by frequency with minimal disturbing of your existing code you can use qsort() with a pointer list (or any other sort of your choice) with relative ease.
Note: the most efficient way to do this would be to maintain a sorted linked-list during vocab-insert, and you may want to consider that. This code assumes you already have a hash table populated and need to get the frequencies out in sorted order of highest to lowest.
First, keep a running tally of all unique insertions. Simple enough, just add a counter to your allocation subsection:
gVocabCount++; // increment with each unique entry.
ptr = (HashNode*)malloc(sizeof(HashNode));//if doesn't exist, create it
ptr -> freq = 1;
ptr -> voc = (char*)malloc(strlen(pVoc)+1);
strcpy(ptr -> voc, pVoc);
ptr -> next = HashTable[index];
HashTable[index] = ptr;
Next allocate a list of pointers to HashNodes as large as your total unique vocab-count. then walk your entire hash table, including collision chains, and put each node into a slot in this list. The list better be the same size as your total node count or you did something wrong:
HashNode **nodeList = malloc(gVocabCount * sizeof(HashNode*));
int i;
int idx = 0;
for (i=0;i<HASHSIZE;++i)
{
HashNode* p = HashTable[i];
while (p)
{
nodeList[idx++] = p;
p = p->next;
}
}
So now we have a list of all unique node pointers. We need a comparison function to send to qsort(). We want the items with the largest numbers to be at the head of the list.
int compare_nodeptr(void* left, void* right)
{
return (*(HashNode**)right)->freq - (*(HashNode**)left)->freq;
}
And finally, fire qsort() to sort your pointer list.
qsort(nodeList, gVocabCount, sizeof(HashNode*), compare_nodeptr);
The nodeList array of HashNode pointers will have all of your nodes sorted in descending frequency:
for (i=0; i<gVocabCount; ++i)
printf("Vocabulary:%s,Count:%d\n", nodeList[i]->voc, nodeList[i]->freq);
Finally, don't forget to free the list:
free(nodeList);
As I said at the beginning, the most efficient way to do this would be to use a sorted linked list that pulls an incremented value (by definition all new entries can go to the end) and runs an insertion sort to slip it back into the right place. In the end that list will look virtually identical to what the above code would create (like-count-order not withstanding; i.e. a->freq = 5 and b->freq = 5, either a-b or b-a can happen).
Hope this helps.
EDIT: Updated to show OP an idea of what the Write function that outputs sorted data may look like:
static int compare_nodeptr(const void* left, const void* right)
{
return (*(const HashNode**)right)->freq - (*(const HashNode**)left)->freq;
}
void WriteVocabularyTOHashTable(const char *path)
{
HashNode **nodeList = NULL;
size_t i=0;
size_t idx = 0;
FILE* pFile = fopen(path, "w");
if(pFile == NULL)
{
perror("Fail to Write\n");
return;
}
nodeList = malloc(gVocabCount * sizeof(HashNode*));
for (i=0,idx=0;i<HASHSIZE;++i)
{
HashNode* p = HashTable[i];
while (p)
{
nodeList[idx++] = p;
p = p->next;
}
}
// send to qsort()
qsort(nodeList, idx, sizeof(HashNode*), compare_nodeptr);
for(i=0; i < idx; i++)
fprintf(pFile, "Vocabulary:%s,Count:%d\n", nodeList[i]->voc, nodeList[i]->freq);
fflush(pFile);
fclose(pFile);
free(nodeList);
}
Something like that, anyway. From the OP's test file, these are the top few lines of output:
Vocabulary:the, Count:912
Vocabulary:of, Count:414
Vocabulary:to, Count:396
Vocabulary:a, Count:388
Vocabulary:that, Count:260
Vocabulary:in, Count:258
Vocabulary:and, Count:221
Vocabulary:is, Count:220
Vocabulary:it, Count:215
Vocabulary:unix, Count:176
Vocabulary:for, Count:142
Vocabulary:as, Count:121
Vocabulary:on, Count:111
Vocabulary:you, Count:107
Vocabulary:user, Count:102
Vocabulary:s, Count:102

Related

Illegal instruction 4 when placing a function outside int main

I've just begun learning the C language and I ran into an issue with one of my programs.
I am getting an error: "Illegal instruction 4" when executing: ./dictionary large.txt
Large.txt is a file with 143091 alphabetically sorted words, with each word starting on a new line. I am trying to load all of them into a hash table and return true if all the words are loaded successfully.
This code works for me if the code in bool load() is within int main and load() is non-existent. However, once I place it inside the load() function and call it from main, I get an error.
I would appreciate help on this, as there are not many threads on Illegal instruction.
This is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <stdbool.h>
// Maximum length for a word
// (e.g., pneumonoultramicroscopicsilicovolcanoconiosis)
#define LENGTH 45
// Number of letters in the english alphabet
#define ALPHABET_LENGTH 26
// Default dictionary
#define DICTIONARY "large.txt"
// Represents a node in a hash table
typedef struct node
{
char word[LENGTH + 1];
struct node *next;
} node;
// Number of buckets in hash table
const unsigned int N = ALPHABET_LENGTH;
// Hash table
node *table[N];
// Load function
bool load(char *dictionary);
// Hash function
int hash(char *word);
int main(int argc, char *argv[])
{
// Check for correct number of args
if (argc != 2 && argc != 3)
{
printf("Usage: ./speller [DICTIONARY] text\n");
exit(1);
}
// Determine which dictionary to use
char *dictionary = (argc == 3) ? argv[1] : DICTIONARY;
bool loaded = load(dictionary);
// TODO: free hashtable from memory
return 0;
}
bool load(char *dictionary)
{
// Open dictionary for reading
FILE *file = fopen(dictionary, "r");
if (file == NULL)
{
printf("Error 2: could not open %s. Please call customer service.\n", dictionary);
exit(2);
}
// Initialize array to NULL
for (int i = 0; i < N; i++)
table[i] = NULL;
// Declare and initialize variables
unsigned int char_count = 0;
unsigned int word_count = 0;
char char_buffer;
char word_buffer[LENGTH + 1];
int hash_code = 0;
int previous_hash_code = 0;
// Declare pointers
struct node *first_item;
struct node *current_item;
struct node *new_item;
// Is true the first time the while loop is ran to be able to distinguish between hash_code and previous_hash_code after one loop
bool first_loop = true;
// Count the number of words in dictionary
while (fread(&char_buffer, sizeof(char), 1, file))
{
// Builds the word_buffer by scanning characters
if (char_buffer != '\n')
{
word_buffer[char_count] = char_buffer;
char_count++;
}
else
{
// Increases word count each time char_buffer == '\n'
word_count += 1;
// Calls the hash function and stores its value in hash_code
hash_code = hash(&word_buffer[0]);
// Creates and initializes first node in a given table index
if (hash_code != previous_hash_code || first_loop == true)
{
first_item = table[hash_code] = (struct node *)malloc(sizeof(node));
if (first_item == NULL)
{
printf("Error 3: memory not allocated. Please call customer service.\n");
return false;
}
current_item = first_item;
strcpy(current_item->word, word_buffer);
current_item->next = NULL;
}
else
{
new_item = current_item->next = (struct node *)malloc(sizeof(node));
if (new_item == NULL)
{
printf("Error 4: memory not allocated. Please call customer service.\n");
return false;
}
current_item = new_item;
strcpy(current_item->word, word_buffer);
current_item->next = NULL;
}
// Fills word buffer elements with '\0'
for (int i = 0; i < char_count; i++)
{
word_buffer[i] = '\0';
}
// Signals the first loop has finished.
first_loop = false;
// Clears character buffer to keep track of next word
char_count = 0;
// Keeps track if a new table index should be initialized
previous_hash_code = hash_code;
}
}
return true;
}
// Hash in order of: 'a' is 0 and 'z' is 25
int hash(char *word_buffer)
{
int hash = word_buffer[0] - 97;
return hash;
}
Thank you in advance!
Chris
You should use node *table[ALPHABET_LENGTH]; for the table declaration instead of node *table[N];
There is a difference between constant macros and const variables, a macro can be used in a constant expression, such as a global array bound as per your use case, whereas a const variable cannot.
As you can see here, the compiler you say you are using, gcc, with no compiler flags, issues an error message:
error: variably modified 'table' at file scope
You can read more about these differences and use cases in "static const" vs "#define" vs "enum" it has more subjects, like static and enum, but is a nice read to grasp the differences between these concepts.

Issue with hashtable in c

So I have an assignment to create a program in c that reads a couple of sentences(a 140mb file), and based on the 2nd input, which is a number, I need to return the Nth most common word. My idea was to build a hash table with linear probing, every time I get a new element I hash it accordingly based its position and based on djb2, else if there is a collision I rehash. After that, I apply Quicksort based on the occurrence and then I finally access by index.
I am having issues finishing up a hash table with linear probing in c. I am pretty sure I have finished it but every time I run I am getting a heap buffer overflow on lldb. I tried to spot the issue but I still cannot figure it out.
Am I getting out of memory on stack? The file is relatively small to consume so much memory.
I used address sanitiser and I got a heap-buffer-overflow on inserting.
I don't think I am touching the memory outside the allocate region but I am not 100% sure.
Any idea what has gone wrong? This is the table.c implementation and below that you can see the form of the struct.
Here is a more detailed message from address sanitiser:
thread #1: tid = 0x148b44, 0x0000000100166b20 libclang_rt.asan_osx_dynamic.dylib`__asan::AsanDie(), queue = 'com.apple.main-thread', stop reason = Heap buffer overflow
{
"access_size": 1,
"access_type": 1,
"address": 105690555220216,
"description": "heap-buffer-overflow",
"instrumentation_class": "AddressSanitizer",
"pc": 4294981434,
"stop_type": "fatal_error"
}
table.c :
#include "table.h"
#include "entities.h"
static inline entry_t* entryInit(const char* const value){
unsigned int len = strlen(value);
entry_t* entry = malloc(sizeof(entry));
entry->value = malloc(sizeof(char*) * len);
strncpy(entry->value, value, strlen(value));
entry->exists = 1;
entry->occurence = 1;
return entry;
}
table_t* tableInit(const unsigned int size){
table_t* table = malloc(sizeof(table_t));
table->entries = malloc(size*sizeof(entry_t));
table->seed = getPrime();
table->size = size;
table->usedEntries = 0U;
return table;
}
//okay, there is definitely an issue here
table_t* tableResize(table_t* table, const unsigned int newSize){
//most likely wont happen but if there is an overflow then we have a problem
if(table->size > newSize) return NULL;
//create a temp array of the realloced array, then do changes there
entry_t* temp = calloc(newSize,sizeof(entry_t));
table->size = newSize;
//temp pointer to an entry
entry_t *tptr = NULL;
unsigned int pos = 0;
unsigned int index = 0;
while(pos != table->size){
tptr = &table->entries[pos];
if(tptr->exists == 1){
index = hashString(table->seed, tptr->value, table->size, pos);
temp[index] = *entryInit(tptr->value);
temp[index].occurence = tptr->occurence;
break;
}
else pos++;
}
table->entries = temp;
//TODO: change table destroy to free the previous array from the table
free(temp);
return table;
}
//insert works fine, it is efficient enough to add something in the table
unsigned int tableInsert(table_t* table,const char* const value){
//decide when to resize, might create a large enough array to bloat the memory?
if(table->usedEntries >(unsigned int)(2*(table->size/3))) table = tableResize(table, table->size*2);
entry_t* entry = NULL;
unsigned int index;
auto int position = 0;
while(position != table->size){
//calculate the hash of our string as a function of the current position on the table
index = hashString(table->seed,value,table->size, position);
entry = &table->entries[index];
if(entry->exists == 0){
*entry = *entryInit(value);
table->usedEntries++;
return index;
} else if (entry->exists == 1 && strcmp(entry->value, value) == 0){
entry->occurence++;
return index;
} else{
position++;
}
}
}
//there might be an issue here
static inline void tableDestroy(const table_t* const table){
entry_t* entry = NULL;
for (auto int i = 0; i < table->size; ++i){
entry =&table->entries[i];
//printf("Value: %s Occurence: %d Exists: %d \n",entry->value, entry->occurence, entry->exists );
if(&table->entries[i] !=NULL)free(&table->entries[i]);
}
free(table);
}
entities.h :
#pragma once
typedef struct __attribute__((packed)) __entry {
char *value;
unsigned int exists : 1;
unsigned int occurence;
} entry_t;
typedef struct __table {
int size;
int usedEntries;
entry_t *entries;
unsigned int seed;
} table_t;
here is how I read from a file and process the text:
void readFromFile(const char* const fileName, table_t* table){
FILE *fp = fopen(fileName, "r");
if(!fp) fprintf(stderr,"error reading file. \n");
char word[15];//long enough to hold the biggest word in the text?
int position = 0;
char ch;
while((ch = fgetc(fp))!= EOF){
//discard all the ascii chars that are not letters
if(!(ch >= 65 && ch <= 90) && !(ch >= 97 && ch <= 122)){
word[position]= '\0';
if(word[0] == NULL)continue;
tableInsert(table, word);
position = 0;
continue;
}
else word[position++] = ch;
}
}
Any suggestions what is wrong with my code?
I believe resize might have an issue and I am not properly deleting yet because I have had a lot of problems with the memory management.
Thanks in advance!

C : Linked list value lost during execution

I am actually trying to implement a breadth-first-search algorithm in C, as an input I take any graph from a file and store all nodes and vertices in a structure.
Then I create an adjacency matrix, and run through all columns, push encountered nodes on a stack, pop them and so on till I have all the paths.
Still I have problems storing those paths in a linked list, and by problem I mean that sometimes, from some specific cases, I lose the last value of my stored path (one int array per link), which is quite surprising as it occurs only on paths of length 5 (I cannot test all lengths but up to 12 it seems OK).
It's weird, because these values are lost at function exit (I tried debugging using LLDB and in the function that creates the link, the last byte exists, but once I leave the function, it does not) and not all the time (1 out 10 execution all is fine).
To me this is a malloc issue, so I checked every single malloc of my program in order to solve (unsuccessfully) the problem. Checked all the variables and all seems fine, except for this 5 length case (I assume my program has a 'defect' that is only apparent in this case, but why ?).
I would gladly accept some help, as I just ran out of things to check.
Here is the code of the main BFS function :
void bfs(t_lemin *e)
{
t_path *save;
//set needed variables
set_bfs_base_var(e);
save = e->p;
while (paths_remain(e))
{
//Special Start-End case
if (e->map[e->nb_start][e->nb_end] == 1)
{
create_single(e);
break ;
}
e->x = e->nb_start;
reset_tab(e);
while (e->x != e->nb_end)
{
e->y = 0;
while (e->y < e->nb_rooms)
{
if (e->map[e->x][e->y] == 1 && !e->visited[e->y])
##push_on_stack the nodes
push_stack(e);
e->y++;
}
//go_to first elem on stack
e->x = e->stack[0];
if (e->x == e->nb_end || is_stack_empty(e->stack, e->nb_rooms - 1))
break ;
e->visited[e->x] = 1;
//set_it as visited than pop it
pop_stack(e, e->nb_rooms);
}
if (is_stack_empty(e->stack, e->nb_rooms - 1))
break ;
e->find_new[add_path(e)] = 1;
discover_more_paths(e, save);
}
print_paths(e, save);
e->p = save;
}
And here the 2 functions that stores the paths in a linked list :
void create_path(t_lemin *e, int *pa, int len)
{
int j;
j = 1;
//create_new_node if required
if (e->p->path)
{
if (!(e->p->next = malloc(sizeof(t_path))))
return ;
e->p = e->p->next;
}
//create_the_array_for_path_storing
e->p->path = malloc(sizeof(int) * len + 2);
e->p->next = NULL;
e->p->size_path = len + 2;
//copy_in_it
while (--len >= 0)
{
e->p->path[j++] = pa[len];
}
//copy_end_and_start_at_end_and_start
e->p->path[e->p->size_path - 1] = e->nb_end;
e->p->path[0] = e->nb_start;
e->nb_paths++;
}
int add_path(t_lemin *e)
{
int i;
int save;
int *path;
int next_path;
i = 0;
if (!(path = malloc(sizeof(int) * e->nb_rooms)))
exit(-1);
save = e->nb_end;
//in_order_to_save_the_path_i store the previous value of each node so I can find the path by iterating backward
next_path = -1;
while (e->prev[save] != e->nb_start)
{
path[i] = e->prev[save];
save = e->prev[save];
next_path = next_path == -1 && get_nb_links(e, path[i])
> 2 ? path[i] : -1;
i++;
}
//path_contains all values of the path except for start and end
save = i;
while (i < e->nb_rooms)
{
path[i] = 0;
i++;
}
create_path(e, path, save);
i = next_path == -1 ? path[0] : next_path;
//ft_printf("to_block : %d\n", i);
return (next_path == -1 ? path[0] : next_path);
}
If needed here is a clone of the entire repository, the issue can be seen running the program with maptest in the main directory : https://github.com/Caribou123/bfs_agesp.git
Make && ./lem_in < maptest
All paths must end by the last room, whereas in this case the value of the last room becomes 0. So the program outputs "start->room1->room2->....->start as the index value of start is 0.
Here is a look at my 'e', the main structure. (it's quite huge, don't be scared) :
typedef struct s_lemin
{
int x;
int y;
char *av;
int nb_ants;
int st;
int nd;
int nb_rooms;
int nb_paths;
int max_sizep;
int nb_links;
int nb_start;
int nb_end;
int **map;
int *stack;
int *visited;
int *prev;
int *find_new;
int maxy;
int maxx;
int minx;
int conti;
int miny;
char ***saa;
struct s_rooms *r;
struct s_ants *a;
struct s_rooms **table_r;
struct s_links *l;
struct s_hash **h;
struct s_rooms *start;
struct s_rooms *end;
struct s_info *i;
struct s_path *p;
struct s_path *select_p;
}
Thank you in advance for your help, and sorry if it's some stupid malloc that I somehow missed.
Artiom

is it true to do this to malloc array with unknown size

I want to make an array with unknown size , is it true to make it like this ? :
int *array,var,i=0;
FILE *fp;
fopen=("/home/inputFile.txt","r");
fscanf(fp,"%d",&var);
while(fp!=NULL)
{
if(var>0)
{
array=malloc(sizeof(int));
array[i++]=var
}
fscanf(fp,"%d",&var);
}
This is absurdly false, full of memory leaks and undefined behaviors.
However, it's not that far from one of the right ways, the linked list way:
struct linked_int
{
int value;
struct linked_int* pNext;
};
struct linked_int *pHead=NULL;
struct linked_int **ppTail = &pHead;
int* array = NULL;
int cpt=0;
/*Read file, building linked list*/
FILE *fp = fopen("/home/inputFile.txt","r");
if(fp != NULL)
{
int var;
while(fscanf(fp,"%d",&var)==1)
{
if(var>0)
{
struct linked_int *pNew = malloc(sizeof(struct linked_int));
pNew->value = var;
pNew->pNext = NULL;
/*Append at the tail of the list*/
*ppTail = pNew;
ppTail = &(pNew->pNext);
cpt++;
}
}
fclose(fp);
}
/*Copy from the linked list to an array*/
array = malloc(sizeof(int) * cpt);
if(array != NULL)
{
int i;
struct linked_int const *pCur = pHead;
for(i=0 ; i<cpt ; i++)
{
arr[i] = pCur->value;
pCur = pCur->pNext;
}
}
/*Free the linked list*/
while(pHead != NULL)
{
struct linked_int *pDelete = pHead;
pHead = pHead->pNext;
free(pDelete);
}
ppTail = &pHead;
Other ways:
Another right way is the realloc way, which consists in re-allocating the array with an ever expanding size (usually with a geometric growth, i.e. multiplying the array size by a number such as 1.5 every time). A wrong way to do so is to add 1 to the array size every time.
It goes something like this:
int arrayCapacity=0, numberOfItems=0;
int* array = NULL;
int var;
while(fscanf(fp, "%d", &var)==1)
{
if(numberOfItems >= arrayCapacity)
{
/*Need to resize array before inserting*/
const int MIN_CAPACITY = 4;
const double GROWTH_RATE = 1.5;
int newCapacity = arrayCapacity<MIN_CAPACITY ? MIN_CAPACITY : (int)(arrayCapacity*GROWTH_RATE);
int* tmp = realloc(array, newCapacity*sizeof(int));
if(tmp==NULL)
{
/*FAIL: can't make the array bigger!*/
}
else
{
/*Successfully resized the array.*/
array = tmp;
arrayCapacity = newCapacity;
}
}
if(numberOfItems >= arrayCapacity)
{
puts("Cannot add, array is full and can't be enlarged.");
break;
}
else
{
array[numberOfItems] = var;
numberOfItems++;
}
}
/*Now we have our array with all integers in it*/
The obvious result is that in this code, there can be unused space in the array. This isn't a problem.
sizeof(int) will return you 4 (and note that few compilers/settings may say you 2 or 8 in response). So your code is equivalent to allocating a 4 bytes long array.
If you want an array with unknown size, it could be worth to take a loot at STL containers like std::vector (because it will manage allocations and resizes behind the scene). If you plan to stick with "plain C" scope, you may be interested with TSTL2CL library: http://sourceforge.net/projects/tstl2cl
The basic thing is, ARRAY is STATIC not DYNAMIC.

Seg. Fault in Hash Table ADT - C

Edit:
Hash.c is updated with revisions from the comments, I am still getting a Seg fault. I must be missing something here that you guys are saying
I have created a hash table ADT using C but I am encountering a segmentation fault when I try to call a function (find_hash) in the ADT.
I have posted all 3 files that I created parse.c, hash.c, and hash.h, so you can see all of the variables. We are reading from the file gettysburg.txt which is also attached
The seg fault is occuring in parse.c when I call find_hash. I cannot figure out for the life of me what is going on here. If you need anymore information I can surely provide it.
sorry for the long amount of code I have just been completely stumped for a week now on this. Thanks in advance
The way I run the program is first:
gcc -o parse parse.c hash.c
then: cat gettysburg.txt | parse
Parse.c
#include <stdio.h>
#include <ctype.h>
#include <string.h>
#include "hash.h"
#define WORD_SIZE 40
#define DICTIONARY_SIZE 1000
#define TRUE 1
#define FALSE 0
void lower_case_word(char *);
void dump_dictionary(Phash_table );
/*Hash and compare functions*/
int hash_func(char *);
int cmp_func(void *, void *);
typedef struct user_data_ {
char word[WORD_SIZE];
int freq_counter;
} user_data, *Puser_data;
int main(void)
{
char c, word1[WORD_SIZE];
int char_index = 0, dictionary_size = 0, num_words = 0, i;
int total=0, largest=0;
float average = 0.0;
Phash_table t; //Pointer to main hash_table
int (*Phash_func)(char *)=NULL; //Function Pointers
int (*Pcmp_func)(void *, void *)=NULL;
Puser_data data_node; //pointer to hash table above
user_data * find;
printf("Parsing input ...\n");
Phash_func = hash_func; //Assigning Function pointers
Pcmp_func = cmp_func;
t = new_hash(1000,Phash_func,Pcmp_func);
// Read in characters until end is reached
while ((c = getchar()) != EOF) {
if ((c == ' ') || (c == ',') || (c == '.') || (c == '!') || (c == '"') ||
(c == ':') || (c == '\n')) {
// End of a word
if (char_index) {
// Word is not empty
word1[char_index] = '\0';
lower_case_word(word1);
data_node = (Puser_data)malloc(sizeof(user_data));
strcpy(data_node->word,word1);
printf("%s\n", data_node->word);
//!!!!!!SEG FAULT HERE!!!!!!
if (!((user_data *)find_hash(t, data_node->word))){ //SEG FAULT!!!!
insert_hash(t,word1,(void *)data_node);
}
char_index = 0;
num_words++;
}
} else {
// Continue assembling word
word1[char_index++] = c;
}
}
printf("There were %d words; %d unique words.\n", num_words,
dictionary_size);
dump_dictionary(t); //???
}
void lower_case_word(char *w){
int i = 0;
while (w[i] != '\0') {
w[i] = tolower(w[i]);
i++;
}
}
void dump_dictionary(Phash_table t){ //???
int i;
user_data *cur, *cur2;
stat_hash(t, &(t->total), &(t->largest), &(t->average)); //Call to stat hash
printf("Number of unique words: %d\n", t->total);
printf("Largest Bucket: %d\n", t->largest);
printf("Average Bucket: %f\n", t->average);
cur = start_hash_walk(t);
printf("%s: %d\n", cur->word, cur->freq_counter);
for (i = 0; i < t->total; i++)
cur2 = next_hash_walk(t);
printf("%s: %d\n", cur2->word, cur2->freq_counter);
}
int hash_func(char *string){
int i, sum=0, temp, index;
for(i=0; i < strlen(string);i++){
sum += (int)string[i];
}
index = sum % 1000;
return (index);
}
/*array1 and array2 point to the user defined data struct defined above*/
int cmp_func(void *array1, void *array2){
user_data *cur1= array1;
user_data *cur2= array2;//(user_data *)array2;
if(cur1->freq_counter < cur2->freq_counter){
return(-1);}
else{ if(cur1->freq_counter > cur2->freq_counter){
return(1);}
else return(0);}
}
hash.c
#include "hash.h"
Phash_table new_hash (int size, int(*hash_func)(char*), int(*cmp_func)(void*, void*)){
int i;
Phash_table t;
t = (Phash_table)malloc(sizeof(hash_table)); //creates the main hash table
t->buckets = (hash_entry **)malloc(sizeof(hash_entry *)*size); //creates the hash table of "size" buckets
t->size = size; //Holds the number of buckets
t->hash_func = hash_func; //assigning the pointer to the function in the user's program
t->cmp_func = cmp_func; // " "
t->total=0;
t->largest=0;
t->average=0;
t->sorted_array = NULL;
t->index=0;
t->sort_num=0;
for(i=0;i<size;i++){ //Sets all buckets in hash table to NULL
t->buckets[i] = NULL;}
return(t);
}
void free_hash(Phash_table table){
int i;
hash_entry *cur;
for(i = 0; i<(table->size);i++){
if(table->buckets[i] != NULL){
for(cur=table->buckets[i]; cur->next != NULL; cur=cur->next){
free(cur->key); //Freeing memory for key and data
free(cur->data);
}
free(table->buckets[i]); //free the whole bucket
}}
free(table->sorted_array);
free(table);
}
void insert_hash(Phash_table table, char *key, void *data){
Phash_entry new_node; //pointer to a new node of type hash_entry
int index;
new_node = (Phash_entry)malloc(sizeof(hash_entry));
new_node->key = (char *)malloc(sizeof(char)*(strlen(key)+1)); //creates the key array based on the length of the string-based key
new_node->data = data; //stores the user's data into the node
strcpy(new_node->key,key); //copies the key into the node
//calling the hash function in the user's program
index = table->hash_func(key); //index will hold the hash table value for where the new node will be placed
table->buckets[index] = new_node; //Assigns the pointer at the index value to the new node
table->total++; //increment the total (total # of buckets)
}
void *find_hash(Phash_table table, char *key){
int i;
hash_entry *cur;
printf("Inside find_hash\n"); //REMOVE
for(i = 0;i<table->size;i++){
if(table->buckets[i]!=NULL){
for(cur = table->buckets[i]; cur->next != NULL; cur = cur->next){
if(strcmp(table->buckets[i]->key, key) == 0)
return((table->buckets[i]->data));} //returns the data to the user if the key values match
} //otherwise return NULL, if no match was found.
}
return NULL;
}
void stat_hash(Phash_table table, int *total, int *largest, float *average){
int node_num[table->size]; //creates an array, same size as table->size(# of buckets)
int i,j, count = 0;
int largest_buck = 0;
hash_entry *cur;
for(i = 0; i < table->size; i ++){
if(table->buckets[i] != NULL){
for(cur=table->buckets[i]; cur->next!=NULL; cur = cur->next){
count ++;}
node_num[i] = count;
count = 0;}
}
for(j = 0; j < table->size; j ++){
if(node_num[j] > largest_buck)
largest_buck = node_num[j];}
*total = table->total;
*largest = largest_buck;
*average = (table->total) / (table->size);
}
void *start_hash_walk(Phash_table table){
Phash_table temp = table;
int i, j, k;
hash_entry *cur; //CHANGE IF NEEDED to HASH_TABLE *
if(table->sorted_array != NULL) free(table->sorted_array);
table->sorted_array = (void**)malloc(sizeof(void*)*(table->total));
for(i = 0; i < table->total; i++){
if(table->buckets[i]!=NULL){
for(cur=table->buckets[i]; cur->next != NULL; cur=cur->next){
table->sorted_array[i] = table->buckets[i]->data;
}}
}
for(j = (table->total) - 1; j > 0; j --) {
for(k = 1; k <= j; k ++){
if(table->cmp_func(table->sorted_array[k-1], table->sorted_array[k]) == 1){
temp -> buckets[0]-> data = table->sorted_array[k-1];
table->sorted_array[k-1] = table->sorted_array[k];
table->sorted_array[k] = temp->buckets[0] -> data;
}
}
}
return table->sorted_array[table->sort_num];
}
void *next_hash_walk(Phash_table table){
table->sort_num ++;
return table->sorted_array[table->sort_num];
}
hash.h
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct hash_entry_ { //Linked List
void *data; //Generic pointer
char *key; //String-based key value
struct hash_entry_ *next; //Self-Referencing pointer
} hash_entry, *Phash_entry;
typedef struct hash_table_ {
hash_entry **buckets; //Pointer to a pointer to a Linked List of type hash_entry
int (*hash_func)(char *);
int (*cmp_func)(void *, void *);
int size;
void **sorted_array; //Array used to sort each hash entry
int index;
int total;
int largest;
float average;
int sort_num;
} hash_table, *Phash_table;
Phash_table new_hash(int size, int (*hash_func)(char *), int (*cmp_func)(void *, void *));
void free_hash(Phash_table table);
void insert_hash(Phash_table table, char *key, void *data);
void *find_hash(Phash_table table, char *key);
void stat_hash(Phash_table table, int *total, int *largest, float *average);
void *start_hash_walk(Phash_table table);
void *next_hash_walk(Phash_table table);
Gettysburg.txt
Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war. . .testing whether that nation, or any nation so conceived and so dedicated. . . can long endure. We are met on a great battlefield of that war.
We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we cannot dedicate. . .we cannot consecrate. . . we cannot hallow this ground. The brave men, living and dead, who struggled here have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember, what we say here, but it can never forget what they did here.
It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us. . .that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion. . . that we here highly resolve that these dead shall not have died in vain. . . that this nation, under God, shall have a new birth of freedom. . . and that government of the people. . .by the people. . .for the people. . . shall not perish from the earth.
It's possible that one of several problems with this code are loops like:
for(table->buckets[i];
table->buckets[i]->next != NULL;
table->buckets[i] = table->buckets[i]->next)
...
The initializing part of the for loop (table->buckets[i]) has no effect. If i is 0 and table->buckets[0] == NULL, then the condition on this loop (table->buckets[i]->next != NULL) will dereference a null pointer and crash.
That's where your code seemed to be crashing for on my box, at least. When I changed several of your loops to:
if (table->buckets[i] != NULL) {
for(;
table->buckets[i]->next != NULL;
table->buckets[i] = table->buckets[i]->next)
...
}
...it kept crashing, but in a different place. Maybe that will help get you unstuck?
Edit: another potential problem is that those for loops are destructive. When you call find_hash, do you really want all of those buckets to be modified?
I'd suggest using something like:
hash_entry *cur;
// ...
if (table->buckets[i] != NULL) {
for (cur = table->buckets[i]; cur->next != NULL; cur = cur->next) {
// ...
}
}
When I do that and comment out your dump_dictionary function, your code runs without crashing.
Hmm,
here's hash.c
#include "hash.h"
Phash_table new_hash (int size, int(*hash_func)(char*), int(*cmp_func)(void*, void*)){
int i;
Phash_table t;
t = (Phash_table)calloc(1, sizeof(hash_table)); //creates the main hash table
t->buckets = (hash_entry **)calloc(size, sizeof(hash_entry *)); //creates the hash table of "size" buckets
t->size = size; //Holds the number of buckets
t->hash_func = hash_func; //assigning the pointer to the function in the user's program
t->cmp_func = cmp_func; // " "
t->total=0;
t->largest=0;
t->average=0;
for(i=0;t->buckets[i] != NULL;i++){ //Sets all buckets in hash table to NULL
t->buckets[i] = NULL;}
return(t);
}
void free_hash(Phash_table table){
int i;
for(i = 0; i<(table->size);i++){
if(table->buckets[i]!=NULL)
for(table->buckets[i]; table->buckets[i]->next != NULL; table->buckets[i] = table->buckets[i]->next){
free(table->buckets[i]->key); //Freeing memory for key and data
free(table->buckets[i]->data);
}
free(table->buckets[i]); //free the whole bucket
}
free(table->sorted_array);
free(table);
}
void insert_hash(Phash_table table, char *key, void *data){
Phash_entry new_node; //pointer to a new node of type hash_entry
int index;
new_node = (Phash_entry)calloc(1,sizeof(hash_entry));
new_node->key = (char *)malloc(sizeof(char)*(strlen(key)+1)); //creates the key array based on the length of the string-based key
new_node->data = data; //stores the user's data into the node
strcpy(new_node->key,key); //copies the key into the node
//calling the hash function in the user's program
index = table->hash_func(key); //index will hold the hash table value for where the new node will be placed
table->buckets[index] = new_node; //Assigns the pointer at the index value to the new node
table->total++; //increment the total (total # of buckets)
}
void *find_hash(Phash_table table, char *key){
int i;
hash_entry *cur;
printf("Inside find_hash\n"); //REMOVE
for(i = 0;i<table->size;i++){
if(table->buckets[i]!=NULL){
for (cur = table->buckets[i]; cur != NULL; cur = cur->next){
//for(table->buckets[i]; table->buckets[i]->next != NULL; table->buckets[i] = table->buckets[i]->next){
if(strcmp(cur->key, key) == 0)
return((cur->data));} //returns the data to the user if the key values match
} //otherwise return NULL, if no match was found.
}
return NULL;
}
void stat_hash(Phash_table table, int *total, int *largest, float *average){
int node_num[table->size];
int i,j, count = 0;
int largest_buck = 0;
hash_entry *cur;
for(i = 0; i < table->size; i ++)
{
if(table->buckets[i]!=NULL)
for (cur = table->buckets[i]; cur != NULL; cur = cur->next){
//for(table->buckets[i]; table->buckets[i]->next != NULL; table->buckets[i] = table->buckets[i]->next){
count ++;}
node_num[i] = count;
count = 0;
}
for(j = 0; j < table->size; j ++){
if(node_num[j] > largest_buck)
largest_buck = node_num[j];}
*total = table->total;
*largest = largest_buck;
*average = (table->total) /(float) (table->size); //oook: i think you want a fp average
}
void *start_hash_walk(Phash_table table){
void* temp = 0; //oook: this was another way of overwriting your input table
int i, j, k;
int l=0; //oook: new counter for elements in your sorted_array
hash_entry *cur;
if(table->sorted_array !=NULL) free(table->sorted_array);
table->sorted_array = (void**)calloc((table->total), sizeof(void*));
for(i = 0; i < table->size; i ++){
//for(i = 0; i < table->total; i++){ //oook: i don't think you meant total ;)
if(table->buckets[i]!=NULL)
for (cur = table->buckets[i]; cur != NULL; cur = cur->next){
//for(table->buckets[i]; table->buckets[i]->next != NULL; table->buckets[i] = table->buckets[i]->next){
table->sorted_array[l++] = cur->data;
}
}
//oook: sanity check/assert on expected values
if (l != table->total)
{
printf("oook: l[%d] != table->total[%d]\n",l,table->total);
}
for(j = (l) - 1; j > 0; j --) {
for(k = 1; k <= j; k ++){
if (table->sorted_array[k-1] && table->sorted_array[k])
{
if(table->cmp_func(table->sorted_array[k-1], table->sorted_array[k]) == 1){
temp = table->sorted_array[k-1]; //ook. changed temp to void* see assignment
table->sorted_array[k-1] = table->sorted_array[k];
table->sorted_array[k] = temp;
}
}
else
printf("if (table->sorted_array[k-1] && table->sorted_array[k])\n");
}
}
return table->sorted_array[table->sort_num];
}
void *next_hash_walk(Phash_table table){
/*oook: this was blowing up since you were incrementing past the size of sorted_array..
NB: *you **need** to implement some bounds checking here or you will endup with more seg-faults!!*/
//table->sort_num++
return table->sorted_array[table->sort_num++];
}
here's parse.c
#include <stdio.h>
#include <ctype.h>
#include <string.h>
#include <assert.h> //oook: added so you can assert ;)
#include "hash.h"
#define WORD_SIZE 40
#define DICTIONARY_SIZE 1000
#define TRUE 1
#define FALSE 0
void lower_case_word(char *);
void dump_dictionary(Phash_table );
/*Hash and compare functions*/
int hash_func(char *);
int cmp_func(void *, void *);
typedef struct user_data_ {
char word[WORD_SIZE];
int freq_counter;
} user_data, *Puser_data;
int main(void)
{
char c, word1[WORD_SIZE];
int char_index = 0, dictionary_size = 0, num_words = 0, i;
int total=0, largest=0;
float average = 0.0;
Phash_table t; //Pointer to main hash_table
int (*Phash_func)(char *)=NULL; //Function Pointers
int (*Pcmp_func)(void *, void *)=NULL;
Puser_data data_node; //pointer to hash table above
user_data * find;
printf("Parsing input ...\n");
Phash_func = hash_func; //Assigning Function pointers
Pcmp_func = cmp_func;
t = new_hash(1000,Phash_func,Pcmp_func);
// Read in characters until end is reached
while ((c = getchar()) != EOF) {
if ((c == ' ') || (c == ',') || (c == '.') || (c == '!') || (c == '"') ||
(c == ':') || (c == '\n')) {
// End of a word
if (char_index) {
// Word is not empty
word1[char_index] = '\0';
lower_case_word(word1);
data_node = (Puser_data)calloc(1,sizeof(user_data));
strcpy(data_node->word,word1);
printf("%s\n", data_node->word);
//!!!!!!SEG FAULT HERE!!!!!!
if (!((user_data *)find_hash(t, data_node->word))){ //SEG FAULT!!!!
dictionary_size++;
insert_hash(t,word1,(void *)data_node);
}
char_index = 0;
num_words++;
}
} else {
// Continue assembling word
word1[char_index++] = c;
}
}
printf("There were %d words; %d unique words.\n", num_words,
dictionary_size);
dump_dictionary(t); //???
}
void lower_case_word(char *w){
int i = 0;
while (w[i] != '\0') {
w[i] = tolower(w[i]);
i++;
}
}
void dump_dictionary(Phash_table t){ //???
int i;
user_data *cur, *cur2;
stat_hash(t, &(t->total), &(t->largest), &(t->average)); //Call to stat hash
printf("Number of unique words: %d\n", t->total);
printf("Largest Bucket: %d\n", t->largest);
printf("Average Bucket: %f\n", t->average);
cur = start_hash_walk(t);
if (!cur) //ook: do test or assert for null values
{
printf("oook: null== (cur = start_hash_walk)\n");
exit(-1);
}
printf("%s: %d\n", cur->word, cur->freq_counter);
for (i = 0; i < t->total; i++)
{//oook: i think you needed these braces
cur2 = next_hash_walk(t);
if (!cur2) //ook: do test or assert for null values
{
printf("oook: null== (cur2 = next_hash_walk(t) at i[%d])\n",i);
}
else
printf("%s: %d\n", cur2->word, cur2->freq_counter);
}//oook: i think you needed these braces
}
int hash_func(char *string){
int i, sum=0, temp, index;
for(i=0; i < strlen(string);i++){
sum += (int)string[i];
}
index = sum % 1000;
return (index);
}
/*array1 and array2 point to the user defined data struct defined above*/
int cmp_func(void *array1, void *array2){
user_data *cur1= array1;
user_data *cur2= array2;//(user_data *)array2;
/* ooook: do assert on programmatic errors.
this function *requires non-null inputs. */
assert(cur1 && cur2);
if(cur1->freq_counter < cur2->freq_counter){
return(-1);}
else{ if(cur1->freq_counter > cur2->freq_counter){
return(1);}
else return(0);}
}
follow the //ooks
Explanation:
There were one or two places this was going to blow up in.
The quick fix and answer to your question was in parse.c, circa L100:
cur = start_hash_walk(t);
printf("%s: %d\n", cur->word, cur->freq_counter);
..checking that cur is not null before calling printf fixes your immediate seg-fault.
But why would cur be null ? ~because of this bad-boy:
void *start_hash_walk(Phash_table table)
Your hash_func(char *string) can (& does) return non-unique values. This is of course ok except that you have not yet implemented your linked list chains. Hence you end up with table->sorted_array containing less than table->total elements ~or you would if you were iterating over all table->size buckets ;)
There are one or two other issues.
For now i hacked Nate Kohl's for(cur=table->buckets[i]; cur->next != NULL; cur=cur->next) further, to be for(cur=table->buckets[i]; cur != NULL; cur=cur->next) since you have no chains. But this is *your TODO so enough said about that.
Finally. note that in next_hash_walk(Phash_table table) you have:
table->sort_num++
return table->sorted_array[table->sort_num];
Ouch! Do check those array bounds!
Notes
1) If you're function isn't designed to change input, then make the input const. That way the compiler may well tell you when you're inadvertently trashing something.
2) Do bound checking on your array indices.
3) Do test/assert for Null pointers before attempting to use them.
4) Do unit test each of your functions; never write too much code before compiling & testing.
5) Use minimal test-data; craft it such that it limit-tests your code & attempts to break it in cunning ways.
6) Do initialise you data structures!
7)Never use egyptian braces ! {
only joking ;)
}
PS Good job so far ~> pointers are tricky little things! & a well asked question with all the necessary details so +1 and gl ;)
(//oook: maybe add a homework tag)

Resources