Hash tables, same record on each hash table cell - c

There is a word dictionary in a text file. I will hash all these words. I wrote some code yet there is a problem. Last word takes place on each hash table records.
main()
{
FILE *fp;
char word[100];
char *hash[569];
int i;
for(i=0;i<569;i++)
hash[i]="NULL";
int m=569;
int z =569;
int mm=568;
char w;
int key;
int j;
int hash1;
int hash2;
fp=fopen("smallDictionary.txt","r");
int counter =0;
while(fscanf(fp,"%s",word)!=EOF)
{
j=0;
counter++;
for(i=key=0;i<strlen(word);i++)
key+=(word[i]-'a')*26*i;
hash1=key%m;
hash2=1+(key%mm);
while(hash[(hash1+j*hash2)%m]!="NULL")
{
j++;
}
hash[(hash1+j*hash2)%m] = word;
}
for(i=0;i<569;i++)
printf("%s ",hash[i]);
fclose(fp);
}
End the result on console.
As it seen, as last word of dictionary, "your" keyword repeats.
Dictionary content :
a about above absolutely acceptable add adjacent after algorithm all
along also an analyses and any anyone are argument as assignment
assume at available be been below bird body but by can cannot
capitalization case center chain chaining changing characters check
checker checkers checking choose class code collision collisions
command compilation compile compiled complexity conform consist
containing contains convenience convert correcting corrections create
created cross deal debugging december decide deducted deleted
departmental dictionary discover discussed divides document
documentation due dynamically e each easiest encountered enough error
errors etc even exactly example executable expand experience factor
fair fall file files find first following font for forth found four
friday from function functions g gain general generate generated
generating geoff get gird given good graders grows guide guidelines
had hair handle has hash hashing have head header help helped hold
homework hour how i if in include including incorrect information
input insert inserted insertions instructions into is ispell it
keeping known kuenning last length less letter letters like line
linear list load long longer longest look looked low lower maintained
many match may messages method midnight might misspelled mistake mode
more most must my name named names necessary need never next no note
number of on once one only options or original other otherwise output
overview page pair pedantic points policies possibility possible
prefer primary probing problem produce produces professor program
purpose quack quadratic quick read reason reasonably refer reference
rehashing reinitialized report resubmit rid same save separate
separated seriously should similar simple simplify single size so
something specifications specified spell spelling standard statistics
string strong submission submit submitted successfully suggest
suggested support suppose table tech th than that the them then there
these this those three through thursday times title to together tooth
track traditionally transposed trial try turing udder under understand
unlike up use used useful using usual variant variants version wall
ward warning was way we when whenever which while whitespace who why
wild will wind wire wiry with word words works write written wrong you
your

hash[(hash1+j*hash2)%m] = word;
That just assigns the address of word to the hash entry. Which is always the same. And at the end of the loop the contents of word is obviously going to be the last thing that was read by fscanf. SO you need something like:
hash[(hash1+j*hash2)%m] = strdup(word);
strdup allocates heap memory so don't forget to free it when the hash table is no longer needed.

All pointers in the hashtable points to the same variable: The word array. Whatever is the last string you put in word, will be the string printed in your output.
You need to allocate memory for each word, either by making hash an array of arrays and copy the word into the secondary array, or using e.g. strdup to allocate the string dynamically.
More graphically, it could be though of as this:
+---------+
| hash[x] | -\ +------+
+---------+ >--- > | word |
| hash[y] | -/ +------+
+---------+ +--------+
| hash[z] | --> | "NULL" |
+---------+ +--------+
Here both the entry x and y points to word, while entry z points to the string literal "NULL" you initialized all entries to.

Related

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

(C) Recursive strcpy() that takes only 1 parameter

Let me be clear from the get go, this is not a dupe, I'll explain how.
So, I tasked myself to write a function that imitates strcpy but with 2 conditions:
it needs to be recursive
it must take a single parameter (which is the original string)
The function should return a pointer to the newly copied string. So this is what I've tried so far:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char * my_strcpy(char *original);
int main(void) {
char *string = my_strcpy("alpine");
printf("string = <%s>\n", string);
return 0;
}
char * my_strcpy(char *original){
char *string = (char *)malloc(10);
if(*original == '\0') {
return string;
}
*string++ = *original;
my_strcpy(original + 1);
}
The problem is somewhat obvious, string gets malloc-ed every time my_strcpy() is called. One of the solutions I could think of would be to allocate memory for string only the first time the function is called. Since I'm allowed to have only 1 parameter, only thing I could think of was to check the call stack, but I don't know whether that's allowed and it does feel like cheating.
Is there a logical solution to this problem?
You wrote it as tail recursive, but I think without making the function non-reentrant your only option is going to be to make the function head recursive and repeatedly call realloc on the return value of the recursive call to expand it, then add in one character. This has the same problem that just calling strlen to do the allocation has: it does something linear in the length of the input string in every recursive call and turns out to be an implicitly n-squared algorithm (0.5*n*(n+1)). You can improve it by making the amortized time complexity better, by expanding the string by a factor and only growing it when the existing buffer is full, but it's still not great.
There's a reason you wouldn't use recursion for this task (which you probably know): stack depth will be equal to input string length, and a whole stack frame pushed and a call instruction for every character copied is a lot of overhead. Even so, you wouldn't do it recursively with a single argument if you were really going to do it recursively: you'd make a single-argument function that declares some locals and calls a recursive function with multiple arguments.
Even with the realloc trick, it'll be difficult or impossible to count the characters in the original as you go so that you can call realloc appropriately, remembering that other stdlib "str*" functions are off limits because they'll likely make your whole function n-squared, which I assumed we were trying to avoid.
Ugly tricks like verifying that the string is as long as a pointer and replacing the first few characters with a pointer by memcpy could be used, making the base case for the recursion more complicated, but, um, yuck.
Recursion is a technique for analysing problems. That is, you start with the problem and think about what the recursive structure of a solution might be. You don't start with a recursive structure and then attempt to shoe-horn your problem willy-nilly into it.
In other words, it's good to practice recursive analysis, but the task you have set yourself -- to force the solution to have the form of a one-parameter function -- is not the way to do that. If you start contemplating global or static variables, or extracting runtime context by breaking into the call stack, you have a pretty good hint that you have not yet found the appropriate recursive analysis.
That's not to say that there is not an elegant recursive solution to your problem. There is one, but before we get to it, we might want to abstract away a detail of the problem in order to provide some motivation.
Clearly, if we have a contiguous data structure already in memory, making a contiguous copy is not challenging. If we don't know how big it is, we can do two traverses: one to find its size, after which we can allocate the needed memory, and another one to do the copy. Both of those tasks are simple loops, which is one form of recursion.
The essential nature of a recursive solution is to think about how to step from a problem to a (slightly) simpler or smaller problem. Or, more commonly, a small number of smaller or simpler problems.
That's the nature of one of the most classic recursive problems: sorting a sequence of numbers. The basic structure: divide the sequence into two roughly equal parts; sort each part (the recursive step) and put the results back together so that the combination is sorted. That basic outline has at least two interesting (and very different) manifestations:
Divide the sequence arbitrarily into two almost equal parts either by putting alternate elements in alternate parts or by putting the first half in one part and the rest in the other part. (The first one will work nicely if we don't know in advance how big the sequence is.) To put the sorted parts together, we have to interleave ("merge") the. (This is mergesort).
Divide the sequence into two ranges by estimating the middle value and putting all smaller values into one part and all larger values into the other part. To put the sorted parts together, we just concatenate them. (This is quicksort.)
In both cases, we also need to use fact that a single-element sequence is (trivially) sorted so no more processing needs to be done. If we divide a sequence into two parts often enough, ensuring that neither part is empty, we must eventually reach a part containing one element. (If we manage to do the division accurately, that will happen quite soon.)
It's interesting to implement these two strategies using singly-linked lists, so that the lengths really are not easily known. Both can be implemented this way, and the implementations reveal something important about the nature of sorting.
But let's get back to the much simpler problem at hand, copying a sequence into a newly-allocated contiguous array. To make the problem more interesting, we won't assume that the sequence is already stored contiguously, nor that we can traverse it twice.
To start, we need to find the length of the sequence, which we can do by observing that an empty sequence has length zero and any other sequence has one more element than the subsequence starting after the first element (the "tail" of the sequence.)
Length(seq):
If seq is empty, return 0.
Else, return 1 + Length(Tail(seq))
Now, suppose we have allocated storage for the copy. Now, we can copy by observing that an empty sequence is fully copied, and any other sequence can be copied by placing the first element into the allocated storage and then cipying the tail of the sequence into the storage starting at the second position: (and this procedure logically takes two arguments)
Copy(destination, seq):
If seq is not empty:
Put Head(seq) into the location destination
Call Copy (destination+1, Tail(seq))
But we can't just put those two procedures together, because that would traverse the sequence twice, which we said we couldn't do. So we need to somehow nest these algorithms.
To do that, we have to start by passing the accumulated length down through the recursion so that we can use it at to allocate the storage when we know how big the object. Then, on the way back, we need to copy the element we counted on the way down:
Copy(seq, length):
If seq is not empty:
Set item to its first element (that is, Head(seq))
Set destination to Copy(Tail(seq), length + 1)
Store item at location destination - 1
Return destination - 1
Otherwise: (seq is empty)
Set destination to Allocate(length)
# (see important note below)
Return destination + length
To correctly start the recursion, we need to pass in 0 as the initial length. It's bad style to force the user to insert "magic numbers", so we would normally wrap the function with a single-argument driver:
Strdup(seq):
Return Copy (seq, 0)
Important Note: if this were written in C using strings, we would need to NUL-terminate the copy. That means allocating length+1 bytes, rather than length, and then storing 0 at destination+length.
You didn't say we couldn't use strcat.
So here is logical (although somewhat useless) answer by using recursion to do nothing other than chop off the last character and adding it back on again.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char * my_strcpy(char *original);
int main(void) {
char *string = my_strcpy("alpine");
printf("string = <%s>\n", string);
return 0;
}
char * my_strcpy(char *original){
if(*original == '\0') {
return original;
}
int len = strlen(original);
char *string = (char *)malloc(len+1);
char *result = (char *)malloc(len+1);
string[0] = result[0] = '\0';
strcat (string, original);
len--;
char store[2] = {string[len] , '\0'}; // save last char
string[len] = '\0'; // cut it off
strcat (result, my_strcpy(string));
strcat (result, store); // add it back
return result;
}

Implementing a Hash Table using C

I am trying to read a file of names, and hash those names to a spot based on the value of their letters. I have succeeded in getting the value for each name in the file, but I am having trouble creating a hash table. Can I just use a regular array and write a function to put the words at their value's index?
while (fgets(name,100, ptr_file)!=NULL) //while file isn't empty
{
fscanf(ptr_file, "%s", &name); //get the first name
printf("%s ",name); //prints the name
int length = strlen(name); // gets the length of the name
int i;
for (i =0; i <length; i++)
//for the length of the string add each letter's value up
{
value = value + name [i];
value=value%50;
}
printf("value= %1d\n", value);
}
No, you can't, because you'll have collisions, and you'll thus have to account for multiple values to a single hash. Generally, your hashing is not really a wise implementation -- why do you limit the range of values to 50? Is memory really really sparse, so you can't have a bigger dictionary than 50 pointers?
I recommend using an existing C string hash table implementation, like this one from 2001.
In general, you will have hash collisions and you will need to find a way to handle that. The two main ways of handling that situation are as follows:
If the hash table entry is taken, you use the next entry, and iterate until you find an empty entry. When looking things up, you need to do similarly. This is nice and simple for adding things to the hash table, but makes removing things tricky.
Use hash buckets. Each hash table entry is a linked list of entries with the same hash value. First you find your hash bucket, then you find the entry on the list. When adding a new entry, you merely add it to the list.
This is my demo-program, which reads strings (words) from stdin, and
deposit into hashtable. Thereafter (at EOF), iterate hashtable, and compute number of words, distinct words, and prints most frequent word:
http://olegh.cc.st/src/words.c.txt
In hashtable, utilized double hashing algorithm.

Comparing strings in two files in C;

I'm new to language c, so I'll appreciate every help :D
I need to compare given words in the first file ( " Albert\n Martin\n Bob" ) with words in the second file ( " Albert\n Randy\n Martin\n Ohio" ) .
Whenever they're the same i need to put in the file word " Language " ; and print every word without representation in second file "
Something like that:
Language
Language
Bob
need's to be in my third file;
I tried to come up with some ideas , but they dont work; p ,
Thanks for every anwser in advance .
First, you need to open a stream to read the files.
If you need to do this in C, then you may use the strcmp function. It allows you to compares the two strings.
For example:
int strcmp(const char *s1, const char *s2);
I'd open all three files to begin with (both input files and the output file). If you can't open all of them then you can't do anything useful (other than display an error message or something); and there's no point wasting CPU time only to find out that (for e.g.) you can't open the output file later. This can also help to reduce race conditions (e.g. second file changes while you're processing the first file).
Next, start processing the first file. Break it into words/tokens as you read it, and for each word/token calculate a hash value. Then use the hash value and the word/token itself to check if the new word/token is a duplicate of a previous (already known) word/token. If it's not a duplicate, allocate some memory and create a new entry for the word/token and insert the entry onto the linked list that corresponds to the hash.
Finally, process the second file. This is similar to how you processed the first file (break it into words/tokens, calculate the hash, use the hash to find out if the word/token is known), except if the word/token isn't known you write it to the output file, and if it is known you write " language" to the output file instead.
If you're not familiar with hash tables, they're fairly easy. For a simple method (not necessary the best method) of calculating the hash value for ASCII/text you could do something like:
hash = 0;
while(*src != 0) {
hash = hash ^ (hash << 5) ^ *src;
src++;
}
hash = hash % HASH_SIZE;
Then you have an array of linked lists, like "INDEX_ENTRY *index[HASH_SIZE]" that contains a pointer to the first entry for each linked list (or NULL if the linked list for the hash is empty).
To search, use the hash to find the first entry of the correct linked list then do "strcmp()" on each entry in the linked list. An example might look something like this:
INDEX_ENTRY *find_entry(uint32_t hash, char *new_word) {
INDEX_ENTRY *entry;
entry = index[hash];
while(entry != NULL) {
if(strcmp(new_word, entry->word) == 0) return entry;
entry = entry->next;
}
return NULL;
}
The idea of all this is to improve performance. For example, if both files have 1024 words then (without a hash table) you'd need to do "strcmp()" 1024*1024 times; but if you use a hash table with "#define HASH_SIZE 1024" you'll probably reduce that to about 2000 times (and end up with much faster code). Larger values of HASH_SIZE increase the amount of memory you use a little (and reduce the chance of different words having the same hash).
Don't forget to close your files when you're finished with them. Freeing the memory you used is a good idea if you do something else after this (but if you don't do anything after this then it's faster and easier to "exit()" and let the OS cleanup).

choosing the element with more duplicates on an array in plain C

This question brings me back to my college days, but since I haven't coded since those days (more than 20 years ago) I am a bit rusty.
Basically I have an array of 256 elements. there might be 1 element on the array, 14 or 256. This array contains the usernames of people requesting data from the system. I am trying to count the duplicates on the list so I can give the priority to the user with most requests. So, if I have a list such as:
{john, john, paul, james, john, david, charles, charles, paul, john}
I would choose John because it appeared 4 times.
I can iterate the array and copy the elements to another and start counting but it gets complicated after a while. As I said, I am very rusty.
I am sure there is an easy way to do this. Any ideas? Code would be very helpful here.
Thank you!
EDIT:
The buffer is declared as:
static WCHAR userbuffer[150][128];
There could be up to 150 users and each username is up to 128 chars long.
1 - sort your array.
2 - set maxcount = 0;
3 - iterate array and count until visitNEXT username.
4 - if count > maxcount then set maxcount to count and save name as a candidate.
5 - after loop finished, pickup the candidate.
Here's how I would solve it:
First, define a structure to hold user names and frequency counts and make an array of them with the same number of elements as your userbuffer array (150 in your example).
struct user {
WCHAR name[128];
int count;
} users[150];
Now, loop through userbuffer and for each entry, check and see if you have an entry in users that has the same name. If you find a match, increment the count for that entry in users. If you don't find a match, copy the user's name from userbuffer into a blank entry in users and set that user's count to 1.
After you have processed the entire userbuffer array, you can use the count values from your users array to see who has the most requests. You can either iterate through it manually to find the max or use qsort.
It's not the most efficient algorithm, but it's direct and doesn't do anything too clever or fancy.
Edit:
To qsort the users array, you will need to define a simple function that qsort can use to sort by. For this example, something like this should work:
static int compare_users(const void *p1, const void *p2) {
struct user *u1 = (struct user*)p1;
struct user *u2 = (struct user*)p2;
if (u1->count > u2->count)
return 1;
if (u1->count < u2->count)
return -1;
return 0;
}
Now, you can pass this function to qsort like:
qsort(users, 150, sizeof(struct user), compare_users);
What did you mean with "there might be 1 element on the array, 14 or 256". Are 14 and 256 element numbers?
If you can change the array definition I think the best way will be to define
a structure with two fields, username and numberOfRequests. When a user requests, if usernames exists in the list the numberOfRequest will be increased. If it is the first time the user is requesting username should be added to list and numberOfRequests would be 1.
If you can not change the array definition, one is to sort the array with quick sort or another algorithm. It will be easy to count the number of request after that.
However, maybe there is a library with this functionality, but I doubt that you can find something in Standard Library!
Using std::map as proposed by AraK, you can stock your names without duplicates, and associate your names to a value.
A quick example :
std::map<string, long> names;
names["john"]++;
names["david"]++;
You can then search which key has the biggest value.

Resources