Implementing a Hash Table using C - c

I am trying to read a file of names, and hash those names to a spot based on the value of their letters. I have succeeded in getting the value for each name in the file, but I am having trouble creating a hash table. Can I just use a regular array and write a function to put the words at their value's index?
while (fgets(name,100, ptr_file)!=NULL) //while file isn't empty
{
fscanf(ptr_file, "%s", &name); //get the first name
printf("%s ",name); //prints the name
int length = strlen(name); // gets the length of the name
int i;
for (i =0; i <length; i++)
//for the length of the string add each letter's value up
{
value = value + name [i];
value=value%50;
}
printf("value= %1d\n", value);
}

No, you can't, because you'll have collisions, and you'll thus have to account for multiple values to a single hash. Generally, your hashing is not really a wise implementation -- why do you limit the range of values to 50? Is memory really really sparse, so you can't have a bigger dictionary than 50 pointers?
I recommend using an existing C string hash table implementation, like this one from 2001.

In general, you will have hash collisions and you will need to find a way to handle that. The two main ways of handling that situation are as follows:
If the hash table entry is taken, you use the next entry, and iterate until you find an empty entry. When looking things up, you need to do similarly. This is nice and simple for adding things to the hash table, but makes removing things tricky.
Use hash buckets. Each hash table entry is a linked list of entries with the same hash value. First you find your hash bucket, then you find the entry on the list. When adding a new entry, you merely add it to the list.

This is my demo-program, which reads strings (words) from stdin, and
deposit into hashtable. Thereafter (at EOF), iterate hashtable, and compute number of words, distinct words, and prints most frequent word:
http://olegh.cc.st/src/words.c.txt
In hashtable, utilized double hashing algorithm.

Related

Comparing nodes of a list with a string array

If i have an array that contains some char such as [a,b,c] and i have another array that contains the respective frequency of each char such as [2,1,1]. I would like to now go through a linked list which has nodes which have some string to see if they also have the chars i have in my original array with same frequency.
My approach
I was thinking i need
One loop that will start at index 0 of original array and another loop inside that will check all nodes for that string and if my temp pointer hits null it means all of them have it and if not then they don't and i move on to the next one. However i am not sure how to quite implement this approach as i am very new to c and also i was wondering is it possible to do this in O(N) TIME as my approach would be O(N2).
Sample Output: i apologize for the confusion
so if you have 3 nodes and each has a char array containing "nba" "tba" "rba"
the output should then return b a . since both them appear equal number of times in each node.
So you start both your char array and freqarray at index 0 and then check all the nodes for strings matching the same frequency of a character . I presume you use some kind of function to return frequency of particular char in a string .
Also your problem requires you to go through all of the nodes hence O(N^2) is implied.

Data structure for playing notes in MIDI synthesizer

I'm working on a hardware virtual analog synthesizer using C, and I'm trying to come up with an efficient data structure to manage the dynamic assignment of synthesizer voices in response to incoming MIDI messages.
I have a structure type which holds the data for a single synthesizer voice (pitch, low frequency oscillator, ADSR settings, etc.) and I have a "NoteOn" callback function which is executed when a MIDI "Note On" message is decoded. This function needs to take an "idle" voice from an "idle" pool, modify some settings in the structure, and assign it to a "playing" pool that the main synth engine works with to generate audio samples. Then, when a "Note Off" message is received, the voice with a note value corresponding to the one in the "Note Off" message needs to be selected from the "playing" pool, have its data structure modified again, and eventually returned to the "idle" pool (depending on envelope/ADSR settings.)
I tried an implementation using linked lists for both pools, but my implementation seemed rather cumbersome and slow. I'd like this process to be as quick as possible, to maintain playing responsiveness. Any suggestions?
If a linked list is too slow, the usual answer is to implement a hash table. There many, many possible variations of the data structure and algorithm. I'll just describe open, "single"-hashing, because that's the variation I'm most familiar with.
So with an open hash table, the table is just an array ("closed" hashing has an array, too, but each element is a linked list). We want the array to be, at most, about half-full for performance reasons. And at maximum-capacity, the filled table will actually have one empty slot still, because this simplifies the algorithm.
We also need a hash function which accepts the type of the key values, and returns integers. It's very difficult to predict how the hash function will behave with respect to clustered keys and overall performance. So just make sure it's an isolated function that can easily be changed later. It can be as simple as shifting-around all the bytes and adding them together.
int hash (char *key, int key_length, int table_size)
{
int ret, i;
for (i=0, ret=0; i < key_length; i++)
{
ret += key[i] << i;
}
return abs(ret) % table_size;
}
The table-lookup function uses the hash function to decide where to start looking in the array. If the key isn't found there (determined by doing a memcmp() on the actual search key and the key stored at that position in the table), it looks at each successive key, wrapping from the end of the array back to the beginning, and declares failure if it finds an empty table element.
#define RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY \
if (memcmp(table + i, &key, sizeof key) == 0 || (key_type)table[i] == 0) \
return table + i;
key_value_pair *hash_lookup(key_value_pair *table, int table_size, key_type key)
{
int h, i;
h = hash(&key, sizeof key, table_size);
i = h;
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
for ( ; i < table_size; i++)
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
for (i=0; i < h; i++)
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
return NULL;
}
We'll need one more function in front of this to handle a few quirks. It can return a NULL pointer which indicates that not only has the key not been found, but the table itself is overfull. An overfull table, which really means "completely full", but we decided earlier that a "full" table should really have one empty element. This means that both for loops should not run to completion; when it finds an empty table position, that's a failure. With an overfull table, it has to scan the entire table before discovering that the key is not present, thus losing much of the performance advtantage from using a hash at all.
The lookup function can also return a valid pointer to an empty slot. This is also a failure to find the value, but not an error. If adding the key/value pair for the first time, this will be slot to store it.
Or it could return a pointer to the desired table element. And this will be faster than a linear search, be it an array or linked list.
Deleting a key from the table requires us to fill-in the vacated position in the sequence. There are a couple of options.
If you're not worried about the table running out of space (it's set really large, and the lifetime and usage can be controlled), you can overwrite the entry with a deleted special key, distinct from an empty key.
Or, if you want to reclaim the space, too, you'll need to lookup the key, and then scan the rest of the "chain" (sequence of keys up to the next empty slot (including wrap-around)) and move the last key with a matching hash into the key-to-delete's position. Then write-over this moved key/value's position with the empty key. .... oops! This process must be repeated for the this last matching key until we're actually clearing the very last key in the chain. (I need to go fix this in my implementation right now!....)

Comparing strings in two files in C;

I'm new to language c, so I'll appreciate every help :D
I need to compare given words in the first file ( " Albert\n Martin\n Bob" ) with words in the second file ( " Albert\n Randy\n Martin\n Ohio" ) .
Whenever they're the same i need to put in the file word " Language " ; and print every word without representation in second file "
Something like that:
Language
Language
Bob
need's to be in my third file;
I tried to come up with some ideas , but they dont work; p ,
Thanks for every anwser in advance .
First, you need to open a stream to read the files.
If you need to do this in C, then you may use the strcmp function. It allows you to compares the two strings.
For example:
int strcmp(const char *s1, const char *s2);
I'd open all three files to begin with (both input files and the output file). If you can't open all of them then you can't do anything useful (other than display an error message or something); and there's no point wasting CPU time only to find out that (for e.g.) you can't open the output file later. This can also help to reduce race conditions (e.g. second file changes while you're processing the first file).
Next, start processing the first file. Break it into words/tokens as you read it, and for each word/token calculate a hash value. Then use the hash value and the word/token itself to check if the new word/token is a duplicate of a previous (already known) word/token. If it's not a duplicate, allocate some memory and create a new entry for the word/token and insert the entry onto the linked list that corresponds to the hash.
Finally, process the second file. This is similar to how you processed the first file (break it into words/tokens, calculate the hash, use the hash to find out if the word/token is known), except if the word/token isn't known you write it to the output file, and if it is known you write " language" to the output file instead.
If you're not familiar with hash tables, they're fairly easy. For a simple method (not necessary the best method) of calculating the hash value for ASCII/text you could do something like:
hash = 0;
while(*src != 0) {
hash = hash ^ (hash << 5) ^ *src;
src++;
}
hash = hash % HASH_SIZE;
Then you have an array of linked lists, like "INDEX_ENTRY *index[HASH_SIZE]" that contains a pointer to the first entry for each linked list (or NULL if the linked list for the hash is empty).
To search, use the hash to find the first entry of the correct linked list then do "strcmp()" on each entry in the linked list. An example might look something like this:
INDEX_ENTRY *find_entry(uint32_t hash, char *new_word) {
INDEX_ENTRY *entry;
entry = index[hash];
while(entry != NULL) {
if(strcmp(new_word, entry->word) == 0) return entry;
entry = entry->next;
}
return NULL;
}
The idea of all this is to improve performance. For example, if both files have 1024 words then (without a hash table) you'd need to do "strcmp()" 1024*1024 times; but if you use a hash table with "#define HASH_SIZE 1024" you'll probably reduce that to about 2000 times (and end up with much faster code). Larger values of HASH_SIZE increase the amount of memory you use a little (and reduce the chance of different words having the same hash).
Don't forget to close your files when you're finished with them. Freeing the memory you used is a good idea if you do something else after this (but if you don't do anything after this then it's faster and easier to "exit()" and let the OS cleanup).

(Algorithm) Find if two unsorted arrays have any common elements in O(n) time without sorting?

We have two unsorted arrays and each array has a length of n. These arrays contain random integers in the range of 0-n100. How to find if these two arrays have any common elements in O(n)/linear time? Sorting is not allowed.
Hashtable will save you. Really, it's like a swiss knife for algorithms.
Just put in it all values from the first array and then check if any value from the second array is present.
You have not defined the model of computation. Assuming you can only read O(1) bits in O(1) time (anything else would be a rather exotic model of computation), there can be no algorithm solving the problem in O(n) worst case time complexity.
Proof Sketch:
Each number in the input takes O(log(n ^ 100)) = O(100 log n) = O(log n) bits. The entire input therefore O(n log n) bits, which can not be read in O(n) time. Any O(n) algorithm can therefore not read the entire input, and hence not react if these bits matter.
Answering Neil:
Since you know at start what is your N (two arrays of size N), you can create a hash with array size of 2*N*some_ratio (for example: some_ratio= 1.5). With this size, almost all simple hash functions will provide you good spread of the entities.
You can also implement find_or_insert to search for existing or insert a new one at the same action, this will reduce the hash function and comparison calls. (c++ stl find_or_insert is not good enough since it doesnt tell you whether the item was there before or not).
Linearity Test
Using Mathematica hash function and arbitrary length integers.
Tested until n=2^20, generating random numbers till (2^20)^100 = (approx 10^602)
Just in case ... the program is:
k = {};
For[t = 1, t < 21, t++,
i = 2^t;
Clear[a, b];
Table[a[RandomInteger[i^100]] = 1, {i}];
b = Table[RandomInteger[i^100], {i}];
Contains = False;
AppendTo[k,
{i, First#Timing#For[j = 2, j <= i, j++,
Contains = Contains || (NumericQ[a[b[[j]]]]);
]}]];
ListLinePlot[k, PlotRange -> All, AxesLabel -> {"n", "Time(secs)"}]
Put the elements of the first array in an hash table, and check for existence scanning the second array. This gives you a solution in O(N) average case.
If you want a truly O(N) worst case solution then instead of using an hash table use a linear array in the range 0-n^100. Note that you need to use just a single bit per entry.
If storage is not important, then scratch hash table in favor for an array of n in length. Flag to true when you come across that number in first array. In pass through second array, if you find any of them to be true, you have your answer. O(n).
Define largeArray(n)
// First pass
for(element i in firstArray)
largeArray[i] = true;
// Second pass
Define hasFound = false;
for(element i in secondArray)
if(largeArray[i] == true)
hasFound = true;
break;
Have you tried a counting sort? It is simple to implement, uses O(n) space and also has a \theta(n) time complexity.
Based on the ideas posted till date.We can store the one array integer elements into a hash map . Maximum number of different integers can be stored in RAM . Hash map will have only unique integer values. Duplicates are ignored.
Here is the implementation in Perl language.
#!/usr/bin/perl
use strict;
use warnings;
sub find_common_elements{ # function that prints common elements in two unsorted array
my (#arr1,#array2)=#_; # array elements assumed to be filled and passed as function arguments
my $hash; # hash map to store value of one array
# runtime to prepare hash map is O(n).
foreach my $ele ($arr1){
$hash->{$ele}=true; # true here element exists key is integer number and value is true, duplicate elements will be overwritten
# size of array will fit in memory as duplicate integers are ignored ( mx size will be 2 ^( 32) -1 or 2^(64) -1 based on operating system) which can be stored in RAM.
}
# O(n ) to traverse second array and finding common elements in two array
foreach my $ele2($arr2){
# search in hash map is O(1), if all integers of array are same then hash map will have only one entry and still search tim is O(1)
if( defined $hash->{$ele}){
print "\n $ele is common in both array \n";
}
}
}
I hope it helps.

choosing the element with more duplicates on an array in plain C

This question brings me back to my college days, but since I haven't coded since those days (more than 20 years ago) I am a bit rusty.
Basically I have an array of 256 elements. there might be 1 element on the array, 14 or 256. This array contains the usernames of people requesting data from the system. I am trying to count the duplicates on the list so I can give the priority to the user with most requests. So, if I have a list such as:
{john, john, paul, james, john, david, charles, charles, paul, john}
I would choose John because it appeared 4 times.
I can iterate the array and copy the elements to another and start counting but it gets complicated after a while. As I said, I am very rusty.
I am sure there is an easy way to do this. Any ideas? Code would be very helpful here.
Thank you!
EDIT:
The buffer is declared as:
static WCHAR userbuffer[150][128];
There could be up to 150 users and each username is up to 128 chars long.
1 - sort your array.
2 - set maxcount = 0;
3 - iterate array and count until visitNEXT username.
4 - if count > maxcount then set maxcount to count and save name as a candidate.
5 - after loop finished, pickup the candidate.
Here's how I would solve it:
First, define a structure to hold user names and frequency counts and make an array of them with the same number of elements as your userbuffer array (150 in your example).
struct user {
WCHAR name[128];
int count;
} users[150];
Now, loop through userbuffer and for each entry, check and see if you have an entry in users that has the same name. If you find a match, increment the count for that entry in users. If you don't find a match, copy the user's name from userbuffer into a blank entry in users and set that user's count to 1.
After you have processed the entire userbuffer array, you can use the count values from your users array to see who has the most requests. You can either iterate through it manually to find the max or use qsort.
It's not the most efficient algorithm, but it's direct and doesn't do anything too clever or fancy.
Edit:
To qsort the users array, you will need to define a simple function that qsort can use to sort by. For this example, something like this should work:
static int compare_users(const void *p1, const void *p2) {
struct user *u1 = (struct user*)p1;
struct user *u2 = (struct user*)p2;
if (u1->count > u2->count)
return 1;
if (u1->count < u2->count)
return -1;
return 0;
}
Now, you can pass this function to qsort like:
qsort(users, 150, sizeof(struct user), compare_users);
What did you mean with "there might be 1 element on the array, 14 or 256". Are 14 and 256 element numbers?
If you can change the array definition I think the best way will be to define
a structure with two fields, username and numberOfRequests. When a user requests, if usernames exists in the list the numberOfRequest will be increased. If it is the first time the user is requesting username should be added to list and numberOfRequests would be 1.
If you can not change the array definition, one is to sort the array with quick sort or another algorithm. It will be easy to count the number of request after that.
However, maybe there is a library with this functionality, but I doubt that you can find something in Standard Library!
Using std::map as proposed by AraK, you can stock your names without duplicates, and associate your names to a value.
A quick example :
std::map<string, long> names;
names["john"]++;
names["david"]++;
You can then search which key has the biggest value.

Resources