Comparing strings in two files in C; - c

I'm new to language c, so I'll appreciate every help :D
I need to compare given words in the first file ( " Albert\n Martin\n Bob" ) with words in the second file ( " Albert\n Randy\n Martin\n Ohio" ) .
Whenever they're the same i need to put in the file word " Language " ; and print every word without representation in second file "
Something like that:
Language
Language
Bob
need's to be in my third file;
I tried to come up with some ideas , but they dont work; p ,
Thanks for every anwser in advance .

First, you need to open a stream to read the files.
If you need to do this in C, then you may use the strcmp function. It allows you to compares the two strings.
For example:
int strcmp(const char *s1, const char *s2);

I'd open all three files to begin with (both input files and the output file). If you can't open all of them then you can't do anything useful (other than display an error message or something); and there's no point wasting CPU time only to find out that (for e.g.) you can't open the output file later. This can also help to reduce race conditions (e.g. second file changes while you're processing the first file).
Next, start processing the first file. Break it into words/tokens as you read it, and for each word/token calculate a hash value. Then use the hash value and the word/token itself to check if the new word/token is a duplicate of a previous (already known) word/token. If it's not a duplicate, allocate some memory and create a new entry for the word/token and insert the entry onto the linked list that corresponds to the hash.
Finally, process the second file. This is similar to how you processed the first file (break it into words/tokens, calculate the hash, use the hash to find out if the word/token is known), except if the word/token isn't known you write it to the output file, and if it is known you write " language" to the output file instead.
If you're not familiar with hash tables, they're fairly easy. For a simple method (not necessary the best method) of calculating the hash value for ASCII/text you could do something like:
hash = 0;
while(*src != 0) {
hash = hash ^ (hash << 5) ^ *src;
src++;
}
hash = hash % HASH_SIZE;
Then you have an array of linked lists, like "INDEX_ENTRY *index[HASH_SIZE]" that contains a pointer to the first entry for each linked list (or NULL if the linked list for the hash is empty).
To search, use the hash to find the first entry of the correct linked list then do "strcmp()" on each entry in the linked list. An example might look something like this:
INDEX_ENTRY *find_entry(uint32_t hash, char *new_word) {
INDEX_ENTRY *entry;
entry = index[hash];
while(entry != NULL) {
if(strcmp(new_word, entry->word) == 0) return entry;
entry = entry->next;
}
return NULL;
}
The idea of all this is to improve performance. For example, if both files have 1024 words then (without a hash table) you'd need to do "strcmp()" 1024*1024 times; but if you use a hash table with "#define HASH_SIZE 1024" you'll probably reduce that to about 2000 times (and end up with much faster code). Larger values of HASH_SIZE increase the amount of memory you use a little (and reduce the chance of different words having the same hash).
Don't forget to close your files when you're finished with them. Freeing the memory you used is a good idea if you do something else after this (but if you don't do anything after this then it's faster and easier to "exit()" and let the OS cleanup).

Related

Hash tables, same record on each hash table cell

There is a word dictionary in a text file. I will hash all these words. I wrote some code yet there is a problem. Last word takes place on each hash table records.
main()
{
FILE *fp;
char word[100];
char *hash[569];
int i;
for(i=0;i<569;i++)
hash[i]="NULL";
int m=569;
int z =569;
int mm=568;
char w;
int key;
int j;
int hash1;
int hash2;
fp=fopen("smallDictionary.txt","r");
int counter =0;
while(fscanf(fp,"%s",word)!=EOF)
{
j=0;
counter++;
for(i=key=0;i<strlen(word);i++)
key+=(word[i]-'a')*26*i;
hash1=key%m;
hash2=1+(key%mm);
while(hash[(hash1+j*hash2)%m]!="NULL")
{
j++;
}
hash[(hash1+j*hash2)%m] = word;
}
for(i=0;i<569;i++)
printf("%s ",hash[i]);
fclose(fp);
}
End the result on console.
As it seen, as last word of dictionary, "your" keyword repeats.
Dictionary content :
a about above absolutely acceptable add adjacent after algorithm all
along also an analyses and any anyone are argument as assignment
assume at available be been below bird body but by can cannot
capitalization case center chain chaining changing characters check
checker checkers checking choose class code collision collisions
command compilation compile compiled complexity conform consist
containing contains convenience convert correcting corrections create
created cross deal debugging december decide deducted deleted
departmental dictionary discover discussed divides document
documentation due dynamically e each easiest encountered enough error
errors etc even exactly example executable expand experience factor
fair fall file files find first following font for forth found four
friday from function functions g gain general generate generated
generating geoff get gird given good graders grows guide guidelines
had hair handle has hash hashing have head header help helped hold
homework hour how i if in include including incorrect information
input insert inserted insertions instructions into is ispell it
keeping known kuenning last length less letter letters like line
linear list load long longer longest look looked low lower maintained
many match may messages method midnight might misspelled mistake mode
more most must my name named names necessary need never next no note
number of on once one only options or original other otherwise output
overview page pair pedantic points policies possibility possible
prefer primary probing problem produce produces professor program
purpose quack quadratic quick read reason reasonably refer reference
rehashing reinitialized report resubmit rid same save separate
separated seriously should similar simple simplify single size so
something specifications specified spell spelling standard statistics
string strong submission submit submitted successfully suggest
suggested support suppose table tech th than that the them then there
these this those three through thursday times title to together tooth
track traditionally transposed trial try turing udder under understand
unlike up use used useful using usual variant variants version wall
ward warning was way we when whenever which while whitespace who why
wild will wind wire wiry with word words works write written wrong you
your
hash[(hash1+j*hash2)%m] = word;
That just assigns the address of word to the hash entry. Which is always the same. And at the end of the loop the contents of word is obviously going to be the last thing that was read by fscanf. SO you need something like:
hash[(hash1+j*hash2)%m] = strdup(word);
strdup allocates heap memory so don't forget to free it when the hash table is no longer needed.
All pointers in the hashtable points to the same variable: The word array. Whatever is the last string you put in word, will be the string printed in your output.
You need to allocate memory for each word, either by making hash an array of arrays and copy the word into the secondary array, or using e.g. strdup to allocate the string dynamically.
More graphically, it could be though of as this:
+---------+
| hash[x] | -\ +------+
+---------+ >--- > | word |
| hash[y] | -/ +------+
+---------+ +--------+
| hash[z] | --> | "NULL" |
+---------+ +--------+
Here both the entry x and y points to word, while entry z points to the string literal "NULL" you initialized all entries to.

Implementing a Hash Table using C

I am trying to read a file of names, and hash those names to a spot based on the value of their letters. I have succeeded in getting the value for each name in the file, but I am having trouble creating a hash table. Can I just use a regular array and write a function to put the words at their value's index?
while (fgets(name,100, ptr_file)!=NULL) //while file isn't empty
{
fscanf(ptr_file, "%s", &name); //get the first name
printf("%s ",name); //prints the name
int length = strlen(name); // gets the length of the name
int i;
for (i =0; i <length; i++)
//for the length of the string add each letter's value up
{
value = value + name [i];
value=value%50;
}
printf("value= %1d\n", value);
}
No, you can't, because you'll have collisions, and you'll thus have to account for multiple values to a single hash. Generally, your hashing is not really a wise implementation -- why do you limit the range of values to 50? Is memory really really sparse, so you can't have a bigger dictionary than 50 pointers?
I recommend using an existing C string hash table implementation, like this one from 2001.
In general, you will have hash collisions and you will need to find a way to handle that. The two main ways of handling that situation are as follows:
If the hash table entry is taken, you use the next entry, and iterate until you find an empty entry. When looking things up, you need to do similarly. This is nice and simple for adding things to the hash table, but makes removing things tricky.
Use hash buckets. Each hash table entry is a linked list of entries with the same hash value. First you find your hash bucket, then you find the entry on the list. When adding a new entry, you merely add it to the list.
This is my demo-program, which reads strings (words) from stdin, and
deposit into hashtable. Thereafter (at EOF), iterate hashtable, and compute number of words, distinct words, and prints most frequent word:
http://olegh.cc.st/src/words.c.txt
In hashtable, utilized double hashing algorithm.

Why do we need unsigned char for Huffman tree code

I am trying to create a Huffman tree the question I read is very strange for me, it is as follows:
Given the following data structure:
struct huffman
{
unsigned char sym; /* symbol */
struct huffman *left, *right; /* left and right subtrees */
};
write a program that takes the name of a binary file as sole argument,
builds the Huffman tree of that file assuming that atoms (elementary
symbols) are 8-bit unsigned characters, and prints the tree as well as
the dictionary.
allocations must be done using nothing else than
malloc(), and sorting can be done using qsort().
Here the thing which confuses me is that to write a program to create a huffman tree we just need to do following things:
We need to take a frequency array (That could be Farray[]={.......})
Sort it and add the two smallest nodes to form a tree until it don't left 1 final node(which is head).
Now the question is here: why and where do we need those unsigned char data? (what type of unsigned char data this question want, I think only frequency is enough to display a Huffman tree)?
If you purely want to display the shape of the tree, then yes, you just need to build it. However, for it to be of any use whatsoever you need to know what original symbol each node represents.
Imagine your input symbols are [ABCD]. An imaginary Huffman tree/dictionary might look like this:
( )
/ \ A = 1
( ) (A) B = 00
/ \ C = 010
(B) ( ) D = 011
/ \
(C) (D)
If you don't store sym, it looks like this:
( )
/ \ A = ?
( ) ( ) B = ?
/ \ C = ?
( ) ( ) D = ?
/ \
( ) ( )
Not very useful, that, is it?
Edit 2: The missing step in the plan is step 0: build the frequency array from the file (somehow I missed that you don't need to actually encode the file too). This isn't part of the actual Huffman algorithm itself and I couldn't find a decent example to link to, so here's a rough idea:
FILE *input = fopen("inputfile", "rb");
int freq[256] = {0};
int c;
while ((c = fgetc(input)) != EOF)
freq[c]++;
fclose(input);
/* do Huffman algorithm */
...
Now, that still needs improving since it neither uses malloc() nor takes a filename as an argument, but it's not my homework ;)
It's a while since I did this, but I think the generated "dictionary" is required to encode data, while the "tree" is used to decode it. Of course, you can always build one from the other.
While decoding, you traverse the tree (left/right, according to successive input bits), and when you hit a terminal node (null pointer) then the 'sym' in the node is the output value.
Usually data compression is divided into 2 big steps; given a stream of data:
evaluate the probability that a given symbol will appear in the stream, in other words you evaluate how frequent a symbol appears in a dataset
once you have studied the occurences and created your table with symbols associated with a probability, you need to encode the symbols according their probability, to achieve this magic you create a dictionary were the original symbol is often times just replaced with another symbol that is much smaller in size, especially true for symbols that are frequently used in the dataset, the dictionary keeps track of this substitutions for both the encoding and decoding phase. Hoffman gives you an algorithm to automate this process and get a fairly good result.
In practice it's a little bit more complicated than this, because trees are involved, but the main purpose is always to build the dictionary.
There is a complete tutorial here.

Best way to match(search) a URL in a list in C (implementing whitelist or blacklist)?

I'm writing a proxy server. It applies different rules to websites that match in lists. For example, we can block List A and use another proxy to fetch content for List B.
For example, List A:
.google.com
blogger.com
sourceforge.net
ytimg.com
http://media-cache-*.pinterest.com/*
images-amazon.com
*.amazonaws.com
twitter.com
fbcdn.net
google-analytics.com
staticflickr.com
List B:
ytimg.com
youtube.com
Currently, the match function is:
struct proxy_t *
match_list(char *url) {
// 2KB line should be enough
char buf[2048];
int pos = 0, size;
struct acllist *al = config->acl_h;
struct acl *node = al->data;
while (node != NULL) { // iterate list
pos = 0; // position in list file
size = strlen(node->data); // node->data holds a URL list
while (1) { // iterate each line in list
readline(buf, node->data, &pos, size);
if (buf[0] == 0) break;
if (strcasestr(url, buf) != NULL
|| !fnmatch(buf, url, FNM_CASEFOLD)) {
return node->proxy;
}
}
node = node->next;
}
printf("Not Matched\n");
return config->default_proxy;
}
That is, iterate the two list files, read line by line, use strcasestr and fnmatch to match a single URL.
It works fine. But if the lists get larger and more, say 10,000 lines per list and 5 lists, I suppose it won't be an efficient solution since it is an O(N) algorithm.
I'm thinking about adding a hit counter to each match line. By ordering the match lines it may reduce the average search length. Like this:
.google.com|150
blogger.com|76
sourceforge.net|43
ytimg.com|22
Is there any other ideas on it?
There are two ways you could go to improve performance.
1
First way is order the URL lists in some way and therefore you can optimize searching in it.
Quicksort is fastest algorithm out there.
Bubble sort is easier to implement.
Then you can use binary search to search in the list.
Binary search has logarithmic performance while your loop has linear, therefore it will be significantly faster on large lists.
2
If your lists of URLs are static, you can use special tool called flex, which enables you to parse the string just by reading it.
This also means, then when some of your URL lists is updated, you have to write new code for parsing or create code generator.
This is much more effective way of parsing, then any kind of sorting, because it only need N steps, when N is the length of URL you are parsing, therefore it doesn't matter how long your list is, as long as you can write correct scanner for inputs.

File Descriptors and System Calls

I am doing merging k-sorted streams using read write system calls.
After having read the first integers out of the files and sorting them, the file having the smallest element should be accessed again.
I am not sure how to do this. I thought I can use a structure like:
struct filePointer {
int ptr;
int num;
}fptr[5];
Can someone tell me how to do this.
Thanks
Although reading integers one-by-one is not an efficient way of doing this, I will try to write the solution that you are looking for. however this is not a real implementation, just the idea.
Your structure should be like this:
struct filePointer {
FILE * fp;
int num;
} fptr[k]; /* I assume k is constant, known at compile time */
You need to have an priority queue ( http://en.wikipedia.org/wiki/Priority_queue ) and prioities are determined accourding to num.
First read the first numbers from all files and insert them to priority queue (pq).
Then while pq is not empty, pop the first element which holds the smallest integer compared to other elements in the pq.
Write the integer that first element holds to file.
Using the file pointer (fp) try to read new integer from the input file.
If EOF (end of file), then do nothing
else insert the new element to pq by setting num to the read one.
When the loop is finished, close all files and you will have a new file that contains all the elements from the input files and it will be sorted.
I hope this helps.

Resources