Why do we need unsigned char for Huffman tree code - c

I am trying to create a Huffman tree the question I read is very strange for me, it is as follows:
Given the following data structure:
struct huffman
{
unsigned char sym; /* symbol */
struct huffman *left, *right; /* left and right subtrees */
};
write a program that takes the name of a binary file as sole argument,
builds the Huffman tree of that file assuming that atoms (elementary
symbols) are 8-bit unsigned characters, and prints the tree as well as
the dictionary.
allocations must be done using nothing else than
malloc(), and sorting can be done using qsort().
Here the thing which confuses me is that to write a program to create a huffman tree we just need to do following things:
We need to take a frequency array (That could be Farray[]={.......})
Sort it and add the two smallest nodes to form a tree until it don't left 1 final node(which is head).
Now the question is here: why and where do we need those unsigned char data? (what type of unsigned char data this question want, I think only frequency is enough to display a Huffman tree)?

If you purely want to display the shape of the tree, then yes, you just need to build it. However, for it to be of any use whatsoever you need to know what original symbol each node represents.
Imagine your input symbols are [ABCD]. An imaginary Huffman tree/dictionary might look like this:
( )
/ \ A = 1
( ) (A) B = 00
/ \ C = 010
(B) ( ) D = 011
/ \
(C) (D)
If you don't store sym, it looks like this:
( )
/ \ A = ?
( ) ( ) B = ?
/ \ C = ?
( ) ( ) D = ?
/ \
( ) ( )
Not very useful, that, is it?
Edit 2: The missing step in the plan is step 0: build the frequency array from the file (somehow I missed that you don't need to actually encode the file too). This isn't part of the actual Huffman algorithm itself and I couldn't find a decent example to link to, so here's a rough idea:
FILE *input = fopen("inputfile", "rb");
int freq[256] = {0};
int c;
while ((c = fgetc(input)) != EOF)
freq[c]++;
fclose(input);
/* do Huffman algorithm */
...
Now, that still needs improving since it neither uses malloc() nor takes a filename as an argument, but it's not my homework ;)

It's a while since I did this, but I think the generated "dictionary" is required to encode data, while the "tree" is used to decode it. Of course, you can always build one from the other.
While decoding, you traverse the tree (left/right, according to successive input bits), and when you hit a terminal node (null pointer) then the 'sym' in the node is the output value.

Usually data compression is divided into 2 big steps; given a stream of data:
evaluate the probability that a given symbol will appear in the stream, in other words you evaluate how frequent a symbol appears in a dataset
once you have studied the occurences and created your table with symbols associated with a probability, you need to encode the symbols according their probability, to achieve this magic you create a dictionary were the original symbol is often times just replaced with another symbol that is much smaller in size, especially true for symbols that are frequently used in the dataset, the dictionary keeps track of this substitutions for both the encoding and decoding phase. Hoffman gives you an algorithm to automate this process and get a fairly good result.
In practice it's a little bit more complicated than this, because trees are involved, but the main purpose is always to build the dictionary.
There is a complete tutorial here.

Related

Searching a string in a binary tree in c

I am a student of computer science, and I had an exam last week in C.
One of the questions was to search a specific word (string) in a binary tree, and count how many times it appears.
Every node in the tree contains a letter.
For example, if the word is "mom", and the tree looks like the attached image, the function should return 2.
Pay attention that if there is a word like this — "momom" — the function will count the "mom" only one time.
I have not been able to solve this question. Can you help?
a
/ \
b m
/ / \
v o o
/ \ \
m t m
So basically, because the tree in your image does not appear to be ordered or balanced, so you would have to search every branch until either you hit a match, or you hit a leaf. Once you hit a match, you could ignore all the branches underneath because they're irrelevant. But outside of this, you don't know the depth of the tree, so you can't end searching prematurely based on depth.
So, your algorithm would be something to the effect of:
// returns the number of matches
// matchMask is a bitmap of the string sublengths that match so far...
int search(const char *substr, int substrlen, uint32_t matchMask, node_t *node) {
uint16_t newMatchMask = 0;
int bit;
ASSERT(substrlen < (sizeof(matchMask)*8));
if (node == NULL) {
// hit a leaf, stop return 0
return 0;
}
while (bit = LSB(matchMask) != -1)
{
if (node->ch == substr[bit+1])
newMatchMask |= (1 << (bit+1));
}
if (node->ch == substr[0])
newMatchMask;
if (newMatchMask & (1 << strlen)) {
// found a match, don't bother recursing
return 1;
} else {
return
search(substr, substrlen, newMatchMask, node->left) +
search(substr, substrlen, newMatchMask, node->right);
}
}
Note, that I had to do some funky bitmap stuff there to keep track of the depths matched so far, as you can match a partial substring along the way. LSB is assumed to be a least-significant-bit macro that returns -1 if no bits are set. Also, this is not tested, so there might be an off-by-one error in the bit masking, but the idea is still there.
-- EDIT --
oops, forgot to stop recursing if your node is blank... Fixing
You want to enumerate all words in the tree and check at each end of word if you have a match using strstr().
The keywords to search for would be tree walking tree depth-first.
The semantics of your tree structure are confused. To clarify your question, you should enumerate all words present in the tree by hand, then write a function that walks the tree and prints the same list, the final step is easy: instead of printing them, check if the string matches with strstr and count the matching words.

Fastest algorithm to figure out if an array has at least one duplicate

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.
Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).
If X &in; {A..Z} and N &in; {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.
Here you go: a python script to generate all possible 17 million keys:
import itertools
from string import ascii_uppercase
for prefix in itertools.product(ascii_uppercase, repeat=3):
for numeric in range(1000):
print "%s%03d" % (''.join(prefix), numeric)
and a simple C bitset filter:
#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
int byteIndex = value / CHAR_BIT;
unsigned char mask = 1 << (value % CHAR_BIT);
unsigned char byte = filter[byteIndex];
filter[byteIndex] = byte | mask;
return byte & mask;
}
which for your purposes you'd use like so:
#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
int alpha = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
int key = numeric + 1000 * alpha;
return filterTestAndSet(filter, key);
}
#include <stdio.h>
int main() {
unsigned char *filter = allocMyFilter();
char key[8]; /* 6 chars + newline + nul */
while (fgets(key, sizeof(key), stdin)) {
if (testAndSetMyKey(filter, key)) {
printf("collision: %s\n", key);
return 1;
}
}
return 0;
}
This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:
useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt
real 0m0.474s
user 0m0.436s
sys 0m0.036s
useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000
real 0m0.467s
user 0m0.452s
sys 0m0.016s
admittedly the input file is cached in memory for these runs.
The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.
public static bool ThereAreDuplicates(string[] inputs)
{
var hashTable = new Hashtable();
foreach (var input in inputs)
{
if (hashTable[input] != null)
return true;
hashTable.Add(input, string.Empty);
}
return false;
}
A fast but inefficient memory solution would use
// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 }; // or calloc() this
char buffer[100];
while (fgets(buffer, sizeof buffer, istream)) {
unsigned long index = strtoul(buffer, NULL, 36);
if (found[index]++) {
Dupe_found();
break;
}
}
The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.
Depending on how many different things there can be you have some options:
Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).
The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.
However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.
Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.
Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :
void counting_sort(int a[],int n,int max)
{
int count[max+1]={0},i;
for(i=0;i<n;++i){
count[a[i]]++;
if (count[a[i]]>=2) return 1;
}
return 0;
}
Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).
You can see counting sort visualization to understand it better, here:
https://www.cs.usfca.edu/~galles/visualization/CountingSort.html
Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.

Best way to match(search) a URL in a list in C (implementing whitelist or blacklist)?

I'm writing a proxy server. It applies different rules to websites that match in lists. For example, we can block List A and use another proxy to fetch content for List B.
For example, List A:
.google.com
blogger.com
sourceforge.net
ytimg.com
http://media-cache-*.pinterest.com/*
images-amazon.com
*.amazonaws.com
twitter.com
fbcdn.net
google-analytics.com
staticflickr.com
List B:
ytimg.com
youtube.com
Currently, the match function is:
struct proxy_t *
match_list(char *url) {
// 2KB line should be enough
char buf[2048];
int pos = 0, size;
struct acllist *al = config->acl_h;
struct acl *node = al->data;
while (node != NULL) { // iterate list
pos = 0; // position in list file
size = strlen(node->data); // node->data holds a URL list
while (1) { // iterate each line in list
readline(buf, node->data, &pos, size);
if (buf[0] == 0) break;
if (strcasestr(url, buf) != NULL
|| !fnmatch(buf, url, FNM_CASEFOLD)) {
return node->proxy;
}
}
node = node->next;
}
printf("Not Matched\n");
return config->default_proxy;
}
That is, iterate the two list files, read line by line, use strcasestr and fnmatch to match a single URL.
It works fine. But if the lists get larger and more, say 10,000 lines per list and 5 lists, I suppose it won't be an efficient solution since it is an O(N) algorithm.
I'm thinking about adding a hit counter to each match line. By ordering the match lines it may reduce the average search length. Like this:
.google.com|150
blogger.com|76
sourceforge.net|43
ytimg.com|22
Is there any other ideas on it?
There are two ways you could go to improve performance.
1
First way is order the URL lists in some way and therefore you can optimize searching in it.
Quicksort is fastest algorithm out there.
Bubble sort is easier to implement.
Then you can use binary search to search in the list.
Binary search has logarithmic performance while your loop has linear, therefore it will be significantly faster on large lists.
2
If your lists of URLs are static, you can use special tool called flex, which enables you to parse the string just by reading it.
This also means, then when some of your URL lists is updated, you have to write new code for parsing or create code generator.
This is much more effective way of parsing, then any kind of sorting, because it only need N steps, when N is the length of URL you are parsing, therefore it doesn't matter how long your list is, as long as you can write correct scanner for inputs.

How to write a bitstream

I'm thinking about writing some data into a bit stream using C. There are two ways come in mind. One is to concatenate variable bit-length symbols into a contiguous bit sequence, but in this way my decoder will probably have a hard time separating those symbols from this continuous bit stream. Another way is to distribute same amount of bits for which symbol and in that way the decoder can easily recover the original data, but there may be a waste of bits since the symbols have different values which in turn cause many bits in the bit stream being zero(this waste bits I guess).
Any hint what I should do?
I'm new to programming. Any help will be appreciated.
Sounds like your trying to do something similiar to a Huffman compression scheme? I would just go byte-by-byte (char)and keep track of the offset within the byte where I read off the last symbol.
Assuming none of your symbols would be bigger than char. It would look something like this:
struct bitstream {
char *data;
int data_size; // size of 'data' array
int last_bit_offset; // last bit in the stream
int current_data_offset; // position in 'data', i.e. data[current_data_offset] is current reading/writing byte
int current_bit_offset; // which bit we are currently reading/writing
}
char decodeNextSymbol(bitstream *bs) {
}
int encodeNextSymbol(bitstream *bs, char symbol) {
}
The matching code for decodeNextSymbol and encodeNextSymbol would have to use the C bitwise operations ('&' (bitwise AND), and '|' (bitwise OR) for example. I would then come up with a list of all my symbols, starting with the shortest first, and do a while loop that matches the shortest symbol. For example, if one of your symbols is '101', then if the stream is '1011101', it would match the first '101' and would continue to match the rest of the stream '1101' You would also have to handle the case where your symbol values overflow from one byte to the next.

Comparing strings in two files in C;

I'm new to language c, so I'll appreciate every help :D
I need to compare given words in the first file ( " Albert\n Martin\n Bob" ) with words in the second file ( " Albert\n Randy\n Martin\n Ohio" ) .
Whenever they're the same i need to put in the file word " Language " ; and print every word without representation in second file "
Something like that:
Language
Language
Bob
need's to be in my third file;
I tried to come up with some ideas , but they dont work; p ,
Thanks for every anwser in advance .
First, you need to open a stream to read the files.
If you need to do this in C, then you may use the strcmp function. It allows you to compares the two strings.
For example:
int strcmp(const char *s1, const char *s2);
I'd open all three files to begin with (both input files and the output file). If you can't open all of them then you can't do anything useful (other than display an error message or something); and there's no point wasting CPU time only to find out that (for e.g.) you can't open the output file later. This can also help to reduce race conditions (e.g. second file changes while you're processing the first file).
Next, start processing the first file. Break it into words/tokens as you read it, and for each word/token calculate a hash value. Then use the hash value and the word/token itself to check if the new word/token is a duplicate of a previous (already known) word/token. If it's not a duplicate, allocate some memory and create a new entry for the word/token and insert the entry onto the linked list that corresponds to the hash.
Finally, process the second file. This is similar to how you processed the first file (break it into words/tokens, calculate the hash, use the hash to find out if the word/token is known), except if the word/token isn't known you write it to the output file, and if it is known you write " language" to the output file instead.
If you're not familiar with hash tables, they're fairly easy. For a simple method (not necessary the best method) of calculating the hash value for ASCII/text you could do something like:
hash = 0;
while(*src != 0) {
hash = hash ^ (hash << 5) ^ *src;
src++;
}
hash = hash % HASH_SIZE;
Then you have an array of linked lists, like "INDEX_ENTRY *index[HASH_SIZE]" that contains a pointer to the first entry for each linked list (or NULL if the linked list for the hash is empty).
To search, use the hash to find the first entry of the correct linked list then do "strcmp()" on each entry in the linked list. An example might look something like this:
INDEX_ENTRY *find_entry(uint32_t hash, char *new_word) {
INDEX_ENTRY *entry;
entry = index[hash];
while(entry != NULL) {
if(strcmp(new_word, entry->word) == 0) return entry;
entry = entry->next;
}
return NULL;
}
The idea of all this is to improve performance. For example, if both files have 1024 words then (without a hash table) you'd need to do "strcmp()" 1024*1024 times; but if you use a hash table with "#define HASH_SIZE 1024" you'll probably reduce that to about 2000 times (and end up with much faster code). Larger values of HASH_SIZE increase the amount of memory you use a little (and reduce the chance of different words having the same hash).
Don't forget to close your files when you're finished with them. Freeing the memory you used is a good idea if you do something else after this (but if you don't do anything after this then it's faster and easier to "exit()" and let the OS cleanup).

Resources