Counting unique words in a file? Good Linear Search alternative? - c

I'm using a naive approach to this problem, I'm putting the words in a linked list and just making a linear search into it. But it's taking too much time in large files.
I was thinking in use a Binary Search Tree but I don't know if it works good with strings. Also heard of Skip Lists, didn't really learn it yet.
And also I have to use the C language...

You can put all of the words into a trie and then count the number of words after you have processed the whole file.

Binary Search Trees work fine for strings.
If you don't care about having the words in sorted order, you can just use a hash table.

You're counting the number of unique words in the file?
Why don't your construct a simple hash table? This way, for each word in your list, add it into the hash table. Any duplicates will be discarded since they would already be in the hash table - and finally, you can just count the number of elements in the data structure (by storing a counter and incrementing it each time you add to the table).

The first upgrade to your algorithm could be having the list sorted, so, your lineal search could be faster (you only search until you find one element greater than yours), but this is still a naive solution.
Best approaches are Binary Search Trees and even better, a prefix tree (or trie, already mentioned in other answer).
In "The C Programming Language" From K&R you have the exact example of what you are looking for.
The first example of "autoreferenced data structs" (6.5) is a binary search tree used for counting the ocurrences of every word in a string. (You don't need to count :P)
the structure is something like this:
struct tnode {
char *word;
struct tnode *left;
struct tnode *right;
};
In the book you can see the whole example of what you want to do.
Binary Search Trees works good with any tipe of data structure that can accept an order, and will be better than a lineal search in a list.
Sorry for my poor english, and correct me if i was wrong with something I've said, Im very noob with C :p
EDIT: I can't add comments to other answers, but I have read a coment from OP saying "The list isn't sorted so I can't use binary search". It is nonsense to use binary search on a linked list. ¿Why? Binary Search is efficient when the access to a random element is fast, like in an array. In a double linked list, your worst access will be n/2.. However, you can put a lot of pointers in the list (accesing to key elements), but it is a bad solution..

I'm puting the words in a linked list and just making a linear search into it.
If to check whether word W is present, you go through the whole list, then it's surely long. O(n^2), where n is size of the list.
Simplest way is probably having a hash. It's easy to implement yourself (unlike some tree structures) and even C should have some libraries for that. You'll get O(n) complexity.
edit Some C hashtable implementations
http://en.wikipedia.org/wiki/Hash_table#Independent_packages

If you're on a UNIX system, then you could use the bsearch() or hsearch() family of functions instead of a linear search.

If you need something simple and easily available then man tsearch for simple binary search tree. But this is plain binary search tree, not balanced.
Depending on number of unique words, plain C array + realloc() + qsort() + bsearch() might be an option too. That's what I use when I need no-frills faster-than-linear search in plain portable C. (Otherwise, if possible, I opt for C++ and std::map/std::set.)
More advanced options are often platforms specific (e.g. glib on Linux).
P.S. Another very easy to implement structure is a hash. Less efficient for strings but very easy to implement. Can be very quickly made blazing fast by throwing memory at the problem.

Related

Putting words from a file into a binary tree (C)

I'm currently learning C in Uni but for some reason it's just so difficult for me. I couldn't find a simple step by step guide and everything that's on the internet is just so complex and without much explanation.
I'm supposed to write this program:
'Using the binary tree and list write a program that reads a text file and prints to the output file all the words in alphabetical order along with the line numbers in which the word occurs.'
And I just don't know how to start it. I can open files, run it from the command line, but I have no idea how to create a binary tree, get the words from a file and put them there, and then create a list inside the binary tree. All the examples I've found are so different that I don't know how to rewrite them so they would work for me.
Could anyone help? Even a few lines of code that would guide me in the right direction would help so much!
For starters, its binary-search tree,(special kind of binary tree) that is required for the given problem.
A binary search tree, is a binary tree, that is populated with comparable objects, like numbers. Meaning given two numbers x and y, the following three boolean conditions can be answered without any ambiguity.
x greater than y
x less than y
x equal to y
Now a binary search tree is built upon the above boolean conditions. The analogy here is that words are also comparable, which decides there order in a typical oxford dictionary. Like apple < box and therefore, apple comes before box in alphabetical order.
How to get alphabetical order of words?
Once you have populated your tree, a simple inorder traversal will do the rest, that is, listing the words in alphabetical order. Just remember to also have variable for line numbers, which can be stored at the same time you are building your tree, which can later be retrieved while printing words in order.
Take the code as an exercise.

Efficient algorithm to search a buffer for any string from a list

I am looking for an efficient search algorithm, that, for a given set of strings searches a large buffer for any one match from the set of strings.
Currently i know a few efficient single-string algorithms (i have used the Knuth before), but i don't know if they really help.
Here is what i am actually doing:
I have around 6-10 predefined strings, each around 200-300 characters (actually bytes, since i`m processing binary data)
The input is a large, sometimes few megabyte buffer
I would like to process the buffer, and when i have a match, i would like to stop the search
I have looked for multiple-string searching algorithms using a finite set of predefined patterns, but they all seem to revolve around matching ALL of the predefined strings in the buffer.
This post: Fast algorithm for searching for substrings in a string, suggested using the Aho–Corasick or the Rabin–Karp alogirthm.
I thought, that since i only need one match, i could find other methods, that are similar to the mentioned algorithms, but the constrains given by the problem can improve the performance.
Aho-Corasick is a good choice here. After building an automaton the input string is traversed from left to right so it is possible to stop immediately after the first match is found. The time complexity is O(sum of lengths of all patterns + the position of the first occurrence). It is optimal because it is not possible to find the first match without reading all patterns and all the bytes from the buffer before the first occurrence.

storing strings in an array in a compact way [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

trie tree implemenation in c for efficient look of ipaddess [duplicate]

I am implementing Patricia tries for IP prefix lookup, I could get the
code working for complete key match, but facing problems with prefix search, when there
are keys which are prefixes of other keys, like:
1.2.3.0
1.2.0.0
Can anyone help me with the algorithm for prefix searches in the above case
Should I consider these as keys of separate length (i.e, /24 and 16) ?
Take a look at Net-Patricia. This is an implementation of a Patricia trie to look up IP addresses. The interface is perl, but the underlying code is in C. Here is a link, but many CPAN archives should have it:
http://cpansearch.perl.org/src/PHILIPP/Net-Patricia-1.15_07/libpatricia/patricia.c
If you use this trie for storing IP numbers as elements of the fixed length then it is definitely not the right way. The point here is that PT is especially useful for storing variable length data.
If you store parts of IP numbers, as prefixes of variable length then PT is a good choice.
In this case yes your keys should be of different length.
Let's say you are storing prefix "192.168" in binary 0xC0 0xA8, you add this as first key.
Then, when searching for IP like 192.168.1.1 you can get information that your trie contains 192.168 which is a prefix of what you look for.
All you have to do is to store the "common part" while traversing the trie.
This is a minor addition to the this implementation. Just make sure that while going down the trie you store the common part somewhere in the parameters of the recursive function.
For good understanding of Patricia trie I would suggest reading Robert Sedgewick's Algorithms book which is a great source of knowledge.
EDIT: There is one problem when storing C strings in PT. This trie is designed to store binary data, but you are interested only in getting the whole bytes.
Make sure you are storing common part of the prefix only if its size in bits is multiple of 8.
For a wrong example: you have key in your tree: 0xC0 0xA5 and you are looking fro 0xC0 0xA6.
Your traversal will stop when the common part "0xC0 0xA", but you are interested in taking only "0xC0". So make sure to store common bytes, not bits.
There's a fairly-readable C implementation in the test code for LLVM: https://llvm.org/svn/llvm-project/test-suite/trunk/MultiSource/Benchmarks/MiBench/network-patricia/

How do I find out if a value is in an array in C?

What's the equivalent of php's in_array or IList.Contains in C#.net. I'm a C n00b but I know very little is buit-in. If I have to do this myself, what's the best way? iterate over the array and test each value?
There is no way to tell if a given value is in a C array without inspecting the array yourself. If your array is unsorted, you have to look at every element. If you know some more details about the contents of the array, choose your favourite search algorithm and start looking!
Linear search is trivial to implement yourself. Binary search isn't much harder, but it's part of the standard C library already. Check out man bsearch.

Resources