Putting words from a file into a binary tree (C) - c

I'm currently learning C in Uni but for some reason it's just so difficult for me. I couldn't find a simple step by step guide and everything that's on the internet is just so complex and without much explanation.
I'm supposed to write this program:
'Using the binary tree and list write a program that reads a text file and prints to the output file all the words in alphabetical order along with the line numbers in which the word occurs.'
And I just don't know how to start it. I can open files, run it from the command line, but I have no idea how to create a binary tree, get the words from a file and put them there, and then create a list inside the binary tree. All the examples I've found are so different that I don't know how to rewrite them so they would work for me.
Could anyone help? Even a few lines of code that would guide me in the right direction would help so much!

For starters, its binary-search tree,(special kind of binary tree) that is required for the given problem.
A binary search tree, is a binary tree, that is populated with comparable objects, like numbers. Meaning given two numbers x and y, the following three boolean conditions can be answered without any ambiguity.
x greater than y
x less than y
x equal to y
Now a binary search tree is built upon the above boolean conditions. The analogy here is that words are also comparable, which decides there order in a typical oxford dictionary. Like apple < box and therefore, apple comes before box in alphabetical order.
How to get alphabetical order of words?
Once you have populated your tree, a simple inorder traversal will do the rest, that is, listing the words in alphabetical order. Just remember to also have variable for line numbers, which can be stored at the same time you are building your tree, which can later be retrieved while printing words in order.
Take the code as an exercise.

Related

Reconstruct version control from set of files

I am looking after an approach for the following task:
given a set of files that are highly similar (I am using Fuzzy hashing here), I would like to know if there is an algorithm that allows to label those files with a version number. The output should return the sequential order of when those files have been generated.
The reason is I have to re-organize data of a team who were not familiar with version control.
Thank you
A fairly simple approach (I hope) would be to try and convert this into some kind of graph problem.
Let's say every file is a node with edges between every two files.
The weight of an edge between two nodes would be, for instance, the number of different lines between the files (or some some other function).
What you do next is find a non-cyclic path that traverses all files with the minimum cost. something like this, if you know the first file and the last.
You could add an empty file and the latest version you have as your start and end nodes.
I'm guessing this won't give you the exact result, but it'll probably give you a good starting point.
Hope this is helpful.

Extract records from compiled search program, C

Does anybody have an idea on how to extract all information from a compiled, record search program?
I think the program works by using a binary search. It was compiled and the database was in the program. The only way to see the records is to make a correct search.
Is there some way that I can bruteforce the program and extract all information?
The record is searched by the ID which starts with 1 and 10 digit long [ 1xxxxxxxxx ].
If you want to try, 1112700303 will work but I don't have the other numbers.
I've tried some Decompiler but I have no idea what I'm doing.
The program can be downloaded from here:
https://docs.google.com/file/d/0B9fwDRGBsrxBT3FiSFdaTnJZcUk/edit
Your help is appreciated as it will increase my knowledge and learn something new here :D
Though question. Is there no way to get hold of the source code (ask the author, search for the program name, ...)?
On Unix/Linux, the program strings extracts printable strings from a binary file. Doing that on x86 executables gives a long list of strings that are just instructions which happen to be ASCII strings, names of functions used by the program, ans other junk. Somewhere it lists initialized text data for the program (printf(3) formats, constant strings used), which in this case shows a bunch of names that look arabic, and some directory names. Perhaps searching for those could help.
This can probably be achieved by using Snowman. It might not get the exact source code you are looking for but enough to extract all the data you need such has the constant strings.

storing strings in an array in a compact way [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

Reading a text line using a given pattern

The question is quite simple, and I hope the answer it's simple too :)
Let's say you are given a pattern, and you have to read a text file, and every line is composed by a first digit, that indicates the number of "patterns" in the string, and the the patterns element separed by one space.
For example, given the pattern
key value
a valid line of the text file could be
3 10 "apple" 15 "orange" 17 "melon"
If the number N of repetitions is fixed, I'd use something like
fscanf(inFile,"%d %s",&n,str);
but is there a function that allows me to give the number of repetitions as a parameter, or I should scan each line and extract values I'm interested in, using substr and atoi?
The "trivial" way is obvious, I'm looking for something more "professional" and effective.
Use fscanf() in a loop: first extract the number of repetitions N, then loop N times extracting your pattern.
If you're looking for something more professional or sophisticated, you might want to move away from the standard C library and towards a regex or parsing library, or something mentioned heere: http://www.and.org/vstr/comparison. While I won't go so far as to say you can't do string processing easily or well in C, it's not a strong point of the core language.

Counting unique words in a file? Good Linear Search alternative?

I'm using a naive approach to this problem, I'm putting the words in a linked list and just making a linear search into it. But it's taking too much time in large files.
I was thinking in use a Binary Search Tree but I don't know if it works good with strings. Also heard of Skip Lists, didn't really learn it yet.
And also I have to use the C language...
You can put all of the words into a trie and then count the number of words after you have processed the whole file.
Binary Search Trees work fine for strings.
If you don't care about having the words in sorted order, you can just use a hash table.
You're counting the number of unique words in the file?
Why don't your construct a simple hash table? This way, for each word in your list, add it into the hash table. Any duplicates will be discarded since they would already be in the hash table - and finally, you can just count the number of elements in the data structure (by storing a counter and incrementing it each time you add to the table).
The first upgrade to your algorithm could be having the list sorted, so, your lineal search could be faster (you only search until you find one element greater than yours), but this is still a naive solution.
Best approaches are Binary Search Trees and even better, a prefix tree (or trie, already mentioned in other answer).
In "The C Programming Language" From K&R you have the exact example of what you are looking for.
The first example of "autoreferenced data structs" (6.5) is a binary search tree used for counting the ocurrences of every word in a string. (You don't need to count :P)
the structure is something like this:
struct tnode {
char *word;
struct tnode *left;
struct tnode *right;
};
In the book you can see the whole example of what you want to do.
Binary Search Trees works good with any tipe of data structure that can accept an order, and will be better than a lineal search in a list.
Sorry for my poor english, and correct me if i was wrong with something I've said, Im very noob with C :p
EDIT: I can't add comments to other answers, but I have read a coment from OP saying "The list isn't sorted so I can't use binary search". It is nonsense to use binary search on a linked list. ¿Why? Binary Search is efficient when the access to a random element is fast, like in an array. In a double linked list, your worst access will be n/2.. However, you can put a lot of pointers in the list (accesing to key elements), but it is a bad solution..
I'm puting the words in a linked list and just making a linear search into it.
If to check whether word W is present, you go through the whole list, then it's surely long. O(n^2), where n is size of the list.
Simplest way is probably having a hash. It's easy to implement yourself (unlike some tree structures) and even C should have some libraries for that. You'll get O(n) complexity.
edit Some C hashtable implementations
http://en.wikipedia.org/wiki/Hash_table#Independent_packages
If you're on a UNIX system, then you could use the bsearch() or hsearch() family of functions instead of a linear search.
If you need something simple and easily available then man tsearch for simple binary search tree. But this is plain binary search tree, not balanced.
Depending on number of unique words, plain C array + realloc() + qsort() + bsearch() might be an option too. That's what I use when I need no-frills faster-than-linear search in plain portable C. (Otherwise, if possible, I opt for C++ and std::map/std::set.)
More advanced options are often platforms specific (e.g. glib on Linux).
P.S. Another very easy to implement structure is a hash. Less efficient for strings but very easy to implement. Can be very quickly made blazing fast by throwing memory at the problem.

Resources