Help with this ragged array/substring program in C? - c

so, I have a ragged array (called names) of length 200. each pointer in the array points to a string that is no longer than 50 characters and has no white spaces. I also have a string given through user-input called inname, with length 50, inname will be one of the strings stored in names. I need to find a way to go through the strings in my ragged array and find the string with the largest substring overlap with inname, excluding inname itself as that will be in the file. If no string has an overlap then we print out "no recommendation".
I've been melting trying to puzzle this out for many hours now, helps? O:) SO basically, the program finds the name in the array with the largest substring overlap with inname.
will edit to provide additional info if you need it

You should probably start on the smaller problem of determining the size of the overlap between infunc and a single string.
Wikipedia goes over some algorithms for solving the longest common substring problem (including pseudocode!)

This won't lead you to the most efficient way to find overlap (dynamic programming is one way--there are other crazy methods like suffix trees), but it should get you started:
First, think about how you would find the length of the overlap with the beginning of both strings aligned. For instance, find the longest overlap between these two:
programming
ungrammatical
In this case, only a single m overlaps--a length of 1.
Then think about how you would "shift" the strings and look for the overlap when they are aligned differently. (Don't actually change the strings: just change how you loop through them to compare them.) What is the overlap between these two?
programming
ungrammatical
Think about how to look at all possible alignments. If you keep track of the longest one you find, you have the longest alignment between two particular strings.
After that, move on to checking all the different strings. Keep track of the one with the best match, and again, once you have looked at all of them, you have your answer.

Related

How to compare two strings and return the number of words that are the same?

I am making a code in C, and I have not gotten an efficient way to make this comparison, if someone could help me I would be very grateful.
EXAMPLE:
W1: Big house with white walls
W2: house walls
return: 2
When thinking how to solve the problem, it helps to break it down into steps that you can see clearly how to proceed, and how they progress toward the solution. The usual approach for something like this would be
make a copy of each string
for each string, make an array of pointers to char*, which
point into the copied strings, which in turn
parsed into "words" (putting '\0' at all of the non-word characters
run qsort on the array of strings
Then, having two sorted arrays of pointers to words, you can write a loop using strcmp for checking equality of the words. I suggested strcmp because (since the arrays are sorted) it is simple to check when a word is missing from one or the other of the two arrays versus the other.
The copy/parse/sort part would naturally be a function, given a string and returning the array of pointers. The caller should free that (and the chopped-up string to which it points).

How to sort this specific string array by using the following set of rules(?):

Yes, I'm a newbie and my uncle has challenged me to use the function:
void sortStrings(char str[], const char* delim){...}
to sort the given char array str in a way that every char from delim that shows up in str will separate a group of chars in str thus making them the words you need to sort by hex value. In the process I also need to replace those word separating chars from delim with ';'.
The rules are: I may only use the library <stdio.h> and I can't use malloc/realloc.
Apparently this should be done with an O notation of n^2 (n being the amount of words in str, not chars)
and here's an example of an input and output:
input:
char str[] = "aaa*test,hello.world*abcd.zzz";
sortDelim(str, ",.*")
output: str is now: "aaa;abcd;hello;test;world;zzz"
Well, I've managed it now eventually, the bubble sort thing kinda helped so tyvm for that :)
note: I'll leave this thread here in case anyone wants to take this challenge on himself? It ain't easy I promise :P If you think I should delete it or add the finished code then just ask(Please don't deduct my rep even more ><)
You are off to a good start. After your second for loop you have
replaced all delimiters with semicolon
you know how many words there are, size
you know how many letters each word has, letters
One observation is that you have allocated letters to be 1000 entries. That seems like enough, but is it really? How do you know how many words there are in str? You don't. And you can't use malloc to allocate dynamically, so perhaps you need to look for an algorithm that doesn't require that lookup table? You need an algorithm that works in place
http://en.wikipedia.org/wiki/In-place_algorithm
What's next? You need to sort the words. There are many sorting algorithms. You want something simple, and are allowed complexity O(n^2). Here's a list of sorting algorithms:
http://en.wikipedia.org/wiki/Sorting_algorithm
Notice that in the table, under 'other notes' it tells you that some algorithms are "Tiny code size", that sounds good. Sort the table first by 'other notes' then by 'Average' complexity (click on the triangle in the column header). You now have two algorithms that use a method 'exchanging' (that means in place), with Tiny code size and average complexity O(n^2), these wikipedia links explain how they work, and include pseudocode to get you started:
Bubble sort
Gnome sort
Take your pick and try your hand at it.
Hint: you probably want a subroutine that exchanges (swaps) two consecutive words if the first word compares greater than the second word. This can be done in place.
Suppose you have
abcd;aaa
The first word is greater than the second word, you need to detect that, and swap the words so you end up with
aaa;abcd
Here's a diagram that will give you the general idea.

storing strings in an array in a compact way [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

iterating string in C, word by word

I just started learning C. What I am trying to right now is that I have two strings in which each word is separated by white spaces and I have to return the number of matching words in both strings. So, is there any function in C where I can take each word and compare it to everyother word in another string, if not any idea on how I can do that.
Break up the first string in words, this you can do in any number of ways everything from looping through the character array inserting \0 at each space to using strtok.
For each word found, go through the other string using strstr which checks if a string is in there. just check return value from strstr, if != NULL it found it.
I'd not use strtok but stick with pointer arithmetics length comparison and memcmp to compare strings of equal length.
There are two problems here:
1) splitting each string into words
The strtok() function can split a string into words.
It is a meaningful exercise to imagine how you might write your own equivalent to strtok.
The rosetta project shows both a strtok and a custom method approach to precisely this problem.
I would naturally write my own parser, as its the kind of code that appeals to me. It could be a fun exercise for you.
2) finding those words in one string that are also in another
If you iterate over each word in one string for each word in another, it has O(n*n) complexity.
If you index the words in one string it will take just O(n) which is substantially quicker (if your input is large enough to make this interesting). It is worth imagining how you might build a hashtable of the words in one string so that you can look for the words in the other.

When would you use strings instead of characters?

When is it appropriate to use strings instead of characters? What about vice-versa?
Strings and characters represent fundamentally different concepts.
A character is a single, indivisible unit representing some sort of glyph. When working with a character, you are guaranteed to have a single character, no more or no less. Functions that work with characters are best suited for cases where you know this to be true. For example, if you were writing "Hangman" and wanted to process a user's guess, it would make sense for the function that processes the guess to take a character rather than a string, since you know for a fact that the input to that function should always be a single letter.
A string is a composite type formed by taking zero or more characters and putting them together. Strings are typically used to represent text, which can have an arbitrary length, including having zero length. Functions that work on strings are best suited for cases where the input is known to be made of letters, but it's unclear how many letters there are going to be.
One other option is to use a fixed-length array of characters, which is ideal for the situation where you know that you have exactly k characters for some k. This does not come up very much, but it's another option.
In short, use characters when you know that you need to work on a piece of text that is just one glyph long. Use strings when you don't know the length of the input in advance. Use fixed-sized arrays when you know for a fact that the input has some particular length.

Resources