Word ranking efficiency - permutation

I am not sure how to solve this problem within the constraints.
Consider a "word" as any sequence of capital letters A-Z (not limited to just "dictionary words"). For any word with at least two different letters, there are other words composed of the same letters but in a different order (for instance, STATIONARILY/ANTIROYALIST, which happen to both be dictionary words; for our purposes "AAIILNORSTTY" is also a "word" composed of the same letters as these two). We can then assign a number to every word, based on where it falls in an alphabetically sorted list of all words made up of the same set of letters. One way to do this would be to generate the entire list of words and find the desired one, but this would be slow if the word is long. Write a program which takes a word as a command line argument and prints to standard output its number. Do not use the method above of generating the entire list. Your program should be able to accept any word 25 letters or less in length (possibly with some letters repeated), and should use no more than 1 GB of memory and take no more than 500 milliseconds to run. Any answer we check will fit in a 64-bit integer.
Sample words, with their rank:
ABAB = 2
AAAB = 1
BAAA = 4
QUESTION = 24572
BOOKKEEPER = 10743
examples:
AAAB - 1
AABA - 2
ABAA - 3
BAAA - 4
AABB - 1
ABAB - 2
ABBA - 3
BAAB - 4
BABA - 5
BBAA - 6
I thought about using a binary search for a word and all the possible words built from the characters (1 - permutation(word)) but I think that would take too long. O(logN) might be too slow.
I found this solution but I am a bit confused and need a bit of help understanding it:
Consider the n-letter word { x1, x2, ... , xn }. My solution is based on the idea that the word number will be the sum of two quantities:
The number of combinations starting with letters lower in the alphabet than x1, and
how far we are into the the arrangements that start with x1.
The trick is that the second quantity happens to be the word number of the word { x2, ... , xn }. This suggests a recursive implementation.
Getting the first quantity is a little complicated:
Let uniqLowers = { u1, u2, ... , um } = all the unique letters lower than x1
For each uj, count the number of permutations starting with uj.
Add all those up.

The solutions says that the answer consists of two numbers. Look at the following picture describing the words that can be made from the word QUESTION:
EIONQSTU (first word lexographically, rank 1)
...
...
... (first word before Q, rank A)
QEIONSTU
....
....
QUESTION (our given word, rank x)
...
This phrase "how far we are into the the arrangements that start with x1", is the quantity (x-A), call it B. The thing is B is exactly equal to the word rank of "UESTION", which is our original word with the first letter cut off. This is asking the same question but with a subset of our input, suggesting a recursive solution.
It then remains to find A, this says to find the number of permutations of words beginning with words that come before Q. So A = number of words beginning with {E, I, O, N}

Related

Minimal number of swaps and inserts/removes of digits to transform number to another number

So I have the following task:
Given two numbers N and K. We can perform
add digit at the end of the number.
remove digit from the end of the number.
swap two digits in a number.
I need to write a function that return the minimal number of operations to be performed to transform N to K.
f(1234, 4326) = 4 // 1234 (swap)-> 4231 (swap)-> 4321 (remove)-> 432 (insert)->4326
The task is given as homework to "Introduction to programming" course, so no dynamic programming (which was my solution), no data structures (as stacks, queues) are allowed.
The teacher said that there is a simple solution with no extra memory.
Any ideas?

Why does this C code output a data which makes an interesting wave-like histogram?

Suppose that I use dictionary.txt (a dictionary of words) from here, and this C code produces data like this:
#include <stdio.h>
int main ( void )
{
FILE *dictionary = fopen ( "dictionary.txt", "r" );
char entry[46];
unsigned value = 0;
while ( fscanf ( dictionary, "%s", entry ) != EOF ) {
for ( int i = 0; entry[i]; ++i )
value += entry[i];
printf ( "%d"\n", value );
value = 0;
}
fclose ( dictionary );
return 0;
}
What this code does is, for every word in the dictionary it produces a value which is the sum of all ASCII values of each letter.
After compiling, I make a data.txt that collects all values on the terminal:
$ ./dictionary > data.txt
And on MATLAB:
fileID = fopen ( 'data.txt', 'r' );
formatSpec = '%u';
A = fscanf ( fileID, formatSpec );
fclose(fileID);
X = min(A):max(A);
hist ( A, X )
Which outputs a histogram that looks like this:
What you are plotting is a histogram of added-up ASCII values for white-space delimited strings in a text file. Following the name I assume it is a dictionary of English words.
English is mainly written in lower-case letters, which have ASCII-codes from 97 (a) to 122 (z), on average 110 (disregarding letter frequencies). The histogram shows peaks at distances of also about 110. The different peaks therefore correspond to words of different lengths. Discernible peaks seem to correspond to word lengths from 1 to about 21 letters update: 4 to 12 letters, with the most common word length being 8 letters.
The shape of the single peaks is roughly that of a normal distribution, which can be explained by the fact that different letters are "randomly" selected from the range of about 'a' to 'z'. Though these selections are surely not independent and identically distributed, still an effect like that described by the central limit theorem seems to take place.
Update after the question was edited: The file is indeed a list of English words, one per line, all written in lower-case letters. Some further analysis:
The ASCII values of the letters of all words taken together have the following absolute frequencies in the dictionary:
All occurring values are between 97 (a) and 122 (z). The mean value is 107.5 and the standard deviation is 6.89.
The frequencies of word lengths in the dictionary are:
My interpretation above that the most common word length is 8 turns out to be correct, but the range of word lengths is actually 4 to 12. The updated histogram in the question is consistent with that.
Now, if we simulate 4-, 8- and 12-letter words by randomly drawing from the pool of all letters, and then plot ASCII-sum histograms, this is the result:
It demonstrates that words of different lengths lead to peaks not just of different means (here: 430, 860, and 1290) but also of different widths (here: standard deviation 13.78, 19.49, and 23.91) and correspondingly of different heights.
The picture you see in your histogram is therefore a combination of the word length histogram and the sum-of-ASCII-values for different word lengths.
Suppose the dictionary is all lower-case.
1-letter words have a sum in the range 97-122.
2-letter words have a sum in the range 194-244.
3-letter words have a sum in the range 291-366.
4-letter words have a sum in the range 388-488.
5-letter words have a sum in the range 485-610.
Only now are the ranges starting to overlap, and even then only for words such as zzzz and aaaaa. The in between values are fewer or non-existent, unless words begin with a Capital or perhaps contain a hyphen. So the frequency peaks will be at multiples of about 110 as already said.
14 letter words have a sum in the range 1358-1708
15 letter words have a sum in the range 1455-1830
If each word length's histogram has a natural distribution, you can see this will show through the combined histogram.

Algorithm to find specific char/letter in array of strings?

So I'm trying to come up with a algorithm to find words with a specific char/letter in an array of strings.
If I want words with vowel e, I would get word apple and hello from below set.
{apple, bird, hello}
To find words with specific character, will I need to look through all of the letters in the array and look through every character of each word?
Is there a clever way maybe by sorting the list and then searching somehow?
Also, what would be the running time of this algorithm?
Will it be considered as O(n) or O(n*m)?
Where n is the number of words in the dictionary and m is the length of each word in the array.
In order to find words with a specific character you need to read that character at least once. Thus you must reach each character from each word once, giving a runtime of O(n*m), where n is the number of words and m is the average word length. So yes, you need to lookup each character from each word.
Now if you're going to be doing lots of queries for different letters you can do a single pass over all words and map those words to the characters they are apart of. i.e. apple => a, p, l, e sets. Then you would have 26 sets which hold all words with that character ('a' : [apple], 'b' : [bird], 'c' : [], ... 'l' : [apple, hello], ...). As the number of queries increases relative to the size of your word set you would end up with an amortized lookup time of O(1) -- though you still have a O(n*m) initialization complexity.

C programming: function that keeps track of word lengths & counts?

I'm having trouble articulating what the function is supposed to do, so I think I will just show you guys an example. Say my program opens and scans a text file, which contains the following:
"The cat chased after the rooster to no avail."
Basically the function I'm trying to write is supposed to print out how many 1 letter words there are (if any), how many 2 letter words there are, how many 3 letter words, etc.
"
Length Count
2 2
3 3
5 2
6 1
7 1
"
Here's my attempt:
int word_length(FILE *fp, char file[80], int count)//count is how many total words there are; I already found this in main()
{
printf("Length\n");
int i = 0, j = 0;
while(j < count)
{
for(i = 0; i < count; i++)
{
if(strlen(file[i] = i)
printf("%d\n", i);
}//I intended for the for loop to print the lengths
++i;
printf("Count\n");
while()//How do you print the counts in this case?
}
}
I think the way I set up the loops causes words of the same length to be printed twice...so it'd look something like this, which is wrong. So how should I set up the loops?
"Length Count
2 1
2 2
"
This sounds like homework, so I will not write code for you, but will give you some clues.
To hold several values you will need array. Element with index i will contain counter for words with length i.
Find a way to identify boundaries of words (space, period, beginning of line etc.). Then count number of characters between boundaries.
Increase relevant counter (see tip 1). Repeat.
Some details. You actually want to map one thing to another: length of word to number of such words. For mapping there is special data type, called usually hash(table) or dictionary. But in your case array can perfectly work as a map because you keys are uniform and continues (1,2 ... to some maximum word length).
You can't use a single int to count all of that. You need an array and then in it at position 0 you keep track of how many 1 letter words, at position 1 you accumulate 2 letter words and so on.

Generating Strings

I am about creating a distributed Password Cracker, in which I will use brute force technique, so I need every combination of string.
For the sake of distribution, Server will give Client a range of strings like from "aaaa" to "bxyz". I am supposing that string length will be of four. So I need to check every string between these two bounds.
I am trying to generate these strings in C. I am trying to make logic for this but I'm failing; I also searched on Google but no benefit. Any Idea?
EDIT
Sorry brothers, I would like to edit it
I want combination of string with in a range, lets suppose between aaaa and aazz that would be strings like aaaa aaab aaac aaad ..... aazx aazy aazz .. my character space is just upper and smaller English letters that would be like 52 characters. I want to check every combination of 4 characters. but Server will distribute range of strings among its clients. MY question was if one client gets range between aaaa and aazz so how will I generate strings between just these bounds.
If your strings will comprehend only the ASCII table, you'll have, as an upper limit, 256 characters, or 2^8 characters.
Since your strings are 4 characters length, you'll have 2^8 * 2^8 * 2^8 * 2^8 combinations,
or 2^8^4 = 2^32 combinations.
Simply split the range of numbers and start the combinations in each machine.
You'll probably be interested in this: Calculating Nth permutation step?
Edit:
Considering your edit, your space of combinations would be 52^4 = 7.311.616 combinations.
Then, you do simply need to divide these "tasks" for each machine to compute, so, 7.311.616 / n = r, having r as the amount of permutations calculated by each machine -- the last machine may compute r + (7.311.616 % n) combinations.
Since you know the amount of combinations to build in each machine, you'll have to execute the following, in each machine:
function check_permutations(begin, end, chars) {
for (i = begin; i < end; i++) {
nth_perm = nth_permutation(chars, i);
check_permutation(nth_perm); // your function of verification
}
}
The function nth_permutation() is not hard to derive, and I'm quite sure you can get it in the link I've posted.
After this, you would simply start a process with such a function as check_permutations, giving the begin, end, and the vector of characters chars.
You can generate a tree containing all the permutations. E.g., like in this pseudocode:
strings(root,len)
for(c = 'a' to 'z')
root->next[c] = c
strings(&root->next[c], len - 1)
Invoke by strings(root, 4).
After that you can traverse the tree to get all the permutations.

Resources