How to Sort By using J - arrays

I'm used to sort by operation which many languages afford. It takes some comparator and sorts by it.
What I want to do is to sort the following words firstly by length and then by letter order. Help me please.
I didn't find anything about it in Phrases or Dictionary on jsoftware, except from sorting and grading numerical values.
words=: >;:'CLOUD USB NETWORK LAN SERVER FIREWIRE CLIENT PEER'
] alpha=: a. {~ (i.26) + a.i.'A'
ABCDEFGHIJKLMNOPQRSTUVWXYZ
;/ words/: alpha i. words
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│CLIENT │CLOUD │FIREWIRE│LAN │NETWORK │PEER │SERVER │USB │
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
My first crazy idea is to shift each word to right boundary of the array, e.g.
ABC
DEFG
XY
Then for whitespace assign the extreme ranking (in y argument of sorting primitive). And then shift each word back :D. It would be highly inefficient, I can't see another J-way.
Update
Here is the Wolfram Language code for my problem:
StringSplit # "CLOUD USB NETWORK LAN SERVER FIREWIRE CLIENT PEER"
~SortBy~
(Reverse # ComposeList[{StringLength}, #] &)
If I want to prioritize longer words, I just append Minus #* to StringLength.
Basically my sorting order here is {{5, "CLOUD"}, {3, "USB"}, {7, "NETWORK"}, ...}.
I can make the same array in J using (,.~ #&.>) applied to boxed words, but how do I use sorting primitives then? Maybe this is the right first step? I'm still not sure, but it sound much better than my first guess :).

As requested, I have promoted answers suggested in the comments to the main body of the answer.
First of all, I don't think it is a good idea to apply > to the list of words because that will add fill which destroys information about the length of each word. So start with
words=: ;:'CLOUD USB NETWORK LAN SERVER FIREWIRE CLIENT PEER'
Then you need a function foo that turns each word into a value that will sort in the order you wish using \: or /:
words \: foo words
will do the trick. Or, depending on whether you think hooks are pretty or ugly, you could express that as
(/: foo) words
My original suggestion for foo used #. to express each word as a single number:
foo=: (128&#.)#(a.&i.)#>
foo words
18145864388 1403330 345441182148939 1253582 2870553715410 39730390339053893 2322657797972 168911570
(/: foo) words
┌───┬───┬────┬─────┬──────┬──────┬───────┬────────┐
│LAN│USB│PEER│CLOUD│CLIENT│SERVER│NETWORK│FIREWIRE│
└───┴───┴────┴─────┴──────┴──────┴───────┴────────┘
In a comment, Danylo Dubinin pointed out that instead of encoding the word, we can simply catenate the word length to the front of the index vector and sort using that:
(/: (# , a.&i.)#>) words
┌───┬───┬────┬─────┬──────┬──────┬───────┬────────┐
│LAN│USB│PEER│CLOUD│CLIENT│SERVER│NETWORK│FIREWIRE│
└───┴───┴────┴─────┴──────┴──────┴───────┴────────┘

Related

Permutations of a Word on a Page

Forgive me for the lack of official phrasing; this is a problem given orally in class, as opposed to being written in a problem set. Using the English alphabet with no spaces, commas, periods, etc (and thus only working with twenty-six letters possible), how many possible orderings are there of a string of fifty characters that contain the combination "Johndoe" at some location in the set?
Edit: was a little quick to answer, and overlooked something very obvious. Please see the new answer below
This is more suited for something like math or stats stackexchange. Having said, that, there are 26^(50-7)*(50-7) combinations. To see why, ask yourself: how many 50 letter permutations of the 26 letters exist? Now, we will reduce this set by adding the restriction that a 7-letter contiguous word must exist within any candidate permutation. This has the effect of "fixing" 7 letters and making them unable to vary. However, we can place this 7 letter string anywhere, and there are 43 positions to place it ("johndoe" at position 0, "johndoe" at position 1, all the way to position 43, since "johndoe" will not fit at position 44).

Ceaser Cipher crack using C language

I am writing a program to decrypt text using ceaser cypher algorithm.
till now my code is working fine and gets all possible decrypted results but I have to show just the correct one, how can I do this?
below is the code to get all decrypted strings.
for my code answer should be "3 hello world".
void main(void)
{
char input[] = "gourz#roohk";
for(int key = 1;x<26;key++)
{
printf("%i",input[I]-x%26);
for(int i = strlen(input)-1;i>=0;i--)
{
printf("%c",input[I]-x%26);
}
}
}
Recall that a Caesar Cipher has only 25 possible shifts. Also, for text of non-trivial length, it's highly likely that only one shift will make the input make sense. One possible approach, then, is to see if the result of the shift makes sense; if it does, then it's probably the correct shift (e.g. compare words against a dictionary to see if they're "real" words; not sure if you've done web services yet, but there are free dictionary APIs available).
Consider the following text: 3 uryyb jbeyq. Some possible shifts of this:
3 gdkkn vnqkc (12)
3 xubbe mehbt (3)
3 hello world (13)
3 jgnnq yqtnf (15)
Etc.
As you can see, only the shift of 13 makes this text contain "real" words, so the correct shift is probably 13.
Another possible solution (albeit more complicated) is through frequency analysis (i.e. see if the resulting text has the same - or similar - statistical characteristics as English). For example, in English the most frequent letter is "e," so the correct shift will likely have "e" as the most frequent letter. By way of example, the first paragraph of this answer contains 48 instances of the letter "e", but if you shift it by 15 letters, it only has 8:
Gtrpaa iwpi p Rpthpg Rxewtg wph dcan 25 edhhxqat hwxuih. Pahd, udg
itmi du cdc-igxkxpa atcviw, xi'h wxvwan axztan iwpi dcan dct hwxui
lxaa bpzt iwt xceji bpzt htcht. Dct edhhxqat peegdprw, iwtc, xh id htt
xu iwt gthjai du iwt hwxui bpzth htcht; xu xi sdth, iwtc xi'h egdqpqan
iwt rdggtri hwxui (t.v. rdbepgt ldgsh pvpxchi p sxrixdcpgn id htt xu
iwtn'gt "gtpa" ldgsh; cdi hjgt xu ndj'kt sdct ltq htgkxrth nti, qji
iwtgt pgt ugtt sxrixdcpgn PEXh pkpxapqat).
The key word here is "likely" - it's not at all statistically certain (especially for shorter texts) and it's possible to write text that's resistant to that technique to some degree (e.g. through deliberate misspellings, lipograms, etc.). Note that I actually have an example of an exception above - "3 xubbe mehbt" has more instances of the letter "e" than "3 hello world" even though the second one is clearly the correct shift - so you probably want to apply several statistical tests to increase your confidence (especially for shorter texts).
Hello to make an attack on caesar cipher more speed way is the frequency analysis attack where you count the frequency of each letter in your text and how many times it appeared and compare this letter to the most appearing letters in English in this link
( https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html )
then by applying this table to the letters you can git the text or use this link its a code on get hub (https://github.com/tombusby/understanding-cryptography-exercises/blob/master/Chapter-01/ex1.2.py)
in python for letter frequency last resort answer is the brute force because its more complexity compared to the frequency analysis
brute force here is 26! which means by getting a letter the space of search of letters decrease by one
if you want to use your code you can make a file for the most popular strings in english and every time you decrypt you search in this file but this is high cost of time to do so letter frequency is more better

What is a good approach to check if an item is in a very big hashset?

I have a hashset that cannot be entirely loaded into the memory. So let's say it has ABC part and each one could be loaded into memory but not all at one time.
I also have random entries coming in from time to time which I can barely tell which part it could potentially belong to. So one of the approaches could be that I load A first and then make a check, and then B, C. But next entry could belong to B so I have to unload C, and then load A, then B...Hopefully I make this understood.
This clearly would be very slow so I wonder is there a better way to do that? (if using db is not an alternative)
I suggest that you don't use some criteria to put data entry either to A or to B. In other words, A,B,C - it's just result of division of whole data to 3 equal parts. Am I right? If so I recommend you add some criteria when you adding new entry to your set. For example, if your entries are numbers put those who starts from 0-3 to A, those who starts from 4-6 - to B, from 7-9 to C. When your search something, you apriori now that you have to search in A or in B, or in C. If your entries are words - the same solution, but now criteria is first letter. May be here better use not 3 sets but 26 - size of english alphabet. Please note, that you anyway have to store one of sets in memory. You see one advantage - you do maximum 1 load/unload operation, you don't need to check all sets - you now which of them can really store your value. This idea is widely using in DB - partitioning. If you store in sets nor numbers nor words but some complex objects you anyway can invent some simple criteria.

Find the minimum number of needed transformations given input and target string

Given that I have an input string, for example: aab
And I am given a target string, for example: bababa
And then I am given a set of transformation rules. For example:
ab -> bba
b -> ba
How could I do, in C, an algorithm that would find the minimum number of transformations that would need to be applied in the input string to get the target string.
In this example, for example, the number would be 3. Because we would do:
1 - Apply rule 1 (abba)
2 - Apply rule 1 again (bbaba)
3 - Apply rule 2 (bababa)
It could happen that given an input and a target, there is no solution and that should be noticed too.
I am pretty much lost in strategies on doing this. It comes to my mind creating an automata but I am not sure how would I apply in this situation. I think is an interesting problem and I have been researching online, but all I can find is transformations given rules, but not how to ensure it's a minimum.
EDIT: As one of the answers suggested, we could do a graph starting from the initial string and create nodes that are the result of applying transformations to the previous node. However, this brings some problems, from my point of view:
Imagine that I have a transformation that looks like this a --> ab. And my initial string is 'a'. And my output string is 'c'. So, I keep doing transformations (growing the graph) like this:
a -> ab
ab -> abb
abb -> abbb
...
How would I know when I need to stop building the graph?
Say I have the following string aaaa, and I have a transformation rule like aa->b. How would I create the new nodes? I mean, how would I find the substrings in that input string and remember them?
I dont think there is an efficient solution for this. I think you have to do breadth-first search. by doing that you will know that as soon as you have a solution that it is a shortest solution.
EDIT:
Image: modify string breadth first
Every layer is made from the previous by applying all possible rules to all possible substrings. For example the b->ba rule can be applied to abba for each b. It is important to only apply a single rule and then remember that string (eg ababa and abbaa) in a list. You have to completely have each layer in a List in your program before you start the next Layer (=breadth first).
EDIT 2:
You write you now have an output c. For this you obviously need a rule with XX->c. So say you have rule aaa->c. Now in layer 2 you will have a string aaa which came from some a->aa rules. You will then apply a->aa again and get aaaa, that is ok, since you should go breadth first you will THEN apply the aaa->c rule to aaa and now have layer 3 consisting of aaaa, c and others. You do not continue modifying aaaa because that would go to layer 4, you already found the target c in layer 3 so you can stop.
EDIT 3:
You now ask if you can decide for an unspecified set of rules how you can decide when to stop layering. In general it is impossible, it is called the Halting problem https://en.wikipedia.org/wiki/Halting_problem .
BUT For specific rules you can tell if you can ever reach the output.
Example 1: if the target contains an atom that no rule can provide (your 'c'-Example).
Example 2: if your rules are all either increasing the string's length or keeping the length as it is (no rules that decrease the length of the string)
Example 3: you can drop certain rules if you found by algorithm that they are cyclic
Other examples exist

How do spell checkers work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I need to implement a spell checker in C. Basically, I need all the standard operations... I need to be able to spell check a block of text, make word suggestions and dynamically add new words to the index.
I'd kind of like to write this myself, tho I really don't know where to begin.
Read up on Tree Traversal. The basic concept is as follows:
Read a dictionary file into memory (this file contains the entire list of correctly spelled words that are possible/common for a given language). You can download free dictionary files online, such as Oracle's example dictionary.
Parse this dictionary file into a search tree to make the actual text search as efficient as possible. I won't describe all of the dirty details of this type of tree structure, but the tree will be made up of nodes which have (up to) 26 links to child nodes (one for each letter), plus a flag to indicate wether or not the current node is the end of a valid word.
Loop through all of the words in your document, and check each one against the search tree. If you reach a node in the tree where the next letter in the word is not a valid child of the current node, the word is not in the dictionary. Also, if you reach the end of your word, and the "valid end of word" flag is not set on that node, the word is not in the dictionary.
If a word is not found in the dictionary, inform the user. At this stage, you can also suggest alternate spellings, but that gets a tad more complicated. You will have to loop through each character in the word, substituting alternate characters and test each of them against the search tree. There are probably more efficient algorithms for finding the recommended words, but I don't know what they are.
A really short example:
Dictionary:apex apple appoint appointed
Tree: (* indicates valid end of word)
update: Thank you to Curt Sampson for pointing out that this data structure is called a Patricia Tree
A -> P -> E -> X*
\\-> P -> L -> E*
\\-> O -> I -> N -> T* -> E -> D*
Document:apple appint ape
Results:
"apple" will be found in the tree, so it is considered correct.
"appint" will be flagged as incorrect. Traversing the tree, you will follow A -> P -> P, but the second P does not have an I child node, so the search fails.
"ape" will also fail, since the E node in A -> P -> E does not have the "valid end of word" flag set.
edit: For more details on spelling suggestions, look into Levenshtein Distance, which measures the smallest number of changes that must be made to convert one string into another. The best suggestions would be the dictionary words with the smallest Levenshtein Distance to the incorrectly spelled word.
Given you don't know where to begin, I'd suggest using an existing solution. See, for example, aspell
(GLPL licenced). If you really have to implement it yourself, please tell us why.
One should look at prefixes and suffixes.
suddenly = sudden + ly.
by removing ly's you can get away storing just the root word.
Likewise preallocate = pre + allocate.
And lovingly = love + ing + ly
gets a bit more complex, as the english rules for ing get invoked.
There is also the possibility of using some sort of hashing function to map a root word
into a specific bit is a large bit map, as a constant time method of determining if the root word is spelled correctly.
You can get even more complex by trying to provide an alternate list of possible correct spellings to a misspelled word. You might research the soundex algorithm to get some ideas.
I would advise prototyping with a small set of words. Do a lot of testing, then scale up.
It is a wonderful educational problem.
Splitting a word into root and suffix is knonw as the "Porter Stemming Algorithm" it's a good way of fitting an English ditionary into an amazingly small memory.
It's also useful for seach so "spell checker" will also find "spelling check" and "spell checking"
I've done this in class
You should consider python Natural Language Toolkit NLTK which is made specificaly to handle this.
It also allows to create text interpreters such as chatbots
The Open Office Spell checker Hunspell can be a good starting point. Here is the Homepage:
Hunspell at Sourceforge
E James gives a great answer for how to tell if a word is valid. It probably depends on the spell checker for how they determine likely misspellings.
One such method, and the one that I would use is the Levenshteinn String Similarity which looks at how many letters must be added, removed, or swaped in a word in order to make another word.
If you say spelled: Country as Contry. The levenshtein string similarity would be 1 since you have to only add 1 letter to transform contry into country.
You could then loop through all possible correct spellings of words (only 171,000 english words and 3000 of those account for 95% of text). Determine those with the lowest levenshtein string similarity value, and then return the top X words that are most similar to the misspelled word.
There's a great python package called Fuzzy Wuzzy which implements this efficiently and generates a % similarity between two words or sentences based on this formula.

Resources