I was thinking of creating a chatbot using something like markov chains, but I'm not entirely sure how to get it to work. From what I understand, you create a table from data with a given word and then words which follow. Is it possible to attach any sort of probability or counter while training the bot? Is that even a good idea?
The second part of the problem is with keywords. Assuming I can already identify keywords from user input, how do I generate a sentence which uses that keyword? I don't always want to start the sentence with the keyword, so how do I seed the markov chain?
I made a Markov chain chatbot for IRC in Python a few years back and can shed some light how I did it. The text generated does not necessarily make any sense, but it can be really fun to read. Lets break it down in steps. Assuming you have a fixed input, a text file, (you can use input from chat text or lyrics or just use your imagination)
Loop through the text and make a Dictionary, meaning key - value container. And put all pair of words as keys and the word following as a value.
For example: If you have a text "a b c a b k" you start with "a b" as key and "c" as value, then "b c" and "a" as value... the value should be a list or any collection holding 0..many 'items' as you can have more than one value for a given pair of words. In the example above you will have "a b" two times followed fist by "c" then in the end by "k". So in the end you will have a dictionary/hash looking like this: {'a b': ['c','k'], 'b c': ['a'], 'c a': ['b']}
Now you have the needed structure for building your funky text. You can choose to start with a random key or a fixed place! So given the structure we have we can start by saving "a b" then randomly taking a following word from the value, c or k, so the first save in the loop, "a b k" (if "k" was the random value chosen) then you continue by moving one step to the right which in our case is "b k" and save a random value for that pair if you have, in our case no, so you break out of the loop (or you can decide other stuff like start over again). When to loop is done you print your saved text string.
The bigger the input, the more values you will have for you keys (pair of words) and will then have a "smarter bot" so you can "train" your bot by adding more text (perhaps chat input?). If you have a book as input, you can construct some nice random sentences. Please note that you don't have to take only one word that follows a pair as a value, you can take 2 or 10. The difference is that your text will appear more accurate if you use "longer" building blocks. Start with a pair as a key and the following word as a value.
So you see that you basically can have two steps, first make a structure where you randomly choose a key to start with then take that key and print a random value of that key and continue till you do not have a value or some other condition. If you want you can "seed" a pair of words from a chat input from your key-value structure to have a start. Its up to your imagination how to start your chain.
Example with real words:
"hi my name is Al and i live in a box that i like very much and i can live in there as long as i want"
"hi my" -> ["name"]
"my name" -> ["is"]
"name is" -> ["Al"]
"is Al" -> ["and"]
........
"and i" -> ["live", "can"]
........
"i can" -> ["live"]
......
Now construct a loop:
Pick a random key, say "hi my" and randomly choose a value, only one here so its "name"
(SAVING "hi my name").
Now move one step to the right taking "my name" as the next key and pick a random value... "is"
(SAVING "hi my name is").
Now move and take "name is" ... "Al"
(SAVING "hi my name is AL").
Now take "is Al" ... "and"
(SAVING "hi my name is Al and").
...
When you come to "and i" you will randomly choose a value, lets say "can", then the word "i can" is made etc... when you come to your stop condition or that you have no values print the constructed string in our case:
"hi my name is Al and i can live in there as long as i want"
If you have more values you can jump to any keys. The more values the more combinations you have and the more random and fun the text will be.
The bot chooses a random word from your input and generates a response by choosing another random word that has been seen to be a successor to its held word. It then repeats the process by finding a successor to that word in turn and carrying on iteratively until it thinks it’s said enough. It reaches that conclusion by stopping at a word that was prior to a punctuation mark in the training text. It then returns to input mode again to let you respond, and so on.
It isn’t very realistic but I hereby challenge anyone to do better in 71 lines of code !! This is a great challenge for any budding Pythonists, and I just wish I could open the challenge to a wider audience than the small number of visitors I get to this blog. To code a bot that is always guaranteed to be grammatical must surely be closer to several hundred lines, I simplified hugely by just trying to think of the simplest rule to give the computer a mere stab at having something to say.
Its responses are rather impressionistic to say the least ! Also you have to put what you say in single quotes.
I used War and Peace for my “corpus” which took a couple of hours for the training run, use a shorter file if you are impatient…
here is the trainer
#lukebot-trainer.py
import pickle
b=open('war&peace.txt')
text=[]
for line in b:
for word in line.split():
text.append (word)
b.close()
textset=list(set(text))
follow={}
for l in range(len(textset)):
working=[]
check=textset[l]
for w in range(len(text)-1):
if check==text[w] and text[w][-1] not in '(),.?!':
working.append(str(text[w+1]))
follow[check]=working
a=open('lexicon-luke','wb')
pickle.dump(follow,a,2)
a.close()
Here is the bot:
#lukebot.py
import pickle,random
a=open('lexicon-luke','rb')
successorlist=pickle.load(a)
a.close()
def nextword(a):
if a in successorlist:
return random.choice(successorlist[a])
else:
return 'the'
speech=''
while speech!='quit':
speech=raw_input('>')
s=random.choice(speech.split())
response=''
while True:
neword=nextword(s)
response+=' '+neword
s=neword
if neword[-1] in ',?!.':
break
print response
You tend to get an uncanny feeling when it says something that seems partially to make sense.
You could do like this:
Make a order 1 markov chain generator, using words and not letters.
Everytime someone post something, what he posted is added to bot database.
Also bot would save when he gone to chat and when a guy posted the first post (in multiples of 10 seconds), then he would save the amount of time this same guy waited to post again (in multiples of 10 seconds)...
This second part would be used to see when the guy will post, so he join the chat and after some amount of time based on a table with "after how many 10 seconds the a guy posted after joining the chat", then he would continue to post with the same table thinking "how was the amount of time used to write the the post that was posted after a post that he used X seconds to think about and write"
Related
I'm trying to turn a returned string into an array with certain characters taken out so I can add it into a vertical list.
Input:
"\n\n1. Separate clothes into whites, lights and darks.\n2. Fill washing machine with cold water and select the appropriate cycle for the type of clothing.\n3. Add the appropriate laundry detergent, fabric softener and other laundry additives.\n4. Place clothes in the washing machine and press start."
Note that in the first one it has two linebreaks with a number (\n\n1.) and in the rest it only has one line break with a number (\n2.)
My end goal is to have some type of regex, or replace occurences for every numbered step to delete. I want it to return as ["Separate clothes into whites, lights and darks.", "Fill washing machine with cold water and select the appropriate cycle for the type of clothing.", etc etc] then to put it in a list. My problem is my regex or syntax isn't working, the one I use is "\n([+-]?(?=.\d|\d)(?:\d+)?(?:.?\d*))(?:eE)? "
I've gotten it to work perfectly in Python but am new to swift so please have mercy.
Thanks in advance!
One way is to use split to divide the text into an array. Unfortunately I couldn't get it tot work with only using split so I first used replacingOccurrences(of:with:options) to modify the values to split on
let rows = text.replacingOccurrences(of: "\n\\d\\. ", with: "\n", options: [.regularExpression])
.split(separator: "\n")
Maybe someone can make the regex work directly in split
I am writing a program to decrypt text using ceaser cypher algorithm.
till now my code is working fine and gets all possible decrypted results but I have to show just the correct one, how can I do this?
below is the code to get all decrypted strings.
for my code answer should be "3 hello world".
void main(void)
{
char input[] = "gourz#roohk";
for(int key = 1;x<26;key++)
{
printf("%i",input[I]-x%26);
for(int i = strlen(input)-1;i>=0;i--)
{
printf("%c",input[I]-x%26);
}
}
}
Recall that a Caesar Cipher has only 25 possible shifts. Also, for text of non-trivial length, it's highly likely that only one shift will make the input make sense. One possible approach, then, is to see if the result of the shift makes sense; if it does, then it's probably the correct shift (e.g. compare words against a dictionary to see if they're "real" words; not sure if you've done web services yet, but there are free dictionary APIs available).
Consider the following text: 3 uryyb jbeyq. Some possible shifts of this:
3 gdkkn vnqkc (12)
3 xubbe mehbt (3)
3 hello world (13)
3 jgnnq yqtnf (15)
Etc.
As you can see, only the shift of 13 makes this text contain "real" words, so the correct shift is probably 13.
Another possible solution (albeit more complicated) is through frequency analysis (i.e. see if the resulting text has the same - or similar - statistical characteristics as English). For example, in English the most frequent letter is "e," so the correct shift will likely have "e" as the most frequent letter. By way of example, the first paragraph of this answer contains 48 instances of the letter "e", but if you shift it by 15 letters, it only has 8:
Gtrpaa iwpi p Rpthpg Rxewtg wph dcan 25 edhhxqat hwxuih. Pahd, udg
itmi du cdc-igxkxpa atcviw, xi'h wxvwan axztan iwpi dcan dct hwxui
lxaa bpzt iwt xceji bpzt htcht. Dct edhhxqat peegdprw, iwtc, xh id htt
xu iwt gthjai du iwt hwxui bpzth htcht; xu xi sdth, iwtc xi'h egdqpqan
iwt rdggtri hwxui (t.v. rdbepgt ldgsh pvpxchi p sxrixdcpgn id htt xu
iwtn'gt "gtpa" ldgsh; cdi hjgt xu ndj'kt sdct ltq htgkxrth nti, qji
iwtgt pgt ugtt sxrixdcpgn PEXh pkpxapqat).
The key word here is "likely" - it's not at all statistically certain (especially for shorter texts) and it's possible to write text that's resistant to that technique to some degree (e.g. through deliberate misspellings, lipograms, etc.). Note that I actually have an example of an exception above - "3 xubbe mehbt" has more instances of the letter "e" than "3 hello world" even though the second one is clearly the correct shift - so you probably want to apply several statistical tests to increase your confidence (especially for shorter texts).
Hello to make an attack on caesar cipher more speed way is the frequency analysis attack where you count the frequency of each letter in your text and how many times it appeared and compare this letter to the most appearing letters in English in this link
( https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html )
then by applying this table to the letters you can git the text or use this link its a code on get hub (https://github.com/tombusby/understanding-cryptography-exercises/blob/master/Chapter-01/ex1.2.py)
in python for letter frequency last resort answer is the brute force because its more complexity compared to the frequency analysis
brute force here is 26! which means by getting a letter the space of search of letters decrease by one
if you want to use your code you can make a file for the most popular strings in english and every time you decrypt you search in this file but this is high cost of time to do so letter frequency is more better
I've got a database holding musical works, and an example of a title could be "I See A Soul".
As it is right now, I'm indexing musical works using a field which is configured with a LengthFilterFactory to filter out words less than 2 characters and more than 255 characters. This, of course, filters out "I" and "A" in "I See A Soul" so the resulting indexed document would hold the title "See Soul". Consequently, this doesn't produce the desired results as users can't search for "I See A Soul". So, I'm removing the LengthFilterFactory.
However, I'm curious : In what situations would it be a good idea to strip out words of certain lengths?
The point is that you can apply the same filter to the query too.
So that if the user search for "I see a soul" or "see a soul" or "u see w soul" he will still find the same result.
another idea could be that if you have a requirement that does not allow the user to search until they type at least 3 letters (like an autocomplete function for example), you may not want to index the word less than 3 letters as they will not be searched against anyway.
I want a quick and dirty way of determining what language the user is writing in. I know that there is a Google API which will detect the difference between French and Spanish (even though they both use mostly the same alphabet), but I don't want the latency. Essentially, I know that the Latin alphabet has a lot of confusion as to what language it is using. Other alphabets, however, don't. For example, if there is a character using hiragana (part of the Japanese writing system) there is no confusion as to the language. Therefore, I don't need to ask Google.
Therefore, I would like to be able to do something simple like say that שלום uses the Hebrew alphabet and こんにちは uses Japanese characters. How do I get that alphabet string?
"Bonjour", "Hello", etc. should return "Latin" or "English" (Then I'll ask Google for the real language). "こんにちは" should return "Hiragana" or "Japanese". "שלום" should return "Hebrew".
I'd suggest looking at the Unicode "Script" property. The latest database can be found here.
For a quick and dirty implementation, I'd try scanning all of the characters in the target text and looking up the script name for each one. Pick whichever script has the most characters.
Use an N-gram model and then give a sufficiently large set of training data. A full example describing this technique is to be found at this page, among others:
http://phpir.com/language-detection-with-n-grams/
Although the article assumes you are implementing in PHP and by "language" you mean something like English, Italian, etc... the description may be implemented in C if you require this, and instead of using "language" as in English, etc. for the training, just use your notion of "alphabet" for the training. For example, look at all of your "Latin alphabet" strings together and consider their n-grams for n=2:
Bonjour: "Bo", "on", "nj", "jo", "ou", "ur"
Hello: "He", "el", "ll", "lo"
With enough training data, you will discover dominant combinations that are likely for all Latin text, for example, perhaps "Bo" and "el" are quite probable for text written in the "Latin alphabet". Likewise, these combinations are probably quite rare in text that is written in the "Hiragana alphabet". Similar discoveries will be made with any other alphabet classification for which you can provide sufficient training data.
This technique is also known as a Hidden Markov model or a Markov chain; searching for these keywords will give more ideas for implementation. For "quick and dirty" I would use n=2 and gather just enough training data such that the least common letter from each alphabet is encountered at least once... e.g. at least one 'z' and at least one 'ぅ' *little hiragana u.
EDIT:
For a simpler solution than N-Grams, use only basic statistical tests -- min, max and average -- to compare your Input (a string given by the user) with an Alphabet (a string of all characters in one of the alphabets you are interested).
Step 1. Place all the numerical values of the Alphabet (e.g. utf8 codes) in an array. For example, if the Alphabet to be tested against is "Basic Latin", make an array DEF := {32, 33, 34, ..., 122}.
Step 2. Place all the numerical values of the Input into an array, for example, make an array INP := {73, 102, 32, ...}.
Step 3. Calculate a score for the input based on INP and DEF. If INP really comes from the same alphabet as DEF, then I would expect the following statements to be true:
min(INP) >= min(DEF)
max(INP) <= max(DEF)
avg(INP) - avg(DEF) < EPS, where EPS is a suitable constant
If all statements are true, the score should be close to 1.0. If all are false, the score should close to 0.0. After this "Score" routine is defined, all that's left is to repeat it on each alphabet you are interested in and choose the one whiich gives the highest score for a given Input.
I'm storing a list of keywords that have been used throughout all searches on a site, and I'm getting a lot of random strings in the keywords field. Here's a sample of the data that I'm getting back:
fRNPRXiPtjDrfTDKH
boom
Mule deer
gVXOFEzRWi
cbFXZcCoSiKcmrvs
Owner Financed ,owner Financed
I'm trying to find a way in SQL or ColdFusion to figure out if something has valid English words, or if it's a random set of characters. I've tried doing some digging for n-gram analysis, but can't seem to come up with any useful solutions that I can run directly on my servers.
UPDATE: The code is now on jsFiddle: http://jsfiddle.net/ybanrab/s6Bs5/1/ it may be interesting to copy and paste a page of news copy and paste in your test data
I'd suggest trying to analyse the probabilities of the individual characters following each other. Below is an example in JavaScript I've written but that ought to translate to T-SQL or ColdFusion pretty easily.
The idea is that you feed in good phrases (the corpus) and analyse the frequency of letters following other letters. If you feed it "this thin the" you'll get something like this:
{
t:{h:3},
h:{i:2,e:1},
i:{s:1,n:1},
s:{},
n:{}
}
You'll get most accuracy by feeding in hand-picked known good inputs from the data you're analysing, but you may also get good results by feeding in plain english. In the example below I'm computing this, but you can obviously store this once you're happy with it.
You then run the sample string against the probabilities to give it a score. This version ignores case, word starting letter, length etc, but you could use them as well if you want.
You then just need to decide on a threshold score and filter like that.
I'm fairly sure this kind of analysis has a name, but my google-fu is weak today.
You can paste the code below into a script block to get an idea of how well (or not) it works.
var corpus=["boom","Mule Deer", "Owner Financed ,owner Financed", "This is a valid String","The quick brown fox jumped over the lazy dog"];
var probs={};
var previous=undefined;
//Compute the probability of one letter following another
corpus.forEach(function(phrase){
phrase.split(" ").forEach(function(word){
word.toLowerCase().split("").forEach(function(chr){
//set up an entry in the probabilities table
if(!probs[chr]){
probs[chr]={};
}
//If this isn't the first letter in the word, record this letter as following the previous one
if(previous){
if(!probs[previous][chr]){
probs[previous][chr]=0;
}
probs[previous][chr]++;
}
//keep track of the previous character
previous=chr;
});
//reset previous as we're moving onto a different word
previous=undefined;
})
});
function calculateProbability(suspect){
var score=0;
var previous=undefined;
suspect.toLowerCase().split("").forEach(function(chr){
if(previous && probs[previous] && probs[previous][chr]){
//Add the score if there is one, otherwise zero
score+=probs[previous][chr];
}
previous=chr;
});
return score/suspect.length;
}
console.log(calculateProbability("boom"));
console.log(calculateProbability("Mood"));
console.log(calculateProbability("Broom"));
console.log(calculateProbability("sajkdkas dak"));
The best thing to do is to check your words against frequency lists: dictionaries won't work because they don't contain grammatical inflections, proper nouns, compounds, and a whole load of other stuff that's valid.
The problem with naive checking against n-gram data is there is a lot of noise in the lower frequency words. The easiest thing to do which should give you the correct answer in the overwhelming majority of cases is to truncate a list of frequency counted words from somewhere suitably large (Google n-gram, Wikipedia, etc) at the top 50,000 or 100,000 words. Adjust the threshold as appropriate to get the results you're looking for, but then you can just check if any/all of your query terms appear in this list.
If you want to know if the query is grammatical, or sensible as a unit rather than its constituent parts, that's a whole other question of course.
There are some non-dictionary-words that can be valid searches (e.g. gethostbyname is a valid and meaningful search here on SO, but not a dictionary word). On the other hand, there are dictionary words that have absolutely nothing to do with your website.
Instead of trying to guess what is a word and what isn't, you could simply check if the search query produced a non-empty result. Those with empty results must be complete off-topic or gibberish.
It sounds like you are looking for a
Bayesian Filter