How do I delete certain patterns in a string?

How do I delete certain patterns in a string? - arrays

I'm trying to turn a returned string into an array with certain characters taken out so I can add it into a vertical list.
Input:
"\n\n1. Separate clothes into whites, lights and darks.\n2. Fill washing machine with cold water and select the appropriate cycle for the type of clothing.\n3. Add the appropriate laundry detergent, fabric softener and other laundry additives.\n4. Place clothes in the washing machine and press start."
Note that in the first one it has two linebreaks with a number (\n\n1.) and in the rest it only has one line break with a number (\n2.)
My end goal is to have some type of regex, or replace occurences for every numbered step to delete. I want it to return as ["Separate clothes into whites, lights and darks.", "Fill washing machine with cold water and select the appropriate cycle for the type of clothing.", etc etc] then to put it in a list. My problem is my regex or syntax isn't working, the one I use is "\n([+-]?(?=.\d|\d)(?:\d+)?(?:.?\d*))(?:eE)? "
I've gotten it to work perfectly in Python but am new to swift so please have mercy.
Thanks in advance!

One way is to use split to divide the text into an array. Unfortunately I couldn't get it tot work with only using split so I first used replacingOccurrences(of:with:options) to modify the values to split on
let rows = text.replacingOccurrences(of: "\n\\d\\. ", with: "\n", options: [.regularExpression])
.split(separator: "\n")
Maybe someone can make the regex work directly in split

Related

Ive got a pipe that consists of 5 pieces, each including 5 properties

Inlet -> front -> middle -> rear -> outlet
Those five properties have a value anything between 4 - 40. Now i want to calculate a specific match for each of those values that is either a full 10 or a 5 when a single property is summed from each pipe piece. There might be hundreds of different pipe pieces all with different properties.
So if i have all 5 pieces and when summed, their properties go like 54,51,23,71,37. That is not good and not what im looking.
Instead 55,50,25,70,40. That would be perfect.
My trouble is there are so many of the pieces that it would be insane to do the miss'matching manually, and new ones come up frequently.
I have manually inserted about 100 of these already into SQLite, but should be easy to convert into any excel or other database formats, so answer can be related to anything like mysql or googlesheets.
I need the calculation that takes every piece in account and results either in "no match" or tells me the id of each piece that is required for a match and if multiple matches are available, it separates them.
Edit: Even just the math needed to do this kind of calculation would be a lot of help here, not much of a math guy myself. I guess there should be a reference piece i need to use and then that gets checked against every possible scenario.

If the value you want to verify is in A1, use: =ROUND(A1/5,0)*5
If the pipes may not be shorter than the given values, use =CEILING(A1,5)

How many trucks came empty but bought something

I think this is the hardest to date I have had to crack - so hard I had a hard time finding a good headline.
So we have a site where trucks come and buy say Gravel, or sand or other building materials.
Sometimes they also unload demolition waste first.
I need to find out a couple of things
how many trucks (and from what companys) came empty
if they came empty what did they buy from us.
what companys are sending full trucks and what are sending empty trucks.
a tope 10 of materials they will drive to us from to buy even when coming empty to our facility.
a list of all the order numbers that they drove to us til fill and came with empty trucks. ( I have distances linked to order numbers, so now I can estimate the value of our products)
The data I have available:
I have a full data set of when what customer buys what and / or pay to deliver.
E.G.:
I can see the parts I need to split the data into I think it should be something like this
find all unique licence plates
somehow map if they bought materials within 30 minutes of
offloading demolition waste (most trucks will come between 2 and 10
times per day)
Present all this data (on a normal day we have about 800 trucks = 2000 lines since they weigh in, weigh out, and then some buy something = 2 more weigh lines)
I can easily find unique licence plates per day (either by formula or by Excel function Data/delete doublets,
but after that I have no clue where to start.
I think I need some sheets in between, where I somehow mark if a material was bought from an "empty truck" and I need a counter for that .. somehow...
Any help on how to get started is appreciated.

It seems like the best way to start is with a helper column (in the following exampes, I have chosen "Column M") to flag whether the truck arrived empty.
In the helper column, you can use something similar to the following formula.
{=IF(ISBLANK(B2),0,IF(C2="In",0,IF(B2=$B$2:$B$13,IF($C$2:$C$13="In",IF($A$2:$A$13>(A2-TIME(0,30,0)),0,1),1),1)))}
This is an array formula, which means you have to press ctrl+shift+enter after pasting it in the cell. Then you can copy that cell down the column.
Just to explain, the first if statement knows the truck is not arriving empty if Column C is 'In'. The second if statement creates an array and tests to see if other the same truck appears in other rows. The third if statement checks to see if the same truck checked 'In' in the matching rows, and the fourth if statement verifies if the time they checked in was less than thirty minutes ago. You can adjust the length by editing the TIME(0,30,0) function. The format is TIME(hours,minuites,seconds). Unless the truck matches all three of the second, third and fourth if statements, it is marked as coming empty.
Once you have this helper column, just about all of your tasks are quite simple.
1a: How many trucks came empty? Sum Column M
1b: How many trucks from what company? Create a unique list of companies. Then create a COUNTIFS formula based on Column M = 1 and Column K = Company. For example, if C32 had Company B then the formula =COUNTIFS($M$2:$M$13,1,$K$2:$K$13,C32) would return 2
1c: How many times did a truck come empty? Similar to 1b, create a unique list of License Plates, then use a COUNTIFS based on Column M = 1 and Column B = License Plate.
2: Similar to 1b, just use a unique list of products tested against Column F
3: Similar to 1b, just create a second column, next to the first that uses =COUNTIFS($M$2:$M$13,0,$K$2:$K$13,C53,$C$2:$C$13,"In") Which tests that Column M reports the truck did not come empty, that matches the company in Column K and that the truck came 'In' so you don't double count the same truck when it goes 'out'
4: Just sort list created by number 2. You can highlight the range, right-click and select "Sort" > "Custom Sort", then select the column you want to sort on and largest to smallest.
5: There are a couple of different ways, you could do this. The formula
{=TEXTJOIN(", ",TRUE,IF($M$2:$M$13=1,$J$2:$J$13,""))}
(again, entered as an array formula)
would create a comma separated list of order numbers. An alternative if you want a column of order numbers (but would only work if they are actually numbers), is to paste the formula {=MAX(IF($M$2:$M$13=1,$J$2:$J$13,))} in the first row of the column (in my example, its O2) and then {=MAX(IF($M$2:$M$13=1,IF($J$2:$J$13<O2,$J$2:$J$13,)))} in the row below (change the reference to O2 if you pasted it in a different spot)(again, note that both of these are array formulas). Then copy and paste the second formula down the column. When order numbers of trucks that came in empty are exhausted, the formula will report 0.

Separate Real Words from Random Strings

I'm storing a list of keywords that have been used throughout all searches on a site, and I'm getting a lot of random strings in the keywords field. Here's a sample of the data that I'm getting back:
fRNPRXiPtjDrfTDKH
boom
Mule deer
gVXOFEzRWi
cbFXZcCoSiKcmrvs
Owner Financed ,owner Financed
I'm trying to find a way in SQL or ColdFusion to figure out if something has valid English words, or if it's a random set of characters. I've tried doing some digging for n-gram analysis, but can't seem to come up with any useful solutions that I can run directly on my servers.

UPDATE: The code is now on jsFiddle: http://jsfiddle.net/ybanrab/s6Bs5/1/ it may be interesting to copy and paste a page of news copy and paste in your test data
I'd suggest trying to analyse the probabilities of the individual characters following each other. Below is an example in JavaScript I've written but that ought to translate to T-SQL or ColdFusion pretty easily.
The idea is that you feed in good phrases (the corpus) and analyse the frequency of letters following other letters. If you feed it "this thin the" you'll get something like this:
{
t:{h:3},
h:{i:2,e:1},
i:{s:1,n:1},
s:{},
n:{}
}
You'll get most accuracy by feeding in hand-picked known good inputs from the data you're analysing, but you may also get good results by feeding in plain english. In the example below I'm computing this, but you can obviously store this once you're happy with it.
You then run the sample string against the probabilities to give it a score. This version ignores case, word starting letter, length etc, but you could use them as well if you want.
You then just need to decide on a threshold score and filter like that.
I'm fairly sure this kind of analysis has a name, but my google-fu is weak today.
You can paste the code below into a script block to get an idea of how well (or not) it works.
var corpus=["boom","Mule Deer", "Owner Financed ,owner Financed", "This is a valid String","The quick brown fox jumped over the lazy dog"];
var probs={};
var previous=undefined;
//Compute the probability of one letter following another
corpus.forEach(function(phrase){
phrase.split(" ").forEach(function(word){
word.toLowerCase().split("").forEach(function(chr){
//set up an entry in the probabilities table
if(!probs[chr]){
probs[chr]={};
}
//If this isn't the first letter in the word, record this letter as following the previous one
if(previous){
if(!probs[previous][chr]){
probs[previous][chr]=0;
}
probs[previous][chr]++;
}
//keep track of the previous character
previous=chr;
});
//reset previous as we're moving onto a different word
previous=undefined;
})
});
function calculateProbability(suspect){
var score=0;
var previous=undefined;
suspect.toLowerCase().split("").forEach(function(chr){
if(previous && probs[previous] && probs[previous][chr]){
//Add the score if there is one, otherwise zero
score+=probs[previous][chr];
}
previous=chr;
});
return score/suspect.length;
}
console.log(calculateProbability("boom"));
console.log(calculateProbability("Mood"));
console.log(calculateProbability("Broom"));
console.log(calculateProbability("sajkdkas dak"));

The best thing to do is to check your words against frequency lists: dictionaries won't work because they don't contain grammatical inflections, proper nouns, compounds, and a whole load of other stuff that's valid.
The problem with naive checking against n-gram data is there is a lot of noise in the lower frequency words. The easiest thing to do which should give you the correct answer in the overwhelming majority of cases is to truncate a list of frequency counted words from somewhere suitably large (Google n-gram, Wikipedia, etc) at the top 50,000 or 100,000 words. Adjust the threshold as appropriate to get the results you're looking for, but then you can just check if any/all of your query terms appear in this list.
If you want to know if the query is grammatical, or sensible as a unit rather than its constituent parts, that's a whole other question of course.

There are some non-dictionary-words that can be valid searches (e.g. gethostbyname is a valid and meaningful search here on SO, but not a dictionary word). On the other hand, there are dictionary words that have absolutely nothing to do with your website.
Instead of trying to guess what is a word and what isn't, you could simply check if the search query produced a non-empty result. Those with empty results must be complete off-topic or gibberish.

It sounds like you are looking for a
Bayesian Filter

Is there a way to rank the difficulty of pronunciation of a word?

I'm trying to build a collection English words that are difficult to pronounce.
I was wondering if there is an algorithm of some kind or a theory, that can be used to show how difficult a word is to pronounce.
Does this appear to you as something that can be computed?
As this seems to be a very subjective thing, let me make it more objective, let's say hardest words to pronounce by text to speech technologies.

One approach would be to build a list with two versions of each word. One the correct spelling, and the other being the word spelled using the simplest of phonetic spelling. Apply a distance function on the two words (like Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance). The greater the distance between the two words, the harder the word would be to pronounce.

Great problem! Off the top of my head you could create a system which contains all the letters from the phonetic alphabet and with connected weights betweens every combination based on difficulty (highly specific so may need multiple people testing and take averages etc) then have a list of all words from the English dictionary stored on disk and call a script which cycles through each entry and performs web scraping on wikipedia for the phonetic spelling and ranks their difficulty. This could take into consideration the length of the word as well as the difficulty between joining phonetics then order the list based on the difficulty.
Thats what I would try and do :P

To a certain extent...
Speech programs for example use a system of phonetics to try and pronounce words.
For example, "grasp" would be split into:
Gr-A-Sp
However, for foreign words (or words that don't follow this pattern), exception lists have to be kept e.g. Yacht

Suggestion
Fortunately Pronunciation as a process is dependent on a two factors these include
the phones making up the words and the location of vowels and semi vowels i.e
/a/,/ae/,/e/,/i/,/o/,/u/,/w/,/j/...
length of the word.
the first relates to the mechanics of phone sound production as the velum, cheeks tongue have to be altered to produce various sounds related to individual phones i.e nasal etc. this makes some words more difficult to pronounce as the movement required may be a lot. Refer to books about phonetics to find positions of pronouncing each phone.
Algorithm
a weighted spanning tree with weight being the difficulty of pronouncing two consecutive phones i.e l and r or /sh/ and /s/
good luck.

How do Markov Chain Chatbots work?

I was thinking of creating a chatbot using something like markov chains, but I'm not entirely sure how to get it to work. From what I understand, you create a table from data with a given word and then words which follow. Is it possible to attach any sort of probability or counter while training the bot? Is that even a good idea?
The second part of the problem is with keywords. Assuming I can already identify keywords from user input, how do I generate a sentence which uses that keyword? I don't always want to start the sentence with the keyword, so how do I seed the markov chain?

I made a Markov chain chatbot for IRC in Python a few years back and can shed some light how I did it. The text generated does not necessarily make any sense, but it can be really fun to read. Lets break it down in steps. Assuming you have a fixed input, a text file, (you can use input from chat text or lyrics or just use your imagination)
Loop through the text and make a Dictionary, meaning key - value container. And put all pair of words as keys and the word following as a value.
For example: If you have a text "a b c a b k" you start with "a b" as key and "c" as value, then "b c" and "a" as value... the value should be a list or any collection holding 0..many 'items' as you can have more than one value for a given pair of words. In the example above you will have "a b" two times followed fist by "c" then in the end by "k". So in the end you will have a dictionary/hash looking like this: {'a b': ['c','k'], 'b c': ['a'], 'c a': ['b']}
Now you have the needed structure for building your funky text. You can choose to start with a random key or a fixed place! So given the structure we have we can start by saving "a b" then randomly taking a following word from the value, c or k, so the first save in the loop, "a b k" (if "k" was the random value chosen) then you continue by moving one step to the right which in our case is "b k" and save a random value for that pair if you have, in our case no, so you break out of the loop (or you can decide other stuff like start over again). When to loop is done you print your saved text string.
The bigger the input, the more values you will have for you keys (pair of words) and will then have a "smarter bot" so you can "train" your bot by adding more text (perhaps chat input?). If you have a book as input, you can construct some nice random sentences. Please note that you don't have to take only one word that follows a pair as a value, you can take 2 or 10. The difference is that your text will appear more accurate if you use "longer" building blocks. Start with a pair as a key and the following word as a value.
So you see that you basically can have two steps, first make a structure where you randomly choose a key to start with then take that key and print a random value of that key and continue till you do not have a value or some other condition. If you want you can "seed" a pair of words from a chat input from your key-value structure to have a start. Its up to your imagination how to start your chain.
Example with real words:
"hi my name is Al and i live in a box that i like very much and i can live in there as long as i want"
"hi my" -> ["name"]
"my name" -> ["is"]
"name is" -> ["Al"]
"is Al" -> ["and"]
........
"and i" -> ["live", "can"]
........
"i can" -> ["live"]
......
Now construct a loop:
Pick a random key, say "hi my" and randomly choose a value, only one here so its "name"
(SAVING "hi my name").
Now move one step to the right taking "my name" as the next key and pick a random value... "is"
(SAVING "hi my name is").
Now move and take "name is" ... "Al"
(SAVING "hi my name is AL").
Now take "is Al" ... "and"
(SAVING "hi my name is Al and").
...
When you come to "and i" you will randomly choose a value, lets say "can", then the word "i can" is made etc... when you come to your stop condition or that you have no values print the constructed string in our case:
"hi my name is Al and i can live in there as long as i want"
If you have more values you can jump to any keys. The more values the more combinations you have and the more random and fun the text will be.

The bot chooses a random word from your input and generates a response by choosing another random word that has been seen to be a successor to its held word. It then repeats the process by finding a successor to that word in turn and carrying on iteratively until it thinks it’s said enough. It reaches that conclusion by stopping at a word that was prior to a punctuation mark in the training text. It then returns to input mode again to let you respond, and so on.
It isn’t very realistic but I hereby challenge anyone to do better in 71 lines of code !! This is a great challenge for any budding Pythonists, and I just wish I could open the challenge to a wider audience than the small number of visitors I get to this blog. To code a bot that is always guaranteed to be grammatical must surely be closer to several hundred lines, I simplified hugely by just trying to think of the simplest rule to give the computer a mere stab at having something to say.
Its responses are rather impressionistic to say the least ! Also you have to put what you say in single quotes.
I used War and Peace for my “corpus” which took a couple of hours for the training run, use a shorter file if you are impatient…
here is the trainer
#lukebot-trainer.py
import pickle
b=open('war&peace.txt')
text=[]
for line in b:
for word in line.split():
text.append (word)
b.close()
textset=list(set(text))
follow={}
for l in range(len(textset)):
working=[]
check=textset[l]
for w in range(len(text)-1):
if check==text[w] and text[w][-1] not in '(),.?!':
working.append(str(text[w+1]))
follow[check]=working
a=open('lexicon-luke','wb')
pickle.dump(follow,a,2)
a.close()
Here is the bot:
#lukebot.py
import pickle,random
a=open('lexicon-luke','rb')
successorlist=pickle.load(a)
a.close()
def nextword(a):
if a in successorlist:
return random.choice(successorlist[a])
else:
return 'the'
speech=''
while speech!='quit':
speech=raw_input('>')
s=random.choice(speech.split())
response=''
while True:
neword=nextword(s)
response+=' '+neword
s=neword
if neword[-1] in ',?!.':
break
print response
You tend to get an uncanny feeling when it says something that seems partially to make sense.

You could do like this:
Make a order 1 markov chain generator, using words and not letters.
Everytime someone post something, what he posted is added to bot database.
Also bot would save when he gone to chat and when a guy posted the first post (in multiples of 10 seconds), then he would save the amount of time this same guy waited to post again (in multiples of 10 seconds)...
This second part would be used to see when the guy will post, so he join the chat and after some amount of time based on a table with "after how many 10 seconds the a guy posted after joining the chat", then he would continue to post with the same table thinking "how was the amount of time used to write the the post that was posted after a post that he used X seconds to think about and write"

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight