Parsing a number out from a string - winforms

I have something like this:
Cat= "I cost £3,000"
And some are like this:
AnotherCat= "that other cat is just £10 more expensive than mine £2990
I want to parse out just 3000 from cat and 2990 from Another Cat

Your question is much more complicated then parse number from string. If you just want to extract all number from string you can use next regex:
var result = Regex.Matches(stringValue, #"\d+");
but if you want to to understand, where is price of something, but not a count of cats, for example, it is related to Artificial Intelligence and Machine Learning.

Related

How would I create a DCG to parse queries about a database?

I'm just playing around with learning Prolog and am trying to build a database of information and then use natural English to query about relationships. How would I go about doing this?
Example:
%facts
male(bob).
female(sarah).
male(timmy).
female(mandy).
parent(bob, timmy).
parent(bob, mandy).
parent(sarah, timmy).
parent(sarah, mandy).
spouse(bob, sarah).
spouse(sarah, bob).
%some basic rules
married(X,Y) :- spouse(X,Y).
married(X,Y) :- spouse(Y,X).
child(C,X) :- parent(X,C).
Now, I want to ask some "who" questions, i.e., "who is the parent of timmy".
I read something about DCGs, but, can anyone point me to some good resources or get me going in the right direction?
Thank you!
First, I would also like to ask a "who" question: In a fact like
parent(X, Y).
WHO is actually the parent? Is it X or is it Y?
To make this clearer, I strongly recommend you use a naming convention that makes clear what each argument means. For example:
parent_child(X, Y).
Now, to translate "informal" questions to Prolog goals that reason over your data, consider for example the following DCG:
question(Query, Person) --> [who,is], question_(Query, Person).
question_(parent_child(Parent,Child), Parent) --> [the,parent,of,Child].
question_(parent_child(Parent,Child), Child) --> [the,child,of,Parent].
question_(married(Person0,Person), Person) --> [married,to,Person0].
question_(spouse(Person0,Person), Person) --> [the,spouse,of,Person0].
Here, I am assuming you have already converted given sentences to tokens, which I represent as Prolog atoms.
In order to conveniently work with DCGs, I strongly recommend the following declaration:
:- set_prolog_flag(double_quotes, chars).
This lets you write for example:
?- Xs = "abc".
Xs = [a, b, c].
Thus, it becomes very convenient to work with such programs. I leave converting such lists of characters to tokens as an exercise for you.
Once you have such tokens, you can use the DCG to relate lists of such tokens to Prolog queries over your program.
For example:
?- phrase(question(Query, Person), [who,is,the,parent,of,timmy]).
Query = parent_child(Person, timmy) .
Another example:
?- phrase(question(Query, Person), [who,is,the,spouse,of,sarah]).
Query = spouse(sarah, Person).
To actually run such queries, simply post them as goals. For example:
?- phrase(question(Query, Person), [who,is,married,to,bob]),
Query.
Query = married(bob, sarah),
Person = sarah .
This shows that such a conversion is quite straight-forward in Prolog.
As #GuyCoder already mentioned in the comments, make sure to check out dcg for more information about DCG notation!

How to store keywords for an article

In SOLR, I have a document that has id, words (indexed), raw_text fields. I want to search just words field this way:
Words are infinitives of an article (or say keywords). For parsing and lemmatization(stemming) I use another tool, so that's not the point of the question.
E.g.: for these two articles(texts) words would be:
1 Yesterday I didn't go to work, because it was holiday.
words: yesterday go work because holiday
2 Tommorrow I am going to work in the morning and in the evening I am going shopping.
words: tommorrow go work morning evening go shop
3 words: go tommorrow work
In a search for "go " I want to have 2 retreived first (be more relevant) because of having more "go"-s than 1. Also I want to use longer queries with a bunch of words and have retrieved articles containig most of them most times.
E.g: search for: "go tommorrow work" would return 2 more relevant than 3 because there are two "go"-s contrary to only one in 3
So the question: how should I store words? multiValued or just single ? What field type should be used?
Thank you!
(Single-valued) text would suit you.
The text comes with tokenization, stemming and stop word analyzers.
Stemming uses heuristics to derive the root of a word. Among other things, it would find the root of your articles even in infinitive form :-)
Trying it out for your samples (with a few additions):
Original: Yesterday [yesterday's] I didn't go to work [working, workable], because it was holiday [holidays].
Stemmed: Yesterdai yesterdai s I didn t go to work work workabl becaus it wa holidai holidai
Original: Tommorrow I am going [go,going,gone] to work in the morning [mornings] and in the evening I am going shopping [shoppers, shops].
Stemmed: Tommorrow I am go go go gone to work in the morn morn and in the even I am go shop shopper shop
Because it uses heuristics, "workable" does not share roots with "work" and "gone" doesnt share roots with "go". But its a tradeoff that works much simpler and faster while not diminishing the result quality.
And "didnt" and "I" are stop words according to this list, so they are automatically eliminated.
If you ever observe unacceptable results too often, take the trouble to implement Wordnet. They have lemma, part of speech and other natural language goodies.

How to debug GQL queries in GAE?

I have some custom user model, and I count the number of users with name Joe:
c = UserModel.all().filter('name =', 'Joe').count()
Even though I know there is a Joe in the datastore, there is some mistake witch makes c == 0.
This is a problem I'm dealing with, however the biggest problem is that I don't know how to debug this.
I would like to get some query and visualise it somehow, so that I can understand what is there and why Joe is not there:
v = magically_visualise_contents_of(UserModel.all().filter('name =','Joe'))
handler.response.out.write(v)
Try running the query directly in the datastore viewer by GQL.
That usually helps identify minor issues, for example:
SELECT * FROM UserModel WHERE name = 'Joe'
Also, one common mistake with string matching is whitespace characters in the data, like "Joe ".

App engine - easy text search

I was hoping to implement an easy, but effective text search for App Engine that I could use until official text search capabilities for app engine are released. I see there are libraries out there, but its always a hassle to install something new. I'm wondering if this is a valid strategy:
1) Break each property that needs to be text-searchable into a set(list) of text fragments
2) Save record with these lists added
3) When searching, just use equality filters on the list properties
For example, if I had a record:
{
firstName="Jon";
lastName="Doe";
}
I could save a property like this:
{
firstName="Jon";
lastName="Doe";
// not case sensative:
firstNameSearchable=["j","o", "n","jo","on","jon"];
lastNameSerachable=["D","o","e","do","oe","doe"];
}
Then to search, I could do this and expect it to return the above record:
//pseudo-code:
SELECT person
WHERE firstNameSearchable=="jo" AND
lastNameSearchable=="oe"
Is this how text searches are implemented? How do you keep the index from getting out of control, especially if you have a paragraph or something? Is there some other compression strategy that is usually used? I suppose if I just want something simple, this might work, but its nice to know the problems that I might run into.
Update:::
Ok, so it turns out this concept is probably legitimate. This blog post also refers to it: http://googleappengine.blogspot.com/2010/04/making-your-app-searchable-using-self.html
Note: the source code in the blog post above does not work with the current version of Lucene. I installed the older version (2.9.3) as a quick fix since google is supposed to come out with their own text search for app engine soon enough anyway.
The solution suggested in the response below is a nice quick fix, but due to big table's limitations, only works if you are querying on one field because you can only use non-equality operators on one property in a query:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
If you want to query on more than one property, you can save indexes for each property. In my case, I'm using this for some auto-suggest functionality on small text fields, not actually searching for word and phrase matches in a document (you can use the blog post's implementation above for this). It turns out this is pretty simple and I don't really need a library for it. Also, I anticipate that if someone is searching for "Larry" they'll start by typing "La..." as opposed to starting in the middle of the word: "arry". So if the property is for a person's name or something similar, the index only has the substrings starting with the first letter, so the index for "Larry" would just be {"l", "la", "lar", "larr", "larry"}
I did something different for data like phone numbers, where you may want to search for one starting from the beginning or middle digits. In this case, I just stored the entire set of substrings starting with strings of length 3, so the phone number "123-456-7890" would be: {"123","234", "345", ..... "123456789", "234567890", "1234567890"}, a total of (10*((10+1)/2))-(10+9) = 41 indexes... actually what I did was a little more complex in order to remove some unlikely to-be-used substrings, but you get the idea.
Then your query would be:
(Pseaudo Code)
SELECT * from Person WHERE
firstNameSearchIndex == "lar"
phonenumberSearchIndex == "1234"
The way that app engine works is that if the query substrings match any of the substrings in the property, then that is counted as a match.
In practice, this won't scale. For a string of n characters, you need n factorial index entries. A 500 character string would need 1.2 * 10^1134 indexes to capture all possible substrings. You will die of old age before your entity finishes writing to the datastore.
Implementations like search.SearchableModel create one index entry per word, which is a bit more realistic. You can't search for arbitrary substrings, but there is a trick that lets you match prefixes:
From the docs:
db.GqlQuery("SELECT * FROM MyModel
WHERE prop >= :1 AND prop < :2",
"abc", u"abc" + u"\ufffd")
This matches every MyModel entity with
a string property prop that begins
with the characters abc. The unicode
string u"\ufffd" represents the
largest possible Unicode character.
When the property values are sorted in
an index, the values that fall in this
range are all of the values that begin
with the given prefix.

Searching through descriptions

There's a movie which name I can't remember. It's about a carnival or amusement park with a horror house and a bunch of teens who are murdered one by one by something with a clowns mask. I've seen this movie about 20 years ago, and it's sequel, but can't remember it exactly. (And also forgot it's title.) As a result, I started wondering about how to solve something technical.
Assume that I have a database with the story plot and other data of each and every movie published. (Something like the IMDb.) And I would have an edit field where a user can just enter a description in plain text. The system would then start analysing this text to find the movie(s) that would qualify to this description.
For example (different movie), I enter this in the edit field: "Some movie about an Egyptian king who attacks a bunch of indians on horseback, but he's badly wounded and his horse dies while he lost this battle."
The system should then report the movie "Alexander" from 2004 as answer, but possibly a few more. (Even allowing a few errors in the description.)
To create such a system where a description gets analysed to find a matching record by searching through descriptions, what techniques should I need for something as complex as that? Not that I want to build something like that right now, but more out of curiosity if I ever want to pick up some interesting new project.
(I wanted to award extra points for those who recognise the movie I've mentioned in the beginning. But one Google-attempt later and I found it myself!)
Btw, it's not the search engine itself that interests me, but analysing the description to get to something a search engine will understand! With the example movie, it's human logic that helped me to find the title. (And it's annoying that this movie isn't for sale in the Netherlands.) Human logic will always be a requirement but it's about analysing the user input, which is in the form of a story or description, with possible errors.
You should check out document classification.
A few document classification techniques
Naive Bayes classifier
tf–idf
For what I can tell by your own comments, Google is the technique to be used. ;-) But, honestly, I think more or less any search engine would do.
Edit: heh, you removed your comment, but I do remember you mentioned Google as the one deserving extra points.
Edit+: well, you mentioned Google again, but I don't want to remove my first edit. ;-)
Pure speculation: Would something trivial such as taking every word of more than 4 letters in the description "Egyptian, Indian, horse battle etc." and fuzzy matching against a database of such summaries work? Perhaps with some normalisation eg. king == leader == emperor?
Hmmm ... Young Man, Girlfriend, swimming pool, mother, wedding does that get us to The Graduate? Well I guess with a small amount of specifics "Robinson" it might.
You can do lots of interesting stuff with the imdb keyword search:
http://akas.imdb.com/keyword/carnival/clown/murder/
You can specify multiple keywords, it suggests movies and more keywords which are in similar context with your given keywords.
The data contained in imdb is publicy available for non-commercial use and can be downloaded as text files. You could build a database from it.

Resources