Too much maxent in text-processing? - solr

My name is Frederic and I am no professional developper nor native english speaker so please forgive me for that.
I am actually a medical physician working as an in-hospital clinician and i am passionate in NLP for quite a time now. I'm currently writing a thesis for a MDPHD in medical informatics and my subject is about information retrieval of patient documentation for a better clinical workflow.
I have a quite well defined strategy for text pre-processing and later indexing through Solr. I have been able to implement a max-ent classifier that works great! It works so well actually that is puts into questions other step of text processing (I now kindof want to use it everywhere, which feels totally wrong). And I need your insight for making my mind about that point.
Medical texts are of very different types (specialized consultations, operative documents, prescriptions, notes etc). However, they often follow text structuration rules that are quite general. For example, admissions form always display info about physical examination in the form of a list, and a few paragraphs of narrative text.
I first wanted to use a maxent classifier to classify chunks of text into main headings (Patient info, discipline, physical, discharge summary, discharge prescriptions and so on). That seems to work great! But after a while, I realized that most errors in classification came from the fact that the text was not segmented properly in the first place and thus the maxent could not do its job correctly. I do paragraph segmenting on the basis of a manual decision-tree, taking into account new lines and spacing, and some characteristics of the text preceding or following separation markors (ex: titles are often all-caps, you can differentiate "real new lines" from "decorative new lines" by the presence of a period at the end of preceding paragraph and an upper cased first letter in the one following and so on...).
Well... maxent for this task works well as well. But now I find myself training a third maxent because it works so well and I now want to differentiate real periods from other periods (like in numbers or abbreviations)... and there also is dashes classification.
If I listened to myself (and hopefully i dont), i would use maxent for pretty much every text processing task that requires a bit of classification. But it seems very wrong for many reasons: time for training, memory usage etc...
So could you please give me your advices: For a series of tasks, which would make the most sense to use maxent, and are there alternative that I did not took into account?
Main tasks that are in question now are:
paragraph segmentation: using maxent to determine if a line-break is
a real one or should be omitted
paragraph classification: once
segmented, give subheading for each paragraphs
words (or tokens) normalisation:
real periods VS in-word periods (end of sentence VS abbreviation for
example), dashes, apostrophes (I work in the french language so these
characters are quite problematic, like in the words such as "va-t'en"
or "jusqu'à".
Sorry for the long text. I figured no code was really useful for this question, but i can put some if it can help you help me ;)
Thanx in advance, and cheers,

Related

short text syntactic classification

I am newbie at machine learning and data mining. Here's the problem: I have one input variable currently which is a small text comprises of non-standard nouns and want to classify in target category. I have about 40% of total training data from entire dataset. Rest 60% we would like to classify as accurately as possible. Followings are some input variables across multiple observations those are assigned 'LEAD_GENERATION_REPRESENTATIVE' title.
"Business Development Representative MFG"
"Business Development Director Retail-KK"
"Branch Staff"
"Account Development Rep"
"New Business Rep"
"Hong Kong Cloud"
"Lead Gen, New Business Development"
"Strategic Alliances EMEA"
"ENG-BDE"
I think above give idea what I mean by non-standard nouns. I can see here few tokens that are meaningful like 'development','lead','rep' Others seems random without any semantic but they may be appearing multiple times in data. Another thing is some tokens like 'rep','account' can appear for multiple category. I think that will make weighting/similarity a challenging task.
My first question is "is it worth automating this kind of classification?"
Second : "is it a good problem to learn machine learning classification?". There are only 30k such entries and handful of target categories. I can find someone to manually do that which will also be more accurate.
here's my take on this problem so far:
Full-text engine: like solr to build index and query rules that draws matches based on tokens - word, phrase, synonyms, acronyms, descriptions. I can get someone to define detail taxonomy for each category. Use boosting, use pluggable scoring lib
Machine learning:
Naive Bayes classification
Decision tree
SVM
I have tried out Solr for this with revers lookup though since I don't have taxonomy available at moment. It seems like I can get about 80% true positives (I'll have to dig more into confusion matrix to reduce false positives). My query is bunch of booleans terms and phrases with proximity and boosts; negations to reduce errors. I'm afraid this approach may lead to overfit and wont scale.
I am aware that people usually tries multiple modeling techniques to achieve which one works best or derives combination of techniques. I want to understand this problem with feasibility and complexity point of view. If its too broad question please just comment on feasibility of solution.

Strategies for UK Postal Address Matching

I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.

Best way to store large searchable text files

I am developing an online Bible search program. The Bible is a pretty large book, taking up nearly 5MB of space in plain text. I am planning on implementing an API in the program as well allowing other websites to include their own Bible search widgets and programs without having to develop the search queries or storing Bibles on their own servers.
With this in mind, I am going to expect that eventually I will have a moderate flow of queries passing through the program. Also, for those not familiar with the Bible, it has 2 methods of formatting the text. It can contain both red text and italics. I need a way to store the Scriptures along with the red letter and italics formatting but allowing the search queries to ignore the formatting.
It also needs to be fast and as efficient (memory and cpu usage) as possible. Any storage format will be considered (MySQL, JSON or XML text files, etc) as long as the querying can be done ignoring the formatting. File size and count doesn't really matter, so splitting up the books or even chapters into separate files is fine by me.
One more important thing to keep in mind though, is that I want to have some form of search method that can search across multiple verses. So a search for "but have everlasting life for God sent not his Son" would return John 3:16,17. Thanks for all ideas!
There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.
As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).
The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.
A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).
I didn't know the bible had formatting. What is it used for? If it is for the verses, I'd suggest you store every verse in a database. In a highly normalized form, you got a table with books, a table with chapters and a table with verses. Each verse consists of a verse number and a verse text.
Now, I think the chapters don't have titles so they are actually just a number as well. In that case it it silly to store them separately, so you got just your table of books and a table of verses, in which each verse has a chapter number and a verse number and a verse text. That text I think of to be plain text, isn't it?
If the verse is plain text, you can easily make it searchable by storing it in MySQL and create a FULLTEXT index for it. That way, you can search quite efficiently and even use wildcards and such.
If the verse was to have formatting, you could choose to create two columns, one with the plain text for searching, and one with the formatted text for display, but I doubt you would need this.
PS: 5 MB of text is nothing really. If you got a dedicated program, you could keep it in memory in a single string and use strpos or a similar function to find a text. What language, database and platform are you using?

Using Markov chains (or something similar) to produce an IRC-bot

I tried google and found little that I could understand.
I understand Markov chains to a very basic level: It's a mathematical model that only depends on previous input to change states..so sort of a FSM with weighted random chances instead of different criteria?
I've heard that you can use them to generate semi-intelligent nonsense, given sentences of existing words to use as a dictionary of kinds.
I can't think of search terms to find this, so can anyone link me or explain how I could produce something that gives a semi-intelligent answer? (if you asked it about pie, it would not start going on about the vietnam war it had heard about)
I plan on:
Having this bot idle in IRC channels for a bit
Strip any usernames out of the string and store as sentences or whatever
Over time, use this as the basis for the above.
Yes, a Markov chain is a finite-state machine with probabilistic state transitions. To generate random text with a simple, first-order Markov chain:
Collect bigram (adjacent word pair) statistics from a corpus (collection of text).
Make a markov chain with one state per word. Reserve a special state for end-of-text.
The probability of jumping from state/word x to y is the probability of the words y immediately following x, estimated from relative bigram frequencies in the training corpus.
Start with a random word x (perhaps determined by how often that word occurs as the first word of a sentence in the corpus). Then pick a state/word y to jump to randomly, taking into account the probability of y following x (the state transition probability). Repeat until you hit end-of-text.
If you want to get something semi-intelligent out of this, then your best shot is to train it on lots of carefully collected texts. The "lots" part makes it produce proper sentences (or plausible IRC speak) with high probability; the "carefully collected" part means you control what it talks about. Introducing higher-order Markov chains also helps in both areas, but takes more storage to store the necessary statistics. You may also look into things like statistical smoothing.
However, having your IRC bot actually respond to what is said to it takes a lot more than Markov chains. It may be done by doing text categorization (aka topic spotting) on what is said, then picking a domain-specific Markov chain for text generation. Naïve Bayes is a popular model for topic spotting.
Kernighan and Pike in The Practice of Programming explore various implementation strategies for Markov chain algorithms. These, and natural language generation in general, is covered in great depth by Jurafsky and Martin, Speech and Language Processing.
You want to look for Ian Barber Text Generation ( phpir.com ). Unfortunately the site is down or offline. I have a copy of his text and I want to send it to you.
It seems to me you are trying multiple things at the same time:
extracting words/sentences by idling in IRC
building a knowledge base
listening to some chat, parsing keywords
generate some sentence regarding keywords
Those are basically very different tasks. Markov models are often used for machine learning. I don't see much learning in your tasks though.
larsmans answer shows how you generate sentences from word-based markov-models. You can also train the weights to favor those word-pairs that other IRC users used. But nonetheless this will not generate keyword-related sentences, because building/refining a markov model is not the same as "driving" it.
You might try hidden markov models (HMM) where the visible output is the keywords and the hidden states are made from those word-pairs. You could then favor sentences more appropriate to specific keywords dynamically.

AI program to generate paragraph pattern

Is there any software or service or AI program who can rebuild an English paragraph using different set of vocabulary, grammar rules etc.
I mean to say, if the source paragraph is
“Gwalior is a good tourist place near
to Jhansi. Jhansi is very famous due
their queen Rani Laxmi Bai
(Manikandana)”
Any software can generate its version or pattern like
“Rani Laxmi Bai (Manikandana) was the
queen of Jhansi which is nearer to a
good tourist palace Gwalior.”
Or something else. I know that 100% correctness is not possible until human intervention.
This guy wrote a JavaScript app that generates corporate bullshit ready for distribution (He's also got a great buzzword bingo generator). It's not AI, it just simply follows linguistic rules. From what I understand of your question, you don't need AI, you could learn a lot from just studying what this guy did. He seeds the program with nouns, verbs, adjectives, adverbs, etc and generates text that your eyes can parse (it's grammatical but it doesn't necessarily make sense). If you're looking for something to write your thesis paper, you have a lot more looking to do.
From you're question, it looks like you're also looking for a program to parse English and generate the seed data for the formerly mentioned generator. Abiword uses such a grammar parser for grammar checking. I haven't looked at it in much depth, but I figure you could easily use it to list the parts of speech contained in a section of text. If you used this program to generate the seed data you could pump the output directly into the other program to generate more text.
The python NLTK library does natural language parsing, including building parse trees which include whether a word is a verb, noun, tense etc. Perhaps you could take these trees and re-organize them according to some simple rules you come up with and verify. I don't think you would need too many rules before the results of your program are very different from the source document. Some example rules:
Replace words with synonyms
active voice to passive voice and vice-versa (The hunter saw the deer -> the deer was seen by the hunter)
http://www.nltk.org/
Rapid Rewrite is a software that can do what you want: http://www.rapidrewriter.com/?hop=qushy It's not free though, and the website is terrible.
Here's another one - same story
http://thebestspinner.com/?id=eprocent
watch their video and tell me that's not what you are looking for...
Here are a few links to various programs to alter written text. One of them should be able to provide you with some tips on how to implement what you're looking for.
http://www.worldlingo.com/ma/enwiki/en/Jive_filter
http://bytes.com/topic/python/answers/476939-filters-like-old-skool-jive-fudd-valspeak-text-transformation-python
http://www.rinkworks.com/dialect/
I disagree that NLP is not the path you need to follow.
However, if you don't want to go the NLP route, you could generate some good sounding sentences without using NLP, by training a custom language model using n-grams to build a fourth or fifth order model. You would then use statistical probability to generate your sentences.
Once you have your model, you randomly pick a starting word (in the domain of known sentence starting words, or words that begin with a capital letter), and then use conditional probablitily to pick the next word.
An easy example of this is in this article: Wordmills are coming...
Of course, you would need ample training material in order to accomplish this, as just training on a simple paragraph would not work well for the way you want to rephrase a paragraph. Without using NLP techniques to detect nouns, verbs, etc. from your sample paragraph (which would require well trained models as well), and then rearranging them using an opposite sentence structure would be more effort than just using NLP in the first place.
What you are trying to do is perform entity extraction, and also location awareness. Not only that, but relationships between entities and locations. A very tall order if you are not going to use any NLP.

Resources