Making solr to understand English - solr

I'm trying to setup solr that should understand English. For example I've indexed our company website (www.biginfolabs.com) or it could be any other website or our own data.
If i put some English like queries i should get the one word answer just what Google does;queries are:
Where is India located.
who is the father of Obama.
Workaround:
Integrated UIMA,Mahout with solr(person name,city name extraction is done).
I read the book called "Taming Text" and implemented https://github.com/tamingtext/book. But Did not get what i want.
Can anyone please tell how to move further. It can be anything our team is ready to do it.

This task is called Named Entity Recognition. You can look up this tutorial to see how they use Solr for extractic atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. and then learning a model to answer queries.
Also have a look at Stanford NLP for more ideas on algorithms that you can use.

Related

Solr multilingual search

I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.

short text syntactic classification

I am newbie at machine learning and data mining. Here's the problem: I have one input variable currently which is a small text comprises of non-standard nouns and want to classify in target category. I have about 40% of total training data from entire dataset. Rest 60% we would like to classify as accurately as possible. Followings are some input variables across multiple observations those are assigned 'LEAD_GENERATION_REPRESENTATIVE' title.
"Business Development Representative MFG"
"Business Development Director Retail-KK"
"Branch Staff"
"Account Development Rep"
"New Business Rep"
"Hong Kong Cloud"
"Lead Gen, New Business Development"
"Strategic Alliances EMEA"
"ENG-BDE"
I think above give idea what I mean by non-standard nouns. I can see here few tokens that are meaningful like 'development','lead','rep' Others seems random without any semantic but they may be appearing multiple times in data. Another thing is some tokens like 'rep','account' can appear for multiple category. I think that will make weighting/similarity a challenging task.
My first question is "is it worth automating this kind of classification?"
Second : "is it a good problem to learn machine learning classification?". There are only 30k such entries and handful of target categories. I can find someone to manually do that which will also be more accurate.
here's my take on this problem so far:
Full-text engine: like solr to build index and query rules that draws matches based on tokens - word, phrase, synonyms, acronyms, descriptions. I can get someone to define detail taxonomy for each category. Use boosting, use pluggable scoring lib
Machine learning:
Naive Bayes classification
Decision tree
SVM
I have tried out Solr for this with revers lookup though since I don't have taxonomy available at moment. It seems like I can get about 80% true positives (I'll have to dig more into confusion matrix to reduce false positives). My query is bunch of booleans terms and phrases with proximity and boosts; negations to reduce errors. I'm afraid this approach may lead to overfit and wont scale.
I am aware that people usually tries multiple modeling techniques to achieve which one works best or derives combination of techniques. I want to understand this problem with feasibility and complexity point of view. If its too broad question please just comment on feasibility of solution.

How to get a trained Watson natural language classifier to NOT pick up a class?

When using the nice demo at http://watson-on-classifier.mybluemix.net, you sometimes got the answer "Sorry, I don't understand the question. Please try to rephrase it." when your question is not related to any of the supported themes.
I don't understand how to do this using Watson natural language classifier: it seems to me that whatever the entry, it choose one of the classes it has been trained for... How do you achieve rejection of some entries as "does not match any of the classes with enough confidence" ?
Thanks for your help.
Roughly speaking, what NLC does behind the scenes (I guess) is to try to correlate one statement with another based on concepts parsed from the input text and calculated using some ontology, so it can find synonyms or concepts that are "kind of" or "part of" other concepts.
So, in order to have a rejection, I can see 3 possible ways
the entry has no correlation to any of the data used in the classifier because the concepts are too far from the concepts of the training data, in the ontology
the entry has equal correlation to more than one category, so the system can't tell if it belongs to one or another
the entry has correlation with one category, but the confidence level is too low, so it does not satisfy some threshold defined by the system
NLC will always return answers in order of confidence. The system has been set up that if intents fall below a certain level of confidence it will not return an answer.
This is defined by the person writing the application.

Automatic product classification and query weighting

I'm facing ranking problems using solr and I'm stucked.
Given a e-commerce site, for the query "ipad" i obtain:
ipad case for ipad 2
ipad case
ipad connection kit
ipad 32gb wifi
This is a problem, since we want to rank first the main products (or products by itself) and tf/idf ranks first the accessories due to descriptions like "ipad case compatible with ipad, ipad2, ipad3, ipad retina, ipad mini, etc".
Furthermore, using the categories we have no way of determining whether is an accessory or a product.
I wonder if using automatic classification would help. Another solution that improves this ranking (like Named Entity Recognition) would be appreciated.
Could you provide tagged data?
If you have >50k items a Naive Bayes with a bigram language model trained on the product name will almost catch all accessories with 99% accuracy. I guess you can train such a naive bayes with Mahout, however product names have a pretty limited bigram amount so this can be trained even on a smartphone easily and fast nowadays.
This is a typical mechanical turk task, shouldn't be that expensive to tag a few items. However if you insist on some semi-supervised algorithm, I found Iterative similarity aggregation pretty useful.
The main idea is that you give a few tokens like "case"/"power adapter" and it iteratively finds new tokens that are indicators of spam because they appear in the same context.
Here is the paper, but I have written a blogpost about this as well which sums up the intention in plain language. This paper also mentions the same "let the user find the right item" paradigm that Sean has proposed, so both can be used in conjunction.
Oh and if you need some advice of machine learning with Lucene&SOLR I can recommend you the talk of my friend Tommaso Teofili at ApacheCon Europe this year. You can find the slides on slideshare. There is also a youtube video of the talk out there, just search for it ;)
TF/IDF is just going to rank based on the words in the query vs words in the title as you have found. That sounds like it is not the right definition of "good result" and that you want to favor products over accessories.
Of course you can simply attach heuristics to patch the problem. For example, consider the title as a set of words, not multiset, so the appearance of "iPad" several times makes no difference. Or just boost the score of items that you know are products. This isn't learning per se, but are simple, directly reflect your business knowledge, and probably have some positive effect.
If you want to learn here, you probably need to use the one best source of knowledge about what the best results are: your users. You know what they click in response to each query. You can learn a term-item model that associates search terms to items clicked. You can view that as many types of problem -- actually a latent-factor recommender model could work well there.
Have a look at Ted's slides on how to use a recommender as a "search engine": http://www.slideshare.net/tdunning/search-as-recommendation

Tips on how to improve full text search for search engine

I'm developing: http://www.buscatiendas.com.mx
I've seen people entering text for queries with lots of typos.
What kind of search could i implement so similar words are found?
Like google does more or less would be neat.
I'm using SQL Server Full Text search.
Why don't you have google/bing index it for you and just use that using the site: feature provided by them?
If that is not an option, you might have to have one of your own 'spell checkers' (either implement yourself or just use an existing one), which is trained on the data you have. Note spell checking is not deterministic (for eg: latel, is it label? later?). You can only make a 'best' guess based on the actual data you have in your site.
There are probabilistic models where you can 'train' your spell guesser/checker to come up with the a 'best' guess.
The following page seems pretty useful. It has a description on how to write one yourself, and also has good links (including a survey paper) and links to implementations in different languages:
http://norvig.com/spell-correct.html.
There are two ways to solve this:
Buy a 3rd party product, like a google search applicance, or one of
Microsoft search servers.
Log all queries, and have someone review these, making a table which
links the bad queries to what they
should be. (It's possible you could
buy a component library which does
this, much like a
spelling checker.)
if you want to roll out your own, first u need to filter out noise words before u even start searching because this may just impose load on your database unnecessarily. should "a good book" be the same as searching for "the good book" or "his good book" or "good and bad reviews on a book"? so obviously, "a", "the", "an", "and", etc. do not at at all qualify as "useful" search keywords. once u got the "noise" filtered out, then u start the real searching. again, u should consider database performance. is it wise to search a dynamic database or a pre-precessed database? figure out a way to filter out the noise words in the search data too.

Resources