Singular/plural keyword search not working - solr

I am facing a problem with singular and plural keyword search.
For example, if I search men, it should return "men" and also "man". However, it is not working.

The easiest way is to use a SynonymFilter with those terms that you're aware of - the hard part is thinking of every alternative.
While you usually use stemming to get the common stem for words, this problem is known as lemmatization - where you're interested in the different forms of a word, and not the common stem.
For Solr your best bet is probably to be to go for something like Solr Lemmatizer by Nicholas Ding.

Related

Is it possible to use only one field to index traditional and simplified Chinese?

Official Solr documentation points to the separation of Simplified and Traditional Chinese by using different Tokenizers. I wonder if people use ICU Transform Filter to Traditional <-> Simplified conversion and then be able to have one unique field for both Chineses.
At the same time, it seems that this conversion is a really hard task and it doesn't seem to be solved.
The simple question is what is the recommended way of indexing traditional and simplified Chinese in Solr? It would be really convenient to have a unique field for both, but I couldn't find a good success case for that.
The truth is, it is possible. This video shows how you could create an field with as many languages as possible. But looks tricky.

HebMorph with solr: how to use stopwords

I am developing an application that supports indexing & searching of multi-language texts, including hebrew, using the "solr" engine.
After lots of searches, I found that HebMorph is the best plugin to use for hebrew language
My problem is that the behavior of HebMorph with hebrew stopwords seems to be different than solr:
Whith solr (any language): when I search for a stopword, the results returned doesn't include any of the stopwords exxisting in query.
Whereas when I search for hebrew terms (after pluging HebMorh in solr following this link, the returned results include all existing stopwords in the query.
1) Is this the normal behavior for HebMorph? If yes, how can I alter it? If no, what should I change?
2) Since HebMorph doesn't support synonyms, (as I read in their documentation that it is a future work). Is there a way to use synonyms for hebrew as other languages the way solr supports it? (i.e. by adding the proper filter in solrconfig and pointing out to the synonyms file)?
Thanks in advance for your help.
I'm the author of HebMorph.
StopWords are indeed supported, but you need to filter them out before the lemmatizer kicks in. Assuming a recent version of HebMorph - your stopwords filter needs to come in right after the tokenizer, which means it needs to take care also of בחל"מ letters attached to the stop-words.
The general advice nowadays, for all languages, is NOT to drop stopwords - at least not in indexing, so I'd recommend not applying a stop-words filter here either.
With regards to synonyms - the root issue is with the HebMorph lemmatizer expanding a word to multiple lemmas at times, which makes the work of applying synonyms a bit more challenging. With the (relatively) new graph based analyzers this is now possible to do so we will likely implement that too and Lucene's Synonym filters will be supported OOTB.
In the commercial version there is already a way to customize word lists and override dictionary definitions, which is useful in an ambiguous language like Hebrew. Many use this as their way of creating synonyms.

Sitecore AdvancedDatabaseCrawler advantages/benefits

I tried using Sitecore.Search namespace and it seems to do basic stuff. I am now evaluating AdvancedDatabaseCrawler module by Alex Shyba. What are some of the advantages of using this module instead of writing my own crawler and search functions?
Thanks
Advantages:
You don't have to write anything.
It handles a lot of the code you need to write to even query Sitecore, e.g. basic search, basic search with field-level sorting, field-level searches, relation searches (GUID matches for lookup fields), multi-field searches, numeric range and date range searches, etc.
It handles combined searches, with logical operators
You can access the code.
This video shows samples of the code and front-end running various search types.
Disadvantages:
None that I can think of, because if you find an issue or a way to extend it, you have full access to the code and can amend it per your needs. I've done this before by creating the GetHashCode() and Equals() methods for the SkinnyItem class.
First of all, the "old" way of acecssing the Lucene index was very simple, but unfortunately it's deprecated from Sitecore 6.5.
The "new" way of accessing the Lucene index is very complex as the possibilities are endless. Alex Shyba's implementation is the missing part that makes it sensible to use the "new" way.
Take a look at this blog post: http://briancaos.wordpress.com/2011/10/12/using-the-sitecore-open-source-advanceddatabasecrawler-lucene-indexer/
It's a 3 part description on how to configure the AdvancedDatabaseCrawler, how to make a simple search and how to make a multi field search. Without Alex's AdvancedDatabaseCrawler, these tasks would take almost 100 lines of code. With the AdvancedDatabaseCrawler, it takes only 7 lines of code.
So if you are in need of an index solution, this is the solution to use.

Is there any lucene/solr spell checker which can handle space insertions/removal typos?

As far as I know almost all do spell checking based on single query term and are unable to do changes on whole input query to increase coverage in corpra. I have one in lingpipe but it is very expensive... http://alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html
So my question what is the best Apache alternative to lingpipe like spell checker?
The spellcheckers in lucene treat whitespace like any other character. So in general you can feed them your query logs or whatever, and spellcheck/autocomplete full queries.
For lucene this should just work, for solr you need to ensure the QueryConverter doesn't split up your terms... see https://issues.apache.org/jira/browse/SOLR-3143
On the other hand, these suggesters currently work on the whole input, so if you want to suggest queries that have never been searched before, instead you want something that maybe only takes the last N words of context similar to http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html.
I'm hoping we will provide that style of suggester soon also as an alternative, possibly under https://issues.apache.org/jira/browse/LUCENE-3842.
But keep in mind, thats not suitable for all purposes, so I think its going to likely just be an option. For example, if you are doing e-commerce there is no sense is suggesting products you don't sell :)

Using the trailing 's while indexing in Solr

I am trying to implement a sane way to search using Solr, but I am getting stuck at a particular place, I am indexing a bunch of company names. Lets say one of them is Lowe's. Now when someone types lowes, I want a result to show up, but I am unable to get this functionality working. Does anyone know how to get this working?
The problem is, if you manage to configure your analyzers to do it one way (i.e., searching lowes and matching Lowe's), you'll most probably break the other way (i.e., searching lowe's and getting Lowe's).
One quick workaround that doesn't need black magic with your schema is fuzzy searching. Try searching for lowes~.
One possible solution might be to add them to the synonym text files. Also, the WordDelimiterFilterFactory mentions a way to treat trailing 's by removing them. But that is probably not what you want.

Resources