Solr: stop words entirely stop the search, should just be disregarded - solr

I have datasets in CKAN that have Titles like "Application for funding". If a search on "Application for funding" is entered, the results are returned. But if "Application for" is entered, zero results are returned. Is there a way to make it so stop words are disregarded but the other words are still used for search? "Application for" should search as though just "Application" was entered. What would the configuration be for that? It's just not clear what combination of configuration settings will accomplish this.
I didn't try a whole lot of tinkering with the SOLR configuration for fear of breaking it in a significant way, I'm hoping someone will know what specific setting or parameter is causing this. Is there a proper way to configure StopFilterFactory so that stop words are "ignored" while the words in the query that are NOT stop words "pass through"?

Related

Querytype=Full and searching for stop words returns no results

When using azure cognitive search, we are using full query syntax. When searching for something like: the document we create a query like this (this is a simplified example):
(Title:the OR Contents:the) AND (Title:document OR Contents:document)
(we need to split up the query for unrelated reasons)
The problem is that the could be a stopword in the language we are searching in (we search in several languages), causing the entire query to fail. We would like to be able to ignore stop words in generating queries like this, of have the search engine simply return true for the specific stop word search parts
I figure the latter is not possible. (or is it?). Might there be a way to query the stop words for specific language analyzers so we can exclude the stop words ourselves? Or is there a way to alter out query to be able to handle stop words better?
If you want to strip stop words from your search query the only thing I can think of is calling the analyzer with the search query and check the returned tokens.
In this example you would call the en.microsoft analyzer with the search query "the document".
The tokens returned only contain "document", so you know "the" is considered a stop word by the analyzer. But when searching multiple languages you might need to call multiple analyzers and strip stop words for all those languages.

Why solr.SnowballPorterFilterFactory cuts last letter of search term if protword file is empty?

I have a solr schema that uses solr.SnowballPorterFilterFactory. When I do admin/analysis
I see that for query "iphone", after SnowballPorterFilterFactory I get "iphon", even if the file specified in schema (protwords_ro.txt) is empty.
I have removed the filter and term text remains "iphone". Since my protwords_ro.txt file is empty I don't really need that filter right now, but I was wondering why is this happening.
Actually, this filter is for stemming.
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form
So for example for word resume this filter will give resum, etc.
Also,
The Snowball stemmers rely on algorithms and considered fairly aggressive
I think this is the reason why you got iphon, even when your text file is empty.

Sunspot/Solr: word concatenation

I'm using Solr with the Sunspot Ruby gem. It works great, but I'm noticing that sometimes users will get poor search results because they have concatenated their search terms (e.g. 'foolproof') where the document text was 'fool proof'. Or vice-versa.
I was going to try and address this by creating a set of alternate match fields by manually concatenating the words from the source documents together. This seems kind of hackish, and implementing the other side (breaking up user concatenations into words) is not obvious.
Is there a way to do this properly in Solr/Sunspot?
Did yo have a look at SOLR spellcheck (or spell check) component?
http://wiki.apache.org/solr/SpellCheckComponent
For example, there is a WordBreakSolrSpellChecker, which may provide valid suggestions in such case.

ElisionFilter before WordDelimiterFilter

On this Solr documentation page I see the following comment:
Note: Its probably best to use the ElisionFilter before
WordDelimiterFilter. This will prevent very slow phrase queries.
http://wiki.apache.org/solr/LanguageAnalysis#French
Can someone explain me why it could lead to slow phrase queries please?
Actually my WordDelimiterFilter configuration works file and I don't think I need the ElisionFilter since it's somehow already included in the WordDelimiterFilter configuration.
I just wonder what is the impact on performances...
Based on SOLR-1938, if you have ElisionFilter before WordDelimiterFilter, then l'avion will generate only one token avion. But if ElisionFilter is not there, then depending on the settings of your WordDelimiterFilter, it could generate more than 1 token like
l, avion, lavion
Since avion is anyway generated by the WordDelimiterFilter, you perceive it as though the ElisionFilter is already included in there.
I guess the comment about the slow phrase queries means that if l'avion is searched for, then it will search for more than one token if ElisionFilter is not there.
Update: This post nails the problem: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance where it says What we discovered is that the word “l’art” was being searched as a phrase query “l art”. Phrase queries are much slower than Boolean queries because the search engine has to read the positions index for the words in the phrase into memory and because there is more processing involved.
so I would guess the problem is for a search in double quotes like "l'avion"

When enabled stemming, searching for the root word gives no hits

I have indexed a site with solr. It works very well if stemming is not enabled. Using stemming, however, solr does not return any hits when searching for the root of a word. I use Swedish stemming.
For example, searching for support gives hits if not using stemming. Using stemming, searching for support gives no hits. Though, searching for supporten returns hits that match support.
By debugging the query, I can see that it stems the word support to suppor (which is incorrect by the way, but that should not matter). However, having the word stemmed to suppor, I want it to search for matches with the the original query word as well.
I'd appreciate any help on this!
Afaik, there is no way to keep the original word when stemming...
I assume that you are using solr.SnowballPorterFilterFactory. Snowball algorithm is too aggressive.
You should try a Hunspell stemmer or maybe solr.SwedishLightStemFilterFactory.
A workaround you can do is to reformat your query into "support support*" or "support support~". * is wildcard matching and ~ is fuzzy matching using Lucene syntax. I know you didn't mention the need to do wildcard and fuzzy search, but I found under these circumstances, the stemming on query will not take effect, so "support" is preserved. And stemming will still be effective on the first word, so both results will be returned if any. Plus, fuzzy search will help reduce the tolerance of typos in users' queries, so it's an added benefit.

Resources