On this Solr documentation page I see the following comment:
Note: Its probably best to use the ElisionFilter before
WordDelimiterFilter. This will prevent very slow phrase queries.
http://wiki.apache.org/solr/LanguageAnalysis#French
Can someone explain me why it could lead to slow phrase queries please?
Actually my WordDelimiterFilter configuration works file and I don't think I need the ElisionFilter since it's somehow already included in the WordDelimiterFilter configuration.
I just wonder what is the impact on performances...
Based on SOLR-1938, if you have ElisionFilter before WordDelimiterFilter, then l'avion will generate only one token avion. But if ElisionFilter is not there, then depending on the settings of your WordDelimiterFilter, it could generate more than 1 token like
l, avion, lavion
Since avion is anyway generated by the WordDelimiterFilter, you perceive it as though the ElisionFilter is already included in there.
I guess the comment about the slow phrase queries means that if l'avion is searched for, then it will search for more than one token if ElisionFilter is not there.
Update: This post nails the problem: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance where it says What we discovered is that the word “l’art” was being searched as a phrase query “l art”. Phrase queries are much slower than Boolean queries because the search engine has to read the positions index for the words in the phrase into memory and because there is more processing involved.
so I would guess the problem is for a search in double quotes like "l'avion"
Related
I have successfully implemented a Czech lemmatizer for Lucene. I'm testing it with Solr and it woks nice at the index time. But it doesn't work so well when used for queries, because the query parser doesn't provide any context (words before or after) to the lemmatizer.
For example the phrase pila vodu is analyzed differently at index time than at query time. It uses the ambiguous word pila, which could mean pila (saw e.g. chainsaw) or pít (the past tense of the verb "to drink").
pila vodu ->
Index time: pít voda
Query time: pila voda
.. so the word pila is not found and not highlighted in a document snippet.
This behaviour is documented at the solr wiki (quoted bellow) and I can confirm it by debugging my code (only isolated strings "pila" and "vodu" are passed to the lemmatizer).
... The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, ...
So my question is:
Is it possible to somehow change, configure or adapt the query parser so the lemmatizer would see the whole query string, or at least some context of individual words? I would like to have a solution also for different solr query parsers like dismax or edismax.
I know that there is no such issue with phrase queries like "pila vodu" (quotes), but then I would lose the documents without the exact phrase (e.g. documents with "pila víno" or even "pila dobrou vodu").
Edit - trying to explain / answer following question (thank you #femtoRgon):
If the two terms aren't a phrase, and so don't necessarily come together, then why would they be analyzed in context to one another?
For sure it would be better to analyze only terms coming together. For example at the indexing time, the lemmatizer detects sentences in the input text and it analyzes together only words from a single sentence. But how to achieve a similar thing at the query time? Is implementing my own query parser the only option? I quite like the pf2 and pf3 options of the edismax parser, would I have to implement them again in case of my own parser?
The idea behind is in fact a bit deeper because the lemmatizer is doing word-sense-disambiguation even for words that has the same lexical base. For example the word bow has about 7 different senses in English (see at wikipedia) and the lemmatizer is distinguishing such senses. So I would like to exploit this potential to make searches more precise -- to return only documents containing the word bow in the concrete sense required by the query. So my question could be extended to: How to get the correct <lemma;sense>-pair for a query term? The lemmatizer is very often able to assign the correct sense if the word is presented in its common context, but it has no chance when there is no context.
Finally, I implemented my own query parser.
It wasn't that difficult thanks to the edismax sources as a guide and a reference implementation. I could easily compare my parser results with the results of edismax...
Solution :
First, I analyze the whole query string together. This gives me the list of "tokens".
There is a little clash with stop words - it is not that easy to get tokens for stop words as they are omitted by the analyzer, but you can detect them from PositionIncrementAttribute.
From "tokens" I construct the query in the same way as edismax do (e.g. creating all 2-token and/or 3-token phrase queries combined in DisjunctionMaxQuery instances).
I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.
I am trying to set up Solr but encountered the problem mentioned in the title. I just downloaded Solr and used the built-in example. When I used a query with words occurred in the example documents, such as "ipod". Solr worked properly. However, when I added some words that are not in these documents, such as "what". Solr does not return anything. For me, it is weird since the relevance scores should be computed to query terms separately and added up. Non-existing query term should not affect the ranking (even though the coord norm is affected, thus the scores of documents will change).
Could anyone tell me what might be the issue? Thanks.
There are several ways of configuring how you want this behavior. I'll assume that you're using the edismax query handler for these examples, although some of these also apply to the standard lucene query parser.
The reason for not always wanting "ipod what" to retrieve the same subset sa "ipod" is that you'll get a poor result set and user experience for terms that are more general than "ipod" (i.e. searching for "microsoft windows" will not be perceived as a good search result if you're showing only general hits for anything about windows - it's usually better to say "we didn't find anything" in those cases). It all depends on your use case.
First, you can do it yourself, by applying either AND or OR between terms to get the exact kind of matching you're looking for.
You can use q.op to configure wether each term should be AND-ed together (all required) or OR-ed together (any one is sufficient). This overrides the (now deprecated) value from <solrQueryParser defaultOperator=".."/> in schema.xml.
For (e)dismax, there's the mm parameter, which allows you do more specific, but in a general way, handling of how you want matches to be performed. mm allows you to say "at least 50% of the terms should match" or "if there's only two terms, both should match, but any over that should be optional" or "match everything up to four, and 75% after that".
I'm trying to query a text_general field named body for times like 9:15, 9:15pm, 9:15p, etc. I tried both of the following queries via the REST API without success:
q=body:9\:15* gives me no hits, missing docs that mention 9:15
q=body:"9:15"* gives me all docs, including docs that have nothing resembling 9:15
Debugging in Chrome, I enter these directly in the browser. I've also tried encodeURIComponent on the values to make sure the content isn't lost in HTTP translation. Same outcome either way.
I'm guessing there's a simple answer here and my mental model of how Solr queries work is just broken.
In cases like that I often do 2 things:
Turn Solr query debug on, so I can see that really goes into query. You will see extra node at the end of response.
&debug=query
Examine field analyser with Analysis tool. (url based on Solr's example core)
http://localhost:8983/solr/#/collection1/analysis?analysis.fieldvalue=9%3A30pm&analysis.query=9%3A30&analysis.fieldtype=text_general&verbose_output=0
Both methods should tell you exactly what is going wrong with your query. In second one you can check how matching work without reindexing anything.
Your time string gets tokenized following the Unicode standard annex UAX#29.
So the colon should be stripped out.
I think if you check you will see that your results should contain either 9 or 15.
I have indexed a site with solr. It works very well if stemming is not enabled. Using stemming, however, solr does not return any hits when searching for the root of a word. I use Swedish stemming.
For example, searching for support gives hits if not using stemming. Using stemming, searching for support gives no hits. Though, searching for supporten returns hits that match support.
By debugging the query, I can see that it stems the word support to suppor (which is incorrect by the way, but that should not matter). However, having the word stemmed to suppor, I want it to search for matches with the the original query word as well.
I'd appreciate any help on this!
Afaik, there is no way to keep the original word when stemming...
I assume that you are using solr.SnowballPorterFilterFactory. Snowball algorithm is too aggressive.
You should try a Hunspell stemmer or maybe solr.SwedishLightStemFilterFactory.
A workaround you can do is to reformat your query into "support support*" or "support support~". * is wildcard matching and ~ is fuzzy matching using Lucene syntax. I know you didn't mention the need to do wildcard and fuzzy search, but I found under these circumstances, the stemming on query will not take effect, so "support" is preserved. And stemming will still be effective on the first word, so both results will be returned if any. Plus, fuzzy search will help reduce the tolerance of typos in users' queries, so it's an added benefit.