I have successfully implemented a Czech lemmatizer for Lucene. I'm testing it with Solr and it woks nice at the index time. But it doesn't work so well when used for queries, because the query parser doesn't provide any context (words before or after) to the lemmatizer.
For example the phrase pila vodu is analyzed differently at index time than at query time. It uses the ambiguous word pila, which could mean pila (saw e.g. chainsaw) or pít (the past tense of the verb "to drink").
pila vodu ->
Index time: pít voda
Query time: pila voda
.. so the word pila is not found and not highlighted in a document snippet.
This behaviour is documented at the solr wiki (quoted bellow) and I can confirm it by debugging my code (only isolated strings "pila" and "vodu" are passed to the lemmatizer).
... The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, ...
So my question is:
Is it possible to somehow change, configure or adapt the query parser so the lemmatizer would see the whole query string, or at least some context of individual words? I would like to have a solution also for different solr query parsers like dismax or edismax.
I know that there is no such issue with phrase queries like "pila vodu" (quotes), but then I would lose the documents without the exact phrase (e.g. documents with "pila víno" or even "pila dobrou vodu").
Edit - trying to explain / answer following question (thank you #femtoRgon):
If the two terms aren't a phrase, and so don't necessarily come together, then why would they be analyzed in context to one another?
For sure it would be better to analyze only terms coming together. For example at the indexing time, the lemmatizer detects sentences in the input text and it analyzes together only words from a single sentence. But how to achieve a similar thing at the query time? Is implementing my own query parser the only option? I quite like the pf2 and pf3 options of the edismax parser, would I have to implement them again in case of my own parser?
The idea behind is in fact a bit deeper because the lemmatizer is doing word-sense-disambiguation even for words that has the same lexical base. For example the word bow has about 7 different senses in English (see at wikipedia) and the lemmatizer is distinguishing such senses. So I would like to exploit this potential to make searches more precise -- to return only documents containing the word bow in the concrete sense required by the query. So my question could be extended to: How to get the correct <lemma;sense>-pair for a query term? The lemmatizer is very often able to assign the correct sense if the word is presented in its common context, but it has no chance when there is no context.
Finally, I implemented my own query parser.
It wasn't that difficult thanks to the edismax sources as a guide and a reference implementation. I could easily compare my parser results with the results of edismax...
Solution :
First, I analyze the whole query string together. This gives me the list of "tokens".
There is a little clash with stop words - it is not that easy to get tokens for stop words as they are omitted by the analyzer, but you can detect them from PositionIncrementAttribute.
From "tokens" I construct the query in the same way as edismax do (e.g. creating all 2-token and/or 3-token phrase queries combined in DisjunctionMaxQuery instances).
Related
When using azure cognitive search, we are using full query syntax. When searching for something like: the document we create a query like this (this is a simplified example):
(Title:the OR Contents:the) AND (Title:document OR Contents:document)
(we need to split up the query for unrelated reasons)
The problem is that the could be a stopword in the language we are searching in (we search in several languages), causing the entire query to fail. We would like to be able to ignore stop words in generating queries like this, of have the search engine simply return true for the specific stop word search parts
I figure the latter is not possible. (or is it?). Might there be a way to query the stop words for specific language analyzers so we can exclude the stop words ourselves? Or is there a way to alter out query to be able to handle stop words better?
If you want to strip stop words from your search query the only thing I can think of is calling the analyzer with the search query and check the returned tokens.
In this example you would call the en.microsoft analyzer with the search query "the document".
The tokens returned only contain "document", so you know "the" is considered a stop word by the analyzer. But when searching multiple languages you might need to call multiple analyzers and strip stop words for all those languages.
I am trying to build a field in my Solr Schema which will be able to join words together at query time and then search for this new joined word in the index.
Lets say I have the word "bluetooth" in my index and I want this to come up in results when I search "blue tooth".
So far I have been unsuccessful in trying varying combinations of shinglefilterfactory and positionfilterfactory as well as keyword, standard and whitespace tokenizers.
I'm hoping someone might be able to point me in the right direction to solve this!
Your goal is looking obscure to me and strange a little bit. But for your specific use-case the following filter can be used:
"solr.PatternReplaceCharFilterFactory"
"pattern"="[\\W]"
"replacement"=""
It will make "blue tooth" to be replaced into "bluetooth". And also you can specify that field-analysis for query-time only.
But let me tell you that usually tokenization is used instead of concatenation. And let me also offer you the following filter - WordDelimiterFilter. In such case this guy can split "BlueTooth" into "blue" and "tooth" based on cases.
We are using Sunspot-solr 4.0 when I update synonyms file it does not change anything in search. Do I really need to re-index after making changes in synonyms.txt or there is any other trick to update synonyms file that I am missing?
That depends on when you're expanding the synonyms. If you're expanding at query time, the updates will be visible without any reindexing, but if you're expanding at index time (which is the recommended way), you'll have to reindex to get the new synonyms included in the index.
The reasoning behind recommending expansion at index time compared to query time is described in the old wiki:
This is because there are two potential issues that can arrise at query time:
The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
There's an really detailed explanation of what's actually happening behind the scenes available in Better synonym handling in Solr.
As long as you're aware of these issues and the trade-off, doing query time synonyms could work fine - but you'll have to test it against your queries and what you expect the results to be - and be aware of the pitfalls.
I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.
Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.
For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails
When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.