Solr how get top keyword without common words - solr

I am using Solr and I would like to get the top 10 keywords of all my dataset, without the common words (like "I", "go", "the"...).
I used the "facet.excludeTerms", but there is too much to list all the common words in the query.
For now, I used the facet parameters in my query :
http://localhost:8983/solr/<my_core>/select?facet=true&facet.field=content&facet.limit=10&facet.minCount=1&facet.excludeTerms=I,go,the&q=content:(%2A)&rows=1
My dataset can contain data from many different languages (English, French, Spanish...), so I can't use the OpenNLPTokenizer in my schema, because it is language specific, and I don't know in advance what language is gonna be inserted.
I'm also trying something with tf-idf, but nothing right for now.
http://localhost:8983/solr/<my_core>/select?fl=idf(content,'covid')&idf(content,'and')&tf(content,'covid')&tf(content,'covid')&q=*:*&fl=score&debugQuery=true
I don't understand the mean of idf :
"covid" gets 5.2 -> interesting word - OK
"and" gets 7.3 -> common word - KO
No really big difference between the 2 values, so how can I use them ?
And all tf values are 0 :(.
Any idea please ?

Related

Avoiding keyword stuffing in SOLR

I'm looking for a way to limit the effect (or eliminate it) of "keyword stuffing" in SOLR. (We're currently running a SOLR 6.2.0 server).
I've tried setting omitTermFreqAndPositions="true", but when I do that, some queries throw phrase query errors (specifically queries with search terms such as G1966B - likely due to word splitting and such). I could go down the road of disabling the word splitting and try to avoid the phrase query errors, but this is simply going to mess up more things than I'm trying to fix.
Does anyone have any suggestions on how to limit the affect of multiple keyword matches in a single field?
Example: If we have a description field with something like this:
BrandX 1200 Series G1924B LC/MSD SL XBC System.
This BrandX 1200 Series G1924B ( G 1924 B , G1924 B , G 1924B ) LC/MSD SL XBC >System is in excellent condition.
When someone does a search for "G1924B" I would like to avoid scoring this document higher just because it happens to have G1924B (or a variation of that) in there several times.
In theory someone could repeat the keyword many times in their description to try to trick the system into ranking their search results higher.
Any suggestions?
Thanks!
This happens to appear as a more frequent requirement than initially thought.
If you remove both term freq and positions, you lose phrase search capability.
I would recommend to write a custom similarity that ignores TF ( Term Frequency).
At the moment the default BM25 take TF in consideration.
You can just pick that class and adjust the similarity calculus to consider TF as a constant.
e.g.
org.apache.lucene.search.similarities.BM25Similarity.BM25DocScorer#score
[1] org.apache.lucene.search.similarities.BM25Similarity

Using solr shingle filter at query time

I am trying to build a field in my Solr Schema which will be able to join words together at query time and then search for this new joined word in the index.
Lets say I have the word "bluetooth" in my index and I want this to come up in results when I search "blue tooth".
So far I have been unsuccessful in trying varying combinations of shinglefilterfactory and positionfilterfactory as well as keyword, standard and whitespace tokenizers.
I'm hoping someone might be able to point me in the right direction to solve this!
Your goal is looking obscure to me and strange a little bit. But for your specific use-case the following filter can be used:
"solr.PatternReplaceCharFilterFactory"
"pattern"="[\\W]"
"replacement"=""
It will make "blue tooth" to be replaced into "bluetooth". And also you can specify that field-analysis for query-time only.
But let me tell you that usually tokenization is used instead of concatenation. And let me also offer you the following filter - WordDelimiterFilter. In such case this guy can split "BlueTooth" into "blue" and "tooth" based on cases.

Solr/Lucene query lemmatization with context

I have successfully implemented a Czech lemmatizer for Lucene. I'm testing it with Solr and it woks nice at the index time. But it doesn't work so well when used for queries, because the query parser doesn't provide any context (words before or after) to the lemmatizer.
For example the phrase pila vodu is analyzed differently at index time than at query time. It uses the ambiguous word pila, which could mean pila (saw e.g. chainsaw) or pít (the past tense of the verb "to drink").
pila vodu ->
Index time: pít voda
Query time: pila voda
.. so the word pila is not found and not highlighted in a document snippet.
This behaviour is documented at the solr wiki (quoted bellow) and I can confirm it by debugging my code (only isolated strings "pila" and "vodu" are passed to the lemmatizer).
... The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, ...
So my question is:
Is it possible to somehow change, configure or adapt the query parser so the lemmatizer would see the whole query string, or at least some context of individual words? I would like to have a solution also for different solr query parsers like dismax or edismax.
I know that there is no such issue with phrase queries like "pila vodu" (quotes), but then I would lose the documents without the exact phrase (e.g. documents with "pila víno" or even "pila dobrou vodu").
Edit - trying to explain / answer following question (thank you #femtoRgon):
If the two terms aren't a phrase, and so don't necessarily come together, then why would they be analyzed in context to one another?
For sure it would be better to analyze only terms coming together. For example at the indexing time, the lemmatizer detects sentences in the input text and it analyzes together only words from a single sentence. But how to achieve a similar thing at the query time? Is implementing my own query parser the only option? I quite like the pf2 and pf3 options of the edismax parser, would I have to implement them again in case of my own parser?
The idea behind is in fact a bit deeper because the lemmatizer is doing word-sense-disambiguation even for words that has the same lexical base. For example the word bow has about 7 different senses in English (see at wikipedia) and the lemmatizer is distinguishing such senses. So I would like to exploit this potential to make searches more precise -- to return only documents containing the word bow in the concrete sense required by the query. So my question could be extended to: How to get the correct <lemma;sense>-pair for a query term? The lemmatizer is very often able to assign the correct sense if the word is presented in its common context, but it has no chance when there is no context.
Finally, I implemented my own query parser.
It wasn't that difficult thanks to the edismax sources as a guide and a reference implementation. I could easily compare my parser results with the results of edismax...
Solution :
First, I analyze the whole query string together. This gives me the list of "tokens".
There is a little clash with stop words - it is not that easy to get tokens for stop words as they are omitted by the analyzer, but you can detect them from PositionIncrementAttribute.
From "tokens" I construct the query in the same way as edismax do (e.g. creating all 2-token and/or 3-token phrase queries combined in DisjunctionMaxQuery instances).

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

How to combine Prefix and Fuzzy Search in Solr 4.0

The solr syntax for fuzzy search is:
q~n where q is the query term and n is the Levenshtein Distance (e.g. 1-3).
The syntax for prefix search is:
q* where q is a query term and the * indicates a wildcard.
Combining both like q~n* (with even n=1) has the side effect, that nearly everything matches
(for a reason, that i still need to find out).
Combining both like q*~n (with even n=1) has the side effect, that the query performs as it will be a prefix search only.
In our use case we need to offer suggestions based on historical queries stored in index. That seam also to be the thing google does when you type in a misspelled term, and it is a great solution for suggestions.
The problem is, we can either offer suggestions wich start with the same index or some with a defined Levenshtein Distance <= 3 which is impracticable when it comes to long terms.
Now, I know that there is a similar question asked 3 years ago, where the solution says it aint possible to express in solr syntax and the whole case does not make any particular sense, but in my opinion it makes sense and a combination would be a perfekt solution to practical problems.
Not a tested solution, did you think of using this ? q* OR q~1 for example name:S* OR name: S~1 ,
Larger example : name:Samson~3 OR name:Samson* returned : <str name="name">Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133</str></doc>
I have not tried this specifically, but it looks like you might be able to do what you want with the ComplexPhraseQueryParser.
It looks like the ComplexPhraseQueryParser is slated to be distributed with 4.8, but for now you can get the plugin (there are install instructions in the zip files) from Solr's Jira. https://issues.apache.org/jira/browse/SOLR-1604
There is some discussion using distance here. http://lucene.472066.n3.nabble.com/ComplexPhraseQueryParser-and-wildcards-td2742244.html
I would expect with the ComplexPhraseQueryParser you could do a query like "q*"~n.

Resources