Avoiding keyword stuffing in SOLR

Avoiding keyword stuffing in SOLR - solr

I'm looking for a way to limit the effect (or eliminate it) of "keyword stuffing" in SOLR. (We're currently running a SOLR 6.2.0 server).
I've tried setting omitTermFreqAndPositions="true", but when I do that, some queries throw phrase query errors (specifically queries with search terms such as G1966B - likely due to word splitting and such). I could go down the road of disabling the word splitting and try to avoid the phrase query errors, but this is simply going to mess up more things than I'm trying to fix.
Does anyone have any suggestions on how to limit the affect of multiple keyword matches in a single field?
Example: If we have a description field with something like this:
BrandX 1200 Series G1924B LC/MSD SL XBC System.
This BrandX 1200 Series G1924B ( G 1924 B , G1924 B , G 1924B ) LC/MSD SL XBC >System is in excellent condition.
When someone does a search for "G1924B" I would like to avoid scoring this document higher just because it happens to have G1924B (or a variation of that) in there several times.
In theory someone could repeat the keyword many times in their description to try to trick the system into ranking their search results higher.
Any suggestions?
Thanks!

This happens to appear as a more frequent requirement than initially thought.
If you remove both term freq and positions, you lose phrase search capability.
I would recommend to write a custom similarity that ignores TF ( Term Frequency).
At the moment the default BM25 take TF in consideration.
You can just pick that class and adjust the similarity calculus to consider TF as a constant.
e.g.
org.apache.lucene.search.similarities.BM25Similarity.BM25DocScorer#score
[1] org.apache.lucene.search.similarities.BM25Similarity

Related

Indexing and searching words and word-parts

I just indexed a bunch of text data from our products DB. My goal is evaluating Apache Solr for production use.
This is a document example:
{
"shape":"Geometric",
"color":"MATTE BLACK",
"gender":"unisex",
"model":"CLUBMASTER RX 5154",
"sales":10,
"lens":"rugged",
"material":"plastic",
"brand":"Ray-Ban"
}
The most important thing in our search app is fuzzy matching, because inaccurate search terms are very frequent.
So, I'm a little disappointed with results found by Solr.
For example:
clubmaster -> many results
club master -> no results
Why?!
ray ban -> many results
rayban -> no results
I also tried putting ~1 or even ~2 after my term, with no luck!
All fields are indexed '*_txt_en' predefined field.

You can't just run a serious production setup without customizing schema/solrconfig to fit your specific needs. From what I can guess, you would get the results you want by:
copy your text fields into different versions with different analysis, for example:
one as a string type, hard to match
one field that is using EdgeNgram to match prefixes.
another with WordDelimiterFilterFactory to match ray-ban/rayban
...
using edismax as the query parser
in edismax, there are many things to tweak in it. But the most important is: search on all the fields above, but weight then in different way, the less analysis, the more weight

No result return by Solr when query contains word that is not in the collection

I am trying to set up Solr but encountered the problem mentioned in the title. I just downloaded Solr and used the built-in example. When I used a query with words occurred in the example documents, such as "ipod". Solr worked properly. However, when I added some words that are not in these documents, such as "what". Solr does not return anything. For me, it is weird since the relevance scores should be computed to query terms separately and added up. Non-existing query term should not affect the ranking (even though the coord norm is affected, thus the scores of documents will change).
Could anyone tell me what might be the issue? Thanks.

There are several ways of configuring how you want this behavior. I'll assume that you're using the edismax query handler for these examples, although some of these also apply to the standard lucene query parser.
The reason for not always wanting "ipod what" to retrieve the same subset sa "ipod" is that you'll get a poor result set and user experience for terms that are more general than "ipod" (i.e. searching for "microsoft windows" will not be perceived as a good search result if you're showing only general hits for anything about windows - it's usually better to say "we didn't find anything" in those cases). It all depends on your use case.
First, you can do it yourself, by applying either AND or OR between terms to get the exact kind of matching you're looking for.
You can use q.op to configure wether each term should be AND-ed together (all required) or OR-ed together (any one is sufficient). This overrides the (now deprecated) value from <solrQueryParser defaultOperator=".."/> in schema.xml.
For (e)dismax, there's the mm parameter, which allows you do more specific, but in a general way, handling of how you want matches to be performed. mm allows you to say "at least 50% of the terms should match" or "if there's only two terms, both should match, but any over that should be optional" or "match everything up to four, and 75% after that".

How to combine Prefix and Fuzzy Search in Solr 4.0

The solr syntax for fuzzy search is:
q~n where q is the query term and n is the Levenshtein Distance (e.g. 1-3).
The syntax for prefix search is:
q* where q is a query term and the * indicates a wildcard.
Combining both like q~n* (with even n=1) has the side effect, that nearly everything matches
(for a reason, that i still need to find out).
Combining both like q*~n (with even n=1) has the side effect, that the query performs as it will be a prefix search only.
In our use case we need to offer suggestions based on historical queries stored in index. That seam also to be the thing google does when you type in a misspelled term, and it is a great solution for suggestions.
The problem is, we can either offer suggestions wich start with the same index or some with a defined Levenshtein Distance <= 3 which is impracticable when it comes to long terms.
Now, I know that there is a similar question asked 3 years ago, where the solution says it aint possible to express in solr syntax and the whole case does not make any particular sense, but in my opinion it makes sense and a combination would be a perfekt solution to practical problems.

Not a tested solution, did you think of using this ? q* OR q~1 for example name:S* OR name: S~1 ,
Larger example : name:Samson~3 OR name:Samson* returned : <str name="name">Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133</str></doc>

I have not tried this specifically, but it looks like you might be able to do what you want with the ComplexPhraseQueryParser.
It looks like the ComplexPhraseQueryParser is slated to be distributed with 4.8, but for now you can get the plugin (there are install instructions in the zip files) from Solr's Jira. https://issues.apache.org/jira/browse/SOLR-1604
There is some discussion using distance here. http://lucene.472066.n3.nabble.com/ComplexPhraseQueryParser-and-wildcards-td2742244.html
I would expect with the ComplexPhraseQueryParser you could do a query like "q*"~n.

How can I limit my Solr search to an arbitrary set of 100,000 documents?

I've got an 11,000,000-document index. Most documents have a unique ID called "flrid", plus a different ID called "solrid" that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it "never" that any two searches will have the same set of FLRIDs to limit on.
What we're doing right now is, roughly:
q=title:dogs AND
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))
Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.
The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.
How can we do this better?
Things we've tried or considered:
Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.
What we're hoping for:
An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.
I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.
solr search within subset defined by list of keys
Searching within a subset of data - Solr
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Terms Prevalence in SolR searches

Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.

For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails

When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight