solr edismax search for words containing substring - solr

Using eDisMax with SOLR 5.2.1 to search for a string, when I set the q parameter to that string, SOLR only matches fields containing that string as a whole word. For example,
q=bc123 will match "aa-bc123" but not "aabc123". If I add the * character before or after the phrase, than to match the search, there must be trailing and leading characters. For example, q=*bc123* will match "abc123a" but will not match "bc123".
The questions is -- what query string will match words containing the search words with or without trailing/leading characters?
Please note:
There are multiple fields to match, which are defined using the qf parameter
qf=field1^4 field2^3 field2^2 ...
The search may contain multiple words, eg. for q=abc def I want fields that contain both words containing "abc" and words containing "def", such as using q.op=AND
I have tried to use fuzzy search, but I have gotten a varying degree of false positives or omitted results, depending on the threshold.

You can use an NGramFilter to achieve this. It will split the terms into multiple tokens, where each token will be a substring of the original token.
The filter is only required when indexing (when querying, the tokens should match directly).

Related

Solr does not retrieve results for partial string match in shards

In Solr i have 3 cores :
Unicore
Core_1
Core_2
Unicoreis the common core which has all the fields in Core_1 & Core_2
Im getting results for the below query for the string "50000912"
http://localhost:8983/solr/UniCore/select?q=*text:"50000912"*&wt=json&indent=true&shards=http://localhost:8983/solr/Core_1,http://localhost:8983/solr/Core_2
output :
"response":{"numFound":4,"start":0,"maxScore":10.04167,"docs":[
but if i pass "5000091" instead of "50000912" by removing "2" at the end of the string i get zero results
output :
"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
with the same query used which should return more results technically,am i missing some thing or its a bug ? can any one please correct me..
just for the reference this is one my resulting data from Core_2
"response":{"numFound":4,"start":0,"maxScore":10.04167,"docs":[
{
"Storageloc_For_EP":"2500",
"Material_Number":"50000912-001",
"Maximum_LotSize":"0",
"Totrepl_Leadtime":"3",
"Prodstor_Location":"2000",
"Country_Of_Origin":"CN",
"Planned_Deliv_Time":"1",
"Planning_Time_Fence":"0",
"Plant":"5515",
"GR_Processing_Time":"1",
"Minimum_LotSize":"7920",
"Rounding_Value":"720",
"Service_Level_Days":"0",
"id":"2716447",
"Fixed_LotSize":"0",
"Procurement_Type":"F",
"Automatic_PO":"X",
"SchedMargin_Key":"005",
"Service_Level_Qty":"0",
"MRP_Type":"ZB",
"Profit_Center":"B2019",
"_version_":1531317575416283139,
"[shard]":"http://localhost:8983/solr/Core_2",
"score":10.04167},
{
Solr won't do any substring matches unless you've used a NgramFilter (which will generate multiple tokens for each substring of your original token).
My guess is you've indexed the content to a standard defined text field, which means that it'll tokenize on -. That means that what's being stored in the index for that document for Material_Number is 50000912 and 001. Solr will only give hits when both the query and index side tokens match.
You have a few options - you can either add a EdgeNGramFilter which will generate separate tokens for each combination of characters from the start of the string, or since this is a numeric value, you can use a string field (and not a tokenized field, unless it uses the KeywordTokenizer) and a wildcard at the end: q=Material_Number:5000091* will give you any document which have a token that starts with 5000091.

How to config solr that use Synonym base on KeywordTokenizerFactory

synonym eg: "AAA" => "AVANT AT ALJUNIED"
If i search AAA*BBB
I can get AVANT AT ALJUNIEDBBB.
I was used StandardTokenizerFactory.But it's always breaking field data into lexical units,and then ignore relative position for search words.
On other way,I try to use StandardTokenizerFactory or other filter like WordDelimiterFilterFactory to split word via * . It don't work
You can't - synonyms works with tokens, and KeywordTokenizer keeps the whole string as a single token. So you can't expand just one part of the string when indexing if you're using KT.
In addition the SynonymFilter isn't MultiTermAware, so it's not invoked on query time when doing a wildcard search - so you can't expand synonyms for parts of the string there, regardless of which tokenizer you're using.
This is probably a good case for preprocessing the string and doing the replacements before sending it to Solr, or if the number of replacements are small, having filters to do pattern replacements inside of the strings when indexing to have both versions indexed.

Disable boolean query in Solr for edismax

How do I disable boolean operators in edismax for solr?
The following query: Edismax -The Extended DisMax Query Parser should not exclude results mentioning "the" (given that stop words is not used).
I don't believe that Solr has an option to deactivate boolean operators. (Though I could be unaware of it - Solr is huge!)
My standard practice is to modify user-entered queries before passing them along to Solr. If punctuation isn't relevant in your search structure anyway, you could simply remove the hyphen, replace it with a space, or if you want to preserve the structure of hyphenated terms for your Solr analyzers to play with, you might selectively replace the specific pattern " -" with a single space " ", and so leave regular hyphenated expressions alone.
If you're not sure that the hyphen is irrelevant data in your search you could replace it instead with a sentinel character or sequence of characters that will pass cleanly though your query parser and field analysis, but you would probably want to do the same thing to the input data going into the search index so the two sentinel values can match within Solr.

Solrnet facet returning spaces

I'm using Solrnet to return search results and am also requesting the facets, in particular categories which is a multi-valued field.
The problem I'm coming up against is that the category "house products" is being returned as two seperate facets because of the space.
Is there a way of ensuring this is returned as a single facet value, or should I be escaping the value when it is added to the index?
Thanks in advance
Al
If the tokens are generated for house products then you are using text analysis for the field.
Text fields are not suggested to be used for Faceting.
You won't get the desired behavior as the text fields would be tokenized and filtered leading to the generation of multiple tokens which you see from the facets returned as response.
Use a copy field to copy the field to a String field to be able to facet on it without splitting the words.
SolrFacetingOverview :-
Because faceting fields are often specified to serve two purposes,
human-readable text and drill-down query value, they are frequently
indexed differently from fields used for searching and sorting:
They are often not tokenized into separate words
They are often not mapped into lower case
Human-readable punctuation is often not removed (other than double-quotes)
There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for
value retrieval.
Try to use String fields and it would be good enough without any overheads.
The faceting works on tokens, so if you have a field that is tokenized in many words it will split the facet too.
I suggest you create another field of type string used only for faceting.

Solr comma separated field - facet search

I got a field in my solr index which holds comma separated values like "area1,area2,area3,area4". There are documents in it where the value is just one value like "area6".
Now i want to make a facet search over all this values.
Example (This is what i want):
area1:10
area2:4297
area3:54
area4:65
area6:87
This is what i get
area1,area2,area3,area4: 7462
area6: 87
Does solr delivers any solutions for this problem or must i seperate the different values on my own.
While indexing you need to get tokens out of the data using ,. You can use the PatternTokenizerFactory tokenizer with , as the pattern. This would split your text whenever it finds a ,.
The field in your schema.xml should be multivalued.

Resources