Solr does not retrieve results for partial string match in shards - solr

In Solr i have 3 cores :
Unicore
Core_1
Core_2
Unicoreis the common core which has all the fields in Core_1 & Core_2
Im getting results for the below query for the string "50000912"
http://localhost:8983/solr/UniCore/select?q=*text:"50000912"*&wt=json&indent=true&shards=http://localhost:8983/solr/Core_1,http://localhost:8983/solr/Core_2
output :
"response":{"numFound":4,"start":0,"maxScore":10.04167,"docs":[
but if i pass "5000091" instead of "50000912" by removing "2" at the end of the string i get zero results
output :
"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
with the same query used which should return more results technically,am i missing some thing or its a bug ? can any one please correct me..
just for the reference this is one my resulting data from Core_2
"response":{"numFound":4,"start":0,"maxScore":10.04167,"docs":[
{
"Storageloc_For_EP":"2500",
"Material_Number":"50000912-001",
"Maximum_LotSize":"0",
"Totrepl_Leadtime":"3",
"Prodstor_Location":"2000",
"Country_Of_Origin":"CN",
"Planned_Deliv_Time":"1",
"Planning_Time_Fence":"0",
"Plant":"5515",
"GR_Processing_Time":"1",
"Minimum_LotSize":"7920",
"Rounding_Value":"720",
"Service_Level_Days":"0",
"id":"2716447",
"Fixed_LotSize":"0",
"Procurement_Type":"F",
"Automatic_PO":"X",
"SchedMargin_Key":"005",
"Service_Level_Qty":"0",
"MRP_Type":"ZB",
"Profit_Center":"B2019",
"_version_":1531317575416283139,
"[shard]":"http://localhost:8983/solr/Core_2",
"score":10.04167},
{

Solr won't do any substring matches unless you've used a NgramFilter (which will generate multiple tokens for each substring of your original token).
My guess is you've indexed the content to a standard defined text field, which means that it'll tokenize on -. That means that what's being stored in the index for that document for Material_Number is 50000912 and 001. Solr will only give hits when both the query and index side tokens match.
You have a few options - you can either add a EdgeNGramFilter which will generate separate tokens for each combination of characters from the start of the string, or since this is a numeric value, you can use a string field (and not a tokenized field, unless it uses the KeywordTokenizer) and a wildcard at the end: q=Material_Number:5000091* will give you any document which have a token that starts with 5000091.

Related

Solr query not working as expected when it contains the `#` character

I have a field called email_txt of type text_general that holds a list of emails of type abc#xyz.com,
and I'm trying to create a query that will only search the username and disregard the domain.
My query looks something like this:
email_txt:*abc*#*
This produces 0 results. I expect to receive results where the username contains abc, like abcdefg#xyz.com, fooabc#xyzbuzz.com, barabcefg#fizzxyz.com, abc#fizz.com. And yes, I am confident that I have data of that type, it doesn't work even if I try email_txt:*#*.
If I try something like:
email_txt:*abc*
It works, and produces multiple results, including the desired ones from above, but also cases where the domain contains abc, like fizz#helpmeabc.com, which is not desired.
I've had a look at the documentation (just in case I'm going crazy) and it confirms that # is not a special character. Even so, I have tried to escape it like this (just in case, I am going crazy):
email_txt:*abc*\#*
still, 0 results
Now the actual question. Is # a special character? If so, how can it be escaped, if not what am I doing wrong in the query? I genuinely can't tell if there is a flaw in my logic, or if there is something that I am missing.
Notes: I'm using solr version 6.3.0, the doc is for 6.6 (the closest available)
When you're using the StandardTokenizer (which the default field types text_general, text_en, etc. use by default), the content will be split into tokens when the # sign occurs. That means that for your example, there are actually two or three tokens being stored, (izz and helpmeabc.com) or (izz, helpmeabc and com).
A wildcard match is applied against the tokens by themselves (unless using the complex phrase query parser), where no tokenization and filtering taking place (except for multi term aware filters such as the lowercase filter).
The effect is that your query, *abc*#* attempts to match a token containing #, but since the processing when you're indexing splits on # and separate the tokens based on that character, no tokens contain # - and thus, giving you no hits.
You can use the string field type or a KeywordTokenizer paired with filters such as the lower case filter, etc. to get the original input more or less as a complete token instead.

SOLR - Searching record based on SOLR field in passed string

I have a CSV string field say "field1" in SOLR which can have value similar to 1,5,7
Now, I want to get this record if I pass values:
1,5,6,7
OR
1,5,7,10
OR
1,5,7
Basically any of these inputs should return me this record from SOLR.
Is there anyway to achieve this. I am open for schema change if it helps.
The Standard Tokenizer (used in text fields like text_general) will not split on commas if there is no space in between characters.
That means that "1,2,3" will be indexed as a single token ("1,2,3") but it will index "1, 2, 3" as three tokens ("1", "2", "3").
If you can make sure there will be a space after the comma in the value that you are indexing and the value that you are using in your search query you might be able to achieve what you want by indexing your field as a text_general.
You can use the Analysis Screen in Solr to see how your value will be indexed and searched and see if any of the built-in field types gives you what you want.

Solr query giving wrong results searching multi-word (separated by space) string

I have indexed following document in Solr with app_name is multi-word string eg."Fire inspection" ,
{
"app_name":"Fire inspection",
"appversion":1,
"id":"app_1397_version_2417",
"icon":"/images/media/default_icons/app.png",
"type":"app",
"app_id":1397,
"account_id":556,
"app_description":"fire inspection app",
"_version_":1599441252925833216}]
}
if i execute following Solr query, Solr returning wrong response,
Query:
http://localhost:8983/solr/AxoSolrCollectionLocal/select?fq=app_name:*fire P*&q=*:*
I'm searching for record's whose app_name contains "fire P" but getting -response whose app_name contains "fire inspection". Here, string 'Fire P' does not match with below record but still it is responded by Solr.
Response:
{
"app_name":"Fire inspection",
"appversion":1,
"id":"app_1397_version_2417",
"icon":"/images/media/default_icons/app.png",
"type":"app",
"app_id":1397,
"account_id":556,
"app_description":"fire inspection app",
"_version_":1599441252925833216}]
}
Can someone please help me with the Solr query (same as that of like query in SQL) which will check for substring and spaces will not be mattered.
Your help is greatly appreciated.
First - your query does not mean what you think it means. app_name:*fire P* means "search for anything ending in fire in the field app_name and/or anything starting with p in the default search field". Since you haven't prefixed the second value with a field name, the default search field will be used.
If you want to search for a substring match inside a field like that (i.e. something that contains "fire P" as a substring inside the value, the field type has to be made a string field - or a field with a keyword tokenizer - that way the field retains its actual value, and it's not processed / filtered / tokenized further. If it's being tokenized, those tokens (i.e. fire, inspection etc) will be stored separately. You'll have to escape any spaces properly and query a single field (i.e. app_name:fire\ P`), and depending on the use case, performance may take a hit unless you have the ReversedWildcardFilter enabled as well.
However, you can probably also use the ComplexPhraseQueryParser to get support for wilcards in phrase queries:
{!complexphrase inOrder=true}app_name:"*fire P*"
should work, as long as you actually have uppercase letters in your tokens (wildcards disables many filters, so usually you'll want to match the end syntax in your tokens.

How to config solr that use Synonym base on KeywordTokenizerFactory

synonym eg: "AAA" => "AVANT AT ALJUNIED"
If i search AAA*BBB
I can get AVANT AT ALJUNIEDBBB.
I was used StandardTokenizerFactory.But it's always breaking field data into lexical units,and then ignore relative position for search words.
On other way,I try to use StandardTokenizerFactory or other filter like WordDelimiterFilterFactory to split word via * . It don't work
You can't - synonyms works with tokens, and KeywordTokenizer keeps the whole string as a single token. So you can't expand just one part of the string when indexing if you're using KT.
In addition the SynonymFilter isn't MultiTermAware, so it's not invoked on query time when doing a wildcard search - so you can't expand synonyms for parts of the string there, regardless of which tokenizer you're using.
This is probably a good case for preprocessing the string and doing the replacements before sending it to Solr, or if the number of replacements are small, having filters to do pattern replacements inside of the strings when indexing to have both versions indexed.

solr edismax search for words containing substring

Using eDisMax with SOLR 5.2.1 to search for a string, when I set the q parameter to that string, SOLR only matches fields containing that string as a whole word. For example,
q=bc123 will match "aa-bc123" but not "aabc123". If I add the * character before or after the phrase, than to match the search, there must be trailing and leading characters. For example, q=*bc123* will match "abc123a" but will not match "bc123".
The questions is -- what query string will match words containing the search words with or without trailing/leading characters?
Please note:
There are multiple fields to match, which are defined using the qf parameter
qf=field1^4 field2^3 field2^2 ...
The search may contain multiple words, eg. for q=abc def I want fields that contain both words containing "abc" and words containing "def", such as using q.op=AND
I have tried to use fuzzy search, but I have gotten a varying degree of false positives or omitted results, depending on the threshold.
You can use an NGramFilter to achieve this. It will split the terms into multiple tokens, where each token will be a substring of the original token.
The filter is only required when indexing (when querying, the tokens should match directly).

Resources