Issues with searching special characters in Solr - solr

I'm using Solr 6.1.0
When I use defType=edismax, and using debug mode by setting debug=True, I found that the search for "r&d" is actually done to search on just the character "r".
http://localhost:8983/solr/collection1/highlight?q="r&d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r",
"querystring":"\"r",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)"
Even if I search with escape character, it is of no help.
http://localhost:8983/solr/collection1/highlight?q="r\&d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r\\",
"querystring":"\"r\\",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)",
But if I'm using other symbols like "r*d", then the search is ok.
http://localhost:8983/solr/collection1/highlight?q="r*d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r*d\"",
"querystring":"\"r*d\"",
"parsedquery":"(+DisjunctionMaxQuery((text:\"r d\")))/no_coord",
"parsedquery_toString":"+(text:\"r d\")",
What could be the reason behind this?
Regards,
Edwin

First - if you're using the URL as you've pasted, & is the separator between different arguments in the URL, and have to be properly urlencoded if it belongs to an argument, and is not an argument separator.
q=text:"foo&bar"&fl=..
is parsed as
q=text:"foo
bar"
fl=..
Your Solr library usually handles this for you transparently. text%3A%22r%26d%22 is the urlencoded version of text:"r&d".
Secondly, any further parsing will depend on the analysis chain and tokenizer for the field you're searching. This determines which characters are kept and how the text is tokenized (split into separate tokens) before the tokens are matched between the querying text and the indexed text.

What Analyzer are you using for your field . Better try a Analyzer that doesn't tokenize your field much like KeyWordTokenizerFactory.

Related

Azure Search Define Custom Analyzer

I'm defining the Index schema. One of the field is "InvoiceNumber" which it can be something like "459" or "00459" or "P00459".
I want the text "00459" while indexing tokenize to 2 tokens "459" and the original "00459".
And the text "P00459", tokenize to 3 tokens "459", "00459" and the original "P00459".
Is there a way to define the custom analyzer for this?
configuring pattern_capture token filter with appropriate regex is able to produce multiple tokens based on the same text while preserving the original text.
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers
https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
This is the example from the latter link
"(https?://([a-zA-Z-_0-9.]+))" when matched against the string "http://www.foo.com/index" would return the tokens "https://www.foo.com" and "www.foo.com".

regex with OR condition not working in angularjs [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

Solr highlighting does not work with multiple fields hl.fl when dynamic field is present

I have a dynamic text field bar_* in my index and want Solr to return highlightings for that field. So what I run is:
q=gold&hl=true&hl.fl=bar_*
It works as expected BUT in case I add some more fields to hl.fl it stops working. E.g.
q=gold&hl=true&hl.fl=bar_*,foo
Notes:
bar_* and foo fields are in the index/schema and there is no error here.
just rewriting request as q=gold&hl=true&hl.fl=bar_*&hl.fl=foo or q=gold&hl=true&hl.fl=bar_* foo does NOT help.
I didn't find any bugs in Solr JIRA on that topic.
Does anyone have an idea how to bit this. The possible workarounds that I see are:
Use hl.fl=*. But this one is not good for performance.
Explicitly specify all possible fields names for my dynamic field. But I don't like that at all.
I don't know what version is used, but it seems like this was a bug of previous Solr versions, I can confirm that in Solr 7.3 this works as expected.
curl -X GET \
'http://localhost:8983/solr/test/select?q=x_ggg:Test1%20OR%20bar_x:Test2&hl=true&hl.fl=%2A_ggg,foo,bar_%2A' \
-H 'cache-control: no-cache'
The more correct way is to do: hl.fl=bar_*,foo,*_ggg (use , or space as delimiter).
This helps to avoid long time debugging when you remove asterisk from your hl.fl parameter and highlighting by fields stops working, since this field not processed as regex anymore.
Here is spots in sources of Solr 7.3, where we can trace this behavior:
Solr calls org.apache.solr.highlight.SolrHighlighter#getHighlightFields
Before processing field, value splited by , or space here:
org.apache.solr.util.SolrPluginUtils#split
private final static Pattern splitList=Pattern.compile(",| ");
/** Split a value that may contain a comma, space of bar separated list. */
public static String[] split(String value){
return splitList.split(value.trim(), 0);
}
Results of split goes to method org.apache.solr.highlight.SolrHighlighter#expandWildcardsInHighlightFields.
In doc also mentioned expected contract https://lucene.apache.org/solr/guide/7_3/highlighting.html
hl.fl
Specifies a list of fields to highlight. Accepts a comma- or space-delimited list of fields for which Solr should generate highlighted snippets.
A wildcard of * (asterisk) can be used to match field globs, such as text_* or even * to highlight on all fields where highlighting is possible. When using *, consider adding hl.requireFieldMatch=true.
When not defined, the defaults defined for the df query parameter will be used.
try
q=gold&hl=true&hl.fl=bar_*&hl.fl=foo
After digging into Solr sources (org.apache.solr.highlight.SolrHighlighter#getHighlightFields) I have found a workaround for this. As appears Solr interprets hl.fl content as a regular expression pattern. So I've specified hl.fl as:
hl.fl=bar_*|foo
I.e. using | instead of comma. That worked perfectly for me.
Btw, I have found no documentation of this in the internet.

Unicode/special characters in help_text for Django form?

I am trying to add a special character (specifically the ndash) to a Model field's help_text. I'm using it in the Form output so I tried what seemed intuitive for the HTML:
help_text='2 – 30 characters'
Then I tried:
help_text='2 \2013 30 characters'
Still no luck. Thoughts?
django escapes all html by default. try wrapping your string in mark_safe
You almost had it on your second try. First you need to declare the string as Unicode by prefacing it with a u. Second, you wrote the codepoint wrong. It needs a preface as well; like \u.
help_text=u'2\u201330 characters'
Now it will work and has the added benefit of not polluting the string with HTML character entities. Remember that field value could be used elsewhere, not just in the Form display output. This tip is universal for using Unicode characters in Python.
Further reading:
Unicode literals in Python, which mentions other codepoint prefaces (\x and \U)
PEP263 has simple instructions for using actual raw Unicode characters in a source file.

The proper Solr Tokenizer to tokenize text while preserving special characters

Which tokenizer is appropriate to do this:
input: "This-something is something."
output: ["] [This] [-] [something] [is] [something] [.] ["]
I tried with solr.WordDelimiterFilterFactory, but this removes all the special characters. Also tried solr.KeepWordFilterFactory, with all the special characters in keepwords.txt. But this doesn't work either.
Any suggestions? I am on Solr 3.4.
Don't think there is an out of the box Tokenizer for your specific requirement.
You can create a new one specific to the requirements and easily have Solr use it.

Resources