The proper Solr Tokenizer to tokenize text while preserving special characters

The proper Solr Tokenizer to tokenize text while preserving special characters - solr

Which tokenizer is appropriate to do this:
input: "This-something is something."
output: ["] [This] [-] [something] [is] [something] [.] ["]
I tried with solr.WordDelimiterFilterFactory, but this removes all the special characters. Also tried solr.KeepWordFilterFactory, with all the special characters in keepwords.txt. But this doesn't work either.
Any suggestions? I am on Solr 3.4.

Don't think there is an out of the box Tokenizer for your specific requirement.
You can create a new one specific to the requirements and easily have Solr use it.

Related

Azure Search Define Custom Analyzer

I'm defining the Index schema. One of the field is "InvoiceNumber" which it can be something like "459" or "00459" or "P00459".
I want the text "00459" while indexing tokenize to 2 tokens "459" and the original "00459".
And the text "P00459", tokenize to 3 tokens "459", "00459" and the original "P00459".
Is there a way to define the custom analyzer for this?

configuring pattern_capture token filter with appropriate regex is able to produce multiple tokens based on the same text while preserving the original text.
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers
https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
This is the example from the latter link
"(https?://([a-zA-Z-_0-9.]+))" when matched against the string "http://www.foo.com/index" would return the tokens "https://www.foo.com" and "www.foo.com".

Solr highlighting does not work with multiple fields hl.fl when dynamic field is present

I have a dynamic text field bar_* in my index and want Solr to return highlightings for that field. So what I run is:
q=gold&hl=true&hl.fl=bar_*
It works as expected BUT in case I add some more fields to hl.fl it stops working. E.g.
q=gold&hl=true&hl.fl=bar_*,foo
Notes:
bar_* and foo fields are in the index/schema and there is no error here.
just rewriting request as q=gold&hl=true&hl.fl=bar_*&hl.fl=foo or q=gold&hl=true&hl.fl=bar_* foo does NOT help.
I didn't find any bugs in Solr JIRA on that topic.
Does anyone have an idea how to bit this. The possible workarounds that I see are:
Use hl.fl=*. But this one is not good for performance.
Explicitly specify all possible fields names for my dynamic field. But I don't like that at all.

I don't know what version is used, but it seems like this was a bug of previous Solr versions, I can confirm that in Solr 7.3 this works as expected.
curl -X GET \
'http://localhost:8983/solr/test/select?q=x_ggg:Test1%20OR%20bar_x:Test2&hl=true&hl.fl=%2A_ggg,foo,bar_%2A' \
-H 'cache-control: no-cache'
The more correct way is to do: hl.fl=bar_*,foo,*_ggg (use , or space as delimiter).
This helps to avoid long time debugging when you remove asterisk from your hl.fl parameter and highlighting by fields stops working, since this field not processed as regex anymore.
Here is spots in sources of Solr 7.3, where we can trace this behavior:
Solr calls org.apache.solr.highlight.SolrHighlighter#getHighlightFields
Before processing field, value splited by , or space here:
org.apache.solr.util.SolrPluginUtils#split
private final static Pattern splitList=Pattern.compile(",| ");
/** Split a value that may contain a comma, space of bar separated list. */
public static String[] split(String value){
return splitList.split(value.trim(), 0);
}
Results of split goes to method org.apache.solr.highlight.SolrHighlighter#expandWildcardsInHighlightFields.
In doc also mentioned expected contract https://lucene.apache.org/solr/guide/7_3/highlighting.html
hl.fl
Specifies a list of fields to highlight. Accepts a comma- or space-delimited list of fields for which Solr should generate highlighted snippets.
A wildcard of * (asterisk) can be used to match field globs, such as text_* or even * to highlight on all fields where highlighting is possible. When using *, consider adding hl.requireFieldMatch=true.
When not defined, the defaults defined for the df query parameter will be used.

try
q=gold&hl=true&hl.fl=bar_*&hl.fl=foo

After digging into Solr sources (org.apache.solr.highlight.SolrHighlighter#getHighlightFields) I have found a workaround for this. As appears Solr interprets hl.fl content as a regular expression pattern. So I've specified hl.fl as:
hl.fl=bar_*|foo
I.e. using | instead of comma. That worked perfectly for me.
Btw, I have found no documentation of this in the internet.

Is there a way to have a syntax that highlight single specific words in vim? [duplicate]

I want to search for multiple strings in Vim/gVim and have them highlighted in different colours. Is there a way of doing this with out-the-box Vim or with a plug-in?

There are two simple ways to highlight multiple words in vim editor.
Go to search mode i.e. type '/' and then type \v followed by the words you want to search separated by '|' (pipe).
E.g.: /\vword1|word2|word3
Go to search mode and type the words you want to search separated by '\|'.
E.g.: /word1\|word2\|word3
Basically the first way puts you in the regular expression mode so that you do not need to put any extra back slashes before every pipe or other delimiters used for searching.

This can be done manually, without any script, for two search patterns.
:match Search /pattern/
:match Search /<CTRL-R>/ # highlight the current search pattern
Search is the name of the highlight group, use the completion to select another group to highlight with a different color.
:match <TAB>
:match <TAB> # completion will list all highlight group
This an be handy when you cannot use your own vim configuration.
:match none # clear the match pattern to stop highlighting

For searching multiple strings in vim you can do like:
/search1\|search2
This works, and will highlight both search1 and search2, but with same color.
You have to do this in vim editor.

Try "Highlight multiple words", which uses matchadd().

Yes, out-of-the-box you can use matchadd().
To add a highlight, eg. for trailing whitespace:
:highlight ExtraWhitespace ctermbg=grey guibg=grey
:call matchadd('ExtraWhitespace', '\s\+$', 11)
To view all matches:
:echo getmatches()
To remove matches use matchdelete(). Eg.:
:call matchdelete(7)

:%s /red\|green\|blue/
I am not sure about how to keep different colors for different keyword though. Thanks.

MultipleSearch : Highlight multiple searches at the same time, each with a different color.
http://www.vim.org/scripts/script.php?script_id=479
:Search <pattern1> //will highlight all occurences of <pattern1> in the current buffer.
A subsequent :Search <pattern2> will highlight all occurences of <pattern2> in the current buffer.

My Mark plugin can highlight several words in different colors simultaneously, like the built-in search. It comes with many mappings and commands, allows to persist the patterns, and supports multiple color palettes.

MultipleSearch2 is another script which is integrated with vim's search:
http://www.vim.org/scripts/script.php?script_id=1183

I prefer highlight plugin, simple and enough, can highlight different words with differently colors automatically.
http://www.vim.org/scripts/script.php?script_id=1599

Issues with searching special characters in Solr

I'm using Solr 6.1.0
When I use defType=edismax, and using debug mode by setting debug=True, I found that the search for "r&d" is actually done to search on just the character "r".
http://localhost:8983/solr/collection1/highlight?q="r&d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r",
"querystring":"\"r",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)"
Even if I search with escape character, it is of no help.
http://localhost:8983/solr/collection1/highlight?q="r\&d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r\\",
"querystring":"\"r\\",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)",
But if I'm using other symbols like "r*d", then the search is ok.
http://localhost:8983/solr/collection1/highlight?q="r*d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r*d\"",
"querystring":"\"r*d\"",
"parsedquery":"(+DisjunctionMaxQuery((text:\"r d\")))/no_coord",
"parsedquery_toString":"+(text:\"r d\")",
What could be the reason behind this?
Regards,
Edwin

First - if you're using the URL as you've pasted, & is the separator between different arguments in the URL, and have to be properly urlencoded if it belongs to an argument, and is not an argument separator.
q=text:"foo&bar"&fl=..
is parsed as
q=text:"foo
bar"
fl=..
Your Solr library usually handles this for you transparently. text%3A%22r%26d%22 is the urlencoded version of text:"r&d".
Secondly, any further parsing will depend on the analysis chain and tokenizer for the field you're searching. This determines which characters are kept and how the text is tokenized (split into separate tokens) before the tokens are matched between the querying text and the indexed text.

What Analyzer are you using for your field . Better try a Analyzer that doesn't tokenize your field much like KeyWordTokenizerFactory.

Unicode/special characters in help_text for Django form?

I am trying to add a special character (specifically the ndash) to a Model field's help_text. I'm using it in the Form output so I tried what seemed intuitive for the HTML:
help_text='2 – 30 characters'
Then I tried:
help_text='2 \2013 30 characters'
Still no luck. Thoughts?

django escapes all html by default. try wrapping your string in mark_safe

You almost had it on your second try. First you need to declare the string as Unicode by prefacing it with a u. Second, you wrote the codepoint wrong. It needs a preface as well; like \u.
help_text=u'2\u201330 characters'
Now it will work and has the added benefit of not polluting the string with HTML character entities. Remember that field value could be used elsewhere, not just in the Form display output. This tip is universal for using Unicode characters in Python.
Further reading:
Unicode literals in Python, which mentions other codepoint prefaces (\x and \U)
PEP263 has simple instructions for using actual raw Unicode characters in a source file.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

The proper Solr Tokenizer to tokenize text while preserving special characters - solr

Don't think there is an out of the box Tokenizer for your specific requirement. You can create a new one specific to the requirements and easily have Solr use it.

Related

Azure Search Define Custom Analyzer

Solr highlighting does not work with multiple fields hl.fl when dynamic field is present

Is there a way to have a syntax that highlight single specific words in vim? [duplicate]

Issues with searching special characters in Solr

Unicode/special characters in help_text for Django form?

Categories

Resources