What is the naming ruleset for collections that should be followed, or in other words what are the inacceptable characters for a collection name? (for example, in elasticsearch you can't use some symbols ",:? etc..." for naming indices
Just found an answer in solr docs:
Throughout Solr there are limitations on the allowable characters in
collection names. Any characters other than ASCII alphanumeric
characters (A-Za-z0-9), hyphen (-) or underscore (_) are replaced with
an underscore when calculating the collection name for a category.
Update:
Collection names must consist entirely of periods, underscores, hyphens, and alphanumerics as well not start with a hyphen
Related
I have an Azure Search index with a bunch of text entries. I've observed that if the index contains an entry like "AI's" (with the Unicode apostrophe character 8217), searching for the word 'AI' fails to return the result. The indexed should handle punctuators - including Unicode variants. Searching for "John" should return an item that has "John's." Please confirm if this is a known bug and if so when it will be fixed.
Expecting to find "AI's" when I search for "AI" (where the apostrophe is a Unicode character 8217). The item is not returned as one would expect.
can you confirm what analyzer you are using in your index? We support many analyzers that will break down your search terms and document terms into different tokens. For example, if your content is in English, you could use the en.microsoft analyzer, which should split your "AI's" term into two tokens -> "AI" and "AI's".
More info on analyzers here ->
https://learn.microsoft.com/en-us/azure/search/search-analyzers
and here
https://learn.microsoft.com/en-us/azure/search/index-add-language-analyzers
I'm running Apache Solr 6.6.5. When a user searches for "ETCS" (a special technical term) then all documents are matches that contain the word "etc". But I only want to match documents that really contain "ETCS". Solr should never even index "etc" since it is such a common word. The stemmer should never turn "etc" into "etcs" (the plural stemming).
I added "etc" to stopwords.txt:
# Contains words which shouldn't be indexed for fulltext fields, e.g., because
# they're too common. For documentation of the format, see
# http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# (Lines starting with a pound character # are ignored.)
etc
I added "etc" to protwords.txt:
#-----------------------------------------------------------------------
# This file blocks words from being operated on by the stemmer and word delimiter.
&
<
>
'
"
etc
That helps to not match documents that contain "etc", but documents containing "etc.", "etc," or similar are still matched.
So I could add even more variants to protwords.txt:
&
<
>
'
"
etc
etc.
etc..
etc...
etc,
But that will always be incomplete. How can I tell the stemmer to consider "etc" as tokenized word with arbitrary non-word characters around it?
My schema.xml: https://gist.github.com/klausi/f59ee47a9b14b915f5bb44bd6cf1c945
1.)
I added "etc" to protwords.txt:
you should add etcs to protwords to protect stemming of the term etcs.
2.)
So I could add even more variants to protwords.txt:
Add all variations of words you like to remove from the index into the stopwords.txt, not the protwords.txt
3.) check what filed type you are using. Maybe you can tune that here a bit
//Edit: adding a link to your schema.xml will not help as long as you does not explain, which field you are using.
4.) don`t forget to restart and (if needed) reindex your index.
After changing splitOnNumerics="0" I can search words with mixed number and normal character such as "90s", "omega30", etc but it is still not working with special characters like "80"", "40)", etc even I escaped them: 80\", 40\), etc. Do you have any idea?
synonym eg: "AAA" => "AVANT AT ALJUNIED"
If i search AAA*BBB
I can get AVANT AT ALJUNIEDBBB.
I was used StandardTokenizerFactory.But it's always breaking field data into lexical units,and then ignore relative position for search words.
On other way,I try to use StandardTokenizerFactory or other filter like WordDelimiterFilterFactory to split word via * . It don't work
You can't - synonyms works with tokens, and KeywordTokenizer keeps the whole string as a single token. So you can't expand just one part of the string when indexing if you're using KT.
In addition the SynonymFilter isn't MultiTermAware, so it's not invoked on query time when doing a wildcard search - so you can't expand synonyms for parts of the string there, regardless of which tokenizer you're using.
This is probably a good case for preprocessing the string and doing the replacements before sending it to Solr, or if the number of replacements are small, having filters to do pattern replacements inside of the strings when indexing to have both versions indexed.
Using eDisMax with SOLR 5.2.1 to search for a string, when I set the q parameter to that string, SOLR only matches fields containing that string as a whole word. For example,
q=bc123 will match "aa-bc123" but not "aabc123". If I add the * character before or after the phrase, than to match the search, there must be trailing and leading characters. For example, q=*bc123* will match "abc123a" but will not match "bc123".
The questions is -- what query string will match words containing the search words with or without trailing/leading characters?
Please note:
There are multiple fields to match, which are defined using the qf parameter
qf=field1^4 field2^3 field2^2 ...
The search may contain multiple words, eg. for q=abc def I want fields that contain both words containing "abc" and words containing "def", such as using q.op=AND
I have tried to use fuzzy search, but I have gotten a varying degree of false positives or omitted results, depending on the threshold.
You can use an NGramFilter to achieve this. It will split the terms into multiple tokens, where each token will be a substring of the original token.
The filter is only required when indexing (when querying, the tokens should match directly).