Is it possible to index a text field considering currency symbols as separate tokens?
For example in a text field I have this:
"16 €"
and I need to build an index with this entries:
16
€
In order to search for "€" and finding the document.
Now I'm using StandardTokenizer and it discards currency symbols.
A possible solution could be using a more "trivial" tokenizer such as the WhitespaceTokenizer but I think it will get worse tokenization on other text.
Note that the problem is not how to index currencies, this is a trivial example but in the field i could have an arbitrary text.
One possible solution, albeit not very pretty, is to replace the eurosign with something the tokenizer you've chosen will leave alone. You can use a MappingCharFilterFactory to replace the eurosign with a string like EUROSIGN, and then replace it after tokenization again.
Unless you're able to formally express exactly how you want your tokenizer to work, you'll have to go with one of the preset versions that are suitable for most content to give usable search results. If you have a more specific rule set, writing your own tokenizer in Java is an option.
Related
I want to search products in the document with whitespaces and without whitespaces like "base ball", "baseball"
if someone searches for "baseball" the result should fetch the records of "baseball" & "base ball"
I am not able to that, also i do not want to use "synonyms" for that.
I have used filter class "WordDelimiterFilterFactory" to get that results i use keywords like sunglass for sun glass, keychain for key chain in synonyms files.
but there will be much more words like this so it's been difficult to find such words whose meaning is same even after split.
so I am looking for the solution where I don't have to use synonyms to get the desired result
I've tried by setting catenateWords='1' to get that result but it also did not match the result.
This is not possible without adding the synonyms. You should add the base ball as a synonyms to baseball.
The WordDelimiterFilterFactory is depricated.
Even if you use WordDelimiterGraphFilterFactory its not possible.
generateWordParts : It spilts the words at camelcase like BaseBall...but its not the case for you.
catenateWords : It also wont work in your case as your word is not having any special char or hyphen separated to join. e.g wi-fi will get wifi.
So either you data should have the separate words to be indexed. It means if you dont want to use synonyms then you have to push baseball and base ball for indexing then only you will be able perform search on these words.
I posted a document with the field value "Pineapple upside down cake." I want to get hits for pineapple, pine*, *side, pi?????le, upside down, etc. I chose text_en which does not find *side nor pi?????le.
What out of the box field type will give me hits for all the above?
I'm using Solr 7.6.
If you want to retain all the tokens as is (as I commented on your previous question about this, the text_en type contains a stemmer), use a field type with just a WhitespaceTokenizer and a LowercaseFilter. You'll have to define this field yourself.
I'm guessing you can use text_general to get a decent enough answer (it uses the StandardTokenizer, so it'll split on a few more cases than just whitespace).
The reason is that wildcard searches happens without most processing taking place (as it's impossible to do proper handling of stemming, splitting, etc. when you don't have the complete token), so any wildcard search will be against the generated list of tokens after processing.
Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks
We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.
perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.
If I search for toto.pdf, a token "pdf" is created for the search tI'm indexing some data, including filenames.
What I want is, according to indexed filename:
MySupercool123girlfriend.jpg
And to be able tosearch it with:
supercool
supercool123
123
girlfriend
jpg
So at index it pretty easy to be able to use WordDelimiterFilterFactory so that some tokens are created, like:
my
supercool
mysupercool
mysupercool123
supercool123
123
girlfriend
jpg
girlfriend.jgp
etc...
The matter is that at search time, I don't really know what I should do.
If I use WordDelimiterFilterFactory at search time, MySupercool123girlfriend.jpg would match even with toto.jpg because in both cases a token jpg is created.
toto.jpg should not be in the result list at all, so it's not a solution for me to have both results with the appropriate one having a better scoring
Have you any recommendation to index and search for filenames?
For this specific example of yours i.e. if the search is for MySupercool123girlfriend.jpg and you want this to only return documents that have the entire string in it, you can keep a copyField, say named filename_str, whose fieldType is string. String matches will ensure you that you get an exact match. This could be a first-level "exact match" search you do.
However, I am guessing that you would want a search for 123girlfriend.jpg to return the document containing MySupercool123girlfriend.jpg. You can do a 2nd level search for this. Beginning Solr 4.0 you can do a regex search like
q=filename_str:/.*123girlfriend.jpg/
(This regex query should also work for filename field itself, if you are using preserveOriginal=1 in WordDelimiterFilterFactory at index time.)
Else you can do a leading wild-card search, which works in earlier Solr versions too.
If you also want MySupercool.jpg to match MySupercool123girlfriend.jpg, then I guess you would have to manually do the work of DelimiterFilterFactory and construct a regex query like
q=filename_str:/.*My.*Supercool.*.jpg/
Another issue is that jpg is going to match lot of documents, so you may want to split the filename and the extension and keep them as separate fields.
Can you come up with some meaningful for your use case DisMax mm parameter?
See http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29
E.g.
mm=100% and "MySupercool123girlfriend.jpg" would match only filenames that have all ["my", "supercool", "123", "girlfriend", "jpg"] terms in them
You can find some less strict but still giving relevant results expression. See http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/util/doc-files/min-should-match.html
In apache Solr why do we always need to prefer string field over text field if both solves purposes?
How string or text affects the parameters like index size, index read, index creation?
The fields as default defined in the solr schema are vastly different.
String stores a word/sentence as an exact string without performing tokenization etc. Commonly useful for storing exact matches, e.g, for facetting.
Text typically performs tokenization, and secondary processing (such as lower-casing etc.). Useful for all scenarios when we want to match part of a sentence.
If the following sample, "This is a sample sentence", is indexed to both fields we must search for exactly the text This is a sample sentence to get a hit from the string field, while it may suffice to search for sample (or even samples with stemmning enabled) to get a hit from the text field.
Adding to Johans Sjöbergs good answer:
You can sort a String but not a Text.