SOLR results too broad including not matching

SOLR results too broad including not matching - solr

I have a DB where i store strings of hex colors like f9f8f7 or aaaaaa
If search for a color, strange things happen:
if i search aaaaaa i get the one and only result hat contains aaaaaa
but if I search f9f8f7 i get more results not pertinent...
it seems like since f9f8f7 ha letters and numbers together Solr tries to split and search for single letters..
how do i prevent this?

You define your field to not split on numbers inside words. You can use the WhitespaceTokenizer in your field definition instead of the one you're currently using. The whitespace tokenizer will only break words on whitespace instead of using a range of other splitting points (which will depend on your tokenizer or if you have a WordDelimiter(Graph)Filter active in your analysis chain.
You can test out exactly how your strings are being processed for a specific field by going to "Analysis" under the collection / core in your Solr Admin.

Related

solr search with whitespaces and without whitespaces

I want to search products in the document with whitespaces and without whitespaces like "base ball", "baseball"
if someone searches for "baseball" the result should fetch the records of "baseball" & "base ball"
I am not able to that, also i do not want to use "synonyms" for that.
I have used filter class "WordDelimiterFilterFactory" to get that results i use keywords like sunglass for sun glass, keychain for key chain in synonyms files.
but there will be much more words like this so it's been difficult to find such words whose meaning is same even after split.
so I am looking for the solution where I don't have to use synonyms to get the desired result
I've tried by setting catenateWords='1' to get that result but it also did not match the result.

This is not possible without adding the synonyms. You should add the base ball as a synonyms to baseball.
The WordDelimiterFilterFactory is depricated.
Even if you use WordDelimiterGraphFilterFactory its not possible.
generateWordParts : It spilts the words at camelcase like BaseBall...but its not the case for you.
catenateWords : It also wont work in your case as your word is not having any special char or hyphen separated to join. e.g wi-fi will get wifi.
So either you data should have the separate words to be indexed. It means if you dont want to use synonyms then you have to push baseball and base ball for indexing then only you will be able perform search on these words.

Including currency symbols in solr / lucene indexes

Is it possible to index a text field considering currency symbols as separate tokens?
For example in a text field I have this:
"16 €"
and I need to build an index with this entries:
16
€
In order to search for "€" and finding the document.
Now I'm using StandardTokenizer and it discards currency symbols.
A possible solution could be using a more "trivial" tokenizer such as the WhitespaceTokenizer but I think it will get worse tokenization on other text.
Note that the problem is not how to index currencies, this is a trivial example but in the field i could have an arbitrary text.

One possible solution, albeit not very pretty, is to replace the eurosign with something the tokenizer you've chosen will leave alone. You can use a MappingCharFilterFactory to replace the eurosign with a string like EUROSIGN, and then replace it after tokenization again.
Unless you're able to formally express exactly how you want your tokenizer to work, you'll have to go with one of the preset versions that are suitable for most content to give usable search results. If you have a more specific rule set, writing your own tokenizer in Java is an option.

Getting frequency for whitespace preserved word in SOLR spell suggestion

I am currently working with SOLR spellcheck feature. I am faced with a problem of not able to find the original frequency for the input when it has whitespaces.
For example,
spellcheck.q=aple returns me origFreq for the word 'aple'
However, when I input a text with spaces like bank of amarica, I am not getting the frequency of the whole word. Instead it is giving individual word's frequency. The suggestion for this is given via the collation in solr.
Is there a way to get the hits of the input entered with spaces, in this case bank of amarica?

SOLR handles multiple words a bit differently entirely depending on the setting of sp.query.extendedResults. If it's false then words with spaces are treated as a single token. If it's true then they are tokenized and treated as separate words. So try changing the core configuration. If this is not the case, post your config file.

Lucene search for a filename, using WordDelimiterFilterFactory

If I search for toto.pdf, a token "pdf" is created for the search tI'm indexing some data, including filenames.
What I want is, according to indexed filename:
MySupercool123girlfriend.jpg
And to be able tosearch it with:
supercool
supercool123
123
girlfriend
jpg
So at index it pretty easy to be able to use WordDelimiterFilterFactory so that some tokens are created, like:
my
supercool
mysupercool
mysupercool123
supercool123
123
girlfriend
jpg
girlfriend.jgp
etc...
The matter is that at search time, I don't really know what I should do.
If I use WordDelimiterFilterFactory at search time, MySupercool123girlfriend.jpg would match even with toto.jpg because in both cases a token jpg is created.
toto.jpg should not be in the result list at all, so it's not a solution for me to have both results with the appropriate one having a better scoring
Have you any recommendation to index and search for filenames?

For this specific example of yours i.e. if the search is for MySupercool123girlfriend.jpg and you want this to only return documents that have the entire string in it, you can keep a copyField, say named filename_str, whose fieldType is string. String matches will ensure you that you get an exact match. This could be a first-level "exact match" search you do.
However, I am guessing that you would want a search for 123girlfriend.jpg to return the document containing MySupercool123girlfriend.jpg. You can do a 2nd level search for this. Beginning Solr 4.0 you can do a regex search like
q=filename_str:/.*123girlfriend.jpg/
(This regex query should also work for filename field itself, if you are using preserveOriginal=1 in WordDelimiterFilterFactory at index time.)
Else you can do a leading wild-card search, which works in earlier Solr versions too.
If you also want MySupercool.jpg to match MySupercool123girlfriend.jpg, then I guess you would have to manually do the work of DelimiterFilterFactory and construct a regex query like
q=filename_str:/.*My.*Supercool.*.jpg/
Another issue is that jpg is going to match lot of documents, so you may want to split the filename and the extension and keep them as separate fields.

Can you come up with some meaningful for your use case DisMax mm parameter?
See http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29
E.g.
mm=100% and "MySupercool123girlfriend.jpg" would match only filenames that have all ["my", "supercool", "123", "girlfriend", "jpg"] terms in them
You can find some less strict but still giving relevant results expression. See http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/util/doc-files/min-should-match.html

tokenizer for keepwordfilterfactory in solr

I want to use the solr keepwordfilterfactory but not getting the appropriate tokenizer for that. Use case is, i have a string say hi i am coming, bla-bla go out. Now from the following string i want to keep the words like hi i, coming,,bla-blaetc. So what tokenizer to use with the filter factory so that i am able to get any such combination in facets. Tried different tokenizer but not getting the exact result. I am using solr 4.0. Is there any such tokenizer that tokenizes based on the keepwords used.

What are your 'rules' for tokenization (splitting long text into individual tokens). The example above seem to be implying that sometimes you have single word tokens and sometimes a multi-word ("hi i"). The multi-word case is problematic here, but you might be able to do it by combining ShingleFilterFactory to give you multi-word tokens as well as the original ones and then you keep only the items you want.
I am not sure whether KeepWord filter deals correctly with multi-word strings. If it does not, you may want to have a special separator character during shingle process and then regex filter it back to space as the last step.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

SOLR results too broad including not matching - solr

Related

solr search with whitespaces and without whitespaces

Including currency symbols in solr / lucene indexes

Getting frequency for whitespace preserved word in SOLR spell suggestion

Lucene search for a filename, using WordDelimiterFilterFactory

tokenizer for keepwordfilterfactory in solr

Categories

Resources