I am currently working with SOLR spellcheck feature. I am faced with a problem of not able to find the original frequency for the input when it has whitespaces.
For example,
spellcheck.q=aple returns me origFreq for the word 'aple'
However, when I input a text with spaces like bank of amarica, I am not getting the frequency of the whole word. Instead it is giving individual word's frequency. The suggestion for this is given via the collation in solr.
Is there a way to get the hits of the input entered with spaces, in this case bank of amarica?
SOLR handles multiple words a bit differently entirely depending on the setting of sp.query.extendedResults. If it's false then words with spaces are treated as a single token. If it's true then they are tokenized and treated as separate words. So try changing the core configuration. If this is not the case, post your config file.
Related
I have a DB where i store strings of hex colors like f9f8f7 or aaaaaa
If search for a color, strange things happen:
if i search aaaaaa i get the one and only result hat contains aaaaaa
but if I search f9f8f7 i get more results not pertinent...
it seems like since f9f8f7 ha letters and numbers together Solr tries to split and search for single letters..
how do i prevent this?
You define your field to not split on numbers inside words. You can use the WhitespaceTokenizer in your field definition instead of the one you're currently using. The whitespace tokenizer will only break words on whitespace instead of using a range of other splitting points (which will depend on your tokenizer or if you have a WordDelimiter(Graph)Filter active in your analysis chain.
You can test out exactly how your strings are being processed for a specific field by going to "Analysis" under the collection / core in your Solr Admin.
We have used StandardTokenizerFactory in the solr. but we have faced issue when we have search without special character.
Like we have search "What’s the Score?" and its content special character. now we have only search with "Whats the Score" but we didn't get proper result. its
means search title with and without special character should we work.
Please suggest which Filter we need to use and satisfy both condition.
If you have a recent version of Solr, try adding to your analyzer chain solr.WordDelimiterGraphFilterFactory having catenateWords=1.
This starting from What's should create three tokens What, s and Whats.
Not sure if ' is in the list of characters used by filter to concatenate words, in any case you can add it using the parameter types="characters.txt"
I have a solr schema that uses solr.SnowballPorterFilterFactory. When I do admin/analysis
I see that for query "iphone", after SnowballPorterFilterFactory I get "iphon", even if the file specified in schema (protwords_ro.txt) is empty.
I have removed the filter and term text remains "iphone". Since my protwords_ro.txt file is empty I don't really need that filter right now, but I was wondering why is this happening.
Actually, this filter is for stemming.
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form
So for example for word resume this filter will give resum, etc.
Also,
The Snowball stemmers rely on algorithms and considered fairly aggressive
I think this is the reason why you got iphon, even when your text file is empty.
I want to use the solr keepwordfilterfactory but not getting the appropriate tokenizer for that. Use case is, i have a string say hi i am coming, bla-bla go out. Now from the following string i want to keep the words like hi i, coming,,bla-blaetc. So what tokenizer to use with the filter factory so that i am able to get any such combination in facets. Tried different tokenizer but not getting the exact result. I am using solr 4.0. Is there any such tokenizer that tokenizes based on the keepwords used.
What are your 'rules' for tokenization (splitting long text into individual tokens). The example above seem to be implying that sometimes you have single word tokens and sometimes a multi-word ("hi i"). The multi-word case is problematic here, but you might be able to do it by combining ShingleFilterFactory to give you multi-word tokens as well as the original ones and then you keep only the items you want.
I am not sure whether KeepWord filter deals correctly with multi-word strings. If it does not, you may want to have a special separator character during shingle process and then regex filter it back to space as the last step.
In apache Solr why do we always need to prefer string field over text field if both solves purposes?
How string or text affects the parameters like index size, index read, index creation?
The fields as default defined in the solr schema are vastly different.
String stores a word/sentence as an exact string without performing tokenization etc. Commonly useful for storing exact matches, e.g, for facetting.
Text typically performs tokenization, and secondary processing (such as lower-casing etc.). Useful for all scenarios when we want to match part of a sentence.
If the following sample, "This is a sample sentence", is indexed to both fields we must search for exactly the text This is a sample sentence to get a hit from the string field, while it may suffice to search for sample (or even samples with stemmning enabled) to get a hit from the text field.
Adding to Johans Sjöbergs good answer:
You can sort a String but not a Text.