I would like to provide results for words that are severly misspelled. Do you have any suggestions on how I can that in solr 5. The built in solr.DirectSolrSpellChecker doesn't seem to be very flexible.
Thanks for any help you can provide.
You may want to consider instead an analyzer stack that creates phonetic mapping or other transformations that reduce spelling to more-general representation. An example shows one of them (DoubleMetaphone). But there are many different ones depending on the possible reasons the words are being misspelt.
Related
I use TIKA to index documents. then I want to get the whole paragraph from paragraph start to the paragraph end which contains the key words. I tried to use HighlightFragsize but it does not work. For example: there is a document like below:
When I was very small, my parents took me to many places, because they wanted me to learn more about the world. Thanks to them, I
witnessed the variety of the world and a lot of beautiful scenery.
But no matter where I go, in my heart, the place with the most
beautiful scenery is my hometown.
there are two paragraphs above. If I search 'my parents', I hope I can get the whole paragraph "When I was very small, my parents....... a lot of beautiful scenery". not only part of this paragraph. I used HighlightFragsize to limit the sentence, but the result is not what I want. Please help. thanks in advance
You haven't provided a lot of information to go off of but I'm assuming that you're using a highlighter so here are a couple of things you should check for:
The field that holds your parsed data - is it stored? Can you see the entire contents?
If (1), is the text longer than 51200 chars? The default highlighter configuration has a setting maxAnalyzedChars that is set to 51200. This means that the highlighter will not process more than 51200 characters from a highlighted field in a matched document to look for highlights. If this is the case, increase this value until you get the desired results.
Highlighting on extremely large fields may incur a significant performance penalty which you should be mindful of before choosing a configuration.
See this for more details.
UPDATE
I don't think there's any parameter called HighlightFragsize but there's one called hl.fragsize which can do what you want when set to zero.
Try the following query and see if it works for you:
q=my+parents&hl=true&hl.fl=my_field&hl.fragsize=0
Additionally, you should, in any case, be mindful of the first 2 points I posted above.
UPDATE 2
I don’t think there’s a direct way to do what you’re looking for. You could possibly split up your field into a multi valued field with each paragraph being stored as a separate value.
You can then possibly use hl.preserveMulti, hl.maxMultiValuedToExamine and hl.maxMultiValuedToMatch to achieve what you need.
Solr 4.3.0
I want to find the larger size documents.
I'm trying to build some test data for testing memory usage, but I keep getting the smaller sized documents. So, if I could add a doc size clause to my query it would help me find more suitable documents.
I'm not aware of this possibility, most likely there is no support for it.
I could see one possible approach - you could add size of the document during indexing in some separate field, which will later use to filter on.
Another possible case - is to use TermVectorComponent, which could return term vectors for matched documents, which could lead to some understanding of "how big" this document is. Not easy and simple, though.
Example of the possibly useful output:
Third possible option (kudos to MatsLindh for the idea): to use sorting function norm() for a specific field. There are some limitations:
You need to use some classic similarity
The field you're sorting on should contains norms
Example of the sorting function: sort:norm(field_name) desc
For a SOLR search, I want to treat some results differently (where the field "is_promoted" is set to "1") to give them a better ranking. After the "normal" query is performed, the order of the results should be rearranged so that approximately 30 % of the results in a given range (say, the first 100 results) should be "promoted results". The ordering of the results should otherwise be preserved.
I thought it would be a good idea to solve this by making a custom SOLR plugin. So I tried writing a SearchComponent, but it seems like you can't change the ordering of search results after it has passed through the QueryComponent (since they are cached)?
One could have written some kind of custom sort function (or a function query?) but the challenge is that the algorithm needs to know about the score/ordering of the other surrounding results. A simple increase in the score won't do the trick.
Any suggestions on how this should be implemented?
Just answered this question on the Solr users list. The RankQuery feature in Solr 4.9 is designed to solve this type of problem. You can read about RankQueries here: http://heliosearch.org/solrs-new-rankquery-feature/
Let's say we have two words with easily confused spellings. Let's say we take the terms:
derailer (a device used to prevent fouling of a rail track)
vs.
Derailleur (a device used for changing gear ratios on a bicycle)
Now for some reason we have both terms in our spelling suggestions. As a result a search for one or the other will never yield spelling suggestions, despite it being likely that you'll get poor (or no) results if you meant the other term.
So the question is how can I convince solr to give me spelling suggestions if you search for one or the other, and what controls do I have to ensure that not every search results in showing spelling suggestions?
Can Solr give you a nearest match when comparing "fingerprint" type data stored in the Solr datastore. For example,
eJyFk0uyJSEIBbcEyEeWAwj7X8JzfDvKnuTAJIojWACwGB4QeM
HWCw0vLHlB8IWeF6hf4PNC2QunX3inWvDCO9WsF7heGHrhvYV3qvPEu-
87s9ELLi_8J9VzknReEH1h-BOKRULBwyZiEulgQZZr5a6OS8tqCo00cd
p86ymhoxZrbtQdgUxQvX5sIlF_2gUGQUDbM_ZoC28DDkpKNCHVkKCgpd
OHf-wweX9adQycnWtUoDjABumQwbJOXSZNur08Ew4ra8lxnMNuveIem6
LVLQKsIRLAe4gbj5Uxl96RpdOQ_Noz7f5pObz3_WqvEytYVsa6P707Jz
j4Oa7BVgpbKX5tS_qntcB9G--1tc7ZDU1HamuDI6q07vNpQTFx22avyR
Can it find this record if it was presented with something extremely similar? And can it provide back a confidence score?
one straighforward approach could be to use a fuzzy search, and pick the first hit (by score), then you need to check whether the hit is good a match or not, maybe by testing you could find some good rule of thumbs.
But not sure if perf would be an issue with such long tokens. Use Lucene4.0 where fuzzy perf is much improved.
You may try experimenting with Ngram filter factory. You may pick a min/max gram size that is consistent with a matching/similar finger print.
If you have a tight range of minGramSize and maxGramSize, you can match documents with similar fingerprint without having to iterate over false positives.