I am trying to implement case-insensitive search to my title and content
fields, however to no avail. I have tried the following methods:
Adding <filter class="solr.LowerCaseFilterFactory" /> to 'text_general' field type in schema.xml / managed-schema.xml, to both 'index' and 'analyze' tokenizers.
My title and content field each will be of 'text_general' type.
I tried searching the following:
*abc* : No results of 'ABC' appears
*ABC* : Only results with 'ABC' appears.
This clearly shows that lowercase filters are not working. Also pasted below is the debug results of the first query.
Also below is the screenshot of the title field when analyzing a sample text. Output seems ok, but search does not work as per expected. Is this a search query issue?
Thanks for any help in advanced!
No, it doesn't clearly show that the lowercase filtering doesn't work - what you're experiencing is that most filters or tokenizers aren't applied when you're doing a wildcard search (since they really can't be applied cleanly for a wildcard search where they don't have the whole term to work with).
The solution is, if you want to perform a wildcarded, lowercased search, is to perform the lowercasing or processing of the field before actually indexing it, and using only a tokenizer to split the text as necessary (where LowercaseTokenizer seems to be the only one that is a MultiTermAwareComponent). Otherwise, if you don't want to perform any tokenization or splitting of the string, use a string field.
You can do this either in your own code that sends content to Solr or in an update processor.
Related
I'm trying to make a React instantsearch that lets you search phone numbers. They need to be displayed in this format: "(123)456-7890" but I want to be able to be able to search with either "(123)456-7890" or "1234567890".
I thought I could just store it in the index formatted and then the typo tolerance would take care of non-formatted queries. But I get no results with the query "1234567890". It apparently has to do with the fact that the formatting splits the number into three words and the query is just one word. Bizarrely, this means that adding the parentheses doesn't get you more matching characters on the search, but leaving them out can cause the query not to match at all.
I then tried just storing it as non-formatted (only digits) in the index. This time, both the formatted and unformatted queries got a match. But when typing it in digit-by-digit, the result disappears when I get to "(123)", only reappearing when I get to "(123)456-7". It seems like a frustrating and bizarre user experience to be typing exactly the number the result shows and having it disappear.
I've tried adding the perens to the optional words setting, but that didn't seem to have any effect. I think if I could get Algolia to ignore the perens and dash instead of replacing them with a space, this whole thing wouldn't be a problem. Is there a way to accomplish that? Maybe it's best to find a way to filter the query before it gets sent to Algolia? How should I go about that?
Store 1234567890 in an attribute named phoneNoFormat and (123)456-7890 in an attribute named phoneFormat. Include both in searchableAttributes. On the display side, look in the _highlightResult field to see which attribute matched and render the highlighted result for that attribute. With default typo tolerance each of these queries will match and correctly highlight either one or both of the attributes.
1234567890
123-456-7890
(123)456-7890
(123) 456-7890
(123)4567890
(123) 4567890
Since you're using React InstantSearch, you will need to make your own Hits component, where you can change the attribute name used to display the result on a per-hit basis. Thankfully this is not too complicated. Just see the documentation for connectHits.
When you are looping through the hits, look at each the _highlightResult property of each hit to see which of the two attributes matched. Then, when you create the <Highlight /> component set the attributeName property to the right attribute. So you have this:
<Highlight attributeName='phoneFormat' hit={hit}/>
Or this:
<Highlight attributeName='phoneNoFormat' hit={hit}/>
I use Solr 5 for searching in large (text) documents. For each search result, I display a fragment containing the highlighted search match. This works nicley using Solr's Standard Highlighter. Yet I found that if several matches are found close to each other, they will be merged into one fragment, even with hl.mergeContiguous=false. Params are
SolrQuery query = new SolrQuery();
query.setQuery(rawQuery);
query.set("defType", "lucene");
query.setRows(1000);
query.setHighlight(true);
query.setHighlightFragsize(200);
query.setHighlightSnippets(20);
query.setParam("hl.fl", "content");
query.setParam("hl.maxAnalyzedChars", "-1");
query.setParam("hl.mergeContiguous", false);
Example: I use a bible translation for testing, just because of its length. Searching for beast yields (among many others)
...7:8 Of clean beasts, and of beasts that are not clean, and of birds, and of everything that creeps upon the ground, 7:9 there went in two and two to Noah into...
I would rather have this fragment twice, because it contains two occurrences of the search term. Manually duplicating the fragment in this case appears clumsy to me. Am I missing a query parameter, or do I need a custom BoundaryScanner to achieve this?
You can think of using hl.regex - regex based fragmenter, and prepare the regex based on your terms and attach to the request. look for related hl.regex.slop, hl.regex.maxAnalyzedChars params also if you want to try this.
Or can reduce the fragment size for standard highlighter: hl.fragsize to something you think two occurrences of your terms may not be existing within.
BoundaryScanner works with FastVectorHighlighter only, and can be the option if no OOTB param works.
My text is like this: The searched word WildCard shall be partially highlighted
I search using wildcard expression "wild*".
What I need is to have the highlight snippet to be [tag]Wild[/tag]Card. What I got was [tag]WildCard[/tag], and I spent lots of time researching it, but could not find an answer.
This behavior can be illustrated on linkedin.com, where you type other people's name at the top right corner.
Once this is figured out, I will have a follow-up questions.
Thanks,
I am not sure if you can achieve what you want directly in solr. The obvious solution is to parse the returned doc yourself searching for [tag]WildCard[/tag] and find out what part of the term you need to highlight.
This is not possible with Solr. What you need to do is change the characters Solr uses to highlight found words to something you can easily remove (maybe an html comment) and then build a highlight yourself. You are likely already storing your query in a variable, so just scan the search return docs and highlight all instances of your query.
I'm very new with Solr,
And I really want a step by step to have my Solr search result like the google one.
To give you an idea, when you search 'PHP' in http://wiki.apache.org/solr/FindPage , the word 'php' shows up in bold .. This is the same result I want to have.
Showing only a parser even if the pdf is a very huge one.
You can use highlighting to show a matching snippet in the results.
http://wiki.apache.org/solr/HighlightingParameters
By default, it will wrap matching words with <em> tags, but you can change this by setting the hl.simple.pre/hl.simple.post parameters.
You may be looking at the wrong part of the returned data. Try looking at the 'highlighting' component of the returned data structure (i.e. don't look at the response docs). This should give you the snippets you want.
When I search for the word "fish" I get back a list of documents containing that word and variants of that word. If I turn on highlighting I might see a snippet that looks like this:
The law requires that anyone <em>fishing</em> in public lakes...
I would like to show the user the above snippet, which works just fine by the way, but I would also like to show the user a complete list of words that would also have been highlighted had I shown all snippets.
For example I would like to be able to show the user the following:
Section 18.32A - Hunting and Fishing
...The law requires that anyone <em>fishing</em> in public lakes...
Document also contains: Fish, Fishing, Fisherman
Is thee a way to get that list of words other than having solr highlight the entire document and then me parsing the document looking for em tags and building a list of highlighted words?
I would investigate frag size (hl.fragsize), synonyms (synonym.txt), or stemming (can help with variations of a word) to find a solution. You can set fish, fishing, fished to all mean the same in synonyms. Ensure you understand how the expand all works and whether you want the search to replace each with the other. Also ensure you know whether to index the synonym file or query with it. Do not use synonyms at both index and query time. There is also a switch to enable multiple matches in highlighting.