Show all occurrences of query while highlighting in solr 1.4 - solr

I have a solr setup(1.4) having a text field with ebook data. The params while hitting solr are -
"hl.fragsize":"0",
"indent":"1",
"hl.simple.pre":"{{{",
"hl.fl":"body_eng",
"hl.maxAnalyzedChars":"-1",
"wt":"json",
"hl":"true",
"rows":"1",
"fl":"ia,body_length,page_count",
"q":"ia:talesofpunjabtol00stee AND PUNJAB",
"q.op":"AND",
"f.body_eng.hl.snippets":"428",
"hl.simple.post":"}}}",
"hl.usePhraseHighlighter":"true"}},
However, the results show only 20 highlighted occurrences of word PUNJAB.
I tried f.body_eng.hl.snippets":"428" but this even isnt working.
body_eng is a big text field. The highlighting works only till some length. I have tried other words as well. In all the examples, highlighting works till around 54K letter counts.
What could be the reason?

First of all: 1.4 is a very old version of Solr. I'm not sure if per field values were supported at that time (Highlighting itself was introduced with Solr 1.3). The default highlighter was changed in 3.1.
You should however be able to highlight all occurences in a field by supplying a large value for hl.maxAnalyzedChars (not sure if -1 will do what you want). Another option to try should be to have a large hl.maxAnalyzedChars value and a large hl.fragsize value (use the same value for both fields and not 0).
If you're still unable to get it to work, test it on a more recent version of Solr to see if it's an issue that has already been fixed.

So, after lot of asking around, Its working now.
The query params is correct. The schema was causing problems. Changes done were -
<filter class="solr.SnowballPorterFilterFactory" language="English" />
was replaced with
with <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

Related

What is the easiest way to implement SVD algorithm for my searched results on Solr?

I created my own core on http://localhost:8983/solr and added some documents so I could query. But When I query something like"dog", I want those documents that contains "pooch" will be returned too. So I want to implement SVD algorithm to make some improvement on my results.
Since I am new to the search engine thing. All I know is that I can use Mahout to implement SVD, but it seems a little bit difficult coz I have to install Maven, Hadoop and Mahout.
Any suggestion will be appreciated.
You can use SynonymGraphFilterFactory
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Synonym Filter.
Create a file i.e mysynonyms.txt in the directory your_collection/conf/ and put the synonyms with => sign
pooch,pup,fido => dog
huge,ginormous,humungous => large
And Example Schema will be :
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>
Source : https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions
The is another way to augment your index with terms not in the content. Synonyms is good as #ashraful says. But there are 2 other problems you will run into:
words used but not in the synonym list
behavioral search: using other user behavior as a hint to what they are looking for
These require you to augment the index with terms learned from 1) other searches, and 2) user behavior. Mahout's Correlated Cross Occurrence algorithm can help with both. You can set it up to find terms that lead to people reading an item and (if you have something like purchase or other preference data) conversion items that correlate with items in the index. In the second case you would add user conversions to the search query to personalize the results.
A blog about the technique here: http://actionml.com/blog/personalized_search
The page on Mahout docs here: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
You should also look at word2vec, which will (given the right training data) find that "dog" and "pooch" are synonyms regardless of the synonym list because it is learned from the data. I'm not sure how you add word2vec to Solr but it is integrated into Fusion, the closed source product of Lucid.

Synonyms are not working ibm watson retrieve and rank

This is my synonyms.txt
file system => filesystem
file set => fileset
version , release
latest, new
content, information
I have changed the synonyms.txt but synonyms are not working also help me to how to give space separated synonyms.
eg.
foo bar => foobar
The field type "watson_text_en" we use in retrieve and rank doesn't have synonyms filter by default. You would need to update your schema.xml by adding that filter to make it available. Here is an instruction of where and what to add: In your schema.xml, in section, add <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> into the tag list.
Depending on your requirement, you can add it to both/either of and , which tell solr whether to apply it in indexing and/or query time. Adding it to "index" would require reindexing to make the change effective, while adding into "query" does not. Also, list will run in the order you put it, so you can choose where to put this filter to let it run before/after certain filters. For example, if you put it before solr.LowerCaseFilterFactory, it's better to toggle on ignoreCase="true", because it will run before everything is transformed into lower case
Just to note regarding adding the filter into 'Query' - according to the Solr docs, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory this is a Very Bad Thing to Do.

SOLR eDISMAX product search

I'm new to SOLR and am implementing it to search our product catalog. I'm creating ngrams and edge ngrams on the brand name, display name and category fields.
I'm using edismax and have qf defined as displayname_nge displayname_ng category_nge category_ng brandname_nge brandname_ng.
When I search for 'vitamin c' (without the quotes) I get all of the vitamins. If I surround it with quotes then I only get vitamin c. The problem is that I can't always surround the query string with quotes because a person might enter 'chewable vitamin c', or 'vendor x vitamin c'. I've tried the mm parameter without luck. I've also tried applying different boost levels and still not getting the expected results.
Any suggestions would be greatly appreciated. Thank you
Was there a reason for using only ngrams fields for searching? I'm not sure this is the problem in your case, but you may want to look at your ngrams analysis configuration in schema.xml. One from one of my indexes looks like this:
<fieldType name="ngram" class="solr.TextField" >
<analyzer type="index">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
Though you can see this is actually using the safer EdgeNGramFilterFactory, the important thing to note here is minGramSize="2". This means that during the indexing process only grams of at least two characters will be created. The word 'c'? That doesn't get any grams at all. While you could set minGramSize="1" and rebuild your index, single character grams are a very bad idea, as your search for 'c' would match against any document with a word that starts with 'c' (or contains the letter 'c' with NGramFilterFactory).
If you're currently using NGrams with minGramSize="2", a search for 'ca' would find any documents with any words containing the letters 'ca' consecutively in that order. This may not be exactly what you want, either.
My top suggestion would be to drop the ngrams in favor of a more vanilla Text field. Whether you want to keep the edge-ngrams around for better truncation support is up to you, but I suspect you'll have better luck if the Text field is at least in the mix.
You could also take a look at this question on StackOverflow: "Can I protect short words from an n-gram filter in Solr?" if you want to pursue the ngrams further.
Also, you should consider using Solr's built-in analysis tool to figure out where your searches are failing. You choose a field or fieldType, and provide values for what was entered into the index and what is being searched. It will show you how the analysis works against both values so you can see how each string is broken down and why it does or doesn't create matching tokens. The URL for the tool depends on whether you're in a multi-core environment, but if you go to Solr's web interface you should be able to find the Analysis link on the left.
Update:
Now that I have a little more detail from you and am thinking about it again, the results you're getting are very explainable.
With minGramSize="1", your unquoted search for 'vitamin c' is looking for records with the word 'vitamin' (or a longer word containing 'vitamin'), and the word 'c' (or a longer word containing 'c'). Since most records are likely to have a 'c' somewhere, this is hardly a limiting factor and your results will be very close to or exactly the same as your results for just the word 'vitamin'.
In the quoted search for 'vitamin c', the 'c' now has to appear in a word immediately following vitamin, making it a much more useful search, but still not great. You should be able to test this by finding records that have a word following vitamin that isn't a vitamin designation. For example, a record mentioning "vitamin tablets" should be found when searching for "vitamin b" (because there's a 'b' in "tablets"). and a record mentioning "vitamin chart" or "vitamin deficiency" should be found when searching for "vitamin c".
The upshot of this is that I strongly recommend having a set of fields for searching separate from your fields for autocomplete. The NGrams with minGramSize="1" are just not going to give you reasonable results for the actual search step.
Other option is to use edismax - 'mm', there you can give matching %. if you give 100% it will give you accurate matching. 75% will give you list of vitamin... you can programatically handle % according to your need
You may consider to replace the query keyword this way: "'vitamin c' vitamin c". In such case, records matching 'vitamin c' can get higher score than those matching 'vitamin' and 'c' separately. Your search results will still return all matching records. Please see if this help, and feel free to comment.

Solr british and american spelling

Search for 'globali*z*ation' only returns search results for 'globalization' but doesn't include any results for 'globali*s*ation' and vice versa.
I'm looking
into solr.HunspellStemFilterFactory filter (available in Solr 3.5).
<filter class="solr.HunspellStemFilterFactory" dictionary="en_GB.dic,en_US.dic" affix="en_GB.aff,en_US.aff" ignoreCase="true" />
Before upgrading from Solr 3.4 to 3.6.1 I was wondering if Hunspell filter is the way to go?
Thanks
If stemming doesn't solve this for you, you could always use a SynonymFilterFactory in order to normalize both spellings into one, I guess a dictionary containing US/UK spelling variations wouldn't be hard to come by.

How to use SynonymFilterFactory in Solr?

I'm trying to execute synonym filtering at query time so that if I search for X, results for Y also show up.
I go to where Solr is being run, edit the .txt file and add X, Y on a new line.
This does not work. I check the schema and I see:
<analyzer type="query">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
What am I missing?
EDIT
Assessing configuration files
tomcat6/Catalina/localhost seems to point to the correct location
<Context docBase="/data/solr/solr.war" debug="0" privileged="true" allowLinking="true" crossContext="true">
<Environment name="solr/home" type="java.lang.String" value="/data/solr" override="true" />
</Context>
Also, in the Solr admin I see this. What does cwd mean?
cwd=/usr/share/tomcat6 SolrHome=/data/solr/
Use the SynonymFilterFactory only at index time, not query time. There are some subtle but well-understood problems with synonyms at query time.
See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
After you move synonyms to the index analyzer chain, check that they are working with the Analysis page in the admin UI.
The answer from #Walter Underwood is good, but incomplete.
Whether you use the SynonymFilterFactory at index or query time depends on your default operator.
So, let's say we have a synonym file with this entry:
5,five
If your default operator is OR (which is the default default operator), then you should set up your synonyms on the query filter. This way a query for "5" will be passed to the backend as a query for "5" OR "five", say, and the backend would respond appropriately. At the same time, you can make changes to your synonym file without reindexing, and your index is smaller since it doesn't have to have so many tokens.
However, if you change the default operator to AND, you should set up your synonyms on the index filter instead. If you don't, a query for "5" would go to the backend as "5" AND "five", and it would not match the documents that it's expected to. This, alas, makes the index bigger, and also means new synonyms require complete reindexes.
Note: The documentation for this is currently wrong, leaving out all these details.

Resources