Solr Facet Search-Spell check - solr

I'm usign Solr facet search on a column of database. It successfully returns the data:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="tags">
<int name="lol">58</int>
<int name="scienc">58</int>
<int name="photo">34</int>
<int name="axiom">27</int>
<int name="geniu">14</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
I want to make sure that only complete words are counted. In the above example you can see counts for'scienc' and 'geniu' that should be for 'science' and 'genius'. How can I achieve this? Can I incorporate spell checking feature?

This probably has to do with the underlying fieldType that you have associated with your tags field. The field value is most likely being stemmed or having other analyzers associated with it. I would suggest one of two things:
Remove the stemming and/or other processing to prevent the words from appearing as partial.
(Recommended) Create a separate field tags_facet with fieldType="string" in your schema.xml and use a copyField directive to copy the values feed into your original tags field. Then facet on this new tags_facet field.

Use the copyField feature of Solr to copy the original field to one with a string fieldType. If the values are a set of words, instead of string, you could use a whitespace tokenised fieldtype (without ngrams of course.)

Related

Can we give boost to fields through solr config file?

Every time we mention in query to give boost. Is it possible to mention boost for any field name in solr config itself ?
in the requestHandler config :
<requestHandler name="/select" class="solr.SearchHandler">
....
<lst name="appends">
<str name="qf">my_col^1</str>
<!--str name="qf">my_col^boost_val</str-->
<!--str name="bq">my_col2^boost_val</str-->
</lst>
....
It is possible to individually boost fields in Solr.
There is an additional parameter qf (Query Fields) which introduce list of fields, each of which is assigned a boost factor to increase or decrease that particular field's importance in the query.
Below is the sample solrconfig.
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="qf">title^10 content^5</str>
</lst>
</requestHandler>
In above qf assigns title field a boost of 10 and content a boost of 5.
NOTE :- The qf (Query Fields) Parameter can't be used with the standard query parser. You can use it with the dismax or edismax query parser.

How to boost repeated values in a multiValue field on Solr

I have some repeated (same strings) data in a multiValue field on my solr index and i want to boost documents by matches count in that field. For example:
doc1 : { locales : ['en_US', 'de_DE', 'fr_FR', 'en_US'] }
doc2 : { locales : ['en_US'] }
When i run the query q=locales:en_US i would like to see the doc1 at the top because it has two "en_US" values. What is the proper way to boost this kind of data?
Should i use a special tokenizer?
Solr version is: 4.5
Disclaimer
In order to use either of the following solutions you will need to make either one of the following changes:
Create a copyField for locales:
<field name="locales" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- No need to store(stored="false") locales_text as it will only be used for searching/sorting/boosting -->
<field name="locales_text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="locales" dest="locales_text"/>
Change the type of locales to "text_general" (the type is provided in the standard solr collection1)
First solution (Ordering):
Results can be ordered by some function. So we can order by number of occurrences (termfreq function) in field:
If copyField is used, then sort query will be: termfreq(locales_text,'en_US') DESC
If locales is of text_general type, then sort query will be: termfreq(locales,'en_US') DESC
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">*,score</str>
<str name="sort">termfreq(locales_text,'en_US') DESC</str>
<str name="indent">true</str>
<str name="q">locales:en_US</str>
<str name="_">1383598933337</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="0.5945348">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">0.4203996</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can also use fl=*,termfreq(locales_text,'en_US') to see the number of matches.
One thing to keep in mind - it is an order function, not a boost function. If you will rather boost score based on multiple matches, you will be probably more insterested in the second solution.
I included the score in the results to demonstrate what #arun was talking about. You can see that the score is different(probably to length)... Quite unexpected(for me) that for multivalued string it is the same.
Second solution (Boosting):
If copyField is used, then the query will be : {!boost b=termfreq(locales_text,'en_US')}locales:en_US
If locales is of text_general type, then the query will be: {!boost b=termfreq(locales,'en_US')}locales:en_US
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="lowercaseOperators">true</str>
<str name="fl">*,score</str>
<str name="indent">true</str>
<str name="q">{!boost b=termfreq(locales_text,'en_US')}locales:en_US</str>
<str name="_">1383599910386</str>
<str name="stopwords">true</str>
<str name="wt">xml</str>
<str name="defType">edismax</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="1.1890696">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">1.1890696</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can see that the score changed significantly. The first document score two time more than the second (because there was two matches each scored as 0.5945348).
Third solution (omitNorms=false)
Based on the answer from #arun I figured that there is also a third option.
If you convert you field to (for example) text_general AND set omitNorms=true for that field - it should have the same result.
The default standard request handler in Solr does not use only the term frequency to compute the scores. Along with term frequency, it also uses the length of the field. See the lucene scoring algorithm, where it says:
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.
Since doc2 has a shorter field it might have scored higher. Check the score for the results with fl=*,score in your query. To know how Solr arrived at the score, use fl=*,score&wt=xml&debugQuery=on (then right click on your browser and view page-source to see a properly indented score calculation). I believe you will see the lengthNorm contributing to a lower score for doc1.
To have length of the field not contribute to the score, you need to disable it. Set omitNorms=true for that field. (Ref: http://wiki.apache.org/solr/SchemaXml) Then see what the scores are.

Solr: Facet one field with two outputs

I'm using Solr for indexing products and organising them into several categories. Each document has a taxon_names multi value field, where the categories are stored as human readable strings for a product.
Now I want to fetch all the categories from Solr and display them with clickable links to the user, without hitting the database again. At index time, I get the permalinks for every category from the MySQL database, which is stored as a multi value field taxon_permalinks. For generating the links to the products, I need the human readable format of the category and its permalink (otherwise you would have such ugly URLs in your browser, when just using the plain human readable name of the category, e.g. %20 for space).
When I do a facet search with http://localhost:8982/solr/default/select?q=*%3A*&rows=0&wt=xml&facet=true&facet.field=taxon_names, I get a list of human readable taxons with its counts. Based on this list, I want to create the links, so that I don't have to hit the database again.
So, is it possible to retrieve the matching permalinks from Solr for the different categories? For example, I get a XML like this:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<result name="response" numFound="6580" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="taxon_names">
<int name="Books">2831</int>
<int name="Music">984</int>
...
</lst>
</result>
And inside the taxon_names array I would need the name of the permalink.
Maybe it's possible by defining a custom field type in the config XMLs. But for this, I don't have enough experience with Solr.
Since it appears from your description that you are faceting permalink in the taxon_permalink field and the values in that field should correspond to the same category names in the taxon_names field. Solr allows you to facet on multiple fields, so you can just facet on both fields and walk the two facet results grabbing the display name from the taxon_names facet values and the permalink from the taxon_permalink facet values.
Query:
http://localhost:8982/solr/default/selectq=*%3A*&rows=0&wt=xml
&facet=true&facet.field=taxon_names&facet.field=taxon_permalink
Your output should then look like similar to the following:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<result name="response" numFound="6580" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="taxon_names">
<int name="Books">2831</int>
<int name="Music">984</int>
...
</lst>
<lst name="taxon_permalink">
<int name="permalink1">2831</int>
<int name="permalink2">984</int>
...
</lst>
</result>

Solr Grouping with multifield facets

I want to know if this is possible using solr query:
Two columns to consider: location1, location2
Want to do a face on both the columns.
Below query will work:
http://localhost:8983/solr/select/? q=*:*&version=2.2&rows=0&facet=true&facet.field=location1&facet.field=location2
Response:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">13</int>
</lst>
<result name="response" numFound="7789" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="location1">
<int name="Chicago">100</int>
<int name="NewYork">50</int>
<int name="Washington">30</int>
</lst>
<lst name="location2">
<int name="Washington">200</int>
<int name="Philadelphia">100</int>
<int name="Chicago">50</int>
</lst>
<response>
What I need, is to group both location1 and location2 and get the following results:
Washington :230
Chicago :50
Philadelphia:100
Washington :30
Currently we do it at the service layer. But can this be done using result grouping in solr? What I understand is result grouping gives an aggregate of all the data but goes not do a facet topic aggregate.
You need to store both location1 and location2 in a single multi-valued field, say locations. Then you can issue this facet query to get what you want:
q=*:*&rows=0&facet=true&facet.field=locations
Solr does not support Grouping on Multivalued fields.
Support for grouping on a multi-valued field has not yet been implemented.
You can probably create a new field at indexing with a combined value and use the fields for faceting.
EDIT :-
Use a copy field to copy the contents of both fields to a single field and perform facet on it. Need just the schema changes and reindexing of data

Solr not searching (dynamically created) fields

I have imported docs into Solr that have fields dynamically created from a pattern (mostly *_s). In the back-end (/solr/admin), I can see that they exist: the aggregate stats, like term frequency, appear correctly. They are all listed as indexed & stored.
However, they do not appear in queries, even when I search across all fields, for example:
/solr/select/?indent=on&q=myterms&fl=*
This problem seems similar to SOLR not searching on certain fields, and I tried the solution there, which was:
If you want your standard query handler to search against all your fields you can change it in your solrconfig.xml (I always add a second query handler instead of modifying "standard". The fl field is the list of fields you want to search against. It's a comma separated list or *.
I made that change to the standard solrconfig.xml, but still get no results.
I tried creating a very simple doc:
{'id':5, 'name':'foo'}
And this query returns that doc:
/solr/select/?indent=on&q=foo&fl=*
The whole results of a query with no results read:
<response>
−
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
−
<lst name="params">
<str name="echoParams">all</str>
<str name="h1">true</str>
<str name="defType">dismax</str>
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">Foo</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
</response>
Is the deftype of your "standard" query handler is dismax? If not, then it won't work. As the answer to the question you provided says, you have to use dismax to search in multiple fields. If you do not want to use dismax and still want to search in many fields at once, you have to use the copy fields feature at index time to gather all the fields you want to search on into one field, and then make that field your default field.
Since you're using _s you can copy those fields to "text" in solr/collection1/conf/schema.xml like this:
<copyField source="*_s" dest="text" maxChars="3000"/>
It's a slight variation the solution at Why do dynamic fields not act like normal fields (specifically when querying and displaying in Hue) in solr? which was to uncomment this *_t line:
<!-- Above, multiple source fields are copied to the [text] field.
Another way to map multiple source fields to the same
destination field is to use the dynamic field syntax.
copyField also supports a maxChars to copy setting. -->
<!-- <copyField source="*_t" dest="text" maxChars="3000"/> -->
This made my dynamic fields searchable with:
curl http://localhost:8983/solr/collection1/select?q=foo
Here's where the "catchall" text field is described:
<!-- catchall field, containing all other searchable text fields (implemented
via copyField further on in this schema -->
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
See also http://wiki.apache.org/solr/SchemaXml#Copy_Fields
I see your using the query "Foo" while the name value is "foo". You might wanna check if you lowercase terms in de index and query in your schema for the fieldtype you are using for name.

Resources