solr set relevancy score in solrconfig - solr

Im using solr 4.4 ,I want to search by relevancy for exact match words .I have 10 fields ,i used
copy fields to achieve this.And pretty much its working fine.
Im having problem with the exact match results should be higher the order.
Also how i can set score?
schema.xml
<field name="field8" type="text_search" indexed="true" stored="true"/>
<field name="description" type="text_search" indexed="true" stored="true"/>
<field name="keywords" type="text_search" indexed="true" stored="true"/>
<copyField source="field8" dest="text"/>
<copyField source="description" dest="text"/>
<copyField source="keywords" dest="text"/>
solrconfig.xml
<requestHandler name="/browse" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="qf">
field8 description keyword ^10.0
</str>
<str name="df">text</str>
<str name="mm">100%</str>
<str name="q.alt">*:*</str>
<str name="rows">10</str>
<str name="fl">*,score</str>
........
........
........

Phrase Fields pf
Once the list of matching documents has been identified using the fq
and qf parameters, the pf parameter can be used to "boost" the score
of documents in cases where all of the terms in the q parameter appear
in close proximity.
For Example if you search for Apache Solr Lucene by setting pf to the title
q=Apache Solr Lucen
& qf=title name
& pf=title
<!--Debug-->
<str name="parsedquery_toString">
+((name:apache | title:apache) (name:solr | title:solr) (name:lucene | title:lucene)) (title:"apache solr lucene")
</str>
Now If you look at the debug response.It is searching for the single Keyword but also searching it as phrase. So it boost all the search results which have the search String as phrase.
P.S :- Again pf will only impact boost score not the search results.

Related

How to boost repeated values in a multiValue field on Solr

I have some repeated (same strings) data in a multiValue field on my solr index and i want to boost documents by matches count in that field. For example:
doc1 : { locales : ['en_US', 'de_DE', 'fr_FR', 'en_US'] }
doc2 : { locales : ['en_US'] }
When i run the query q=locales:en_US i would like to see the doc1 at the top because it has two "en_US" values. What is the proper way to boost this kind of data?
Should i use a special tokenizer?
Solr version is: 4.5
Disclaimer
In order to use either of the following solutions you will need to make either one of the following changes:
Create a copyField for locales:
<field name="locales" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- No need to store(stored="false") locales_text as it will only be used for searching/sorting/boosting -->
<field name="locales_text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="locales" dest="locales_text"/>
Change the type of locales to "text_general" (the type is provided in the standard solr collection1)
First solution (Ordering):
Results can be ordered by some function. So we can order by number of occurrences (termfreq function) in field:
If copyField is used, then sort query will be: termfreq(locales_text,'en_US') DESC
If locales is of text_general type, then sort query will be: termfreq(locales,'en_US') DESC
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">*,score</str>
<str name="sort">termfreq(locales_text,'en_US') DESC</str>
<str name="indent">true</str>
<str name="q">locales:en_US</str>
<str name="_">1383598933337</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="0.5945348">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">0.4203996</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can also use fl=*,termfreq(locales_text,'en_US') to see the number of matches.
One thing to keep in mind - it is an order function, not a boost function. If you will rather boost score based on multiple matches, you will be probably more insterested in the second solution.
I included the score in the results to demonstrate what #arun was talking about. You can see that the score is different(probably to length)... Quite unexpected(for me) that for multivalued string it is the same.
Second solution (Boosting):
If copyField is used, then the query will be : {!boost b=termfreq(locales_text,'en_US')}locales:en_US
If locales is of text_general type, then the query will be: {!boost b=termfreq(locales,'en_US')}locales:en_US
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="lowercaseOperators">true</str>
<str name="fl">*,score</str>
<str name="indent">true</str>
<str name="q">{!boost b=termfreq(locales_text,'en_US')}locales:en_US</str>
<str name="_">1383599910386</str>
<str name="stopwords">true</str>
<str name="wt">xml</str>
<str name="defType">edismax</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="1.1890696">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">1.1890696</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can see that the score changed significantly. The first document score two time more than the second (because there was two matches each scored as 0.5945348).
Third solution (omitNorms=false)
Based on the answer from #arun I figured that there is also a third option.
If you convert you field to (for example) text_general AND set omitNorms=true for that field - it should have the same result.
The default standard request handler in Solr does not use only the term frequency to compute the scores. Along with term frequency, it also uses the length of the field. See the lucene scoring algorithm, where it says:
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.
Since doc2 has a shorter field it might have scored higher. Check the score for the results with fl=*,score in your query. To know how Solr arrived at the score, use fl=*,score&wt=xml&debugQuery=on (then right click on your browser and view page-source to see a properly indented score calculation). I believe you will see the lengthNorm contributing to a lower score for doc1.
To have length of the field not contribute to the score, you need to disable it. Set omitNorms=true for that field. (Ref: http://wiki.apache.org/solr/SchemaXml) Then see what the scores are.

How to avoid splitting of field values in faceted search in solr

While facet-based searching, in the search result doc element has field with values in the form of string(of more than words) but in the facet, every value is in the form of string with single word.
Following is the sample solr search result,
<result>
<doc>
<str name="fieldA">abc1 efg1 ijk1</str>
<str name="fieldA">abc2 efg2 ijk2</str>
<str name="fieldA">abc3 efg3 ijk3</str>
<arr name="fieldD">
<str>abc1 efg1 ijk1</str>
<str>abc2 efg2 ijk2</str>
<str>abc3 efg3 ijk3</str>
</arr>
</doc>
</result>
<lst name="facet_counts">
<lst name="facet_queries">
<int name="fieldB:ab">some_number</int>
</lst>
<lst name="facet_fields">
<lst name="fieldA">
<int name="abc1">1</int> I want <int name="abc1 efg1 ijk1">1</int>
<int name="efg1">1</int>
<int name="ijk1">1</int>
</lst>
</lst>
</lst>
Schema.xml has fields - fieldA, fieldB, fieldC and fieldD like following
<field name="fieldA" type="text_general" stored="true" indexed="true"/>
<field name="fieldB" type="text_general" stored="true" indexed="true"/>
<field name="fieldC" type="text_general" stored="true" indexed="true"/>
<field name="fieldD" type="text_general" stored="true" indexed="true"/>
and
<copyField source="fieldA" dest="fieldD"/>
<copyField source="fieldB" dest="fieldD"/>
<copyField source="fieldC" dest="fieldD"/>
I want the facet values of string of multiple words just like in the string of multiple words in the field values. Please suggest.
You have to change the type of your field from type="text_general" into type="string" for the facet search.
If you can't do it for that field you can create a new string field (it could be a copyfield) and then apply the facet on that one.

Solr schema field

I've made a schema for solr and I don't know the name of every field from the document I want to add, so I defined a dynamicField like this:
<dynamicField name="*" type="text_general" indexed="true" stored="true" />
Right now I'm testing and I don't get an error when importing for undefined fields in the document, but when I try to query for *:something (anything other than "*") I don't get any results back.
My question is how can I define a catch all field, is there any right way to do this? Or am I under the wrong impression that a query for *:something would normally search in all the documents and all the fields for "something"?
The search key word `*:something` can not get anything from solr, no matter what kind of field you are using, dinamicField or not.
If I understand your question correctly, you want a dynamicField to store all fields and want to query all fields laterly.
Here is my solution.
First, defining a default_search field for search:
<field name="default_search" type="text" indexed="true" stored="true" multiValued="true"/>
And then copy all fields into the default_search field.
<copyField source="*" dest="default_search" />
Finally, you can make a query for all fields like this:
http://host/core/select/?q=something
or
http://host/core/select/?q=default_search:something
AFAIK *:something does not query all the fields. It looks for a field names *.
I get the below error when attempting to do a query for *:test
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">9</int>
<lst name="params">
<str name="wt">xml</str>
<str name="q">*:test</str>
</lst>
</lst>
<lst name="error">
<str name="msg">undefined field *</str>
<int name="code">400</int>
</lst>
</response>
You would need to define a catchall field using copyField in your schema.xml.
I would recommend not using a simple wildcard for dynamic fields. Instead something like this:
<dynamicField name="*_text" type="text_general" indexed="true" stored="true" />
and then have a catchall field
<field name="CatchAll" type="text_general" indexed="true" stored="true" multiValued="false" />
You can have a copyField defined as below, to support query such as q=something
<copyField source="*_text" dest="CatchAll" />

solr spatial search with distance to search results

I'm able to return all results within a specific radius from geolocation point A, but I want to return the distance of each search result to point A.
I was reading this: http://wiki.apache.org/solr/SpatialSearch
I have this Solr query:
http://localhost:8983/solr/tt/select/?indent=on&facet=true&fq={!geofilt}&pt=51.4416420,5.4697225&sfield=geolocation&d=20&sort=geodist()%20asc&q=*:*&start=0&rows=10&fl=_dist_:geodist(),id,title,lat,lng,geolocation,location&facet.mincount=1
And this in my schema.xml
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<field name="geolocation" type="location" indexed="true" stored="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
This is one of the results:
<doc>
<str name="geolocation">51.4231086,5.474830699999984</str>
<str name="id">122</str>
<str name="lat">51.4231086</str>
<str name="lng">5.474830699999984</str>
<str name="title">Eindhoven Museum</str>
</doc>
However, with my current query string, I don't see a distance field in the document.
What am I missing?

Solr - Get the sum of all "filemetadata.filesize" field for a given user

I'm building some kind of file storage software.
The files metadata are indexed with fields like filesize and userId
The
What I'd like to do is to be able to compute the space used by an user.
For exemple if I have documents:
documentId = 1 | userId = 1 | fileSize = 10
documentId = 2 | userId = 2 | fileSize = 5
documentId = 3 | userId = 1 | fileSize = 3
I'd like to run a query so that for userId=1 I retrieve a result being 13MB (10+3)
I have seen that we can run FunctionQuery but it doesn't seem to do what I'm looking for.
Same for the FieldCollapsing which doesn't permit to run aggregation functions on the grouped results.
I have tested the StatsComponent as well but it doesn't seem to work for unknown reasons.
My schema contains:
<field name="FileSize" type="integer" indexed="false" stored="true" required="true" />
<field name="OtherField" type="sfloat" indexed="true" stored="true" required="false" />
<field name="OtherField2" type="integer" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="OtherField3" type="integer" indexed="true" stored="true" required="false" multiValued="false"/>
And when I perform the query
http://mysolr:8414/solr/mycore/select/?q=docId:123
&rows=0
&stats=true
&stats.field=FileSize
&stats.field=OtherField
&stats.field=OtherField2
&stats.field=OtherField3
I get back the result:
<lst name="stats">
<lst name="stats_fields">
<null name="FileSize"/>
<lst name="OtherField">
<double name="min">6.0</double>
<double name="max">6.0</double>
<long name="count">1</long>
<long name="missing">0</long>
<double name="sum">6.0</double>
<double name="sumOfSquares">36.0</double>
<double name="mean">6.0</double>
<double name="stddev">0.0</double>
<lst name="facets"/>
</lst>
<lst name="OtherField2">
<double name="min">0.0</double>
<double name="max">0.0</double>
<long name="count">1</long>
<long name="missing">0</long>
<double name="sum">0.0</double>
<double name="sumOfSquares">0.0</double>
<double name="mean">0.0</double>
<double name="stddev">0.0</double>
<lst name="facets"/>
</lst>
<null name="OtherField3"/>
</lst>
</lst>
As you can see I'm asking for stats for a single doc (which isn't really useful but helps to debug, anyway without the q=docId:123 it doesn't return me a better result).
This document has a set FileSize of 15
I use Solr 4.1
Can someone please explain me why I can get stats for fields OtherField and OtherField2, but not for fields FileSize and OtherField3? I don't see the problem at all...
Good news, writing this question helped me find the solution. I use a legacy schema and didn't notice that the FileSize field had indexed="false".
Passing this attribute to true makes the StatsComponent returns stats for that field!
However, for the field OtherField3 which has exactly the same definition as OtherField2, I have no answer

Resources