Solr edismax relevancy sorting multiple fields - solr

I use the edismax query parser to handle user queries against our Solr 4.4 server.
Im getting correct query ,but require help with the prioritization.
For example if i give q=ideapad miix 310
1)It will get all the exact matched ,this is working fine .Now if the results
contains ideapad instead of full matched word it should be given least priority
2)prioritization of results in this order
field8,keywords,product,marketing,description .Also here ideapad will be
have least priority.
MY bq:
bq:text:"ideapad miix 310"^20000 OR (text:"miix"^12000 -text:ideapad^-20 -text:thinkpad^-20 -text:ideacentre^-20 -text:thinkcentre^-20 text:"310"^1000 -text:ideapad^-20 -text:thinkpad^-20 -text:ideacentre^-20 -text:thinkcentre^-20)
URL
http://localhost:8983/solr/collection1/select?q=ideapad+miix+310&defType=edismax&bq=text%3A%22ideapad+miix+310%22%5E20000++OR+(text%3A%22miix%22%5E12000+-text%3Aideapad%5E-20+-text%3Athinkpad%5E-20+-text%3Aideacentre%5E-20+-text%3Athinkcentre%5E-20+text%3A%22310%22%5E1000+-text%3Aideapad%5E-20+-text%3Athinkpad%5E-20+-text%3Aideacentre%5E-20+-text%3Athinkcentre%5E-20)
I use the catch all field "text" and boosted copied each fields(field8,keywords etc....)
<field name="field8" type="text_search" indexed="true" stored="true" omitNorms="true"/>
<field name="description" type="text_search" indexed="true" stored="true" omitNorms="true"/>
<field name="keywords" type="commaDelimited" indexed="true" stored="true" omitNorms="true"/>
<field name="product" type="commaDelimited" indexed="true" stored="true" omitNorms="true" omitPositions="true" omitTermFreqAndPositions="true"/>
<field name="marketing" type="commaDelimited_s" indexed="true" stored="true" omitNorms="true" omitPositions="true" omitTermFreqAndPositions="true"/>
<copyField source="field8" dest="text"/>
<copyField source="field8" dest="text"/>
My solrconfig for edismax i have boosted the fields
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="defType">edismax</str>
<str name="qf">
text^100 field8^90 keywords^80 product^70 marketing^60 description^10
</str>
<str name="pf">
text^100 field8^90 keywords^80 product^70 marketing^60 description^10
</str>
</lst>
</requestHandler>

Related

Does SOLR cell in any way limit the amount of characters imported into a solr.TextField?

I'm indexing with Solr Cell a large HTML page using a curl command with a Windows command prompt like so:
curl http://localhost:8987/solr/myexample/update/extract -d #test.html -H 'Content-type:html'
I have found that I'm missing data (text) in my fields when I query (query?q=*:*&q.op=OR&indent=true) them in the admin menu of SOLR.
Example: I have a bunch of lorem ipsum <p> tags but near the end of my HTML page I have another paragraph tag of Hello world, this does not show up in SOLR admin.
I found the following on the old wiki.
Large individual fields.
It is possible to store megabytes of text in one record. These fields are clumsy to work with. By default the number of characters stored is clipped.
It does not go into any details on how you would prevent the text from being clipped, that is if this is even what's causing the issue because I can't even get MB worth of data in a field before it's cut.
schema.xml
<field name="main" type="text_general" indexed="true" stored="true"/>
<field name="div" type="text_general" indexed="true" stored="true"/>
<field name="doc_id" type="string" uninvertible="true" indexed="true" stored="true"/>
<field name="date_pub" type="pdate" uninvertible="true" indexed="true" stored="true"/>
<field name="p" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="_text_" type="text_general" indexed="true" stored="true" multiValued="true"/>
<copyField source="*" dest="_text_"/>
solrconfig.xml
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="fmap.content">content</str>
<str name="capture">div</str>
<str name="fmap.div">div</str>
<str name="capture">h1</str>
<str name="fmap.h1">h1</str>
<str name="capture">h2</str>
<str name="fmap.h2">h2_t</str>
<str name="capture">p</str>
<str name="fmap.p">p</str>
</lst>
</requestHandler>
Solr Version: 8.10.1
SOLR cell doesn't seem to limit the characters, however, and don't ask me why, the culprit was the curl command I was using below:
curl http://localhost:8987/solr/myexample/update/extract -d #test.html -H 'Content-type:html'
Solution: The following command pulls all the text without truncating any of the text (replace paths with wherever your post.jar and HTML file are):
java -jar -Dc=myexample -Dauto example\exampledocs\post.jar example\exampledocs\sample.html
Worth noting these are Window commands for the Command Prompt.

Solr spellcheckin randomly working

I've got a problem with the spell checker integrated in solr.
I have (for now) two cores, configured with the same solrconfig.xml (with right settings for the spellchecker) and a slightly different XML (with the same configuration for spellchecker).
The problem is that for one of the core the spell checker works perfectly, for the other not.
For the not working one from Solr Admin I can see that the field "spelling" (the field the spell check uses) is indexed but no stored.
Any idea?
I don't think I will be able to post xml files, as they don't belong to me.
Thanks everybody
EDIT:
Solrxml.conf
<requestHandler name="/select" class="solr.SearchHandler">
...
</requestHandler>
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="classname">solr.IndexBasedSpellChecker</str>
<!-- field to use -->
<str name="field">spelling</str>
<!-- buildOnCommit|buildOnOptimize -->
<str name="buildOnCommit">true</str>
<!-- $solr.solr.home/data/spellchecker-->
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="accuracy">0.7</str>
<float name="thresholdTokenFrequency">.0001</float>
</lst>
</searchComponent>
schema.xml (working)
<schema name="docs" version="1.5">
...
<field name="fooCore1" type="text" indexed="true" stored="true" multiValued="false" />
<!-- Spellcheck -->
<field name="spelling" type="text" indexed="true" stored="true" multiValued="false" />
<copyField source="fooCore1" dest="spelling" />
...
...
<solrQueryParser defaultOperator="OR"/>
</schema>
schema.xml (not working)
<schema name="docs" version="1.5">
...
<field name="fooFoo" type="text" indexed="true" stored="true" multiValued="false" />
<copyField source="fooFoo" dest="fooCore" maxChars="300000" />
<!-- Spellcheck -->
<field name="fooCore2" type="text" indexed="true" stored="true" multiValued="false" />
<copyField source="fooCore2" dest="spelling" maxChars="300000" />
...
</schema>
All fields except spelling in the second schema, are stored and indexed with their value.
Even tried creating a third core but neither it is working.
It seems like that a copyField cannot be a source for another copyField.
Changed the source from a copyfield to a field for the wrong schema and it solved the problem.

Need help to decide between the type of spellchecker to use in solr?

I have a list of cities on mysql db which is hooked onto a UI for autocompletion purposes. I am currently using solr-5.3.0. Data import is happening through scheduled delta imports. I have the following questions:
I want to implement spell checker to this feature. I tried using:
DirectSolrSpellChecker
IndexBasedSpellChecker
FileBasedSpellChecker
Out of these 3 only FileBasedSpellChecker is able to give
suggestions that solely exists on db. For eg, while searching
cologne I've got results like
{
"responseHeader":{
"status":0,
"QTime":4,
"params":{
"q":"searchfield:kolakata",
"indent":"true",
"spellcheck":"true",
"wt":"json"}},
"response":{"numFound":0,"start":0,"docs":[]
},
"spellcheck":{
"suggestions":[
"cologne",{
"numFound":4,
"startOffset":12,
"endOffset":19,
"suggestion":["Cologne",
"Bologna",
"Cogne",
"Bastogne"]}],
"collations":[
"collation","searchfield:Cologne"]}}
These cities are pretty accurate and exists in db/file.
But when I use other 2 I got results like
{
"responseHeader":{
"status":0,
"QTime":4,
"params":{
"q":"searchfield:kolakata",
"indent":"true",
"spellcheck":"true",
"wt":"json"}},
"response":{"numFound":0,"start":0,"docs":[]
},
"spellcheck":{
"suggestions":[
"cologne",{
"numFound":4,
"startOffset":12,
"endOffset":19,
"suggestion":["Cologne",
"Cologn",
"Colognei"]}],
"collations":[
"collation","searchfield:Cologne"]}}
These cities who are not present in my db.
Though FileBasedSpellChecker is giving satisfactory results, but I
am a little apprehensive in using them because, I would need to keep
updating the file manually everytime a new city gets added/removed.
Also its generally not advisable to use FileBasedSpellChecker in
general.
I also need to make the suggestions searchable as well, that means
currently I am accessing the doc returned in
"responseHeader":{"response":{"docs":[<some-format>]}}
to search for results in that city, but now I want the suggestor to
return the results in the same <some-format> instead of just
string results, in order to get it integrated with UI properly.
One minor change requested is to sort the suggestions in ascending
order of edit/levenshtein distance. This is not a hard requirement
and can be negotiated with.
edit
My solrconfig looks like this:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">searchfield</str>
<str name="spellcheck">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.dictionary">file</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.count">5</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
and
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">text_ngram</str>
<lst name="spellchecker">
<str name="name">file</str>
<str name="classname">solr.FileBasedSpellChecker</str>
<str name="sourceLocation">spellings.txt</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
</searchComponent>
schema looks like this:
<field name="name" type="string" indexed="true" stored="true" multiValued="false" />
<field name="latlng" type="location" indexed="true" stored="true" multiValued="false" />
<field name="citycode" type="string" indexed="true" stored="true" multiValued="false" />
<field name="country" type="string" indexed="true" stored="true" multiValued="false" />
<field name="searchscore" type="float" indexed="true" stored="true" multiValued="false" />
<field name="searchfield" type="text_ngram" indexed="true" stored="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />
<defaultSearchFieldsearchfield</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="name" dest="searchfield"/>

Configuring Solr to use UUID as a key

I am trying to configure Solr 4 to work with UUID and so far I am unsuccessful
From reading the documentation I have seen two different ways to configure schema.xml to work with UUID (both do not work)
for both I need to write
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
option 1:
add:
<field name="id" type="uuid" indexed="true" stored="true" default="NEW" multiValued="false"/>
and make sure to remove the line
<uniqueKey>id</uniqueKey>
option 2
add:
<field name="id" type="uuid" indexed="true" stored="true" required="true" multiValued="false" />
Both options are not working correctly and returning
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error initializing QueryElevationComponent.
I also tried adding a row to the colrconfig.xml file with the configuration:
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">uniqueKey</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Thanks,
Shimon
After some work here is the solution:
In schema.xml, add (or edit) the field field
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
In solr config, update the chain and add the chain to the handlers (Example: for /update/extract):
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>`
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
<str name="update.chain">uuid</str>
</lst>
</requestHandler>
You may want to remove the Query Elevation component if not using it.
QueryElevationComponent requires unique key to be defined and it should be a string unique key with JIRA.
However, it was fixed with the Solr 4.0 alpha so it would depend what Solr version you are using.
This limitation is documented in the Solr wiki.

How do I get solr to return results from all indicies?

I am starting to integrate with Solr and have run across what I perceive as an issue. I uploaded a simple spreadsheet using the java API (here is an exert:
- Document, id, value
- Excel3, name, steelers
- Excel3, subject, pirates
- Excel3, description, penguins
- Excel3, comments, panthers
- Excel3, author, panthers
)
Using this I used the first column as the "document name", second column as the field in the document to index, and the third column as the indexed data. All of these fields already existed in schema.xml, but here is how they are set up:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
now here is where my problem comes into play. I run a search for say steelers, and it comes back fine, but if I look for penguins, or many of the other fields, it does not pull back any results. However if I do description:penguins, the result pulls back as expected.
Can anyone please help me understand why the part before the : is required for some fields, but not others?
example searches:
solr/select?indent=on&q=penguins&wt=xml ----Doesn't return any results
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="indent">on</str>
<str name="q">penguins</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
</response>
solr/select?indent=on&q=description:penguins&wt=xml
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">18</int>
<lst name="params">
<str name="indent">on</str>
<str name="q">description:penguins</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="author">panthers</str>
<str name="comments">panthers</str>
<str name="description">penguins</str>
<str name="id">Excel3</str>
<str name="name">steelers</str>
<str name="subject">pirates</str>
</doc>
</result>
</response>
The default query parser will query the default field, which can be specified in the schema.xml as seen here: http://wiki.apache.org/solr/SchemaXml#The_Default_Search_Field
I think #Frank Famer's comment about using the DisMax parser is a real solution to this problem. That said, here are two work-arounds I've seen in practice:
1.Create an additional copyField that is indexed, not stored, that contains the values from all the fields you want to search and then specify that field as the default. It would look something like this in your schema.xml file.
<field name="myhugedefaultfield" type="text" indexed="true" stored="false" multiValued="true"/>
<copyField source="name" dest="myhugedefaultfield"/>
<copyField source="subject" dest="myhugedefaultfield"/>
<copyField source="description" dest="myhugedefaultfield"/>
<defaultSearchField>myhugedefaultfield</defaultSearchField>
2.Alter the user edited syntax and turn the query for penguins into a query for (name:penguins) OR (subject:penguins) OR (description:penguins).

Resources