Solr Boosting on custom fields - solr

Solr Experts,
I am trying to change the score for few of my custom fields through index time/field boosting but doesn't seems to be changing the score. Please help.
These are my custom fields in schema.xml
<field name="doc_id" type="string" indexed="true" stored="true" omitNorms="false"/>
<field name="doc_name" type="text_autocomplete" indexed="true" stored="true" multiValued="true" omitNorms="false"/>
<field name="doc_author" type="text_autocomplete" indexed="true" stored="true" multiValued="true" />
<field name="modifieddate" type="text_autocomplete" indexed="true" stored="true" multiValued="true"/>
<field name="doc_content" type="text_autocomplete" indexed="true" stored="true" multiValued="true" omitNorms="false"/>
<field name="doc_title" type="text_autocomplete" indexed="true" stored="true" multiValued="true"/>
<field name="doc_description" type="text_autocomplete" indexed="true" stored="true" multiValued="true" omitNorms="false"/>
I posted 3 test docs as below using SolrJ
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field name="doc_id">7781</field>
<field name="doc_name" boost="1.5">Cat</field>
<field name="doc_author">Nsd80</field>
<field name="modifieddate">11 30</field>
<field name="doc_content" boost="1.5">Cat life history. Cat life cycle. Cat Foods</field>
<field name="doc_title">Titled</field>
<field name="doc_description" boost="1.5">Cat related details</field>
</doc>
<doc>
<field name="doc_id">7782</field>
<field name="doc_name" boost="2.5">Dog</field>
<field name="doc_author">Nsd80</field>
<field name="modifieddate">11 30</field>
<field name="doc_content" boost="2.5">Dog life history. Dog life cycle. Dog Foods</field>
<field name="doc_title">Titled</field>
<field name="doc_description" boost="2.5">Dog details</field>
</doc>
<doc>
<field name="doc_id">7783</field>
<field name="doc_name" boost="2.7">Cow</field>
<field name="doc_author">Nsd80</field>
<field name="modifieddate">11 30</field>
<field name="doc_content" boost="2.7">Cow life history. Cow life cycle. Cow Foods</field>
<field name="doc_title">Titled</field>
<field name="doc_description" boost="2.7">Cow lifecycle</field>
</doc>
</add>
When I query to find the scores as below,
localhost:8983/solr/select/?q=doc_id:*&fl=*,score
it shows 1.0 as score for all three docs
I was trying to boost them as
localhost:8983/solr/select?defType=edismax&q=doc_description:*Cow*^195
but doesnt seems to be working either
<arr name="doc_description">
<str>Dog details</str>
</arr>
<long name="_version_">1479948142366425088</long>
<float name="score">1.0</float>
</doc>
<doc>
Also tried to elevate as
localhost:8983/solr/elevate?q=doc_id:7781&enableElevation=true&fl=doc_id,score,[elevated]
but result was negative
<result name="response" numFound="1" start="0" maxScore="7.23343">
<doc>
<str name="doc_id">7781</str>
<float name="score">7.23343</float>
<bool name="[elevated]">false</bool>
</doc>
</result>
</response>
My requirement is just to boost the docs to have more scores so that I can retrieve them based on scores. If you look at my xml docs, I have tried field boosting at index time, later tried to boost the docs using edismax (mentioned here) and also a elevation
Can anyone help me with detailed example?

Related

Solr how to indexing file content to multiple field?

Solr version:
7.3.0
I want to indexing file and register extracted text to multi field (word splitted field and bi-gram field) for search flexibility.
I wrote below configset, but it does not work, solr indexed only to content_text ,or content_text_bi (upper defined fmap.content field only)
solrconfig.xml
...
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content_text</str>
<str name="fmap.content">content_text_bi</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>
...
schema.xml
...
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- docValues are enabled by default for long type so we don't need to index the version field -->
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="content_text" type="text_ja" indexed="true" stored="true" storeOffsetsWithPositions="false"/>
<field name="content_text_bi" type="text_ja_bi" indexed="true" stored="true" storeOffsetsWithPositions="false"/>
<field name="filepath" type="string" indexed="true" stored="true" />
<field name="filename" type="string" indexed="true" stored="true" />
<field name="storage_id" type="pint" indexed="true" stored="true" />
...
How can I make it work as I want?
I solved to use copyField in schema.xml.
1. Add this line to schema.xml
<copyField source="content_text" dest="content_text_bi" />
2.and remove this line in in solrconfig.xml
<str name="fmap.content">content_text_bi</str>

langid UpdateRequestProcessor only mapping first field

I am trying to use solr's langid UpdateRequestProcessor. Here is the config:
<updateRequestProcessorChain name="languages">
<processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="invariants">
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
<str name="langid.whitelist">en,fr</str>
<str name="langid.fallback">en</str>
<str name="langid.langField">detectedlang</str>
<bool name="langid.map">true</bool>
<bool name="langid.map.keepOrig">false</bool>
</lst>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
My fields look like this:
<fields>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<!-- raw fields from sql db -->
<field name="expertise_id" type="int" indexed="true" stored="true" />
<field name="person_id" type="int" indexed="true" stored="true" />
<field name="mod_date" type="date" indexed="true" stored="true" />
<field name="lang" type="string" indexed="true" stored="true" />
<field name="focus" type="text_general" indexed="true" stored="true" />
<field name="expertise" type="text_general" indexed="true" stored="true" />
<field name="platforms" type="text_general" indexed="true" stored="true" />
<field name="partners" type="text_general" indexed="true" stored="true" />
<field name="participation" type="text_general" indexed="true" stored="true" />
<field name="additional" type="text_general" indexed="true" stored="true" />
<field name="tag" type="text_general" termVectors="true" multiValued="true" />
<field name="facet_tag" type="string" stored="false" indexed="false" docValues="true" multiValued="true" default=""/>
<!-- language detected by solr -->
<field name="detectedlang" type="string" indexed="true" stored="true" />
<!-- defined locale fields -->
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" />
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" />
<copyField source="tag" target="facet_tag"/>
</fields>
When I run an update or a dataimport I know that the "languages" update chain is used because focus is mapped to focus_en and detectedlang is set. However, none of the other fields in langid.fl are mapped. Why?
An example update query:
{
"additional": "here is some other information about me.",
"expertise_id": "10000",
"id": "foo_10000",
"focus": "this is my new focus. It is very exciting. When I am done I expect to be super experienced."
}
And here is the result of a query for expertise_id=10000. Note that additional has not been moved to additional_en:
"response":{"numFound":1,"start":0,"docs":[
{
"additional":"here is some other information about me.",
"expertise_id":10000,
"id":"foo_10000",
"detectedlang":"en",
"focus_en":"this is my new focus. It is very exciting. When I am done I expect to be super experienced.",
"_version_":1447088846110982144}]
}
Turns out that the problem is a syntax error. This line:
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
must be
<str name="langid.fl">focus,expertise,platforms,partners,participation,additional</str>
The docs state that the field list should be comma or space separated values. Evidently, comma and space screws things up (though it works fine in other Solr contexts like fl in a requestHandler which langid.fl is supposedly modelled on). I tried the space-separated syntax as well, but it did not fix my issue.
I hope this helps someone.

Solr More Like This (MLT) not returning results

I'm currently looking to implement more like this functionality based on a on a number of fields in my index.
My current configuration is as follows:
Haystack | PySolr | Solr
For this piece I'm using PySolr and passing the parameters to the more_like_this function. The response finds the document but not any related results. Why is that?
Here is the URL I hit:
http://localhost:8080/solr/mlt?q=django_id:12123412&mlt.fl=industry_ids,loc_state,amount,sector_id&mlt.interestingTerms=details
Here is my response from Solr:
<response>
<object type="{XXXXXX-0F1D-4F28-AAA2-XXXXXXXXXXX}" cotype="cs" id="cosymantecbfw" style="width: 0px; height: 0px; display: block;"/>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">24</int>
</lst>
<result name="match" numFound="1" start="0">
<doc>...</doc>
</result>
<result name="response" numFound="0" start="0"/>
<lst name="interestingTerms"/>
</response>
solrconfig.xml
<!-- More Like This -->
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
</requestHandler>
schema.xml
<field name="award_amount" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="estatus" type="slong" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="loc_state" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="orgtype_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="industry_ids" type="string" indexed="true" stored="true" multiValued="true" termVectors="true" />
<field name="award_amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id_exact" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true"/>
Any help would be appreciated!
Your text fields must have type text, which processes them to make them searchable. The string fields are stored and queried as they are, so they are not searchable, making them useless for MLT.
Refer copy fields if you ever want to store the same data as both text and string (for example, faceting).
I see you also intend to find numbers closest to our query. MLT is not right for that. You want to compose a function query for that. SolR : More Like This on number fields

Tika Solr Metadata mapping ignore document title

I have the following config file for solr:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
<str name="fmap.application_name">type</str>
<str name="fmap.content_type">mime</str>
<str name="fmap.stream_size">size</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">false</str>
</lst>
</requestHandler>
and this is my schema:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="access_type" type="string" indexed="true" stored="false"/>
<field name="access_restriction" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="string" indexed="true" stored="true" multiValued="true" />
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_en_splitting" indexed="true" stored="true"/>
<field name="created" type="date" indexed="true" stored="true"/>
<field name="createdby" type="string" indexed="true" stored="true"/>
<field name="modified" type="date" indexed="true" stored="true"/>
<field name="modifiedby" type="string" indexed="true" stored="true"/>
<field name="source" type="string" indexed="true" stored="true" />
<field name="version" type="string" indexed="true" stored="true" />
<field name="resourcelink" type="string" indexed="true" stored="true" />
<field name="downloadlink" type="string" indexed="true" stored="true" />
<field name="type" type="string" indexed="true" stored="true" />
<field name="mime" type="string" indexed="true" stored="true" />
<field name="size" type="string" indexed="true" stored="true" />
I want to set the title myself. But Tika keeps setting it's own title (that's why I set multiValued="true" temporarily), which I find strange because I have to manually map stuff like stream_size and content_type.
What solution is possible to this issue?
I'd like Tika to override the title I assign, like this:
I have 3 documents, for one of those, Tika doesn't extract a title, in this case, I have my own title I set passing literal.title, when Tika does extract a title, I want it to override the one I passed in literal.title. Is this possible?
I was working on the same issue some time ago, but I hit a wall as well :(
I let Tika take "title", and use literal.other_title_like_field to store proper title.
This is not a best solution, but worked for me.
For those who are still struggling with this problem, I solved it by adding
<str name="fmap.title">ignored_</str>
in my ExtractingRequestHandler defaults.

Indexing office formats with a custom field type schema

We have the following Solr (3.4) schema for indexing html/text documents:
<fields>
<field name="text" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="title" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="created" type="date" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="modified" type="date" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="filesize" type="integer" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="mimetype" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="id" type="string" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="tag" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="relpath" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<dynamicField name="tika_*" type="ignored" />
</fields>
The configurations are auto-generated from templates from the solrinstance recipe for zc.buildout.
Now we need to import/index PDF/Office files etc. into Solr for fulltext indexing.
The generated requestHandler for the extraction is:
<requestHandler name="/update/extract"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.text">tika_content</str>
<str name="lowernames">false</str>
<str name="uprefix">tika_</str>
</lst>
</requestHandler>
But after uploading a PDF file through curl I can not find any indication that it
has been index (no changes in the document stats etc.).
What is the trick here?
[Update]
I am using
curl "http://localhost:8983/solr/update/extract?literal.id=2&commit=true&fmap.content=text" -F "myfile=#1.pdf"
to upload a PDF file. Having adding fmap.content=text seems to do the desired mapping (overriding the generated configuration).
This seems to have solved the problem.
fmap is basically field mapping for the content generated by tika.
Tika handler extracts the content of the document uploaded and assigns it to the field name content.
<str name="fmap.content">text</str> maps the content field to the text field defined in the schema.
As you have text field defined in the schema, this will work.
However, for <str name="fmap.text">tika_content</str> there is not field tika_content defined nor I think the text gets generated, so would not result in any matches.

Resources