Tika Solr Metadata mapping ignore document title - solr

I have the following config file for solr:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
<str name="fmap.application_name">type</str>
<str name="fmap.content_type">mime</str>
<str name="fmap.stream_size">size</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">false</str>
</lst>
</requestHandler>
and this is my schema:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="access_type" type="string" indexed="true" stored="false"/>
<field name="access_restriction" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="string" indexed="true" stored="true" multiValued="true" />
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_en_splitting" indexed="true" stored="true"/>
<field name="created" type="date" indexed="true" stored="true"/>
<field name="createdby" type="string" indexed="true" stored="true"/>
<field name="modified" type="date" indexed="true" stored="true"/>
<field name="modifiedby" type="string" indexed="true" stored="true"/>
<field name="source" type="string" indexed="true" stored="true" />
<field name="version" type="string" indexed="true" stored="true" />
<field name="resourcelink" type="string" indexed="true" stored="true" />
<field name="downloadlink" type="string" indexed="true" stored="true" />
<field name="type" type="string" indexed="true" stored="true" />
<field name="mime" type="string" indexed="true" stored="true" />
<field name="size" type="string" indexed="true" stored="true" />
I want to set the title myself. But Tika keeps setting it's own title (that's why I set multiValued="true" temporarily), which I find strange because I have to manually map stuff like stream_size and content_type.
What solution is possible to this issue?
I'd like Tika to override the title I assign, like this:
I have 3 documents, for one of those, Tika doesn't extract a title, in this case, I have my own title I set passing literal.title, when Tika does extract a title, I want it to override the one I passed in literal.title. Is this possible?

I was working on the same issue some time ago, but I hit a wall as well :(
I let Tika take "title", and use literal.other_title_like_field to store proper title.
This is not a best solution, but worked for me.

For those who are still struggling with this problem, I solved it by adding
<str name="fmap.title">ignored_</str>
in my ExtractingRequestHandler defaults.

Related

Solr how to indexing file content to multiple field?

Solr version:
7.3.0
I want to indexing file and register extracted text to multi field (word splitted field and bi-gram field) for search flexibility.
I wrote below configset, but it does not work, solr indexed only to content_text ,or content_text_bi (upper defined fmap.content field only)
solrconfig.xml
...
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content_text</str>
<str name="fmap.content">content_text_bi</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>
...
schema.xml
...
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- docValues are enabled by default for long type so we don't need to index the version field -->
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="content_text" type="text_ja" indexed="true" stored="true" storeOffsetsWithPositions="false"/>
<field name="content_text_bi" type="text_ja_bi" indexed="true" stored="true" storeOffsetsWithPositions="false"/>
<field name="filepath" type="string" indexed="true" stored="true" />
<field name="filename" type="string" indexed="true" stored="true" />
<field name="storage_id" type="pint" indexed="true" stored="true" />
...
How can I make it work as I want?
I solved to use copyField in schema.xml.
1. Add this line to schema.xml
<copyField source="content_text" dest="content_text_bi" />
2.and remove this line in in solrconfig.xml
<str name="fmap.content">content_text_bi</str>

langid UpdateRequestProcessor only mapping first field

I am trying to use solr's langid UpdateRequestProcessor. Here is the config:
<updateRequestProcessorChain name="languages">
<processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="invariants">
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
<str name="langid.whitelist">en,fr</str>
<str name="langid.fallback">en</str>
<str name="langid.langField">detectedlang</str>
<bool name="langid.map">true</bool>
<bool name="langid.map.keepOrig">false</bool>
</lst>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
My fields look like this:
<fields>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<!-- raw fields from sql db -->
<field name="expertise_id" type="int" indexed="true" stored="true" />
<field name="person_id" type="int" indexed="true" stored="true" />
<field name="mod_date" type="date" indexed="true" stored="true" />
<field name="lang" type="string" indexed="true" stored="true" />
<field name="focus" type="text_general" indexed="true" stored="true" />
<field name="expertise" type="text_general" indexed="true" stored="true" />
<field name="platforms" type="text_general" indexed="true" stored="true" />
<field name="partners" type="text_general" indexed="true" stored="true" />
<field name="participation" type="text_general" indexed="true" stored="true" />
<field name="additional" type="text_general" indexed="true" stored="true" />
<field name="tag" type="text_general" termVectors="true" multiValued="true" />
<field name="facet_tag" type="string" stored="false" indexed="false" docValues="true" multiValued="true" default=""/>
<!-- language detected by solr -->
<field name="detectedlang" type="string" indexed="true" stored="true" />
<!-- defined locale fields -->
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" />
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" />
<copyField source="tag" target="facet_tag"/>
</fields>
When I run an update or a dataimport I know that the "languages" update chain is used because focus is mapped to focus_en and detectedlang is set. However, none of the other fields in langid.fl are mapped. Why?
An example update query:
{
"additional": "here is some other information about me.",
"expertise_id": "10000",
"id": "foo_10000",
"focus": "this is my new focus. It is very exciting. When I am done I expect to be super experienced."
}
And here is the result of a query for expertise_id=10000. Note that additional has not been moved to additional_en:
"response":{"numFound":1,"start":0,"docs":[
{
"additional":"here is some other information about me.",
"expertise_id":10000,
"id":"foo_10000",
"detectedlang":"en",
"focus_en":"this is my new focus. It is very exciting. When I am done I expect to be super experienced.",
"_version_":1447088846110982144}]
}
Turns out that the problem is a syntax error. This line:
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
must be
<str name="langid.fl">focus,expertise,platforms,partners,participation,additional</str>
The docs state that the field list should be comma or space separated values. Evidently, comma and space screws things up (though it works fine in other Solr contexts like fl in a requestHandler which langid.fl is supposedly modelled on). I tried the space-separated syntax as well, but it did not fix my issue.
I hope this helps someone.

Solr More Like This (MLT) not returning results

I'm currently looking to implement more like this functionality based on a on a number of fields in my index.
My current configuration is as follows:
Haystack | PySolr | Solr
For this piece I'm using PySolr and passing the parameters to the more_like_this function. The response finds the document but not any related results. Why is that?
Here is the URL I hit:
http://localhost:8080/solr/mlt?q=django_id:12123412&mlt.fl=industry_ids,loc_state,amount,sector_id&mlt.interestingTerms=details
Here is my response from Solr:
<response>
<object type="{XXXXXX-0F1D-4F28-AAA2-XXXXXXXXXXX}" cotype="cs" id="cosymantecbfw" style="width: 0px; height: 0px; display: block;"/>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">24</int>
</lst>
<result name="match" numFound="1" start="0">
<doc>...</doc>
</result>
<result name="response" numFound="0" start="0"/>
<lst name="interestingTerms"/>
</response>
solrconfig.xml
<!-- More Like This -->
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
</requestHandler>
schema.xml
<field name="award_amount" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="estatus" type="slong" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="loc_state" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="orgtype_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="industry_ids" type="string" indexed="true" stored="true" multiValued="true" termVectors="true" />
<field name="award_amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id_exact" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true"/>
Any help would be appreciated!
Your text fields must have type text, which processes them to make them searchable. The string fields are stored and queried as they are, so they are not searchable, making them useless for MLT.
Refer copy fields if you ever want to store the same data as both text and string (for example, faceting).
I see you also intend to find numbers closest to our query. MLT is not right for that. You want to compose a function query for that. SolR : More Like This on number fields

The simplest Solr DIH indexing

I'm trying to index data from a database in Solr using the DIH.
So I have modified the two config files as follows:
solrconfig.xml :
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
data-config.xml :
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="****"/>
<document>
<entity name="source_scellee" query="select * from source_scellee">
</entity>
</document>
</dataConfig>
source_scellee being the name of my table on my test database. It contains many fields.
Obviously, I'm trying to run nothing else than a simple test. When running http://localhost:8983/solr/dataimport?command=full-import&clean=false&commit=true I get the following result :
<str name="Full Dump Started">2012-01-27 12:27:01</str><str name="">Indexing completed. Added/Updated: 4 documents. Deleted 0 documents.</str><str name="Committed">2012-01-27 12:27:02</str>
<str name="**Total Documents Failed**">4</str>
Besides no warning nor error on the server logs. 4 is my number of records inside table "source_scellee". But it says all documents fail.
If I run a query from http://localhost:8983/solr/admin/
no results appear, at all !! How can I solve it ?
(":" shows no results)
Thank you for your help!!!
----edit---
I have added these lines to my schema.xml :
<field name="ID" type="int" indexed="true" stored="true" />
<field name="reference_catalogue" type="string" indexed="true" stored="true"/>
<field name="reference_capsule" type="string" indexed="true" stored="true"/>
<field name="organisme_certificateur" type="string" indexed="true" stored="true" />
<field name="reference_certificat" type="string" indexed="true" stored="true" />
<field name="duree_d_utilisation" type="string" indexed="true" stored="true" />
<field name="activite_nominale" type="string" indexed="true" stored="true"/>
<field name="activite_minimale" type="string" indexed="true" stored="true"/>
<field name="activite_maximale" type="string" indexed="true" stored="true"/>
<field name="coffret" type="boolean" indexed="true" stored="true"/>
<field name="dispositif_medical" type="boolean" indexed="true" stored="true"/>
<field name="forme_speciale" type="boolean" indexed="true" stored="true" />
<field name="exemption_cpa" type="boolean" indexed="true" stored="true"/>
<field name="marquage_ce" type="boolean" indexed="true" stored="true"/>
<field name="element_cible" type="boolean" indexed="true" stored="true"/>
However the result is still the same: no results when querying (I tried to restart solr, and to re-index all also)
------second edit---
I have tried the dynamic import
Now my data-config.xml looks like this :
<document>
<entity name="source_scellee" query="select * from source_scellee">
<field column="ID" name="ID_i" />
<field column="reference_catalogue" name="reference_catalogue_s" />
<field column="reference_capsule" name="reference_capsule_s" />
<field column="organisme_certificateur" name="organisme_certificateur_s" />
<field column="reference_certificat" name="reference_certificat_s" />
<field column="duree_d_utilisation" name="duree_d_utilisation_s" />
<field column="activite_nominale" name="activite_nominale_s" />
<field column="activite_minimale" name="activite_minimale_s" />
<field column="activite_maximale" name="activite_maximale_s" />
<field column="coffret" name="coffret_b" />
<field column="dispositif_medical" name="dispositif_medical_b" />
<field column="forme_speciale" name="forme_speciale_b" />
<field column="exemption_cpa" name="exemption_cpa_b" />
<field column="marquage_ce" name="marquage_ce_b" />
<field column="element_cible" name="element_cible_b" />
</entity>
</document>
1.) You can take a look to the statistics page to see, how much docs are indexed right now:
http://localhost:8983/solr/admin/stats.jsp
2.) The result of your search depends on your schema.xml, because there it's defined how docs are indexed/stored, which fields are processed and how searchs are handled on query time.
Please take a look at this file or post the field definition from the schema.xml and also the schema/design from your table source_scellee.
Does the columns and the fields have the same name?
//Edit: This should work, if coulmname and filedname are the same:
<document>
<entity name="source_scellee"
pk="ID"
query="select * from source_scellee">
</entity>
</document>
is having NULL values in data an issue ?
that depends on the destination field.
Are your running solr in an tomcat or someting like that?
Take a look in the Java EE Container output, like catalina.out or so.
I am pretty sure the issue lies in how the DIH is trying to map fields. Thanks for adding the information from your schema file... However, I believe that what you have done is added configuration that needs to be added separately to both the schema.xml and the data-config.xml for the DIH.
Based on the Full Import Example from the Solr Wiki, I would try the following.
schema.xml
<field name="ID" type="int" indexed="true" stored="true" />
<field name="reference_catalogue" type="string" indexed="true" stored="true"/>
<field name="reference_capsule" type="string" indexed="true" stored="true"/>
<field name="date_de_creation" type="date" indexed="true" stored="true"/>
<field name="organisme_certificateur" type="string" indexed="true" stored="true" />
<field name="reference_certificat" type="string" indexed="true" stored="true" />
<field name="duree_d_utilisation" type="string" indexed="true" stored="true" />
<field name="activite_nominale" type="string" indexed="true" stored="true"/>
<field name="activite_minimale" type="string" indexed="true" stored="true"/>
<field name="activite_maximale" type="string" indexed="true" stored="true"/>
<field name="coffret" type="int" indexed="true" stored="true"/>
<field name="dispositif_medical" type="int" indexed="true" stored="true"/>
<field name="forme_speciale" type="int" indexed="true" stored="true" />
<field name="exemption_cpa" type="int" indexed="true" stored="true"/>
<field name="marquage_ce" type="int" indexed="true" stored="true"/>
<field name="element_cible" type="int" indexed="true" stored="true"/>
data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="****"/>
<document>
<entity name="source_scellee" query="select * from source_scellee">
<field column="ID" name="ID"/>
<field column="reference_catalogue" name="reference_catalogue"/>
<field column="reference_capsule" name="reference_capsule"/>
<field column="date_de_creation" name="date_de_creation"/>
<field column="organisme_certificateur" name="organisme_certificateur"/>
<field column="reference_certificat" name="reference_certificat"/>
<field column="duree_d_utilisation" name="duree_d_utilisation"/>
<field column="activite_nominale" name="activite_nominale"/>
<field column="activite_minimale" name="activite_minimale"/>
<field column="activite_maximale" name="activite_maximale"/>
<field column="coffret" name="coffret"/>
<field column="dispositif_medical" name="dispositif_medical"/>
<field column="forme_speciale" name="forme_speciale"/>
<field column="exemption_cpa" name="exemption_cpa"/>
<field column="marquage_ce" name="marquage_ce"/>
<field column="element_cible" name="element_cible"/>
</entity>
</document>
</dataConfig>
There is a way to setup the schema.xml to dynamically add fields that it encounters by using some naming conventions. Please see the Dynamic Fields details in the Solr Wiki for more details and some examples of how this can be done.

Indexing office formats with a custom field type schema

We have the following Solr (3.4) schema for indexing html/text documents:
<fields>
<field name="text" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="title" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="created" type="date" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="modified" type="date" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="filesize" type="integer" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="mimetype" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="id" type="string" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="tag" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="relpath" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<dynamicField name="tika_*" type="ignored" />
</fields>
The configurations are auto-generated from templates from the solrinstance recipe for zc.buildout.
Now we need to import/index PDF/Office files etc. into Solr for fulltext indexing.
The generated requestHandler for the extraction is:
<requestHandler name="/update/extract"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.text">tika_content</str>
<str name="lowernames">false</str>
<str name="uprefix">tika_</str>
</lst>
</requestHandler>
But after uploading a PDF file through curl I can not find any indication that it
has been index (no changes in the document stats etc.).
What is the trick here?
[Update]
I am using
curl "http://localhost:8983/solr/update/extract?literal.id=2&commit=true&fmap.content=text" -F "myfile=#1.pdf"
to upload a PDF file. Having adding fmap.content=text seems to do the desired mapping (overriding the generated configuration).
This seems to have solved the problem.
fmap is basically field mapping for the content generated by tika.
Tika handler extracts the content of the document uploaded and assigns it to the field name content.
<str name="fmap.content">text</str> maps the content field to the text field defined in the schema.
As you have text field defined in the schema, this will work.
However, for <str name="fmap.text">tika_content</str> there is not field tika_content defined nor I think the text gets generated, so would not result in any matches.

Resources