The simplest Solr DIH indexing - solr

I'm trying to index data from a database in Solr using the DIH.
So I have modified the two config files as follows:
solrconfig.xml :
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
data-config.xml :
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="****"/>
<document>
<entity name="source_scellee" query="select * from source_scellee">
</entity>
</document>
</dataConfig>
source_scellee being the name of my table on my test database. It contains many fields.
Obviously, I'm trying to run nothing else than a simple test. When running http://localhost:8983/solr/dataimport?command=full-import&clean=false&commit=true I get the following result :
<str name="Full Dump Started">2012-01-27 12:27:01</str><str name="">Indexing completed. Added/Updated: 4 documents. Deleted 0 documents.</str><str name="Committed">2012-01-27 12:27:02</str>
<str name="**Total Documents Failed**">4</str>
Besides no warning nor error on the server logs. 4 is my number of records inside table "source_scellee". But it says all documents fail.
If I run a query from http://localhost:8983/solr/admin/
no results appear, at all !! How can I solve it ?
(":" shows no results)
Thank you for your help!!!
----edit---
I have added these lines to my schema.xml :
<field name="ID" type="int" indexed="true" stored="true" />
<field name="reference_catalogue" type="string" indexed="true" stored="true"/>
<field name="reference_capsule" type="string" indexed="true" stored="true"/>
<field name="organisme_certificateur" type="string" indexed="true" stored="true" />
<field name="reference_certificat" type="string" indexed="true" stored="true" />
<field name="duree_d_utilisation" type="string" indexed="true" stored="true" />
<field name="activite_nominale" type="string" indexed="true" stored="true"/>
<field name="activite_minimale" type="string" indexed="true" stored="true"/>
<field name="activite_maximale" type="string" indexed="true" stored="true"/>
<field name="coffret" type="boolean" indexed="true" stored="true"/>
<field name="dispositif_medical" type="boolean" indexed="true" stored="true"/>
<field name="forme_speciale" type="boolean" indexed="true" stored="true" />
<field name="exemption_cpa" type="boolean" indexed="true" stored="true"/>
<field name="marquage_ce" type="boolean" indexed="true" stored="true"/>
<field name="element_cible" type="boolean" indexed="true" stored="true"/>
However the result is still the same: no results when querying (I tried to restart solr, and to re-index all also)
------second edit---
I have tried the dynamic import
Now my data-config.xml looks like this :
<document>
<entity name="source_scellee" query="select * from source_scellee">
<field column="ID" name="ID_i" />
<field column="reference_catalogue" name="reference_catalogue_s" />
<field column="reference_capsule" name="reference_capsule_s" />
<field column="organisme_certificateur" name="organisme_certificateur_s" />
<field column="reference_certificat" name="reference_certificat_s" />
<field column="duree_d_utilisation" name="duree_d_utilisation_s" />
<field column="activite_nominale" name="activite_nominale_s" />
<field column="activite_minimale" name="activite_minimale_s" />
<field column="activite_maximale" name="activite_maximale_s" />
<field column="coffret" name="coffret_b" />
<field column="dispositif_medical" name="dispositif_medical_b" />
<field column="forme_speciale" name="forme_speciale_b" />
<field column="exemption_cpa" name="exemption_cpa_b" />
<field column="marquage_ce" name="marquage_ce_b" />
<field column="element_cible" name="element_cible_b" />
</entity>
</document>

1.) You can take a look to the statistics page to see, how much docs are indexed right now:
http://localhost:8983/solr/admin/stats.jsp
2.) The result of your search depends on your schema.xml, because there it's defined how docs are indexed/stored, which fields are processed and how searchs are handled on query time.
Please take a look at this file or post the field definition from the schema.xml and also the schema/design from your table source_scellee.
Does the columns and the fields have the same name?
//Edit: This should work, if coulmname and filedname are the same:
<document>
<entity name="source_scellee"
pk="ID"
query="select * from source_scellee">
</entity>
</document>
is having NULL values in data an issue ?
that depends on the destination field.
Are your running solr in an tomcat or someting like that?
Take a look in the Java EE Container output, like catalina.out or so.

I am pretty sure the issue lies in how the DIH is trying to map fields. Thanks for adding the information from your schema file... However, I believe that what you have done is added configuration that needs to be added separately to both the schema.xml and the data-config.xml for the DIH.
Based on the Full Import Example from the Solr Wiki, I would try the following.
schema.xml
<field name="ID" type="int" indexed="true" stored="true" />
<field name="reference_catalogue" type="string" indexed="true" stored="true"/>
<field name="reference_capsule" type="string" indexed="true" stored="true"/>
<field name="date_de_creation" type="date" indexed="true" stored="true"/>
<field name="organisme_certificateur" type="string" indexed="true" stored="true" />
<field name="reference_certificat" type="string" indexed="true" stored="true" />
<field name="duree_d_utilisation" type="string" indexed="true" stored="true" />
<field name="activite_nominale" type="string" indexed="true" stored="true"/>
<field name="activite_minimale" type="string" indexed="true" stored="true"/>
<field name="activite_maximale" type="string" indexed="true" stored="true"/>
<field name="coffret" type="int" indexed="true" stored="true"/>
<field name="dispositif_medical" type="int" indexed="true" stored="true"/>
<field name="forme_speciale" type="int" indexed="true" stored="true" />
<field name="exemption_cpa" type="int" indexed="true" stored="true"/>
<field name="marquage_ce" type="int" indexed="true" stored="true"/>
<field name="element_cible" type="int" indexed="true" stored="true"/>
data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="****"/>
<document>
<entity name="source_scellee" query="select * from source_scellee">
<field column="ID" name="ID"/>
<field column="reference_catalogue" name="reference_catalogue"/>
<field column="reference_capsule" name="reference_capsule"/>
<field column="date_de_creation" name="date_de_creation"/>
<field column="organisme_certificateur" name="organisme_certificateur"/>
<field column="reference_certificat" name="reference_certificat"/>
<field column="duree_d_utilisation" name="duree_d_utilisation"/>
<field column="activite_nominale" name="activite_nominale"/>
<field column="activite_minimale" name="activite_minimale"/>
<field column="activite_maximale" name="activite_maximale"/>
<field column="coffret" name="coffret"/>
<field column="dispositif_medical" name="dispositif_medical"/>
<field column="forme_speciale" name="forme_speciale"/>
<field column="exemption_cpa" name="exemption_cpa"/>
<field column="marquage_ce" name="marquage_ce"/>
<field column="element_cible" name="element_cible"/>
</entity>
</document>
</dataConfig>
There is a way to setup the schema.xml to dynamically add fields that it encounters by using some naming conventions. Please see the Dynamic Fields details in the Solr Wiki for more details and some examples of how this can be done.

Related

parent child indexing in apache solr

I'm new to Apache solr search. I'm not getting ho to get solr search result with child documents.
My entity in data-config.xml
<entity name="products" query="SELECT DISTINCT IDENTIFIER,PDT_NAME,PDT_DESCRIPTION FROM **PARENT_TABLE**"
deltaQuery="SELECT IDENTIFIER FROM PARENT_TABLE WHERE LAST_MODIFIED_DATE > '${dataimporter.last_index_time}'">
<field column="IDENTIFIER" name="pdtid" />
<field column="PDT_NAME" name="productname" />
<field column="PDT_DESCRIPTION" name="productdescription" />
<entity name="productVersions" child="true" query="SELECT DISTINCT child_id , child_name FROM WHERE IDENTIFIER = '${**products.IDENTIFIER**}'">
<field column="IDENTIFIER" name="productVersions.pdtesat" />
<field column="VERSION_NUMBER" name="productVersions.versionnum" />
<field column="DISPLAY_NAME" name="productVersions.displayname" />
</entity>
</entity>
field details in managed-schema file:
<field name="pdtid" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productname" type="text_general" indexed="true" stored="true" multiValued="true" />
<field name="productnamerrr" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productdescription" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productVersions.childid" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productVersions.versionnum" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productVersions.displayname" type="text_general" indexed="true" stored="true" multiValued="false" />
I'm expecting my solr result should be :
"response":{"numFound":26,"start":0,"docs":[
{
"productdescription":" Java",
"productnamerrr":"pdtid",
"pdtid":"6591",
"child_docs" : [
"productVersions":[
"productVersions.childid":"123"
"productVersions.versionnum":"V1"
"productVersions.displayname":"disp"],
"productVersions":[
"productVersions.childid":"456"
"productVersions.versionnum":"V2"
"productVersions.displayname":"disp2"]
],
"id":"92689209-dc5f-4ae6-bd3c-d55dbd0e200c",
"_version_":1599132440456069120},
Please help me in getting the multiple child docs in json format after indexing.
May 2nd edit.
My query result from solr search like below.
"response":{"numFound":38,"start":0,"docs":[
{
"productdescription":" JIRA provides issue (bug) and project tracking
for the software development team.",
"productnamerrr":"Atlassian JIRA",
"productVersions":
["childid:6.x,versionnum:Jira 6.x,displayname :Withdrawn",
"childid:2.0.3,versionnum:Atlassian JIRA,displayname:Planning",
"childid:JIRA Server 5.0.1 - 6.3.15,versionnum:JIRA - JEditor,displayname :Withdrawn",
"childid:1.x,versionnum:Jira 1.x,displayname :Withdrawn"
],
"id":"0b5ba528-ef7a-49ba-a97b-2ea94922cbb5",
"_version_":1599297669816123392},
Edited on May 3-2018
returned data is correct. But the i'm expecting in parent child documents explicitly. getting child docs as below.
"productVersions":["childid:6.x,versionnum:Jira 6.x,displayname :Withdrawn",
"childid:2.0.3,versionnum:Atlassian JIRA,displayname:Planning",
"childid:JIRA Server 5.0.1 - 6.3.15,versionnum:JIRA - JEditor,displayname :Withdrawn",
"childid:1.x,versionnum:Jira 1.x,displayname :Withdrawn"
],
Expecting like below.
"productVersions":[
"productVersions.childid":"123"
"productVersions.versionnum":"V1"
"productVersions.displayname":"disp"],
"productVersions":[
"productVersions.childid":"456"
"productVersions.versionnum":"V2"
"productVersions.displayname":"disp2"]
],
How can i change the query to get child docs separately as a separate entity.??

solr does not import fields other than id

I am using Solr DataImportHandler module. Here is my config;
<dataConfig>
<dataSource type="JdbcDataSource"
name="sql"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://localhost;databaseName=AdventureWorks2008;integratedSecurity=true;"/>
<document>
<entity name="Person" dataSource="sql"
pk="BusinessEntityID"
query="select BusinessEntityID,FirstName,LastName FROM [Person].[Person]"
deltaImportQuery="select BusinessEntityID,FirstName,LastName FROM [Person].[Person] WHERE id='${dih.delta.id}'"
deltaQuery="SELECT BusinessEntityID FROM [Person].[Person] WHERE ModifiedDate > '${dih.last_index_time}'">
<field column="BusinessEntityID" name="id"/>
<field column="FirstName" name="firstname"/>
<field column="LastName" name="lastname"/>
</entity>
</document>
</dataConfig>
for some reason, only id field is importing but not the rest.
What would be the reason? Am I missing something?
You might have missed the below entries in the schema.xml file
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="firstname" type="string" indexed="true" stored="true"/>
<field name="lastname" type="string" indexed="true" stored="true"/>
Here type for id can be int. Just check what you want.
<field name="id" type="int" indexed="true" stored="true" required="true"/>
Make sure your Id and unique field is Proper.
I was facing same issue, change Pk and unique field name and it's working fine.

langid UpdateRequestProcessor only mapping first field

I am trying to use solr's langid UpdateRequestProcessor. Here is the config:
<updateRequestProcessorChain name="languages">
<processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="invariants">
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
<str name="langid.whitelist">en,fr</str>
<str name="langid.fallback">en</str>
<str name="langid.langField">detectedlang</str>
<bool name="langid.map">true</bool>
<bool name="langid.map.keepOrig">false</bool>
</lst>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
My fields look like this:
<fields>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<!-- raw fields from sql db -->
<field name="expertise_id" type="int" indexed="true" stored="true" />
<field name="person_id" type="int" indexed="true" stored="true" />
<field name="mod_date" type="date" indexed="true" stored="true" />
<field name="lang" type="string" indexed="true" stored="true" />
<field name="focus" type="text_general" indexed="true" stored="true" />
<field name="expertise" type="text_general" indexed="true" stored="true" />
<field name="platforms" type="text_general" indexed="true" stored="true" />
<field name="partners" type="text_general" indexed="true" stored="true" />
<field name="participation" type="text_general" indexed="true" stored="true" />
<field name="additional" type="text_general" indexed="true" stored="true" />
<field name="tag" type="text_general" termVectors="true" multiValued="true" />
<field name="facet_tag" type="string" stored="false" indexed="false" docValues="true" multiValued="true" default=""/>
<!-- language detected by solr -->
<field name="detectedlang" type="string" indexed="true" stored="true" />
<!-- defined locale fields -->
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" />
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" />
<copyField source="tag" target="facet_tag"/>
</fields>
When I run an update or a dataimport I know that the "languages" update chain is used because focus is mapped to focus_en and detectedlang is set. However, none of the other fields in langid.fl are mapped. Why?
An example update query:
{
"additional": "here is some other information about me.",
"expertise_id": "10000",
"id": "foo_10000",
"focus": "this is my new focus. It is very exciting. When I am done I expect to be super experienced."
}
And here is the result of a query for expertise_id=10000. Note that additional has not been moved to additional_en:
"response":{"numFound":1,"start":0,"docs":[
{
"additional":"here is some other information about me.",
"expertise_id":10000,
"id":"foo_10000",
"detectedlang":"en",
"focus_en":"this is my new focus. It is very exciting. When I am done I expect to be super experienced.",
"_version_":1447088846110982144}]
}
Turns out that the problem is a syntax error. This line:
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
must be
<str name="langid.fl">focus,expertise,platforms,partners,participation,additional</str>
The docs state that the field list should be comma or space separated values. Evidently, comma and space screws things up (though it works fine in other Solr contexts like fl in a requestHandler which langid.fl is supposedly modelled on). I tried the space-separated syntax as well, but it did not fix my issue.
I hope this helps someone.

Tika Solr Metadata mapping ignore document title

I have the following config file for solr:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
<str name="fmap.application_name">type</str>
<str name="fmap.content_type">mime</str>
<str name="fmap.stream_size">size</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">false</str>
</lst>
</requestHandler>
and this is my schema:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="access_type" type="string" indexed="true" stored="false"/>
<field name="access_restriction" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="string" indexed="true" stored="true" multiValued="true" />
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_en_splitting" indexed="true" stored="true"/>
<field name="created" type="date" indexed="true" stored="true"/>
<field name="createdby" type="string" indexed="true" stored="true"/>
<field name="modified" type="date" indexed="true" stored="true"/>
<field name="modifiedby" type="string" indexed="true" stored="true"/>
<field name="source" type="string" indexed="true" stored="true" />
<field name="version" type="string" indexed="true" stored="true" />
<field name="resourcelink" type="string" indexed="true" stored="true" />
<field name="downloadlink" type="string" indexed="true" stored="true" />
<field name="type" type="string" indexed="true" stored="true" />
<field name="mime" type="string" indexed="true" stored="true" />
<field name="size" type="string" indexed="true" stored="true" />
I want to set the title myself. But Tika keeps setting it's own title (that's why I set multiValued="true" temporarily), which I find strange because I have to manually map stuff like stream_size and content_type.
What solution is possible to this issue?
I'd like Tika to override the title I assign, like this:
I have 3 documents, for one of those, Tika doesn't extract a title, in this case, I have my own title I set passing literal.title, when Tika does extract a title, I want it to override the one I passed in literal.title. Is this possible?
I was working on the same issue some time ago, but I hit a wall as well :(
I let Tika take "title", and use literal.other_title_like_field to store proper title.
This is not a best solution, but worked for me.
For those who are still struggling with this problem, I solved it by adding
<str name="fmap.title">ignored_</str>
in my ExtractingRequestHandler defaults.

Indexing office formats with a custom field type schema

We have the following Solr (3.4) schema for indexing html/text documents:
<fields>
<field name="text" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="title" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="created" type="date" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="modified" type="date" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="filesize" type="integer" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="mimetype" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="id" type="string" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="tag" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="relpath" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<dynamicField name="tika_*" type="ignored" />
</fields>
The configurations are auto-generated from templates from the solrinstance recipe for zc.buildout.
Now we need to import/index PDF/Office files etc. into Solr for fulltext indexing.
The generated requestHandler for the extraction is:
<requestHandler name="/update/extract"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.text">tika_content</str>
<str name="lowernames">false</str>
<str name="uprefix">tika_</str>
</lst>
</requestHandler>
But after uploading a PDF file through curl I can not find any indication that it
has been index (no changes in the document stats etc.).
What is the trick here?
[Update]
I am using
curl "http://localhost:8983/solr/update/extract?literal.id=2&commit=true&fmap.content=text" -F "myfile=#1.pdf"
to upload a PDF file. Having adding fmap.content=text seems to do the desired mapping (overriding the generated configuration).
This seems to have solved the problem.
fmap is basically field mapping for the content generated by tika.
Tika handler extracts the content of the document uploaded and assigns it to the field name content.
<str name="fmap.content">text</str> maps the content field to the text field defined in the schema.
As you have text field defined in the schema, this will work.
However, for <str name="fmap.text">tika_content</str> there is not field tika_content defined nor I think the text gets generated, so would not result in any matches.

Resources