Solr DataImportHandler 1 to many relation - solr

I have setup MySQL tables for storing articles.
Basically, they look like this:
article
-------
article_number
title
sets_article
-------
setcode
article_number
I have setup a schema.xml and configured the DataImportHandler. Everything works fine except, that the sets are not stored, when I call the DataImportHandler with full-import.
Here is my the relevant part of my data-config.xml:
<document name="articles">
<entity name="article"
pk="article_number"
query="select * from article"
deltaImportQuery="select * from article where article_number='${dih.delta.article_number}'"
deltaQuery="select article_number from article where tstamp > UNIX_TIMESTAMP(STR_TO_DATE('${dih.last_index_time}', '%Y-%m-%d %H:%i:%s'))">
<entity name="sets_article" query="select setcode as sets from sets_article where article_number='${article.article_number}'" />
<entity name="sets_articlel2" query="select distinct setcode as sets2 from sets_article" />
<entity name="sets_articlel3" query="select distinct setcode as sets3 from sets_article where article_number='11112222'" />
</entity>
</document>
The entities sets_article2 and sets_article3 work fine, so I think there is a problem with:
where article_number='${article.article_number}'
Does anybody know what is wrong with this setup?

The problem was a simple misconfiguration.
The tables didn't have the proper primary keys set.

Related

SolR's Tika processor in Data Import Handler does not get filename from DB processor

I have a DIH configuration where I want to combine data from DB and Tika, by passing the filename from db to Tika. Problem is that filename in Tika is coming as empty. Logs say:
ERROR (Thread-16) [ ] o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: (resolved to: C:\Users\jimbo\Desktop\solr-8.9.0\server\.
My configuration xml file is this:
<dataConfig>
<dataSource name="ds-db" driver="org.mariadb.jdbc.Driver" url="jdbc:mysql://localhost:3306/eepyakm?user=root" user="root" password="wpadmin"/>
<dataSource name="ds-file" type="BinFileDataSource"/>
<document>
<entity name="supplier" query="select * from suppliers_tmp_view" dataSource="ds-db"
deltaQuery="select id from suppliers_tmp_view where last_modified > '${dataimporter.last_index_time}'"
deltaImportQuery="select * from suppliers_tmp_view where id='${dataimporter.delta.id}'">
<entity name="attachment" dataSource="ds-db"
query="select * from suppliers_tmp_files_view where supplier_tmp_id='${supplier.id}'"
deltaQuery="select id,supplier_tmp_id from suppliers_tmp_files_view where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select id from suppliers_tmp_view where id='${attachment.supplier_tmp_id}'">
<field name="path" column="path"/>
<entity name="file" processor="TikaEntityProcessor" url="${attachment.path}" format="text" dataSource="ds-file">
<field column="text"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
I found a similar problem at a very old post: Solr's TikaEntityProcessor not working

SolR's Data Import Handler tracks but ignores nested entity's changes

I have two tables and I'm trying to make Data Import Handler to update the index of the document when the sub-entity changes. When I fire the "delta-import" command, I get the following:
{
"responseHeader":{
"status":0,
"QTime":3},
"initArgs":[
"defaults",[
"config","db-data-config.xml"]],
"command":"delta-import",
"status":"idle",
"importResponse":"",
"statusMessages":{
"Total Requests made to DataSource":"5",
"Total Rows Fetched":"3",
"Total Documents Processed":"0",
"Total Documents Skipped":"0",
"Delta Dump started":"2021-08-16 11:05:47",
"Identifying Delta":"2021-08-16 11:05:47",
"Deltas Obtained":"2021-08-16 11:05:47",
"Building documents":"2021-08-16 11:05:47",
"Total Changed Documents":"0",
"Time taken":"0:0:0.12"}}
My data config is this:
<dataConfig>
<dataSource driver="org.mariadb.jdbc.Driver" url="jdbc:mysql://localhost:3306/eepyakm?user=root" user="root" password="root"/>
<document>
<entity name="supplier" query="select * from suppliers_tmp_view"
deltaQuery="select id from suppliers_tmp_view where last_modified > '${dataimporter.last_index_time}'"
deltaImportQuery="select * from suppliers_tmp_view where id='${dataimporter.delta.id}'">
<entity name="attachment"
query="select * from suppliers_tmp_files_view where supplier_tmp_id='${supplier.id}'"
deltaQuery="select id from suppliers_tmp_files_view where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select id from suppliers_tmp_view where id='${attachment.supplier_tmp_id}'">
<field name="path" column="path" />
</entity>
</entity>
</document>
</dataConfig>
In my understanding, "Total Rows Fetched" shows that 3 entries in the sub-entity table have changed. So, why doesn't it index the changed field?
If I do a "full-import" it picks the changes fine.
Neither of your queries do include a supplier_tmp_id - but you still reference this in your parentDeltaQuery.
You want to select this column as well in your SELECT statement.

Solr templated sql print nothing in dataimport

I have the following dataimport configuration:
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://C:/feqh/main.mdb;memory=false" />
<document>
<entity name="Book"
query="select bkid AS id, bkid AS BookID,bk AS BookTitle, betaka AS BookInfo, cat as cat from 0bok">
<field column="id" name="id"/>
<field column="BookID" name="BookID"/>
<field column="BookTitle" name="BookTitle"/>
<field column="cat" name="cat"/>
<entity name="Category"
query="select name as CatName, catord as CatWeight, Lvl as CatLevel from 0cat where id = ${Book.cat}">
<field column="CatName" name="CatName"/>
<field column="CatWeight" name="CatWeight"/>
<field column="CatLevel" name="CatLevel"/>
</entity>
</entity>
</document>
</dataConfig>
This dataimport is failed due to the following error from the log:
Unable to execute query: select name as CatName, catord as CatWeight,
Lvl as CatLevel from 0cat where id = Processing Document # 1
When I replace ${Book.cat} with any fixed number such as 128, the import works fine.
So it seems that ${Book.cat} does not printout any value. The database that I import data from is MS Access database mdb using ucanaccess version 2.0.9. I'm using Solr 4.9.0 on Java8. How could I solve this issue?
For unknown reason I found that the column name should be written in Upper case in the template, so ${Book.cat} should be ${Book.CAT} I said unknown because I'm sure that the column
name in the database is written lower case cat.

Timestamp compatibility while performing delta import in solr

Im new to solr.I have successfully indexed oracle 10g xe database. Im trying to perform delta import on the same.
The delta query requires a comparison of last_modified column of the table with ${dih.last_index_time}.
However in my application I do not have such a column . Also, i cannot add this column. Therefore i used 'scn_to_timestamp(ora_rowscn)' to give the value of the required timestamps. This query returns the value of type timestamp in the following format 24-JUL-13 12.42.32.000000000 PM and dih.last_index_time is in the format 2013-07-24 12:18:03. So, I changed the format of dih.last_index_time as to_timestamp('${dih.last_index_time}', 'YYYY/MM/DD HH:MI:SS').
My Data-config looks like this -
<dataConfig>
<dataSource type="JdbcDataSource" driver="oracle.jdbc.OracleDriver" url="jdbc:oracle:thin:#XXX.XXX.XX.XX:XXXX:xe" user="XXXXXXXX" password="XXXXXXX" />
<document name="product_info">
<entity name="PRODUCT" pk="PID" query="SELECT * FROM PRODUCT" deltaImportQuery="SELECT * FROM PRODUCT WHERE PID=${dih.delta.id}" deltaQuery="SELECT PID FROM PRODUCT WHERE scn_to_timestamp(ora_rowscn) > to_timestamp('${dih.last_index_time}', 'YYYY/MM/DD HH:MI:SS')">
<field column="PID" name="id" />
<field column="PNAME" name="itemName" />
<field column="INITQTY" name="itemQuantity" />
<field column="REMQTY" name="remQuantity" />
<field column="PRICE" name="itemPrice" />
<field column="SPECIFICATION" name="specifications" />
<entity name="SUB_CATEGORY" query="SELECT * FROM SUB_CATEGORY WHERE SCID=${PRODUCT.SCID}">
<field column="SUBCATNAME" name="brand" />
<entity name="CATEGORY" query="SELECT CNAME FROM CATEGORY WHERE CID=${SUB_CATEGORY.CID}">
<field column="CNAME" name="itemCategory" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
However,This is not working and im getting the following error -
Unable to execute query: SELECT * FROM PRODUCT WHERE PID= Processing Document # 1
Caused by: java.sql.SQLException: ORA-00936: missing expression
Please help me out!!!
I had a similar issue and had more success with *to_date*. But looking at this again, it just looks like perhaps you just need to quote your delta id in the delatImportQuery:
deltaImportQuery="SELECT * FROM PRODUCT WHERE PID='${dih.delta.id}'"

solr: import from different datasources using DIH

I am trying to fill a Solr index from 2 different data-sources (xml and db) using the DataImportHandler.
1st try: Created 2 data-config.xml files, one for the xml import and one for the db import.
The db-config would read id and lets say field A. The xml-config also id and field B.
That works for both (i could import from both datasources), but the index got overwritten each time (with clean=false of course), so I either had id and A or id and B
so on for the
2nd try: merged the 2 files into one
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource
name="cr-db"
jndiName="xyz"
type="JdbcDataSource" />
<dataSource
name="cr-xml"
type="FileDataSource"
encoding="utf-8" />
<document name="doc">
<entity
dataSource="cr-xml"
name="f"
processor="FileListEntityProcessor"
baseDir="/path/to/xml"
filename="*.xml"
recursive="true"
rootEntity="false"
onError="skip">
<entity
name="xml-data"
dataSource="cr-xml"
processor="XPathEntityProcessor"
forEach="/root"
url="${f.fileAbsolutePath}"
transformer="DateFormatTransformer"
onError="skip">
<field column="id" xpath="/root/id" />
<field column="A" xpath="/root/a" />
</entity>
<entity
name="db-data"
dataSource="cr-db"
query="
SELECT
id, b
FROM
a_table
WHERE
id = '${f.file}'">
<field column="B" name="b" />
</entity>
</entity>
</document>
</dataConfig>
A bit funny is the id = '${f.file}'-part i guess, but that is the id that is used. The select statement is correctly formed, but I get an exception when trying to run that file in the dataimport.jsp. The first part (xml) works fine, but when he gets to the db part it raises:
java.lang.RuntimeException: java.io.FileNotFoundException:
Could not find file: SELECT id, b FROM a_table WHERE id = '12345678.xml'
at org.apache.solr.handler.dataimport.FileDataSource.getFile[..]
Any advice? Thanks in advance
EDIT
I found the problem for the FileNotFoundException: within the entity tags the datasource-attributes need to be camelCased --> dataSource..
Now it runs through, but with the same outcome as in the first try: only field B gets in the index. If I take the db-entity out, then the file contents are indexed (field A)
Try:
<entity name="db-data" dataSource="cr-db"
The attributes are case-sensitive, so your wrong-cased attribute name is ignored and you fall back to the default one (which somehow is the file one).

Resources