Solr - problems with compound Id - solr

For example I have 2 tables: table1 = book, table2 = site --> 1 book can have n sites.
<entity name="book" dataSource="myDs" pk="id"
transformer="TemplateTransformer"
query="SELECT b.id, b.title, s.id, s.number, s.content
FROM book b. site s WHERE b.id = s.book">
<field column="b.id" name="id" />
<field column="s.id" name="sId" />
<field column="id" template="${id}_${sId}" ignoreMissingVariables="true" />
</entity>
Why this dont work? I just get only 1 book with 1 site as result and not x book with x sites
I just dont get a compound key in field 'id'.

I recently ran into similar issue and following points fixed it for me:
In SQL query use alias to get distinct column names. Please recheck the sytax of alias as it may vary for your database.
when you use template transformer or do sub select (on inner entities), always use entity.COLUMN_NAME_IN_CAPITAL. The column-in-capital really stumped me until I stumbled upond solr forum where someone else suggested that as a solution (unfortunately I do not have the URL of the post that provided this helpful tidbit).
i was doing above against Oracle DB. Not 100% sure if it applies to other DB but wanted to share the solution.
Here is an attempt to re define your DIH with the above changes
<entity name="book" dataSource="myDs" pk="id"
transformer="TemplateTransformer"
query="SELECT b.id, b.title, s.id as sid, s.number, s.content
FROM book b. site s WHERE b.id = s.book">
<field column="sid" name="sid" />
<field column="id" template="${book.ID}_${book.SID}"/>
</entity>

Add multivalued=true to the id and sId fields in schema.xml file.

yes I deleted the index and inserted him new.
what I get:
<doc>
<str name="id">1</str>
<str name="title">Im a title</str>
<str name="number">1337</str>
<str name="content">content 23</str>
</doc>
what I want:
<doc>
<str name="id">1_1</str>
<str name="title">Im a title</str>
<str name="number">1337</str>
<str name="content">content 23</str>
</doc>
<doc>
<str name="id">1_2</str>
<str name="title">Im a title for 2</str>
<str name="number">1654654</str>
<str name="content">ekddsd</str>
</doc>
I already tried to change following
<field column="id" template="${id}_${sId}" ignoreMissingVariables="true" />
into
<field column="id" template="${id}_${someOtherField}" ignoreMissingVariables="true" />
and the result didn't change.. Looks like the TemplateTransformer doesn't work?
EDIT
Found something in my logs:
Unable to resolve variable: id while parsing expression: ${id}_${sId}
Unable to resolve variable: sId while parsing expression: ${id}_${sId}

Related

Assign Unique id's across all documents and its children

So the situation is as follows:
Solr has a dataimport directly on the database
I have a table project in a relationship to unit. A project can hold up to 5 units
ID's are automatically generated from the database, starting by 1
ID's are unique for each table but not across the database
Since Solr requires each document to have a unique ID I created a field solrId which gets its ID's from solr.UUIDUpdateProcessorFactory.
However, the dataimport only fetches a few projects and no units whatsoever. Can someone point me in the right direction?
The relevant passages:
solrconfig.xml:
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">solrId</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
....
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">wiensued-data-config.xml</str>
<str name="update.chain">uuid</str>
</lst>
</requestHandler>
managed-schema:
<uniqueKey>solrId</uniqueKey>
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
<!-- solrId is the real ID -->
<field name="solrId" type="uuid" multiValued="false" indexed="true" stored="true" />
<!-- the ID from the database -->
<field name="id" type="int" multiValued="false" indexed="true" stored="true"/>
The dataimporthandler is configured to index id (from the table) into either projectId or unitId
The stacktrace is:
org.apache.solr.common.SolrException: [doc=null] missing required field: solrId
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:265)
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:107)
at org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:212)
at org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:185)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:259)
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:433)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1384)
at org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:920)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:913)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:302)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:91)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:80)
at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:254)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:526)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:415)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:457)
at java.lang.Thread.run(Thread.java:748)
However, the solrId is provided as far as I can tell
just get this fixed in your dih config, it will be just cleaner and easier.
Just prepend a 'p' to the project id to create the id, and supply that to solr. Likewise with the units (prepend 'u'). You get the idea:
<entity name="project" pk="id" query="select concat('p', id) as solrid, ...
Of course the sql depends on your DB.

Solr Block Join Children Query Parser with query that matches non Parent Docs

I have been using Solr 6.2.1 with nested documents and was trying to retrieve all child documents of a specific type of parent with the Block Join Children Query Parser, however I am getting the following error:
Parent query yields document which is not matched by parents filter
My documents are similar to:
<add>
<doc>
<field name="id">1</field>
<field name="type">MYDOCTYPE</field>
<field name="isParent">true</field>
<doc>
<field name="id">1_1</field>
<field name="comments">some comments</field>
</doc>
<doc>
<field name="id">1_2</field>
<field name="comments">some more comments</field>
</doc>
</doc>
<doc>
<field name="id">2</field>
<field name="type">MYDOCTYPE</field>
<field name="isParent">true</field>
<doc>
<field name="id">2_1</field>
<field name="comments">some comments</field>
</doc>
<doc>
<field name="id">2_2</field>
<field name="comments">some more comments</field>
</doc>
</doc>
<doc>
<field name="id">3</field>
<field name="type">MYDOCTYPE</field>
</doc>
</add>
And I'm trying to query them with: q={!child of="isParent:true"}type:MYDOCTYPE
I guess the problem is that document 3 has the type MYDOCTYPE but is not a parent document, it makes sense it isn't as it doesn't have child documents.
Is there anyway to retrieve all the children documents without adding the field isParent to document 3?
I found a workaround and that is to make the query:
{!child of="isParent:true"}type:"EDH/MAG"+isParent:true
this way the second part of the query only matches doc 1 and 2 and does not throw the exception.
It's old question, but maybe my answer help somebody.
It's right that doc with id=3 is treated as child, therefrom error occures.
Maybe we can assume that parent document is document with isParent:true or with not empty type field and with id, then query may looks like this:
q={!child of="isParent:true OR (id:* AND type:*)"}type:MYDOCTYPE

Solr Dataimport is not indexing excel file stored as BLOB in DB

Hi I have configured Solr with TikaEntityprocessor to index BLOB type from DB. My problem is when the contents of an excel file is stored as BLOB in DB nothing gets indexed into SOLR.There is neither any exception thrown. Everything works fine when the BLOB field contains simple binary data.
Below is my data-config.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="db" driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:#//x.x.x.x:1521/d11gr21"
user="dummy" password="dummy"/>
<dataSource name="dastream" type="FieldStreamDataSource" />
<document>
<entity name="messages"
query="select MSG_PK,MESSAGE from table1" dataSource="db">
<field column ="MSG_PK" name ="id" />
<entity name="message" dataSource="dastream"
processor="TikaEntityProcessor"
url="message"
dataField="messages.MESSAGE"
format="text">
<field column="text" name="mxMsg" />
</entity>
</entity>
</document>
</dataConfig>
Update
After indexing when I query, the output shows something like this :
<result name="response" numFound="1" start="0">
<doc>
<arr name="message">
<str>oracle.sql.BLOB#121c39fa</str>
</arr>
<int name="id">992</int>
<arr name="mxMsg">
<str/>
</arr>
<arr name="content">
<str/>
</arr>
<long name="version">1454004358166347776</long>
</doc>
</result>
The text in the BLOB field is simply not getting indexed.

How to expose the Solr DataImportHandler dataSource name in the result doc

I am importing data into Solr 4.3.0 from two different dataSources. This all works fine except that the search results do not indicate the original dataSource for each result document.
Is there a "proper" way to get the dataSource (or entity name) into the result document?
My data-config.xml looks like this (based on example given in http://wiki.apache.org/solr/DataImportHandler#Multiple_DataSources ):
<dataConfig>
<dataSource name="ds1" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//oracle-1:1521/DB1" user="SCHEMA1" password="Passw0rd1"/>
<dataSource name="ds2" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//oracle-1:1521/DB2" user="SCHEMA2" password="Passw0rd2"/>
<document>
<entity name="apples" dataSource="ds1" pk="id" query="select id,name,color from apples" />
</entity>
<entity name="bannnas" dataSource="ds2" pk="id" query="select id,name,desc from bananas" />
</entity>
</document>
</dataConfig>
Sample XML result set from a search looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">3</int>
<lst name="params">
<str name="indent">true</str>
<str name="q">yellow</str>
<str name="_">1370321809357</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0">
<doc>
<str name="id">12</str>
<str name="name">Golden Delicious</str>
<str name="color">yellow</str></doc>
<doc>
<str name="id">5</str>
<str name="name">Cavendish group</str>
<str name="desc">Cavendish group is the common name for the triploid AAA group of Musa acuminata, by far the most popular cultivar by export volume. Cavendish bananas have a yellow skin and pale yellow inside when ripe.</str></doc>
</result>
</response>
Note the reason I want to know the dataSource for a given result is that the result entities have different schemas and thus need to be parsed/handled/rendered differently by the client application. Happy to see other answers that address this root problem in a different way.
Instead of the storing the datasource, why not just add the entity identifier column with each document.
This identifier field would a fixed value column, probably embedded within the Query itself.
e.g. Use alias in sql e.g. SELECT 'APPLE' AS ENTITY_TYPE
You can use this field to determine what type of parsing is needed for the respective entity.

DataImportHandler can't add/update

I am trying to convince solr to perform a bulk import of a sqlite database.
I follow the all instruction from Solr-Wiki.
I configured DataImportHandler to open that database through jdbc successfully and I can start the import http://localhost:8080/solr/dataimport?command=full-import
but whatever I do, DIH didn't add any document even though it seems index the DB
the result
<str name="command">full-import</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">**14**</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-06 01:14:30</str>
<str name="">**Indexing completed**. **Added/Updated: 0 documents**. Deleted 0 documents.</str>
<str name="Committed">2012-04-06 01:14:32</str>
<str name="Optimized">2012-04-06 01:14:32</str>
<str name="Total Documents Processed">0</str>
I use the emp table in Oracle DB
data-config.xml
<dataConfig>
<dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//localhost:1521/ORCL" user="scott" password="tiger"/>
<document>
<entity name="emp" query="select EMPNO, ENAME from EMP">
<field column="EMPNO" name="empno" />
<field column="ENAME" name="ename" />
</entity>
</document>
</dataConfig>
schema.xml
<field name="empno" type="int" indexed="true" stored="true"/>
<field name="ename" type="string" indexed="true" stored="true"/>
It doesn't seem to index, but not to stored indexed data
Any ideas why this problem happen?
EDIT 1
Log show warning message like..
WARNING: Error creating document : SolrInputDocument[{ename=ename(1.0)={SMITH}, empno=empno(1.0)={7369}}]
org.apache.solr.common.SolrException: [doc=null] missing required field: id
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:346)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:73)
at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:636)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
that kind of log followed and
this warning message show up end of the log
2012. 4. 6 오후 12:12:25 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {deleteByQuery=*:*,add=[(null), (null), (null), (null), (null), (null), (null), (null), ... (14 adds)],optimize=} 0 0
I thought missing required field: id has some relation with the configuration in
the schema.xml
<uniqueKey>id</uniqueKey>
but after delete, I got this message
HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: false in solr.xml ------------------------------------------------------------- org.apache.solr.common.SolrException: QueryElevationComponent requires the schema to have a uniqueKeyField implemented using StrField at org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java:158) at org.apache.solr.core.SolrResourceLoader.inform
Any advice?
Try:
<entity name="emp" query="select EMPNO, ENAME from EMP">
<field column="EMPNO" name="id" />
<field column="ENAME" name="ename" />
in data-config.xml and put back:
<uniqueKey>id</uniqueKey>
in schema.xml and also let the field id.
Or u can simply replace:
<uniqueKey>id</uniqueKey>
with:
<uniqueKey>epno</uniqueKey>
Hope that will work.
You can also add an autoincrement id with
<dataConfig>
<script><![CDATA[
id = 1;
function GenerateId(row) {
row.put('id', (id ++).toFixed());
return row;
}
]]></script>
<dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//localhost:1521/ORCL" user="scott" password="tiger"/>
<document>
<entity name="emp" query="select EMPNO, ENAME from EMP" transformer="script:GenerateId">
<field column="EMPNO" name="empno" />
<field column="ENAME" name="ename" />
</entity>
</document>
</dataConfig>

Resources