DataImportHandler can't add/update - solr

I am trying to convince solr to perform a bulk import of a sqlite database.
I follow the all instruction from Solr-Wiki.
I configured DataImportHandler to open that database through jdbc successfully and I can start the import http://localhost:8080/solr/dataimport?command=full-import
but whatever I do, DIH didn't add any document even though it seems index the DB
the result
<str name="command">full-import</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">**14**</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-06 01:14:30</str>
<str name="">**Indexing completed**. **Added/Updated: 0 documents**. Deleted 0 documents.</str>
<str name="Committed">2012-04-06 01:14:32</str>
<str name="Optimized">2012-04-06 01:14:32</str>
<str name="Total Documents Processed">0</str>
I use the emp table in Oracle DB
data-config.xml
<dataConfig>
<dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//localhost:1521/ORCL" user="scott" password="tiger"/>
<document>
<entity name="emp" query="select EMPNO, ENAME from EMP">
<field column="EMPNO" name="empno" />
<field column="ENAME" name="ename" />
</entity>
</document>
</dataConfig>
schema.xml
<field name="empno" type="int" indexed="true" stored="true"/>
<field name="ename" type="string" indexed="true" stored="true"/>
It doesn't seem to index, but not to stored indexed data
Any ideas why this problem happen?
EDIT 1
Log show warning message like..
WARNING: Error creating document : SolrInputDocument[{ename=ename(1.0)={SMITH}, empno=empno(1.0)={7369}}]
org.apache.solr.common.SolrException: [doc=null] missing required field: id
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:346)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:73)
at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:636)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
that kind of log followed and
this warning message show up end of the log
2012. 4. 6 오후 12:12:25 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {deleteByQuery=*:*,add=[(null), (null), (null), (null), (null), (null), (null), (null), ... (14 adds)],optimize=} 0 0
I thought missing required field: id has some relation with the configuration in
the schema.xml
<uniqueKey>id</uniqueKey>
but after delete, I got this message
HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: false in solr.xml ------------------------------------------------------------- org.apache.solr.common.SolrException: QueryElevationComponent requires the schema to have a uniqueKeyField implemented using StrField at org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java:158) at org.apache.solr.core.SolrResourceLoader.inform
Any advice?

Try:
<entity name="emp" query="select EMPNO, ENAME from EMP">
<field column="EMPNO" name="id" />
<field column="ENAME" name="ename" />
in data-config.xml and put back:
<uniqueKey>id</uniqueKey>
in schema.xml and also let the field id.
Or u can simply replace:
<uniqueKey>id</uniqueKey>
with:
<uniqueKey>epno</uniqueKey>
Hope that will work.

You can also add an autoincrement id with
<dataConfig>
<script><![CDATA[
id = 1;
function GenerateId(row) {
row.put('id', (id ++).toFixed());
return row;
}
]]></script>
<dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//localhost:1521/ORCL" user="scott" password="tiger"/>
<document>
<entity name="emp" query="select EMPNO, ENAME from EMP" transformer="script:GenerateId">
<field column="EMPNO" name="empno" />
<field column="ENAME" name="ename" />
</entity>
</document>
</dataConfig>

Related

indexing huge table record in apache solr cloud

I have a Cassandra table with 9 million records and my data size is 500 MB . I have a Solr cloud with 3 nodes(3 shards and 2 replicas)with three external Zookeeper ensemble. My Cassandra is a 1 node cluster. I am trying to index this table using Apache Solr but my query is getting timeout as soon as i am starting full import .
I am able to cqlsh and fetch records but i am failing in indexing it .
Here is my attached solr.log...
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT * from counter.series Processing Document # 1
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:318)
at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:279)
at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:54)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
... 5 more
Caused by: java.sql.SQLTransientConnectionException: TimedOutException()
at org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStatement.java:189)
at org.apache.cassandra.cql.jdbc.CassandraStatement.execute(CassandraStatement.java:205)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.executeStatement(JdbcDataSource.java:338)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:313)
... 12 more
Caused by: TimedOutException()
at org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:37865)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1562)
at org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1547)
at org.apache.cassandra.cql.jdbc.CassandraConnection.execute(CassandraConnection.java:468)
at org.apache.cassandra.cql.jdbc.CassandraConnection.execute(CassandraConnection.java:494)
at org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStatement.java:164)
... 15 more
I want some help in indexing the table either batch wise or by using multiple threads . Any help or suggestion is welcomed..
db-data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource" driver="org.apache.cassandra.cql.jdbc.CassandraDriver" url="jdbc:cassandra://192.168.0.7:9160/counter" user="cassandra" password="cassandra" autoCommit="true" />
<document>
<entity name="counter" query="SELECT * from counter.series;" autoCommit="true">
<field column="serial" name="serial" />
<field column="random" name="random" />
<field column="remarks" name="remarks" />
<field column="timestamp" name="timestamp" />
</entity>
</document>
</dataConfig>
solrconfig.xml
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
schema.xml
<field name="remarks" type="string" indexed="false" stored="false" required="false" />
<field name="serial" type="string" indexed="true" stored="true" required="true" />
<field name="random" type="string" indexed="false" stored="true" required="true" />
<field name="timestamp" type="string" indexed="false" stored="false" required="false" />
The problem most likely is the size of the payload of data sending to Solr. By default when no batchSize specified in JdbcDataSource it's got defaulted to 500. It looks like in your case it's too much. You should use smaller numbers or increase timeout settings on Solr side

Assign Unique id's across all documents and its children

So the situation is as follows:
Solr has a dataimport directly on the database
I have a table project in a relationship to unit. A project can hold up to 5 units
ID's are automatically generated from the database, starting by 1
ID's are unique for each table but not across the database
Since Solr requires each document to have a unique ID I created a field solrId which gets its ID's from solr.UUIDUpdateProcessorFactory.
However, the dataimport only fetches a few projects and no units whatsoever. Can someone point me in the right direction?
The relevant passages:
solrconfig.xml:
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">solrId</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
....
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">wiensued-data-config.xml</str>
<str name="update.chain">uuid</str>
</lst>
</requestHandler>
managed-schema:
<uniqueKey>solrId</uniqueKey>
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
<!-- solrId is the real ID -->
<field name="solrId" type="uuid" multiValued="false" indexed="true" stored="true" />
<!-- the ID from the database -->
<field name="id" type="int" multiValued="false" indexed="true" stored="true"/>
The dataimporthandler is configured to index id (from the table) into either projectId or unitId
The stacktrace is:
org.apache.solr.common.SolrException: [doc=null] missing required field: solrId
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:265)
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:107)
at org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:212)
at org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:185)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:259)
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:433)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1384)
at org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:920)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:913)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:302)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:91)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:80)
at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:254)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:526)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:415)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:457)
at java.lang.Thread.run(Thread.java:748)
However, the solrId is provided as far as I can tell
just get this fixed in your dih config, it will be just cleaner and easier.
Just prepend a 'p' to the project id to create the id, and supply that to solr. Likewise with the units (prepend 'u'). You get the idea:
<entity name="project" pk="id" query="select concat('p', id) as solrid, ...
Of course the sql depends on your DB.

Solr - problems with compound Id

For example I have 2 tables: table1 = book, table2 = site --> 1 book can have n sites.
<entity name="book" dataSource="myDs" pk="id"
transformer="TemplateTransformer"
query="SELECT b.id, b.title, s.id, s.number, s.content
FROM book b. site s WHERE b.id = s.book">
<field column="b.id" name="id" />
<field column="s.id" name="sId" />
<field column="id" template="${id}_${sId}" ignoreMissingVariables="true" />
</entity>
Why this dont work? I just get only 1 book with 1 site as result and not x book with x sites
I just dont get a compound key in field 'id'.
I recently ran into similar issue and following points fixed it for me:
In SQL query use alias to get distinct column names. Please recheck the sytax of alias as it may vary for your database.
when you use template transformer or do sub select (on inner entities), always use entity.COLUMN_NAME_IN_CAPITAL. The column-in-capital really stumped me until I stumbled upond solr forum where someone else suggested that as a solution (unfortunately I do not have the URL of the post that provided this helpful tidbit).
i was doing above against Oracle DB. Not 100% sure if it applies to other DB but wanted to share the solution.
Here is an attempt to re define your DIH with the above changes
<entity name="book" dataSource="myDs" pk="id"
transformer="TemplateTransformer"
query="SELECT b.id, b.title, s.id as sid, s.number, s.content
FROM book b. site s WHERE b.id = s.book">
<field column="sid" name="sid" />
<field column="id" template="${book.ID}_${book.SID}"/>
</entity>
Add multivalued=true to the id and sId fields in schema.xml file.
yes I deleted the index and inserted him new.
what I get:
<doc>
<str name="id">1</str>
<str name="title">Im a title</str>
<str name="number">1337</str>
<str name="content">content 23</str>
</doc>
what I want:
<doc>
<str name="id">1_1</str>
<str name="title">Im a title</str>
<str name="number">1337</str>
<str name="content">content 23</str>
</doc>
<doc>
<str name="id">1_2</str>
<str name="title">Im a title for 2</str>
<str name="number">1654654</str>
<str name="content">ekddsd</str>
</doc>
I already tried to change following
<field column="id" template="${id}_${sId}" ignoreMissingVariables="true" />
into
<field column="id" template="${id}_${someOtherField}" ignoreMissingVariables="true" />
and the result didn't change.. Looks like the TemplateTransformer doesn't work?
EDIT
Found something in my logs:
Unable to resolve variable: id while parsing expression: ${id}_${sId}
Unable to resolve variable: sId while parsing expression: ${id}_${sId}

Solr is not showing result after indexing table records

The dataimporthandler status is showing it has indexed and 10 documents are added, but not showing any result when I search for word that is part of added document. If I give : in search it displays all records
example of clob record:
<?xml version="1.0" encoding="UTF-8" ?>
<message xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="someurl" xmlns:csp="someurl.xsd" xsi:schemaLocation="somelocation jar: id="002" message-type="create">
<content>
<dsp:row>
<dsp:channel>100</dsp:channel>
<dsp:role>115</dsp:role>
</dsp:row>
<![CDATA[ <ol><li>java</li></ol><li>ASP</li>]]>
</body></content></message>
data-config.xml
<document name="doc">
<entity name="MYCONTENT" transformer="ClobTransformer"
query="SELECT CID,XML FROM MYCONTENT">
<field column="CID" name="CID"/>
<field column="XML" clob="true" name="XML"/>
</entity>
</document>
schema.xml
<field name="CID" type="string" indexed="true" stored="true" required="true"/>
<field name="XML" type="string" indexed="true" stored="true" required="true"/>
<dynamicField name="*" type="ignored" />
<uniqueKey>CID</uniqueKey>
<defaultSearchField>XML</defaultSearchField>
solrconfig.xml
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/data-config.xml</str>
<str name="rows">10</str>
</lst>
</requestHandler>
I do not know why it is not showing result when I search for "Java" "ASP". Any help is greatly appreciated.
thanks in advance
srini
You have two things to fix.
First, the "string" field type treats the entire document as a single token. You need a text field type.
Second, Solr does not parse the XML in your CLOB, it indexes it as raw text, splitting tokens as specified by your choice of tokenizer for that field. For example, if you used a whitespace tokenizer, it would treat "115" as a single token, and a search for "115" would not match.
For testing, I would try using the HTMLStripCharFilterFactory in that field definition before the tokenizer. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
That should strip a fair amount of the XML. If you want to process it a specific way, you will probably want to learn about XPathEntityProcessor, which can extract parts of the XML for indexing. See: http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

Total Documents Processed = 0 though Total Rows Fetched is non zero using Solr with Oracle database

I am using dataImportHandler to import data into solr from Oracle db. Though the import and idexing is successful I am not able to search as the documents do not get created.There are no errors in the logs also.Here are my config file snippets. Kindly help.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
schema.xml
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
</types>
<fields>
<field name="eid" type="string" indexed="true" stored="true" required="true" />
<field name="nm" type="string" indexed="true" stored="true" required="true" />
</fields>
<uniqueKey>eid</uniqueKey>
<defaultSearchField>nm</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
data-config.xml
<dataConfig> url="jdbc:oracle:thin:#//abc" user="abc" password="abc" />
<document name="client">
<entity name="org" query="select org.code ,org.name from abc org where org_name like 'BB%'">
<field column="code" name="eid"/>
<field column="name" name="nm" />
</entity>
</document>
</dataConfig>
data import status:
<str name="Total Rows Fetched">64</str>
<str name="Total Documents Processed">0</str>
Some ways to debug on the Solr Admin console (i.e. http://[yourhost]:8983/solr/index.html#/ ):
On Dataimport (http://[yourhost]:8983/solr/index.html#/dataimport/), check "Raw Status-Output" for "Total Documents Failed":
"statusMessages": {
"Total Requests made to DataSource": "1",
"Total Rows Fetched": "12966",
"Total Documents Processed": "0",
"Total Documents Skipped": "0",
"Full Dump Started": "2016-08-08 11:15:18",
"": "Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.",
"Committed": "2016-08-08 11:15:20",
**"Total Documents Failed": "12966"**,
"Time taken": "0:0:2.452"
}
On the admin console, go to "Logging" page (http://[yourhost]:8983/solr/index.html#/~logging) to view error logs.
Have you tried debugging it with DIH development mode?
Go to logs on the dashboard, it will show the status of the process
image
In data-config.xml->entity->query-> select statement, there must be "id" column i.e. uniquekey field.
Had the same issue, no errors in logs, could not start debug mode (havn't done the pre-reqs ). For me:
In this project I am using a modified DIH example project. I had updated my solr-data-config.xml data import for a new query (with additional fields) and had not added the field to managed-schema. After adding the field, and restarting solr it indexed fine.
It would be nice if the error / issue was a little bit more clear in the solr interface.

Resources