Unable to import large data into solr using DIH - solr

I am trying to import large data using dih from mySql.
Following is the datasource with batchSize =-1 for mySql
<dataSource batchSize="-1" driver="com.mysql.jdbc.Driver" ..... />
If fetches all 10 million records.
But at the end says full import failed.
I get the following exception in the log. :
2017-03-14 07:27:04.429 ERROR (Thread-14) [ x:companyData] o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLException: Operation not allowed after ResultSet closed
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:475)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:458)
at org.apache.solr.handler.dataimport.DataImporter$$Lambda$85/252359661.run(Unknown Source)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLException: Operation not allowed after ResultSet closed
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
... 5 more
Any help would be appreciated regardign the same.

The error you're facing does not concern Solr but the way you're accessing your database.
If you look at your exception: java.sql.SQLException: Operation not allowed after ResultSet closed.
I suggest to change batchSize parameter to a different value, for example 1000.
The batchSize option is used to retrieve the rows of a database table
in batches in order to reduce memory usage (it is often used to
prevent running out of memory when running the data import handler).
While a lower batch size might be slower, the option does not intend
to affect the speed of the import process.

Related

One error when import data from Hive cluster to solr6.2 cluster

There are 2 clusters for hive and solr.
I want to import data from hive table in hive cluster to solr cluster.
When the num of fields configured in managed-schema file is about 34 then import successfully, once the num larger than 34(the num is 44) , i will meet exception of connection reset.
Exception info as follow:
{"id":{"name":"id","value":"176036","boost":1.0},
"_id":{"name":"_id","value":"176036","boost":1.0}
... 44fields in json format ...
org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: IOException occured when talking to server at https://ip:port/solr/gp19edamp_qxb_administrative_punishment_all_cs1_shard2_replica1
Caused by: java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886)
at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857)
at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
at org.apache.http.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:115)
at org.apache.http.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:122)
at org.apache.http.entity.BufferedHttpEntity.writeTo(BufferedHttpEntity.java:115)
at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:96)
at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:112)
at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:216)
at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:488)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.huawei.solr.client.solrj.impl.InsecureHttpClient.execute(InsecureHttpClient.java:143)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:644)
... 10 more
You know there is no build-in limit. The limit is going to be dictated by your hardware resources.
Hope your answers~

is there any way to fix solr index

I am running a program that crawls the web and saves data into a solr index. for mysterious reasons, the solr server crashed. And now I end up with a corrupted index that has no segment files and hence risking losing all my data collected for 5 days....
The error message reads as below when you try to search on this index. the index folder definitely has data, as it has 182 files and 2GB in size.
I have tried to use CheckIndex but get the same error about no segment files...
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [chase]
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.solr.core.CoreContainer.lambda$load$6(CoreContainer.java:586)
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Unable to create core [chase]
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:935)
at org.apache.solr.core.CoreContainer.lambda$load$5(CoreContainer.java:558)
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
... 5 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:977)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:830)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:920)
... 7 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2069)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2189)
at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1071)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:949)
... 9 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(NRTCachingDirectory(MMapDirectory#/home/zqz/Work/chase/aws/data/solr/chase/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory#51b2fc7e; maxCacheMB=48.0 maxMergeSizeMB=4.0)): files: [_fh2.fdt, _fh2.fdx, _fh2.fnm, _fh2.nvd, _fh2.nvm, _fh2.si, _fh2_Lucene50_0.doc, _fh2_Lucene50_0.pos, _fh2_Lucene50_0.tim, _fh2_Lucene50_0.tip, write.lock]
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:925)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:118)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:93)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:248)
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:122)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2030)
... 12 more
2017-06-20 14:38:52.428 INFO (qtp475266352-16) [ ] o.a.s.c.TransientSolrCoreCacheDefault Allocating transient cache for 2147483647 transient cores
2017-06-20 14:38:52.894 INFO (qtp475266352-13) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={indexInfo=false&wt=json&_=1497969532681} status=0 QTime=11
2017-06-20 14:38:52.962 INFO (qtp475266352-20) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={wt=json&_=1497969532684} status=0 QTime=76
The error you mentioned is caused by the missing file :
segments* e.g. segments_3 ...
in the index files :
files: [_fh2.fdt, _fh2.fdx, _fh2.fnm, _fh2.nvd, _fh2.nvm, _fh2.si, _fh2_Lucene50_0.doc, _fh2_Lucene50_0.pos, _fh2_Lucene50_0.tim, _fh2_Lucene50_0.tip, write.lock]
That file specifies the last commit point and the last generation of segments to take into account and apparently it is missing.
Check if that file is there and is readable.
If it is not ( because for example the index writer was not closed properly due to the mulfuction, do not despair.
Chances are there that the transaction log contains still the documents you indexed, so you could just replay it and get the documents back ( clean the index dir, make solr starting and it should take care).
Solr allows also a backup functionality, so for the future you may want to configure it.

solr indexing not working when i try to insert 1000000 rows but works fine when i try to index 400000 rows or below

iam using solr 4.7.1 and trying to do a full import.My data source is a table in mysql. It has 10000000 rows and 20 columns.
Whenever iam trying to do a full import solr stops responding. But when i try to do a import of 400000 or less it works fine.
If i try to import more than this solr wont index the result it either stops responding or will show "indexing failed". In the error log it says "Unable to execute query".But i dont understand how is the query running fine for lesser number of records but fails when i run more number of records
My system config are follows
CPU-i7
Ram -6Gb
OS-64 bit windows 7
I am not able to figure out what the problem is ,i have tried increasing the max_allowed_packet to 1000M and even java heap size.
please help thanks in advance
This is the error code
`Exception while processing: playername document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT player_id,firstname,lastname,value1,value2,value3,value4,value5,value6, value7,value8,value9,value10, value11,value18,value19,value20, country_id, playername_modtime,player_flag from playername WHERE 'true' != 'false' OR playername.playername_modtime > '2014-05-23 10:38:56' Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:281) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:238) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:42) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:477) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:331) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:239) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464) Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 130,037 milliseconds ago. The last packet sent successfully to the server was 130,038 milliseconds ago. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at com.mysql.jdbc.Util.handleNewInstance(Util.java:409) at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1127) at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2288) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:2044) at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3549) at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:489) at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3240) at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2411) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2834) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2832) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2781) at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:908) at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:788) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:274) ... 12 more Caused by: java.io.EOFException: Can not read response from server. Expected to read 6 bytes, read 4 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3161) at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2269) ... 23 more 5/23/2014 8:32:18 PM ERROR DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT player_id,​firstname,​lastname,​value1,​value2,​value3,​value4,​value5,​value6,​ value7,​value8,​value9,​value10,​ value11,​value18,​value19,​value20,​ country_id,​ playername_modtime,​player_flag from playername WHERE 'true' != 'false' OR playername.playername_modtime > '2014-05-23 10:38:56' Processing Document # 1 Last Check: 5/23/2014 8:36:34 PM`
Added batchSize="-1" to data-config.xml and it worked
http://wiki.apache.org/solr/DataImportHandlerFaq

Solr error when doing full-import 250000 rows org.apache.solr.common.SolrException;null:org.eclipse.jetty.io.EofException

I am using solr 4.6.0 with jetty on windows 7 enterrpise with max heap of 2G.I can do a full-import for 200,000 records properly from the Solr Admin UI but as soon as I increase to 250,000 records, it starts giving me this error below:
webapp=/solr path=/dataimport params={optimize=false&clean=false&indent=true&commit=true&verbose=true&entity=files&command=full-import&debug=true&wt=json&rows=250000} {add=[8065121, 8065126, 8065128, 8065146, 8065963, 7838189, 7838186, 8065155, 8065174, 8065179, ... (250001 adds)],commit=} 0 2693420
org.apache.solr.common.SolrException; null:org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:507)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:170)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)
at sun.nio.cs.StreamEncoder.writeBytes(Unknown Source)
at su
Caused by: java.net.SocketException: Software caused connection abort: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method)
at j......
org.apache.solr.common.SolrException;null:org.eclipse.jetty.io.EofException at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
org.eclipse.jetty.servlet.ServletHandler; /solr/dihdb/dataimport
java.lang.IllegalStateException: Committed
at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1144)
I have changed example/etc/jetty.xml as follows for maxIdleTime=3500000.
I changed example/etc/webdefault.xml for session-timeout=720.
I still keep getting the error above.
TIA,
Vijay
I changed -Xmx5120M and that seems to have fixed the issue with 500K and 1 million records.Lack of memory in essence was the issue for this misleading error showing up.
Also tried 100000 1800000 for DataImportHandler.

Hector error inserting integer

I am running a single node Cassandra instance (for dev purposes) and am looking to insert an integer row into it. My Keyspace and columnfamily are already created on Cassandra.
I am using Cassandra 1.0 with Hector 1.0.5 (Jar version). My code is as follows:
Cluster cluster = HFactory.getOrCreateCluster("Test Cluster", "10.40.14.93:9160");
Keyspace keyspaceOperator = HFactory.createKeyspace("mykeyspace", cluster)
Mutator intM = HFactory.createMutator(keyspaceOperator, IntegerSerializer.get());
for each elem in my list {
intM.insert(doc.document_id ,
"mycolfamily",
me.prettyprint.hector.api.factory.HFactory.createColumn("numAdults", doc.numAdults))
}
I get TimedOutException on my client, and in the Cassandra logs, I see a bunch of the following:
ERROR [MutationStage:357] 2012-07-20 08:15:02,106 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[MutationStage:357,5,main]
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: ""
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1228)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:410)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.solr.schema.TrieField.createField(TrieField.java:508)
at org.apache.solr.schema.FieldType.createFields(FieldType.java:292)
at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:106)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.addFieldToDocument(SolrSecondaryIndex.java:382)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.populateDocument(SolrSecondaryIndex.java:280)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.applyIndexUpdates(SolrSecondaryIndex.java:164)
at org.apache.cassandra.db.index.SecondaryIndexManager.applyIndexUpdates(SecondaryIndexManager.java:419)
at org.apache.cassandra.db.Table.apply(Table.java:448)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:256)
at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:415)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1224)
... 3 more
ERROR [MutationStage:357] 2012-07-20 08:15:02,106 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[MutationStage:357,5,main]
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: ""
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1228)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:410)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.solr.schema.TrieField.createField(TrieField.java:508)
at org.apache.solr.schema.FieldType.createFields(FieldType.java:292)
at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:106)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.addFieldToDocument(SolrSecondaryIndex.java:382)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.populateDocument(SolrSecondaryIndex.java:280)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.applyIndexUpdates(SolrSecondaryIndex.java:164)
at org.apache.cassandra.db.index.SecondaryIndexManager.applyIndexUpdates(SecondaryIndexManager.java:419)
at org.apache.cassandra.db.Table.apply(Table.java:448)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:256)
at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:415)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1224)
}
I am trialling the Datastax Enterprise (DSE) which packages Cassandra, Hadoop, Solr etc. I have created my Cassandra CF via Solr Configuration (You can post Solr config and schema xmls to a Datastax instance to create the Keyspace and CF - its a feature of DSE)
Could someone please help?
Try adding an explicit serializer to your createColumn call...like so:
me.prettyprint.hector.api.factory.HFactory.createColumn("numAdults", doc.numAdults, StringSerializer.get(), IntegerSerializer.get()))
Also, on another note, I see you're doing inserts in a loop. Doing intM.addInsertion inside the loop and then intM.execute() once its done is more efficient.

Resources