Flink Falling back to Kyro when using Custom Avro Deserializer - apache-flink

I have a very simple Flink Streaming job that basically consumes 2 records from a topic (r1 with schema s1 and r2 with schema s2, both s1 and s2 are backward+forward compatible).
Schema definition: s2 has 1 field extra than s1.
I'm trying to make the FlinkKafkaConsumer run using ConfluentRegistryAvroSerializationSchema.forGeneric(), which takes in s1 as reader schema and also schema registry URL, it doesn't work. The deserializer just ignores the additional column as the reader schema doesn't have it. How is schema registry being useful here?
Additionally, I tried writing a custom wrapper for KafkaAvroDeserializer referring this: Is it possible to deserialize Avro message(consuming message from Kafka) without giving Reader schema in ConfluentRegistryAvroDeserializationSchema. This solution explores using only the writer schema without reader schema provided - which is exactly what I need.
This gives me another runtime exception with Flink as follows:
Caused by: com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException
Serialization trace:
reserved (org.apache.avro.Schema$Field)
fieldMap (org.apache.avro.Schema$RecordSchema)
schema (org.apache.avro.generic.GenericData$Record)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:143)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:21)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:354)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:191)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:46)
at org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:53)
at org.apache.flink.runtime.io.network.api.serialization.NonSpanningWrapper.readInto(NonSpanningWrapper.java:337)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.readNonSpanningRecord(SpillingAdaptiveSpanningRecordDeserializer.java:128)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.readNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:103)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:93)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:95)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:496)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.UnsupportedOperationException
at java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
... 30 more
How do I make this work and understand the core concept of schema registry better?

Related

Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).
With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

apache beam application failing in Flink with NotSerializableException

my use case use beam generate sequence to create unbounded source and do further transforms. it is working fine in direct runner. As per docs, Flink serialize Java primitive types and boxed forms, but I am getting the following error.
Caused by: org.apache.flink.api.common.InvalidProgramException: [lastEmitted type:LONG pos:0, startTime type:RECORD pos:1] is not serializable. The object probably contains or references non serializable fields.
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:140)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:1558)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.java:1470)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.java:1414)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.java:1396)
at org.apache.beam.runners.flink.FlinkStreamingTransformTranslators$UnboundedReadSourceTranslator.translateNode(FlinkStreamingTransformTranslators.java:221)
... 31 more
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$Field
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at java.util.ArrayList.writeObject(ArrayList.java:768)
source
GenerateSequence.from(0).withRate(1, Duration.standardSeconds(10));
thanks for your help

Why is flume not writing data to S3 in Mumbai region?

I am using flume 1.5.0 version to migrate SQL server data to Amazon S3.
I am migrating only incremental data to s3. So its something like whenever a new record gets inserted into my sql server then it should be replicate on S3.
I can write sql server data to s3 in north virginia region, but when I create a bucket in mumbai region and write data in mumbai region then its throwing below error-
19/09/07 13:01:08 WARN hdfs.HDFSEventSink: HDFS IO error
java.io.IOException: s3n://BUCKET-NAME : 400 : Bad Request
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:403)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:776)
at org.apache.flume.sink.hdfs.HDFSDataStream.doOpen(HDFSDataStream.java:86)
at org.apache.flume.sink.hdfs.HDFSDataStream.open(HDFSDataStream.java:113)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:273)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:262)
at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:718)
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:183)
at org.apache.flume.sink.hdfs.BucketWriter.access$1700(BucketWriter.java:59)
at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:715)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.jets3t.service.impl.rest.HttpException: 400 Bad Request
My flume conf sink properties are as below-
# SINK
agent.sinks = s3hdfs
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3n://ACCESS_KEY:SECRET_KEY#BUCKET-NAME/test
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.hdfs.filePrefix = test
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.rollSize = 67108864 #64Mb filesize
agent.sinks.s3hdfs.hdfs.batchSize = 100
agent.sinks.s3hdfs.hdfs.rollInterval = 1
agent.sinks.s3hdfs.channel = ch1
I tried S3a instead of S3n but it also did not work.
So please suggest me what changes I need to make so it can write to S3 bucket in mumbai region. If I need to mention region name in flume conf file, then how would I mention that.
Thanks in advance.
S3n is obsolete. S3A will work with it -look at the hadoop docs on fs.s3a.endpoint and signing algorithm. These are supported in Hadoop 2.8+

Writing dataframe to SQL Server with df.write.jdbc() produces error: Column has a data type that cannot participate in a columnstore index

I'm using pyspark with spark 2.2 and python 2.7 in a cluster of 20 nodes. I'm loading data into a dataframe from cloud blob storage using df = spark.read.jdbc(...) and then I attempt to write it into my SQL Server database with df.write.jdbc(...). However, during the write process I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o67.jdbc.
: com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'my_col' has a data type that cannot participate in a columnstore index.
The df's schema is as follows:
root
|-- my_col: string (nullable = true)
|-- my_other_col: string (nullable = true)
...
This post has me believing that df.write.jdbc(...)maybe trying to create columnstore indexes for all columns in the df when it writes. Unfortunately I do not know how to stop spark from doing this so I can mitigate this issue.
A summarized version of my code looks like this:
spark = (pyspark.sql.SparkSession.builder
.appName('my-app').getOrCreate())
df = (self.spark.read.format('com.databricks.spark.avro')
.options(inferSchema=True, header=True)
.load(blob_storage_path)).repartition(self.num_partitions)
df.write.jdbc(url=self.jdbc_url, table=table, mode='overwrite',
properties=self.jdbc_properties)
Here's the full stack trace:
py4j.protocol.Py4JJavaError: An error occurred while calling o68.jdbc.
: com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'my_col' has a data type that cannot participate in a columnstore index.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:217)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1655)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:885)
at com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:778)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7505)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2445)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:191)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:166)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeUpdate(SQLServerStatement.java:703)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:805)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance for any help you can provide!
What SQL Server version do you have? There are some restrictions to the datatypes that could be in columnstore index in versions earlier than 2017. Read restrictions section of this article https://learn.microsoft.com/en-us/sql/t-sql/statements/create-columnstore-index-transact-sql The thing you can do, I guess, is drop the columnstore index or to migrate to Sql Server 2017.

Hector error inserting integer

I am running a single node Cassandra instance (for dev purposes) and am looking to insert an integer row into it. My Keyspace and columnfamily are already created on Cassandra.
I am using Cassandra 1.0 with Hector 1.0.5 (Jar version). My code is as follows:
Cluster cluster = HFactory.getOrCreateCluster("Test Cluster", "10.40.14.93:9160");
Keyspace keyspaceOperator = HFactory.createKeyspace("mykeyspace", cluster)
Mutator intM = HFactory.createMutator(keyspaceOperator, IntegerSerializer.get());
for each elem in my list {
intM.insert(doc.document_id ,
"mycolfamily",
me.prettyprint.hector.api.factory.HFactory.createColumn("numAdults", doc.numAdults))
}
I get TimedOutException on my client, and in the Cassandra logs, I see a bunch of the following:
ERROR [MutationStage:357] 2012-07-20 08:15:02,106 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[MutationStage:357,5,main]
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: ""
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1228)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:410)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.solr.schema.TrieField.createField(TrieField.java:508)
at org.apache.solr.schema.FieldType.createFields(FieldType.java:292)
at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:106)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.addFieldToDocument(SolrSecondaryIndex.java:382)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.populateDocument(SolrSecondaryIndex.java:280)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.applyIndexUpdates(SolrSecondaryIndex.java:164)
at org.apache.cassandra.db.index.SecondaryIndexManager.applyIndexUpdates(SecondaryIndexManager.java:419)
at org.apache.cassandra.db.Table.apply(Table.java:448)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:256)
at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:415)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1224)
... 3 more
ERROR [MutationStage:357] 2012-07-20 08:15:02,106 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[MutationStage:357,5,main]
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: ""
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1228)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:410)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.solr.schema.TrieField.createField(TrieField.java:508)
at org.apache.solr.schema.FieldType.createFields(FieldType.java:292)
at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:106)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.addFieldToDocument(SolrSecondaryIndex.java:382)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.populateDocument(SolrSecondaryIndex.java:280)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.applyIndexUpdates(SolrSecondaryIndex.java:164)
at org.apache.cassandra.db.index.SecondaryIndexManager.applyIndexUpdates(SecondaryIndexManager.java:419)
at org.apache.cassandra.db.Table.apply(Table.java:448)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:256)
at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:415)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1224)
}
I am trialling the Datastax Enterprise (DSE) which packages Cassandra, Hadoop, Solr etc. I have created my Cassandra CF via Solr Configuration (You can post Solr config and schema xmls to a Datastax instance to create the Keyspace and CF - its a feature of DSE)
Could someone please help?
Try adding an explicit serializer to your createColumn call...like so:
me.prettyprint.hector.api.factory.HFactory.createColumn("numAdults", doc.numAdults, StringSerializer.get(), IntegerSerializer.get()))
Also, on another note, I see you're doing inserts in a loop. Doing intM.addInsertion inside the loop and then intM.execute() once its done is more efficient.

Resources