Apache Beam ReadFromKafka using Python runs in Flink but no published messages are passing through

Apache Beam ReadFromKafka using Python runs in Flink but no published messages are passing through - apache-flink

I have a local cluster running in Minikube. My pipeline job is written in python and is a basic consumer of Kafka. My pipeline looks as follows:
def run():
import apache_beam as beam
options = PipelineOptions([
"--runner=FlinkRunner",
"--flink_version=1.10",
"--flink_master=localhost:8081",
"--environment_type=EXTERNAL",
"--environment_config=localhost:50000",
"--streaming",
"--flink_submit_uber_jar"
])
options.view_as(SetupOptions).save_main_session = True
options.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=options) as p:
(p
| 'Create words' >> ReadFromKafka(
topics=['mullerstreamer'],
consumer_config={
'bootstrap.servers': '192.168.49.1:9092,192.168.49.1:9093',
'auto.offset.reset': 'earliest',
'enable.auto.commit': 'true',
'group.id': 'BEAM-local'
}
)
| 'print' >> beam.Map(print)
)
if __name__ == "__main__":
run()
The Flink runner shows no records passing through in "Records received"
Am I missing something basic?

--environment_type=EXTERNAL means you are starting up the workers manually, and is primarily for internal testing. Does it work if you don't specify an environment_type/config at all?

def run(bootstrap_servers, topic, pipeline_args):
bootstrap_servers = 'localhost:9092'
topic = 'wordcount'
pipeline_args = pipeline_args.append('--flink_submit_uber_jar')
pipeline_options = PipelineOptions([
"--runner=FlinkRunner",
"--flink_master=localhost:8081",
"--flink_version=1.12",
pipeline_args
],
save_main_session=True, streaming=True)
with beam.Pipeline(options=pipeline_options) as pipeline:
_ = (
pipeline
| ReadFromKafka(
consumer_config={'bootstrap.servers': bootstrap_servers},
topics=[topic])
| beam.FlatMap(lambda kv: log_ride(kv[1])))
I'm facing another issue with latest apache Beam 2.30.0, Flink 1.12.4
2021/06/10 17:39:42 Initializing python harness: /opt/apache/beam/boot --id=1-2 --provision_endpoint=localhost:42353
2021/06/10 17:39:50 Failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for /tmp/staged/pickled_main_session
caused by:
rpc error: code = Unknown desc = ; failed to retrieve chunk for /tmp/staged/pickled_main_session
caused by:
rpc error: code = Unknown desc = ; failed to retrieve chunk for /tmp/staged/pickled_main_session
caused by:
rpc error: code = Unknown desc = ; failed to retrieve chunk for /tmp/staged/pickled_main_session
caused by:
rpc error: code = Unknown desc =
2021-06-10 17:39:53,076 WARN org.apache.flink.runtime.taskmanager.Task [] - [3]ReadFromKafka(beam:external:java:kafka:read:v1)/{KafkaIO.Read, Remove Kafka Metadata} -> [1]FlatMap(<lambda at kafka-taxi.py:88>) (1/1)#0 (9d941b13ae9f28fd1460bc242b7f6cc9) switched from RUNNING to FAILED.
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalStateException: No container running for id d727ca3c0690d949f9ed1da9c3435b3ab3af70b6b422dc82905eed2f74ec7a15

Related

cannot read kinesis stream with flink, getting SdkClientException: Unable to execute HTTP request: Current token (VALUE_STRING)

I am trying to read kinesis (actually running a local mocking with kinesa) from flink. This is my consumer configuration :
val consumerConfig = new Properties()
consumerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
consumerConfig.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, "x")
consumerConfig.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, "x")
consumerConfig.put(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST")
val kinesis =
env.addSource(new FlinkKinesisConsumer[String](
inputStreamName, new SimpleStringSchema(), consumerConfig))
.print()
but when placing a record on the stream :
aws --endpoint-url=http://localhost:4567 kinesis put-record --data "hello" --partition-key 0 --stream-name ExampleInputStream
i am getting the following error:
org.apache.flink.kinesis.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Current token (VALUE_STRING) not VALUE_EMBEDDED_OBJECT, can not access as binary
at [Source: (org.apache.flink.kinesis.shaded.com.amazonaws.event.ResponseProgressInputStream); line: -1, column: 304]
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1201)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1147)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2809)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2776)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2765)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1292)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:1263)
at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getRecords(KinesisProxy.java:247)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.getRecords(ShardConsumer.java:401)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:244)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.fasterxml.jackson.core.JsonParseException: Current token (VALUE_STRING) not VALUE_EMBEDDED_OBJECT, can not access as binary
at [Source: (org.apache.flink.kinesis.shaded.com.amazonaws.event.ResponseProgressInputStream); line: -1, column: 304]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:712)
at com.fasterxml.jackson.dataformat.cbor.CBORParser.getBinaryValue(CBORParser.java:1660)
at com.fasterxml.jackson.core.JsonParser.getBinaryValue(JsonParser.java:1484)
at org.apache.flink.kinesis.shaded.com.amazonaws.transform.SimpleTypeCborUnmarshallers$ByteBufferCborUnmarshaller.unmarshall(SimpleTypeCborUnmarshallers.java:198)
at org.apache.flink.kinesis.shaded.com.amazonaws.transform.SimpleTypeCborUnmarshallers$ByteBufferCborUnmarshaller.unmarshall(SimpleTypeCborUnmarshallers.java:196)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.model.transform.RecordJsonUnmarshaller.unmarshall(RecordJsonUnmarshaller.java:61)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.model.transform.RecordJsonUnmarshaller.unmarshall(RecordJsonUnmarshaller.java:29)
at org.apache.flink.kinesis.shaded.com.amazonaws.transform.ListUnmarshaller.unmarshallJsonToList(ListUnmarshaller.java:92)
at org.apache.flink.kinesis.shaded.com.amazonaws.transform.ListUnmarshaller.unmarshall(ListUnmarshaller.java:46)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.model.transform.GetRecordsResultJsonUnmarshaller.unmarshall(GetRecordsResultJsonUnmarshaller.java:53)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.model.transform.GetRecordsResultJsonUnmarshaller.unmarshall(GetRecordsResultJsonUnmarshaller.java:29)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:118)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356)
at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
... 20 more

In my case i added this system properties :
System.setProperty(com.amazonaws.SDKGlobalConfiguration.AWS_CBOR_DISABLE_SYSTEM_PROPERTY, "true")
System.setProperty(SDKGlobalConfiguration.AWS_CBOR_DISABLE_SYSTEM_PROPERTY, "true")
System.setProperty(SdkSystemSetting.CBOR_ENABLED.property(), "false")

From the stacktrace it looks like Flink kinesis expects Cbor while your record is writting as a simple string.
It seems like this post contains some hints on how to align your local setup with the consumer side.

How to generate Cucumber extent reports 4 through Gradle

I am executing the Cucumber automation test scripts through gradle, though i am able to execute the test successfully but reports are not getting generated. I did give a try with with below gradle tasks
one at a time.
Cucumber gradle tasks copied below
a) for the below gradle task i am able to execute the scripts which does not generate reports as i have not included the reports statements inside args
task cucumber() {
dependsOn assemble, testClasses
doLast {
javaexec {
main = "io.cucumber.core.cli.Main"
classpath = configurations.cucumberRuntime + sourceSets.main.output + sourceSets.test.output
args = ['--plugin', 'pretty',
'--glue','com.hal.brands.test.stepdefinition',
'src/test/resources','src/main/java',
'--tags', '#Api'
]
}
}
}
b) I tried to include the few statements related to reporting as shown in below gradle task but it failed with a exception
task cucumber() {
dependsOn assemble, testClasses
doLast {
javaexec {
main = "io.cucumber.core.cli.Main"
classpath = configurations.cucumberRuntime + sourceSets.main.output + sourceSets.test.output
args = ['--plugin', 'pretty', 'json:target/HalBrands.json', 'com.aventstack.extentreports.cucumber.adapter.ExtentCucumberAdapter:',
'--monochrome', 'true',
'--features','src\test\resources\featurefile',
'--glue','com.hal.brands.test.stepdefinition',
'src/test/resources','src/main/java',
'--tags', '#Api'
]
}
}
}
Exception details given below
BUILD SUCCESSFUL in 1s
6 actionable tasks: 6 up-to-date
E:\GradleProj_workspace\GadleDemoProj>gradle cucumber
> Configure project :
http://api-qa.dpr-prv.halbrands.us.cloudappzero.azureflash.com
E:\GradleProj_workspace\GadleDemoProj
> Task :cucumber FAILED
Exception in thread "main" java.lang.IllegalArgumentException: com.aventstack.extentreports.cucumber.adapter.ExtentCucumberAdapter: is not valid. Try URI[:LINE]*
at io.cucumber.core.model.FeatureWithLines.parse(FeatureWithLines.java:56)
at io.cucumber.core.options.RuntimeOptionsParser.parse(RuntimeOptionsParser.java:131)
at io.cucumber.core.options.CommandlineOptionsParser.parse(CommandlineOptionsParser.java:25)
at io.cucumber.core.options.CommandlineOptionsParser.parse(CommandlineOptionsParser.java:29)
at io.cucumber.core.cli.Main.run(Main.java:29)
at io.cucumber.core.cli.Main.main(Main.java:14)
Caused by: java.lang.IllegalArgumentException: Expected scheme-specific part at index 68: com.aventstack.extentreports.cucumber.adapter.ExtentCucumberAdapter:
at java.net.URI.create(URI.java:852)
at io.cucumber.core.model.FeaturePath.parseProbableURI(FeaturePath.java:72)
at io.cucumber.core.model.FeaturePath.parse(FeaturePath.java:57)
at io.cucumber.core.model.FeatureWithLines.parseFeaturePath(FeatureWithLines.java:77)
at io.cucumber.core.model.FeatureWithLines.parse(FeatureWithLines.java:53)
... 5 more
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 68: com.aventstack.extentreports.cucumber.adapter.ExtentCucumberAdapter:
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.failExpecting(URI.java:2854)
at java.net.URI$Parser.parse(URI.java:3057)
at java.net.URI.<init>(URI.java:588)
at java.net.URI.create(URI.java:850)
... 9 more
FAILURE: Build failed with an exception.
* Where:
Build file 'E:\GradleProj_workspace\GadleDemoProj\build.gradle' line: 104
* What went wrong:
Execution failed for task ':cucumber'.
> Process 'command 'C:\Program Files\Java\jdk1.8.0_202\bin\java.exe'' finished with non-zero exit value 1
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
* Get more help at https://help.gradle.org
Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.5.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD FAILED in 1s
6 actionable tasks: 1 executed, 5 up-to-date
E:\GradleProj_workspace\GadleDemoProj>
**---------------------------------------------------------------------------------------------------------------**
I am not sure where i am going wrong in the above 'b' gradle task . Please help with me at the earliest as iam literally wasting lot of time here trying out many possibilities. Also i want to keep you informed that i am new gradle stuff.
Have also copied cucumberoptions as present in my cucumber runner file
#CucumberOptions(features = { "classpath:featurefile" }, glue = { "classpath:com.hal.brands.test.stepdefinition",
"classpath:com.hal.brands.helper" },
plugin = { "pretty", "json:target/HalBrands.json", "com.aventstack.extentreports.cucumber.adapter.ExtentCucumberAdapter:" },
monochrome = true, tags = "#Api")

Pls take off the json: to get rid of the exception you are getting wrt to invalid scheme!

The Complete solution is given below .
task cucumber() {
dependsOn assemble, testClasses
doLast {
javaexec {
main = "io.cucumber.core.cli.Main"
classpath = configurations.cucumberRuntime + sourceSets.main.output + sourceSets.test.output
args = ['--plugin', 'pretty',
'--plugin', 'json:target/HalBrands.json',
'--plugin', 'com.aventstack.extentreports.cucumber.adapter.ExtentCucumberAdapter:Report',
'--glue','com.hal.brands.test.stepdefinition',
'src/test/resources','src/main/java',
'--tags', '#Api'
]
}
}
}

Flink adding rebalance to stream cause to job failure when StreamExecutionEnvironment set using TimeCharacteristic.IngestionTime

I am trying to run streaming job that consumes messages from Kafka transforms them and sinks to the Cassandra.
Current code snippet is failing
val env: StreamExecutionEnvironment = getExecutionEnv("dev")
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
.
.
.
.
val source = env.addSource(kafkaConsumer)
.uid("kafkaSource")
.rebalance
val transformedObjects = source.process(new EnrichEventWithIngestionTimestamp)
.setParallelism(dataSinkParallelism)
sinker.apply(transformedObjects,dataSinkParallelism)
class EnrichEventWithIngestionTimestamp extends ProcessFunction[RawData, TransforemedObjects] {
override def processElement(rawData: RawData,
context: ProcessFunction[RawData, TransforemedObjects]#Context,
collector: Collector[TransforemedObjects]): Unit = {
val currentTimestamp=context.timerService().currentProcessingTime()
context.timerService().registerProcessingTimeTimer(currentTimestamp)
collector.collect(TransforemedObjects.fromRawData(rawData,currentTimestamp))
}
}
but if rebalance is commented out, or the job is changed to use TimeCharacteristic.EventTime and watermark assignment, as in fallowing snippet, then it works.
val env: StreamExecutionEnvironment = getExecutionEnv("dev")
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
.
.
val source = env.addSource(kafkaConsumer)
.uid("kafkaSource")
.rebalance
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessRawDataTimestampExtractor[RawData](Time.seconds(1)))
val transformedObjects = source.map(rawData=>TransforemedObjects.fromRawData(rawData))
.setParallelism(dataSinkParallelism)
sinker.apply(transformedObjects,dataSinkParallelism)
The stack trace is:
java.lang.Exception: java.lang.RuntimeException: 1
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.checkThrowSourceExecutionException(SourceStreamTask.java:217)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.processInput(SourceStreamTask.java:133)
at org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:301)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:406)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: 1
at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:110)
at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
at org.apache.flink.streaming.api.collector.selector.DirectedOutput.collect(DirectedOutput.java:143)
at org.apache.flink.streaming.api.collector.selector.DirectedOutput.collect(DirectedOutput.java:45)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:727)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:705)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$AutomaticWatermarkContext.processAndCollect(StreamSourceContexts.java:176)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$AutomaticWatermarkContext.processAndCollectWithTimestamp(StreamSourceContexts.java:194)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collectWithTimestamp(StreamSourceContexts.java:409)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp(AbstractFetcher.java:398)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord(Kafka010Fetcher.java:91)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:156)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:715)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:203)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.getBufferBuilder(RecordWriter.java:246)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:169)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:154)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:120)
at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
... 16 more
Am I doing something wrong?
Or there Is a limitation of using rebalance function when TimeCharacteristic set to IngestionTime?
Thank you in advance...

Can you provide the flink version that you are using?
Seems your issue is related to this Jira ticket
https://issues.apache.org/jira/browse/FLINK-14087
Did u only use rebalance once in your task? The recordWriter may share same channelSelector which decides where the record will be forwarded to. Your stack trace shows it is trying to select an out-of-bound channel.

Flow has failed with error Shutting down because of violation of the Reactive Streams specification

It seems I can never get the error handling right when using Akka Streams.
So this is my code
var db = Database.forConfig("oracle")
var mysqlDb = Database.forConfig("mysql_read")
var mysqlDbWrite = Database.forConfig("mysql_write")
implicit val actorSystem = ActorSystem()
val decider : Supervision.Decider = {
case _: Exception =>
println("got an exception restarting connections")
// let us restart our connections
db.close()
mysqlDb.close()
mysqlDbWrite.close()
db = Database.forConfig("oracle")
mysqlDb = Database.forConfig("mysql_read")
mysqlDbWrite = Database.forConfig("mysql_write")
Supervision.Restart
}
implicit val materializer = ActorMaterializer(ActorMaterializerSettings(actorSystem).withSupervisionStrategy(decider))
and I have a flow like this
val alreadyExistsFilter : Flow[Foo, Foo, NotUsed] = Flow[Foo].mapAsync(10){ foo =>
try {
val existsQuery = sql"""SELECT id FROM foo WHERE id = ${foo.id}""".as[Long]
mysqlDbWrite.run(existsQuery).map(v => (foo, v))
} catch {
case e: Throwable =>
println(s"Lookup failed for ${foo}")
throw e // will restart the stream
}
}.collect {case (f, v) if v.isEmpty => f}
So basically if the foo already exists in MySQL then the record should not be processed any further by the stream.
My hope with this code was that if anything fails with the mysql lookup (the mysql machine is pretty bad and timeouts are common), the record will be printed and discarded and the stream will continue with the remaining records courtesy of the supervision.
When I run this code. I see errors like
[error] (mysql_write network timeout executor) java.lang.RuntimeException: java.sql.SQLException: Invalid socket timeout value or state
java.lang.RuntimeException: java.sql.SQLException: Invalid socket timeout value or state
at com.mysql.jdbc.ConnectionImpl$12.run(ConnectionImpl.java:5576)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Invalid socket timeout value or state
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:937)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:926)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:872)
at com.mysql.jdbc.MysqlIO.setSocketTimeout(MysqlIO.java:4852)
at com.mysql.jdbc.ConnectionImpl$12.run(ConnectionImpl.java:5574)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Socket is closed
at java.net.Socket.setSoTimeout(Socket.java:1137)
at com.mysql.jdbc.MysqlIO.setSocketTimeout(MysqlIO.java:4850)
at com.mysql.jdbc.ConnectionImpl$12.run(ConnectionImpl.java:5574)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
and
[error] (mysql_write network timeout executor) java.lang.NullPointerException
java.lang.NullPointerException
at com.mysql.jdbc.MysqlIO.setSocketTimeout(MysqlIO.java:4850)
at com.mysql.jdbc.ConnectionImpl$12.run(ConnectionImpl.java:5574)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
One thing which surprises me here is that these exceptions don't come from my catch block. because I don't see the println statement of my catch block. The stack trace doesn't show me where it originated from... but since it is saying mysql_write I can assume that its the Flow above because only this Flow uses mysql_write.
Finally the entire stream crashes with the error
[trace] Stack trace suppressed: run last compile:runMain for the full output.
flow has failed with error Shutting down because of violation of the Reactive Streams specification.
14:51:06,973 |-INFO in ch.qos.logback.classic.AsyncAppender[asyncKafkaAppender] - Worker thread will flush remaining events before exiting.
[success] Total time: 3480 s, completed Sep 26, 2017 2:51:07 PM
14:51:07,603 |-INFO in ch.qos.logback.core.hook.DelayingShutdownHook#2320545b - Sleeping for 1 seconds
I don't know what I did to violate the reactive streams specification!!

A first stab at getting a more predictable solution would be removing the blocking behaviour (Await.result) and use mapAsync. A rewrite of the alreadyExistsFilter flow could be:
val alreadyExistsFilter : Flow[Foo, Foo, NotUsed] = Flow[Foo].mapAsync(3) { foo ⇒
val existsQuery = sql"""SELECT id FROM foo WHERE id = ${foo.id}""".as[Long]
foo → Await.result(mysqlDbWrite.run(existsQuery), Duration.Inf)
}.collect{
case (foo, res) if res.isDefined ⇒ foo
}
More info on blocking in Akka can be found in the docs.

The answer given by Stefano is correct. The error was indeed coming because of blocking code in the flow.
Although, my initial program was running against scala 2.11 and even after switching to the mapAsync, the problem persisted.
Since this is a command line tool it was easy for me to switch to scala 2.12 and try again.
When I tried with Scala 2.12 it worked perfectly.
One thing which greatly helped me is to have "ch.qos.logback" % "logback-classic" % "1.2.3", in the dependencies. This will show you each and every SQL statement which is being executed and easily see if something is going wrong.

SnappyData + Zeppelin + Kafka streaming - error while creating streaming table

I'm trying to create SnappyData streaming table using Zeppelin.
I have issue with stream table definition on argument 'rowConverter'
Zeppelin notebook is divided to a few paragraphs:
Paragraph 1:
import org.apache.spark.sql.Row
import org.apache.spark.sql.streaming.{SchemaDStream, StreamToRowsConverter}
class RowsConverter extends StreamToRowsConverter with Serializable {
override def toRows(message: Any): Seq[Row] = {
val log = message.asInstanceOf[String]
val fields = log.split(",")
val rows = Seq(Row.fromSeq(Seq(new java.sql.Timestamp(fields(0).toLong),
fields(1),
fields(2),
fields(3),
fields(4),
fields(5).toDouble,
fields(6)
)))
rows
}
}
Paragraph 2:
snsc.sql(
"CREATE STREAM TABLE adImpressionStream if not exists ("sensor_id string, metric
metric string) using kafka_stream
options (storagelevel 'MEMORY_AND_DISK_SER_2',
rowConverter 'RowsConverter',
zkQuorum 'localhost:2181',
groupId 'streamConsumer', topics 'test'");"
)
First paragraph returns error:
error: not found: type StreamToRowsConverter
class RowsConverter extends StreamToRowsConverter with Serializable {
^
<console>:13: error: not found: type Row
override def toRows(message: Any): Seq[Row] = {
^
<console>:16: error: not found: value Row
val rows = Seq(Row.fromSeq(Seq(new java.sql.Timestamp(fields(0).toLong),
Second paragraph:
java.lang.RuntimeException: Failed to load class : java.lang.ClassNotFoundException: RowsConverter
I have been trying to use default code from git:
snsc.sql("create stream table streamTable (userId string, clickStreamLog string) " +
"using kafka_stream options (" +
"storagelevel 'MEMORY_AND_DISK_SER_2', " +
" rowConverter 'io.snappydata.app.streaming.KafkaStreamToRowsConverter' ," +
"kafkaParams 'zookeeper.connect->localhost:2181;auto.offset.reset->smallest;group.id->myGroupId', " +
"topics 'test')")
but I have similar error:
java.lang.RuntimeException: Failed to load class : java.lang.ClassNotFoundException: io.snappydata.app.streaming.KafkaStreamToRowsConverter
Could you help me with this issue?
Thank you a lot.

You need to provide your application specific classes in the classpath. Please refer to the step of setting classpath here. Zeppelin will pick up classpath set in your spark-env.sh
https://github.com/SnappyDataInc/snappy-poc#lets-get-this-going

Add the snappydata interpreter to Apache Zeppelin as given here: https://snappydatainc.github.io/snappydata/howto/use_apache_zeppelin_with_snappydata/
This will enable running the Zeppelin in the lead so that the code is run in embedded mode. In particular you need to set the required jars using "-classpath" option in the cluster configuration.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Apache Beam ReadFromKafka using Python runs in Flink but no published messages are passing through - apache-flink

--environment_type=EXTERNAL means you are starting up the workers manually, and is primarily for internal testing. Does it work if you don't specify an environment_type/config at all?

Related

cannot read kinesis stream with flink, getting SdkClientException: Unable to execute HTTP request: Current token (VALUE_STRING)

How to generate Cucumber extent reports 4 through Gradle

Flink adding rebalance to stream cause to job failure when StreamExecutionEnvironment set using TimeCharacteristic.IngestionTime

Flow has failed with error Shutting down because of violation of the Reactive Streams specification

SnappyData + Zeppelin + Kafka streaming - error while creating streaming table

Categories

Resources