Unable to stream with Flink Streaming - apache-flink

I am new to Flink Streaming framework and I am trying to understand the components and the flow. I am trying to run the basic wordcount example using the DataStream. I am trying to run the code on my IDE. The code runs with no issues when I feed data using the collection as
DataStream<String> text = env.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",z
"Or to take arms against a sea of troubles,"
);
But, it fails everytime when I am trying to read data either from the socket or from a file as below:
DataStream<String> text = env.socketTextStream("localhost", 9999);
or
DataStream<String> text = env.readTextFile("sample_file.txt");
For both the socket and textfile, I am getting the following error:
Exception in thread "main" java.lang.reflect.InaccessibleObjectException: Unable to make field private final byte[] java.lang.String.value accessible: module java.base does not "opens java.lang" to unnamed module

Flink currently (as of Flink 1.14) only supports Java 8 and Java 11. Install a suitable JDK and try again.

Related

Apache Flink with Kinesis Analytics : java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0

Background :
I have been trying to setup BATCH + STREAMING in the same flink application which is deployed on kinesis analytics runtime. The STREAMING part works fine, but I'm having trouble adding support for BATCH.
Flink : Handling Keyed Streams with data older than application watermark
Apache Flink : Batch Mode failing for Datastream API's with exception `IllegalStateException: Checkpointing is not allowed with sorted inputs.`
The logic is something like this :
The logic is something like this :
streamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
streamExecutionEnvironment.fromSource(FileSource.forRecordStreamFormat(new TextLineFormat(), path).build(),
WatermarkStrategy.noWatermarks(),
"Text File")
.process(process function which transforms input)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<DetectionEvent>forBoundedOutOfOrderness(orderness)
.withTimestampAssigner(
(SerializableTimestampAssigner<Event>) (event, l) -> event.getEventTime()))
.keyBy(keyFunction)
.window(TumblingEventWindows(Time.of(x days))
.process(processWindowFunction);
On doing this I'm getting the below exception :
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:254)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:441)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:582)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:562)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:571)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for WindowOperator_90bea66de1c231edf33913ecd54406c1_(1/1) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163)
... 10 more
Caused by: java.io.IOException: Failed to acquire shared cache resource for RocksDB
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.allocateSharedCachesIfConfigured(RocksDBOperationUtils.java:306)
at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:426)
at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:90)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
... 12 more
Caused by: java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0. Please make sure that all types of managed memory consumers contained in the job are configured with a non-negative weight via `taskmanager.memory.managed.consumer-weights`.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:160)
at org.apache.flink.runtime.memory.MemoryManager.validateFraction(MemoryManager.java:672)
at org.apache.flink.runtime.memory.MemoryManager.computeMemorySize(MemoryManager.java:653)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:521)
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.allocateSharedCachesIfConfigured(RocksDBOperationUtils.java:302)
... 17 more
Seems like kinesis-analytics does not allow clients to define a flink-conf.yaml file to define taskmanager.memory.managed.consumer-weights. Is there any way around this ?
It's not clear to me what the underlying cause of this exception is, nor how to make batch processing work on KDA.
You can try this (but I'm not sure KDA will allow it):
Configuration conf = new Configuration();
conf.setString("taskmanager.memory.managed.consumer-weights", "put-the-value-here");
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(conf);

NPE when i submit the jar to Flink

I write a code to consume stream from Kafka and then sink it back to another kafka topic.
The code runs normally in my IDE ,but when i submit the jar to Flink webpage, it throws NullPointerException on String[] cells = s.split(",");
Any help is appreciated. The full exception is:
java.lang.NullPointerException
at java.base/java.lang.String.split(String.java:2273)
at java.base/java.lang.String.split(String.java:2364)
at ex_filter_operation$SplitterKafkaString.map(ex_filter_operation.java:336)
at ex_filter_operation$SplitterKafkaString.map(ex_filter_operation.java:330)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:717)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:692)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:672)
at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:52)
at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:30)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collectWithTimestamp(StreamSourceContexts.java:111)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordsWithTimestamps(AbstractFetcher.java:352)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.partitionConsumerRecordsHandler(KafkaFetcher.java:185)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:141)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:755)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:213)

Anybody know if OrcTableSource supports S3 file system?

I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:
OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();
seems this path is incorrect but anyone can help out? appreciate a lot!
Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)
By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.
DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.

Can't Get GAE + GWT + Objectify to Work

As the title says, I'm trying to create a GAE + GWT project using Objectify but I can't even get it off the ground. I'm sure I'm missing something simple but doesn't seem to be working.
Here is what I've done so far:
Create a new project and added guava-17.0.jar, guava-gwt-17.0.jar, objectify-5.0.3.jar, and objectify-gwt-1.1jar to my WEB-INF\lib folder. These are all the latest versions of these jars.
Run the application. Send a simple RPC command, server responds, and client successfully receives response (onSuccess() is called).
Add the line <inherits name="com.googlecode.objectify.Objectify" /> to my gwt.xml file per Objectify-GWT's website which is supposed to enable Objectify in GWT.
Run the application. The application starts, same RPC command is sent, server receives and responds, but the client says the command was a failure (onFailure() is called).
I am using the boiler-plate code that is pre-populated when first create a new web application. For reference, here is the RPC command:
private void sendNameToServer() {
// First, we validate the input.
errorLabel.setText("");
String textToServer = nameField.getText();
if (!FieldVerifier.isValidName(textToServer)) {
errorLabel.setText("Please enter at least four characters");
return;
}
// Then, we send the input to the server.
sendButton.setEnabled(false);
textToServerLabel.setText(textToServer);
serverResponseLabel.setText("");
greetingService.greetServer(textToServer,
new AsyncCallback<String>() {
public void onFailure(Throwable caught) {
// Show the RPC error message to the user
dialogBox
.setText("Remote Procedure Call - Failure");
serverResponseLabel
.addStyleName("serverResponseLabelError");
serverResponseLabel.setHTML(SERVER_ERROR);
dialogBox.center();
closeButton.setFocus(true);
}
public void onSuccess(String result) {
dialogBox.setText("Remote Procedure Call");
serverResponseLabel
.removeStyleName("serverResponseLabelError");
serverResponseLabel.setHTML(result);
dialogBox.center();
closeButton.setFocus(true);
}
});
}
This is the error I receive after I try to make the RPC call:
[DEBUG] [my_app] - Validating units:
[INFO] [my_app] - Module my_app has been loaded
[ERROR] [my_app] - Errors in 'com/google/gwt/dev/jjs/SourceOrigin.java'
[ERROR] [my_app] - Line 77: The method synchronizedMap(new LinkedHashMap<SourceOrigin,SourceOrigin>(){}) is undefined for the type Collections
[ERROR] [my_app] - Errors in 'com/google/gwt/dev/util/StringInterner.java'
[ERROR] [my_app] - Line 29: No source code is available for type com.google.gwt.thirdparty.guava.common.collect.Interner<E>; did you forget to inherit a required module?
[ERROR] [my_app] - Line 29: No source code is available for type com.google.gwt.thirdparty.guava.common.collect.Interners; did you forget to inherit a required module?
To me it looks like Objectify is interfering with GWT. I know they're supposed to work together so not sure what I'm doing wrong. Any advice would be appreciated.
Use objectify-gwt 1.2. It's possible that 1.1 has some issues from merging a bad PR.
You can see a sample application that uses objectify-gwt to pass a GeoPt back and forth from the client here: https://github.com/stickfigure/objectify-gwt-test
You should use objectify on the server side before trying to do this kind of stuff. Objectify is a server side peristence technology. Call it in your server code
add try catch in your service methods and print the stack trace of the exception on your server console, if you receive onFailure() on GWT that means there is a failure on the server side. You have to find what is that failure.
Now the second part is an advice:
<inherits name="com.googlecode.objectify.Objectify" />
Is a weired line for me. GWT doesn't have to know about your persistence layer.
Unless it's a revolutionary concept, I would recommend you d'ont use this type of technology that removes your hand from the controle of your db access...

java.lang.OutOfMemoryError when processing large pgp file

I want to use Camel 2.12.1 to decrypt some potentially large pgp files. The following flow results in an out of memory exception and the call stack shows that the PGPDataFormat.unmarshal() function is trying to build a ByteArray which is destined to fail if the file is large. Is there a way to pass streams around during unmarshalling?
My route:
from("file:///home/cps/camel/sftp-in?"
+ "include=.*&" // find files using this pattern
+ "move=/home/cps/camel/sftp-archive&" // after done adding records to queue, move file to archive
+ "delay=5000&"
+ "readLock=rename&" // readLock parameters prevent picking up file which is currently changing
+ "readLockCheckInterval=5000")
.choice()
.when(header(Exchange.FILE_NAME_ONLY).regex(".*pgp$|.*PGP$|.*gpg$|.*GPG$")).to("direct:decrypt")
.otherwise()
.to("file:///home/cps/camel/input");
from("direct:decrypt").unmarshal().pgp("file:///home/cps/.gnupg/secring.gpg", "developer", "set42now")
.setHeader(Exchange.FILE_NAME).groovy("request.headers.get('CamelFileNameOnly').replace('.gpg', '')")
.to("file:///home/cps/camel/input/")
.to("log:done");
The exception which shows the converter trying to create a ByteArray:
java.lang.OutOfMemoryError: Java heap space
at org.apache.commons.io.output.ByteArrayOutputStream.needNewBuffer(ByteArrayOutputStream.java:128)
at org.apache.commons.io.output.ByteArrayOutputStream.write(ByteArrayOutputStream.java:158)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1026)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:218)
at org.apache.camel.converter.crypto.PGPDataFormat.unmarshal(PGPDataFormat.java:238)
at org.apache.camel.processor.UnmarshalProcessor.process(UnmarshalProcessor.java:65)
Try with 2.13 or 2.12-SNAPSHOT as we have improved data format and streaming recently. So likely to be better in next release.

Resources