I'm running Apache Flink version 1.12.7 and configured Streaming Execution Environment with number of task slots for task manager = 3 (just experimenting) but unable to see the output of a file read by the environment. Instead, as seen in the logs, the Execution Graph is stuck as SCHEDULED and does not get into RUNNING state.
Note that if no configuration is passed in the properties file, everything works good and output is seen as environment is able to read the file since Execution Graph get switched to RUNNING state.
The code is as follows :
ParameterTool parameters = ParameterTool.fromPropertiesFile("src/main/resources/application.properties");
Configuration config = Configuration.fromMap(parameters.toMap());
TaskExecutorResourceUtils.adjustForLocalExecution(config);
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(config);
System.out.println("Config Params : " + config.toMap());
DataStream<String> inputStream =
env.readTextFile(FILEPATH);
DataStream<String> filteredData = inputStream.filter((String value) -> {
String[] tokens = value.split(",");
return Double.parseDouble(tokens[3]) >= 75.0;
});
filteredData.print(); // no o/p seen if configuration object is set otherwise everything works as expected
env.execute("Filter Country Details");
Need help in understanding this behaviour and what changes should be made in order that the output gets printed along with having some custom configuration. Thank you.
Okay..So found the answer to the above puzzle by referring to some links mentioned below.
Solution : So I set the parallelism (env.setParallelism) in the above code just after configuring the streaming execution environment and the file was read with output generated as expected.
Post that, experimented with a few things :
set parallelism equal to number of task slots = everything worked
set parallelism greater than number of task slots = intermittent results
set parallelism less than number of task slots = intermittent results.
As per this link corresponding to Flink Architecture,
A Flink cluster needs exactly as many task slots as the highest parallelism used in the job
So its best to go with no. of task slots for a task manager equal to the parallelism configured.
Related
When using Apache Flink we can configure values in flink-conf.yaml. But here using CLI commands we can assign some values Dynamically when starting or submitting a job or a task in flink.
eg:- bin/taskmanager.sh start-foreground -Dtaskmanager.numberOfSlots=12
But some values like jobmanager.memory.process.size and taskmanager.memory.process.size are unable to set Dynamically using "-D".
Is there a way to set up those values dynamically when starting jobmanager and taskmanager using CLI?
May be this might help you :
ParameterTool parameters = ParameterTool.fromPropertiesFile("src/main/resources/application.properties"); // one can specify the properties defined in conf/flink-conf.yaml in this properties file
Configuration config = Configuration.fromMap(parameters.toMap());
TaskExecutorResourceUtils.adjustForLocalExecution(config);
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(config);
env.setParallelism(3);
System.out.println("Config Params : " + config.toMap());
Make a note to set the parallelism equal to the number of task slots for task manager as per this link. By default, the number of task slots for a task manager is one.
I am reading lines from parquet for that I am using source functions similar to this one , however when I try counting number of lines being processed, nothing is printed although the job is completed :
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
lazy val stream: DataStream[Group] = env.addSource(new ParquetSourceFunction)
stream.map(_ => 1)
.timeWindowAll(Time.seconds(180))
.reduce( _ + _).print()
The problem is the fact that You are using ProcessingTime, so basically whenever You are using the EventTime when the file is finished Flink is emitting a watemark with Long.Max value so that all windows are closed, but this does not happen when working with ProcessingTime, so simply speaking Flink doesn't wait for Your window to close and that's why You are not getting any valuable results.
You may want to try to switch to DataSet API, which should be more appropriate for the task You want to achieve.
Alternatively, You may try to play with EventTime and assign static Watermark, since Flink at the end will still emit watermark with Long.Max value.
I have implement the CEP Pattern in Flink which is working as expected connecting to local Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//saves checkpoint
env.getCheckpointConfig().enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
I am using AscendingTimestampExtractor,
consumer.assignTimestampsAndWatermarks(
new AscendingTimestampExtractor<ObjectNode>() {
#Override
public long extractAscendingTimestamp(ObjectNode objectNode) {
long timestamp;
Instant instant = Instant.parse(objectNode.get("value").get("timestamp").asText());
timestamp = instant.toEpochMilli();
return timestamp;
}
});
And also i am getting Warn Message that,
AscendingTimestampExtractor:140 - Timestamp monotony violated: 1594017872227 < 1594017873133
And Also i tried using AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks none of one is working
I have attached Flink console screenshot where Watermark is not assigning.
Updated flink console screenshot
Could Anyone Help?
CEP must first sort the input stream(s), which it does based on the watermarking. So
the problem could be with watermarking, but you haven't shown us enough to debug the cause. One common issue is having an idle source, which can prevent the watermarks from advancing.
But there are other possible causes. To debug the situation, I suggest you look at some metrics, either in the Flink Web UI or in a metrics system if you have one connected. To begin, check if records are flowing, by looking at numRecordsIn, numRecordsOut, or numRecordsInPerSecond and numRecordsOutPerSecond at different stages of your pipeline.
If there are events, then look at currentOutputWatermark throughout the different tasks of your job to see if event time is advancing.
Update:
It appears you may be calling assignTimestampsAndWatermarks on the Kafka consumer, which will result in per-partition watermarking. In that case, if you have an idle partition, that partition won't produce any watermarks, and that will hold back the overall watermark. Try calling assignTimestampsAndWatermarks on the DataStream produced by the source instead, to see if that fixes things. (Of course, without per-partition watermarking, you won't be able to use an AscendingTimestampExtractor, since the stream won't be in order.)
in my spring-boot application I am running a schedule job which is taking the record from one table and saving it another as archive, I just ran into a problem so now the records I have selected to be save are almost around 400,000 plus and when the job is executing it just keeps on going. Can the ja handle this much data to be saved at once? Below are my configuration and the method.
#--Additional Settings for Batch Processing --#
spring.jpa.properties.hibernate.jdbc.batch_size=5000
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.generate_statistics=true
spring.jpa.properties.hibernate.jdbc.batch_versioned_data = true
rewriteBatchedStatements=true
cachePrepStmts=true
The agentAuditTrailArchiveList object has size of 400,000 plus.
private void archiveAgentAuditTrail(int archiveTimePeriod) {
List<AgentAuditTrail> archivableAgentAuditTrails = agentAuditTrailRepository
.fecthArchivableData(LocalDateTime.now().minusDays(archiveTimePeriod));
List<AgentAuditTrailArchive> agentAuditTrailArchiveList = new ArrayList<>();
archivableAgentAuditTrails
.forEach(agentAuditTrail -> agentAuditTrailArchiveMapper(agentAuditTrailArchiveList, agentAuditTrail));
System.out.println(" agentAuditTrailArchiveList = " + agentAuditTrailArchiveList.size());
System.out.println("archivableAgentAuditTrails = " + archivableAgentAuditTrails.size());
agentAuditTrailArchiveRepository.saveAll(agentAuditTrailArchiveList);
agentAuditTrailRepository.deleteInBatch(archivableAgentAuditTrails);
}
Assuming you are actually doing this as a Spring Batch job this is already handled. You just need to set the chunk size within the batch job step config so it will only process a set number of records at a time. See the Batch step config example below.
You never want to try to do the entire job as one commit unless it is always going to be really small. Doing it all in one commit leaves you open to a laudatory list of failure conditions and data integrity problems.
#Bean
public Step someStepInMyProcess(FlatFileItemReader<MyDataObject> reader,
MyProcesssor processor, ItemWriter<MyDataObject> writer) {
return stepBuilders.get("SomeStepName")
.<MyDataObject, MyDataObject>chunk(50)
.reader( reader )
.processor( processor )
.writer(writer)
.transactionManager(transactionManager)
.build();
}
I have some fairly simple stream code that aggregating data via time windows. The windows are on the large side (1 hour, with a 2 hour bound), and the values in the streams are metrics coming from hundreds of servers. I keep running out of memory, and so I added the RocksDBStateBackend. This caused the JVM to segfault. Next I tried the FsStateBackend. Both of these backends never wrote any data to disk, but simply created a directory with the JobID. I'm running this code in standalone mode, not deployed. Any thoughts as to why the State Backends aren't writing data, and why it runs out of memory even when provided with 8GB of heap?
final SingleOutputStreamOperator<Metric> metricStream =
objectStream.map(node -> new Metric(node.get("_ts").asLong(), node.get("_value").asDouble(), node.get("tags"))).name("metric stream");
final WindowedStream<Metric, String, TimeWindow> hourlyMetricStream = metricStream
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Metric>(Time.hours(2)) { // set how long metrics can come late
#Override
public long extractTimestamp(final Metric metric) {
return metric.get_ts() * 1000; // needs to be in ms since Java epoch
}
})
.keyBy(metric -> metric.getMetricName()) // key the stream so we can run the windowing in parallel
.timeWindow(Time.hours(1)); // setup the time window for the bucket
// create a stream for each type of aggregation
hourlyMetricStream.sum("_value") // we want to sum by the _value
.addSink(new MetricStoreSinkFunction(parameters, "sum"))
.name("hourly sum stream")
.setParallelism(6);
hourlyMetricStream.aggregate(new MeanAggregator())
.addSink(new MetricStoreSinkFunction(parameters, "mean"))
.name("hourly mean stream")
.setParallelism(6);
hourlyMetricStream.aggregate(new ReMedianAggregator())
.addSink(new MetricStoreSinkFunction(parameters, "remedian"))
.name("hourly remedian stream")
.setParallelism(6);
env.execute("flink test");
It is tough to say why you would run out of memory unless you have a very large number of metric names (that is the only explanation I can come up with based on the code you posted).
With respect to the disk writing, RocksDB will actually use a temporary directory by default for its actual database files. You can also pass an explicit directory during configuration. You would do this by calling state.setDbStoragePath(someDirectory)
Somewhat confusingly, the FSStateBackend in fact only writes to disk during checkpointing, it otherwise is entirely heap based. So you likely did not see anything in the directory if you did not have checkpointing enabled. So that would explain why you might still run out of memory when the FSStateBackend is used.
Assuming you do have the RocksDB (or any) state backend working, you can enable checkpointing by doing:
env.enableCheckpointing(5000); // value is in MS, so however frequently you want to checkpoint
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(5000); // this is to help prevent your job from making progress if checkpointing takes a bit. For large state checkpointing can take multiple seconds