Here is my question :
Will filnk finish the sink process and rename the .inprogress files to part-x-x files when sending a stop command?
I find my flink tasks(using flink-1.9.1) will not rename the .inprogress files to part-x-x files. But I read the source code, it says
org.apache.flink.client.program.ClusterClient#stopWithSavepoint:
* Stops a program on Flink cluster whose job-manager is configured in this client's configuration.
* Stopping works only for streaming programs. Be aware, that the program might continue to run for
* a while after sending the stop command, because after sources stopped to emit data all operators
* need to finish processing.
The StreamingFileSink does have some limitations in this regard. See this thread from the user#flink.apache.org mailing list.
FLIP-46, which is being tracked as FLINK-13103, is needed in order to fix this. Until then, the StreamingFileSink will remain unable to transition unfinished files to the finished state when a job is stopped. This is described in the documentation as Important Note 2.
Related
I saw this line in Flink documentation but can't figure out what 'detached mode' means. Please help. Thanks.
Run example program in detached mode:
./bin/flink run -d ./examples/batch/WordCount.jar
The Flink CLI runs jobs either in blocking or detached mode. In blocking mode, the CliFrontend (client) process keeps running, blocked, waiting for the job to complete -- after which it will print out some information. In the example below I ran a streaming job, which I cancelled from the WebUI after a few seconds:
$ flink run target/oscon-1.0-SNAPSHOT.jar
Starting execution of program
Program execution finished
Job with JobID b02da01c30585bfbc86a23446559987f has finished.
Job Runtime: 8673 ms
If you run in blocking mode, you can kill the CliFrontend (e.g., with ctrl-C) if you like, and the job will be unaffected, so long as it has run far enough to have submitted the job to the cluster.
In detached mode, the CliFrontend submits the job to the cluster and then exits straight away.
That means that the application is not attached (or bound) to your shell session. So if you close your terminal the application will still keep running (until it finished its work). For a batch example that might not be a big problem - they will process the given batch of data and end afterwards. As soon as you skip to a streaming approach the operations will take place on an "infinite stream of data" and have no defined end.
Hope that helps.
In our ESB project, we have a lot of routes reading files with file2 or ftp protocol for further processing. Important to notice, that the files we read locally (file2 protocol) are mounted network shares via different protocols (NFS, SMB).
Now, we are facing issues with race conditions. Both servers read the file and process it. We have reduced the possibility of that by using the preMove option, but from time to time the duplicate reading still occurs when both servers poll at the same millisecond. According to the documentation, an idempotentRepository together with readLock=idempotent could help, for example with HazelCast.
However, I'm wondering if this is a suitable solution for my issue as I don't really know if it will work in all cases. It is within milliseconds that both servers read the file, so the information that one server has already processed the file need to be available in the HazelCast grid at the point in time when the second server tries to read. Is that possible? What happens if there are minimal latencies (e.g. network related)?
In addition to that, the setting readLock=idempotent is only available for file2 but not for ftp. How to solve that issue there?
Again: The issue is not preventing dublicate files in general, it is solely about preventing the race condition.
AFAIK the idempotent repository should prevent in your case that both consumers read the same file.
The latency between detection of the file and the entry in hazelcast is not relevant because the file consumers do not enter what they read. Instead they both ask the repository for an exclusive read-lock. The first one wins, the second one is denied, so it continues to the next file.
If you want to minimize the potential of conflicts between the consumers you can turn on shuffle=true to randomize the ordering of files to consume.
For the problem with the missing readLock=idempotent on the ftp consumer: you could perhaps build a separate transfer-route with only 1 consumer that downloads the files. Then your file-consumer route can process them idempotent.
Spark streaming provides API for termination awaitTermination(). Is there any similar API available to gracefully shut down flink streaming after some t seconds?
Your driver program (i.e. the main method) in Flink doesn't stay running while the streaming job executes. Your program should define a dataflow, call execute, and then terminate. In Spark, the driver program stays running (AFAIK), and awaitTermination relates to that.
Note that a Flink streaming dataflow continues to execute indefinitely, unless you're using a 'bounded' data source with a finite number of elements. You may also cancel or stop a job, and even take a checkpoint upon stopping to be resumed from later.
Lets assume that I have a File Consumer that polls a directory every 10 seconds and does some sort of processing to the files it has found there.
This processing may take 40 seconds for each file. This means that during that interval the Cosumer will poll the directory again, and start another similar process?
Is there any way I can avoid that, and not allow the Consumer to poll if the previous poll has not finished?
The file consumer is single threaded so it will not poll while it already process files.
When the consumer finishes it will delay for 10s before polling again. This is controlled by useFixedDelay option which you can read more about in the JDK ScheduledExecutorService which is used by Camel as the scheduler.
I have two camel applications and their duty is to read files from the same directory, process them and send them to db consumer. To do this, my endpoint are like this:
file:/data/air?preMove=thread&readLock=fileLock &idempotent=true&idempotentRepository=#fileStore&include=AIROUTPUTCDR_(.*).AIR.gz&move=/data/air/success&moveFailed=error
As u can see, application polls file from polldir based on filters, move them under thread dir to read, read the file and move to success folder.
But with this flow, if I kill an application and start it again, the files ,which were being processed, will not be processed because they are under threads folder.
My question is, is there a way to resume reading the files which are just interrupted?
Thanks
No if you do a hard kill on the application while a file was pre moved, then you would neeed manually to move these files from pre move, back into the source folder, so they can be picked up again