Flink example job of long-running streaming processing - apache-flink

I'm looking for a Flink example job of long-running streaming processing for test purposes. I checked the streaming/WordCount included in the Flink project, but seems it is not long-running, after processing the input file, it exits.
Do I need to write one by myself? What is the simplest way to get an endless stream input?

The WordCount example exits because its source is finite. Once it has fully processed its input, it exits.
The Flink Operations Playground is a nice example of a streaming job that runs forever.

Per definition every streaming job runs "forever" as long as you don't define any halt criterias or cancel the job manually. I guess you are asking for some job which consumes from some kind of infinite source. The most simplest job I could find was the twitter example which is included at the flink-project itself:
https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/scala/org/apache/flink/streaming/scala/examples/twitter/TwitterExample.scala
With some tweaking you could also use the socket-example (there you have some more control of the source):
https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/scala/org/apache/flink/streaming/scala/examples/socket/SocketWindowWordCount.scala
Hope I got your question right and this helps.

Related

How to convince Flink to rename .inprogress files to part-xxx

We have unit tests for a streaming workflow (using Flink 1.14.4) with bounded sources, writing Parquet files. Because it's bounded, checkpointing is automatically disabled (as per the INFO msg Disabled Checkpointing. Checkpointing is not supported and not needed when executing jobs in BATCH mode.), which means setting ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true has no effect.
Is the only solution to run the harness with unbounded sources in a separate thread, and force it to terminate when no more data is written to the output? Seems awkward...
The solution, for others, is:
Make sure you're using FileSink, not the older StreamingFileSink.
Set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true.

AWS SageMaker Training Job not saving model output

I'm running a training job on SageMaker. The job doesn't fully complete and hits the MaxRuntimeInSeconds stopping condition. When the job is stopping, documentation says the artifact will still be saved. I've attached the status progression of my training job below. It looks like the training job finished correctly. However the output S3 folder is empty. Any ideas on what is going wrong here? The training data is located in the same bucket so it should have everything it needs.
From the status progression, it seems that the training image download completed at 15:33 UTC and by that time the stopping condition was initiated based on the MaxRuntimeInSeconds parameter that you have specified. From then, it takes 2 mins (15:33 to 15:35) to save any available model artifact but in your case, the training process did not happen at all. The only thing that was done was downloading the pre-built image(containing the ML algorithm). Please refer the following lines from the documentation which says model being saved is subject to the state the training process is in. May be you can try to increase the MaxRuntimeInSeconds and run the job again. Also, please check MaxWaitTimeInSeconds value that you have set if you have.It must be equal to or greater than MaxRuntimeInSeconds.
Please find the excerpts from AWS documentation :
"The training algorithms provided by Amazon SageMaker automatically
save the intermediate results of a model training job when possible.
This attempt to save artifacts is only a best effort case as model
might not be in a state from which it can be saved. For example, if
training has just started, the model might not be ready to save."
If MaxRuntimeInSeconds is exceeded then model upload is only best-effort and really depends on whether the algorithm saved any state to /opt/ml/model at all prior to being terminated.
The two minute wait period between 15:33 to 15:35 in the Stopping stage signifies the max time between a SIGTERM and a SIGKILL signal sent to your algorithm (see SageMaker doc for more detail). If your algorithm traps the SIGTERM it is supposed to use that as a signal to gracefully save its work and shutdown before the SageMaker platform kills it forcibly with a SIGKILL signal 2 minutes later.
Given that the wait period in the Stopping step is exactly 2 minutes as well as the fact Uploading step started at 15:35 and completed almost immediately at 15:35 it's likely that your algo did not take advantage of the SIGTERM warning and that there was nothing saved to /opt/ml/model. To give you a definitive answer as to whether this was indeed the case please create a SageMaker forum post and the SageMaker team can private-message you to gather details of your job.

How to restore state after a restart from chosen source (not necessarily last checkpoint)

I've been trying to restart my Apache Flink from a previous checkpoint without much luck. I've uploaded the code to GitHub, here's the main class:
https://github.com/edu05/wordcount/blob/restart/src/main/java/edu/streaming/AppWithKafka.java
It's a simple word count program, only I'd like the program to continue with the counts it had already calculated after a restart.
I've read the docs and tried a few things but there must be something stupid missing, could someone please help?
Also: The end goal is to produce the output of the wordcount program into a compacted kafka topic, how would I go about loading the state of the app by first consuming the compacted topic, which in this case serves as both the output and the checkpointing mechanism of the program?
Many thanks
Flink's checkpoints are for automatic restarts after failures. If you want to do a manual restart, then either use a savepoint, or an externalized checkpoint.
If you've already tried this and are still having trouble, please provide more details about what you tried.

What's the right way to do long-running processes on app engine?

I'm working with Go on App Engine and I'm trying to build an API that needs to perform a long-running background task - in this case it needs to parse and chunk a big file out to task queues. I'd like it to return a 200 and close the user connection immediately and let the process keep running in the background until it's complete (this could take 5-10 minutes). Task queues alone don't really work for my use case because parsing the initial file can take more than the time limit for an API request.
At first I tried a Go routine as a solution for this problem. This failed because my app engine context expired as soon as the parent function closed the user connection. (I suppose I could try writing a go routine that doesn't require a context, but then I'd lose logging and I'd need to fetch the entire remote file and pass it to the go routine.)
Looking through the docs, it looks like App Engine used to have functionality to support exactly what I want to do: [runtime.RunInBackground], but that functionality is now deprecated and the replacement isn't obvious.
Is there a "right" or recommended way to do background processing now?
I suppose I could put a link to my big file into a task queue, but if I understand correctly, even functions called through task queues have to complete execution within a specified amount of time (is it 90 seconds?) I need to be able to run longer than that.
Thanks for any help.
try using:
appengine.BackgroundContext()
it should be long-lived but will only work on GAE Flex

How can I have my polling service call SSIS sequentially?

I have a polling service that checks a directory for new files, if there is a new file I call SSIS.
There are instances where I can't have SSIS run if another instance of SSIS is already processing another file.
How can I make SSIS run sequentially during these situations?
Note: parallel SSIS's running is fine in some circumstances, while in others not, how can I achieve both?
Note: I don't want to go into WHEN/WHY it can't run in parallel at times, but just assume sometimes it can and sometimes it can't, the main idea is how can I prevent a SSIS call IF it has to run in sequence?
If you want to control the flow sequentially, think of a design like where you can enqueue requests (for invoking SSIS) to a queue data structure. At a time, only the top request from the queue will be processed. As soon as that request completes, next request can be dequeued.

Resources