Processing job v.s. training job in SageMaker - amazon-sagemaker

What is the difference between a processing job and a training job? I am running training job and did not launch processing job, why does my account have processing job running?

Training job focuses on a process that train a model, for example it will use a PyTorch or TensorFlow container, or SageMaker built-in algorithms, or have a Hyper parameters tuning job associated with the training job. It may include an MPI cluster for distributed training. Finally it outputs a model artifact.
Processing jobs focus on pre/post processing, and have API and containers for ML data processing tools like SK-learn pipeline or Dusk/Spark.
Why do you see processing jobs - When you are profiling a training job (enabled by default), then a matching processing job is created to process the profiler report. You can disable profiling by adding disable_profiler=True parameter to the estimator object.
Generally speaking training jobs have a richer API than processing jobs, and allows more customization. The right choice will depend on your specific use case.

Related

Reuse Apache Beam Workflow Graph

I'm using Apache Beam to run Batch pipelines on Flink, running on AWS EMR.
Beam Pipelines are being created and submitted to a long running Flink cluster. However, I see that there is a cost associated with building the job graph and submitting it to EMR, taking constantly in excess of 2 minutes.
I want to reduce this time & was wondering if there is a way to cache/prevent Beam from building pipelines with every run of the Driver cronjob.
The job graph is a bit complex, so the 2 minute time is justified. Just looking at ways to reduce/eliminate it from happening everytime.

Flink: What does it mean to embed flink on other programs?

What does it mean to embed flink on other programs?
In the link here - https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/api_concepts.html#basic-api-concepts in second paragraph it says flink can be embedded in other programs.
I would like to know more about this. Like how to achieve it. A sample program would be very helpful.
Using the above is it possible to achieve the following?
Can we run flink programs as individual Actors?
Can we route data between two flink programs?
Reason: I am asking the above two questions because my requirement is as below
I have some set of Flink Jobs/programs based on the config file I want only certain Flink Jobs/programs to process the input data and this keeps changing based on the config file. So there is a need for Flink jobs./programs(or the code in those jobs) to be always available and they need to pass data and communicate.
Kindly share your insights.
Running Flink embedded in other programs refers to Flink's local execution mode. The local execution mode runs a Flink program in your JVM. This entails that the job won't be executed distributedly.
What is currently not possible out of the box is to let Flink jobs control other Flink jobs. However, it is possible to build a Flink application which takes as input job descriptions and executes them. RBEA is an example of such a Flink application. The conceptual difference is that you don't have multiple Flink jobs but a single one which processes programs as input records.
Alternatively, you could take a look at Stateful functions which is a virtual actor framework built on top of Apache Flink. The idea is to provide a framework for building distributed stateful applications with strong consistency guarantees. With stateful functions, you would also build a single Flink application which processes events which could represent a form of computation.

Data/event exchange between jobs

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.

How to schedule a job in apache-flink

I want to write a task that is triggered by apache flink after every 24 hours and then processed by flink. What is the possible way to do this? Does flink provide any job scheduling functionality?
Apache Flink is not a job scheduler but an event processing engine which is a different paradigm, as Flink jobs are supposed to run continuously instead of being triggered by a schedule.
That said, you could achieve the functionality by simply using an off the shelve scheduler (i.e. cron) who is scheduled to start a job on your Flink cluster and then stop it after you receive some sort of notification that the job was done (i.e. through a Kafka topic) or simply use a timeout after which you would assume that the job is finished and you can stop the job. But again, especially because Flink is not designed for this kind of use cases, you would most certainly run into edge cases which Flink does not support.
Alternatively you can simply use a 24 hour tumbling window and run your task in the corresponding trigger function. See https://flink.apache.org/news/2015/12/04/Introducing-windows.html for details on that matter.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Resources