In the Dataflow FAQ, it is listed that running custom (cron) job processes on Compute Engine is a way to schedule dataflow pipelines. I am confused about how exactly that should be done: how to start the dataflow job on compute engine and start a cron job.
Thank you!
You can use the Google Cloud Scheduler to execute your Dataflow Job. On Cloud Scheduler you have targets, these could be HTTP/S endpoints, Pub/Sub topics, App Engine applications, you can use your Dataflow template as target. Review this external article to see an example: Schedule Your Dataflow Batch Jobs With Cloud Scheduler or if you want to add more services to the interacion: Scheduling Dataflow Pipeline using Cloud Run, PubSub and Cloud Scheduler.
I have this working on App Engine, but I imagine this is similar for Compute Engine
Cron will hit an endpoint on your service at the frequency you specify. So you need to setup a request handler for that endpoint that will launch the dataflow job when hit (essentially in your request handler you need to define your pipeline and then call 'run' on it).
That should be the basics of it. An extra step I do is I have the request handler for my cron job launch a cloud task and then I have the request handler for my cloud task launch the dataflow job. I do this because I've noticed the 'run' command for pipelines sometimes taking a while and cloud tasks have a 10 minute timeout, compared to the 30s timeout for cron jobs (or was it 60s).
Related
We are using aws EMR cluster setup for submitting Flink job. Today we log in EMR cluster and run following steps for submitting a job
Stop the job with savepoint.
Start new job with savepoint created in step 1
Now I wanted to associate this step with pipeline deployments. How can I do that? Is there any existing software/tools exists for automating Flink deployments?
We would also like to add features like auto-rollbacking deployment to previous version in case of exceptions/error
I am new to Flink and EMR cluster deployment. Currently we have a Flink job and we are manually deploying it on AWS EMR cluster via Flink CLI stop/start-job commands.
I wanted to automate this process (Automate updating flink job jar on every deployment happening via pipelines with savepoints) and need some recommendations on possible approaches that could be explored.
We got an option to automate this process via Flink Rest API support for all flink job operation
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/
Sample project which used the same approach : https://github.com/ing-bank/flink-deployer
My product has an ingestion service written using Java which runs Apache Camel routes. There are multiple ingestion service instances running on different VMs. Ingestion service uses SQL Server 2016 database server. When each route is executed, it creates a job in the database and then each step of job manages the job status till it reaches the completion status.
Requirement is to ensure that Camel routes are executed one after another and no routes run in parallel (i.e., at the same time). How can this be accomplished?
One option is to use a home-grown solution where each route looks at whether there is a job in running status and proceed further only if there is none. This would require polling the database which does not seem to be a good solution.
When each route is executed, it creates a job in the database and then each step of job manages the job status till it reaches the completion status.
I would recommend a central service that manages the ingestion services using rest-api, jmx or something similar. It would provide jobs to these services and track their status. If the manager service needs persistence of its own it can use whatever like log files, json, embed sqlite database, nosql database etc. This would remove the need for ingestion services to know about other ingestion services or their state.
You can look at many CD/CI tools like Jetbrains TeamCity or Jenkins for reference on how they handle jobs with multiple agents/instances.
I'm using Apache Beam with Flink runner with Java SDK. It seems that deploying a job to Flink means building a 80-megabyte fat jar that gets uploaded to Flink job manager.
Is there a way to easily deploy a lightweight SQL to run Beam SQL? Maybe have job deployed that can soemhow get and run ad hoc queries?
I don't think it's possible at the moment, if I understand your question. Right now Beam SDK will always build a fat jar which will implement the pipeline and include all pipeline dependencies, and it will not be able to accept lightweight ad-hoc queries.
If you're interested in more interactive experience in general, you cat look at the ongoing efforts to make Beam more interactive, for example:
SQL shell: https://s.apache.org/beam-sql-packaging . This describes a work-in-progress Beam SQL shell, which should allow you to quickly execute small SQL queries locally in a REPL environment, so that you can interactively explore your data, and design the pipeline before submitting a long-running job. This does not change the way how the job gets submitted to Flink (or any other runner) though. So after you submitted the long running job, you will likely still have to use normal job management tools you currently have to control it.
Python: https://s.apache.org/interactive-beam . Describes the approach to wrap existing runner into an interactive wrapper.
How to schedule a cron job to run at app startup in my GAE application? I just want it to run one time at the app startup.
Cron jobs are used for tasks that should run independently of your client application. For example, you may need a cron job to update the totals in your database at the end of a day, or to periodically clean up stale session objects, etc. Typically, you specify the time when cron jobs have to run: e.g. "every midnight".
If you need to execute a task when your application loads, you can simply execute it from your application.