What's the difference between Google Cloud Scheduler and GAE cron job? - google-app-engine

After reading the docs
cloud scheduler - https://cloud.google.com/scheduler/
GAE cron job -
https://cloud.google.com/appengine/docs/flexible/nodejs/scheduling-jobs-with-cron-yaml
cloud function pub/sub trigger -
https://cloud.google.com/functions/docs/calling/pubsub
I think they are mostly the same.
I can use GAE cron job + pub/sub + cloud function to implement the same functions which cloud scheduler has.
In my understanding, it seems there are some differences between them:
Cloud Scheduler can be more convenient to adjust frequency. To update the frequency of GAE cron job, you must update the config, like schedule: every 1 hours of cron.yaml and redeploy.
There is no need to implement the cron job architecture(integrate GAE, GAE cron service, pub/sub, cloud function, etc..) by yourself which means you don't need to write code for combining them together anymore.
Am I correct? Or, any other differences?

You're right in that the Google Cloud Scheduler is kind of an evolution of the GAE cron job mechanism to make it more user-friendly and flexible. You can see that they are still related since the Cloud Scheduler doc specifies:
To use Cloud Scheduler your project must contain an App Engine app
that is located in one of the supported regions. If your project does
not have an App Engine app, you must create one.
Historically, GAE cron job was the only cron service offered by the platform. You could only target a GAE handler to receive the request from cron. From there you could indeed perform actions like like publish on pub/sub, call an HTTP Cloud Function or launch a dataflow job, but you always had to deploy a GAE service to handle it, which wasn't optimal.
The new Cloud Scheduler makes it more straightforward to use with Pub/Sub, Cloud Functions but also any publicly available HTTP endpoint (may be on-premise). And of course App Engine handlers. More targets may be added in the future for more use-cases.
Finally, as you mentioned, the API exposed to manage it decouples it from App Engine and its cron.yaml file and makes it easier to create and update cron jobs dynamically.

Related

Who starts the google cloud scheduler?

I can't find any documentation on how gcp scheduler works under the hood. An App Engine is needed in the project, so I assume that the Http calls or Pub/Sub messages are started from the App Engine.
Currently I can use a cloud scheduler even without an App Engine in the project. Apparently a compute engine that also contains a permanently running VM is also sufficient. Could someone confirm my assumptions please or does anyone have sources on this?
I can't tell you how work Cloud Scheduler under the hood. I can just tell you that works well!
I'm sure there is a VM, or a cluster of VM, on Google serverless environment, and your Cloud Scheduler job is set on it. It's serverless, the under the hood doesn't matter, it works, and it's what I want!
Now, the relation with App Engine can be confusing. In fact, there is no longer relation between the product now, but you need the App Engine API activated on your project to use Cloud Scheduler. This strange things is normal if you have been using Google Cloud for a while. At the beginning, only App Engine existed, and Datastore, Cloud Task, Cloud Scheduler was all "modules" of App Engine. Years, after years, google has refactored and extracted these modules to create independent products, as you can see them today. However, some relations are still present, like the API activation.

In GCP, is there a canonical way to scrape data from an API?

I'm building an application that will periodically pull data from several APIs and write them to cloud storage for later processing by Dataflow. There are many different ways to do this so I wanted to sanity check before I jumped in.
My plan is this:
For each API, Cloud Scheduler will hit an endpoint for an App Engine app
The app will create a Compute Engine VM instance with a startup script that pulls the data from the API and writes it to storage
When done, the VM will hit another endpoint on the App Engine app that shuts down the VM.
Is this a reasonable way to perform this sort of action? Is there a simpler or more straight-forward method? Thank you in advance for the replies.
Cloud Scheduler can schedule Compute Engine without App Engine however it seems that you cannot create and delete the VM with this method.
You can just use App Engine cron jobs to schedule the tasks. Your App Engine app cron handler can simply run the script that pulls data from the APIs. Maybe I am missing something, why do you need to use a Compute Engine instance to run the script?

Where to run continuous jobs on Google Cloud Platform?

I have a job that involves continuously listening to one or more websocket/mqtt feeds and forwarding this data to an event queue. This job is written in javascript and would run 24/7 in a continuous loop.
The most obvious solution is to run this job on a VM with Compute Engine, but I was wondering is there is a more elegant solution. Azure, for example, has WebJobs that's well-suited to this kind of task. It even restarts the script if there is an error.
Is there some other component on GCP that can run this job in a "managed" way?
Google Cloud does not have a product similar to Azure WebJobs at the moment. Both the standard and flexible environments of Google Cloud App Engine do not currently support websockets. In order to use websockets you can use Compute Engine or Kubernetes Engine.

Google app engine cron job scheduling setup and pricing

I want to trigger a https endpoint every 1 minute I was using cron-job.org but it is not that reliable and goes down often. I have looked at 2 options Microsoft azure scheduler and Google app engine cron scheduler. Microsoft scheduler pricing is very clear, however, I dont understand how to setup google cron job and pricing to run the cron job every minute.
To use Google's cron scheduler, you will have to pay for the app engine running 24x7. Whereas Azure Scheduler is a true microservice and you only pay based on number of jobs/job collections, not the underlying resources consumed.
Unlike Microsoft's scheduler which appears to be an independently configurable and billable service, the GAE cron service can only be a part of a GAE app.
A standard environment GAE app is charged by instance-hours plus the various services it uses. See App Engine Pricing. But it also comes with fairly generous free daily quotas.
A simple app which would only make a few requests per minute - like the one you describe - should have no problems staying within the free daily limits.
Check the Quickstart to see how to get a basic app skeleton running. You already have the cron service doc, you only need the cron.yaml Reference to add a cron service to your app.

How to run Google Cloud Dataflow job from App Engine?

After reading Cloud Dataflow docs, I am still not sure how can I run my dataflow job from App Engine. Is it possible? Is it relevant whether my backend written in Python or in Java? Thanks!
Yes it is possibile, you need to use the "Streaming execution" as mentioned here.
Using Google Cloud Pub/Sub as a streaming source you can use it as "trigger" of your pipeline.
From App Engine you can do the "Pub" action to the Pub/Sub Hub with the REST API.
One way would indeed be to use Pub/Sub from within App Engine to let Cloud Dataflow know when new data is available. The Cloud Dataflow job would then run continuously and App Engine would provide the data for processing.
A different approach would be to add the code that sets up the Cloud Dataflow pipeline to a class in App Engine (including the Dataflow SDK to your GAE project) and set the job options programatically as explained here:
https://cloud.google.com/dataflow/pipelines/specifying-exec-params
Make sure to set the 'runner' option to DataflowPipelineRunner, so it executes asynchronously on the Google Cloud Platform. Since the pipeline runner (that actually runs your pipeline) does not have to be the same as the code that initiates it, this code (up until pipeline.run() ) could be in App Engine.
You can then add an endpoint or servlet to GAE that when called, runs the code that sets up the pipeline.
To schedule even more, you could have a cron job in GAE that calls the endpoint that initiates the pipeline...
There might be a way to submit your Dataflow job from App Engine but this is not something that's actively supported as suggested by the lack of docs. APP Engine's runtime environment makes it more difficult to do some of the operations required, e.g. to obtain credentials, to submit Dataflow jobs.

Resources