How to run Google Cloud Dataflow job from App Engine? - google-app-engine

After reading Cloud Dataflow docs, I am still not sure how can I run my dataflow job from App Engine. Is it possible? Is it relevant whether my backend written in Python or in Java? Thanks!

Yes it is possibile, you need to use the "Streaming execution" as mentioned here.
Using Google Cloud Pub/Sub as a streaming source you can use it as "trigger" of your pipeline.
From App Engine you can do the "Pub" action to the Pub/Sub Hub with the REST API.

One way would indeed be to use Pub/Sub from within App Engine to let Cloud Dataflow know when new data is available. The Cloud Dataflow job would then run continuously and App Engine would provide the data for processing.
A different approach would be to add the code that sets up the Cloud Dataflow pipeline to a class in App Engine (including the Dataflow SDK to your GAE project) and set the job options programatically as explained here:
https://cloud.google.com/dataflow/pipelines/specifying-exec-params
Make sure to set the 'runner' option to DataflowPipelineRunner, so it executes asynchronously on the Google Cloud Platform. Since the pipeline runner (that actually runs your pipeline) does not have to be the same as the code that initiates it, this code (up until pipeline.run() ) could be in App Engine.
You can then add an endpoint or servlet to GAE that when called, runs the code that sets up the pipeline.
To schedule even more, you could have a cron job in GAE that calls the endpoint that initiates the pipeline...

There might be a way to submit your Dataflow job from App Engine but this is not something that's actively supported as suggested by the lack of docs. APP Engine's runtime environment makes it more difficult to do some of the operations required, e.g. to obtain credentials, to submit Dataflow jobs.

Related

which GCP component to use to fetch data from an API

I'm a little bit confused between gcp components, here is my use case :
daily, I need to fetch data from an external API (the API return json data), store it in GCS then load it in Bigquery, I already created the python script fetching the data and store it in GCS and i'm confused which component to use for deployment :
Cloud run : from the doc it is used for deploying services, so I think its a bad choose
Cloud function: I think it works, but it is used for even based processing (through single purpose function...)
composer :(I'll use composer to orchestrate tasks, such as preprocessing of files in GCS, load them to BQ, transfert them to an archive Bucket) through kubernetesPodOperator, create a task that trigger the script to get the data
compute engine: I don't think that its the best chose since there are better ones
app engine: also I don't think it a good idea since it is used to deploy and scale web app ...
(correcte me if i'm wrong in what I said, ) so my question is : what is the GCP component used for this kind of task
Cloud run : from the doc it is used for deploying services
app engine: also I don't think it a good idea since it is used to deploy and scale web app ...
I think you've misunderstood. Both Cloud run and Google App Engine (GAE) are serverless offerings from Google Cloud. You deploy your code to any of them and you can invoke their urls which in turn will cause your code to execute and do stuff like go fetch data from somewhere and save it somewhere.
Google App Engine has a shorter timeout than Cloud Run (can't remember if Cloud Run has time out). So, if your code will take a long time to run, you don't want to use Google App Engine (unless you make it a background task) and if you don't need a UI, then you don't need GAE.
For your specific scenario, you can deploy your code to Cloud Run and use Cloud Scheduler to schedule it to be invoked at specific times. We have that architecture running in a similar scenario (we have a task that runs once daily; it's deployed to Cloud Run; Google Scheduler invokes the endpoint, it runs and saves data to datastore linked to an App Engine App). We wrote a blog article on deploying to Cloud Run and another on securing your cloud run (based off our experience in the earlier described scenario)
GAE Timeout:
Every request to a Google App Engine (Standard) must complete within 1 - 10 minutes for automatic scaling and up to 24 hours for basic scaling (see documentation). For Google App Engine Flexible, the timeout is 60 minutes (documentation).

Who starts the google cloud scheduler?

I can't find any documentation on how gcp scheduler works under the hood. An App Engine is needed in the project, so I assume that the Http calls or Pub/Sub messages are started from the App Engine.
Currently I can use a cloud scheduler even without an App Engine in the project. Apparently a compute engine that also contains a permanently running VM is also sufficient. Could someone confirm my assumptions please or does anyone have sources on this?
I can't tell you how work Cloud Scheduler under the hood. I can just tell you that works well!
I'm sure there is a VM, or a cluster of VM, on Google serverless environment, and your Cloud Scheduler job is set on it. It's serverless, the under the hood doesn't matter, it works, and it's what I want!
Now, the relation with App Engine can be confusing. In fact, there is no longer relation between the product now, but you need the App Engine API activated on your project to use Cloud Scheduler. This strange things is normal if you have been using Google Cloud for a while. At the beginning, only App Engine existed, and Datastore, Cloud Task, Cloud Scheduler was all "modules" of App Engine. Years, after years, google has refactored and extracted these modules to create independent products, as you can see them today. However, some relations are still present, like the API activation.

What's the difference between Google Cloud Scheduler and GAE cron job?

After reading the docs
cloud scheduler - https://cloud.google.com/scheduler/
GAE cron job -
https://cloud.google.com/appengine/docs/flexible/nodejs/scheduling-jobs-with-cron-yaml
cloud function pub/sub trigger -
https://cloud.google.com/functions/docs/calling/pubsub
I think they are mostly the same.
I can use GAE cron job + pub/sub + cloud function to implement the same functions which cloud scheduler has.
In my understanding, it seems there are some differences between them:
Cloud Scheduler can be more convenient to adjust frequency. To update the frequency of GAE cron job, you must update the config, like schedule: every 1 hours of cron.yaml and redeploy.
There is no need to implement the cron job architecture(integrate GAE, GAE cron service, pub/sub, cloud function, etc..) by yourself which means you don't need to write code for combining them together anymore.
Am I correct? Or, any other differences?
You're right in that the Google Cloud Scheduler is kind of an evolution of the GAE cron job mechanism to make it more user-friendly and flexible. You can see that they are still related since the Cloud Scheduler doc specifies:
To use Cloud Scheduler your project must contain an App Engine app
that is located in one of the supported regions. If your project does
not have an App Engine app, you must create one.
Historically, GAE cron job was the only cron service offered by the platform. You could only target a GAE handler to receive the request from cron. From there you could indeed perform actions like like publish on pub/sub, call an HTTP Cloud Function or launch a dataflow job, but you always had to deploy a GAE service to handle it, which wasn't optimal.
The new Cloud Scheduler makes it more straightforward to use with Pub/Sub, Cloud Functions but also any publicly available HTTP endpoint (may be on-premise). And of course App Engine handlers. More targets may be added in the future for more use-cases.
Finally, as you mentioned, the API exposed to manage it decouples it from App Engine and its cron.yaml file and makes it easier to create and update cron jobs dynamically.

In GCP, is there a canonical way to scrape data from an API?

I'm building an application that will periodically pull data from several APIs and write them to cloud storage for later processing by Dataflow. There are many different ways to do this so I wanted to sanity check before I jumped in.
My plan is this:
For each API, Cloud Scheduler will hit an endpoint for an App Engine app
The app will create a Compute Engine VM instance with a startup script that pulls the data from the API and writes it to storage
When done, the VM will hit another endpoint on the App Engine app that shuts down the VM.
Is this a reasonable way to perform this sort of action? Is there a simpler or more straight-forward method? Thank you in advance for the replies.
Cloud Scheduler can schedule Compute Engine without App Engine however it seems that you cannot create and delete the VM with this method.
You can just use App Engine cron jobs to schedule the tasks. Your App Engine app cron handler can simply run the script that pulls data from the APIs. Maybe I am missing something, why do you need to use a Compute Engine instance to run the script?

Where to run continuous jobs on Google Cloud Platform?

I have a job that involves continuously listening to one or more websocket/mqtt feeds and forwarding this data to an event queue. This job is written in javascript and would run 24/7 in a continuous loop.
The most obvious solution is to run this job on a VM with Compute Engine, but I was wondering is there is a more elegant solution. Azure, for example, has WebJobs that's well-suited to this kind of task. It even restarts the script if there is an error.
Is there some other component on GCP that can run this job in a "managed" way?
Google Cloud does not have a product similar to Azure WebJobs at the moment. Both the standard and flexible environments of Google Cloud App Engine do not currently support websockets. In order to use websockets you can use Compute Engine or Kubernetes Engine.

Resources