which GCP component to use to fetch data from an API - google-app-engine

I'm a little bit confused between gcp components, here is my use case :
daily, I need to fetch data from an external API (the API return json data), store it in GCS then load it in Bigquery, I already created the python script fetching the data and store it in GCS and i'm confused which component to use for deployment :
Cloud run : from the doc it is used for deploying services, so I think its a bad choose
Cloud function: I think it works, but it is used for even based processing (through single purpose function...)
composer :(I'll use composer to orchestrate tasks, such as preprocessing of files in GCS, load them to BQ, transfert them to an archive Bucket) through kubernetesPodOperator, create a task that trigger the script to get the data
compute engine: I don't think that its the best chose since there are better ones
app engine: also I don't think it a good idea since it is used to deploy and scale web app ...
(correcte me if i'm wrong in what I said, ) so my question is : what is the GCP component used for this kind of task

Cloud run : from the doc it is used for deploying services
app engine: also I don't think it a good idea since it is used to deploy and scale web app ...
I think you've misunderstood. Both Cloud run and Google App Engine (GAE) are serverless offerings from Google Cloud. You deploy your code to any of them and you can invoke their urls which in turn will cause your code to execute and do stuff like go fetch data from somewhere and save it somewhere.
Google App Engine has a shorter timeout than Cloud Run (can't remember if Cloud Run has time out). So, if your code will take a long time to run, you don't want to use Google App Engine (unless you make it a background task) and if you don't need a UI, then you don't need GAE.
For your specific scenario, you can deploy your code to Cloud Run and use Cloud Scheduler to schedule it to be invoked at specific times. We have that architecture running in a similar scenario (we have a task that runs once daily; it's deployed to Cloud Run; Google Scheduler invokes the endpoint, it runs and saves data to datastore linked to an App Engine App). We wrote a blog article on deploying to Cloud Run and another on securing your cloud run (based off our experience in the earlier described scenario)
GAE Timeout:
Every request to a Google App Engine (Standard) must complete within 1 - 10 minutes for automatic scaling and up to 24 hours for basic scaling (see documentation). For Google App Engine Flexible, the timeout is 60 minutes (documentation).

Related

Who starts the google cloud scheduler?

I can't find any documentation on how gcp scheduler works under the hood. An App Engine is needed in the project, so I assume that the Http calls or Pub/Sub messages are started from the App Engine.
Currently I can use a cloud scheduler even without an App Engine in the project. Apparently a compute engine that also contains a permanently running VM is also sufficient. Could someone confirm my assumptions please or does anyone have sources on this?
I can't tell you how work Cloud Scheduler under the hood. I can just tell you that works well!
I'm sure there is a VM, or a cluster of VM, on Google serverless environment, and your Cloud Scheduler job is set on it. It's serverless, the under the hood doesn't matter, it works, and it's what I want!
Now, the relation with App Engine can be confusing. In fact, there is no longer relation between the product now, but you need the App Engine API activated on your project to use Cloud Scheduler. This strange things is normal if you have been using Google Cloud for a while. At the beginning, only App Engine existed, and Datastore, Cloud Task, Cloud Scheduler was all "modules" of App Engine. Years, after years, google has refactored and extracted these modules to create independent products, as you can see them today. However, some relations are still present, like the API activation.

What is the best service to use for data pipelines on GCP

I wanna deploy service (python script that uses Apache Beam) on my project on GCP with execution time sometimes up to 24h. I need this service with the data pipeline to be always working. Also I have a web application that's gonna use the results from the data pipeline. My solution for this was that I deploy the web app on GCP App Engine and the python script on K8s cluster because the job can last up to 24h and App Engine is serverless so everything in serverless should be a short time job something like up to 15mins. Am I on the right way of thinking or you have some other better solution for GCP services to suggest.
If you are using Apache Beam my advice is that you deploy the pipeline on Dataflow. The service is fully managed by GCP, and in fact this product was the one open sourced in the Apache Beam project, so using the product should be straight forward.
Once processed the data by Dataflow, you can write your results to several possible destinations, like BigQuery, GCS, Pub/Sub, Datastore, and consume these results from your Web App. Please, see the relevant documentation.
Please, only, pay attention on the required processing time: Dataflow will scale as needed but even in that case, if your jobs takes 24 hours to run, is certainly something you must test and study carefully, Also review the possible associated costs.

In GCP, is there a canonical way to scrape data from an API?

I'm building an application that will periodically pull data from several APIs and write them to cloud storage for later processing by Dataflow. There are many different ways to do this so I wanted to sanity check before I jumped in.
My plan is this:
For each API, Cloud Scheduler will hit an endpoint for an App Engine app
The app will create a Compute Engine VM instance with a startup script that pulls the data from the API and writes it to storage
When done, the VM will hit another endpoint on the App Engine app that shuts down the VM.
Is this a reasonable way to perform this sort of action? Is there a simpler or more straight-forward method? Thank you in advance for the replies.
Cloud Scheduler can schedule Compute Engine without App Engine however it seems that you cannot create and delete the VM with this method.
You can just use App Engine cron jobs to schedule the tasks. Your App Engine app cron handler can simply run the script that pulls data from the APIs. Maybe I am missing something, why do you need to use a Compute Engine instance to run the script?

How to make computations with Google Cloud?

I have a machine learning project and for this project, I have to get data from a website in evey 15 minutes. And I decided to use google cloud platform to do it. I've coded a python script to do the process(get the data from website and write down to a csv file) and when I run this script on my computer, it works well. I need to run this script for a couple weeks. So it should be running in google cloud's computers and it should continue running when I close my computer. How can I do this?
I can also use another cloud service if it's required to but google cloud would be better.
Disclaimer: I'm with Google Cloud Platform Support
Google Cloud Compute Engine is defined as an Infrastructure as a Service. It basically provides access to Virtual Machines (VMs), Disks and Networking functionalities. By using this product, you are able to configure your resources from scratch, defining one or multiple VM instances, configuring your work environment, etc. It might require more configuration and boiler plating than needed, but it offers the most control. You can always use some resources for free but in my opinion it is a lot of scratch to start from.
Google Cloud App Engine is defined as a Platform as a Service. It is basically a managed app platform. The management can be automatised to certain degrees. It is based on Compute Engine, in the sense that it provides functionalities, a platform, on top of the infrastructure defined by Compute Engine VMs. You can thus deploy your python script in an App Engine Flexible Python Environment. You can define your whole application as a collection of interrelated microservices, i.e. one service gets the data from a website, maybe another writes csv files and another might trigger ML jobs.
App Engine also provides the possibility to schedule jobs as cron jobs. So if your application needs to run periodical jobs or at a specific time, this is the tool to use. App Engine pricing is correlated with the used resources, but you can estimate eventual budgets by using the Google Cloud Platform Pricing Calculator.
You can store the csv files in Google Cloud Storage as objects in buckets or as data in Datastore, Cloud SQL or BigQuery. Components of Google Cloud Platform can communicate with each other via service acounts. This allows your App Engine deployment, for example, to perform CRUD operations in your Cloud SQL instance, programatically. Or... to trigger a Cloud Machine Learning job.
Your question is very broad and can be addressed in multiple, various ways. I would initially deploy the python script in App Engine Flexible. I would deploy a cron job if needed, to fetch data every 15 minutes. I would upload the csv files in Google Cloud Storage Buckets. I would then use the Cloud Machine Learning python client to trigger Machine Learning jobs programatically.
There are other products that might interest you:
Cloud Dataflow - configure stream/batch data processing
Cloud Dataprerp - transform/clean raw data
Cloud Pub/Sub - global real-time messaging.
All the products/components and sub-products/sub-components can communicate with each other and processes can easily be automated in the Cloud. So the whole project can run in Google's Cloud infrastructure when you close your computer. But, of course, you have to configure it beforehand, in your Google Cloud Platform Project(s).
I am aware that I met your broad question with a broad answer. For any specific issues along your path of implementing the project in the Cloud, the community will be here to provide support.
Good luck!

"Sample DB" for Google App Engine Datastore

MySQL has this Sample Sakila DB where we can start playing around with bunch of data already for our application, how about for Google App Engine/GAEJ is there something like this for the datastore?
I started recently to experiment with the Google App Engine and I was confronted with the same question. I was interested in a REST based app engine backbone which I could easily load/unload with data but couldn't find something to play around.
So I started to build up two projects on github which supports me in such kind of work.
clb-appEngineTemplate is a skeleton application for a Google App Engine Jave REST backend. It provides some sample code for a standardized REST API based persistency layer on Business Object level and can be easily extended (using Objectify and GSON).
clb-test which is a utility class which allows to load Test Data from Excel CSV file into your Google App Engine REST backend.
Both projects are maven based and allow me easily to define data objects which I can upload into the App Engine. Mainly I'm run them against the local test server, which serves me for initial testing.
I just released a first version and will incrementally extend over the next weeks.
AFAIK, there is no sample DB for GAE, probably because datastore write operations are expensive. There are demos bundled with GAE SDK. If you are using Eclipse you can import the samples to your workspace. Some of them involve datastore so you can run the application and add data yourself.
Another way is to use bulkloader to upload data at once using CSV files. But you can quickly run out of free quota for datastore writes.

Resources