I have a machine learning project and for this project, I have to get data from a website in evey 15 minutes. And I decided to use google cloud platform to do it. I've coded a python script to do the process(get the data from website and write down to a csv file) and when I run this script on my computer, it works well. I need to run this script for a couple weeks. So it should be running in google cloud's computers and it should continue running when I close my computer. How can I do this?
I can also use another cloud service if it's required to but google cloud would be better.
Disclaimer: I'm with Google Cloud Platform Support
Google Cloud Compute Engine is defined as an Infrastructure as a Service. It basically provides access to Virtual Machines (VMs), Disks and Networking functionalities. By using this product, you are able to configure your resources from scratch, defining one or multiple VM instances, configuring your work environment, etc. It might require more configuration and boiler plating than needed, but it offers the most control. You can always use some resources for free but in my opinion it is a lot of scratch to start from.
Google Cloud App Engine is defined as a Platform as a Service. It is basically a managed app platform. The management can be automatised to certain degrees. It is based on Compute Engine, in the sense that it provides functionalities, a platform, on top of the infrastructure defined by Compute Engine VMs. You can thus deploy your python script in an App Engine Flexible Python Environment. You can define your whole application as a collection of interrelated microservices, i.e. one service gets the data from a website, maybe another writes csv files and another might trigger ML jobs.
App Engine also provides the possibility to schedule jobs as cron jobs. So if your application needs to run periodical jobs or at a specific time, this is the tool to use. App Engine pricing is correlated with the used resources, but you can estimate eventual budgets by using the Google Cloud Platform Pricing Calculator.
You can store the csv files in Google Cloud Storage as objects in buckets or as data in Datastore, Cloud SQL or BigQuery. Components of Google Cloud Platform can communicate with each other via service acounts. This allows your App Engine deployment, for example, to perform CRUD operations in your Cloud SQL instance, programatically. Or... to trigger a Cloud Machine Learning job.
Your question is very broad and can be addressed in multiple, various ways. I would initially deploy the python script in App Engine Flexible. I would deploy a cron job if needed, to fetch data every 15 minutes. I would upload the csv files in Google Cloud Storage Buckets. I would then use the Cloud Machine Learning python client to trigger Machine Learning jobs programatically.
There are other products that might interest you:
Cloud Dataflow - configure stream/batch data processing
Cloud Dataprerp - transform/clean raw data
Cloud Pub/Sub - global real-time messaging.
All the products/components and sub-products/sub-components can communicate with each other and processes can easily be automated in the Cloud. So the whole project can run in Google's Cloud infrastructure when you close your computer. But, of course, you have to configure it beforehand, in your Google Cloud Platform Project(s).
I am aware that I met your broad question with a broad answer. For any specific issues along your path of implementing the project in the Cloud, the community will be here to provide support.
Good luck!
Related
I'm a little bit confused between gcp components, here is my use case :
daily, I need to fetch data from an external API (the API return json data), store it in GCS then load it in Bigquery, I already created the python script fetching the data and store it in GCS and i'm confused which component to use for deployment :
Cloud run : from the doc it is used for deploying services, so I think its a bad choose
Cloud function: I think it works, but it is used for even based processing (through single purpose function...)
composer :(I'll use composer to orchestrate tasks, such as preprocessing of files in GCS, load them to BQ, transfert them to an archive Bucket) through kubernetesPodOperator, create a task that trigger the script to get the data
compute engine: I don't think that its the best chose since there are better ones
app engine: also I don't think it a good idea since it is used to deploy and scale web app ...
(correcte me if i'm wrong in what I said, ) so my question is : what is the GCP component used for this kind of task
Cloud run : from the doc it is used for deploying services
app engine: also I don't think it a good idea since it is used to deploy and scale web app ...
I think you've misunderstood. Both Cloud run and Google App Engine (GAE) are serverless offerings from Google Cloud. You deploy your code to any of them and you can invoke their urls which in turn will cause your code to execute and do stuff like go fetch data from somewhere and save it somewhere.
Google App Engine has a shorter timeout than Cloud Run (can't remember if Cloud Run has time out). So, if your code will take a long time to run, you don't want to use Google App Engine (unless you make it a background task) and if you don't need a UI, then you don't need GAE.
For your specific scenario, you can deploy your code to Cloud Run and use Cloud Scheduler to schedule it to be invoked at specific times. We have that architecture running in a similar scenario (we have a task that runs once daily; it's deployed to Cloud Run; Google Scheduler invokes the endpoint, it runs and saves data to datastore linked to an App Engine App). We wrote a blog article on deploying to Cloud Run and another on securing your cloud run (based off our experience in the earlier described scenario)
GAE Timeout:
Every request to a Google App Engine (Standard) must complete within 1 - 10 minutes for automatic scaling and up to 24 hours for basic scaling (see documentation). For Google App Engine Flexible, the timeout is 60 minutes (documentation).
I can't find any documentation on how gcp scheduler works under the hood. An App Engine is needed in the project, so I assume that the Http calls or Pub/Sub messages are started from the App Engine.
Currently I can use a cloud scheduler even without an App Engine in the project. Apparently a compute engine that also contains a permanently running VM is also sufficient. Could someone confirm my assumptions please or does anyone have sources on this?
I can't tell you how work Cloud Scheduler under the hood. I can just tell you that works well!
I'm sure there is a VM, or a cluster of VM, on Google serverless environment, and your Cloud Scheduler job is set on it. It's serverless, the under the hood doesn't matter, it works, and it's what I want!
Now, the relation with App Engine can be confusing. In fact, there is no longer relation between the product now, but you need the App Engine API activated on your project to use Cloud Scheduler. This strange things is normal if you have been using Google Cloud for a while. At the beginning, only App Engine existed, and Datastore, Cloud Task, Cloud Scheduler was all "modules" of App Engine. Years, after years, google has refactored and extracted these modules to create independent products, as you can see them today. However, some relations are still present, like the API activation.
After reading available answers/comments on similar questions on this forum, it is now evident that GAE app is not straight-forward ready to be deployed on Compute engine. I fully understand that what all the managed services(mostly as APIs be it, datastore, document/index Search, memcache, cloud storage, task queues, cron jobs etc.), App Engine offered being a platform, won't be the same-fashioned accessible/integration-ready if available on Compute engine at all.
We have a 5 years old fully-grown App engine app now.
I am considering a scenario to support high-level of customization/control and adding third party softwares/middlewares to our server environment which is not possible with App engine. So if we have all solutions(Compute Engines, Container Engines etc.) other than App engine, to migrate our application to meet such requirements, what is the cost of such migration?
Need of server provisioning and configuration at Compute engines with different pricing model[Understood, should not be a problem :)]
Full or partial code rewrite to continue using the same APIs esp. Datastore, Cloud Storage, Task Queues, Cron jobs, Document Search, Memcache etc.[Need confirmation here and any reference/link to migration guide would be help!!]
Does this lead to risk of losing any managed service/API offered from App Engine? Document Search, Memcache, Task Queues, Cron jobs seem the possible candidates. Please confirm.
As per my reading, Big Query, Cloud storage, Pub-Sub APIs integration should not be much affected with such migration(Client-libraries or Rest APIs should still help!). Please confirm.
In nut-shell, We wanted it fully managed in the beginning so PaaS seemed the right choice 5 years ago. Now we want App minus platform-managed plus customized/flexible to our choice. How complicated this transition is going to be?
Full or partial code rewrite to continue using the same APIs esp. Datastore, Cloud Storage, Task Queues, Cron jobs, Document Search, Memcache etc.[Need confirmation here and any reference/link to migration guide would be help!!]
Unfortunately, some of those service only be provided on GAE, such as Document Search. But most of service can be use directly for GCP, such as Datastore, Cloud Storage. GAE Flexible Environment is much like GCP environment, so you can read this article first Migration to GAE Flexible Environment
In following articles also have some answer:
How to migrate Google App Engine Project to Compute Engine completely
Google App Engine Blobstore to Google Cloud Storage Migration Tool
Does this lead to risk of losing any managed service/API offered from App Engine? Document Search, Memcache, Task Queues, Cron jobs seem the possible candidates. Please confirm.
Yes, Document Search only available on GAE.
As per my reading, Big Query, Cloud storage, Pub-Sub APIs integration should not be much affected with such migration(Client-libraries or Rest APIs should still help!). Please confirm.
Yes, but you may need to change SDK or library. It dependency on your language and how to call those service by Rest API directly or SDK.
I have a job that involves continuously listening to one or more websocket/mqtt feeds and forwarding this data to an event queue. This job is written in javascript and would run 24/7 in a continuous loop.
The most obvious solution is to run this job on a VM with Compute Engine, but I was wondering is there is a more elegant solution. Azure, for example, has WebJobs that's well-suited to this kind of task. It even restarts the script if there is an error.
Is there some other component on GCP that can run this job in a "managed" way?
Google Cloud does not have a product similar to Azure WebJobs at the moment. Both the standard and flexible environments of Google Cloud App Engine do not currently support websockets. In order to use websockets you can use Compute Engine or Kubernetes Engine.
I was able to create the basic 'hello world' program.
When I tried to understand the difference between a cloud and a server I learned that Cloud is where you have an access to virtual instance created exclusively for you and you are free to choose and install software of your choice.Why Google App Engine(GAE) is used widely where as tomcat is not used. What are major differences between GAE and Tomcat?
Cloud is Google Cloud Platform at this case. App Engine is just one of their services.
App Engine is a platform to build your apps on top of it. A Platform As A Service or PaaS. It simplifies the process of building a scalable application, and you should use it when you understand what you really need and understand principles of scalable application.
Tomcat is a Java web container, and there're many alternatives. Google App Engine is using Jetty. You could actually use it with Tomcat by using Flexible VM, though it doesn't make much sense.
App Engine is not about web server, it's a set of services that helps you to build a scalable app. It includes Memcache, Datastore, Task Queue, Images API, deployments tools and versioning, CDN for static files, and most important automatic scale.
Actually you aren't limited to App Engine on Google Cloud Platform. There is more traditional service, like own server in the cloud, called Compute Engine. There you can run your Tomcat or anything else.