How to synchronise instance data on google app engine? - google-app-engine

If I write a CRON job which fetches some external data and saves it on the GAE instance - will it become available across all instances?
I am using nodejs flexible env.

While you do have access to the GAE instance disk, the data in an instance is not replicated trough the rest. This is why it is recommended that you write all information to a Cloud Storage bucket, this way you can use this repository in order to share the data across the instances.

I believe you are referring to a CRON Job directly created on the Flexible instances not to App Engine Cron Service. So the answer is no, it won't be available to all instances. It will only be stored to the instance(s) that executed that cron job at that time, meaning that if you have autoscaling enabled, all the new instances will not contain the external data.

Related

What's the difference between Google Cloud Scheduler and GAE cron job?

After reading the docs
cloud scheduler - https://cloud.google.com/scheduler/
GAE cron job -
https://cloud.google.com/appengine/docs/flexible/nodejs/scheduling-jobs-with-cron-yaml
cloud function pub/sub trigger -
https://cloud.google.com/functions/docs/calling/pubsub
I think they are mostly the same.
I can use GAE cron job + pub/sub + cloud function to implement the same functions which cloud scheduler has.
In my understanding, it seems there are some differences between them:
Cloud Scheduler can be more convenient to adjust frequency. To update the frequency of GAE cron job, you must update the config, like schedule: every 1 hours of cron.yaml and redeploy.
There is no need to implement the cron job architecture(integrate GAE, GAE cron service, pub/sub, cloud function, etc..) by yourself which means you don't need to write code for combining them together anymore.
Am I correct? Or, any other differences?
You're right in that the Google Cloud Scheduler is kind of an evolution of the GAE cron job mechanism to make it more user-friendly and flexible. You can see that they are still related since the Cloud Scheduler doc specifies:
To use Cloud Scheduler your project must contain an App Engine app
that is located in one of the supported regions. If your project does
not have an App Engine app, you must create one.
Historically, GAE cron job was the only cron service offered by the platform. You could only target a GAE handler to receive the request from cron. From there you could indeed perform actions like like publish on pub/sub, call an HTTP Cloud Function or launch a dataflow job, but you always had to deploy a GAE service to handle it, which wasn't optimal.
The new Cloud Scheduler makes it more straightforward to use with Pub/Sub, Cloud Functions but also any publicly available HTTP endpoint (may be on-premise). And of course App Engine handlers. More targets may be added in the future for more use-cases.
Finally, as you mentioned, the API exposed to manage it decouples it from App Engine and its cron.yaml file and makes it easier to create and update cron jobs dynamically.

In GCP, is there a canonical way to scrape data from an API?

I'm building an application that will periodically pull data from several APIs and write them to cloud storage for later processing by Dataflow. There are many different ways to do this so I wanted to sanity check before I jumped in.
My plan is this:
For each API, Cloud Scheduler will hit an endpoint for an App Engine app
The app will create a Compute Engine VM instance with a startup script that pulls the data from the API and writes it to storage
When done, the VM will hit another endpoint on the App Engine app that shuts down the VM.
Is this a reasonable way to perform this sort of action? Is there a simpler or more straight-forward method? Thank you in advance for the replies.
Cloud Scheduler can schedule Compute Engine without App Engine however it seems that you cannot create and delete the VM with this method.
You can just use App Engine cron jobs to schedule the tasks. Your App Engine app cron handler can simply run the script that pulls data from the APIs. Maybe I am missing something, why do you need to use a Compute Engine instance to run the script?

GAE Microservices with dedicated Cron Jobs per microservice

Can different microservices in GAE, own dedicated cron jobs?
Background
We have written multiple services on GAE microservices application.
One micronservice say Service1(default) [JAVA in GAE Standard environment] has 10 cron jobs, wheareas another microservice say, Service2 [Python in GAE Flexible environment] has 5 other cronjobs.
When we deploy both the services, cron jobs get replaced with the latest service cron jobs.
I know that Task Queue is shared resource in GAE Microservices and hence Cron jobs too may be shared. But is it impossible to let microservice have their dedicated cronjobs based on their service scope and get them uploaded on Server where all cronjobs can co-exist?
Timely response is highly appreciated.
The cron configuration is also an application level configuration, not a module/service level one, which is why when you deploy it for one service it overwrites the previous one from another service.
You need to combine all cron jobs for all your services into a single cron configuration file and deploy that one instead, preferably using the specific cron deployment command, not by uploading it together with a particular service (sometimes that fails for multi service apps).
There are other such app level configurations as well, see https://stackoverflow.com/a/42361987/4495081

How to make computations with Google Cloud?

I have a machine learning project and for this project, I have to get data from a website in evey 15 minutes. And I decided to use google cloud platform to do it. I've coded a python script to do the process(get the data from website and write down to a csv file) and when I run this script on my computer, it works well. I need to run this script for a couple weeks. So it should be running in google cloud's computers and it should continue running when I close my computer. How can I do this?
I can also use another cloud service if it's required to but google cloud would be better.
Disclaimer: I'm with Google Cloud Platform Support
Google Cloud Compute Engine is defined as an Infrastructure as a Service. It basically provides access to Virtual Machines (VMs), Disks and Networking functionalities. By using this product, you are able to configure your resources from scratch, defining one or multiple VM instances, configuring your work environment, etc. It might require more configuration and boiler plating than needed, but it offers the most control. You can always use some resources for free but in my opinion it is a lot of scratch to start from.
Google Cloud App Engine is defined as a Platform as a Service. It is basically a managed app platform. The management can be automatised to certain degrees. It is based on Compute Engine, in the sense that it provides functionalities, a platform, on top of the infrastructure defined by Compute Engine VMs. You can thus deploy your python script in an App Engine Flexible Python Environment. You can define your whole application as a collection of interrelated microservices, i.e. one service gets the data from a website, maybe another writes csv files and another might trigger ML jobs.
App Engine also provides the possibility to schedule jobs as cron jobs. So if your application needs to run periodical jobs or at a specific time, this is the tool to use. App Engine pricing is correlated with the used resources, but you can estimate eventual budgets by using the Google Cloud Platform Pricing Calculator.
You can store the csv files in Google Cloud Storage as objects in buckets or as data in Datastore, Cloud SQL or BigQuery. Components of Google Cloud Platform can communicate with each other via service acounts. This allows your App Engine deployment, for example, to perform CRUD operations in your Cloud SQL instance, programatically. Or... to trigger a Cloud Machine Learning job.
Your question is very broad and can be addressed in multiple, various ways. I would initially deploy the python script in App Engine Flexible. I would deploy a cron job if needed, to fetch data every 15 minutes. I would upload the csv files in Google Cloud Storage Buckets. I would then use the Cloud Machine Learning python client to trigger Machine Learning jobs programatically.
There are other products that might interest you:
Cloud Dataflow - configure stream/batch data processing
Cloud Dataprerp - transform/clean raw data
Cloud Pub/Sub - global real-time messaging.
All the products/components and sub-products/sub-components can communicate with each other and processes can easily be automated in the Cloud. So the whole project can run in Google's Cloud infrastructure when you close your computer. But, of course, you have to configure it beforehand, in your Google Cloud Platform Project(s).
I am aware that I met your broad question with a broad answer. For any specific issues along your path of implementing the project in the Cloud, the community will be here to provide support.
Good luck!

How to create multiple instances of an application in cloud

i wanted to create a multiple instances of my application in either google cloud or EC2. I have two queries regarding this
1.How to achieve this?
Can we create a virtual instances by using zookeeper?
Google App Engine instances are started automatically, as your traffic raises. You may also have always on instances or backends instaces. Just read the docs: http://code.google.com/intl/pt-BR/appengine/docs/adminconsole/instances.html
Google App Engine is not adequate to use with Zookeper. Since Java code runs in a limited sandbox, you may not be able to communicate with Zookeper at all. Also, you will have to start and end you backends programmatically, leading you to lots of work.
As for EC2, see this:
http://www.mail-archive.com/zookeeper-user#hadoop.apache.org/msg01083.html

Resources