data computation in google app engine flex - google-app-engine

We have a project where 2 datasets(kinds) are stored in google datastore having 1.1 million records together. We are also planning to add more datasets moving forward. Now we are thinking to move to app engine flex so that statistical libraries such as numpy, pandas and ML framework Scikit-learn can be utilized to build predictive models. As part of data transformation/computation pandas and numpy will be used to extract new features out of the datasets stored in the google datastore.
Question - what is the effective approach to execute the computation logic on large datasets which involves data aggregation and transformation in the google app engine flex environment. Initial i was thinking of using task queue to do this heavy duty transformation considering it has 10 min timeout but not sure if that is feasible in flex environment

The trouble is that task queues have limited support in the flex environment. From Migrating Services from the Standard Environment to the Flexible Environment:
Task Queue
The Task Queue service has limited availability outside of the
standard environment. If you want to use the service outside of the
standard environment, you can sign up for the Cloud Tasks alpha.
Outside of the standard environment, you can't add tasks to push
queues, but a service running in the flexible environment can be
the target of a push task. You can specify this using the
target parameter when adding a task to queue or by specifying
the default target for the queue in queue.yaml.
In many cases where you might use pull queues, such as queuing up
tasks or messages that will be pulled and processed by separate
workers, Cloud Pub/Sub can be a good alternative as it offers
similar functionality and delivery guarantees.
One approach is already mentioned in the above quote: using Cloud Pub/Sub.
Another approach is also hinted at in the quote:
keep part of the existing app as a standard env service/module, populating the datasets and pushing processing tasks into push task queues
use the flex environment in the processing service(s)/module(s) where you need to use those libraries. These would be specified as targets for those pushed tasks.

Related

Bulk Enqueue Google Cloud Tasks

As part of migrating my Google App Engine Standard project from python2 to python3, it looks like I also need to switch from using the Taskqueue API & Library to google-cloud-tasks.
In the taskqueue library I could enqueue upto 100 tasks at a time like this
taskqueue.Queue('default').add([...task objects...])
as well as enqueue tasks asynchronously.
In the new library as well as the new API, it looks like you can only enqueue tasks one at a time
https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create
https://googleapis.dev/python/cloudtasks/latest/gapic/v2/api.html#google.cloud.tasks_v2.CloudTasksClient.create_task
I have an endpoint where it receives a batch with thousands of elements, each of which need to get processed in an individual task. How should I go about this?
According to the official documentation (reference 1, reference 2) the feature of adding task to queues asynchronously (as this post suggests for adding bulk number of tasks to a queue), is NOT an available feature via Cloud Tasks API. It is available for the users of App Engine SDK though.
However, there is a reference in the documentation regarding adding a large number of Cloud Tasks to a queue via double-injection pattern workaround (this post might seem useful too).
To implement this scenario, you'll need to create a new injector queue, whose single task would contain information to add multiple(100) tasks of the original queue that you're using. On the receiving end of this injector queue would be a service which does the actual addition of the intended tasks to your original queue. Although the addition of tasks in this service will be synchronous and 1-by-1, it will provide an asynchronous interface to your main application to bulk add tasks. In such a way you can overcome the limits of synchronous, 1-by-1 task addition in your main application.
Note that the 500/50/5 pattern of task addition to queue is a suggested method, in order to avoid any (queue/target) overloads.
As I did not find any examples of this implementation, I will edit the answer as soon as I find one.
Since you are in a migration process, I figured out that this link would be useful, as it concerns migrating from Task Queue to Cloud Tasks (as you stated you are thinking to do).
Additional information on migrating your code with all the available details you can find here and here, regarding Pull queues to Cloud Pub/Sub Migration and Push queues to Cloud Tasks Migration correspondingly.
In order to recreate a batch pull mechanism, you would have to switch to Pub/Sub. Cloud Tasks does not have pull queues. With Pub/Sub you can batch push and batch pull messages.
If you are using a push queue architecture, I would recommend passing those elements as the task payload; however the max task size is limited to 100kb.

Does Google App Engine support multiprocessing via python, and does the DB support multiple writes in localhost?

Regarding a production environment, I would like to know if the python standard environment (2.7) at Google App Engine supports code with multiprocessing and pooling? Using Google´s datastore. Or should Map Reduce be used instead?
And regarding development environment in a localhost, also I would like to know, how to avoid a database lock when writing to the same database from processes started from different shell terminals?
Thanks
You can have a look at this post on Google Groups, where it is confirmed that multiprocessing is not available in Google App Engine (GAE) Standard environment, but you can implement it in GAE Flexible. You might also be interested in this post about parallel execution in GAE, and Tasklets in particular with a Cloud Datastore example.
Regarding database lock:
Updates are actually done within a datastore transaction and NDB by default will retry the operation three times before failing altogether. It is recommended you only update an entity group once per second at the most. If you are seeing database locks, then you're probably doing something wrong. We implemented a version of the "fork-join queue" described by Brett Slatkin back in 2010 data pipelines talk, which is a method of "joining" many updates to the same entity such that they can all be applied at once at a controlled rate: https://www.youtube.com/watch?v=zSDC_TU7rtc&feature=youtu.be&t=33m37s
also, see the discussion going on here:
How to deal with eventual consistency in fork-join-queue

Google AppEngine use Google Compute for Task?

I need to run some specific code that can't be run on Google AppEngine (because of restrictions).
Since these workers are asynchronous, I thought about launching a Compute instance every time I need it and connecting them via a specific Tasks via the Task Queue from Google AppEngine, but I can not find any documentation about if this is possible?
TL;DR: Is it possible to specify a Google Compute as instance for a Task queue?
No, there is no way to specify a Google Compute as instance for a Task queue.
But did you consider using the Flexible environment (eventually with a custom runtime to try to address the restrictions) instead? Or the alternatives suggested for the Flexible env (only has limited task queue support) From Task Queue:
The Task Queue service has limited availability outside of the
standard environment. If you want to use the service outside of the
standard environment, you can sign up for the Cloud Tasks alpha.
Outside of the standard environment, you can't add tasks to push
queues, but a service running in the flexible environment can be
the target of a push task. You can specify this using the
target parameter when adding a task to queue or by specifying
the default target for the queue in queue.yaml.
In many cases where you might use pull queues, such as queuing up
tasks or messages that will be pulled and processed by separate
workers, Cloud Pub/Sub can be a good alternative as it offers
similar functionality and delivery guarantees.

Google Cloud Dataflow ETL (Datastore -> Transform -> BigQuery)

We have an application running on Google App Engine using Datastore as persistence back-end. Currently application has mostly 'OLTP' features and some rudimentary reporting. While implementing reports we experienced that processing large amount of data (millions of objects) is very difficult using Datastore and GQL. To enhance our application with proper reports and Business Intelligence features we think its better to setup a ETL process to move data from Datastore to BigQuery.
Initially we thought of implementing the ETL process as App Engine cron job but it looks like Dataflow can also be used for this. We have following requirements for setting up the process
Be able to push all existing data to BigQuery by using Non streaming
API of BigQuery.
Once above is done, push any new data whenever it is updated/created in
Datastore to BigQuery using streaming API.
My Questions are
Is Cloud Dataflow right candidate for implementing this pipeline?
Will we be able to push existing data? Some of the Kinds have
millions of objects.
What should be the right approach to implement it? We are considering two approaches.
First approach is to go through pub/sub i.e. for existing data create a cron job and push all data to pub/sub. For any new updates push data to pub/sub at the same time it is updated in DataStore. Dataflow Pipeline will pick it from pub/sub and push it to BigQuery.
Second approach is to create a batch Pipeline in Dataflow that will query DataStore and pushes any new data to BigQuery.
Question is are these two approaches doable? which one is better cost wise? Is there any other way which is better than above two?
Thank you,
rizTaak
Dataflow can absolutely be used for this purpose. In fact, Dataflow's scalability should make the process fast and relatively easy.
Both of your approaches should work -- I'd give a preference to the second one of using a batch pipeline to move the existing data, and then a streaming pipeline to handle new data via Cloud Pub/Sub. In addition to the data movement, Dataflow allow arbitrary analytics/manipulation to be performed on the data itself.
That said, BigQuery and Datastore can be connected directly. See, for example, Loading Data From Cloud Datastore in BigQuery documentation.

Keeping Consistent Count in Google App Engine

I am looking for suggestions on a very common problem on Google App Engine platform for keeping consistent counters.
I have a task to load the groups of a domain and then create a task for each group to load its group members in a separate task. Now as there are thousands of groups and members there will be too many tasks.
I will be creating one task to get one page of groups and within that task I will be creating multiple tasks for each group to get its members.Now, to know whether I have loaded all groups or not, I have the logic to just check the nextPageToken and then set the flag of groups loading to finished.
However as there will be separate tasks for each group to load members, I need to keep track of all whether all group member tasks have finished or not. Now here I have a problem that various tasks accessing a single count of numGroupMembersFinished, will create concurrency issues and somewhere the count will get corrupted and not return correct data.
My answer is general because your question doesn't have any code or proposed solution since you don't say where you plan to keep that counter.
Many articles on the web cover this. Google for "sharding counters" for a semi-scalable way to count datastore entities quickly in O(1) time.
more importantly look at the memcache api. It has a function to atomically increment/decrement counters stored there. That one is guaranteed to never have concurrency issues however you would still need some way to recover and/or double-check that the memcache entry wasn't evicted, maybe by also keeping the count stored in an entity that you set asynchronously and "get by key" to always get its latest value.
this still isn't 100% bulletproof because the cache could be evicted at the same moment that you have many concurrent attempts to modify it thus your backup datastore entity could miss a "set".
You need to calculate, based on your expected concurrent usage, if those chances to miss an increment/decrement are greater than a comet hitting the earth. Hopefully you wont use it on an air traffic controller.
you could use the MapReduce or Pipeline API:
https://github.com/GoogleCloudPlatform/appengine-mapreduce
https://github.com/GoogleCloudPlatform/appengine-pipelines
allowing you to split your problem into smaller manageable parts whereby the library can handle all of the details of signaling/blocking between tasks, gathering the results, and handing them back to you when it's done
Google I/O 2010 - Data pipelines with Google App Engine:
https://www.youtube.com/watch?v=zSDC_TU7rtc
Google I/O 2011: Large-scale Data Analysis Using the App Engine Pipeline API:
https://www.youtube.com/watch?v=Rsfy_TYA2ZY
Google I/O 2011: App Engine MapReduce:
https://www.youtube.com/watch?v=EIxelKcyCC0
Google I/O 2012 - Building Data Pipelines at Google Scale:
https://www.youtube.com/watch?v=lqQ6VFd3Tnw
Zig Mandel mentioned it, here's the link to Google's own recipe for implementing a counter:
https://cloud.google.com/appengine/articles/sharding_counters
I copy-pasted (renamed some variables, etc...) the configurable sharded counter into my app and it's working great!
I used this tutorial: https://cloud.google.com/appengine/articles/sharding_counters together with hashid library and created this golang library:
https://github.com/janekolszak/go-gae-uid
gen := gaeuid.NewGenerator("Kind", "HASH'S SALT", 11 /*id length*/)
c := appengine.NewContext(r)
id, err = gen.NewID(c)
The same approach should be easy for other languages.

Resources