Google Cloud Dataflow ETL (Datastore -> Transform -> BigQuery) - google-app-engine

We have an application running on Google App Engine using Datastore as persistence back-end. Currently application has mostly 'OLTP' features and some rudimentary reporting. While implementing reports we experienced that processing large amount of data (millions of objects) is very difficult using Datastore and GQL. To enhance our application with proper reports and Business Intelligence features we think its better to setup a ETL process to move data from Datastore to BigQuery.
Initially we thought of implementing the ETL process as App Engine cron job but it looks like Dataflow can also be used for this. We have following requirements for setting up the process
Be able to push all existing data to BigQuery by using Non streaming
API of BigQuery.
Once above is done, push any new data whenever it is updated/created in
Datastore to BigQuery using streaming API.
My Questions are
Is Cloud Dataflow right candidate for implementing this pipeline?
Will we be able to push existing data? Some of the Kinds have
millions of objects.
What should be the right approach to implement it? We are considering two approaches.
First approach is to go through pub/sub i.e. for existing data create a cron job and push all data to pub/sub. For any new updates push data to pub/sub at the same time it is updated in DataStore. Dataflow Pipeline will pick it from pub/sub and push it to BigQuery.
Second approach is to create a batch Pipeline in Dataflow that will query DataStore and pushes any new data to BigQuery.
Question is are these two approaches doable? which one is better cost wise? Is there any other way which is better than above two?
Thank you,
rizTaak

Dataflow can absolutely be used for this purpose. In fact, Dataflow's scalability should make the process fast and relatively easy.
Both of your approaches should work -- I'd give a preference to the second one of using a batch pipeline to move the existing data, and then a streaming pipeline to handle new data via Cloud Pub/Sub. In addition to the data movement, Dataflow allow arbitrary analytics/manipulation to be performed on the data itself.
That said, BigQuery and Datastore can be connected directly. See, for example, Loading Data From Cloud Datastore in BigQuery documentation.

Related

What is the best place to run BigQuery queries in Google Cloud platform?

I have some files that contains thousands of rows that I need to insert into Google BigQuery, so, because the execution time exceeds the 60s request limit in AppEngine, I moved the BQ queries in a task queue.
For now, It works very well, but I don't know if this is the best place to put BQ queries. I am saying this because the requests are taking up to 3 minutes to complete, and I think that this is a bit slow. Do you think that there's a faster / better place to query BQ ?
PS: I am using the google bigquery api to send the queries.
There is two options:
You file with data is formatted to be used with BQ load jobs. In this case - you start load job in task queue - and store jobid you get from REST call to datastore. And quit from task queue. As another process you setup appengine cron that run say every minute and just check all running jobids and update status (process from cron run as task queue and using - so it will be under 10 mins limit) if changed and initiate another process if needed. In this case I think it will be pretty scalable
You process file and somehow manually insert rows - in this case best case of action will be using pubsub or again start multiple tasks in taskqueue - by manually splitting data into small pieces and using BQ Streaming insert API - of course it depends on size of your row - but I found that 1000-5000 recs per process works well here.
Also check out Potens.io (also available at Cloud Launcher)
The Magnus - Workflow Automator which is part of Potens suite - supports all BigQuery, Cloud Storage and most of Google APIs as well as multiple simple utility type Tasks like BigQuery Task, Export to Storage Task, Loop Task and many more
Disclosure: I am creator of those tools and leader on Potens team
If you have your text files in Google Cloud Storage, Cloud Dataflow could be a natural solution for your situation {1}.
You can use a Google-Provided Template to save some time in the process of creating a Cloud Dataflow pipeline {2}.
This way you can create a batch pipeline to move (and transform if you want) data from Google Cloud Storage (files) to BigQuery.
{1}: https://cloud.google.com/dataflow/
{2}: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-storage-text-to-bigquery

data computation in google app engine flex

We have a project where 2 datasets(kinds) are stored in google datastore having 1.1 million records together. We are also planning to add more datasets moving forward. Now we are thinking to move to app engine flex so that statistical libraries such as numpy, pandas and ML framework Scikit-learn can be utilized to build predictive models. As part of data transformation/computation pandas and numpy will be used to extract new features out of the datasets stored in the google datastore.
Question - what is the effective approach to execute the computation logic on large datasets which involves data aggregation and transformation in the google app engine flex environment. Initial i was thinking of using task queue to do this heavy duty transformation considering it has 10 min timeout but not sure if that is feasible in flex environment
The trouble is that task queues have limited support in the flex environment. From Migrating Services from the Standard Environment to the Flexible Environment:
Task Queue
The Task Queue service has limited availability outside of the
standard environment. If you want to use the service outside of the
standard environment, you can sign up for the Cloud Tasks alpha.
Outside of the standard environment, you can't add tasks to push
queues, but a service running in the flexible environment can be
the target of a push task. You can specify this using the
target parameter when adding a task to queue or by specifying
the default target for the queue in queue.yaml.
In many cases where you might use pull queues, such as queuing up
tasks or messages that will be pulled and processed by separate
workers, Cloud Pub/Sub can be a good alternative as it offers
similar functionality and delivery guarantees.
One approach is already mentioned in the above quote: using Cloud Pub/Sub.
Another approach is also hinted at in the quote:
keep part of the existing app as a standard env service/module, populating the datasets and pushing processing tasks into push task queues
use the flex environment in the processing service(s)/module(s) where you need to use those libraries. These would be specified as targets for those pushed tasks.

Loading data from google cloud storage to BigQuery

I have a requirement to load 100's of tables to BigQuery from Google Cloud Storage(GCS -> Temp table -> Main table). I have created a python process to load the data into BigQuery and scheduled in AppEngine. Since we have Maximum 10min timeout for AppEngine. I have submitted the jobs in Asynchronous mode and checking the job status later point of time. Since I have 100's of tables need to create a monitoring system to check the status the job load.
Need to maintain a couple of tables and bunch of views to check the job status.
The operational process is little complex. Is there any better way?
Thanks
When we did this, we simply used a message queue like Beanstalkd, where we pushed something that later had to be checked, and we wrote a small worker who subscribed to the channel and dealt with the task.
On the other hand: BigQuery offers support for querying data directly from Google Cloud Storage.
Use cases:
- Loading and cleaning your data in one pass by querying the data from a federated data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
- Having a small amount of frequently changing data that you join with other tables. As a federated data source, the frequently changing data does not need to be reloaded every time it is updated.
https://cloud.google.com/bigquery/federated-data-sources

Automatically push engine datastore data to bigquery tables

To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.

AppEngine & BigQuery - Where would you put stat/monitoring data?

I have an AppEngine application that process files from Cloud Storage and inserts them in BigQuery.
Because now and also in the future I would like to know the sanity/performance of the application... I would like to store stats data in either Cloud Datastore or in a Cloud SQL instance.
I have two questions I would like to ask:
Cloud Datastore vs Cloud SQL - what would you use and why? What downsides have you experienced so far?
Would you use a task or direct call to insert data and, also, why? - Would you add a task and then have some consumers insert to data or would you do a direct insert [ regardless of the solution choosen above ]. What downsides have you experienced so far?
Thank you.
Cloud SQL is better if you want to perform JOINs or SUMs later, Cloud Datastore will scale more if you have a lot of data to store. Also, in the Datastore, if you want to update a stats entity transactionally, you will need to shard or you will be limited to 5 updates per second.
If the data to insert is small (one row to insert in BQ or one entity in the datastore) then you can do it by a direct call, but you must accept that the call may fail. If you want to retry in case of failure, or if the data to insert is big and it will take time, it is better to run it asynchronously in a task. Note that with tasks,y you must be cautious because they can be run more than once.

Resources