Google App Engine: Using Big Query on datastore? - google-app-engine

Have a GAE datastore kind with several 100'000s of objects in them. Want to do several involved queries (involving counting queries). Big Query seems a god fit for doing this.
Is there currently an easy way to query a live AppEngine Datastore using Big Query?

You can't run a BigQuery directly on DataStore entities, but you can write a Mapper Pipeline that reads entities out of DataStore, writes them to CSV in Google Cloud Storage, and then ingests those into BigQuery - you can even automate the process. Here's an example of using the Mapper API classes for just the DataStore to CSV step:
import re
import time
from datetime import datetime
import urllib
import httplib2
import pickle
from google.appengine.ext import blobstore
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import util
from google.appengine.ext.webapp import template
from mapreduce.lib import files
from google.appengine.api import taskqueue
from google.appengine.api import users
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from apiclient.discovery import build
from google.appengine.api import memcache
from oauth2client.appengine import AppAssertionCredentials
#Number of shards to use in the Mapper pipeline
SHARDS = 20
# Name of the project's Google Cloud Storage Bucket
GS_BUCKET = 'your bucket'
# DataStore Model
class YourEntity(db.Expando):
field1 = db.StringProperty() # etc, etc
ENTITY_KIND = 'main.YourEntity'
class MapReduceStart(webapp.RequestHandler):
"""Handler that provides link for user to start MapReduce pipeline.
"""
def get(self):
pipeline = IteratorPipeline(ENTITY_KIND)
pipeline.start()
path = pipeline.base_path + "/status?root=" + pipeline.pipeline_id
logging.info('Redirecting to: %s' % path)
self.redirect(path)
class IteratorPipeline(base_handler.PipelineBase):
""" A pipeline that iterates through datastore
"""
def run(self, entity_type):
output = yield mapreduce_pipeline.MapperPipeline(
"DataStore_to_Google_Storage_Pipeline",
"main.datastore_map",
"mapreduce.input_readers.DatastoreInputReader",
output_writer_spec="mapreduce.output_writers.FileOutputWriter",
params={
"input_reader":{
"entity_kind": entity_type,
},
"output_writer":{
"filesystem": "gs",
"gs_bucket_name": GS_BUCKET,
"output_sharding":"none",
}
},
shards=SHARDS)
def datastore_map(entity_type):
props = GetPropsFor(entity_type)
data = db.to_dict(entity_type)
result = ','.join(['"%s"' % str(data.get(k)) for k in props])
yield('%s\n' % result)
def GetPropsFor(entity_or_kind):
if (isinstance(entity_or_kind, basestring)):
kind = entity_or_kind
else:
kind = entity_or_kind.kind()
cls = globals().get(kind)
return cls.properties()
application = webapp.WSGIApplication(
[('/start', MapReduceStart)],
debug=True)
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
If you append this to the end of your IteratorPipeline class: yield CloudStorageToBigQuery(output), you can pipe the resulting csv filehandle into a BigQuery ingestion pipe... like this:
class CloudStorageToBigQuery(base_handler.PipelineBase):
"""A Pipeline that kicks off a BigQuery ingestion job.
"""
def run(self, output):
# BigQuery API Settings
SCOPE = 'https://www.googleapis.com/auth/bigquery'
PROJECT_ID = 'Some_ProjectXXXX'
DATASET_ID = 'Some_DATASET'
# Create a new API service for interacting with BigQuery
credentials = AppAssertionCredentials(scope=SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery_service = build("bigquery", "v2", http=http)
jobs = bigquery_service.jobs()
table_name = 'datastore_dump_%s' % datetime.utcnow().strftime(
'%m%d%Y_%H%M%S')
files = [str(f.replace('/gs/', 'gs://')) for f in output]
result = jobs.insert(projectId=PROJECT_ID,
body=build_job_data(table_name,files)).execute()
logging.info(result)
def build_job_data(table_name, files):
return {"projectId": PROJECT_ID,
"configuration":{
"load": {
"sourceUris": files,
"schema":{
# put your schema here
"fields": fields
},
"destinationTable":{
"projectId": PROJECT_ID,
"datasetId": DATASET_ID,
"tableId": table_name,
},
}
}
}

With the new (from September 2013) streaming inserts api you can import records from your app into BigQuery.
The data is available in BigQuery immediately so this should satisfy your live requirement.
Whilst this question is now a bit old, this may be an easier solution for anyone stumbling across this question
At the moment though getting this to work from a the local dev server is patchy at best.

We're doing a Trusted Tester program for moving from Datastore to BigQuery in two simple operations:
Backup the datastore using Datastore Admin's backup functionality
Import backup directly into BigQuery
It automatically takes care of the schema for you.
More info (to apply): https://docs.google.com/a/google.com/spreadsheet/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ

For BigQuery you got to export those Kind into a CSV or delimited record structure , load into to BigQuery and you can query. There is no facility that i know of which allows querying the live GAE Datastore.
Biquery is Analytical query engine that means you can't change the record. No update or delete allowed, you can only append.

No, BigQuery is a different product that needs the data to be uploaded to it. It cannot work over the datastore. You can use GQL to query the datastore.

As of 2016, This is very possible now! You must do the following:
Make a new bucket in google storage
Backup entities using using the database admin at console.developers.google.com I have a complete tutorial
Head to bigquery Web UI, and import the files generated in step 1.
See this post for a complete example of this workflow!

Related

Querying datastore with NDB

I'm trying to use Python and NDB to access the datastore, which contains one entity:
I've defined by NDB model with the following code:
class Test(ndb.Model):
name = ndb.StringProperty()
val = ndb.IntegerProperty()
Then I run a query for the entity:
query = Test.get_by_id("Testing")
This returns a NoneType with no val field. I tried setting the argument to name=Testing instead of Testing, but that doesn't help.
What can I do to access my entity in Python? Do I need to identify the project's ID somewhere?
Also, I've been using Flask to serve as the microframework. But all the NDB example code I've seen uses webapp2. Should I use webapp2 instead?
Capitalization matters in Python. Instead of "Test", you need to query from the model "test".

My first BigQuery python script

I would like to know how to create a python script to access a BigQuery database.
I Found a lot of script but not really a complete script.
So, I Would like to have a standard script to connect a project and make a query on a specific table and create a csv file from it.
Thanks for your help.
Jérôme.
#!/usr/bin/python
from google.cloud import bigquery
import pprint
import argparse
import sys
from apiclient.discovery import build
def export_data_to_gcs(dataset_id, table_id, destination):
bigquery_client = bigquery.Client(project='XXXXXXXX-web-data')
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job = bigquery_client.extract_table(table_ref, destination)
job.result() # Waits for job to complete
print('Exported {}:{} to {}'.format(
dataset_id, table_id, destination))
export_data_to_gcs('2XXXX842', 'ga_sessions_201XXXXX', 'gs://analytics-to- deci/file-name.json')
Destination format
BigQuery supports CSV, JSON and Avro format.
Nested or repeated data cannot be exported to CSV, but it can be exported to JSON or Avro format.
As per the google documentation -
Google Document
Try other formats as said.
from google.cloud import bigquery
from google.cloud.bigquery.job import DestinationFormat, ExtractJobConfig, Compression
def export_table_to_gcs(dataset_id, table_id, destination):
"""
Exports data from BigQuery to an object in Google Cloud Storage.
For more information, see the README.rst.
Example invocation:
$ python export_data_to_gcs.py example_dataset example_table \\
gs://example-bucket/example-data.csv
The dataset and table should already exist.
"""
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = ExtractJobConfig()
job_config.destination_format = DestinationFormat.NEWLINE_DELIMITED_JSON
job_config.compression = Compression.GZIP
job = bigquery_client.extract_table(table_ref, destination, job_config=job_config)
job.result(timeout=300) # Waits for job to complete
print('Exported {}:{} to {}'.format(dataset_id, table_id, destination))

How to delete multiple entities from GAE datastore by keys

I am using Google App Engine on localhost. I have 2000 entities of kind Book in the Datastore. I want to delete the first 1900 (the keys range from 1 to 1901). How would I do that from the interactive console? I am using ndb as opposed to db
Maybe there is some sort of range functionality.
For example, I try the following, but nothing happens.
from myddb import Book
list= Book.gql("WHERE ID < 193")
for entity in list:
db.delete(entity)
EDIT:
Based on response from #Lipis the following is working
from myddb import Book
from google.appengine.ext import ndb
book_keys = Book.query().fetch(keys_only=True)
ndb.delete_multi(book_keys)
But that deletes everything. What I need to work is query by Key aka ID like
book_keys = Book.query(Article._Key < 1901).fetch(keys_only=True)
You should use the ndb.delete_multi():
from google.appengine.ext import ndb
book_keys = Book.query().fetch(keys_only=True)
ndb.delete_multi(book_keys)
You should go through the NDB Queries to see what other options you have and what you can achieve.
EDIT
I have not tested the solution below but test it and let me know.
Also this should help greatly ndb cheat sheet
q = Book.query(default_options=QueryOptions(keys_only=True))
if Book.ID < 1901:
ndb.delete_multi([m.key for m in q.fetch(1900)])
In ndb you use q = Book.query('query').fetch('number')
Then, iterate and delete.

only one database create while using a multi-database setup

I am trying to setup celery in one of my django projects. I want celery to use a separate database. Currently, as the project is in development phase we are using sqlite3. In order to setup multiple databases i did the following.
Defined databases in the settings.py file.
DATABASES = {'default':
{'ENGINE': 'django.db.backends.sqlite3',
'NAME':'devel',
'USER':'',
'PASSWORD':'',
'HOST':'',
'PORT':'',
},
'celery':
{'ENGINE': 'django.db.backends.sqlite3',
'NAME':'celery',
'USER':'',
'PASSWORD':'',
'HOST':'',
'PORT':'',
},
}
Created a Router Object in db_routers.py file
class CeleryRouter(object):
"""
This class will route all celery related models to a»
separate database.
"""
# Define the applications to be used in the celery database
APPS = (
'django',
'djcelery'
)
# Define Database Alias
DB = 'celery'
def db_for_read(self, model, **hints):
"""
Point read operations to celery database.
"""
if model._meta.app_label in self.APPS:
return self.DB
return None
def db_for_write(self, model, **hints):
"""
Point write operations to celery database.
"""
if model._meta.app_label in self.APPS:
return self.DB
return None
def allow_relation(self, obj1, obj2, **hints):
"""
Allow any relation between two objects in the db pool
"""
if (obj1._meta.app_label is self.APPS) and \
(obj2._meta.app_label in self.APPS):
return True
return None
def allow_syncdb(self, db, model):
"""
Make sure the celery tables appear only in celery
database.
"""
if db == self.DB:
return model._meta.app_label in self.APPS
elif model._meta.app_label in self.APPS:
return False
return None
Updated the DATABASE_ROUTER variable in settings.py file
DATABASE_ROUTERS = [
'appname.db_routers.CeleryRouter',
]
Now, when i do python manage.py syncdb i see that the tables are created for celery but there is only one database created i.e. devel. Why are the tables being created in the devel database and not in celery database ?
Quote from Django docs:
The syncdb management command operates on one database at a time. By default, it operates on the default database, but by providing a --database argument, you can tell syncdb to synchronize a different database.
Try running:
./manage.py syncdb --database=celery

How to delete data from Google App Engine?

I created one table in Google App Engine. I stored and retrieved data from Google App Engine.
However, I don't know how to delete data from Google App Engine Datastore.
An application can delete an entity from the datastore using a model instance or a Key. The model instance's delete() method deletes the corresponding entity from the datastore. The delete() function takes a Key or list of Keys and deletes the entity (or entities) from the datastore:
q = db.GqlQuery("SELECT * FROM Message WHERE msg_date < :1", earliest_date)
results = q.fetch(10)
for result in results:
result.delete()
# or...
q = db.GqlQuery("SELECT __key__ FROM Message WHERE msg_date < :1", earliest_date)
results = q.fetch(10)
db.delete(results)
Source and further reading:
Google App Engine: Creating, Getting and Deleting Data
If you want to delete all the data in your datastore, you may want to check the following Stack Overflow post:
How to delete all datastore in Google App Engine?
You need to find the entity then you need delete it.
So in python it would be
q = db.GqlQuery("SELECT __key__ FROM Message WHERE create_date < :1", earliest_date)
results = q.get()
db.delete(results)
or in Java it would be
pm.deletePersistent(results);
URLS from app engine are
http://code.google.com/appengine/docs/java/datastore/creatinggettinganddeletingdata.html#Deleting_an_Object
http://code.google.com/appengine/docs/python/datastore/creatinggettinganddeletingdata.html#Deleting_an_Entity
In Java
I am assuming that you have an endpoint:
Somethingendpoint endpoint = CloudEndpointUtils.updateBuilder(endpointBuilder).build();
And then:
endpoint.remove<ModelName>(long ID);

Resources