Using deferred.defer within a transaction - google-app-engine

The Google app engine docs state:
You can enqueue a task as part of a Google Cloud Datastore
transaction, such that the task is only enqueued—and guaranteed to be
enqueued—if the transaction is committed successfully.
and gives this example:
#ndb.transactional
def do_something_in_transaction():
taskqueue.add(url='/path/to/my/worker', transactional=True)
But it isn't clear to me if the same holds true for tasks created with the deferred library. For this:
#ndb.transactional
def do_something_in_transaction():
deferred.defer(my_function)
is the task only enqueued if the transaction is successfully committed?

Fundamentally deferred.defer is just a wrapper around taskqueue.add. From the SDK's
google/appengine/ext/deferred/deferred.py file:
def defer(obj, *args, **kwargs):
...
transactional = kwargs.pop("_transactional", False)
...
try:
task = taskqueue.Task(payload=pickled, **taskargs)
return task.add(queue, transactional=transactional)
So you just need to do the equivalent, if you want the deferred task enqueued transactionally:
#ndb.transactional
def do_something_in_transaction():
deferred.defer(my_function, _transactional=True)

Related

Creating a cluster before sending a job to dataproc programmatically

I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following :
authenticate using a service account
submit a job to a cluster
The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month.
I wanted to add the creation of a cluster in my python script but the call is asynchronous (it makes an HTTP request) and thus my job is submitted after the cluster creation call but before the cluster is really up and running.
How could I do ?
I'd like something cleaner than just waiting for a few minutes in my script !
Thanks
EDIT : Here's what my code looks like so far :
To launch the job
class EnqueueTaskHandler(webapp2.RequestHandler):
def get(self):
task = taskqueue.add(
url='/run',
target='worker')
self.response.write(
'Task {} enqueued, ETA {}.'.format(task.name, task.eta))
app = webapp2.WSGIApplication([('/launch', EnqueueTaskHandler)], debug=True)
The job
class CronEventHandler(webapp2.RequestHandler):
def create_cluster(self, dataproc, project, zone, region, cluster_name):
zone_uri = 'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(project, zone)
cluster_data = {...}
dataproc.projects().regions().clusters().create(
projectId=project,
region=region,
body=cluster_data).execute()
def wait_for_cluster(self, dataproc, project, region, clustername):
print('Waiting for cluster to run...')
while True:
result = dataproc.projects().regions().clusters().get(
projectId=project,
region=region,
clusterName=clustername).execute()
# Handle exceptions
if result['status']['state'] != 'RUNNING':
time.sleep(60)
else:
return result
def wait_for_job(self, dataproc, project, region, job_id):
print('Waiting for job to finish...')
while True:
result = dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId=job_id).execute()
# Handle exceptions
print(result['status']['state'])
if result['status']['state'] == 'ERROR' or result['status']['state'] == 'DONE':
return result
else:
time.sleep(60)
def submit_job(self, dataproc, project, region, clusterName):
job = {...}
result = dataproc.projects().regions().jobs().submit(projectId=project,region=region,body=job).execute()
return result['reference']['jobId']
def post(self):
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project = '...'
region = "..."
zone = "..."
clusterName = '...'
self.create_cluster(dataproc, project, zone, region, clusterName)
self.wait_for_cluster(dataproc, project, region, clusterName)
job_id = self.submit_job(dataproc,project,region,clusterName)
self.wait_for_job(dataproc,project,region,job_id)
dataproc.projects().regions().clusters().delete(projectId=project, region=region, clusterName=clusterName).execute()
self.response.write("JOB SENT")
app = webapp2.WSGIApplication([('/run', CronEventHandler)], debug=True)
Everything works until the deletion of the cluster. At this point I get a "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded." Any idea ?
In addition to general polling either through list or get requests on the Cluster or the Operation returned with the CreateCluster request, for single-use clusters like this you can also consider using the Dataproc Workflows API and possibly its InstantiateInline interface if you don't want to use full-fledged workflow templates; in this API you use a single request to specify cluster settings along with jobs to submit, and the jobs will automatically run as soon as the cluster is ready to take it, after which the cluster will be deleted automatically.
You can use the Google Cloud Dataproc API to create, delete and list clusters.
The list operation can be (repeatedly) performed after create and delete operations to confirm that they completed successfully, since it provides the ClusterStatus of the clusters in the results with the relevant State information:
UNKNOWN The cluster state is unknown.
CREATING The cluster is being created and set up. It is not ready for use.
RUNNING The cluster is currently running and healthy. It is ready for use.
ERROR The cluster encountered an error. It is not ready for use.
DELETING The cluster is being deleted. It cannot be used.
UPDATING The cluster is being updated. It continues to accept and process jobs.
To prevent plain waiting between the (repeated) list invocations (in general not a good thing to do on GAE) you can enqueue delayed tasks in a push task queue (with the relevant context information) allowing you to perform such list operations at a later time. For example, in python, see taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
If at the task execution time the result indicates the operation of interest is still in progress simply enqueue another such delayed task - effectively polling but without an actual wait/sleep.

Why aiomysql locks the table even when using context manager?

I noticed that even I execute sql statements inside "with" context manager, after the request is finished, the table queried still locked and I can't execute "truncate" on it until I stop the event loop.
Here is example of my code:
import logging
import asyncio
import aiomysql
from aiohttp import web
from aiomysql.cursors import DictCursor
logging.basicConfig(level=logging.DEBUG)
async def index(request):
async with request.app["mysql"].acquire() as conn:
async with conn.cursor() as cur:
await cur.execute("SELECT * FROM my_table")
lines = await cur.fetchall()
return web.Response(text='Hello Aiohttp!')
async def get_mysql_pool(loop):
pool = await aiomysql.create_pool(
host="localhost",
user="test",
password="test",
db="test",
cursorclass=DictCursor,
loop=loop
)
return pool
if __name__ == "__main__":
loop = asyncio.get_event_loop()
mysql = loop.run_until_complete(get_mysql_pool(loop))
app = web.Application(loop=loop, debug=True)
app["mysql"] = mysql
app.router.add_get("/", index)
web.run_app(app)
After executing curl 'http://localhost:8080/', I'm connecting to mysql server with mysql cli and try to execute "truncate my_table" - it won't finish until I stop aiohttp. How to change this behavior?
Locks held because connection is not in autocommit mode by default. Adding autocommit=True should solve the issue.
pool = await aiomysql.create_pool(
host="localhost",
user="test",
password="test",
db="test",
autocommit=True,
cursorclass=DictCursor,
loop=loop)
Alternatively it is possible to release transaction by explicit command:
await cur.execute("COMMIT;")
Primary purpose of context managers here is to close cursor, not to commit transaction.
aiomysql has SQLAlchemy.core extension with context manager support for transactions, see example here:
https://github.com/aio-libs/aiomysql/blob/93aa3e5f77d77ad5592c3e9519cfc9f9587bf9ac/tests/pep492/test_async_with.py#L214-L234

Deleting massive of entities from Google App Engine NDB

The previous guys made som problem in our Google App Engine app. Currently, the app is saving entities with NULL values, but it would be better if we could clean up all thees values.
Here is the ndb.Modal:
class Day(ndb.Model):
date = ndb.DateProperty(required=True, indexed=True)
items = ndb.StringProperty(repeated=True, indexed=False)
reason = ndb.StringProperty(name="cancelled", indexed=False)
is_hole = ndb.ComputedProperty(lambda s: not bool(s.items or s.reason))
Somehow, we need to delete all Days where is_hole is true.
It's around 4 000 000 entities where around 2 000 000 should be deleted on the server.
Code so far
I thought it would be good to first count how many entities we should delete using this code:
count = Day.query(Day.is_hole != False).count(10000)
This (with the limit of 10 000) takes around 5 seconds to run. Without the limit, it would case a DeadLineException.
For deleting, I've tried this code:
ndb.delete_multi([key for key in Day.query(Day.is_hole != False).fetch(10000, keys_only=True)])
This (with the limit) takes around 30 seconds.
Question
How can I faster delete all Day where is_hole != False?
(We are using Python)
No, there is not faster way to delete entities - deadline is fixed.
But there are some tricks.
You can make deadline longer if you will use https://cloud.google.com/appengine/docs/python/taskqueue/ you can put some task in queue generate next task after first task (recurrence).
Another option similar to task queue is to make after deleting some of bad record redirect to same handler which is deleting while the last record will be deleted. Need browser open till the end.
if at_least_one_bad_record:
delete_some_records (not longer than 30s)
spawn again this task or redirect to this handler (next call will have next 30s)
Remember that it has exit point if no more good records. It will delete all matching record without clicking again.
Best way is to use MapReduce which will run in task queue and also you can do sharding to parallel the work. Here is the python code. Let me know, if you need any clarification
main.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from mapreduce.input_readers import InputReader
from google.appengine.api import app_identity
def deleteEntity(entity):
yield op.db.Delete(entity)
class DeleteEntitiesPipeline(base_handler.PipelineBase):
def run(self):
bucket_name = (app_identity.get_default_gcs_bucket_name())
yield mapreduce_pipeline.MapPipeline(
"job_name",
"main.deleteEntity",
"mapreduce.input_readers.DatastoreInputReader",
params={
"entity_kind": 'models.Day',
"filters": [("is_hole", "=", True)],
"bucket_name": bucket_name
},
shards=5)
class StartDelete(webapp2.RequestHandler):
def get(self):
pipeline = DeleteEntitiesPipeline()
pipeline.start()
application = webapp2.WSGIApplication([
('/deleteentities', StartDelete),
], debug=True)

GAE taskqueue access application storage

My GAE application is written in Python with webapp2. My application targets at analyzing user's online social network. Users could login and authorize my application, hence the access token will be stored for further crawling the data. Then i use the taskqueue to launch a backend task, as the crawling process is time consuming. However, when i access the datastore to fetch the access token, i can get it. I wonders whether there is a way to access the data of the frontend, rather than the temporary storage for the taskqueue.
the handler to the process http request from the user
class Callback(webapp2.RequestHandler):
def get(self):
global client
global r
code = self.request.get('code')
try:
client = APIClient(app_key=APP_KEY, app_secret=APP_SECRET,redirect_uri=CALLBACK_URL)
r = client.request_access_token(code)
access_token = r.access_token
record = model.getAccessTokenByUid(r.uid)
if record is None or r.access_token != record.accessToken:
# logging.debug("access token stored")
**model.insertAccessToken(long(r.uid), access_token, r.expires_in, "uncrawled", datetime.datetime.now())** #data stored here
session = self.request.environ['beaker.session']
session['uid'] = long(r.uid)
self.redirect(CLUSTER_PAGE % ("true"))
except Exception, e:
logging.error("callback:%s" % (str(e)));
self.redirect(CLUSTER_PAGE % ("false"))
the handle to process task submitted to taskqueue
class CrawlWorker(webapp2.RequestHandler):
def post(self): # should run at most 1/s
uid = self.request.get('uid')
logging.debug("start crawling uid:%s in the backend" % (str(uid)))
global client
global client1
global r
tokenTuple = model.getAccessTokenByUid(uid)
if tokenTuple is None: **#here i always get a None**
logging.error("CounterWorker:oops, authorization token is missed.")
return
The question is not clear (is it can or cant?) But if you want to access frontend data from the taskqueue, pass it as parameters to the task queue.

Why is an entity not being fetched from NDB's in-context cache?

I have an entity that is used to store some global app settings. These settings can be edited via an admin HTML page, but very rarely change. I have only one instance of this entity (a singleton of sorts) and always refer to this instance when I need access to the settings.
Here's what it boils down to:
class Settings(ndb.Model):
SINGLETON_DATASTORE_KEY = 'SINGLETON'
#classmethod
def singleton(cls):
return cls.get_or_insert(cls.SINGLETON_DATASTORE_KEY)
foo = ndb.IntegerProperty(
default = 100,
verbose_name = "Some setting called 'foo'",
indexed = False)
#ndb.tasklet
def foo():
# Even though settings has already been fetched from memcache and
# should be available in NDB's in-context cache, the following call
# fetches it from memcache anyways. Why?
settings = Settings.singleton()
class SomeHandler(webapp2.RequestHandler):
#ndb.toplevel
def get(self):
settings = Settings.singleton()
# Do some stuff
yield foo()
self.response.write("The 'foo' setting value is %d" % settings.foo)
I was under the assumption that calling Settings.singleton() more than once per request handler would be pretty fast, as the first call would most probably retrieve the Settings entity from memcache (since the entity is seldom updated) and all subsequent calls within the same request handler would retrieve it from NDB's in-context cache. From the documentation:
The in-context cache persists only for the duration of a single incoming HTTP request and is "visible" only to the code that handles that request. It's fast; this cache lives in memory.
However, AppStat is showing that my Settings entity is being retrieved from memcache multiple times within the same request handler. I know this by looking at a request handler's detailed page in AppStat, expanding the call trace of each call to memcache.Get and looking at the memcahe key that is being reteived.
I am using a lot of tasklets in my request handlers, and I call Settings.singleton() from within the tasklets that need access to the settings. Could this be the reason why the Settings entity is being fetched from memcache again instead of from the in-context cache? If so, what are the exact rules that govern if/when an entity can be fetched from the in-context cache or not? I have not been able to find this information in the NDB documentation.
Update 2013/02/15: I am unable to reproduce this in a dummy test application. Test code is:
class Foo(ndb.Model):
prop_a = ndb.DateTimeProperty(auto_now_add = True)
def use_foo():
foo = Foo.get_or_insert('singleton')
logging.info("Function using foo: %r", foo.prop_a)
#ndb.tasklet
def use_foo_tasklet():
foo = Foo.get_or_insert('singleton')
logging.info("Function using foo: %r", foo.prop_a)
#ndb.tasklet
def use_foo_async_tasklet():
foo = yield Foo.get_or_insert_async('singleton')
logging.info("Function using foo: %r", foo.prop_a)
class FuncGetOrInsertHandler(webapp2.RequestHandler):
def get(self):
for i in xrange(10):
logging.info("Iteration %d", i)
use_foo()
class TaskletGetOrInsertHandler(webapp2.RequestHandler):
#ndb.toplevel
def get(self):
logging.info("Toplevel")
use_foo()
for i in xrange(10):
logging.info("Iteration %d", i)
use_foo_tasklet()
class AsyncTaskletGetOrInsertHandler(webapp2.RequestHandler):
#ndb.toplevel
def get(self):
logging.info("Toplevel")
use_foo()
for i in xrange(10):
logging.info("Iteration %d", i)
use_foo_async_tasklet()
Before running any of the test handlers, I make sure that the Foo entity with keyname singleton exists.
Contrary to what I am seeing in my production app, all of these request handlers show a single call to memcache.Get in Appstats.
Update 2013/02/21: I am finally able to reproduce this in a dummy test application. Test code is:
class ToplevelAsyncTaskletGetOrInsertHandler(webapp2.RequestHandler):
#ndb.toplevel
def get(self):
logging.info("Toplevel 1")
use_foo()
self._toplevel2()
#ndb.toplevel
def _toplevel2(self):
logging.info("Toplevel 2")
use_foo()
for i in xrange(10):
logging.info("Iteration %d", i)
use_foo_async_tasklet()
This handler does show 2 calls to memcache.Get in Appstats, just like my production code.
Indeed, in my production request handler codepath, I have a toplevel called by another toplevel. It seems like a toplevel creates a new ndb context.
Changing the nested toplevel to a synctasklet fixes the problem.
It seems like a toplevel creates a new ndb context.
Exactly, each handler with a toplevel decorator have its own context and therefore a separate cache. You can take a look to the code for toplevel in the link below, in the function documentation states that toplevel is "A sync tasklet that sets a fresh default Context".
https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/ndb/tasklets.py#1033

Resources