Related
The previous guys made som problem in our Google App Engine app. Currently, the app is saving entities with NULL values, but it would be better if we could clean up all thees values.
Here is the ndb.Modal:
class Day(ndb.Model):
date = ndb.DateProperty(required=True, indexed=True)
items = ndb.StringProperty(repeated=True, indexed=False)
reason = ndb.StringProperty(name="cancelled", indexed=False)
is_hole = ndb.ComputedProperty(lambda s: not bool(s.items or s.reason))
Somehow, we need to delete all Days where is_hole is true.
It's around 4 000 000 entities where around 2 000 000 should be deleted on the server.
Code so far
I thought it would be good to first count how many entities we should delete using this code:
count = Day.query(Day.is_hole != False).count(10000)
This (with the limit of 10 000) takes around 5 seconds to run. Without the limit, it would case a DeadLineException.
For deleting, I've tried this code:
ndb.delete_multi([key for key in Day.query(Day.is_hole != False).fetch(10000, keys_only=True)])
This (with the limit) takes around 30 seconds.
Question
How can I faster delete all Day where is_hole != False?
(We are using Python)
No, there is not faster way to delete entities - deadline is fixed.
But there are some tricks.
You can make deadline longer if you will use https://cloud.google.com/appengine/docs/python/taskqueue/ you can put some task in queue generate next task after first task (recurrence).
Another option similar to task queue is to make after deleting some of bad record redirect to same handler which is deleting while the last record will be deleted. Need browser open till the end.
if at_least_one_bad_record:
delete_some_records (not longer than 30s)
spawn again this task or redirect to this handler (next call will have next 30s)
Remember that it has exit point if no more good records. It will delete all matching record without clicking again.
Best way is to use MapReduce which will run in task queue and also you can do sharding to parallel the work. Here is the python code. Let me know, if you need any clarification
main.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from mapreduce.input_readers import InputReader
from google.appengine.api import app_identity
def deleteEntity(entity):
yield op.db.Delete(entity)
class DeleteEntitiesPipeline(base_handler.PipelineBase):
def run(self):
bucket_name = (app_identity.get_default_gcs_bucket_name())
yield mapreduce_pipeline.MapPipeline(
"job_name",
"main.deleteEntity",
"mapreduce.input_readers.DatastoreInputReader",
params={
"entity_kind": 'models.Day',
"filters": [("is_hole", "=", True)],
"bucket_name": bucket_name
},
shards=5)
class StartDelete(webapp2.RequestHandler):
def get(self):
pipeline = DeleteEntitiesPipeline()
pipeline.start()
application = webapp2.WSGIApplication([
('/deleteentities', StartDelete),
], debug=True)
Code below shows what I would normally do in a python program.
class LogOnline(ndb.Model):
_timeOnline = ndb.DateTimeProperty(default=None)
#property
def timeOnline(self):
return self._timeOnline
#timeOnline.setter
def timeOnline(self, dateTime):
self._timeOnline = dateTime
#set memcache with all current online users
#.....
However this code doesn't work as app engine does not allow properties to start with a '_'
Also I feel this type of architecture could be bad practice as it could provide problems when doing queries on the class.
What is the best way to approach this?
What you could do, is make timeOnline a property without underscore, but add a _post_put_hook to update memcache.
class LogOnline(ndb.Model):
timeOnline = ndb.DateTimeProperty(default=None)
def _post_put_hook(self, future):
future.get_result() #wait untill the PUT operation has completed
#set memcache with all current online users
...
We are currently building a small and simple central HTTP service that maps "external identities" (like a facebook id) to an "internal (uu)id", unique across all our services to help with analytics.
The first prototype in "our stack" (flask+postgresql) was done within a day. But since we want the service to (almost) never fail and scale automagically, we decided to use Google App Engine.
After a week of reading&trying&benchmarking this question emerges:
What response times are considered "normal" on App Engine (with NDB)?
We are getting response times that are consistently above 500ms on average and well above 1s in the 90percentile.
I've attached a stripped down version of our code below, hoping somebody can point out the obvious flaw. We really like the autoscaling and the distributed storage, but we can not imagine 500ms really is the expected performance in our case. The sql based prototype responded much faster (consistently), hosted on one single Heroku dyno using the free, cache-less postgresql (even with an ORM).
We tried both synchronous and asynchronous variants of the code below and looked at the appstats profile. It's always RPC calls (both memcache and datastore) that take very long (50ms-100ms), made worse by the fact that there are always multiple calls (eg. mc.get() + ds.get() + ds.set() on a write). We also tried deferring as much as possible to the task queue, without noticeable gains.
import json
import uuid
from google.appengine.ext import ndb
import webapp2
from webapp2_extras.routes import RedirectRoute
def _parse_request(request):
if request.content_type == 'application/json':
try:
body_json = json.loads(request.body)
provider_name = body_json.get('provider_name', None)
provider_user_id = body_json.get('provider_user_id', None)
except ValueError:
return webapp2.abort(400, detail='invalid json')
else:
provider_name = request.params.get('provider_name', None)
provider_user_id = request.params.get('provider_user_id', None)
return provider_name, provider_user_id
class Provider(ndb.Model):
name = ndb.StringProperty(required=True)
class Identity(ndb.Model):
user = ndb.KeyProperty(kind='GlobalUser')
class GlobalUser(ndb.Model):
uuid = ndb.StringProperty(required=True)
#property
def identities(self):
return Identity.query(Identity.user==self.key).fetch()
class ResolveHandler(webapp2.RequestHandler):
#ndb.toplevel
def post(self):
provider_name, provider_user_id = _parse_request(self.request)
if not provider_name or not provider_user_id:
return self.abort(400, detail='missing provider_name and/or provider_user_id')
identity = ndb.Key(Provider, provider_name, Identity, provider_user_id).get()
if identity:
user_uuid = identity.user.id()
else:
user_uuid = uuid.uuid4().hex
GlobalUser(
id=user_uuid,
uuid=user_uuid
).put_async()
Identity(
parent=ndb.Key(Provider, provider_name),
id=provider_user_id,
user=ndb.Key(GlobalUser, user_uuid)
).put_async()
return webapp2.Response(
status='200 OK',
content_type='application/json',
body = json.dumps({
'provider_name' : provider_name,
'provider_user_id' : provider_user_id,
'uuid' : user_uuid
})
)
app = webapp2.WSGIApplication([
RedirectRoute('/v1/resolve', ResolveHandler, 'resolve', strict_slash=True)
], debug=False)
For completeness sake the (almost default) app.yaml
application: GAE_APP_IDENTIFIER
version: 1
runtime: python27
api_version: 1
threadsafe: yes
handlers:
- url: .*
script: main.app
libraries:
- name: webapp2
version: 2.5.2
- name: webob
version: 1.2.3
inbound_services:
- warmup
In my experience, RPC performance fluctuates by orders of magnitude, between 5ms-100ms for a datastore get. I suspect it's related to the GAE datacenter load. Sometimes it gets better, sometimes it gets worse.
Your operation looks very simple. I expect that with 3 requests, it should take about 20ms, but it could be up to 300ms. A sustained average of 500ms sounds very high though.
ndb does local caching when fetching objects by ID. That should kick in if you're accessing the same users, and those requests should be much faster.
I assume you're doing perf testing on the production and not dev_appserver. dev_appserver performance is not representative.
Not sure how many iterations you've tested, but you might want to try a larger number to see if 500ms is really your average.
When you're blocked on simple RPC calls, there's not too optimizing you can do.
The 1st obvious moment I see: do you really need a transaction on every request?
I believe that unless most of your requests create new entities it's better to do .get_by_id() outside of transaction. And if entity not found then start transaction or even better defer creation of the entity.
def request_handler(key, data):
entity = key.get()
if entity:
return 'ok'
else:
defer(_deferred_create, key, data)
return 'ok'
def _deferred_create(key, data):
#ndb.transactional
def _tx():
entity = key.get()
if not entity:
entity = CreateEntity(data)
entity.put()
_tx()
That should give much better response time for user facing requests.
The 2nd and only optimization I see is to use ndb.put_multi() to minimize RPC calls.
P.S. Not 100% sure but you can try to disable multithreading (threadsave: no) to get more stable response time.
I'd been searching for a way to do cookie based authentication/sessions in Google App Engine because I don't like the idea of memcache based sessions, and I also don't like the idea of forcing users to create google accounts just to use a website. I stumbled across someone's posting that mentioned some signed cookie functions from the Tornado framework and it looks like what I need. What I have in mind is storing a user's id in a tamper proof cookie, and maybe using a decorator for the request handlers to test the authentication status of the user, and as a side benefit the user id will be available to the request handler for datastore work and such. The concept would be similar to forms authentication in ASP.NET. This code comes from the web.py module of the Tornado framework.
According to the docstrings, it "Signs and timestamps a cookie so it cannot be forged" and
"Returns the given signed cookie if it validates, or None."
I've tried to use it in an App Engine Project, but I don't understand the nuances of trying to get these methods to work in the context of the request handler. Can someone show me the right way to do this without losing the functionality that the FriendFeed developers put into it? The set_secure_cookie, and get_secure_cookie portions are the most important, but it would be nice to be able to use the other methods as well.
#!/usr/bin/env python
import Cookie
import base64
import time
import hashlib
import hmac
import datetime
import re
import calendar
import email.utils
import logging
def _utf8(s):
if isinstance(s, unicode):
return s.encode("utf-8")
assert isinstance(s, str)
return s
def _unicode(s):
if isinstance(s, str):
try:
return s.decode("utf-8")
except UnicodeDecodeError:
raise HTTPError(400, "Non-utf8 argument")
assert isinstance(s, unicode)
return s
def _time_independent_equals(a, b):
if len(a) != len(b):
return False
result = 0
for x, y in zip(a, b):
result |= ord(x) ^ ord(y)
return result == 0
def cookies(self):
"""A dictionary of Cookie.Morsel objects."""
if not hasattr(self,"_cookies"):
self._cookies = Cookie.BaseCookie()
if "Cookie" in self.request.headers:
try:
self._cookies.load(self.request.headers["Cookie"])
except:
self.clear_all_cookies()
return self._cookies
def _cookie_signature(self,*parts):
self.require_setting("cookie_secret","secure cookies")
hash = hmac.new(self.application.settings["cookie_secret"],
digestmod=hashlib.sha1)
for part in parts:hash.update(part)
return hash.hexdigest()
def get_cookie(self,name,default=None):
"""Gets the value of the cookie with the given name,else default."""
if name in self.cookies:
return self.cookies[name].value
return default
def set_cookie(self,name,value,domain=None,expires=None,path="/",
expires_days=None):
"""Sets the given cookie name/value with the given options."""
name = _utf8(name)
value = _utf8(value)
if re.search(r"[\x00-\x20]",name + value):
# Don't let us accidentally inject bad stuff
raise ValueError("Invalid cookie %r:%r" % (name,value))
if not hasattr(self,"_new_cookies"):
self._new_cookies = []
new_cookie = Cookie.BaseCookie()
self._new_cookies.append(new_cookie)
new_cookie[name] = value
if domain:
new_cookie[name]["domain"] = domain
if expires_days is not None and not expires:
expires = datetime.datetime.utcnow() + datetime.timedelta(
days=expires_days)
if expires:
timestamp = calendar.timegm(expires.utctimetuple())
new_cookie[name]["expires"] = email.utils.formatdate(
timestamp,localtime=False,usegmt=True)
if path:
new_cookie[name]["path"] = path
def clear_cookie(self,name,path="/",domain=None):
"""Deletes the cookie with the given name."""
expires = datetime.datetime.utcnow() - datetime.timedelta(days=365)
self.set_cookie(name,value="",path=path,expires=expires,
domain=domain)
def clear_all_cookies(self):
"""Deletes all the cookies the user sent with this request."""
for name in self.cookies.iterkeys():
self.clear_cookie(name)
def set_secure_cookie(self,name,value,expires_days=30,**kwargs):
"""Signs and timestamps a cookie so it cannot be forged"""
timestamp = str(int(time.time()))
value = base64.b64encode(value)
signature = self._cookie_signature(name,value,timestamp)
value = "|".join([value,timestamp,signature])
self.set_cookie(name,value,expires_days=expires_days,**kwargs)
def get_secure_cookie(self,name,include_name=True,value=None):
"""Returns the given signed cookie if it validates,or None"""
if value is None:value = self.get_cookie(name)
if not value:return None
parts = value.split("|")
if len(parts) != 3:return None
if include_name:
signature = self._cookie_signature(name,parts[0],parts[1])
else:
signature = self._cookie_signature(parts[0],parts[1])
if not _time_independent_equals(parts[2],signature):
logging.warning("Invalid cookie signature %r",value)
return None
timestamp = int(parts[1])
if timestamp < time.time() - 31 * 86400:
logging.warning("Expired cookie %r",value)
return None
try:
return base64.b64decode(parts[0])
except:
return None
uid=1234|1234567890|d32b9e9c67274fa062e2599fd659cc14
Parts:
1. uid is the name of the key
2. 1234 is your value in clear
3. 1234567890 is the timestamp
4. d32b9e9c67274fa062e2599fd659cc14 is the signature made from the value and the timestamp
Tornado was never meant to work with App Engine (it's "its own server" through and through). Why don't you pick instead some framework that was meant for App Engine from the word "go" and is lightweight and dandy, such as tipfy? It gives you authentication using its own user system or any of App Engine's own users, OpenIn, OAuth, and Facebook; sessions with secure cookies or GAE datastore; and much more besides, all in a superbly lightweight "non-framework" approach based on WSGI and Werkzeug. What's not to like?!
For those who are still looking, we've extracted just the Tornado cookie implementation that you can use with App Engine at ThriveSmart. We're using it successfully on App Engine and will continue to keep it updated.
The cookie library itself is at:
http://github.com/thrivesmart/prayls/blob/master/prayls/lilcookies.py
You can see it in action in our example app that's included. If the structure of our repository ever changes, you can look for lilcookes.py within github.com/thrivesmart/prayls
I hope that's helpful to someone out there!
This works if anyone is interested:
from google.appengine.ext import webapp
import Cookie
import base64
import time
import hashlib
import hmac
import datetime
import re
import calendar
import email.utils
import logging
def _utf8(s):
if isinstance(s, unicode):
return s.encode("utf-8")
assert isinstance(s, str)
return s
def _unicode(s):
if isinstance(s, str):
try:
return s.decode("utf-8")
except UnicodeDecodeError:
raise HTTPError(400, "Non-utf8 argument")
assert isinstance(s, unicode)
return s
def _time_independent_equals(a, b):
if len(a) != len(b):
return False
result = 0
for x, y in zip(a, b):
result |= ord(x) ^ ord(y)
return result == 0
class ExtendedRequestHandler(webapp.RequestHandler):
"""Extends the Google App Engine webapp.RequestHandler."""
def clear_cookie(self,name,path="/",domain=None):
"""Deletes the cookie with the given name."""
expires = datetime.datetime.utcnow() - datetime.timedelta(days=365)
self.set_cookie(name,value="",path=path,expires=expires,
domain=domain)
def clear_all_cookies(self):
"""Deletes all the cookies the user sent with this request."""
for name in self.cookies.iterkeys():
self.clear_cookie(name)
def cookies(self):
"""A dictionary of Cookie.Morsel objects."""
if not hasattr(self,"_cookies"):
self._cookies = Cookie.BaseCookie()
if "Cookie" in self.request.headers:
try:
self._cookies.load(self.request.headers["Cookie"])
except:
self.clear_all_cookies()
return self._cookies
def _cookie_signature(self,*parts):
"""Hashes a string based on a pass-phrase."""
hash = hmac.new("MySecretPhrase",digestmod=hashlib.sha1)
for part in parts:hash.update(part)
return hash.hexdigest()
def get_cookie(self,name,default=None):
"""Gets the value of the cookie with the given name,else default."""
if name in self.request.cookies:
return self.request.cookies[name]
return default
def set_cookie(self,name,value,domain=None,expires=None,path="/",expires_days=None):
"""Sets the given cookie name/value with the given options."""
name = _utf8(name)
value = _utf8(value)
if re.search(r"[\x00-\x20]",name + value): # Don't let us accidentally inject bad stuff
raise ValueError("Invalid cookie %r:%r" % (name,value))
new_cookie = Cookie.BaseCookie()
new_cookie[name] = value
if domain:
new_cookie[name]["domain"] = domain
if expires_days is not None and not expires:
expires = datetime.datetime.utcnow() + datetime.timedelta(days=expires_days)
if expires:
timestamp = calendar.timegm(expires.utctimetuple())
new_cookie[name]["expires"] = email.utils.formatdate(timestamp,localtime=False,usegmt=True)
if path:
new_cookie[name]["path"] = path
for morsel in new_cookie.values():
self.response.headers.add_header('Set-Cookie',morsel.OutputString(None))
def set_secure_cookie(self,name,value,expires_days=30,**kwargs):
"""Signs and timestamps a cookie so it cannot be forged"""
timestamp = str(int(time.time()))
value = base64.b64encode(value)
signature = self._cookie_signature(name,value,timestamp)
value = "|".join([value,timestamp,signature])
self.set_cookie(name,value,expires_days=expires_days,**kwargs)
def get_secure_cookie(self,name,include_name=True,value=None):
"""Returns the given signed cookie if it validates,or None"""
if value is None:value = self.get_cookie(name)
if not value:return None
parts = value.split("|")
if len(parts) != 3:return None
if include_name:
signature = self._cookie_signature(name,parts[0],parts[1])
else:
signature = self._cookie_signature(parts[0],parts[1])
if not _time_independent_equals(parts[2],signature):
logging.warning("Invalid cookie signature %r",value)
return None
timestamp = int(parts[1])
if timestamp < time.time() - 31 * 86400:
logging.warning("Expired cookie %r",value)
return None
try:
return base64.b64decode(parts[0])
except:
return None
It can be used like this:
class MyHandler(ExtendedRequestHandler):
def get(self):
self.set_cookie(name="MyCookie",value="NewValue",expires_days=10)
self.set_secure_cookie(name="MySecureCookie",value="SecureValue",expires_days=10)
value1 = self.get_cookie('MyCookie')
value2 = self.get_secure_cookie('MySecureCookie')
If you only want to store the user's user ID in the cookie (presumably so you can look their record up in the datastore), you don't need 'secure' or tamper-proof cookies - you just need a namespace that's big enough to make guessing user IDs impractical - eg, GUIDs, or other random data.
One pre-made option for this, which uses the datastore for session storage, is Beaker. Alternately, you could handle this yourself with set-cookie/cookie headers, if you really just need to store their user ID.
Someone recently extracted the authentication and session code from Tornado and created a new library specifically for GAE.
Perhaps this is more then you need, but since they did it specifically for GAE you shouldn't have to worry about adapting it yourself.
Their library is called gaema. Here is their announcement in the GAE Python group on 4 Mar 2010:
http://groups.google.com/group/google-appengine-python/browse_thread/thread/d2d6c597d66ecad3/06c6dc49cb8eca0c?lnk=gst&q=tornado#06c6dc49cb8eca0c
Does anyone know how to delete all datastore in Google App Engine?
If you're talking about the live datastore, open the dashboard for your app (login on appengine) then datastore --> dataviewer, select all the rows for the table you want to delete and hit the delete button (you'll have to do this for all your tables).
You can do the same programmatically through the remote_api (but I never used it).
If you're talking about the development datastore, you'll just have to delete the following file: "./WEB-INF/appengine-generated/local_db.bin". The file will be generated for you again next time you run the development server and you'll have a clear db.
Make sure to clean your project afterwards.
This is one of the little gotchas that come in handy when you start playing with the Google Application Engine. You'll find yourself persisting objects into the datastore then changing the JDO object model for your persistable entities ending up with obsolete data that'll make your app crash all over the place.
The best approach is the remote API method as suggested by Nick, he's an App Engine engineer from Google, so trust him.
It's not that difficult to do, and the latest 1.2.5 SDK provides the remote_shell_api.py out of the shelf. So go to download the new SDK. Then follow the steps:
connect remote server in your commandline: remote_shell_api.py yourapp /remote_api
The shell will ask for your login info, and if authorized, will make a Python shell for you. You need setup url handler for /remote_api in your app.yaml
fetch the entities you'd like to delete, the code looks something like:
from models import Entry
query = Entry.all(keys_only=True)
entries =query.fetch(1000)
db.delete(entries)
\# This could bulk delete 1000 entities a time
Update 2013-10-28:
remote_shell_api.py has been replaced by remote_api_shell.py, and you should connect with remote_api_shell.py -s your_app_id.appspot.com, according to the documentation.
There is a new experimental feature Datastore Admin, after enabling it in app settings, you can bulk delete as well as backup your datastore through the web ui.
The fastest and efficient way to handle bulk delete on Datastore is by using the new mapper API announced on the latest Google I/O.
If your language of choice is Python, you just have to register your mapper in a mapreduce.yaml file and define a function like this:
from mapreduce import operation as op
def process(entity):
yield op.db.Delete(entity)
On Java you should have a look to this article that suggests a function like this:
#Override
public void map(Key key, Entity value, Context context) {
log.info("Adding key to deletion pool: " + key);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.delete(value.getKey());
}
EDIT:
Since SDK 1.3.8, there's a Datastore admin feature for this purpose
You can clear the development server datastore when you run the server:
/path/to/dev_appserver.py --clear_datastore=yes myapp
You can also abbreviate --clear_datastore with -c.
If you have a significant amount of data, you need to use a script to delete it. You can use remote_api to clear the datastore from the client side in a straightforward manner, though.
Here you go: Go to Datastore Admin, and then select the Entity type you want to delete and click Delete. Mapreduce will take care of deleting!
There are several ways you can use to remove entries from App Engine's Datastore:
First, think whether you really need to remove entries. This is expensive and it might be cheaper to not remove them.
You can delete all entries by hand using the Datastore Admin.
You can use the Remote API and remove entries interactively.
You can remove the entries programmatically using a couple lines of code.
You can remove them in bulk using Task Queues and Cursors.
Or you can use Mapreduce to get something more robust and fancier.
Each one of these methods is explained in the following blog post:
http://www.shiftedup.com/2015/03/28/how-to-bulk-delete-entries-in-app-engine-datastore
Hope it helps!
The zero-setup way to do this is to send an execute-arbitrary-code HTTP request to the admin service that your running app already, automatically, has:
import urllib
import urllib2
urllib2.urlopen('http://localhost:8080/_ah/admin/interactive/execute',
data = urllib.urlencode({'code' : 'from google.appengine.ext import db\n' +
'db.delete(db.Query())'}))
Source
I got this from http://code.google.com/appengine/articles/remote_api.html.
Create the Interactive Console
First, you need to define an interactive appenginge console. So, create a file called appengine_console.py and enter this:
#!/usr/bin/python
import code
import getpass
import sys
# These are for my OSX installation. Change it to match your google_appengine paths. sys.path.append("/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine")
sys.path.append("/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/yaml/lib")
from google.appengine.ext.remote_api import remote_api_stub
from google.appengine.ext import db
def auth_func():
return raw_input('Username:'), getpass.getpass('Password:')
if len(sys.argv) < 2:
print "Usage: %s app_id [host]" % (sys.argv[0],)
app_id = sys.argv[1]
if len(sys.argv) > 2:
host = sys.argv[2]
else:
host = '%s.appspot.com' % app_id
remote_api_stub.ConfigureRemoteDatastore(app_id, '/remote_api', auth_func, host)
code.interact('App Engine interactive console for %s' % (app_id,), None, locals())
Create the Mapper base class
Once that's in place, create this Mapper class. I just created a new file called utils.py and threw this:
class Mapper(object):
# Subclasses should replace this with a model class (eg, model.Person).
KIND = None
# Subclasses can replace this with a list of (property, value) tuples to filter by.
FILTERS = []
def map(self, entity):
"""Updates a single entity.
Implementers should return a tuple containing two iterables (to_update, to_delete).
"""
return ([], [])
def get_query(self):
"""Returns a query over the specified kind, with any appropriate filters applied."""
q = self.KIND.all()
for prop, value in self.FILTERS:
q.filter("%s =" % prop, value)
q.order("__key__")
return q
def run(self, batch_size=100):
"""Executes the map procedure over all matching entities."""
q = self.get_query()
entities = q.fetch(batch_size)
while entities:
to_put = []
to_delete = []
for entity in entities:
map_updates, map_deletes = self.map(entity)
to_put.extend(map_updates)
to_delete.extend(map_deletes)
if to_put:
db.put(to_put)
if to_delete:
db.delete(to_delete)
q = self.get_query()
q.filter("__key__ >", entities[-1].key())
entities = q.fetch(batch_size)
Mapper is supposed to be just an abstract class that allows you to iterate over every entity of a given kind, be it to extract their data, or to modify them and store the updated entities back to the datastore.
Run with it!
Now, start your appengine interactive console:
$python appengine_console.py <app_id_here>
That should start the interactive console. In it create a subclass of Model:
from utils import Mapper
# import your model class here
class MyModelDeleter(Mapper):
KIND = <model_name_here>
def map(self, entity):
return ([], [entity])
And, finally, run it (from you interactive console):
mapper = MyModelDeleter()
mapper.run()
That's it!
You can do it using the web interface. Login into your account, navigate with links on the left hand side. In Data Store management you have options to modify and delete data. Use respective options.
I've created an add-in panel that can be used with your deployed App Engine apps. It lists the kinds that are present in the datastore in a dropdown, and you can click a button to schedule "tasks" that delete all entities of a specific kind or simply everything. You can download it here:
http://code.google.com/p/jobfeed/wiki/Nuke
For Python, 1.3.8 includes an experimental admin built-in for this. They say: "enable the following builtin in your app.yaml file:"
builtins:
- datastore_admin: on
"Datastore delete is currently available only with the Python runtime. Java applications, however, can still take advantage of this feature by creating a non-default Python application version that enables Datastore Admin in the app.yaml. Native support for Java will be included in an upcoming release."
Open "Datastore Admin" for your application and enable Admin. Then all of your entities will be listed with check boxes. You can simply select the unwanted entites and delete them.
This is what you're looking for...
db.delete(Entry.all(keys_only=True))
Running a keys-only query is much faster than a full fetch, and your quota will take a smaller hit because keys-only queries are considered small ops.
Here's a link to an answer from Nick Johnson describing it further.
Below is an end-to-end REST API solution to truncating a table...
I setup a REST API to handle database transactions where routes are directly mapped through to the proper model/action. This can be called by entering the right url (example.com/inventory/truncate) and logging in.
Here's the route:
Route('/inventory/truncate', DataHandler, defaults={'_model':'Inventory', '_action':'truncate'})
Here's the handler:
class DataHandler(webapp2.RequestHandler):
#basic_auth
def delete(self, **defaults):
model = defaults.get('_model')
action = defaults.get('_action')
module = __import__('api.models', fromlist=[model])
model_instance = getattr(module, model)()
result = getattr(model_instance, action)()
It starts by loading the model dynamically (ie Inventory found under api.models), then calls the correct method (Inventory.truncate()) as specified in the action parameter.
The #basic_auth is a decorator/wrapper that provides authentication for sensitive operations (ie POST/DELETE). There's also an oAuth decorator available if you're concerned about security.
Finally, the action is called:
def truncate(self):
db.delete(Inventory.all(keys_only=True))
It looks like magic but it's actually very straightforward. The best part is, delete() can be re-used to handle deleting one-or-many results by adding another action to the model.
You can Delete All Datastore by deleting all Kinds One by One.
with google appengine dash board. Please follow these Steps.
Login to https://console.cloud.google.com/datastore/settings
Click Open Datastore Admin. (Enable it if not enabled.)
Select all Entities and press delete.(This Step run a map reduce job for deleting all selected Kinds.)
for more information see This image http://storage.googleapis.com/bnifsc/Screenshot%20from%202015-01-31%2023%3A58%3A41.png
If you have a lot of data, using the web interface could be time consuming. The App Engine Launcher utility lets you delete everything in one go with the 'Clear datastore on launch' checkbox. This utility is now available for both Windows and Mac (Python framework).
For the development server, instead of running the server through the google app engine launcher, you can run it from the terminal like:
dev_appserver.py --port=[portnumber] --clear_datastore=yes [nameofapplication]
ex: my application "reader" runs on port 15080. After modify the code and restart the server, I just run "dev_appserver.py --port=15080 --clear_datastore=yes reader".
It's good for me.
Adding answer about recent developments.
Google recently added datastore admin feature. You can backup, delete or copy your entities to another app using this console.
https://developers.google.com/appengine/docs/adminconsole/datastoreadmin#Deleting_Entities_in_Bulk
I often don't want to delete all the data store so I pull a clean copy of /war/WEB-INF/local_db.bin out source control. It may just be me but it seems even with the Dev Mode stopped I have to physically remove the file before pulling it. This is on Windows using the subversion plugin for Eclipse.
PHP variation:
import com.google.appengine.api.datastore.Query;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
define('DATASTORE_SERVICE', DatastoreServiceFactory::getDatastoreService());
function get_all($kind) {
$query = new Query($kind);
$prepared = DATASTORE_SERVICE->prepare($query);
return $prepared->asIterable();
}
function delete_all($kind, $amount = 0) {
if ($entities = get_all($kind)) {
$r = $t = 0;
$delete = array();
foreach ($entities as $entity) {
if ($r < 500) {
$delete[] = $entity->getKey();
} else {
DATASTORE_SERVICE->delete($delete);
$delete = array();
$r = -1;
}
$r++; $t++;
if ($amount && $amount < $t) break;
}
if ($delete) {
DATASTORE_SERVICE->delete($delete);
}
}
}
Yes it will take time and 30 sec. is a limit. I'm thinking to put an ajax app sample to automate beyond 30 sec.
for amodel in db.Model.__subclasses__():
dela=[]
print amodel
try:
m = amodel()
mq = m.all()
print mq.count()
for mw in mq:
dela.append(mw)
db.delete(dela)
#~ print len(dela)
except:
pass
If you're using ndb, the method that worked for me for clearing the datastore:
ndb.delete_multi(ndb.Query(default_options=ndb.QueryOptions(keys_only=True)))
For any datastore that's on app engine, rather than local, you can use the new Datastore API. Here's a primer for how to get started.
I wrote a script that deletes all non-built in entities. The API is changing pretty rapidly, so for reference, I cloned it at commit 990ab5c7f2063e8147bcc56ee222836fd3d6e15b
from gcloud import datastore
from gcloud.datastore import SCOPE
from gcloud.datastore.connection import Connection
from gcloud.datastore import query
from oauth2client import client
def get_connection():
client_email = 'XXXXXXXX#developer.gserviceaccount.com'
private_key_string = open('/path/to/yourfile.p12', 'rb').read()
svc_account_credentials = client.SignedJwtAssertionCredentials(
service_account_name=client_email,
private_key=private_key_string,
scope=SCOPE)
return Connection(credentials=svc_account_credentials)
def connect_to_dataset(dataset_id):
connection = get_connection()
datastore.set_default_connection(connection)
datastore.set_default_dataset_id(dataset_id)
if __name__ == "__main__":
connect_to_dataset(DATASET_NAME)
gae_entity_query = query.Query()
gae_entity_query.keys_only()
for entity in gae_entity_query.fetch():
if entity.kind[0] != '_':
print entity.kind
entity.key.delete()
continuing the idea of svpino it is wisdom to reuse records marked as delete. (his idea was not to remove, but mark as "deleted" unused records). little bit of cache/memcache to handle working copy and write only difference of states (before and after desired task) to datastore will make it better. for big tasks it is possible to write itermediate difference chunks to datastore to avoid data loss if memcache disappeared. to make it loss-proof it is possible to check integrity/existence of memcached results and restart task (or required part) to repeat missing computations. when data difference is written to datastore, required computations are discarded in queue.
other idea similar to map reduced is to shard entity kind to several different entity kinds, so it will be collected together and visible as single entity kind to final user. entries are only marked as "deleted". when "deleted" entries amount per shard overcomes some limit, "alive" entries are distributed between other shards, and this shard is closed forever and then deleted manually from dev console (guess at less cost) upd: seems no drop table at console, only delete record-by-record at regular price.
it is possible to delete by query by chunks large set of records without gae failing (at least works locally) with possibility to continue in next attempt when time is over:
qdelete.getFetchPlan().setFetchSize(100);
while (true)
{
long result = qdelete.deletePersistentAll(candidates);
LOG.log(Level.INFO, String.format("deleted: %d", result));
if (result <= 0)
break;
}
also sometimes it useful to make additional field in primary table instead of putting candidates (related records) into separate table. and yes, field may be unindexed/serialized array with little computation cost.
For all people that need a quick solution for the dev server (as time of writing in Feb. 2016):
Stop the dev server.
Delete the target directory.
Rebuild the project.
This will wipe all data from the datastore.
I was so frustrated about existing solutions for deleting all data in the live datastore that I created a small GAE app that can delete quite some amount of data within its 30 seconds.
How to install etc: https://github.com/xamde/xydra
For java
DatastoreService db = DatastoreServiceFactory.getDatastoreService();
List<Key> keys = new ArrayList<Key>();
for(Entity e : db.prepare(new Query().setKeysOnly()).asIterable())
keys.add(e.getKey());
db.delete(keys);
Works well in Development Server
You have 2 simple ways,
#1: To save cost, delete the entire project
#2: using ts-datastore-orm:
https://www.npmjs.com/package/ts-datastore-orm
await Entity.truncate();
The truncate can delete around 1K rows per seconds
Here's how I did this naively from a vanilla Google Cloud Shell (no GAE) with python3:
from google.cloud import datastore
client = datastore.Client()
query.keys_only()
for counter, entity in enumerate(query.fetch()):
if entity.kind.startswith('_'): # skip reserved kinds
continue
print(f"{counter}: {entity.key}")
client.delete(entity.key)
This takes a very long time even with a relatively small amount of keys but it works.
More info about the Python client library: https://googleapis.dev/python/datastore/latest/client.html
As of 2022, there are two ways to delete a kind from a (largeish) datastore to the best of my knowledge. Google recommends using a Dataflow template. The template will basically pull each entity one by one subject to a GQL query, and then delete it. Interestingly, if you are deleting a large number of rows (> 10m), you will run into datastore troubles; as it will fail to provide enough capacity, and your operations to the datastore will start timing out. However, only the kind you are mass deleting from will be effected.
If you have less than 10m rows, you can just use this go script:
import (
"cloud.google.com/go/datastore"
"context"
"fmt"
"google.golang.org/api/option"
"log"
"strings"
"sync"
"time"
)
const (
batchSize = 10000 // number of keys to get in a single batch
deleteBatchSize = 500 // number of keys to delete in a single batch
projectID = "name-of-your-GCP-project"
serviceAccount = "path-to-sa-file"
table = "kind-to-delete"
)
func min(a, b int) int {
if a < b {
return a
}
return b
}
func deleteBatch(table string) int {
ctx := context.Background()
client, err := datastore.NewClient(ctx, projectID, option.WithCredentialsFile(serviceAccount))
if err != nil {
log.Fatalf("Failed to open client: %v", err)
}
defer client.Close()
query := datastore.NewQuery(table).KeysOnly().Limit(batchSize)
keys, err := client.GetAll(ctx, query, nil)
if err != nil {
fmt.Printf("%s Failed to get %d keys : %v\n", table, batchSize, err)
return -1
}
var wg sync.WaitGroup
for i := 0; i < len(keys); i += deleteBatchSize {
wg.Add(1)
go func(i int) {
batch := keys[i : i+min(len(keys)-i, deleteBatchSize)]
if err := client.DeleteMulti(ctx, batch); err != nil {
// not a big problem, we'll get them next time ;)
fmt.Printf("%s Failed to delete multi: %v", table, err)
}
wg.Done()
}(i)
}
wg.Wait()
return len(keys)
}
func main() {
var globalStartTime = time.Now()
fmt.Printf("Deleting \033[1m%s\033[0m\n", table)
for {
startTime := time.Now()
count := deleteBatch(table)
if count >= 0 {
rate := float64(count) / time.Since(startTime).Seconds()
fmt.Printf("Deleted %d keys from %s in %.2fs, rate %.2f keys/s\n", count, table, time.Since(startTime).Seconds(), rate)
if count == 0 {
fmt.Printf("%s is now clear.\n", table)
break
}
} else {
fmt.Printf("Retrying after short cooldown\n")
time.Sleep(10 * time.Second)
}
}
fmt.Printf("Total time taken %s.\n", time.Since(globalStartTime))
}