Say I have some model Object, and ~2000 entities in my datastore. Using app stats with the following code
ndb.get_context().set_cache_policy(lambda x: False)
ndb.get_context().set_memcache_policy(lambda x: False)
objects = Object.query().fetch()
I get the following profile
What is being done for the ~18 seconds that it's not waiting for RPCs?
It's likely deserializing those entities into python objects and the process is very very slow. You shouldn't be fetching that many entities during a single web request from a client anyway and if it's for some kind of a batch job - the time shouldn't matter that much (note also that once you go over several thousand items - your requests will likely time out at some point so you will need to use something like query cursors).
You may also find this and this blog posts helpful on some hacks to speed the deserialization process up.
Also, unrelated but this is one of the many cases where Golang would shine and way outperform Python on the exact same task (that delay would be almost non-existent).
in my app i have a few batch operations i perform.
unfortunately this sometimes takes forever to update 400-500 entities.
what i have is all the entity keys, i need to get them, update a property and save them to the datastore and saving them can take up to 40-50 seconds which is not what im looking for.
ill simplify my model to explain what i do (which is pretty simple anyway):
class Entity(ndb.Model):
title = ndb.StringProperty()
keys = [key1, key2, key3, key4, ..., key500]
entities = ndb.get_multi(keys)
for e in entities:
e.title = 'the new title'
ndb.put_multi(entities)
getting and modifying does not take too long. i tried to get_async getting in a tasklet and whatever else is possible which only changes if the get or the forloop takes longer.
but what really bothers me is that a put takes up to 50seconds...
what is the most efficient way to do this operation(s) in a decent amount of time. of course i know that it depends on many factors like the complexity of the entity but the time it takes to put is really over the acceptable limit to me.
i already tried async operations, tasklets...
I wonder if doing smaller batches of e.g. 50 or 100 entities will be faster. If you make that into a task let you can try running those tasklets concurrently.
I also recommend looking at this with Appstats to see if that shows something surprising.
Finally assuming this uses the HRD you may find that there is a limit on the number of entity groups per batch. This limit defaults very low. Try raising it.
Sounds like what MapReduce was designed for. You can do this fast, by simultaneously getting and modifying all the entities at the same time, scaled across multiple server instances. Your cost goes up by using more instances though.
I'm going to assume that you have the entity design that you want (i.e. I'm not going to ask you what you're trying to do and how maybe you should have one big entity instead of a bunch of small ones that you have to update all the time). Because that wouldn't be very nice. ( =
What if you used the Task Queue? You could create multiple tasks and each task could take as URL params the keys it is responsible for updating and the property and value that should be set. That way the work is broken up into manageable chunks and the user's request can return immediately while the work happens in the background? Would that work?
What is the proper way to perform mass updates on entities in a Google App Engine Datastore? Can it be done without having to retrieve the entities?
For example, what would be the GAE equivilant to something like this in SQL:
UPDATE dbo.authors
SET city = replace(city, 'Salt', 'Olympic')
WHERE city LIKE 'Salt%';
There isn't a direct translation. The datastore really has no concept of updates; all you can do is overwrite old entities with a new entity at the same address (key). To change an entity, you must fetch it from the datastore, modify it locally, and then save it back.
There's also no equivalent to the LIKE operator. While wildcard suffix matching is possible with some tricks, if you wanted to match '%Salt%' you'd have to read every single entity into memory and do the string comparison locally.
So it's not going to be quite as clean or efficient as SQL. This is a tradeoff with most distributed object stores, and the datastore is no exception.
That said, the mapper library is available to facilitate such batch updates. Follow the example and use something like this for your process function:
def process(entity):
if entity.city.startswith('Salt'):
entity.city = entity.city.replace('Salt', 'Olympic')
yield op.db.Put(entity)
There are other alternatives besides the mapper. The most important optimization tip is to batch your updates; don't save back each updated entity individually. If you use the mapper and yield puts, this is handled automatically.
No, it can't be done without retrieving the entities.
There's no such thing as a '1000 max record limit', but there is of course a timeout on any single request - and if you have large amounts of entities to modify, a simple iteration will probably fall foul of that. You could manage this by splitting it up into multiple operations and keeping track with a query cursor, or potentially by using the MapReduce framework.
you could use the query class, http://code.google.com/appengine/docs/python/datastore/queryclass.html
query = authors.all().filter('city >', 'Salt').fetch()
for record in query:
record.city = record.city.replace('Salt','Olympic')
Hi I read a few post on this topic, lets say I want to delete every object for a given class db.model such as LinkRating2, is there a way to delete it on startup with a simple command? I thought i remember seeing this somewhere eg --clear datastore?, otherwise I have been trying various methods in sdk console but all seem to be memory crashing as the file is taking forever just to load and start I don't think there is any memory left about lol. So what would be best code method to delete entire entity?
class LinkRating2(db.Model):
user = db.StringProperty()
link = db.StringProperty()
rating2 = db.FloatProperty()
tried this but this is super slow
results2 = LinkRating2.all()
results = results2.fetch(500)
while results:
db.delete(results)
results = results2.fetch(500)
is there a way to delete it on startup
with a simple command? I thought i
remember seeing this somewhere eg
--clear datastore?
If you're talking about the development server, yes. dev_appserver.py --clear_datastore myapp will start you up with a fresh datastore. This is the best option when you're working locally.
I didn't bother putting it in a while
loop, as trying to delete all of the
entities in one go probably stands a
good chance of exceeding the 30-second
deadline. You'll just need to do this
until all of the entities are gone.
If you do want to wipe out every entity of a given type in production, this is a good excuse to use remote_api instead of writing a web-based handler, both to circumvent the 30 second deadline and to reduce the chance of your entitypocalypse code being run again by accident.
One last thing: if you want to delete some entities but you don't want to bother loading the model definition, you can use the low level datastore API:
from google.appengine.api import datastore
kind = 'LinkRating2'
batch_size = 1000
query = datastore.Query(kind=kind, keys_only=True)
results = query.Get(batch_size)
while results:
print "Deleting %d %s entities" % (batch_size, kind)
datastore.Delete(results)
results = query.Get(batch_size)
You are much better off working simply with keys than with entire entities. Querying just for entity keys is much faster than fetching entire entities, and you only need the keys for delete.
results = db.Query(LinkRating2, keys_only=True).fetch(1000)
if len(results) > 0:
db.delete(results)
I didn't bother putting it in a while loop, as trying to delete all of the entities in one go probably stands a good chance of exceeding the 30-second deadline. You'll just need to do this until all of the entities are gone.
results = db.Query(Final, keys_only=True).fetch(1000)
while len(results) > 0:
db.delete(results)
results = db.Query(Final, keys_only=True).fetch(1000)
The --clear-datastore flag will clear all the data from the database-- not just of a specific class as it seems you want.
This question has already been asked: Delete all data for a kind in Google App Engine
I think this answer will provide you with what you want. It deletes items by their keys, which should be much faster:
Delete all data for a kind in Google App Engine
As an example, Google App Engine uses Google Datastore, not a standard database, to store data. Does anybody have any tips for using Google Datastore instead of databases? It seems I've trained my mind to think 100% in object relationships that map directly to table structures, and now it's hard to see anything differently. I can understand some of the benefits of Google Datastore (e.g. performance and the ability to distribute data), but some good database functionality is sacrificed (e.g. joins).
Does anybody who has worked with Google Datastore or BigTable have any good advice to working with them?
There's two main things to get used to about the App Engine datastore when compared to 'traditional' relational databases:
The datastore makes no distinction between inserts and updates. When you call put() on an entity, that entity gets stored to the datastore with its unique key, and anything that has that key gets overwritten. Basically, each entity kind in the datastore acts like an enormous map or sorted list.
Querying, as you alluded to, is much more limited. No joins, for a start.
The key thing to realise - and the reason behind both these differences - is that Bigtable basically acts like an enormous ordered dictionary. Thus, a put operation just sets the value for a given key - regardless of any previous value for that key, and fetch operations are limited to fetching single keys or contiguous ranges of keys. More sophisticated queries are made possible with indexes, which are basically just tables of their own, allowing you to implement more complex queries as scans on contiguous ranges.
Once you've absorbed that, you have the basic knowledge needed to understand the capabilities and limitations of the datastore. Restrictions that may have seemed arbitrary probably make more sense.
The key thing here is that although these are restrictions over what you can do in a relational database, these same restrictions are what make it practical to scale up to the sort of magnitude that Bigtable is designed to handle. You simply can't execute the sort of query that looks good on paper but is atrociously slow in an SQL database.
In terms of how to change how you represent data, the most important thing is precalculation. Instead of doing joins at query time, precalculate data and store it in the datastore wherever possible. If you want to pick a random record, generate a random number and store it with each record. There's a whole cookbook of this sort of tips and tricks here.
The way I have been going about the mind switch is to forget about the database altogether.
In the relational db world you always have to worry about data normalization and your table structure. Ditch it all. Just layout your web page. Lay them all out. Now look at them. You're already 2/3 there.
If you forget the notion that database size matters and data shouldn't be duplicated then you're 3/4 there and you didn't even have to write any code! Let your views dictate your Models. You don't have to take your objects and make them 2 dimensional anymore as in the relational world. You can store objects with shape now.
Yes, this is a simplified explanation of the ordeal, but it helped me forget about databases and just make an application. I have made 4 App Engine apps so far using this philosophy and there are more to come.
I always chuckle when people come out with - it's not relational. I've written cellectr in django and here's a snippet of my model below. As you'll see, I have leagues that are managed or coached by users. I can from a league get all the managers, or from a given user I can return the league she coaches or managers.
Just because there's no specific foreign key support doesn't mean you can't have a database model with relationships.
My two pence.
class League(BaseModel):
name = db.StringProperty()
managers = db.ListProperty(db.Key) #all the users who can view/edit this league
coaches = db.ListProperty(db.Key) #all the users who are able to view this league
def get_managers(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.managers)
def get_coaches(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.coaches)
def __str__(self):
return self.name
# Need to delete all the associated games, teams and players
def delete(self):
for player in self.leagues_players:
player.delete()
for game in self.leagues_games:
game.delete()
for team in self.leagues_teams:
team.delete()
super(League, self).delete()
class UserPrefs(db.Model):
user = db.UserProperty()
league_ref = db.ReferenceProperty(reference_class=League,
collection_name='users') #league the users are managing
def __str__(self):
return self.user.nickname
# many-to-many relationship, a user can coach many leagues, a league can be
# coached by many users
#property
def managing(self):
return League.gql('WHERE managers = :1', self.key())
#property
def coaching(self):
return League.gql('WHERE coaches = :1', self.key())
# remove all references to me when I'm deleted
def delete(self):
for manager in self.managing:
manager.managers.remove(self.key())
manager.put()
for coach in self.managing:
coach.coaches.remove(self.key())
coaches.put()
super(UserPrefs, self).delete()
I came from Relational Database world then i found this Datastore thing. it took several days to get hang of it. well there are some of my findings.
You must have already know that Datastore is build to scale and that is the thing that separates it from RDMBS. to scale better with large dataset, App Engine has done some changes(some means lot of changes).
RDBMS VS DataStore
Structure
In database, we usually structure our data in Tables, Rows which is in Datastore it becomes Kinds and Entities.
Relations
In RDBMS, Most of the people folllows the One-to-One, Many-to-One, Many-to-Many relationship, In Datastore, As it has "No Joins" thing but still we can achieve our normalization using "ReferenceProperty" e.g. One-to-One Relationship Example .
Indexes
Usually in RDMBS we make indexes like Primary Key, Foreign Key, Unique Key and Index key to speed up the search and boost our database performance. In datastore, you have to make atleast one index per kind(it will automatically generate whether you like it or not) because datastore search your entity on the basis of these indexes and believe me that is the best part, In RDBMS you can search using non-index field though it will take some time but it will. In Datastore you can not search using non-index property.
Count
In RDMBS, it is much easier to count(*) but in datastore, Please dont even think it in normal way(Yeah there is a count function) as it has 1000 Limit and it will cost as much small opertion as the entity which is not good but we always have good choices, we can use Shard Counters.
Unique Constraints
In RDMBS, We love this feature right? but Datastore has its own way. you cannot define a property as unique :(.
Query
GAE Datatore provides a better feature much LIKE(Oh no! datastore does not have LIKE Keyword) SQL which is GQL.
Data Insert/Update/Delete/Select
This where we all are interested in, as in RDMBS we require one query for Insert, Update, Delete and Select just like RDBMS, Datastore has put, delete, get(dont get too excited) because Datastore put or get in terms of Write, Read, Small Operations(Read Costs for Datastore Calls) and thats where Data Modeling comes into action. you have to minimize these operations and keep your app running. For Reducing Read operation you can use Memcache.
Take a look at the Objectify documentation. The first comment at the bottom of the page says:
"Nice, although you wrote this to describe Objectify, it is also one of the most concise explanation of appengine datastore itself I've ever read. Thank you."
https://github.com/objectify/objectify/wiki/Concepts
If you're used to thinking about ORM-mapped entities then that's basically how an entity-based datastore like Google's App Engine works. For something like joins, you can look at reference properties. You don't really need to be concerned about whether it uses BigTable for the backend or something else since the backend is abstracted by the GQL and Datastore API interfaces.
The way I look at datastore is, kind identifies table, per se, and entity is individual row within table. If google were to take out kind than its just one big table with no structure and you can dump whatever you want in an entity. In other words if entities are not tied to a kind you pretty much can have any structure to an entity and store in one location (kind of a big file with no structure to it, each line has structure of its own).
Now back to original comment, google datastore and bigtable are two different things so do not confuse google datastore to datastore data storage sense. Bigtable is more expensive than bigquery (Primary reason we didn't go with it). Bigquery does have proper joins and RDBMS like sql language and its cheaper, why not use bigquery. That being said, bigquery does have some limitations, depending on size of your data you might or might not encounter them.
Also, in terms of thinking in terms of datastore, i think proper statement would have been "thinking in terms of NoSQL databases". There are too many of them available out there these days but when it comes to google products except google cloud SQL (which is mySQL) everything else is NoSQL.
Being rooted in the database world, a data store to me would be a giant table (hence the name "bigtable"). BigTable is a bad example though because it does a lot of other things that a typical database might not do, and yet it is still a database. Chances are unless you know you need to build something like Google's "bigtable", you will probably be fine with a standard database. They need that because they are handling insane amounts of data and systems together, and no commercially available system can really do the job the exact way they can demonstrate that they need the job to be done.
(bigtable reference: http://en.wikipedia.org/wiki/BigTable)