I am trying to do my reads and writes for GAE as efficiently as possible and I was wondering which is the best of the following two options.
I have a website where users are able to post different things and right now whenever I want to show all posts by that user I do a query for all posts with that user's user ID and then I display them. Would it be better to store all of the post IDs in the user entity and do a get_by_id(post_ID_list) to return all of the posts? Or would that extra space being used up not be worth it?
Is there anywhere I can find more information like this to optimize my web app?
Thanks!
The main reason you would want to store the list of IDs would be so that you can get each entity separately for better consistency - entity gets by id are consistent with the latest version in the datastore, while queries are eventually consistent.
Check datastore costs and optimize for cost:
https://developers.google.com/appengine/docs/billing
Getting entities by key wouldn't be any cheaper than querying all the posts. The query makes use of an index.
If you use projection queries, you can reduce your costs quite a bit.
There is several cases.
First, if you keep track for all ids of user's posts. You must use entity group for consistency. Thats means speed of write to datastore would be ~1 entity per second. And cost is 1 read for object with ids and 1 read per entity.
Second, if you just use query. This is not need consistency. Cost is 1 read + 1 read per entity retrieved.
Third, if you quering only keys and after fetching. Cost is 1 read + 1 small per key retrieved. Watch this: Keys-Only Queries. This equals to projection quering for cost.
And if you have many result, and use pagination then you need use Query Cursors. That prevent useless usage of datastore.
The most economical solution is third case. Watch this: Batch Operations.
In case you have a list of id's because they are stored with your entity, a call to ndb.get_multi (in case you are using NDB, but it would be similar with any other framework using the memcache to cache single entities) would save you further datastore calls if all (or most) of the entities correpsonding to the keys are already in the datastore.
So in the best possible case (everything is in the memcache), the datastore wouldn't be touched at all, while using a query would.
See this issue for a discussion and caveats: http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118.
Related
I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?
Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.
I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.
Using Google App Engine's NDB datastore, how do I ensure a strongly consistent read of a list of entities after creating a new entity?
The example use case is that I have entities of the Employee kind.
Create a new employee entity
Immediately load a list of employees (including the one that was added)
I understand that the approach below will yield an eventually consistent read of the list of employees which may or may not contain the new employee. This leads to a bad experience in the case of the latter.
e = Employee(...)
e.put()
Employee.query().fetch(...)
Now here are a few options I've thought about:
IMPORTANT QUALIFIERS
I only care about a consistent list read for the user who added the new employee. I don't care if other users have an eventual consistent read.
Let's assume I do not want to put all the employees under an Ancestor to enable a strongly consistent ancestor query. In the case of thousands and thousands of employee entities, the 5 writes / second limitation is not worth it.
Let's also assume that I want the write and the list read to be the result of two separate HTTP requests. I could theoretically put both write and read into a single transaction (?) but then that would be a very non-RESTful API endpoint.
Option 1
Create a new employee entity in the datastore
Additionally, write the new employee object to memcache, local browser cookie, local mobile storage.
Query datastore for list of employees (eventually consistent)
If new employee entity is not in this list, add it to the list (in my application code) from memcache / local memory
Render results to user. If user selects the new employee entity, retrieve the entity using key.get() (strongly consistent).
Option 2
Create a new employee entity using a transaction
Query datastore for list of employees in a transaction
I'm not sure Option #2 actually works.
Technically, does the previous write transaction get written to all the servers before the read transaction of that entity occurs? Or is this not correct behavior?
Transactions (including XG) have a limit on number of entity groups and a list of employees (each is its own entity group) could exceed this limit.
What are the downsides of read-only transactions vs. normal reads?
Thoughts? Option #1 seems like it would work, but it seems like a lot of work to ensure consistency on a follow-on read.
If you don not use an entity group you can do a key_only query and get_multi(keys) lookup for entity consistency. For the new employee you have to pass the new key to key list of the get_multi.
Docs: A combination of the keys-only, global query with a lookup method will read the latest entity values. But it should be noted that a keys-only global query can not exclude the possibility of an index not yet being consistent at the time of the query, which may result in an entity not being retrieved at all. The result of the query could potentially be generated based on filtering out old index values. In summary, a developer may use a keys-only global query followed by lookup by key only when an application requirement allows the index value not yet being consistent at the time of a query.
More info and magic here : Balancing Strong and Eventual Consistency with Google Cloud Datastore
I had the same problem, option #2 doesn't really work: a read using the key will work, but a query might still miss the new employee.
Option #1 could work, but only in the same request. The saved memcache key can dissapear at any time, a subsequent query on the same instance or one on another instance potentially running on another piece of hw would still miss the new employee.
The only "solution" that comes to mind for consistent query results is to actually not attempt to force the new employee into the results and rather leave things flow naturally until it does. I'd just add a warning that creating the new user will take "a while". If tolerable maybe keep polling/querying in the original request until it shows up? - that would be the only place where the employee creation event is known with certainty.
This question is old as I write this. However, it is a good question and will be relevant long term.
Option #2 from the original question will not work.
If the entity creation and the subsequent query are truly independent, with no context linking them, then you are really just stuck - or you don't care. The trick is that there is almost always some relationship or some use case that must be covered. In other words if the query is truly some kind of, essentially, ad hoc query, then you really don't care. In that case, you just quote CAP theorem and remind the client executing the query how great it is that this system scales. However, almost always, if you are worried about the eventual consistency, there is some use case or set of cases that must be handled. For example, if you have a high score list, the highest score must be at the top of the list. The highest score may have just been achieved by the user who is now looking at the list. Another example might be that when an employee is created, that employee must be on the "new employees" list.
So what you usually do is exploit these known cases to balance the throughput needed with consistency. For example, for the high score example, you may be able to afford to keep a secondary index (an entity) that is the list of the high scores. You always get it by key and you can write to it as frequently as needed because high scores are not generated that often presumably. For the new employee example, you might use an approach that you started to suggest by storing the timestamp of the last employee in memcache. Then when you query, you check to make sure your list includes that employee ... or something along those lines.
The price in balancing write throughput and consistency on App Engine and similar systems is always the same. It requires increased model complexity / code complexity to bridge the business needs.
I am making a mobile iOS app. A user can create an account, and upload strings. It will be like twitter, you can follow people, have profile pictures etc. I cannot estimate the user base, but if the app takes off, the total dataset may be fairly large.
I am storing the actual objects on Amazon S3, and the keys on a DataBase, listing Amazon S3 keys is slow. So which would be better for storing keys?
This is my knowledge of SimpleDB and DynamoDB:
SimpleDB:
Cheap
Performs well
Designed for small/medium datasets
Can query using select expressions
DynamoDB:
Costly
Extremely scalable
Performs great; millisecond response
Cannot query
These points are correct to my understanding, DynamoDB is more about killer. speed and scalability, SimpleDB is more about querying and price (still delivering good performance). But if you look at it this way, which will be faster, downloading ALL keys from DynamoDB, or doing a select query with SimpleDB... hard right? One is using a blazing fast database to download a lot (and then we have to match them), and the other is using a reasonably good-performance database to query and download the few correct objects. So, which is faster:
DynamoDB downloading everything and matching OR SimpleDB querying and downloading that
(NOTE: Matching just means using -rangeOfString and string comparison, nothing power consuming or non-time efficient or anything server side)
My S3 keys will use this format for every type of object
accountUsername:typeOfObject:randomGeneratedKey
E.g. If you are referencing to an account object
Rohan:Account:shd83SHD93028rF
Or a profile picture:
Rohan:ProfilePic:Nck83S348DD93028rF37849SNDh
I have the randomly generated key for uniqueness, it does not refer to anything, it is simply there so that keys are not repeated therefore overlapping two objects.
In my app, I can either choose SimpleDB or DynamoDB, so here are the two options:
Use SimpleDB, store keys with the format but not use the format for any reference, instead use attributes stored with SimpleDB. So, I store the key with attributes like username, type and maybe others I would also have to include in the key format. So if I want to get the account object from user 'Rohan'. I just use SimpleDB Select to query the attribute 'username' and the attribute 'type'. (where I match for 'account')
DynamoDB, store keys and each key will have the illustrated format. I scan the whole database returning every single key. Then get the key and take advantage of the key format, I can use -rangeOfString to match the ones I want and then download from S3.
Also, SimpleDB is apparently geographically-distributed, how can I enable that though?
So which is quicker and more reliable? Using SimpleDB to query keys with attributes. Or using DynamoDB to store all keys, scan (download all keys) and match using e.g. -rangeOfString? Mind the fact that these are just short keys that are pointers to S3 objects.
Here is my last question, and the amount of objects in the database will vary on the decided answer, should I:
Create a separate key/object for every single object a user has
Create an account key/object and store all information inside there
There would be different advantages and disadvantages points between these two options, obviously. For example, it would be quicker to retrieve if it is all separate, but it is also more organized and less large of a dataset for storing it in one users account.
So what do you think?
Thanks for the help! I have put a bounty on this, really need an answer ASAP.
Wow! What a Question :)
Ok, lets discuss some aspects:
S3
S3 Performance is low most likely as you're not adding a Prefix for Listing Keys.
If you sharding by storing the objects like: type/owner/id, listing all the ids for a given owner (prefixed as type/owner/) will be fast. Or at least, faster than listing everything at once.
Dynamo Versus SimpleDB
In general, thats my advice:
Use SimpleDB when:
Your entity storage isn't going to pass over 10GB
You need to apply complex queries involving multiple fields
Your queries aren't well defined
You can leverage from Multi-Valued Data Types
Use DynamoDB when:
Your entity storage will pass 10GB
You want to scale demand / throughput as it goes
Your queries and model is well-defined, and unlikely to change.
Your model is dynamic, involving a loose schema
You can cache on your client-side your queries (so you can save on throughput by querying the cache prior to Dynamo)
You want to do aggregate/rollup summaries, by using Atomic Updates
Given your current description, it seems SimpleDB is actually better, since:
- Your model isn't completely defined
- You can defer some decision aspects, since it takes a while to hit the (10GiB) limits
Geographical SimpleDB
It doesn't support. It works only from us-east-1 afaik.
Key Naming
This applies most to Dynamo: Whenever you can, use Hash + Range Key. But you could also create keys using Hash, and apply some queries, like:
List all my records on table T which starts with accountid:
List all my records on table T which starts with accountid:image
However, those are Scans at all. Bear that in mind.
(See this for an overview: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Scan.html)
Bonus Track
If you're using Java, cloudy-data on Maven Central includes SimpleJPA with some extensions to Map Blob Fields to S3. So give it a look:
http://bitbucket.org/ingenieux/cloudy
Thank you
What is the proper way to perform mass updates on entities in a Google App Engine Datastore? Can it be done without having to retrieve the entities?
For example, what would be the GAE equivilant to something like this in SQL:
UPDATE dbo.authors
SET city = replace(city, 'Salt', 'Olympic')
WHERE city LIKE 'Salt%';
There isn't a direct translation. The datastore really has no concept of updates; all you can do is overwrite old entities with a new entity at the same address (key). To change an entity, you must fetch it from the datastore, modify it locally, and then save it back.
There's also no equivalent to the LIKE operator. While wildcard suffix matching is possible with some tricks, if you wanted to match '%Salt%' you'd have to read every single entity into memory and do the string comparison locally.
So it's not going to be quite as clean or efficient as SQL. This is a tradeoff with most distributed object stores, and the datastore is no exception.
That said, the mapper library is available to facilitate such batch updates. Follow the example and use something like this for your process function:
def process(entity):
if entity.city.startswith('Salt'):
entity.city = entity.city.replace('Salt', 'Olympic')
yield op.db.Put(entity)
There are other alternatives besides the mapper. The most important optimization tip is to batch your updates; don't save back each updated entity individually. If you use the mapper and yield puts, this is handled automatically.
No, it can't be done without retrieving the entities.
There's no such thing as a '1000 max record limit', but there is of course a timeout on any single request - and if you have large amounts of entities to modify, a simple iteration will probably fall foul of that. You could manage this by splitting it up into multiple operations and keeping track with a query cursor, or potentially by using the MapReduce framework.
you could use the query class, http://code.google.com/appengine/docs/python/datastore/queryclass.html
query = authors.all().filter('city >', 'Salt').fetch()
for record in query:
record.city = record.city.replace('Salt','Olympic')
As an example, Google App Engine uses Google Datastore, not a standard database, to store data. Does anybody have any tips for using Google Datastore instead of databases? It seems I've trained my mind to think 100% in object relationships that map directly to table structures, and now it's hard to see anything differently. I can understand some of the benefits of Google Datastore (e.g. performance and the ability to distribute data), but some good database functionality is sacrificed (e.g. joins).
Does anybody who has worked with Google Datastore or BigTable have any good advice to working with them?
There's two main things to get used to about the App Engine datastore when compared to 'traditional' relational databases:
The datastore makes no distinction between inserts and updates. When you call put() on an entity, that entity gets stored to the datastore with its unique key, and anything that has that key gets overwritten. Basically, each entity kind in the datastore acts like an enormous map or sorted list.
Querying, as you alluded to, is much more limited. No joins, for a start.
The key thing to realise - and the reason behind both these differences - is that Bigtable basically acts like an enormous ordered dictionary. Thus, a put operation just sets the value for a given key - regardless of any previous value for that key, and fetch operations are limited to fetching single keys or contiguous ranges of keys. More sophisticated queries are made possible with indexes, which are basically just tables of their own, allowing you to implement more complex queries as scans on contiguous ranges.
Once you've absorbed that, you have the basic knowledge needed to understand the capabilities and limitations of the datastore. Restrictions that may have seemed arbitrary probably make more sense.
The key thing here is that although these are restrictions over what you can do in a relational database, these same restrictions are what make it practical to scale up to the sort of magnitude that Bigtable is designed to handle. You simply can't execute the sort of query that looks good on paper but is atrociously slow in an SQL database.
In terms of how to change how you represent data, the most important thing is precalculation. Instead of doing joins at query time, precalculate data and store it in the datastore wherever possible. If you want to pick a random record, generate a random number and store it with each record. There's a whole cookbook of this sort of tips and tricks here.
The way I have been going about the mind switch is to forget about the database altogether.
In the relational db world you always have to worry about data normalization and your table structure. Ditch it all. Just layout your web page. Lay them all out. Now look at them. You're already 2/3 there.
If you forget the notion that database size matters and data shouldn't be duplicated then you're 3/4 there and you didn't even have to write any code! Let your views dictate your Models. You don't have to take your objects and make them 2 dimensional anymore as in the relational world. You can store objects with shape now.
Yes, this is a simplified explanation of the ordeal, but it helped me forget about databases and just make an application. I have made 4 App Engine apps so far using this philosophy and there are more to come.
I always chuckle when people come out with - it's not relational. I've written cellectr in django and here's a snippet of my model below. As you'll see, I have leagues that are managed or coached by users. I can from a league get all the managers, or from a given user I can return the league she coaches or managers.
Just because there's no specific foreign key support doesn't mean you can't have a database model with relationships.
My two pence.
class League(BaseModel):
name = db.StringProperty()
managers = db.ListProperty(db.Key) #all the users who can view/edit this league
coaches = db.ListProperty(db.Key) #all the users who are able to view this league
def get_managers(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.managers)
def get_coaches(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.coaches)
def __str__(self):
return self.name
# Need to delete all the associated games, teams and players
def delete(self):
for player in self.leagues_players:
player.delete()
for game in self.leagues_games:
game.delete()
for team in self.leagues_teams:
team.delete()
super(League, self).delete()
class UserPrefs(db.Model):
user = db.UserProperty()
league_ref = db.ReferenceProperty(reference_class=League,
collection_name='users') #league the users are managing
def __str__(self):
return self.user.nickname
# many-to-many relationship, a user can coach many leagues, a league can be
# coached by many users
#property
def managing(self):
return League.gql('WHERE managers = :1', self.key())
#property
def coaching(self):
return League.gql('WHERE coaches = :1', self.key())
# remove all references to me when I'm deleted
def delete(self):
for manager in self.managing:
manager.managers.remove(self.key())
manager.put()
for coach in self.managing:
coach.coaches.remove(self.key())
coaches.put()
super(UserPrefs, self).delete()
I came from Relational Database world then i found this Datastore thing. it took several days to get hang of it. well there are some of my findings.
You must have already know that Datastore is build to scale and that is the thing that separates it from RDMBS. to scale better with large dataset, App Engine has done some changes(some means lot of changes).
RDBMS VS DataStore
Structure
In database, we usually structure our data in Tables, Rows which is in Datastore it becomes Kinds and Entities.
Relations
In RDBMS, Most of the people folllows the One-to-One, Many-to-One, Many-to-Many relationship, In Datastore, As it has "No Joins" thing but still we can achieve our normalization using "ReferenceProperty" e.g. One-to-One Relationship Example .
Indexes
Usually in RDMBS we make indexes like Primary Key, Foreign Key, Unique Key and Index key to speed up the search and boost our database performance. In datastore, you have to make atleast one index per kind(it will automatically generate whether you like it or not) because datastore search your entity on the basis of these indexes and believe me that is the best part, In RDBMS you can search using non-index field though it will take some time but it will. In Datastore you can not search using non-index property.
Count
In RDMBS, it is much easier to count(*) but in datastore, Please dont even think it in normal way(Yeah there is a count function) as it has 1000 Limit and it will cost as much small opertion as the entity which is not good but we always have good choices, we can use Shard Counters.
Unique Constraints
In RDMBS, We love this feature right? but Datastore has its own way. you cannot define a property as unique :(.
Query
GAE Datatore provides a better feature much LIKE(Oh no! datastore does not have LIKE Keyword) SQL which is GQL.
Data Insert/Update/Delete/Select
This where we all are interested in, as in RDMBS we require one query for Insert, Update, Delete and Select just like RDBMS, Datastore has put, delete, get(dont get too excited) because Datastore put or get in terms of Write, Read, Small Operations(Read Costs for Datastore Calls) and thats where Data Modeling comes into action. you have to minimize these operations and keep your app running. For Reducing Read operation you can use Memcache.
Take a look at the Objectify documentation. The first comment at the bottom of the page says:
"Nice, although you wrote this to describe Objectify, it is also one of the most concise explanation of appengine datastore itself I've ever read. Thank you."
https://github.com/objectify/objectify/wiki/Concepts
If you're used to thinking about ORM-mapped entities then that's basically how an entity-based datastore like Google's App Engine works. For something like joins, you can look at reference properties. You don't really need to be concerned about whether it uses BigTable for the backend or something else since the backend is abstracted by the GQL and Datastore API interfaces.
The way I look at datastore is, kind identifies table, per se, and entity is individual row within table. If google were to take out kind than its just one big table with no structure and you can dump whatever you want in an entity. In other words if entities are not tied to a kind you pretty much can have any structure to an entity and store in one location (kind of a big file with no structure to it, each line has structure of its own).
Now back to original comment, google datastore and bigtable are two different things so do not confuse google datastore to datastore data storage sense. Bigtable is more expensive than bigquery (Primary reason we didn't go with it). Bigquery does have proper joins and RDBMS like sql language and its cheaper, why not use bigquery. That being said, bigquery does have some limitations, depending on size of your data you might or might not encounter them.
Also, in terms of thinking in terms of datastore, i think proper statement would have been "thinking in terms of NoSQL databases". There are too many of them available out there these days but when it comes to google products except google cloud SQL (which is mySQL) everything else is NoSQL.
Being rooted in the database world, a data store to me would be a giant table (hence the name "bigtable"). BigTable is a bad example though because it does a lot of other things that a typical database might not do, and yet it is still a database. Chances are unless you know you need to build something like Google's "bigtable", you will probably be fine with a standard database. They need that because they are handling insane amounts of data and systems together, and no commercially available system can really do the job the exact way they can demonstrate that they need the job to be done.
(bigtable reference: http://en.wikipedia.org/wiki/BigTable)