Is there any side effect of not having a physical entity for it to act as parent key - google-app-engine

If I go through google app engine tutorial, I can see their example seem to encourage us to have parent for entities.
Hence, I have the following workable code, for user creation (with email as unique)
def parent_key():
return ndb.Key('parent', 'parent')
class User(ndb.Model):
email = ndb.StringProperty(required = True)
timestamp = ndb.DateTimeProperty(required = True)
class RegisterHandler2(webapp2.RequestHandler):
def get(self):
email = self.request.get('email')
user_timestamp = int(time.time())
user = User.get_or_insert(email, parent=parent_key(), email=email, timestamp=datetime.datetime.fromtimestamp(user_timestamp))
Note, parent entity physically doesn't exist.
Although the above code runs totally fine, I was wondering any possible problem can occur, if parent entity physically doesn't exist?
One of my concern of not having parent, is eventually consistency. After write operation, I want my read operation able to fetch the latest written value. I'm using User.get_or_insert to write (and read), and User.get_by_id to read only.
I want after I execute User.get_or_insert, and next request User.get_by_id will return latest value. I was wondering, to achieve strong consistency, is parent key an important thingy?

There are no problems as long as you don't actually need this parent entity.
You should not make a decision to use parent entities lightly. In fact, using entity groups (parent-child entities) limit the number of entities you can update per second and makes it necessary to know the parent key to retrieve a child entity.
You may run into serious problems. For example, if entity "User" is a child of some parent entity, and then all other entities are children of entities "User", that turns all of your data into one big entity group. Assuming your app is fairly active, you will see datastore operations failures because of this performance limitation.
Note also that a key of an entity gets longer if you have to include a key of a parent entity into it. If you create a chain of entities (e.g. parent -> user -> album -> photo), a key for each "photo" entity will include a key for album, a key for user and a key for parent entity. It becomes a nightmare to manage and requires much more storage space.

Using a parent key that doesn't correspond to an entity that actually has properties (which is what I think you're referring to as a 'physical entity') is a standard technique.
You can even decide later to add properties to that key.
I've been using this technique for years.

Related

How to ensure isolation with non-ancestor query

I want to create user using ndb such as below:
def create_user(self, google_id, ....):
user_keys = UserInformation.query(UserInformation.google_id == google_id ).fetch(keys_only=True)
if user_keys: # check whether user exist.
# already created
...(SNIP)...
else:
# create new user entity.
UserInformation(
# primary key is incompletekey
google_id = google_id,
facebook_id = None,
twitter_id = None,
name =
...(SNIP)...
).put()
If this function is called twice in the sametime, two user is created.("Isolation" is not ensure between get() and put())
So, I added #ndb.transactional to above function.
But following error is occured.
BadRequestError: Only ancestor queries are allowed inside transactions.
How to ensure isolation with non-ancestor query?
The ndb library doesn't allow non-ancestor queries inside transactions. So if you make create_user() transactional you get the above error because you call UserInformation.query() inside it (without an ancestor).
If you really want to do that you'd have to place all your UserInformation entities inside the same entity group by specifying a common ancestor and make your query an ancestor one. But that has performance implications, see Ancestor relation in datastore.
Otherwise, even if you split the function in 2, one non-transactional making the query followed by a transactional one just creating the user - which would avoid the error - you'll still be facing the datastore eventual consistency, which is actually the root cause of your problem: the result of the query may not immediately return a recently added entity because it takes some time for the index corresponding to the query to be updated. Which leads to room for creating duplicate entities for the same user. See Balancing Strong and Eventual Consistency with Google Cloud Datastore.
One possible approach would be to check later/periodically if there are duplicates and remove them (eventually merging the info inside into a single entity). And/or mark the user creation as "in progress", record the newly created entity's key and keep querying until the key appears in the query result, when you finally mark the entity creation as "done" (you might not have time to do that inside the same request).
Another approach would be (if possible) to determine an algorithm to obtain a (unique) key based on the user information and just check if an entity with such key exists instead of making a query. Key lookups are strongly consistent and can be done inside transactions, so that would solve your duplicates problem. For example you could use the google_id as the key ID. Just an example, as that's not ideal either: you may have users without a google_id, users may want to change their google_id without loosing other info, etc. Maybe also track the user creation in progress in the session info to prevent repeated attempts to create the same user in the same session (but that won't help with attempts from different sessions).
For your use case, perhaps you could use ndb models' get_or_insert method, which according to the API docs:
Transactionally retrieves an existing entity or creates a new one.
So you can do:
user = UserInformation.get_or_insert(*args, **kwargs)
without risking the creation of a new user.
The complete docs:
classmethod get_or_insert(*args, **kwds)source Transactionally
retrieves an existing entity or creates a new one.
Positional Args: name: Key name to retrieve or create.
Keyword Arguments
namespace – Optional namespace. app – Optional app ID.
parent – Parent entity key, if any.
context_options – ContextOptions object (not keyword args!) or None.
**kwds – Keyword arguments to pass to the constructor of the model class if an instance for the specified key name does not already
exist. If an instance with the supplied key_name and parent already
exists, these arguments will be discarded. Returns Existing instance
of Model class with the specified key name and parent or a new one
that has just been created.

What would be the purpose of putting all datastore entities in a single group?

I have started working on an existing project which uses Google Datastore where for some of the entity kinds every entity is assigned the same ancestor. Example:
class BaseModel(ndb.Model):
#classmethod
def create(cls, **kwargs):
return cls(parent=cls.make_key(), **kwargs)
#classmethod
def make_key(cls):
return ndb.Key('Group', cls.key_name())
class Vehicle(BaseModel):
#classmethod
def key_name(cls):
return 'vehicle_group'
So the keys end up looking like this:
Key(Group, 'vehicle_group', Vehicle, 5068993417183232)
There is no such kind as 'Group' nor entity 'vehicle_group' but that's OK in these docs: "note that unlike in a file system, the parent entity need not actually exist".
I understand from reading that this might have a performance benefit in that all the entities of a kind are colocated in the distributed datastore.
But putting all these entities in a single group would in my mind create problems as this project scales, and the once per second write limit would apply to the entire kind. There doesn't appear to be any transactional reason for the group.
No one on the project knows why it was originally done like this. My questions are:
Does anyone know where this "xxx_group" single entity scheme comes
from?
And is it as bunk as it appears to be?
Grouping many entities inside a single entity group offers at least 2 advantages I can think of:
ability to perform (ancestor) queries inside transactions - non-ancestor (or cross-group) queries are not allowed inside transactions
ability to access many entities inside the same transaction - cross-group transactions are limited to max 25 entity groups
The 1 write/second/group limit might not be a scalability issue at all for some applications (think write once read a lot kind of apps, for example, or apps for which 1 write per sec is more than enough).
As for the mechanics, the (unique) parent "entity" key for the group is the ndb.Key('Group', "xxx_group") key (which has the "xxx_group" key ID). The corresponding "entity" or its model doesn't need to exist (unless the entity itself needs to be created, bu that doesn't appear to be the case). The parent key is used simply to establish the group's "namespace" in the datastore, if you want.
You can see a somehow similar use in the examples from the Entity Keys documentation, check out the Message use (except Message is just a "parent" entity in the ancestor path, but not the root entity):
class Revision(ndb.Model):
message_text = ndb.StringProperty()
ndb.Key('Account', 'sandy#foo.com', 'Message', 123, 'Revision', '1')
ndb.Key('Account', 'sandy#foo.com', 'Message', 123, 'Revision', '2')
ndb.Key('Account', 'larry#foo.com', 'Message', 456, 'Revision', '1')
ndb.Key('Account', 'larry#foo.com', 'Message', 789, 'Revision', '2')
...
Notice that Message is not a model class. This is because we are
using Message purely as a way to group Revisions, not to store data.
This was probably done to achieve strongly consistent queries within the group. As you've pointed out this design has... drawbacks.
If this is solely reference data (i.e. Read many write once) that may mitigate some of the negatives, but also mostly invalidates the positives (i.e. Eventual consistency is not a problem if data doesn't update often).

Achieving Strong Consistency Using get_or_insert

I have a model like this:
class UserModel(ndb.Model):
''' model class which stores all the user information '''
fname = ndb.StringProperty(required=True)
lname = ndb.StringProperty(required=True)
sex = ndb.StringProperty(required=True, choices=['male', 'female'])
age = ndb.IntegerProperty(required=True)
dob = ndb.DateTimeProperty(required=True)
email = ndb.StringProperty(default=None)
mobile = ndb.StringProperty(required=True)
city = ndb.StringProperty(required=True)
state = ndb.StringProperty(required=True)
Since none of above fields are unique, not even email becuase many people may no have email ids. So I am using the following logic to create a string id
1. Take first two letters of 'state' and change it to upper case.
2. Take first to letters of 'city' and change it to upper case.
3. Get the count of all records in the database and increment by one.
4. Append all of them together.
I am using get_or_insert for inserting the entity.
Though adding a user, will not happen too often but any kind of clash would be catastrophic, means probability of contention is less but its impact is very high.
My questions are:
1. Will using get_or_insert guarantee that I will never have duplicate IDs?
2. get_or_insert documentation says "Transactionally retrieves an existing
entity or creates a new one.". How can something perform an operation
"transactionally" without using a ancestor query.
PS: For several reasons I can't keep all the user entities in the same entity groups.
In order to provide transactionality, get_or_insert uses a Datastore transaction. In order to use a query in a transaction it must be an ancestor query, however transactions can also get and put, which don't require a parent to be set on the entity.
However, as #Greg mentioned, you absolutely do not want to use this scheme for generating user ids. In particular, doing a count on your db is incredibly slow and will not scale, and is eventually consistent. Because the query is eventually consistent, it may return a count smaller than the actual count as long as results are eventually consistent (which for a large app will be all the time). This means you could wait several hours before an insert would actually succeed.
If you want to provide a customer ID with a State and City, I would recommend doing the following:
Do a put using automatic ids.
Expose to the user a "Customer ID" which is the State + City + ID.
When you want to lookup a customer given their "Customer ID", just do a get for the ID portion.
if you keep that ID scheme (for which you honestly don't really need steps 1 and 2, just 3), there is no reason for it to create duplicate IDs. With get_or_insert, it'll look for the exact ID you provide and fetch it if it exists, or simply create it if it doesn't, as explained here. So you CANNOT have duplicate IDs (well if you have this ID as your forced key in your model). if you follow the link provided it clearly states that :
The get and subsequent (possible) put operations are wrapped in a transaction to ensure atomicity. Ths means that get_or_insert() will never overwrite an existing entity, and will insert a new entity if and only if no entity with the given kind and name exists.
And the fact it does it transactionnaly means it'll lock up the entity group to be sure you don't have contention. Since you don't seem to have ancestors I think it'll just lock the entity you're updating

How can I fetch the lastest entry of a model new put into NDB?

How can I get the latest entry of a model new putted into NDB?
1: If I use a same parent key ? How to ?
I see the document write
Entities whose keys have the same root form an entity group or group.
If entities are in different groups, then changes to those entities
might sometimes seem to occur "out of order". If the entities are
unrelated in your application's semantics, that's fine. But if some
entities' changes should be consistent, your application should make
them part of the same group when creating them.
Is this means , with the same parent key the order is insert order?
But , how to get the last one ?
2: If I not use a same parent key (the model is same)? How to ?
If you're OK with eventual consistency (i.e. you might not see the very latest one immediately) you can just add a DateTimeProperty with auto_now_add=True and then run a query sorting by that property to get the latest one. (This is also approximate since you might have several entities saved close together which are ordered differently than you expect.)
If you need it to be exactly correct, the only way I can see is to create an entity whose job it is to hold a reference to the latest entry, and update that entity in the same transaction as the entry you're creating. Something like:
class LatestHolder(ndb.Model):
latest = ndb.KeyProperty('Entry')
# code to update:
#ndb.transactional(xg=True)
def put_new_entry(entry):
holder = LatestHolder.get_or_insert(name='fixed-key')
holder.latest = entry
holder.put()
entry.put()
Note that I've used a globally fixed key name here with no parent for the holder class. This is a bottleneck; you might prefer to make several LatestHolder entities with different parents if your "latest entry" only needs to be from a particular parent, in which case you just pass a parent key to get_or_insert.

Using ancestors or reference properties in Google App Engine?

Currently, a lot of my code makes extensive use of ancestors to put and fetch objects. However, I'm looking to change some stuff around.
I initially thought that ancestors helped make querying faster if you knew who the ancestor of the entity you're looking for was. But I think it turns out that ancestors are mostly useful for transaction support. I don't make use of transactions, so I'm wondering if ancestors are more of a burden on the system here than a help.
What I have is a User entity, and a lot of other entities such as say Comments, Tags, Friends. A User can create many Comments, Tags, and Friends, and so whenever a user does so, I set the ancestor for all these newly created objects as the User.
So when I create a Comment, I set the ancestor as the user:
comment = Comment(aUser, key_name = commentId)
Now the only reason I'm doing this is strictly for querying purposes. I thought it would be faster when I wanted to get all comments by a certain user to just get all comments with a common ancestor rather than querying for all comments where authorEmail = userEmail.
So when I want to get all comments by a certain user, I do:
commentQuery = db.GqlQuery('SELECT * FROM Comment WHERE ANCESTOR IS :1', userKey)
So my question is, is this a good use of ancestors? Should each Comment instead have a ReferenceProperty that references the User object that created the comment, and filter by that?
(Also, my thinking was that using ancestors instead of an indexed ReferenceProperty would save on write costs. Am I mistaken here?)
You are right about the writing cost, an ancestor is part of the key which comes "free". using a reference property will increase your writing cost if the reference property is indexed.
Since you query on that reference property if will need to be indexed.
Ancestor is not only important for transactions, in the HRD (the default datastore implementation) if you don't create each comment with the same ancestor, the quires will not be strongly consistent.
-- Adding Nick's comment ---
Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down iff you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern.

Resources