Google App Engine Assigning Entity Groups / Parent Keys, Unique constraint - google-app-engine

I have a Kind of 'Customer'. I want to run a transaction that locks the entire Kind when a new 'Customer' is about to be inserted. The transaction would first query to check that the new 'Customer' Name does not already exist, then the 2nd part of the transaction runs the insert if no matches are found. This way I'm enforcing a Unique Constraint (and also restricting the operation to approx 1 insert per second).
My unsatisfactory solution to getting all my 'Customer' entitys in the same entity group is to create a Kind called 'EntityGroups', with a single record called 'CustomersGroup'. This one record is used every time as the Parent of newly created 'Customer' entities, thereby grouping the entire Kind into one entity group.
My question is: I am concerned about using a phantom record such as 'CustomerGroup' because if anything happened and it were lost or deleted, I could not assign any new 'Customer' entities to the same group! I imagine it would be better to assign the Parent of each 'Customer' entity a static arbitrary parent, such as '1111111'? I think the terminology is "virtual root entity", how do I do this?
Please help with any advice on how I can best handle this!

Why don't you use: NDB's get_or_insert: Transactionally retrieves an existing entity or creates a new one.
https://developers.google.com/appengine/docs/python/ndb/modelclass#Model_get_or_insert

Your CustomerGroup record does not need to exist for it to act as a parent. Just create it's key by hand and assign it as the parent to the record in question.
You don't need to worry about it being deleted if it does not exist!
When you create a model and set another as it's parent the system does not check (nor does it need to ) that that model actually exists at all.
So for example:
rev_key = ndb.Key('CustomerGroup', '11111', 'Customer', 'New_Customer_Name')
Yet a model with a key of: ('CustomerGroup', '11111') does not actually exist but it can still be in the ancestor chain.

GrantsV, you can achieve this by creating a proxy entity for each unique constraint and using cross-group transactions to commit the constraints with the normal writes.
class UniqueConstraint(db.Model):
# Consider adding a reference to the owner of the constraint.
#db.transactional(propagation=db.MANDATORY, xg=True)
#classmethod
def reserve(cls, kind, property, value):
key = cls.__get_key(kind, property, value)
if db.get(key):
raise Exception # Already exists
cls(key=key).put()
#db.transactional(propagation=db.MANDATORY, xg=True)
#classmethod
def release(cls, kind, property, value):
db.delete(cls.__get_key(kind, property, value))
#classmethod
def __get_key(cls, kind, property, value):
# Consider using a larger entity group.
return db.Key.from_path(cls.kind(), '%s:%s:%s' % (kind, property, value))
# To restrict to 1 insert per second per kind, use:
# return db.Key.from_path(cls.kind(), kind, cls.kind(), '%s:%s' % (property, value))

You can create a parent entity, like this:
class CustomerParent(ndb.Model):
pass
Then you instantiate and store your parent entity:
customers_parent = CustomerParent()
customers_parent.put()
Finally, when you create all your customer entities, you specify the parent:
a_customer = Customer(parent=customers_parent.key, ...)
a_customer.put()
Hope this helps!

Related

How to ensure isolation with non-ancestor query

I want to create user using ndb such as below:
def create_user(self, google_id, ....):
user_keys = UserInformation.query(UserInformation.google_id == google_id ).fetch(keys_only=True)
if user_keys: # check whether user exist.
# already created
...(SNIP)...
else:
# create new user entity.
UserInformation(
# primary key is incompletekey
google_id = google_id,
facebook_id = None,
twitter_id = None,
name =
...(SNIP)...
).put()
If this function is called twice in the sametime, two user is created.("Isolation" is not ensure between get() and put())
So, I added #ndb.transactional to above function.
But following error is occured.
BadRequestError: Only ancestor queries are allowed inside transactions.
How to ensure isolation with non-ancestor query?
The ndb library doesn't allow non-ancestor queries inside transactions. So if you make create_user() transactional you get the above error because you call UserInformation.query() inside it (without an ancestor).
If you really want to do that you'd have to place all your UserInformation entities inside the same entity group by specifying a common ancestor and make your query an ancestor one. But that has performance implications, see Ancestor relation in datastore.
Otherwise, even if you split the function in 2, one non-transactional making the query followed by a transactional one just creating the user - which would avoid the error - you'll still be facing the datastore eventual consistency, which is actually the root cause of your problem: the result of the query may not immediately return a recently added entity because it takes some time for the index corresponding to the query to be updated. Which leads to room for creating duplicate entities for the same user. See Balancing Strong and Eventual Consistency with Google Cloud Datastore.
One possible approach would be to check later/periodically if there are duplicates and remove them (eventually merging the info inside into a single entity). And/or mark the user creation as "in progress", record the newly created entity's key and keep querying until the key appears in the query result, when you finally mark the entity creation as "done" (you might not have time to do that inside the same request).
Another approach would be (if possible) to determine an algorithm to obtain a (unique) key based on the user information and just check if an entity with such key exists instead of making a query. Key lookups are strongly consistent and can be done inside transactions, so that would solve your duplicates problem. For example you could use the google_id as the key ID. Just an example, as that's not ideal either: you may have users without a google_id, users may want to change their google_id without loosing other info, etc. Maybe also track the user creation in progress in the session info to prevent repeated attempts to create the same user in the same session (but that won't help with attempts from different sessions).
For your use case, perhaps you could use ndb models' get_or_insert method, which according to the API docs:
Transactionally retrieves an existing entity or creates a new one.
So you can do:
user = UserInformation.get_or_insert(*args, **kwargs)
without risking the creation of a new user.
The complete docs:
classmethod get_or_insert(*args, **kwds)source Transactionally
retrieves an existing entity or creates a new one.
Positional Args: name: Key name to retrieve or create.
Keyword Arguments
namespace – Optional namespace. app – Optional app ID.
parent – Parent entity key, if any.
context_options – ContextOptions object (not keyword args!) or None.
**kwds – Keyword arguments to pass to the constructor of the model class if an instance for the specified key name does not already
exist. If an instance with the supplied key_name and parent already
exists, these arguments will be discarded. Returns Existing instance
of Model class with the specified key name and parent or a new one
that has just been created.

NDB query with projection on an attribute used in .IN()

Let's say I have a model:
class Pet(ndb.Model):
age = ndb.IntegerProperty(indexed=False)
name = ndb.StringProperty(indexed=True)
owner = ndb.KeyProperty(indexed=True)
And I have a list of keys named owners. To do a query for Pets I would do:
pets = Pets.query(Pets.owner.IN(owners)).fetch()
The problem is that this query returns the whole entity.
How can I do a projected query and get just the owner and the name?
Or how should I structure the data to just get the name and the owner.
I can do a projection for the name but I loose reference from the pet to the owner. And owner can't be in the projection.
As you have noticed, you can't do that with the exact context you mentioned, because you hit one of the Limitations on projections:
Properties referenced in an equality (=) or membership (IN) filter cannot be projected.
Since owner is used in a IN filter it can't be projected. Since you need the owner and you can't project it you'll have to drop the projection and thus you'll always get the entire entity.
One alternative would be to split your entity into 2 peer entities, always into a 1:1 relationship, using the same entity IDs:
class PetA(ndb.Model):
name = ndb.StringProperty(indexed=True)
owner = ndb.KeyProperty(indexed=True)
class PetB(ndb.Model):
age = ndb.IntegerProperty(indexed=False)
This way you can do the same query, except on PetA kind instead of the original Pet and the result you'd get would be the equivalent of the original projection query you were seeking.
Unfortunately this will only work with one or a very few such projection queries for the same entity, otherwise you'd need to split the entity in too many pieces. So you may have to compromise.
You can find more details about the entity splitting in re-using an entity's ID for other entities of different kinds - sane idea?

Is there any side effect of not having a physical entity for it to act as parent key

If I go through google app engine tutorial, I can see their example seem to encourage us to have parent for entities.
Hence, I have the following workable code, for user creation (with email as unique)
def parent_key():
return ndb.Key('parent', 'parent')
class User(ndb.Model):
email = ndb.StringProperty(required = True)
timestamp = ndb.DateTimeProperty(required = True)
class RegisterHandler2(webapp2.RequestHandler):
def get(self):
email = self.request.get('email')
user_timestamp = int(time.time())
user = User.get_or_insert(email, parent=parent_key(), email=email, timestamp=datetime.datetime.fromtimestamp(user_timestamp))
Note, parent entity physically doesn't exist.
Although the above code runs totally fine, I was wondering any possible problem can occur, if parent entity physically doesn't exist?
One of my concern of not having parent, is eventually consistency. After write operation, I want my read operation able to fetch the latest written value. I'm using User.get_or_insert to write (and read), and User.get_by_id to read only.
I want after I execute User.get_or_insert, and next request User.get_by_id will return latest value. I was wondering, to achieve strong consistency, is parent key an important thingy?
There are no problems as long as you don't actually need this parent entity.
You should not make a decision to use parent entities lightly. In fact, using entity groups (parent-child entities) limit the number of entities you can update per second and makes it necessary to know the parent key to retrieve a child entity.
You may run into serious problems. For example, if entity "User" is a child of some parent entity, and then all other entities are children of entities "User", that turns all of your data into one big entity group. Assuming your app is fairly active, you will see datastore operations failures because of this performance limitation.
Note also that a key of an entity gets longer if you have to include a key of a parent entity into it. If you create a chain of entities (e.g. parent -> user -> album -> photo), a key for each "photo" entity will include a key for album, a key for user and a key for parent entity. It becomes a nightmare to manage and requires much more storage space.
Using a parent key that doesn't correspond to an entity that actually has properties (which is what I think you're referring to as a 'physical entity') is a standard technique.
You can even decide later to add properties to that key.
I've been using this technique for years.

Achieving Strong Consistency Using get_or_insert

I have a model like this:
class UserModel(ndb.Model):
''' model class which stores all the user information '''
fname = ndb.StringProperty(required=True)
lname = ndb.StringProperty(required=True)
sex = ndb.StringProperty(required=True, choices=['male', 'female'])
age = ndb.IntegerProperty(required=True)
dob = ndb.DateTimeProperty(required=True)
email = ndb.StringProperty(default=None)
mobile = ndb.StringProperty(required=True)
city = ndb.StringProperty(required=True)
state = ndb.StringProperty(required=True)
Since none of above fields are unique, not even email becuase many people may no have email ids. So I am using the following logic to create a string id
1. Take first two letters of 'state' and change it to upper case.
2. Take first to letters of 'city' and change it to upper case.
3. Get the count of all records in the database and increment by one.
4. Append all of them together.
I am using get_or_insert for inserting the entity.
Though adding a user, will not happen too often but any kind of clash would be catastrophic, means probability of contention is less but its impact is very high.
My questions are:
1. Will using get_or_insert guarantee that I will never have duplicate IDs?
2. get_or_insert documentation says "Transactionally retrieves an existing
entity or creates a new one.". How can something perform an operation
"transactionally" without using a ancestor query.
PS: For several reasons I can't keep all the user entities in the same entity groups.
In order to provide transactionality, get_or_insert uses a Datastore transaction. In order to use a query in a transaction it must be an ancestor query, however transactions can also get and put, which don't require a parent to be set on the entity.
However, as #Greg mentioned, you absolutely do not want to use this scheme for generating user ids. In particular, doing a count on your db is incredibly slow and will not scale, and is eventually consistent. Because the query is eventually consistent, it may return a count smaller than the actual count as long as results are eventually consistent (which for a large app will be all the time). This means you could wait several hours before an insert would actually succeed.
If you want to provide a customer ID with a State and City, I would recommend doing the following:
Do a put using automatic ids.
Expose to the user a "Customer ID" which is the State + City + ID.
When you want to lookup a customer given their "Customer ID", just do a get for the ID portion.
if you keep that ID scheme (for which you honestly don't really need steps 1 and 2, just 3), there is no reason for it to create duplicate IDs. With get_or_insert, it'll look for the exact ID you provide and fetch it if it exists, or simply create it if it doesn't, as explained here. So you CANNOT have duplicate IDs (well if you have this ID as your forced key in your model). if you follow the link provided it clearly states that :
The get and subsequent (possible) put operations are wrapped in a transaction to ensure atomicity. Ths means that get_or_insert() will never overwrite an existing entity, and will insert a new entity if and only if no entity with the given kind and name exists.
And the fact it does it transactionnaly means it'll lock up the entity group to be sure you don't have contention. Since you don't seem to have ancestors I think it'll just lock the entity you're updating

How can I fetch the lastest entry of a model new put into NDB?

How can I get the latest entry of a model new putted into NDB?
1: If I use a same parent key ? How to ?
I see the document write
Entities whose keys have the same root form an entity group or group.
If entities are in different groups, then changes to those entities
might sometimes seem to occur "out of order". If the entities are
unrelated in your application's semantics, that's fine. But if some
entities' changes should be consistent, your application should make
them part of the same group when creating them.
Is this means , with the same parent key the order is insert order?
But , how to get the last one ?
2: If I not use a same parent key (the model is same)? How to ?
If you're OK with eventual consistency (i.e. you might not see the very latest one immediately) you can just add a DateTimeProperty with auto_now_add=True and then run a query sorting by that property to get the latest one. (This is also approximate since you might have several entities saved close together which are ordered differently than you expect.)
If you need it to be exactly correct, the only way I can see is to create an entity whose job it is to hold a reference to the latest entry, and update that entity in the same transaction as the entry you're creating. Something like:
class LatestHolder(ndb.Model):
latest = ndb.KeyProperty('Entry')
# code to update:
#ndb.transactional(xg=True)
def put_new_entry(entry):
holder = LatestHolder.get_or_insert(name='fixed-key')
holder.latest = entry
holder.put()
entry.put()
Note that I've used a globally fixed key name here with no parent for the holder class. This is a bottleneck; you might prefer to make several LatestHolder entities with different parents if your "latest entry" only needs to be from a particular parent, in which case you just pass a parent key to get_or_insert.

Resources