AppEngine allocateIdRange : clarification about CONTENTION state - google-app-engine

I'm copying entities from one kind to another, and want to map their long ids in a predictable way. After the mapping is over, I want auto-generation of ids to kick in.
To protect the entities I copy, I want to use allocateIdRange and manually allocate each id as I copy it. My hope is that this will cause the datastore to protect these new ids, and only assign other ids to new entities created after the copy.
One return code has me worried: CONTENTION
Indicates the given KeyRange is empty but the datastore's automatic ID
allocator may assign new entities keys in this range. However it is
safe to manually assign Keys in this range if either of the following
is true:
No other request will insert entities with the same kind and
parent as the given KeyRange until all entities with manually assigned
keys from this range have been written.
Overwriting entities written by other requests with the same kind and parent as the given KeyRange is acceptable.
Number 2 is out for me. It is not acceptable for these entities to be overwritten.
Number 1 I think is acceptable, but the wording is scary enough that I want to make sure. If I allocate 5 ids, from 100 to 104, and I get CONTENTION back, this seems to indicate that the entities I copy MAY be overwritten with new entities with automatic ids in the future. BUT, if I hurry up and write my own entities with ids manually set to 100, 101, 102, 103, and 104, I will be safe and new entities with automatic ids will NOT receive these ids.
I'm worried because I don't understand how this would work. I don't think of the id allocator as paying attention to what gets written.
TL;DR
Imagine the following scenario:
allocateIdRange(100, 104); // returns CONTENTION
putEntityWithManualId(100);
putEntityWithManualId(101);
putEntityWithManualId(102);
putEntityWithManualId(103);
putEntityWithManualId(104);
// all puts succeed
now, when, later, I call
putNewEntityWithAutomaticId();
is there any risk that the automatic id will be 100, 101, 102, 103, or 104?

The documentation follows as bellow:
The datastore's automatic ID allocator will not assign a key to a new entity that will overwrite an existing entity, so once the range is populated there will no longer be any contention.
Thus, you don't need to worry that your newly copied entities will be overwritten.

Related

cross-group transaction need to be explicitly specified exception on get_or_insert

We're loading a list of emails from file while putting a large number of datastore entities concurrently for each email address and getting occasional errors in the form:
BadRequestError: cross-group transaction need to be explicitly specified, see TransactionOptions.Builder.withXG
The failing Python method call:
EmailObj.get_or_insert(key_name=email, source=source, reason=reason)
where email is the address string and source and reason are StringProperties.
Question: how can the get_or_insert call start a transaction for one simple datastore model (2 string properties) and get entities involved of different groups? I expect the method above should either read the existing object matching the given key or store the new entity.
Note: I don't know the exact internal implementation, this is just a theory...
Since you didn't specify a parent then there is no entity group that could be "locked"/established from the beginning as the group to which the transaction would be limited to.
The get_or_insert operation would normally be a check if the entity with that keyname exists and, if not, then create a new entity. Without a parent, this new entity would be in its own group, let's call it new_group.
Now to complete the transaction automatically associated with get_or_insert a conflict check would need to be done.
The conflict would mean an entity with the same keyname was created by one of the concurrent tasks (after our check that such entity exists but before our transaction end), which entity would also have its own, different group, let's call it other_group.
This very conflict check, only in the case in which the conflict is real, would effectively access both new_group and other_group, causing the exception.
By specifying a parent the issue doesn't exist, since both new_group and other_group would actually be the same group - the parent's group. The get_or_insert transaction would be restricted to this group.
The theory could be verified, I think, even in production (the errors are actually harmless if the theory is correct - the addresses are after all duplicates)
First step would be to confirm that the occurences are related to the same email address being inserted twice in a very short time - concurently.
Wrap in try/except the exception and log the corresponding email address, then check it against the input files - they should be duplicates and I imagine located quite close to each-other in the file.
Then you could force that condition - intentionally inserting duplicate addresses every 5-10 emails (or some other rate you consider relevant).
Depending on how the app is actually implemented you could insert the duplicates in the input files, or, if using tasks - create/enqueue duplicate tasks, or simply call the get_or_insert twice (not sure if the last one will be effective enough as it may happen on the same thread/instance). Eventually log or keep track of the injected duplicated addresses
The occurence rate of the exception should now increase, potentially proportionally with the forced duplicate insertion rate. Most if not all of the corresponding logged addresses should match the tracked duplicates being injected.
If you got that far the duplicate entries being the trigger is confirmed. What would be left to confirm would be the different entity groups for the same email address - I can't think of a way to check that exactly, but I also can't think of some other group related to the duplicate email addresses.
Anyways, just add xg=True in the get_or_insert calls' context options and, if my theory is correct, the errors should dissapear :)

Google App Engine / NDB - Strongly Consistent Read of Entity List after Put

Using Google App Engine's NDB datastore, how do I ensure a strongly consistent read of a list of entities after creating a new entity?
The example use case is that I have entities of the Employee kind.
Create a new employee entity
Immediately load a list of employees (including the one that was added)
I understand that the approach below will yield an eventually consistent read of the list of employees which may or may not contain the new employee. This leads to a bad experience in the case of the latter.
e = Employee(...)
e.put()
Employee.query().fetch(...)
Now here are a few options I've thought about:
IMPORTANT QUALIFIERS
I only care about a consistent list read for the user who added the new employee. I don't care if other users have an eventual consistent read.
Let's assume I do not want to put all the employees under an Ancestor to enable a strongly consistent ancestor query. In the case of thousands and thousands of employee entities, the 5 writes / second limitation is not worth it.
Let's also assume that I want the write and the list read to be the result of two separate HTTP requests. I could theoretically put both write and read into a single transaction (?) but then that would be a very non-RESTful API endpoint.
Option 1
Create a new employee entity in the datastore
Additionally, write the new employee object to memcache, local browser cookie, local mobile storage.
Query datastore for list of employees (eventually consistent)
If new employee entity is not in this list, add it to the list (in my application code) from memcache / local memory
Render results to user. If user selects the new employee entity, retrieve the entity using key.get() (strongly consistent).
Option 2
Create a new employee entity using a transaction
Query datastore for list of employees in a transaction
I'm not sure Option #2 actually works.
Technically, does the previous write transaction get written to all the servers before the read transaction of that entity occurs? Or is this not correct behavior?
Transactions (including XG) have a limit on number of entity groups and a list of employees (each is its own entity group) could exceed this limit.
What are the downsides of read-only transactions vs. normal reads?
Thoughts? Option #1 seems like it would work, but it seems like a lot of work to ensure consistency on a follow-on read.
If you don not use an entity group you can do a key_only query and get_multi(keys) lookup for entity consistency. For the new employee you have to pass the new key to key list of the get_multi.
Docs: A combination of the keys-only, global query with a lookup method will read the latest entity values. But it should be noted that a keys-only global query can not exclude the possibility of an index not yet being consistent at the time of the query, which may result in an entity not being retrieved at all. The result of the query could potentially be generated based on filtering out old index values. In summary, a developer may use a keys-only global query followed by lookup by key only when an application requirement allows the index value not yet being consistent at the time of a query.
More info and magic here : Balancing Strong and Eventual Consistency with Google Cloud Datastore
I had the same problem, option #2 doesn't really work: a read using the key will work, but a query might still miss the new employee.
Option #1 could work, but only in the same request. The saved memcache key can dissapear at any time, a subsequent query on the same instance or one on another instance potentially running on another piece of hw would still miss the new employee.
The only "solution" that comes to mind for consistent query results is to actually not attempt to force the new employee into the results and rather leave things flow naturally until it does. I'd just add a warning that creating the new user will take "a while". If tolerable maybe keep polling/querying in the original request until it shows up? - that would be the only place where the employee creation event is known with certainty.
This question is old as I write this. However, it is a good question and will be relevant long term.
Option #2 from the original question will not work.
If the entity creation and the subsequent query are truly independent, with no context linking them, then you are really just stuck - or you don't care. The trick is that there is almost always some relationship or some use case that must be covered. In other words if the query is truly some kind of, essentially, ad hoc query, then you really don't care. In that case, you just quote CAP theorem and remind the client executing the query how great it is that this system scales. However, almost always, if you are worried about the eventual consistency, there is some use case or set of cases that must be handled. For example, if you have a high score list, the highest score must be at the top of the list. The highest score may have just been achieved by the user who is now looking at the list. Another example might be that when an employee is created, that employee must be on the "new employees" list.
So what you usually do is exploit these known cases to balance the throughput needed with consistency. For example, for the high score example, you may be able to afford to keep a secondary index (an entity) that is the list of the high scores. You always get it by key and you can write to it as frequently as needed because high scores are not generated that often presumably. For the new employee example, you might use an approach that you started to suggest by storing the timestamp of the last employee in memcache. Then when you query, you check to make sure your list includes that employee ... or something along those lines.
The price in balancing write throughput and consistency on App Engine and similar systems is always the same. It requires increased model complexity / code complexity to bridge the business needs.

Creating Fixed Width ID Based On Serial Number in Python NDB Datastore

I have a model named UserModel and I know that it will never grow beyond 10000 entities. I don't have anything unique in the UserModel which I can use for creating a key. Hence I decided to have string keys which are of this format USRXXXXX.
Where XXXXX represent the serial count. e.g USR00001, USR12345
Hence I chose to have a following way to generate the IDs
def generate_unique_id():
qry = UserModel.query()
num = qry.count() + 1
id = 'USR' + '%0.5d' % num
return id
def create_entity(model, id, **kwargs):
ent = model.get_or_insert(id, **kwargs)
# check if its the newly created record or the existing one
if ent.key.id() != id:
raise InsertError('failed to add new user, please retry the operation)
return True
Questions:
Is this the best way of achiving serial count of fixed width. Whethe this solution is optimal and idiomatic?
Does using get_or_insert like above guarantees that I will never have duplicate records.
Will it increase my billing, becuase for counting the number of records I an doing UserModel.query() without any filters. In a way I am fetching all the records. Or billing thing will not come in picture till I user fetch api on the qry object?
Since you only need a unique key for the UserModel entities, I don't quite understand why you need to create the key manually. The IDs that are generated automatically by App Engine are quaranteed to be unique.
Regarding your questions, we have the following:
I think not. Maybe you should first allocate IDs (check section Using Numeric Key IDs), order it, and use it.
Even though get_or_insert is strong consistent, the query you perform (qry = UserModel.query()) is not. Thus, you may result in overwriting existing entities. For more information about eventual consistency, take a look here.
No, it will not increase your billing. When you execute Model.query().count(), the datastore under the hood executes a Model.query().fetch(keys_only=True) and counts the number of results. Keys-only queries generate small datastore operations, which based on latest pricing changes by Google are not billable.
Probably not. You might get away with what you are trying to do if your UserModel entities have ancestors for stronger consistency.
No, get_or_insert does not guarantee that you won't have duplicates. Although you are unlikely to have duplicates in this case, you are more likely to loose data. Say you are inserting two entities with no ancestors - Model.query().count() might take some time to reflect the creation of the first entity causing the second entity to have the same ID as the first one and thus overwriting it (i.e. you end up with the 2nd entity only that has the ID of the first one).
Model.query().count() is short for len(Model.query().fetch()) (although with some optimizations) so every time you generate an ID you fetch all entities.

Data Modeling - modeling an Append-only list in NDB

I'm trying to make a general purpose data structure. Essentially, it will be an append-only list of updates that clients can subscribe to. Clients can also send updates.
I'm curious for suggestions on how to implement this. I could have a ndb.Model, 'Update' that contains the data and an index, or I could use a StructuredProperty with Repeated=true on the main Entity. I could also just store a list of keys somehow and then the actual update data in a not-strongly-linked structure.
I'm not sure how the repeated properties work - does appending to the list of them (via the Python API) have to rewrite them all?
I'm also worried abut consistency. Since multiple clients might be sending updates, I don't want them to overwrite eachother and lose an update or somehow end up with two updates with the same index.
The problem is that you've a maximum total size for each model in the datastore.
So any single model that accumulates updates (storing the data directly or via collecting keys) will eventually run out of space (not sure how the limit applies with regard to structured properties however).
Why not have a model "update", as you say, and a simple version would be to have each provided update create and save a new model. If you track the save date as a field in the model you can sort them by time when you query for them (presumably there is an upper limit anyway at some level).
Also that way you don't have to worry about simultaneous client updates overwriting each other, the data-store will worry about that for you. And you don't need to worry about what "index" they've been assigned, it's done automatically.
As that might be costly for datastore reads, I'm sure you could implement a version that used repeated properties in a single, moving to a new model after N keys are stored but then you'd have to wrap it in a transaction to be sure mutiple updates don't clash and so on.
You can also cache the query generating the results and invalidate it only when a new update is saved. Look at NDB also as it provides some automatic caching (not for a query however).

Check if numeric is reserved in App Engine Datastore

I would like to implement a deleted key detection for app engine if possible without any extra entities/markers being stored upon deletion, so I can show 404 or 410 response accordingly.
AFAIK new entity key numeric id's are assigned without particular order (at least a simple one), but they are of course reserved/allocated and never implicitly reused for new entities.
So is there a way to check if a particular key was previously allocated, but entity stored under this key was since deleted?
I do not care if a key was manually allocated and never used to store any data, I'll treat it as deleted.
No, there's no way to determine if a key has already been allocated.
You mention that you'll treat allocated but unused keys as deleted, but note that this will result in returning the wrong status code in these cases - including in the potential situation where a key is allocated and later used: you'll mistakenly report it as deleted until it's first used.

Resources