I would like to implement a deleted key detection for app engine if possible without any extra entities/markers being stored upon deletion, so I can show 404 or 410 response accordingly.
AFAIK new entity key numeric id's are assigned without particular order (at least a simple one), but they are of course reserved/allocated and never implicitly reused for new entities.
So is there a way to check if a particular key was previously allocated, but entity stored under this key was since deleted?
I do not care if a key was manually allocated and never used to store any data, I'll treat it as deleted.
No, there's no way to determine if a key has already been allocated.
You mention that you'll treat allocated but unused keys as deleted, but note that this will result in returning the wrong status code in these cases - including in the potential situation where a key is allocated and later used: you'll mistakenly report it as deleted until it's first used.
Related
Assuming the IDs have not been used by calling put() for an Entity. How long would the allocated IDs stick around for? Are they ever put back into use by datastore? Or are they allocated forever?
The documentation says,
These keys are guaranteed not to have been returned previously by the
data store's internal ID generator, nor will they be returned by
future calls to the internal ID generator.
I'll go out on a limb and say that the use of 'guarantee' and 'future' above means forever.
We're loading a list of emails from file while putting a large number of datastore entities concurrently for each email address and getting occasional errors in the form:
BadRequestError: cross-group transaction need to be explicitly specified, see TransactionOptions.Builder.withXG
The failing Python method call:
EmailObj.get_or_insert(key_name=email, source=source, reason=reason)
where email is the address string and source and reason are StringProperties.
Question: how can the get_or_insert call start a transaction for one simple datastore model (2 string properties) and get entities involved of different groups? I expect the method above should either read the existing object matching the given key or store the new entity.
Note: I don't know the exact internal implementation, this is just a theory...
Since you didn't specify a parent then there is no entity group that could be "locked"/established from the beginning as the group to which the transaction would be limited to.
The get_or_insert operation would normally be a check if the entity with that keyname exists and, if not, then create a new entity. Without a parent, this new entity would be in its own group, let's call it new_group.
Now to complete the transaction automatically associated with get_or_insert a conflict check would need to be done.
The conflict would mean an entity with the same keyname was created by one of the concurrent tasks (after our check that such entity exists but before our transaction end), which entity would also have its own, different group, let's call it other_group.
This very conflict check, only in the case in which the conflict is real, would effectively access both new_group and other_group, causing the exception.
By specifying a parent the issue doesn't exist, since both new_group and other_group would actually be the same group - the parent's group. The get_or_insert transaction would be restricted to this group.
The theory could be verified, I think, even in production (the errors are actually harmless if the theory is correct - the addresses are after all duplicates)
First step would be to confirm that the occurences are related to the same email address being inserted twice in a very short time - concurently.
Wrap in try/except the exception and log the corresponding email address, then check it against the input files - they should be duplicates and I imagine located quite close to each-other in the file.
Then you could force that condition - intentionally inserting duplicate addresses every 5-10 emails (or some other rate you consider relevant).
Depending on how the app is actually implemented you could insert the duplicates in the input files, or, if using tasks - create/enqueue duplicate tasks, or simply call the get_or_insert twice (not sure if the last one will be effective enough as it may happen on the same thread/instance). Eventually log or keep track of the injected duplicated addresses
The occurence rate of the exception should now increase, potentially proportionally with the forced duplicate insertion rate. Most if not all of the corresponding logged addresses should match the tracked duplicates being injected.
If you got that far the duplicate entries being the trigger is confirmed. What would be left to confirm would be the different entity groups for the same email address - I can't think of a way to check that exactly, but I also can't think of some other group related to the duplicate email addresses.
Anyways, just add xg=True in the get_or_insert calls' context options and, if my theory is correct, the errors should dissapear :)
My current understanding of Google AppEngine's High Replication DataStore is the following:
Gets and puts of individual entities are always strongly consistent, i.e. once a put of this entry completes, no later get will ever return a version earlier than the completed put. Or, more precisely, as soon as any one get returns the new version, no later get will ever return the old version again.
Gets and puts of multiple entities are strongly consistent, if they belong to the same ancestor group and are performed in a transaction, i.e. if I have two entities that are both being modified in a transaction by a put and "simultaneously" read in a different transaction with a get, the get will either return the old version of both entries or the new version of both entries, depending on whether the put-transaction has completed at the time of the get or not, but it will never return the old value of one entity and the new value of the other.
Queries with an ancestor filter can be chosen to be strongly or eventually consistent, where a strongly consistent query takes longer to complete, but will always return the "same" version (old or new) of all entities updated in the same transaction in this ancestor group and never some old and some new versions.
Queries that span ancestors are always eventually consistent, i.e. might return an old version of one result entity and a new version of another.
Did I get this right? Is this actually documented anywhere? (I only found some documentation about the query consistency here (between the first and second "Note") and here, but it doesn't talk about gets and puts...)
Yes, you're correct. They just word it slightly differently:
https://developers.google.com/appengine/docs/java/datastore/
Right at the beginning there are 5 point form features. The last two describe your question, except that they refer to "reads" instead of "gets".
This probably adds to your confusion, but when they mean "read" or "get", it really means fetching an entity directly - by key or id. If you call the python 'get' function with an attribute other than the key or id, it's actually issuing a query, which is eventually consistent (unless it's an ancestor query).
I'm copying entities from one kind to another, and want to map their long ids in a predictable way. After the mapping is over, I want auto-generation of ids to kick in.
To protect the entities I copy, I want to use allocateIdRange and manually allocate each id as I copy it. My hope is that this will cause the datastore to protect these new ids, and only assign other ids to new entities created after the copy.
One return code has me worried: CONTENTION
Indicates the given KeyRange is empty but the datastore's automatic ID
allocator may assign new entities keys in this range. However it is
safe to manually assign Keys in this range if either of the following
is true:
No other request will insert entities with the same kind and
parent as the given KeyRange until all entities with manually assigned
keys from this range have been written.
Overwriting entities written by other requests with the same kind and parent as the given KeyRange is acceptable.
Number 2 is out for me. It is not acceptable for these entities to be overwritten.
Number 1 I think is acceptable, but the wording is scary enough that I want to make sure. If I allocate 5 ids, from 100 to 104, and I get CONTENTION back, this seems to indicate that the entities I copy MAY be overwritten with new entities with automatic ids in the future. BUT, if I hurry up and write my own entities with ids manually set to 100, 101, 102, 103, and 104, I will be safe and new entities with automatic ids will NOT receive these ids.
I'm worried because I don't understand how this would work. I don't think of the id allocator as paying attention to what gets written.
TL;DR
Imagine the following scenario:
allocateIdRange(100, 104); // returns CONTENTION
putEntityWithManualId(100);
putEntityWithManualId(101);
putEntityWithManualId(102);
putEntityWithManualId(103);
putEntityWithManualId(104);
// all puts succeed
now, when, later, I call
putNewEntityWithAutomaticId();
is there any risk that the automatic id will be 100, 101, 102, 103, or 104?
The documentation follows as bellow:
The datastore's automatic ID allocator will not assign a key to a new entity that will overwrite an existing entity, so once the range is populated there will no longer be any contention.
Thus, you don't need to worry that your newly copied entities will be overwritten.
I am trying out Google App Engine Java, however the absence of a unique constraint is making things difficult.
I have been through this post and this blog suggests a method to implement something similar. My background is in MySQL.Moving to datastore without a unique constraint makes me jittery because I never had to worry about duplicate values before and checking each value before inserting a new value still has room for error.
"No, you still cannot specify unique
during schema creation."
-- David Underhill talks about GAE and the unique constraint (post link)
What are you guys using to implement something similar to a unique or primary key?
I heard about a abstract datastore layer created using the low level api which worked like a regular RDB, which however was not free(however I do not remember the name of the software)
Schematic view of my problem
sNo = biggest serial_number in the db
sNo++
Insert new entry with sNo as serial_number value //checkpoint
User adds data pertaining to current serial_number
Update entry with data where serial_number is sNo
However at line number 3(checkpoint), I feel two users might add the same sNo. And that is what is preventing me from working with appengine.
This and other similar questions come up often when talking about transitioning from a traditional RDB to a BigTable-like datastore like App Engine's.
It's often useful to discuss why the datastore doesn't support unique keys, since it informs the mindset you should be in when thinking about your data storage schemes. The reason unique constraints are not available is because it greatly limits scalability. Like you've said, enforcing the constraint means checking all other entities for that property. Whether you do it manually in your code or the datastore does it automatically behind the scenes, it still needs to happen, and that means lower performance. Some optimizations can be made, but it still needs to happen in one way or another.
The answer to your question is, really think about why you need that unique constraint.
Secondly, remember that keys do exist in the datastore, and are a great way of enforcing a simple unique constraint.
my_user = MyUser(key_name=users.get_current_user().email())
my_user.put()
This will guarantee that no MyUser will ever be created with that email ever again, and you can also quickly retrieve the MyUser with that email:
my_user = MyUser.get(users.get_current_user().email())
In the python runtime you can also do:
my_user = MyUser.get_or_create(key_name=users.get_current_user().email())
Which will insert or retrieve the user with that email.
Anything more complex than that will not be scalable though. So really think about whether you need that property to be globally unique, or if there are ways you can remove the need for that unique constraint. Often times you'll find with some small workarounds you didn't need that property to be unique after all.
You can generate unique serial numbers for your products without needing to enforce unique IDs or querying the entire set of entities to find out what the largest serial number currently is. You can use transactions and a singleton entity to generate the 'next' serial number. Because the operation occurs inside a transaction, you can be sure that no two products will ever get the same serial number.
This approach will, however, be a potential performance chokepoint and limit your application's scalability. If it is the case that the creation of new serial numbers does not happen so often that you get contention, it may work for you.
EDIT:
To clarify, the singleton that holds the current -- or next -- serial number that is to be assigned is completely independent of any entities that actually have serial numbers assigned to them. They do not need to be all be a part of an entity group. You could have entities from multiple models using the same mechanism to get a new, unique serial number.
I don't remember Java well enough to provide sample code, and my Python example might be meaningless to you, but here's pseudo-code to illustrate the idea:
Receive request to create a new inventory item.
Enter transaction.
Retrieve current value of the single entity of the SerialNumber model.
Increment value and write it to the database
Return value as you exit transaction.
Now, the code that does all the work of actually creating the inventory item and storing it along with its new serial number DOES NOT need to run in a transaction.
Caveat: as I stated above, this could be a major performance bottleneck, as only one serial number can be created at any one time. However, it does provide you with the certainty that the serial number that you just generated is unique and not in-use.
I encountered this same issue in an application where users needed to reserve a timeslot. I needed to "insert" exactly one unique timeslot entity while expecting users to simultaneously request the same timeslot.
I have isolated an example of how to do this on app engine, and I blogged about it. The blog posting has canonical code examples using Datastore, and also Objectify. (BTW, I would advise to avoid JDO.)
I have also deployed a live demonstration where you can advance two users toward reserving the same resource. In this demo you can experience the exact behavior of app engine datastore click by click.
If you are looking for the behavior of a unique constraint, these should prove useful.
-broc
I first thought an alternative to the transaction technique in broc's blog, could be to make a singleton class which contains a synchronized method (say addUserName(String name)) responsible adding a new entry only if it is unique or throwing an exception. Then make a contextlistener which instantiates a single instance of this singleton, adding it as an attribute to the servletContext. Servlets then can call the addUserName() method on the singleton instance which they obtain through getServletContext.
However this is NOT a good idea because GAE is likely to split the app across multiple JVMs so multiple singleton class instances could still occur, one in each JVM. see this thread
A more GAE like alternative would be to write a GAE module responsible for checking uniqueness and adding new enteries; then use manual or basic scaling with...
<max-instances>1</max-instances>
Then you have a single instance running on GAE which acts as a single point of authority, adding users one at a time to the datastore. If you are concerned about this instance being a bottleneck you could improve the module, adding queuing or an internal master/slave architecture.
This module based solution would allow many unique usernames to be added to the datastore in a short space of time, without risking entitygroup contention issues.