AppEngine, DataStore: Preallocating normally-distributed IDs (*not* monotonically incrementing) - google-app-engine

There are three schemes to set IDs on datastore entities:
Provide your own string or int64 ID.
Don't provide them and let AE allocate int64 IDs for you.
Pre-allocate a block of int64 IDs.
The documentation has this to say about ID generation:
This (1):
Cloud Datastore can be configured to generate auto IDs using two
different auto id policies:
The default policy generates a random sequence of unused IDs that are approximately uniformly distributed. Each ID can be up to 16
decimal digits long.
The legacy policy creates a sequence of non-consecutive smaller integer IDs.
If you want to display the entity IDs to the user, and/or depend upon
their order, the best thing to do is use manual allocation.
and this (2):
Note: Instead of using key name strings or generating numeric IDs
automatically, advanced applications may sometimes wish to assign
their own numeric IDs manually to the entities they create. Be aware,
however, that there is nothing to prevent Cloud Datastore from
assigning one of your manual numeric IDs to another entity. The only
way to avoid such conflicts is to have your application obtain a block
of IDs with the datastore.AllocateIDs function. Cloud Datastore's
automatic ID generator will keep track of IDs that have been allocated
with this function and will avoid reusing them for another entity, so
you can safely use such IDs without conflict.
and this (3):
Cloud Datastore generates a random sequence of unused IDs that are
approximately uniformly distributed. Each ID can be up to 16 decimal
digits long.
System-allocated ID values are guaranteed unique to the entity group.
If you copy an entity from one entity group or namespace to another
and wish to preserve the ID part of the key, be sure to allocate the
ID first to prevent Cloud Datastore from selecting that ID for a
future assignment.
I have a particular entity-type that is stored with an ancestor. However, I'd like to have globally-unique IDs and AE's IDs (allocated via datastore.AllocateIDs with Go) will not be globally unique when stored under an ancestor (in an entity-group). So, pre-allocation would solve this (they're ancestor-agnostic). However, you are obviously given an interval in response... a continuous range of IDs that have been reserved.
Isn't there some way to preallocate those nice, opaque, uniformally-distributed IDs?
While we're on the subject, I had assumed that the opaque IDs from AE were the result of some pseudorandom number generator with a persisted-state for each entity-type, but the word "track" in (2) seems to imply that there is a cost to optimistically generating and buffering IDs that might not be used. It's be great if someone can clarify this.

The simple solution is to do the following:
When trying to allocate a new ID for an entity:
Repeat the following:
Generate a random K bit integer. Use it for the entity ID field. [Use a uniform random distribution].
Create a Cloud Datastore transaction.
Insert the new entity. [If the transaction aborts because the entity already exists try again with a new random number].
If you make K big enough (for example 128) and have a properly seeded random number generator, then it is statistically impossible to generate an ID collision and you can remove the retry loop.
If you make K big enough stop using the integer id field in the entity key and instead use the string one. Base64 URL encode random number as a string.

Related

Google App Engine (datastore) - will a deleted key regenerate?

I've got a simple question about datastore keys. If I delete an entity, is there any possibility that the key will be created again? or each key is unique and can be generated only one-time?
Thanks.
It is definitely possible to re-use keys.
Easy to test, for example using the datastore admin page:
create an entity for one of your entity models using a custom/specified key name and some property values
delete the entity
create another one using the same key name and different property values...
As for the keys with auto-generated IDs it is theoretically possible, but I guess rather unlikely due to the high number of possibilities. From Assigning identifiers:
Cloud Datastore can be configured to generate auto IDs using two
different auto id policies:
The default policy generates a random sequence of unused IDs that are approximately uniformly distributed. Each ID can be up to 16
decimal digits long.
The legacy policy creates a sequence of non-consecutive smaller integer IDs.

Managing IDs in an asynchronous world

When a new record can be created at potentially any number of locations (i.e. different mobile devices), how do you guarantee that record a unique identity?
(In my SQL-steeped worldview, the default type of an ID is an int or long, though I gladly consider other possibilities.)
The solutions I've considered are
Assign each device a pile of IDs which is (hopefully) more than they will use between syncs, and replenish it when syncing.
Assign each newly created record a temporary ID (Guid) until it can be assigned a "real" ID by the System of Record.
Use Guids as IDs.
Block the creation process until the ID is provided by the System of Record (not preferred due to possible network interruption).
Use a primary value (e.g. Name) as an ID (also not preferred due to potential of primary value to change).
These are what I've come up with on my own, but since this is the type of problem that has certainly already been solved ten million times, what are the accepted solutions?
You could have an unique id for each device (could be set during initial on-line registration), and each device would do its own numbering. The records themselves would use composite primary key: (originDeviceId, recordId), which then is guaranteed to be unique across all devices and has several other advantages, like no need for changing the key when syncing with server and ability to use that key to build relations on off-line remote device right from the start.
The main downside is that you need two columns to reference the record. A - bit hacky - workaround would be to have those columns defined as ints and another - computed one - as a big int, made out of those previous two. The downside is no leftshift operator in most RDBMS, but it can be solved using multiplying by power of two. Then you just make relations using that computed field, so for example:
SELECT file.* FROM t_File as file
JOIN t_User as user on file.UserId = user.Id
-- t_File.UserId is big int and t_User.Id is deviceId * POWER(2, 32) + recordId
Another downside is limiting your records to max int, which might or might not be enough in your case, but at least you have guaranteed uniqueness.
Last downside I see is a need for that initial registration to get assigned an unique device id.

Datastore why use key and id?

I had a question regarding why Google App Engine's Datastore uses a key and and ID. Coming from a relational database background I am comparing entities with rows, so why when storing an entity does it require a key (which is a long automatically generated string) and an ID (which can be manually or automatically entered)? This seems like a big waste of space to identify a record. Again I am new to this type of database, so I may be missing something.
Key design is a critical part of efficient Datastore operations. The keys are what are stored in the built-in and custom indexes and when you are querying, you can ask to have only keys returned (in Python: keys_only=True). A keys-only query costs a fraction of a regular query, both in $$ and to a lesser extent in time, and has very low deserialization overhead.
So, if you have useful/interesting things stored in your key id's, you can perform keys-only queries and get back lots of useful data in a hurry and very cheaply.
Note that this extends into parent keys and namespaces, which are all part of the key and therefore additional places you can "store" useful data and retrieve all of it with keys-only queries.
It's an important optimization to understand and a big part of our overall design.
Basically, the key is built from two pieces of information :
The entity type (in Objectify, it is the class of the object)
The id/name of the entity
So, for a given entity type, key and id are quite the same.
If you do not specify the ID yourself, then a random ID is generated and the key is created based on that random id.

Google App Engine - What's the recommended way to keep entity amount within limit?

I have some entities of a kind, and I need to keep them within limited amount, by discarding old ones. Just like log entries maintainance. Any good approach on GAE to do this?
Options in my mind:
Option 1. Add a Date property for each of these entities. Create cron job to check datastore statistics daily. If it exceeds the limit, query some entities of that kind and sort by date with oldest first. Delete them until the size is less than, for example, 0.9 * max_limit.
Option 2. Option 1 requires an additional property with index. I observed that the entity key ids may be likely increasing. So I'd like to query only keys and sort by ascending order. Delete the ones with smaller ids. It does not require additional property (date) and index. But I'm seriously worrying about whether the key id is assured to go increasingly?
I think this is a common data maintainance task. Is there any mature way to do it?
By the way, a tiny ad for my app, free and purely for coder's fun! http://robotypo.appspot.com
You cannot assume that the IDs are always increasing. The docs about ID gen only guarantee that:
IDs allocated in this manner will not be used by the Datastore's
automatic ID sequence generator and can be used in entity keys without
conflict.
The default sort order is also not guaranteed to be sorted by ID number:
If no sort orders are specified, the results are returned in the order
they are retrieved from the Datastore.
which is vague and doesn't say that the default order is by ID.
One solution may be to use a rotating counter that keeps track of the first element. When you want to add new entities: fetch the counter, increment it, mod it by the limit, and add a new element with an ID as the value of the counter. This must all be done in a transaction to guarantee that the counter isn't being incremented by another request. The new element with the same key will overwrite one that was there, if any.
When you want to fetch them all, you can manually generate the keys (since they are all known), do a bulk fetch, sort by ID, then split them into two parts at the value of the counter and swap those.
If you want the IDs to be unique, you could maintain a separate counter (using transactions to modify and read it so you're safe) and create entities with IDs as its value, then delete IDs when the limit is reached.
You can create a second entity (let's call it A) that keeps a list of the keys of the entities you want to limit, like this (pseudo-code):
class A:
List<Key> limitedEntities;
When you add a new entity, you add its key in the list of A. If the length of the list exceeds the limit, you take the first element of the list and the remove the corresponding entity.
Notice that when you add or delete an entity, you should modify the list of entity A in a transaction. Since, these entities belong to different entity groups, you should consider using Cross-Group Transactions.
Hope this helps!

Will the automatic id allocator return ids I set arbitrarily on other entities?

If I arbitrarily set the id of a new entity to 1000, without using allocateIds first, could a later entity be automatically given the id 1000 and overwrite my original entity?
In pseudocode, if I have:
Entity e1;
e1.setId(1000);
datastore.put(e1);
... later...
Entity e2;
datastore.put(e2);
Is there any chance e2 will automatically be given the id 1000 and overwrite e1?
As explained in the Java API documentation
Creating an entity for the purpose of insertion (as opposed to update)
with this constructor is discouraged unless the id was obtained from a
key returned by a KeyRange obtained from
AsyncDatastoreService.allocateIds(String, long) or
DatastoreService.allocateIds(String, long) for the same kind.
See also the Python documentation:
Be aware, however, that the datastore is not guaranteed to avoid app-assigned IDs. It is possible, though unlikely, that the datastore will assign a numeric ID that will cause a conflict with an entity with an app-assigned ID. The only way to avoid this problem is to have your app use allocate_ids() to obtain a batch of IDs. The datastore's automatic ID generator will not use IDs that have been obtained using allocate_ids(), so your app can use these IDs without conflict.
You should not be creating entity with ids that have not been allocated by the datastore, if you want to supply an user defined identifier as part of the Entity key, use key_name instead.

Resources