Compound Key Performance For App Engine Datastore - google-app-engine

I am creating string datastore IDs for entity "A" from auto-generated IDs of entity "B." Should I prefix the A ID with "A-PREFIX"-B.IntID() or go with B.IntID()+"A-PREFIX?"
I assume I should start with the B ID because they are uniformly distributed so as to prevent hotspots?
From https://cloud.google.com/developers/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore
Anti-Pattern #1: Sequential Numbering of Entity Keys
Thanks,
Andrew

You do not need any prefix at all. A key consists of an entity kind and an ID. So two entities may have the same ID and still have unique keys if they belong to different kinds.
This example works perfectly fine (example in Java):
Entity userEntity = new Entity("User");
Long id = datastore.put(userEntity).getId();
Entity loginEntity = new Entity("Login", id);
datastore.put(loginEntity);
Note that if you take a Long id and convert it into a String, your keys will take up much more space. So using a Long for an id is a better option.

Related

Google App Engine (datastore) - will a deleted key regenerate?

I've got a simple question about datastore keys. If I delete an entity, is there any possibility that the key will be created again? or each key is unique and can be generated only one-time?
Thanks.
It is definitely possible to re-use keys.
Easy to test, for example using the datastore admin page:
create an entity for one of your entity models using a custom/specified key name and some property values
delete the entity
create another one using the same key name and different property values...
As for the keys with auto-generated IDs it is theoretically possible, but I guess rather unlikely due to the high number of possibilities. From Assigning identifiers:
Cloud Datastore can be configured to generate auto IDs using two
different auto id policies:
The default policy generates a random sequence of unused IDs that are approximately uniformly distributed. Each ID can be up to 16
decimal digits long.
The legacy policy creates a sequence of non-consecutive smaller integer IDs.

Generating a unique id GAE datastore

In MySQL I used auto-increment to generate an id for every user. I would like to create a similar user table in Google Datastore where the id for a user will be unique. According to these docs:https://cloud.google.com/appengine/docs/java/datastore/entities
System-allocated ID values are guaranteed unique to the entity group.
But according to this post: Ever see duplicate IDs when using Google App Engine and ndb? the id's are not unique. I need this id to be unique. It is confusing because in the docs it says the id is unique, but from this post it says the id is not unique it is the key that is unique. My objective is for no two users to have the same id. How can I guarantee this? I would prefer for the database to take care of this form me opposed to me having to create large ids manually using things such as uuids.
As Igor correctly observed, IDs are always unique as long as the entity has no parent.
I can't think of any reason to make user entities children of some other entities, so you are safe.
Note that IDs will not be sequential, as it helps to spread the load equally across the entire dataset - it's a by-product of how the Datastore is designed.

GAE datastore index vs normalisation

Given below entity in google app engine datastore, is it better to define index on reportingIds or define a separate entity which has only personId and reportingIds fields? Based on the documentation I understood, defining index results in increase of count of operations against datastore quota.
Below are entities in GAE Go. My code needs to scan through Person entities frequently. It needs to limit its scan to Person entity that has at least 1 reporting person. 2 approaches I see. Define index on reportingIds and Query by specifying filters. Create/Update PersonWithReporters entity when ever a Person gets a new reporting person. In the second case, my code needs to iterate through all the entities in PersonWithReporters and need not construct any index/query. I can iterate using Key which is always guaranteed to have the latest data. Not sure which approach is beneficial considering datastore operation counts against quota limit.
type Person struct {
Id string //unique person id
//many other personal details, his personal settings etc
reportingIds []string //ids of the Person this guy manages
}
type PersonWithReporters struct {
Id string //Person managing reportees
reportingIds []string //ids of the Person this guy manages
}
A approach with a separate entity gives you two advantages.
As you have already mentioned, you don't need to index/query all Person entities.
Every time a Person gets a new reporting person, you will create a new entity, which may be significantly cheaper than updating a Person entity which has many other properties, some of which, presumably, are indexed.
Your approach with a separate entity is also not ideal. When you index a property with multiple values, under the hood the Datastore creates an index entry for each value. So, when you add reporting person number 3 to this entity, you have to update 3 index entries instead of 1.
You can optimize your data model even further by creating a Reporter entity with no properties! Every time a new reporting person is added, you create this Reporter entity with ID set to the ID of a reporting person, and make it a child entity of a Person entity representing a person to whom this reporter reports.
Now, when you need to iterate through all persons with someone reporting to them, you run a simple query on this Reporter entity - no filters. This query can be set to keys-only (there is nothing than a key in this entity anyway, but keys-only queries are treated differently - they are basically free).
For every entity returned by this query you retrieve its key, and this key contains an ID (which is an ID of a reporting person), and a parent key, which includes an ID of a person who this reporter reports to.
Unless AppEngine's datastore in Go is very different to how it works in Java or Python you cannot index an array natively - So option 1 is out of the question, and so is option 2.
I suggest option three, which is to define a
type PersonWithReporters {
Id string // concatenate(managing_Person_id, separator, reporter_Person_id) to avoid id collisions
reportingId string; // indexed
managingId string; // probably indexed as well
}
You would create multiple of these entities instead of a single entity with an array. Also you add an index on reportingId. Now you can create a filter query on this entity and should be able to retrieve the desired information.
I would worry more about performance and not too much about the quota limits, they are pretty high. Just implement it, see how it works and whether quota is your main concern here.

Key vs ID/Name?

I do not want to create an autogenerated key for my entities so I specify my own:
Entity employee = Entity.newBuilder().setKey(makeKey("Employee", "bobby"))
.addProperty(makeProperty("fname", makeValue("fname").setIndexed(false)))
.addProperty(makeProperty("lname", makeValue("lname").setIndexed(false)))
.build();
CommitRequest request = CommitRequest.newBuilder()
.setMode(CommitRequest.Mode.NON_TRANSACTIONAL)
.setMutation(Mutation.newBuilder().addInsert(employee))
.build();
datastore.commit(request);
When I check to see what the entity looks like it looks like this:
Why is this auto-generated key generated if I specified my own key (bobby)? It seems bobby was also created, but now I have bobby and this autogenerated key. What is the difference between the key and id/name?
You can't specify your own key, keys actually contain information necessary for the datastore operation. This note in the documentation gives you an idea:
Note: The URL-safe string looks cryptic, but it is not encrypted! It
can easily be decoded to recover the original entity's kind and
identifier:
key = Key(urlsafe=url_string)
kind_string = key.kind()
ident = key.id()
If you use such URL-safe keys, don't use sensitive data such as email
addresses as entity identifiers. (A possible solution would be to use
the MD5 hash of the sensitive data as the identifier. This stops third
parties, who can see the encrypted keys, from using them to harvest
email addresses, though it doesn't stop them from independently
generating their own hash of a known email address and using it to
check whether that address is present in the Datastore.)
What you can specify is the ID portion of the key, either as a number or as a string:
A key is a series of kind-ID pairs. You want to make sure each entity
has a key that is unique within its application and namespace. An
application can create an entity without specifying an ID; the
Datastore automatically generates a numeric ID. If an application
picks some IDs "by hand" and they're numeric and the application lets
the Datastore generate some IDs automatically, the Datastore might
choose some IDs that the application already used. To avoid, this, the
application should "reserve" the range of numbers it will use to
choose IDs (or use string IDs to avoid this issue entirely).
This is the url-safe version of your key, suitable for use in links. Use KeyFactory.stringToKey to convert it to an actual key, and you'll see that it contains your string name.
What you create with makeKey("Employee", "bobby") is a key for an Entity with the entity name Employee and the name bobby. What you see as Key in the datastore viewer is a representation for exactly that.
Generally speaking a key always consists of
optional parent key (with entity type and name/id)
entity type
entity name/id
Maybe someone here can tell you how to decode the key into its components but rest asured that you're doing everything right and the behavior is as expected.

Datastore why use key and id?

I had a question regarding why Google App Engine's Datastore uses a key and and ID. Coming from a relational database background I am comparing entities with rows, so why when storing an entity does it require a key (which is a long automatically generated string) and an ID (which can be manually or automatically entered)? This seems like a big waste of space to identify a record. Again I am new to this type of database, so I may be missing something.
Key design is a critical part of efficient Datastore operations. The keys are what are stored in the built-in and custom indexes and when you are querying, you can ask to have only keys returned (in Python: keys_only=True). A keys-only query costs a fraction of a regular query, both in $$ and to a lesser extent in time, and has very low deserialization overhead.
So, if you have useful/interesting things stored in your key id's, you can perform keys-only queries and get back lots of useful data in a hurry and very cheaply.
Note that this extends into parent keys and namespaces, which are all part of the key and therefore additional places you can "store" useful data and retrieve all of it with keys-only queries.
It's an important optimization to understand and a big part of our overall design.
Basically, the key is built from two pieces of information :
The entity type (in Objectify, it is the class of the object)
The id/name of the entity
So, for a given entity type, key and id are quite the same.
If you do not specify the ID yourself, then a random ID is generated and the key is created based on that random id.

Resources