Why do entities need keys in Appengine datastore - google-app-engine

What is the usage of keys in the appengine datastore: I am new to Appengine, any info on it would be great.

Comparison
To keep things simple, let's assume MySQL stores all the rows of a table in a single file. That way, it can find all the rows by scanning that file.
App Engine's datastore (BigTable) does not have a concept of tables. Each entity (~row in MySQL) is stored separately. [It can also have a individual structure (~columns).] Because entities are not connected in any way, there is no "default" method to go through all of them. Each entity needs an ID and must be indexed.
Key Structure
A key consists of:
App ID (the closest thing in MySQL is a database).
Kind (the closest thing in MySQL is a table).
ID or name (the closest thing in MySQL is a primary key).
(Optionally) Parent key (all the above of another entity). (Details omitted for the sake of simplicity.)
Please note that what is meant by the closest thing is conceptual similarity. Technically, these things are not related. In MySQL, databases and tables represent actual storage structures. In BigTable they are just IDs, and the storage is actually flat, i.e. every entity is essentially a file.
In other words, identity-wise, a key is to an entity as the database + table + primary key are to a row in a MySQL table.
Key's Responsibilities
An entity's key:
States what application the entity belongs to.
What kind (class, table) it is of.
By the means of the above and either a numeric key ID or a textual key name, identifies the entity uniquely.
(Optionally) What the parent entity of the entity is. (Details omitted for the sake of simplicity.)
Usage
So that you can retrieve all entities of a kind, App Engine automatically builds indexes. That means App Engine maintains a list of all your entities. More specifically, it maintains a list of your entities' keys.
Complex indexes may be defined to run queries on multiple properties (~columns).
In contrast to MySQL, every BigTable query requires an index. Whenever a query is run, the corresponding index is scanned to find the entities that meet the query's conditions, and then the individual entities are retrieved by key.
A common high-level use is to identify an entity in a URL, as every key can be represented as a URL-safe string. When an entity's key is passed in the URL, the entity can be retrieved unambiguously, as the key identifies it uniquely.
Moreover, retrieving an entity by its key is strongly consistent, as opposed to queries on indexes, which means that when entity is retrieved by its key, it's guaranteed to be the latest version.
Tips
Every entity stored in BigTable has a key. Such a key may be programmatically created in your application and given an arbitrary key name. If it's not, an numeric ID will be allocated transparently, as the entity is being stored.
Once an entity is stored, its key may not be changed.
The optional parent component might be used to define a hierarchy of entities, but what it's really important for is transactions and strong consistency.
Entities that share a parent are said to belong to the same entity group.
Queries within a group are strongly consistent.
Just to reiterate, retrieving an entity by its key or querying an index by a parent key are strongly consistent. Retrieving entities in other ways (e.g. by a query on a property) is eventually consistent.
Glossary
Entity - a single key-value document.
Eventual consistency - retrieving an entity (often a number of them) without the guarantee that the replication has completed, which may result in some entities being an old version and some being missing, as they have not yet been brought from the server they were stored on.
Key - an entity's ID.
Kind - arbitrary textual name of a class of entities, such as User or Article.
Key ID - a numeric identifier of a key. Usually automatically allocated.
Key name - a textual identifier of a key.
Strong consistency - retrieving an entity in such a way that its latest version is retrieved.
(I intentionally used MySQL in the examples, as I'm much more familiar with it than with any other relational database.)

Please read https://developers.google.com/appengine/docs/java/datastore/#Java_Entities ... you may want to delete your question and ask again after you have studied this documentation section.
(This is meant to help you, not complain.)

Related

Is it more common to use table_id or id in database design

I have a situation where I would like to know if it is more commonplace to use table_id or just id? (in my opinion, using table_ would cause slight confusion as to if it a foreign key). Which do people prefer, and is there really any difference between the two? Or should it just be left up to picking one and being consistent?
There are two main currents in terms of naming columns in tables:
Schema Namespace
This strategy is the traditional strategy that was conceived by teams documenting the "data dictionary" of a database in the 70s. The idea is that the name itself of the column tells you which table it belongs to across the whole schema or database. For example, CLIENT_NAME would represent the name of the client in the CLIENT table.
There are variations of this strategy where a limited number of letters are assigned as prefixes (specially for M:N relationship tables) because at the time column names were limited to 6 or 8 characters in many databases. For example, the date of purchase of a car by a client could take the form CLI_CAR_DATE, CLICAR_DATE, or even CLCADT.
Examples:
A primary key "id" column of the entity table "car" would be named CAR_ID.
A foreign key on a child table "document" that points to "car" would take the same form: CAR_ID. This allows the use of natural joins; however, it should be pointed out that there are compelling reasons to avoid natural joins at all cost, that are not discussed here.
Foreign keys on a table "transfer" that has multiple (two) relationships (seller and buyer) with "person" pollutes this strategy. They could be named: PERSON_BUYER_ID and PERSON_SELLER_ID because both cannot have the same name PERSON_ID; it doesn't allow natural joins anymore (good).
Table Namespace
In this strategy (that is newer) column names do not include the name of the entity they belong to, but only their property name. This strategy aligns more with object design, and produces shorter names (i.e. less typing). The name of the table must be indicated when mentioning a column. For example, you would need to say the column NAME on the table CLIENT.
Examples:
A primary key "id" column of the entity table "car" would be named ID.
A foreign key on a child table "document" that points to "car" would take the form: CAR_ID; this is the same solution as the previous strategy.
Foreign keys on a table "transfer" that has multiple (two) relationships (seller and buyer) with "person" could be named: BUYER_ID and SELLER_ID. They could follow the longer names as the previous strategy, but the goal here is typically to have shorter names so the app source code gets easier to write and to debug.
Summary
I personally like the second one, but there are teams who adhere to both strategies and there's no clear winner. My leaning towards the second one is [I think] the first one suffers from longer names (more typing), longer SQL (more errors), cryptic names (they don't play well with ORMs and app objects), and foreign keys that cannot follow the strategy well. In fact, virtually all the primary keys in my databases are named ID regardless of the specific entities.
But on the flip side, some teams value very highly the idea of knowing the table name of a column by just looking at it. And this is great for big databases (with 200-1000 relational fact tables) that can become quite complex, specially for new members of a team.
But above all, pick one and be consistent.

Key-Value database with key aliasing or searching by value

Is there a existing in-memory production-ready KV storage that allow me to retrive a single value via any of multiple keys?
Let say I have millions of immutable entities that have a primary key associated. Any of this entity can have multiple aliases and most common scenario is to retrieve the enity by such alias(90% of all requests). The second common scenario is to be able to retrive the entity via the primary key and after that put the new alias record(the last 10%). One special thing about this step - it always prepended by the alias searching and happens only if alias search was unsuccessful.
The entire dataset does fit into the RAM but probably doesn't if entire record data will be duplicated accross all aliases.
I'm higly concerned about data retrieval latency and less concerned on writing speed.
This can be done with Redis in two sequential lookups or via any SQL/Mongodb. I think both ways is suboptimal. The first one obviously because of two round trips for every search attempt and the second one because of latency concerns.
Any suggestions?
Can you do two hashmaps one that goes pk -> record data and the other that goes from alias -> pk ?
Another option is to have some sort of deterministic alias so that you can go from the alias to the primary key directly in code without doing a lookup in a datastore

Is query by key faster than query by indexed property in Google Datastore?

Consider the below datastore entity:
public class Employee {
#Id String id;
#Index String userName
}
My understanding is that only those properties which are part of the filter criteria in the queries need to be annotated with #Index. Indexing in datastore is not for performance but for fetching the data.
Should id also be annotated with #Index to query by id? If no, does datastore automatically create indexes for keys?
#Id annotation makes sure to manage uniqueness, but it has no performance advantage over indexed properties. Is that right?
Will query by id be faster than query by userName in the above example?
1:
No, you don't need to explicitly index it. Datastore uses your key as a primary key for your entities (in the Entities table).
2 & 3:
Querying by primary key is more efficient (you only require a single scan on the primary table instead of a scan on the index followed by a lookup in the primary table. However, it also allows you to do a Lookup instead of a query:
Employee e = ofy().load().type(Employee.class).id("<id>").now();
Besides avoiding the query planning and index scan to lookup this Employee, this is Strongly Consistent. If you don't do this, you may write a new Employee but then not actually see them when you query for them.
While Strong Consistency is important from an application correctness point-of-view, it will be slower. In particular, when you do a strongly consistent lookup, Datastore may need to talk to the other replicas (in other data centers) to catch up your entity group.
If you are ok with eventual consistency, you can perform a Lookup with eventual consistency to avoid the index scans and the replica catch up using a read policy. In objectify, this looks like:
Employee e = ofy().consistency(Consistency.EVENTUAL).load()
.type(Employee.class).id("<id>).now();
Note: This answer talks a lot about indexes and tables. In generally I recommend not thinking about Datastore in terms of indexes and table (since it is not a relational storage system). However, it is implemented on a relational DB, so useful for answering your questions. This page has a lot of good background.
No, will be created automatically
#Id makes sure it's Key
Can't find confirmation, but must be faster. Also it's cheaper than query, 1 read for get vs 2 read for query. See https://cloud.google.com/datastore/docs/pricing
Also, keep in mind that if you decide to add #Index annotation later, then it will be created only for new entities, all existing entities will be unindexed. Which means you need to reindex db, or only new records will be returned from Query with a filter by this field.
Objectify always does a get by key - if you run a query, it does a keys only query, then fetches results by id. This works well because it has cache integration and it also means that you get accurate results (as in the data is strongly consistent, even though they query results aren't). You can control this using the .hybrid(boolean) method on a query.
You cannot query by id - you can only get by key. If you want to do that, you need a duplicate indexed field, and to query on that. This is an artifact of how keys work in the datastore.

Are entity types different than keys?

Do entity types/kinds have any special property or restriction compared to a Key? It seems to me that the entity type is just a key except it has no parent and the API clients use this concept to avoid collisions but technically there is no difference at the datastore level. It's all a big key. Am I right?
TL;DR: Yes, Entity Kind is different from the Key in that it is used for indexing purposes. Think of it as roughly analogous to a table name.
By entity type, I'm assuming you are referring to Entity Kind.
The Key of an entity is globally unique to your project, it is composed from the Entity Kind, its Id or Name, and optionally an ancestor path (which is more Kind and Id/Name pairs).
In the simplistic cases, you can think of a 'Kind' as a table name. Cloud Datastore automatically indexes every Entity by its Kind, which allows you to do 'global' queries for Entities of that Kind - regardless of whether they are a root Entity of the descendant of another entity.

Can a database table contains more than one primary key?

Can a database table contains more than one primary key?
Yes, I am talking about RDBMS.
A table can have:
No primary keys;
One primary key consisting of one column; or
One composite primary key consisting of two or more columns.
Other than that you can have any number of unique indexes, which will do basically the same thing.
The primary key of a relational table uniquely identifies each record in the table.
So, in order to keep the uniqueness of each record, you cant have more than one primary key for the table.
It can either be a normal attribute that is guaranteed to be unique (such as Social Security Number in a table with no more than one record per person) or it can be generated by the DBMS (such as a globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may consist of a single attribute or multiple attributes in combination.
That's why it is called Primary Key because it is, well, PRIMARY
Yes, you can have Composite primary keys, that is, having two fields as a primary key.
"First of all, you have to understand the history of entity-relationship design methodology as well as understand the word "relational" in relational database management systems (RDBMS)."
May I suggest politely that you first get YOURSELF educated on these very same subjects before leading other people into flawed beliefs ? I'll respond to the two worst ones of your stupidities below.
"According to relational methodology principles, each entity should only have one and only one means to identify it."
That is about the biggest crap I have ever heard anybody spawn around about relational data design. The relational model does not constrain any "entity", as you erroneously call it, to have any precise number of keys. Any "entity" can have any number of keys, and EACH key is, by definition of its very property of making the "rows" unique, a valid candidate for any purpose of "identification". Choosing the most useful/appropriate one for use in certain contexts (foreign keys in referencing tables, e.g.), is a design issue, and the relational model does not have anything to say on such things.
"Therefore, "R"DBMS attempts to facilitate the modeling of entity relationships."
Codd's paper "A Relational model of date for large shared data banks", which marks the birth of the relational model, predates the invention of E-R by a number of years. So to say that the Relational model attempts to facilitate the modeling of E-R concepts, is having things COMPLETELY backwards, and nothing but a display of one's own complete and utter ignorance of "the history" that you referred to in your own answer.
The short answer is yes. A primary key is a candidate key and is in principle no different to any other candidate key. It is a widely observed convention that one candidate key per table is designated as the "primary" one - meaning that it is "preferred" or has some special meaning for the database designer or user. This is just convention however. It is only a label of convenience and a reminder about the potential significance of one key. In practice all keys can serve the same purpose and the "primary" one is not special or unique in any fundamental way.
First of all, you have to understand the history of entity-relationship design methodology as well as understand the word "relational" in relational database management systems (RDBMS).
In order to define the bounds of an entity and relationships to be formed, there must be a unique handle or a unique combination of handles to identify each single instance of an entity and then to form relationships between them.
You also need to understand the meaning/root of the word "identify" which is to zero in on the "identity" of each instance of an entity. "identity" being the mathematical term meaning "one" or a singularity.
According to relational methodology principles, each entity should only have one and only one means to identify it. Therefore, "R"DBMS attempts to facilitate the modeling of entity relationships. Note the differences between "Entity/Class" and "Entity/Class instance".
However, RDBMS is used widely and mostly by people not so interested in accurately portraying the E-R design principles. So that frequently, we have more than one possible entity-definition sitting inside a table, which I call entity-aliasing. Opposed to identity-aliasing, where two or more instances of an entity-set hides under the same key, entity-aliasing is like the table
EmpProj([empId], empName, empAddr, projId, projLoc)
actually has two entity-sets aliased under the same table:
Emp([empId], empName, empAddr)
Proj([projId], projLoc, empId)
That is when normalisation comes in - to separate these entities out. Try as we might to do a decent design normalisation, computer scientists may not have as good a perspective on the information as a statistician. The computer scientist (which in this discussion includes everyone with a decent knowledge of ER design) tries his/her best in creating a schema that cleanly defines entities and their relationships.
However, after 18 months analysing voluminous information from the database, the statistician begin to see principal components that emerge whose analyses is terribly crippled due to the misalignment of the principal components with those of boundaries of the computer scientists' perceived entities.
That is where alternate unique keys are good for - to identify instances of entities due to the principal components existing as ghost-entities in the database.
Therefore, the primary key of a table is because that table is perceived to be a perfect entity as an entity should have only one primary key, be it singular or composite.
As far as the statistician is concerned, even though the database allows only one primary key per table, the alternative unique keys is to the statistician the primary keys to those ghost-entities. Which is why sometimes you are frustrated by statisticians who seem to do double work by downloading the data into the local database of their workstation/PC.
In conclusion, the constraint placed by the "R"DBMS manufacturer in allowing only one primary key per table is their pretense in believing that they know how information behave and believing that principal components of the information due to the population do not mutate over time.
If you have more than one unique keys possible in a table it means either one or more of the possibilities
Like myself, you are lazy to
separate them since they seem to
work quite well
For performance' sake, mixing the
entities into the same table makes
the application run incredibly
faster
Like the statistician, you gradually
discover ghost entities in your
information.

Resources