Do entity types/kinds have any special property or restriction compared to a Key? It seems to me that the entity type is just a key except it has no parent and the API clients use this concept to avoid collisions but technically there is no difference at the datastore level. It's all a big key. Am I right?
TL;DR: Yes, Entity Kind is different from the Key in that it is used for indexing purposes. Think of it as roughly analogous to a table name.
By entity type, I'm assuming you are referring to Entity Kind.
The Key of an entity is globally unique to your project, it is composed from the Entity Kind, its Id or Name, and optionally an ancestor path (which is more Kind and Id/Name pairs).
In the simplistic cases, you can think of a 'Kind' as a table name. Cloud Datastore automatically indexes every Entity by its Kind, which allows you to do 'global' queries for Entities of that Kind - regardless of whether they are a root Entity of the descendant of another entity.
Related
I have a situation where I would like to know if it is more commonplace to use table_id or just id? (in my opinion, using table_ would cause slight confusion as to if it a foreign key). Which do people prefer, and is there really any difference between the two? Or should it just be left up to picking one and being consistent?
There are two main currents in terms of naming columns in tables:
Schema Namespace
This strategy is the traditional strategy that was conceived by teams documenting the "data dictionary" of a database in the 70s. The idea is that the name itself of the column tells you which table it belongs to across the whole schema or database. For example, CLIENT_NAME would represent the name of the client in the CLIENT table.
There are variations of this strategy where a limited number of letters are assigned as prefixes (specially for M:N relationship tables) because at the time column names were limited to 6 or 8 characters in many databases. For example, the date of purchase of a car by a client could take the form CLI_CAR_DATE, CLICAR_DATE, or even CLCADT.
Examples:
A primary key "id" column of the entity table "car" would be named CAR_ID.
A foreign key on a child table "document" that points to "car" would take the same form: CAR_ID. This allows the use of natural joins; however, it should be pointed out that there are compelling reasons to avoid natural joins at all cost, that are not discussed here.
Foreign keys on a table "transfer" that has multiple (two) relationships (seller and buyer) with "person" pollutes this strategy. They could be named: PERSON_BUYER_ID and PERSON_SELLER_ID because both cannot have the same name PERSON_ID; it doesn't allow natural joins anymore (good).
Table Namespace
In this strategy (that is newer) column names do not include the name of the entity they belong to, but only their property name. This strategy aligns more with object design, and produces shorter names (i.e. less typing). The name of the table must be indicated when mentioning a column. For example, you would need to say the column NAME on the table CLIENT.
Examples:
A primary key "id" column of the entity table "car" would be named ID.
A foreign key on a child table "document" that points to "car" would take the form: CAR_ID; this is the same solution as the previous strategy.
Foreign keys on a table "transfer" that has multiple (two) relationships (seller and buyer) with "person" could be named: BUYER_ID and SELLER_ID. They could follow the longer names as the previous strategy, but the goal here is typically to have shorter names so the app source code gets easier to write and to debug.
Summary
I personally like the second one, but there are teams who adhere to both strategies and there's no clear winner. My leaning towards the second one is [I think] the first one suffers from longer names (more typing), longer SQL (more errors), cryptic names (they don't play well with ORMs and app objects), and foreign keys that cannot follow the strategy well. In fact, virtually all the primary keys in my databases are named ID regardless of the specific entities.
But on the flip side, some teams value very highly the idea of knowing the table name of a column by just looking at it. And this is great for big databases (with 200-1000 relational fact tables) that can become quite complex, specially for new members of a team.
But above all, pick one and be consistent.
What is the usage of keys in the appengine datastore: I am new to Appengine, any info on it would be great.
Comparison
To keep things simple, let's assume MySQL stores all the rows of a table in a single file. That way, it can find all the rows by scanning that file.
App Engine's datastore (BigTable) does not have a concept of tables. Each entity (~row in MySQL) is stored separately. [It can also have a individual structure (~columns).] Because entities are not connected in any way, there is no "default" method to go through all of them. Each entity needs an ID and must be indexed.
Key Structure
A key consists of:
App ID (the closest thing in MySQL is a database).
Kind (the closest thing in MySQL is a table).
ID or name (the closest thing in MySQL is a primary key).
(Optionally) Parent key (all the above of another entity). (Details omitted for the sake of simplicity.)
Please note that what is meant by the closest thing is conceptual similarity. Technically, these things are not related. In MySQL, databases and tables represent actual storage structures. In BigTable they are just IDs, and the storage is actually flat, i.e. every entity is essentially a file.
In other words, identity-wise, a key is to an entity as the database + table + primary key are to a row in a MySQL table.
Key's Responsibilities
An entity's key:
States what application the entity belongs to.
What kind (class, table) it is of.
By the means of the above and either a numeric key ID or a textual key name, identifies the entity uniquely.
(Optionally) What the parent entity of the entity is. (Details omitted for the sake of simplicity.)
Usage
So that you can retrieve all entities of a kind, App Engine automatically builds indexes. That means App Engine maintains a list of all your entities. More specifically, it maintains a list of your entities' keys.
Complex indexes may be defined to run queries on multiple properties (~columns).
In contrast to MySQL, every BigTable query requires an index. Whenever a query is run, the corresponding index is scanned to find the entities that meet the query's conditions, and then the individual entities are retrieved by key.
A common high-level use is to identify an entity in a URL, as every key can be represented as a URL-safe string. When an entity's key is passed in the URL, the entity can be retrieved unambiguously, as the key identifies it uniquely.
Moreover, retrieving an entity by its key is strongly consistent, as opposed to queries on indexes, which means that when entity is retrieved by its key, it's guaranteed to be the latest version.
Tips
Every entity stored in BigTable has a key. Such a key may be programmatically created in your application and given an arbitrary key name. If it's not, an numeric ID will be allocated transparently, as the entity is being stored.
Once an entity is stored, its key may not be changed.
The optional parent component might be used to define a hierarchy of entities, but what it's really important for is transactions and strong consistency.
Entities that share a parent are said to belong to the same entity group.
Queries within a group are strongly consistent.
Just to reiterate, retrieving an entity by its key or querying an index by a parent key are strongly consistent. Retrieving entities in other ways (e.g. by a query on a property) is eventually consistent.
Glossary
Entity - a single key-value document.
Eventual consistency - retrieving an entity (often a number of them) without the guarantee that the replication has completed, which may result in some entities being an old version and some being missing, as they have not yet been brought from the server they were stored on.
Key - an entity's ID.
Kind - arbitrary textual name of a class of entities, such as User or Article.
Key ID - a numeric identifier of a key. Usually automatically allocated.
Key name - a textual identifier of a key.
Strong consistency - retrieving an entity in such a way that its latest version is retrieved.
(I intentionally used MySQL in the examples, as I'm much more familiar with it than with any other relational database.)
Please read https://developers.google.com/appengine/docs/java/datastore/#Java_Entities ... you may want to delete your question and ask again after you have studied this documentation section.
(This is meant to help you, not complain.)
I'll keep it simple. A Restful URI is something like:
example.com/rest/customer/1
What is the best common practice for what '1' is. Is it the db generated primary key?
Using a system generated primary key makes me think that it won't be conducive to:
Database merges
Importing/exporting data
Not using the primary key has its own set of issues. Looking for prevailing thoughts on this topic.
I would expect the id to be the primary key as that is how you would identify the record. If you want, and you have one, you could use a natural primary key e.g. someone's employee id rather than an identity which is a surrogate key.
If your issue is that is is an integer rather than it being the database primary key (and hence I suppose guessable) you could use a GUID instead. They can be generated on either the client, or server side either in the application or in the DB.
They would help with database merges etc. and they are guaranteed unique.
If you design your RESTful API correctly, the choice of numbering scheme for IDs becomes opaque and irrelevant to your API's consumers.
Applications coded against your API will navigate around it using hyperlinks within your representations, as long as they don't attempt URI construction. Applying the principles of HATEOAS allows you to use database keys as your resource IDs without worry.
Generally, it doesn't matter.
But you should always be returning an href as your unique identifier to the consumer, not just an ID. If you just return an ID, that means they have to know where the resource lives and combine that information with the ID to make a unique request for the resource. Returning the href for them eliminates coupling.
For example, I have 2 entities: book and copy with 1-n relationship as a book can have many its copies.
if copy is a strong entity,
book(PK_ISBN#, title, edition, date)
copy(PK_copy#, condition, FK_ISBN#)
if copy is a weak entity,
book(PK_ISBN#, title, edition, date)
copy(ISBN#, copy#, condition)
Primary key (ISBN#, copy#)
Foreign key ISBN# references book(PK_ISBN#)
Question: why would the copy entity be the weak entity instead of strong entity when both cases, I think, are similar.
P/S: one more question: How can we model partial or total participation constraint in SQL code.
As you've probably realised, in practice there is little or no difference in SQL implementations between a table representing a strong entity and one representing a weak entity. The concept exists in ER notation but has very little relevance to the relational model or SQL except as a way of understanding the semantics of the domain of discourse.
Your examples are a little sketchy on details however. It appears that the copy attribute is unique in the first example but not in the second, which suggests the copy attribute means something different in each case.
A total participation constraint between two tables is usually impossible to enforce in SQL because Standard SQL doesn't support multiple assignment (you can't update both tables at once). The workaround is to disable or defer constraint checking during updates, which means such constraints are of only limited value. Partial participation is essentially what a foreign key constraint achieves.
Can a database table contains more than one primary key?
Yes, I am talking about RDBMS.
A table can have:
No primary keys;
One primary key consisting of one column; or
One composite primary key consisting of two or more columns.
Other than that you can have any number of unique indexes, which will do basically the same thing.
The primary key of a relational table uniquely identifies each record in the table.
So, in order to keep the uniqueness of each record, you cant have more than one primary key for the table.
It can either be a normal attribute that is guaranteed to be unique (such as Social Security Number in a table with no more than one record per person) or it can be generated by the DBMS (such as a globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may consist of a single attribute or multiple attributes in combination.
That's why it is called Primary Key because it is, well, PRIMARY
Yes, you can have Composite primary keys, that is, having two fields as a primary key.
"First of all, you have to understand the history of entity-relationship design methodology as well as understand the word "relational" in relational database management systems (RDBMS)."
May I suggest politely that you first get YOURSELF educated on these very same subjects before leading other people into flawed beliefs ? I'll respond to the two worst ones of your stupidities below.
"According to relational methodology principles, each entity should only have one and only one means to identify it."
That is about the biggest crap I have ever heard anybody spawn around about relational data design. The relational model does not constrain any "entity", as you erroneously call it, to have any precise number of keys. Any "entity" can have any number of keys, and EACH key is, by definition of its very property of making the "rows" unique, a valid candidate for any purpose of "identification". Choosing the most useful/appropriate one for use in certain contexts (foreign keys in referencing tables, e.g.), is a design issue, and the relational model does not have anything to say on such things.
"Therefore, "R"DBMS attempts to facilitate the modeling of entity relationships."
Codd's paper "A Relational model of date for large shared data banks", which marks the birth of the relational model, predates the invention of E-R by a number of years. So to say that the Relational model attempts to facilitate the modeling of E-R concepts, is having things COMPLETELY backwards, and nothing but a display of one's own complete and utter ignorance of "the history" that you referred to in your own answer.
The short answer is yes. A primary key is a candidate key and is in principle no different to any other candidate key. It is a widely observed convention that one candidate key per table is designated as the "primary" one - meaning that it is "preferred" or has some special meaning for the database designer or user. This is just convention however. It is only a label of convenience and a reminder about the potential significance of one key. In practice all keys can serve the same purpose and the "primary" one is not special or unique in any fundamental way.
First of all, you have to understand the history of entity-relationship design methodology as well as understand the word "relational" in relational database management systems (RDBMS).
In order to define the bounds of an entity and relationships to be formed, there must be a unique handle or a unique combination of handles to identify each single instance of an entity and then to form relationships between them.
You also need to understand the meaning/root of the word "identify" which is to zero in on the "identity" of each instance of an entity. "identity" being the mathematical term meaning "one" or a singularity.
According to relational methodology principles, each entity should only have one and only one means to identify it. Therefore, "R"DBMS attempts to facilitate the modeling of entity relationships. Note the differences between "Entity/Class" and "Entity/Class instance".
However, RDBMS is used widely and mostly by people not so interested in accurately portraying the E-R design principles. So that frequently, we have more than one possible entity-definition sitting inside a table, which I call entity-aliasing. Opposed to identity-aliasing, where two or more instances of an entity-set hides under the same key, entity-aliasing is like the table
EmpProj([empId], empName, empAddr, projId, projLoc)
actually has two entity-sets aliased under the same table:
Emp([empId], empName, empAddr)
Proj([projId], projLoc, empId)
That is when normalisation comes in - to separate these entities out. Try as we might to do a decent design normalisation, computer scientists may not have as good a perspective on the information as a statistician. The computer scientist (which in this discussion includes everyone with a decent knowledge of ER design) tries his/her best in creating a schema that cleanly defines entities and their relationships.
However, after 18 months analysing voluminous information from the database, the statistician begin to see principal components that emerge whose analyses is terribly crippled due to the misalignment of the principal components with those of boundaries of the computer scientists' perceived entities.
That is where alternate unique keys are good for - to identify instances of entities due to the principal components existing as ghost-entities in the database.
Therefore, the primary key of a table is because that table is perceived to be a perfect entity as an entity should have only one primary key, be it singular or composite.
As far as the statistician is concerned, even though the database allows only one primary key per table, the alternative unique keys is to the statistician the primary keys to those ghost-entities. Which is why sometimes you are frustrated by statisticians who seem to do double work by downloading the data into the local database of their workstation/PC.
In conclusion, the constraint placed by the "R"DBMS manufacturer in allowing only one primary key per table is their pretense in believing that they know how information behave and believing that principal components of the information due to the population do not mutate over time.
If you have more than one unique keys possible in a table it means either one or more of the possibilities
Like myself, you are lazy to
separate them since they seem to
work quite well
For performance' sake, mixing the
entities into the same table makes
the application run incredibly
faster
Like the statistician, you gradually
discover ghost entities in your
information.