Datastore why use key and id? - google-app-engine

I had a question regarding why Google App Engine's Datastore uses a key and and ID. Coming from a relational database background I am comparing entities with rows, so why when storing an entity does it require a key (which is a long automatically generated string) and an ID (which can be manually or automatically entered)? This seems like a big waste of space to identify a record. Again I am new to this type of database, so I may be missing something.

Key design is a critical part of efficient Datastore operations. The keys are what are stored in the built-in and custom indexes and when you are querying, you can ask to have only keys returned (in Python: keys_only=True). A keys-only query costs a fraction of a regular query, both in $$ and to a lesser extent in time, and has very low deserialization overhead.
So, if you have useful/interesting things stored in your key id's, you can perform keys-only queries and get back lots of useful data in a hurry and very cheaply.
Note that this extends into parent keys and namespaces, which are all part of the key and therefore additional places you can "store" useful data and retrieve all of it with keys-only queries.
It's an important optimization to understand and a big part of our overall design.

Basically, the key is built from two pieces of information :
The entity type (in Objectify, it is the class of the object)
The id/name of the entity
So, for a given entity type, key and id are quite the same.
If you do not specify the ID yourself, then a random ID is generated and the key is created based on that random id.

Related

GAE datastore index vs normalisation

Given below entity in google app engine datastore, is it better to define index on reportingIds or define a separate entity which has only personId and reportingIds fields? Based on the documentation I understood, defining index results in increase of count of operations against datastore quota.
Below are entities in GAE Go. My code needs to scan through Person entities frequently. It needs to limit its scan to Person entity that has at least 1 reporting person. 2 approaches I see. Define index on reportingIds and Query by specifying filters. Create/Update PersonWithReporters entity when ever a Person gets a new reporting person. In the second case, my code needs to iterate through all the entities in PersonWithReporters and need not construct any index/query. I can iterate using Key which is always guaranteed to have the latest data. Not sure which approach is beneficial considering datastore operation counts against quota limit.
type Person struct {
Id string //unique person id
//many other personal details, his personal settings etc
reportingIds []string //ids of the Person this guy manages
}
type PersonWithReporters struct {
Id string //Person managing reportees
reportingIds []string //ids of the Person this guy manages
}
A approach with a separate entity gives you two advantages.
As you have already mentioned, you don't need to index/query all Person entities.
Every time a Person gets a new reporting person, you will create a new entity, which may be significantly cheaper than updating a Person entity which has many other properties, some of which, presumably, are indexed.
Your approach with a separate entity is also not ideal. When you index a property with multiple values, under the hood the Datastore creates an index entry for each value. So, when you add reporting person number 3 to this entity, you have to update 3 index entries instead of 1.
You can optimize your data model even further by creating a Reporter entity with no properties! Every time a new reporting person is added, you create this Reporter entity with ID set to the ID of a reporting person, and make it a child entity of a Person entity representing a person to whom this reporter reports.
Now, when you need to iterate through all persons with someone reporting to them, you run a simple query on this Reporter entity - no filters. This query can be set to keys-only (there is nothing than a key in this entity anyway, but keys-only queries are treated differently - they are basically free).
For every entity returned by this query you retrieve its key, and this key contains an ID (which is an ID of a reporting person), and a parent key, which includes an ID of a person who this reporter reports to.
Unless AppEngine's datastore in Go is very different to how it works in Java or Python you cannot index an array natively - So option 1 is out of the question, and so is option 2.
I suggest option three, which is to define a
type PersonWithReporters {
Id string // concatenate(managing_Person_id, separator, reporter_Person_id) to avoid id collisions
reportingId string; // indexed
managingId string; // probably indexed as well
}
You would create multiple of these entities instead of a single entity with an array. Also you add an index on reportingId. Now you can create a filter query on this entity and should be able to retrieve the desired information.
I would worry more about performance and not too much about the quota limits, they are pretty high. Just implement it, see how it works and whether quota is your main concern here.

General database design: Is it ever considered "okay" to create a non-normalized table on purpose?

After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.

Fetching by key vs fetching by filter in Google App Engine

I want to be as efficient as possible and plan properly. Since read and write costs are important when using Google App Engine, I want to be sure to minimize those. I'm not understanding the "key" concept in the datastore. What I want to know is would it be more efficient to fetch an entity by its key, considering I know what it is, than by fetching by some kind of filter?
Say I have a model called User and a user has an array(list) of commentIds. Now I want to get all this user's comments. I have two options:
The user's array of commentId's is an array of keys, where each key is a key to a Comment entity. Since I have all the keys, I can just fetch all the comments by their keys.
The user's array of commentId's are custom made identifiers by me, in this case let's just say that they're auto-incrementing regular integers, and each comment in the datastore has a unique commentIntegerId. So now if I wanted to get all the comments, I'd do a filtered fetch based on all comments with ID that is in my array of ids.
Which implementation would be more efficient, and why?
Fetching by key is the fastest way to get an entity from the datastore since it the most direct operation and doesn't need to go thru index lookup.
Each time you create an entry (unless you specified key_name) the app engine will generate a unique (per parent entity) numeric id, you should use that as ids for your comments.
You should design a NoSql database (= GAE Datastore) based on usage patterns:
If you need to get all user's comments at once and never need to get one or some of them based on some criteria (e.g. query them), than the most efficient way, in terms of speed and cost would be to serialize all comments as a binary blob inside an entity (or save it to Blobstore).
But I guess this is not the case, as comments are usually tied to both users and to posts, right? In this case above advice would not be viable.
To answer you title question: get by key is always faster then query by a property, because query first goes through index to satisfy the property condition, where it gets the key, then it does the get with this key.

Getting values out of DynamoDB

I've just started looking into Amazon's DynamoDB. Obviously the scalability appeals, but I'm trying to get my head out of SQL mode and into no-sql mode. Can this be done (with all the scalability advantages of dynamodb):
Have a load of entries (say 5 - 10 million) indexed by some number. One of the fields in each entry will be a creation date. Is there an effective way for dynamo db to give my web app all the entries created between two dates?
A more simple question - can dynamo db give me all entries in which a field matches a certain number. That is, there'll be another field that is a number, for argument's sake lets say between 0 and 10. Can I ask dynamodb to give me all the entries which have value e.g. 6?
Do both of these queries need a scan of the entire dataset (which I assume is a problem given the dataset size?)
many thanks
Is there an effective way for dynamo db to give my web app all the
entries created between two dates?
Yup, please have a look at the of the Primary Key concept within Amazon DynamoDB Data Model, specifically the Hash and Range Type Primary Key:
In this case, the primary key is made of two attributes. The first
attributes is the hash attribute and the second one is the range
attribute. Amazon DynamoDB builds an unordered hash index on the hash
primary key attribute and a sorted range index on the range primary
key attribute. [...]
The listed samples feature your use case exactly, namely the Reply ( Id, ReplyDateTime, ... ) table facilitates a primary key of type Hash and Range with a hash attribute Id and a range attribute ReplyDateTime.
You'll use this via the Query API, see RangeKeyCondition for details and Querying Tables in Amazon DynamoDB for respective examples.
can dynamo db give me all entries in which a field matches a certain
number. [...] Can I ask dynamodb to give
me all the entries which have value e.g. 6?
This is possible as well, albeit by means of the Scan API only (i.e. requires to read every item in the table indeed), see ScanFilter for details and Scanning Tables in Amazon DynamoDB for respective examples.
Do both of these queries need a scan of the entire dataset (which I
assume is a problem given the dataset size?)
As mentioned the first approach works with a Query while the second requires a Scan, and Generally, a query operation is more efficient than a scan operation - this is a good advise to get started, though the details are more complex and depend on your use case, see section Scan and Query Performance within the Query and Scan in Amazon DynamoDB overview:
For quicker response times, design your tables in a way that can use
the Query, Get, or BatchGetItem APIs, instead. Or, design your
application to use scan operations in a way that minimizes the impact
on your table's request rate. For more information, see Provisioned Throughput Guidelines in Amazon DynamoDB.
So, as usual when applying NoSQL solutions, you might need to adjust your architecture to accommodate these constraints.

Use a ListProperty or custom tuple property in App Engine?

I'm developing an application with Google App Engine and stumbled across the following scenario, which can perhaps be described as "MVP-lite".
When modeling many-to-many relationships, the standard property to use is the ListProperty. Most likely, your list is comprised of the foreign keys of another model.
However, in most practical applications, you'll usually want at least one more detail when you get a list of keys - the object's name - so you can construct a nice hyperlink to that object. This requires looping through your list of keys and grabbing each object to use its "name" property.
Is this the best approach? Because "reads are cheap", is it okay to get each object even if I'm only using one property for now? Or should I use a special property like tipfy's JsonProperty to save a (key, name) "tuple" to avoid the extra gets?
Though datastore reads are comparatively cheaper datastore writes, they can still add significant time to request handler. Including the object's names as well as their foreign keys sounds like a good use of denormalization (e.g., use two list properties to simulate a tuple - one contains the foreign keys and the other contains the corresponding name).
If you decide against this denormalization, then I suggest you batch fetch the entities which the foreign keys refer to (rather than getting them one by one) so that you can at least minimize the number of round trips you make to the datastore.
When modeling one-to-many (or in some
cases, many-to-many) relationships,
the standard property to use is the
ListProperty.
No, when modeling one-to-many relationships, the standard property to use is a ReferenceProperty, on the 'many' side. Then, you can use a query to retrieve all matching entities.
Returning to your original question: If you need more data, denormalize. Store a list of titles alongside the list of keys.

Resources