Fetching by key vs fetching by filter in Google App Engine - database

I want to be as efficient as possible and plan properly. Since read and write costs are important when using Google App Engine, I want to be sure to minimize those. I'm not understanding the "key" concept in the datastore. What I want to know is would it be more efficient to fetch an entity by its key, considering I know what it is, than by fetching by some kind of filter?
Say I have a model called User and a user has an array(list) of commentIds. Now I want to get all this user's comments. I have two options:
The user's array of commentId's is an array of keys, where each key is a key to a Comment entity. Since I have all the keys, I can just fetch all the comments by their keys.
The user's array of commentId's are custom made identifiers by me, in this case let's just say that they're auto-incrementing regular integers, and each comment in the datastore has a unique commentIntegerId. So now if I wanted to get all the comments, I'd do a filtered fetch based on all comments with ID that is in my array of ids.
Which implementation would be more efficient, and why?

Fetching by key is the fastest way to get an entity from the datastore since it the most direct operation and doesn't need to go thru index lookup.
Each time you create an entry (unless you specified key_name) the app engine will generate a unique (per parent entity) numeric id, you should use that as ids for your comments.

You should design a NoSql database (= GAE Datastore) based on usage patterns:
If you need to get all user's comments at once and never need to get one or some of them based on some criteria (e.g. query them), than the most efficient way, in terms of speed and cost would be to serialize all comments as a binary blob inside an entity (or save it to Blobstore).
But I guess this is not the case, as comments are usually tied to both users and to posts, right? In this case above advice would not be viable.
To answer you title question: get by key is always faster then query by a property, because query first goes through index to satisfy the property condition, where it gets the key, then it does the get with this key.

Related

Amazon DynamoDB Single Table Design For Blog Application

New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.

When should I use ObjectId vs UUID in MongoDB

I'm making a simple CRUD application with MongoDB so I can learn more about it.
The application is a simple blog, I have a collection named "articles" which stores various documents, each one representing a post for my blog.
When I display the list of all blog posts, I can do a db.collection.find(), and list all of them.
But the question lies when I need to show a single post individually, when I need to query the collection for a single, specific document.
The logical solution would be to use a RDBMS and an auto increment feature, but MongoDB is NoSQL and does not have auto increment.
I'm using the auto generated _id field of the document which stores an ObjectId by default, which means that my url's look like this:
http://localhost/blog/article.php?_id=5d41f6e5fc1a2f3d80645185
I saw in the documentation that the ObjectId contains a unique identifier for the server, together with a timestamp and a counter, isn't exposing these things a security risk?
As a solution, I stumbled into UUID https://docs.mongodb.com/manual/reference/method/UUID/ which is an auto-generated unique ID, that doesn't expose timestamp and machine info in it. It seems like a logical solution to use this instead of the _id that contains my ObjectId for querying and finding a document.
So I can make my url's look like this:
http://localhost/blog/article.php?_id=23829651-26f7-4092-99d0-5be8658c966e
But still, should I keep the _id property? should I add another one called "id" that stores the UUID? should I even use UUID's at all?
Here's what I would consider before choosing an identifier:
Collision
Risk of collision is very low for both UUIDs and ObjectIDs. This has been discussed in detail in another question.
Nature
UUIDs are random whereas ObjectID values always increase over time. This makes ObjectIDs a bad choice for sharding.
Other uses
ObjectIDs have the creation timestamp as a part and can be used as a substitute of commonly used the createdAt field. A sort by ObjectIDs is a sort by creation time.
Insecure object references (OWASP)
Short def: An attacker cannot deduce the ID of another object if they have the ID of one object. You can read more about this here. Both UUIDs and ObjectIDs are not vulnerable to this.
Link to another question that discusses the security of ObjectIDs (thanks zbee).
Ease of use
Note: This is subjective
Using ObjectIds is a lot easier in the Mongo ecosystem. The existence of speical aggregation operators to deal with ObjectIDs + libraries add to it.
Portability
UUIDs are more portable than ObjectIDs. I do not know of any other system that uses ObjectIDs internally except for Mongo. Whereas there are other DBs such as Postgres that have a special data type for UUIDs + extensions for random generation etc.

Datastore why use key and id?

I had a question regarding why Google App Engine's Datastore uses a key and and ID. Coming from a relational database background I am comparing entities with rows, so why when storing an entity does it require a key (which is a long automatically generated string) and an ID (which can be manually or automatically entered)? This seems like a big waste of space to identify a record. Again I am new to this type of database, so I may be missing something.
Key design is a critical part of efficient Datastore operations. The keys are what are stored in the built-in and custom indexes and when you are querying, you can ask to have only keys returned (in Python: keys_only=True). A keys-only query costs a fraction of a regular query, both in $$ and to a lesser extent in time, and has very low deserialization overhead.
So, if you have useful/interesting things stored in your key id's, you can perform keys-only queries and get back lots of useful data in a hurry and very cheaply.
Note that this extends into parent keys and namespaces, which are all part of the key and therefore additional places you can "store" useful data and retrieve all of it with keys-only queries.
It's an important optimization to understand and a big part of our overall design.
Basically, the key is built from two pieces of information :
The entity type (in Objectify, it is the class of the object)
The id/name of the entity
So, for a given entity type, key and id are quite the same.
If you do not specify the ID yourself, then a random ID is generated and the key is created based on that random id.

App Engine Datastore: entity design and query optimization

I have a system where users can vote on entities, if they like or hate them. It will be bazillion votes and trazillion records, hopefully, some time in the future :)
At the moment i store a vote in an Entity like this:
UserRecordVote: recordId, userId, hateOrLike
And when i want to get every Record the user liked i do a query like this:
I query the "UserRecordVote" table for all the "likes", then i take the recordIds from that resultset, create a key of that property and get the record from the Record Table.
Then i aggregate all that in a list and return it.
Here's the question:
I came up with a different approach and i want to find out if that one is 1. faster and 2. how much is the difference in cost.
I would create an Entity which's name would be userId + "likes" and the key would be the record id:
new Entity(userId + "likes", recordId)
So when i would do a query to get all the likes i could simply query for all, no filters needed. AND i could just grab the entity key! which would be much cheaper if i remember the documentation of app engine right. (can't find the pricing page anymore). Then i could take the Iterable of keys and do a single get(Iterable keys). Ok so i guess this approach is faster and cheaper right? But what if i want to grab all the votes of a user or better said, i want to grab all the records a user didn't vote on yet.
Here's the real question:
I wan't to load all the records a user didn't vote on yet:
So i would have entities like this:
new Entity(userId+"likes", recordId);
and
new Entity(userId+"hates", recordId);
I would query both vote tables for all entity keys and query the record table for all entity keys. Then i would remove all the record entity keys matching one of the vote entity keys and with the result i would get(Iterable keys) the full entities and have all the record entites which are not in one of the two voting tables.
Is that a useful approach? Is that the fastest and cost efficient way to do a datastore query? Am i totally wrong and i should store the information as list properties?
EDIT:
With that approach i would have 2 entity groups for each user, which would result in million different entity groups, how would GAE Datastore handle that? Atleast the Datastore Viewer entity select box would probably crash :) ?
To answer the Real Question, you probably want to have your hateOrLike field store an integer that indicates either hated/liked/notvoted. Then you can filter on hateOrLike=notVoted.
The other solutions you propose with the dynamically named entities make it impossible to query on other aspects of your entities, since you don't know their names.
The other thing is you expect this to be huge, you likely want to keep a running counter of your votes rather than tabulating every time you pull up a UserRecord - querying all the votes, and then calculating them on each view is very slow - especially since App Engine will only return 1000 results on each query, and if you have more than 1000 votes, you'll have to keep making repeated queries to get all the results.
If you think people will vote quickly, you should look into using a sharded counter for performance. There's examples of that with code available if you do a google search.
Consider serializing user hate/like votes in two separate TextProperties inside the entity. Use the userId as key_name.
rec = UserRecordVote.get_by_key_name(userId)
hates = len(rec.hates.split('_'))
etc.

Getting values out of DynamoDB

I've just started looking into Amazon's DynamoDB. Obviously the scalability appeals, but I'm trying to get my head out of SQL mode and into no-sql mode. Can this be done (with all the scalability advantages of dynamodb):
Have a load of entries (say 5 - 10 million) indexed by some number. One of the fields in each entry will be a creation date. Is there an effective way for dynamo db to give my web app all the entries created between two dates?
A more simple question - can dynamo db give me all entries in which a field matches a certain number. That is, there'll be another field that is a number, for argument's sake lets say between 0 and 10. Can I ask dynamodb to give me all the entries which have value e.g. 6?
Do both of these queries need a scan of the entire dataset (which I assume is a problem given the dataset size?)
many thanks
Is there an effective way for dynamo db to give my web app all the
entries created between two dates?
Yup, please have a look at the of the Primary Key concept within Amazon DynamoDB Data Model, specifically the Hash and Range Type Primary Key:
In this case, the primary key is made of two attributes. The first
attributes is the hash attribute and the second one is the range
attribute. Amazon DynamoDB builds an unordered hash index on the hash
primary key attribute and a sorted range index on the range primary
key attribute. [...]
The listed samples feature your use case exactly, namely the Reply ( Id, ReplyDateTime, ... ) table facilitates a primary key of type Hash and Range with a hash attribute Id and a range attribute ReplyDateTime.
You'll use this via the Query API, see RangeKeyCondition for details and Querying Tables in Amazon DynamoDB for respective examples.
can dynamo db give me all entries in which a field matches a certain
number. [...] Can I ask dynamodb to give
me all the entries which have value e.g. 6?
This is possible as well, albeit by means of the Scan API only (i.e. requires to read every item in the table indeed), see ScanFilter for details and Scanning Tables in Amazon DynamoDB for respective examples.
Do both of these queries need a scan of the entire dataset (which I
assume is a problem given the dataset size?)
As mentioned the first approach works with a Query while the second requires a Scan, and Generally, a query operation is more efficient than a scan operation - this is a good advise to get started, though the details are more complex and depend on your use case, see section Scan and Query Performance within the Query and Scan in Amazon DynamoDB overview:
For quicker response times, design your tables in a way that can use
the Query, Get, or BatchGetItem APIs, instead. Or, design your
application to use scan operations in a way that minimizes the impact
on your table's request rate. For more information, see Provisioned Throughput Guidelines in Amazon DynamoDB.
So, as usual when applying NoSQL solutions, you might need to adjust your architecture to accommodate these constraints.

Resources