In MySQL I used auto-increment to generate an id for every user. I would like to create a similar user table in Google Datastore where the id for a user will be unique. According to these docs:https://cloud.google.com/appengine/docs/java/datastore/entities
System-allocated ID values are guaranteed unique to the entity group.
But according to this post: Ever see duplicate IDs when using Google App Engine and ndb? the id's are not unique. I need this id to be unique. It is confusing because in the docs it says the id is unique, but from this post it says the id is not unique it is the key that is unique. My objective is for no two users to have the same id. How can I guarantee this? I would prefer for the database to take care of this form me opposed to me having to create large ids manually using things such as uuids.
As Igor correctly observed, IDs are always unique as long as the entity has no parent.
I can't think of any reason to make user entities children of some other entities, so you are safe.
Note that IDs will not be sequential, as it helps to spread the load equally across the entire dataset - it's a by-product of how the Datastore is designed.
Related
New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.
One common strategy for one-to-many relationships in DynamoDB is using a composite primary key; a broader Partition key as parent, and a narrower Sorting key that contains child relationships. The following example is taken from this article by data scientist Alex Debrie:
This strategy solves the most common retrieval use cases:
Retrieve all metadata and users in an organization
Retrieve all the users within an organization
Retrieve metadata about an organization
Retrieve specific users within an organization
The use case I am trying to solve does not seem to be covered by this model. What if you wanted to retrieve all the metadata from all organizations combined? Or what if you wanted to know all the users across organizations? Even by using a Global Secondary Index, you have no choice but to split all fields into their own partition; this requires a scan to retrieve. Does dynamo have any features to facilitate these kind of retrievals?
To retrieve the metadata for all organizations, you can model it the following way: The SK for the respective metadata items doesn’t need to contain the company name. Just give it the value “META” for every company. Create a GSI that uses the SK as PK. This way you can query all items from the index using the PK “META”.
To retrieve all users across organizations, add an attribute “type” with the value “USER” to all user entries. Create a GSI with this type as PK. This way you can query all users from the index using the PK “USER”.
I've got a simple question about datastore keys. If I delete an entity, is there any possibility that the key will be created again? or each key is unique and can be generated only one-time?
Thanks.
It is definitely possible to re-use keys.
Easy to test, for example using the datastore admin page:
create an entity for one of your entity models using a custom/specified key name and some property values
delete the entity
create another one using the same key name and different property values...
As for the keys with auto-generated IDs it is theoretically possible, but I guess rather unlikely due to the high number of possibilities. From Assigning identifiers:
Cloud Datastore can be configured to generate auto IDs using two
different auto id policies:
The default policy generates a random sequence of unused IDs that are approximately uniformly distributed. Each ID can be up to 16
decimal digits long.
The legacy policy creates a sequence of non-consecutive smaller integer IDs.
I want to be as efficient as possible and plan properly. Since read and write costs are important when using Google App Engine, I want to be sure to minimize those. I'm not understanding the "key" concept in the datastore. What I want to know is would it be more efficient to fetch an entity by its key, considering I know what it is, than by fetching by some kind of filter?
Say I have a model called User and a user has an array(list) of commentIds. Now I want to get all this user's comments. I have two options:
The user's array of commentId's is an array of keys, where each key is a key to a Comment entity. Since I have all the keys, I can just fetch all the comments by their keys.
The user's array of commentId's are custom made identifiers by me, in this case let's just say that they're auto-incrementing regular integers, and each comment in the datastore has a unique commentIntegerId. So now if I wanted to get all the comments, I'd do a filtered fetch based on all comments with ID that is in my array of ids.
Which implementation would be more efficient, and why?
Fetching by key is the fastest way to get an entity from the datastore since it the most direct operation and doesn't need to go thru index lookup.
Each time you create an entry (unless you specified key_name) the app engine will generate a unique (per parent entity) numeric id, you should use that as ids for your comments.
You should design a NoSql database (= GAE Datastore) based on usage patterns:
If you need to get all user's comments at once and never need to get one or some of them based on some criteria (e.g. query them), than the most efficient way, in terms of speed and cost would be to serialize all comments as a binary blob inside an entity (or save it to Blobstore).
But I guess this is not the case, as comments are usually tied to both users and to posts, right? In this case above advice would not be viable.
To answer you title question: get by key is always faster then query by a property, because query first goes through index to satisfy the property condition, where it gets the key, then it does the get with this key.
I have a web app which I can create some notes, each time I create a new note, it will insert to a table with an auto_increment id. (quite obvious)
Now I want to develop an android app which I can create notes too (save them locally in sqlite), and then syncronize those notes with the server.
The problem is, when I create notes in my phone they will have their own auto_increment id which many times will be the same with those notes in server!
I don't care to have duplicated notes (actually I don't think there is a way to differentiate if the new note is duplicated or not, because they don't have some physical id), the problem is if they have same id (primary key), I won't be able to insert them to the server.
Any suggestion?
You could use an UUID as a key for your note.
That way, each entry should have an unique id, be it created on the server or on the client.
To create a UUID, you can use UUID.randomUUID().
The most obvious solution would be to give each note its own unique hash or GUID in addition to the database's auto_increment_id.
You'd then use these unique values as the basis for synchronisation in conjunction with a "last synced" timestamp in each of the tables so that you know what data needs to be synced and can easily determine if the data already exists in the destination (and should be updated) or whether it's a new note.
I'm sorry but i think that your DB structure is wrong. You cannot use autoincrement field in this way, different DBs with a disconnected architecture. Autoincrement values are created for a specific use, if you need to merge two tables like this, you have to implement a different logic. Use a note_id to identify a note in a unique way, using more data (i.e. the user id, the device id etc.) to make this id unique. Autoincrement will only give you a messy architecture at best in this scenario