When should I use ObjectId vs UUID in MongoDB - database

I'm making a simple CRUD application with MongoDB so I can learn more about it.
The application is a simple blog, I have a collection named "articles" which stores various documents, each one representing a post for my blog.
When I display the list of all blog posts, I can do a db.collection.find(), and list all of them.
But the question lies when I need to show a single post individually, when I need to query the collection for a single, specific document.
The logical solution would be to use a RDBMS and an auto increment feature, but MongoDB is NoSQL and does not have auto increment.
I'm using the auto generated _id field of the document which stores an ObjectId by default, which means that my url's look like this:
http://localhost/blog/article.php?_id=5d41f6e5fc1a2f3d80645185
I saw in the documentation that the ObjectId contains a unique identifier for the server, together with a timestamp and a counter, isn't exposing these things a security risk?
As a solution, I stumbled into UUID https://docs.mongodb.com/manual/reference/method/UUID/ which is an auto-generated unique ID, that doesn't expose timestamp and machine info in it. It seems like a logical solution to use this instead of the _id that contains my ObjectId for querying and finding a document.
So I can make my url's look like this:
http://localhost/blog/article.php?_id=23829651-26f7-4092-99d0-5be8658c966e
But still, should I keep the _id property? should I add another one called "id" that stores the UUID? should I even use UUID's at all?

Here's what I would consider before choosing an identifier:
Collision
Risk of collision is very low for both UUIDs and ObjectIDs. This has been discussed in detail in another question.
Nature
UUIDs are random whereas ObjectID values always increase over time. This makes ObjectIDs a bad choice for sharding.
Other uses
ObjectIDs have the creation timestamp as a part and can be used as a substitute of commonly used the createdAt field. A sort by ObjectIDs is a sort by creation time.
Insecure object references (OWASP)
Short def: An attacker cannot deduce the ID of another object if they have the ID of one object. You can read more about this here. Both UUIDs and ObjectIDs are not vulnerable to this.
Link to another question that discusses the security of ObjectIDs (thanks zbee).
Ease of use
Note: This is subjective
Using ObjectIds is a lot easier in the Mongo ecosystem. The existence of speical aggregation operators to deal with ObjectIDs + libraries add to it.
Portability
UUIDs are more portable than ObjectIDs. I do not know of any other system that uses ObjectIDs internally except for Mongo. Whereas there are other DBs such as Postgres that have a special data type for UUIDs + extensions for random generation etc.

Related

How do we give an identification to a relationship in OntoRefine RDF mapping?

I'm working on a transformation work in which I need to transform a property graph dataset into a RDF dataset. There are so many n-ary relationships that need to be traited as a class, but I do not know how to affect an unique identification on these relations. I tried to use the row index but I've got more than one file on this work so this can't work. So I would like to know how do you affect an unique identification to relationships, if the URI is the solution, how do we do this in OntoRefine mapping? Thank you for your answers.
Lee
There are several ways to address this:
Ideally, use some characteristics of the related entities to make a deterministic URL. Eg if you're making a position (membership) node between a person and an org that involves a mandatory role and start date, you could use a URL like org/<org_id>/person/<person_id>/role/<role_id>/date/<date>
Use a blank node. In that case you don't need to worry about a URN
Use the row index if you prepend it with the table/file name (as a constant)
Use the GREL function random(). It doesn't produce a globally unique identifier, but if you ask for a large enough range, it'll be unique with a very high probability
Use a Jython function, as shown at How to create UUID in Openrefine based on the MD5 hash of the values
If you do your mapping using SPARQL, then use the builtin uuid() function

Amazon DynamoDB Single Table Design For Blog Application

New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.

Datastore why use key and id?

I had a question regarding why Google App Engine's Datastore uses a key and and ID. Coming from a relational database background I am comparing entities with rows, so why when storing an entity does it require a key (which is a long automatically generated string) and an ID (which can be manually or automatically entered)? This seems like a big waste of space to identify a record. Again I am new to this type of database, so I may be missing something.
Key design is a critical part of efficient Datastore operations. The keys are what are stored in the built-in and custom indexes and when you are querying, you can ask to have only keys returned (in Python: keys_only=True). A keys-only query costs a fraction of a regular query, both in $$ and to a lesser extent in time, and has very low deserialization overhead.
So, if you have useful/interesting things stored in your key id's, you can perform keys-only queries and get back lots of useful data in a hurry and very cheaply.
Note that this extends into parent keys and namespaces, which are all part of the key and therefore additional places you can "store" useful data and retrieve all of it with keys-only queries.
It's an important optimization to understand and a big part of our overall design.
Basically, the key is built from two pieces of information :
The entity type (in Objectify, it is the class of the object)
The id/name of the entity
So, for a given entity type, key and id are quite the same.
If you do not specify the ID yourself, then a random ID is generated and the key is created based on that random id.

Fetching by key vs fetching by filter in Google App Engine

I want to be as efficient as possible and plan properly. Since read and write costs are important when using Google App Engine, I want to be sure to minimize those. I'm not understanding the "key" concept in the datastore. What I want to know is would it be more efficient to fetch an entity by its key, considering I know what it is, than by fetching by some kind of filter?
Say I have a model called User and a user has an array(list) of commentIds. Now I want to get all this user's comments. I have two options:
The user's array of commentId's is an array of keys, where each key is a key to a Comment entity. Since I have all the keys, I can just fetch all the comments by their keys.
The user's array of commentId's are custom made identifiers by me, in this case let's just say that they're auto-incrementing regular integers, and each comment in the datastore has a unique commentIntegerId. So now if I wanted to get all the comments, I'd do a filtered fetch based on all comments with ID that is in my array of ids.
Which implementation would be more efficient, and why?
Fetching by key is the fastest way to get an entity from the datastore since it the most direct operation and doesn't need to go thru index lookup.
Each time you create an entry (unless you specified key_name) the app engine will generate a unique (per parent entity) numeric id, you should use that as ids for your comments.
You should design a NoSql database (= GAE Datastore) based on usage patterns:
If you need to get all user's comments at once and never need to get one or some of them based on some criteria (e.g. query them), than the most efficient way, in terms of speed and cost would be to serialize all comments as a binary blob inside an entity (or save it to Blobstore).
But I guess this is not the case, as comments are usually tied to both users and to posts, right? In this case above advice would not be viable.
To answer you title question: get by key is always faster then query by a property, because query first goes through index to satisfy the property condition, where it gets the key, then it does the get with this key.

Use a ListProperty or custom tuple property in App Engine?

I'm developing an application with Google App Engine and stumbled across the following scenario, which can perhaps be described as "MVP-lite".
When modeling many-to-many relationships, the standard property to use is the ListProperty. Most likely, your list is comprised of the foreign keys of another model.
However, in most practical applications, you'll usually want at least one more detail when you get a list of keys - the object's name - so you can construct a nice hyperlink to that object. This requires looping through your list of keys and grabbing each object to use its "name" property.
Is this the best approach? Because "reads are cheap", is it okay to get each object even if I'm only using one property for now? Or should I use a special property like tipfy's JsonProperty to save a (key, name) "tuple" to avoid the extra gets?
Though datastore reads are comparatively cheaper datastore writes, they can still add significant time to request handler. Including the object's names as well as their foreign keys sounds like a good use of denormalization (e.g., use two list properties to simulate a tuple - one contains the foreign keys and the other contains the corresponding name).
If you decide against this denormalization, then I suggest you batch fetch the entities which the foreign keys refer to (rather than getting them one by one) so that you can at least minimize the number of round trips you make to the datastore.
When modeling one-to-many (or in some
cases, many-to-many) relationships,
the standard property to use is the
ListProperty.
No, when modeling one-to-many relationships, the standard property to use is a ReferenceProperty, on the 'many' side. Then, you can use a query to retrieve all matching entities.
Returning to your original question: If you need more data, denormalize. Store a list of titles alongside the list of keys.

Resources