I am designing a course review system and I have Review documents that refer to a review made for a course by a user.
I also have course documents and I am having trouble designing a data model that satisfies my needs.
The relation between course and review is one to many.
I have 2 options:
Embed Course in their Review objects which are many:
In this case, course objects do not exist on their own but I have to allow my users to search through courses so in that case, I would need to run a query on Review objects to search for courses.
If I store courses in a separate collection and reference through has_many: reviews
I also need to find the reviews for a course when the user clicks on a course after searching and with this design, I will need to run a query when retrieving reviews for a course and also when I am displaying the review, I will need to display course too so I would need to run another query.
What would be the best design in this case? I thought if I could find a way to keep Courses as a separate entity and still embed it inside Reviews as well.
Edit: I have decided to embed reviews inside courses as suggested but I have some new questions now:
For following questions please assume that I have embedded reviews inside course.
When inserting reviews, should I do it in ReviewController by finding its course by id and inserting inside its reviews array?
When a user searches for a course, I would like to return last 10 reviews with the course information instead of all reviews because it may slow down fetching the search results. How can I achieve this after putting all reviews inside courses as you mentioned?
I also have users who enter the reviews (one to many again), I am planning to show recent reviews with usernames, is there a way to embed only username field of user collection inside review?
To find a certain users reviews I will need to iterate over all courses, right? It is not a very common query but is there a way to make it faster with an index?
Modeling suggestions for - course has reviews and reviews are made by users.
I have decided to embed reviews inside courses as suggested but I have
some new questions now:
When inserting reviews, should I do it in ReviewController by finding its course by id and inserting inside its reviews array?
You are updating a course collection document. The update query filter will be by the course id (or name) - and you will $push ($push is an update operator) a review sub-document (or embedded document) into the course document's reviews array field.
The course collection document can be like this:
{
_id: <ObjectId>,
name: <string>,
description: <string>,
reviews: [
{ _id: <some id>, date: <date>, content: <string>, user: <...> },
{ _id: <some id>, date: <date>, content: <string>, user: <...> },
...
]
}
The reviews sub-document can have both user's name and id or just one of them.
When a user searches for a course, I would like to return last 10 reviews with the course information instead of all reviews because it
may slow down fetching the search results. How can I achieve this
after putting all reviews inside courses as you mentioned?
You can make this an Aggregation query. For example,
db.course.aggregate([
{ $match: { _id: <some course id> } }, // or, this can be filter by course name field
{ $addFields: {
latestTenReviews: {
// use $function aggregation operator to sort the reviews by the date field descending and
// limit to first ten array elements
}
},
])
The $match stage can use index defined upon the _id (it has by default a unique index) or define an index on the course''s name field.
I also have users who enter the reviews (one to many again), I am planning to show recent reviews with usernames, is there a way to
embed only username field of user collection inside review?
Yes, you store user information in a review as shown in the previous point (2). You can store only the id or name or both, depending upon your need. When the reviews are queried for a course, the user names will show, if the names are stored. In case names are not stored, you may have to use the $lookup aggregation stage to make a "join" operation to get the user details, like name.
To find a certain users reviews I will need to iterate over all courses, right? It is not a very common query but is there a way to
make it faster with an index?
You can define an index on the user field of the reviews array field's sub-document. Indexes on array fields are called as Multikey indexes. The query with user field as filter criteria will benefit from this index.
Related
I am currently exploring MongoDB.
I built a notes web app and for now the DB has 2 collections: notes and users.
The user can create, read and update his notes.
I want to create a page called /my-notes that will display all the notes that belong to the connected user.
My question is:
Should the notes model has an ownerId field or the opposite - the user model will have a field of noteIds of type list.
Points I found relevant for the decision making:
noteIds approach:
There is no need to query the notes that hold the desired ownerId (say we have a lot of notes then we will need indexes and search accross the whole notes collection). We just need to find the user by user ID and then get all the notes by their IDs.
In this case there are 2 calls to DB.
The data is ordered by the order of insertion to the notesIds field in the document.
ownerId approach:
We do need to find the notes by their ownerId field across the notes collection which might be more computer "intensive".
We can paginate / sort the data as we want - more control over the data.
Are there any more points you can think of?
As I can conclude this is a question of whether you want less computer intensive DB calls vs more control over the data.
What are the "best practices"?
Thanks,
A similar use case is explained in the documentation. If there is no limit on number of notes a user can have, it might be better to store a userId reference field in notes document.
As you've figured out already, pagination would be easier in the second approach. Also when updating notes, you can simply updateOne({ _id: "note_id", userId: 1 }) instead of checking user's document if the note actually belong to the user.
I am building an application using DynamoDB. High level details are: there are users, there are communities' (which users can join), and there are posts (essentially, same use case as Reddit).
My question is how to construct the data in DynamoDB. I am currently using the pattern of having main items (these items are users, posts, communities) which have the exact same partition key and sort key, and these items will always have all details. I'll call these items "detailed" items.
For example, a "detailed" user item would look like this:
Partition Key: USER#<id>
Sort Key: USER#<id>
It would be similar with posts and communities:
Partition Key: POST#<id>
Sort Key: POST#<id>
Partition Key: COMMUNITY#<id>
Sort Key: COMMUNITY#<id>
Now, in order to have relations between these entity's, other items will be created which I am going to call "relational" items. So, if a user posts something, a relational item will be created like this:
Partition Key: USER#<id>
Sort Key: POST#<id>
The whole purpose of this "relational" item is just to make it apparent the user has created this post, and it allows for a simple query to get all the posts a user has created.
Now the problem, these "relational" items do not have any of the data of the detailed item, meaning that after doing a query to get all the users posts, batch get would then have to be used to get the "detailed" items (costing more RCU's).
To be clear, the data is not replicated in the "relational" item because posts can be edited, so the duplicating the details could lead to inconstancies.
Is this an appropriate way to access data, are there better ways? Is the cost of doing batch get negligible enough? Should the data just be duplicated, and if something is edited, updated both items? Just looking for outside opinions.
I have tried having no "detailed" items and having the "relational" items have all the details. However, this complicates the requests since I need both the PK and SK to delete or update an item (compared to a single key since PK and SK would be the same). Additionally, this pattern seems more streamlined in implementing, if it's an object/model in the code, then it is a "detailed" item in the database.
You can avoid the "link entity" by placing the user id in the SK of the post.
PK SK
POST_USER_ID#<user_id> POST_ID#<post_id>
This way you can do two types of queries
Query all with PK==POST_USER_ID#123 that will give you all posts of a user
Query all with PK==POST_USER_ID#123 SK==POST_ID#<post_id> will give you a specific post by its id
As for "should data be duplicated and updated when needed", this is very common with NoSQL so don't worry about it.
Using a classic example - Suppose that we have an application which has a Courses collection and a Students collection, while each student can participate in many courses, and each course can have many participants.
We will need to query all the courses that one student participates in efficiently, But we also need to query the students those are participating in a single course.
I know that using relational database to handle this will be the optimal solution, but for now I just want to use one type of databases which is MongoDB, now I want to ask if this schema design could work efficiently? what is the cons and pros of using it? and which design could be better?
User: {
_id,
//...properties
}
Course: {
_id,
//...properties
}
CourseParticipate: {
_id,
userId,
courseId,
//...properties
}
CourseAdmin: {
_id,
userId,
courseId,
//...properties
}
Now I like this design because in the future if I have the ability to work with multiple databases, it will be easy to transfer these collections to a relational DB (or not?), I also like it because it is fast to write the data and to remove the relations between the objects, but it will make the reading queries a little bit slower(or a lot?) as I can see.
Because I never seen this design before in the internet, I already know that there is better solutions (I hope that I don't hear heartful comments and answers because I'm new).
I also want to hear from you whether Neo4j can handle this problem or not? and what relational DBs works the best next to MongoDB?
Links to documentations and articles will be very helpful!
Thanks!
This is a case of having the data with Many-to-Many relationship. I would think there are few thousand students and a few hundred courses in your database.
To start with I can use the following design with course details embedded with each student as an array of sub-documents called as courses.
- students collection
id:
name:
courses: [ { id: 1, name: }, { id: 5, name: }, ... ]
- courses collection
id:
name:
description:
Note, the course id and name are stored in both collections. This is duplication of data. This should be okay, as the duplicated details do not change often (or may not change at all).
Query all courses a student is enrolled into, for example: db.students.find( { name: "John" } ). This will return one student document with the matching name and all the courses (the array field). See db.collection.find.
Query all students enrolled into a particular course: db.students.find( { "courses.name": "Java Programming" } ). This will return all the student documents who have the course name matching the criteria "java Programming". See Query an Array of Embedded Documents.
Further, you can use projection to exclude and include fields from the result.
NOTES:
You can embed students info within the courses collection, instead of the courses into the students. The queries will be similar to the above ones, but you will be querying the courses collection. It depends upon your use case.
You can just store the course id field in the courses array, of the students collection; this is the case where you have course name field changes often. The queries will use Aggregation $lookup (a "join" operation) to get the course and from the courses collection.
Information on Data Model Design for document based MongoDB data.
I am working on a project where a user can have several posts.
So I want to know which approach is better for querying all his posts
having a posts field in user model which contains all the posts ids, and then searching for each through FindById in posts collection which contains all the posts from all the users
to query all the posts collection at once and find all the posts from given user
I'll try to answer both the question in the title as well as your specific use case, starting with the title.
Outside the context of your use case, findOne is likely insignificantly faster when provided valid ObjectIds, in the sense that passing an invalid ID through .findOne({ _id }) will not ensure _id is a valid ObjectId.
However, given malformed ID's, .findById(_id) will prevent a query
from executing whatsoever and actually throw a CastError, which
would be more performant if you are unsure of the origin of the ID,
e.g. user input, and have the added benefit if you'd like to provide an error message rather than an empty result.
As previously mentioned, these details seem irrelevant for your use case, since the ObjectId's in question are already stored in the database, not to mention this appears to be a question of what queries in what order would be theoretically faster, not one query helper over the other.
If you already have the ObjectId of the user in question, you save time by removing the need to query the User collection altogether by querying the Posts table directly. This is assuming you have an index on the field containing the User's ID. As you suggested:
Posts.find({ userId })
If you don't already know the user's ID, it would be keen to use the document population technique, since this matches your use case and has likely been optimized for this purpose. One (untested) example given a Post and User model could be:
User
.findOne({ foo: 'bar' })
.populate('posts')
.select('-id posts')
This should return an object with a single key posts containing an Array of the posts.
I want to implement a user follow system. A user can follow other users. I'm considering two approaches. One is that there are followers and followees in User schema, both of them are arrays of user _id. The other one is that there's only followers in the schema. Whenever I want to find a user's followers, I have to search all users' followers array, that is, db.user.find( { followers: "_id" } );. What the pros and cons of the two approaches? Thanks.
What you're considering is a classic "many-to-many" relationship here. Unlike a RDBMS, where there is a single "correct" normal form for this schema, in MongoDB the correct schema design depends on the way you'll be using your data, as well as a couple of other factors you haven't mentioned here.
Note that for this discussion I'm assuming that the "follows" relationship is NOT symmetric -- that is, that A can follow B without B having to follow A.
1) There are two basic ways to model this relationship in MongoDB.
You can have an indexed "following" array embedded in the user document.
You can have a separate collection of "following" documents, like this:
{ user: ObjectID("x"), following: ObjectID("y") }
You'd have one document in this collection for each following relationship. You'd need to have two indexes on this collection, one for "user" and one for "following".
Note that the second suggestion in your question (having arrays of both "following" and "followed" in the user document) is simply a variation of the first.
2) The correct design depends on a few factors that you haven't mentioned here.
How many followers can one person have, and how many people can one person follow?
What is your most common query? Is it to present a list of followers, or to present a list of users that are being followed?
How often will you be updating the followers/following list(s)?
3) The trade-offs are as follows:
The advantages to the embedded array approach are that the code is simpler, and you can fetch the entire array of followed users in a single document. If you index the 'following' array, then the query to find all a users followers will be relatively quick, as long as that index fits entirely in RAM. (This is no different than a relational database.)
The disadvantages to the embedded array approach occur if you are frequently updating the followers, or if you allow an unlimited number of followers / following.
If you allow an unlimited number of followers/following, then you can potentially overflow the maximum size of a MongoDB document. It's not unheard-of for some people to have 100K followers or more. If this is the case, then you'll need to go to the separate collection approach.
If you know that there will be frequent updates to the followers, then you'll probably want to use the separate collection approach as well. The reason is that every time you add a follower, you grow the size of the 'followers' array. When it reaches a certain size, it will outgrow the amount of space reserved for it on disk, and MongoDB will have to move the document. This will incur additional write overhead, as all of the indexes for that document will have to be updated as well.
4) If you want to use the embedded array approach, there are a couple of things that you can do to make that more feasable.
First, you can limit the total number of followers that one person can have. Second, when you create a new user, you can create the document with a large number of dummy followers pre-created. (E.g., you populate the 'followers' array with a large number of entries that you know don't refer to any actual user -- perhaps ID 0.) That way, when you add a new follower, you replace one of the ID 0 entries with a real entry, and the document size doesn't grow.
Second, you can limit the number of followers that someone can have, and check for that in the application.
Note that if you use the two-array approach in your document, you will cut the maximum number of followers that one person can have (since a portion of the document will be taken up with the array of users that they are following).
5) As an optimization, you can change the 'following" documents to be bucketed. So, instead of one document for each following relationship, you might bucket them by user:
{ user: "X", following: [ "A", "B", "C" ... ] }
{ user: "X", following: [ "H", "I", "J" ... ] }
{ user: "Y", following: [ "A", "X", "K" ... ] }
6) For more about the ways to model many-to-many, see this presentation:
http://www.10gen.com/presentations/mongosf2011/schemabasics
For more information about the "bucketing" design pattern, see this entry in the MongoDB docs:
http://docs.mongodb.org/manual/use-cases/storing-comments/#hybrid-schema-design
If you provide both followers and followees then you can probably service most of your queries efficiently without a secondary index on either of those fields. For example, you can retrieve the current user and then use the default index on _id to retrieve lists of all of their connections.
db.users.find({_id: {$in: user_A.followers}})
If you don't include followees, you need to create a secondary index on followers in order to service some queries without a collection scan. For example, to determine all of the followees of user A, you would use a query as follows:
db.users.find({followers: user_A._id})
The secondary index costs you some memory and disk space but avoids potential data inconsistencies (mismatched follower and followee lists).