I am working on a project where a user can have several posts.
So I want to know which approach is better for querying all his posts
having a posts field in user model which contains all the posts ids, and then searching for each through FindById in posts collection which contains all the posts from all the users
to query all the posts collection at once and find all the posts from given user
I'll try to answer both the question in the title as well as your specific use case, starting with the title.
Outside the context of your use case, findOne is likely insignificantly faster when provided valid ObjectIds, in the sense that passing an invalid ID through .findOne({ _id }) will not ensure _id is a valid ObjectId.
However, given malformed ID's, .findById(_id) will prevent a query
from executing whatsoever and actually throw a CastError, which
would be more performant if you are unsure of the origin of the ID,
e.g. user input, and have the added benefit if you'd like to provide an error message rather than an empty result.
As previously mentioned, these details seem irrelevant for your use case, since the ObjectId's in question are already stored in the database, not to mention this appears to be a question of what queries in what order would be theoretically faster, not one query helper over the other.
If you already have the ObjectId of the user in question, you save time by removing the need to query the User collection altogether by querying the Posts table directly. This is assuming you have an index on the field containing the User's ID. As you suggested:
Posts.find({ userId })
If you don't already know the user's ID, it would be keen to use the document population technique, since this matches your use case and has likely been optimized for this purpose. One (untested) example given a Post and User model could be:
User
.findOne({ foo: 'bar' })
.populate('posts')
.select('-id posts')
This should return an object with a single key posts containing an Array of the posts.
Related
I am currently exploring MongoDB.
I built a notes web app and for now the DB has 2 collections: notes and users.
The user can create, read and update his notes.
I want to create a page called /my-notes that will display all the notes that belong to the connected user.
My question is:
Should the notes model has an ownerId field or the opposite - the user model will have a field of noteIds of type list.
Points I found relevant for the decision making:
noteIds approach:
There is no need to query the notes that hold the desired ownerId (say we have a lot of notes then we will need indexes and search accross the whole notes collection). We just need to find the user by user ID and then get all the notes by their IDs.
In this case there are 2 calls to DB.
The data is ordered by the order of insertion to the notesIds field in the document.
ownerId approach:
We do need to find the notes by their ownerId field across the notes collection which might be more computer "intensive".
We can paginate / sort the data as we want - more control over the data.
Are there any more points you can think of?
As I can conclude this is a question of whether you want less computer intensive DB calls vs more control over the data.
What are the "best practices"?
Thanks,
A similar use case is explained in the documentation. If there is no limit on number of notes a user can have, it might be better to store a userId reference field in notes document.
As you've figured out already, pagination would be easier in the second approach. Also when updating notes, you can simply updateOne({ _id: "note_id", userId: 1 }) instead of checking user's document if the note actually belong to the user.
I am building an application using DynamoDB. High level details are: there are users, there are communities' (which users can join), and there are posts (essentially, same use case as Reddit).
My question is how to construct the data in DynamoDB. I am currently using the pattern of having main items (these items are users, posts, communities) which have the exact same partition key and sort key, and these items will always have all details. I'll call these items "detailed" items.
For example, a "detailed" user item would look like this:
Partition Key: USER#<id>
Sort Key: USER#<id>
It would be similar with posts and communities:
Partition Key: POST#<id>
Sort Key: POST#<id>
Partition Key: COMMUNITY#<id>
Sort Key: COMMUNITY#<id>
Now, in order to have relations between these entity's, other items will be created which I am going to call "relational" items. So, if a user posts something, a relational item will be created like this:
Partition Key: USER#<id>
Sort Key: POST#<id>
The whole purpose of this "relational" item is just to make it apparent the user has created this post, and it allows for a simple query to get all the posts a user has created.
Now the problem, these "relational" items do not have any of the data of the detailed item, meaning that after doing a query to get all the users posts, batch get would then have to be used to get the "detailed" items (costing more RCU's).
To be clear, the data is not replicated in the "relational" item because posts can be edited, so the duplicating the details could lead to inconstancies.
Is this an appropriate way to access data, are there better ways? Is the cost of doing batch get negligible enough? Should the data just be duplicated, and if something is edited, updated both items? Just looking for outside opinions.
I have tried having no "detailed" items and having the "relational" items have all the details. However, this complicates the requests since I need both the PK and SK to delete or update an item (compared to a single key since PK and SK would be the same). Additionally, this pattern seems more streamlined in implementing, if it's an object/model in the code, then it is a "detailed" item in the database.
You can avoid the "link entity" by placing the user id in the SK of the post.
PK SK
POST_USER_ID#<user_id> POST_ID#<post_id>
This way you can do two types of queries
Query all with PK==POST_USER_ID#123 that will give you all posts of a user
Query all with PK==POST_USER_ID#123 SK==POST_ID#<post_id> will give you a specific post by its id
As for "should data be duplicated and updated when needed", this is very common with NoSQL so don't worry about it.
I'm currently trying to get my head around NoSQL (which is kinda hard coming from SQL). Since I'd like to go through some examples to get a better understanding I'm currently a bit stuck with the following:
Assuming I've got the following collections: user, posts and votes. A user can up or downvote posts and filter them. How do I need to structure my collections for efficiently query something like "most upvoted posts within the last 24h"?
My first guess would be something like:
votes:
user (user id)
post (post id)
value (down or upvote)
posts:
title
votes:
user (user id)
date
value (down or upvote)
What immediately caught my attention on this approach is: I'd need to update votes within votes and posts everytime a user changes his vote, right? Other than that that'd be my solution for this problem, since I can access the votes date on every post. My only other concern at this point would be that this maybe can be problematic with thousands of votes?
In a perfect world, you could just keep a count on the post. Upvotes would increment it, downvotes would decrement it. (There are atomic operations just for this case.) This means every "show me a post (with votes) takes one tiny read. Unfortunately, that doesn't prevent users from duplicate voting.
So use 2 tables: one for the Posts, and one for the users.
When a user votes, you store their vote in the users table. That table stores UID + POSTID (primary keys) and also stores their VOTE. You can use DynamoDB streams to trigger a Lambda function that "copies" the up/down vote onto the Post. Every per-user upvote is stored on users first, where is de-dupped, and only rolled up onto the actual Post if it's a valid change in the user vote.
This means every "show me a post (with votes) takes one tiny read, and "user vote" is two simple writes.
I want to be as efficient as possible and plan properly. Since read and write costs are important when using Google App Engine, I want to be sure to minimize those. I'm not understanding the "key" concept in the datastore. What I want to know is would it be more efficient to fetch an entity by its key, considering I know what it is, than by fetching by some kind of filter?
Say I have a model called User and a user has an array(list) of commentIds. Now I want to get all this user's comments. I have two options:
The user's array of commentId's is an array of keys, where each key is a key to a Comment entity. Since I have all the keys, I can just fetch all the comments by their keys.
The user's array of commentId's are custom made identifiers by me, in this case let's just say that they're auto-incrementing regular integers, and each comment in the datastore has a unique commentIntegerId. So now if I wanted to get all the comments, I'd do a filtered fetch based on all comments with ID that is in my array of ids.
Which implementation would be more efficient, and why?
Fetching by key is the fastest way to get an entity from the datastore since it the most direct operation and doesn't need to go thru index lookup.
Each time you create an entry (unless you specified key_name) the app engine will generate a unique (per parent entity) numeric id, you should use that as ids for your comments.
You should design a NoSql database (= GAE Datastore) based on usage patterns:
If you need to get all user's comments at once and never need to get one or some of them based on some criteria (e.g. query them), than the most efficient way, in terms of speed and cost would be to serialize all comments as a binary blob inside an entity (or save it to Blobstore).
But I guess this is not the case, as comments are usually tied to both users and to posts, right? In this case above advice would not be viable.
To answer you title question: get by key is always faster then query by a property, because query first goes through index to satisfy the property condition, where it gets the key, then it does the get with this key.
In reference to this question, I am facing almost the same scenario except that in my case, the questions are probably static (it's subject to change from time to time, and I still think it's not a good idea adding columns for each question, but even I decided to add, how should the answers be specified/retrieved from), but the answers are in different types, for examples the answer could be yes/no, list-items, free text, list-items OR free text (Other, Please specify), multiple-selectable-list items etc.
What would be an efficient way to implement this?
Shimmy, I have written a four-part article that addresses this issue - see Creating a Dynamic, Data-Drive User Interface. The article looks at how to let a user define what data to store about clients, so it's not an exact examination of your question, but it's pretty close. Namely, my article shows how to let an end user define the type of data to store, which is along the lines of what you want.
The following ER diagram gives the gist of the data model:
Here, DynamicAttributesForClients is the table that indicates what user-created attributes a user wants to track for his clients. In short, each attribute has a DataTypeId value, which indicates whether it's a Boolean attribute, a Text attribute, a Numeric attribute, and so on. In your case, this table would store the questions of the survey.
The DynamicValuesForClients table holds the values stored for a particular client for a particular attribute. In your case, this table would store the answers to the questions of the survey. The actual value is stored in the DynamicValue column, which is of type sql_variant, allowing any type of data - numeric, bit, string, etc. - to be stored there.
My article does not address how to handle multiple-choice questions, where a user may select one option from a preset list of options, but enhancing the data model to allow this is pretty straightforward. You would create a new table named DynamicListOptions with the following columns:
DynamicListOptionId - a primary key
DynamicAttributeId - specifies what attribute these questions are associated with
OptionText - the option text
So if you had an attribute that was a multiple-choice option you'd populate the drop-down list in the user interface with the options returned from the query:
SELECT OptionText
FROM DynamicListOptions
WHERE DynamicAttributeId = ...
Finally, you would store the selected DynamicListOptionId value in the DynamicValuesForClients.DynamicValue column to record the list option they selected (or use NULL if they did not choose an item).
Give the article a read through. There is a complete, working demo you can download, which includes the complete database and its model. Also, the four articles that make up the series explore the data model in depth and show how to build a web-based (ASP.NET) user interface for letting users define dynamic attributes, how to display them for data entry, and so forth.
Happy Programming!
This may not fit you exactly, but here's what i've got at my part-time job.
I have a questions table, an answers table, and a survey table. For each new survey i crate a survey build (because each survey is unique, but questions and answers are repeated a lot). I then have a respondent table that contains some information about the respondent (and it also links back to the survey table, forgot that in the diagram). I also have a response table that links the respondent and the survey build. This probably isn't the best way but it's the way that works for me, and it works pretty fast (we're at about 1mill+ in the response table and it handles like a dream).
With this model i get reusable questions, reusable answers (a lot of our questions use "Yes" and "No"), and a rather slim response table.