MongoDB - Search _id in an array or not - arrays

I want to implement a user follow system. A user can follow other users. I'm considering two approaches. One is that there are followers and followees in User schema, both of them are arrays of user _id. The other one is that there's only followers in the schema. Whenever I want to find a user's followers, I have to search all users' followers array, that is, db.user.find( { followers: "_id" } );. What the pros and cons of the two approaches? Thanks.

What you're considering is a classic "many-to-many" relationship here. Unlike a RDBMS, where there is a single "correct" normal form for this schema, in MongoDB the correct schema design depends on the way you'll be using your data, as well as a couple of other factors you haven't mentioned here.
Note that for this discussion I'm assuming that the "follows" relationship is NOT symmetric -- that is, that A can follow B without B having to follow A.
1) There are two basic ways to model this relationship in MongoDB.
You can have an indexed "following" array embedded in the user document.
You can have a separate collection of "following" documents, like this:
{ user: ObjectID("x"), following: ObjectID("y") }
You'd have one document in this collection for each following relationship. You'd need to have two indexes on this collection, one for "user" and one for "following".
Note that the second suggestion in your question (having arrays of both "following" and "followed" in the user document) is simply a variation of the first.
2) The correct design depends on a few factors that you haven't mentioned here.
How many followers can one person have, and how many people can one person follow?
What is your most common query? Is it to present a list of followers, or to present a list of users that are being followed?
How often will you be updating the followers/following list(s)?
3) The trade-offs are as follows:
The advantages to the embedded array approach are that the code is simpler, and you can fetch the entire array of followed users in a single document. If you index the 'following' array, then the query to find all a users followers will be relatively quick, as long as that index fits entirely in RAM. (This is no different than a relational database.)
The disadvantages to the embedded array approach occur if you are frequently updating the followers, or if you allow an unlimited number of followers / following.
If you allow an unlimited number of followers/following, then you can potentially overflow the maximum size of a MongoDB document. It's not unheard-of for some people to have 100K followers or more. If this is the case, then you'll need to go to the separate collection approach.
If you know that there will be frequent updates to the followers, then you'll probably want to use the separate collection approach as well. The reason is that every time you add a follower, you grow the size of the 'followers' array. When it reaches a certain size, it will outgrow the amount of space reserved for it on disk, and MongoDB will have to move the document. This will incur additional write overhead, as all of the indexes for that document will have to be updated as well.
4) If you want to use the embedded array approach, there are a couple of things that you can do to make that more feasable.
First, you can limit the total number of followers that one person can have. Second, when you create a new user, you can create the document with a large number of dummy followers pre-created. (E.g., you populate the 'followers' array with a large number of entries that you know don't refer to any actual user -- perhaps ID 0.) That way, when you add a new follower, you replace one of the ID 0 entries with a real entry, and the document size doesn't grow.
Second, you can limit the number of followers that someone can have, and check for that in the application.
Note that if you use the two-array approach in your document, you will cut the maximum number of followers that one person can have (since a portion of the document will be taken up with the array of users that they are following).
5) As an optimization, you can change the 'following" documents to be bucketed. So, instead of one document for each following relationship, you might bucket them by user:
{ user: "X", following: [ "A", "B", "C" ... ] }
{ user: "X", following: [ "H", "I", "J" ... ] }
{ user: "Y", following: [ "A", "X", "K" ... ] }
6) For more about the ways to model many-to-many, see this presentation:
http://www.10gen.com/presentations/mongosf2011/schemabasics
For more information about the "bucketing" design pattern, see this entry in the MongoDB docs:
http://docs.mongodb.org/manual/use-cases/storing-comments/#hybrid-schema-design

If you provide both followers and followees then you can probably service most of your queries efficiently without a secondary index on either of those fields. For example, you can retrieve the current user and then use the default index on _id to retrieve lists of all of their connections.
db.users.find({_id: {$in: user_A.followers}})
If you don't include followees, you need to create a secondary index on followers in order to service some queries without a collection scan. For example, to determine all of the followees of user A, you would use a query as follows:
db.users.find({followers: user_A._id})
The secondary index costs you some memory and disk space but avoids potential data inconsistencies (mismatched follower and followee lists).

Related

Amazon DynamoDB Single Table Design For Blog Application

New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.

Efficiently modelling a Feed schema on Google Cloud Datastore?

I'm using GCP/App Engine to build a Feed that returns posts for a given user in descending order of the post's score (a modified timestamp). Posts that are not 'seen' are returned first, followers by posts where 'seen' = true.
When a user creates a post, a Feed entity is created for each one of their followers (i.e. a fan-out inbox model)
Will my current index model result in an exploding index and/or contention on the 'score' index if many users load their feed simultaneously?
index.yaml
indexes:
- kind: "Feed"
properties:
- name: "seen" // Boolean
- name: "uid" // The user this feed belongs to
- name: "score" // Int timestamp
direction: desc
// Other entity fields include: authorUid, postId, postType
A user's feed is fetched by:
SELECT postId FROM Feed WHERE uid = abc123 AND seen = false ORDER BY score DESC
Would I be better off prefixing the 'score' with the user id? Would this improve the performance of the score index? e.g. score="{alphanumeric user id}-{unix timestamp}"
From the docs:
You can improve performance with "sharded queries", that prepend a
fixed length string to the expiration timestamp. The index is sorted
on the full string, so that entities at the same timestamp will be
located throughout the key range of the index. You run multiple
queries in parallel to fetch results from each shard.
With just 4 entities I'm seeing 44 indexes which seems excessive.
You do not have an exploding indexes problem, that problem is specific to queries on entities with repeated properties (i.e properties with multiple values) when those properties are used in composite indexes. From Index limits:
The situation becomes worse in the case of entities with multiple
properties, each of which can take on multiple values. To accommodate
such an entity, the index must include an entry for every possible
combination of property values. Custom indexes that refer to multiple properties, each with multiple values, can "explode"
combinatorially, requiring large numbers of entries for an entity with
only a relatively small number of possible property values. Such
exploding indexes can dramatically increase the storage size of an entity in Cloud Datastore, because of the large number of index
entries that must be stored. Exploding indexes also can easily cause
the entity to exceed the index entry count or size limit.
The 44 built-in indexes are nothing more than the indexes created for the multiple indexed properties of your 4 entities (probably your entity model has about 11 indexed properties). Which is normal. You can reduce the number by scrubbing your model usage and marking as unindexed all properties which you do not plan to use in queries.
You do however have the problem of potentially high number of index updates in a short time - when a user with many followers creates a post with all those indexes falling in a narrow range - hotspots, which the article you referenced applies to. Pre-pending the score with the follower user ID (not the post creator ID, which won't help as the same number of updates on the same index range will happen for one use posting event regardless of sharding being used or not) should help. The impact of followers reading the post (when the score properly is updated) is less impactful since it's less likely for all followers to read the post exactly in the same time.
Unfortunately prepending the follower ID doesn't help with the query you intend to do as the result order will be sorted by follower ID first, not by timestamp.
What I'd do:
combine the functionality of the seen and score properties into one: a score value of 0 can be used to indicate that a post was not yet seen, any other value would indicate the timestamp when it was seen. Fewer indexes, fewer index updates, less storage space.
I wouldn't bother with sharding in this particular case:
reading a post takes a bit of time, one follower reading multiple posts won't typically happen fast enough for the index updates for that particular follower to be a serious problem. In the rare worst case an already read post may appear as unread - IMHO not bad enough for justification
delays in updating the indexes for all followers again is IMHO not a big problem - it may just take a bit longer for the post to appear in a follower's feed

Storing Likes in a Non-Relational Database

Gist
I implemented a like button in my application. Let's imagine users are able to like other users products.
Issue
I am now wondering which of the following is the most effective and robust method to store those likes in a non-relational Database (in my case MongoDB). It's important that no user can like a product twice.
Possible Solutions
(1) Store the user ids of those, who liked on the product itself and keep track of the number of likes via likes.length
// Product in database
{
likes: [
'userId1',
'userId2',
'userId3',
...
],
...
}
(2) Store all products, that a user liked on the user itself and keep track of the number of likes through a number on the product
// User in database
{
likedProducts: [
'productId1',
'productId2',
'productId3',
...
]
...
}
// Product in database
{
numberOfLikes: 42,
...
}
(3) Maybe there is even a better solution for this?
Either way, if the product has many likes or the user liked many products, there is a big amount of data, that has to load only to show likes and check if the user has already liked it.
Which approach to use, (1) or (2) depends on your use case, specifically, you should think about what data you will need to access more: to retrieve all products liked by a particular user (2) or to retrieve all users who liked a particular product (1). It looks more likely that (1) is a more frequent case - that way you would easily know if the user already liked the product as well as number of likes for the product as it is simply array length.
I would argue that any further improvement would likely be a premature optimization - it's better to optimize with a problem in hand.
If showing number of likes, for example, appears to be a bottleneck, you can denormalize your data further by storing array length as a separate key-value. That way displaying the product list wouldn't require receiving array of likes with userIds from the database.
Even more unlikely, with millions of likes of a single product, you'll find significant slowdown from looping through the likes array to check if the userId is already in it. You can, of course, use something like a sorted array to keep likes sorted, but database communication would be still slow (slower than looping through array in memory anyway). It's better to use the database indexing for binary search and instead of storing array of likes as array embedded into the product (or user) you can store likes in a separate collection:
{
_id: $oid1,
productId: $oid2,
userId: $oid3
}
That, assuming, that the product has key with a number of likes, should be fastest way of accessing likes if all 3 keys are indexed.
You can also be creative and use concatenation of $oid2+$oid3 as $oid1 which would automatically enforce uniqueness of the user-product pair likes. So you'd just try saving it and ignore database error (might lead to subtle bugs, so it'd be safer to check like exists on a failure to save).
Why simply not amend requirements and use either relational database or RDBMS alike solution. Basically, use the right tool, for the right job:
Create another table Likes that keeps pair of your productId and userId as unique key. For example:
userId1 - productId2
userId2 - productId3
userId2 - productId2
userId1 - productId5
userId3 - productId2
Then you can query by userId and get number of likes per user or query by productId and get number of likes per product.
Moreover, unique key userId_productId will guarantee that one user can only like one product.
Additionally, you can keep in another column(s) extra information like timestamp when user liked the product etc.
You might also need to consider the document size, storing user id on each product or string product id in each user might lead to memory outage and won't scale very well.
Rdbms will be better solution for this problem.

App Engine Datastore: entity design and query optimization

I have a system where users can vote on entities, if they like or hate them. It will be bazillion votes and trazillion records, hopefully, some time in the future :)
At the moment i store a vote in an Entity like this:
UserRecordVote: recordId, userId, hateOrLike
And when i want to get every Record the user liked i do a query like this:
I query the "UserRecordVote" table for all the "likes", then i take the recordIds from that resultset, create a key of that property and get the record from the Record Table.
Then i aggregate all that in a list and return it.
Here's the question:
I came up with a different approach and i want to find out if that one is 1. faster and 2. how much is the difference in cost.
I would create an Entity which's name would be userId + "likes" and the key would be the record id:
new Entity(userId + "likes", recordId)
So when i would do a query to get all the likes i could simply query for all, no filters needed. AND i could just grab the entity key! which would be much cheaper if i remember the documentation of app engine right. (can't find the pricing page anymore). Then i could take the Iterable of keys and do a single get(Iterable keys). Ok so i guess this approach is faster and cheaper right? But what if i want to grab all the votes of a user or better said, i want to grab all the records a user didn't vote on yet.
Here's the real question:
I wan't to load all the records a user didn't vote on yet:
So i would have entities like this:
new Entity(userId+"likes", recordId);
and
new Entity(userId+"hates", recordId);
I would query both vote tables for all entity keys and query the record table for all entity keys. Then i would remove all the record entity keys matching one of the vote entity keys and with the result i would get(Iterable keys) the full entities and have all the record entites which are not in one of the two voting tables.
Is that a useful approach? Is that the fastest and cost efficient way to do a datastore query? Am i totally wrong and i should store the information as list properties?
EDIT:
With that approach i would have 2 entity groups for each user, which would result in million different entity groups, how would GAE Datastore handle that? Atleast the Datastore Viewer entity select box would probably crash :) ?
To answer the Real Question, you probably want to have your hateOrLike field store an integer that indicates either hated/liked/notvoted. Then you can filter on hateOrLike=notVoted.
The other solutions you propose with the dynamically named entities make it impossible to query on other aspects of your entities, since you don't know their names.
The other thing is you expect this to be huge, you likely want to keep a running counter of your votes rather than tabulating every time you pull up a UserRecord - querying all the votes, and then calculating them on each view is very slow - especially since App Engine will only return 1000 results on each query, and if you have more than 1000 votes, you'll have to keep making repeated queries to get all the results.
If you think people will vote quickly, you should look into using a sharded counter for performance. There's examples of that with code available if you do a google search.
Consider serializing user hate/like votes in two separate TextProperties inside the entity. Use the userId as key_name.
rec = UserRecordVote.get_by_key_name(userId)
hates = len(rec.hates.split('_'))
etc.

Efficient group membership test for ACLs on AppEngine

I'm creating an access control list for objects in my datastore. Each ACL entry could have a list of all user ids allowed to access the corresponding entry. Then my query to get the list of entities a user can access would be pretty simple:
select * from ACL where accessors = {userId} and searchTerms >= {search}
The problem is that this can only support 2500 users before it hits the index entry limit, and of course it would be very expensive to put an ACL entry with a lot of users because many index entries would need to be changed.
So I thought about adding a list of GROUPs of users that are allowed to access an entity. That could drastically lower the number of index entries needed for each ACL entry, but querying gets longer because I have to query for every possible group that a user is in:
select * from ACL where accessors = {userId} and searchTerms >= {search}
for (GroupId id : theSetOfGroupsTheUserBelongsTo) {
select * from ACL where accessingGroups = {id} and searchTerms >= {search}
}
mergeAllTheseResultsTogether()
which would take a long time, be much more difficult to page through, etc.
Can anyone recommend a way to fetch a list of entities from an ACL that doesn't limit the number of accessing users?
Edit for more detail:
I'm searching and sorting on a long set of academic topics in use at a school. Some of the topics are created by administrators and should be school-wide. Others are created by teachers and are probably only relevant to those teachers. I want to create a google-docs-list-like hierarchy of collections that treats each topic like a document. The searchTerms field would be a list of words in the topic name - there is not a lot of internal text to search. Each topic will be in at least one collection (the organization's "root" collection) and could be in as many as 10-20 other collections, all managed by different people. Ideally there'd be no upper limit to the number of collections a document might appear in. My struggle here is to produce a list of all of the entities a particular user has at least read access to - the analog in google docs would be the "All Items" view.
Assuming that your documents and group permissions change less often (or are less time critical) than user queries, I suggest this (which is how i'm solving a similar problem):
In your ACL, include the fields
accessors <-- all userids that can access the document
numberOfAccessors <-- store the length of accessors whenever you change that field
searchTerms
The key_name for ACL would be something like "indexed_document_id||index_num"
index_num in the key allows you potentially have multiple entities storing the list of users, incase there are more than 5000 (the datastore limit on items in a list) or however many you want to have in a list to reduce the cost of loading one up (though you wont need to do that often).
Don't forget that the document to be accessed should be the parent of the index entity. that way you can do a select __key__ query rather than a select * (this avoids having to deserialize the accessor and searchTerms fields). You can search and return the parent() of the entity without needing to access any of the fields. More on that and other gae search design at this blog post. Sadly that block post doesn't cover ACL indexes like ours.
Disclaimer: I've now encountered a problem with this design in that what document a user has access to is controlled by whether they are following that user. That means that if they follow or unfollow, there could be a large number of existing documents the user needs to be added/removed from. If this is the case for you, then you might be stuck in the same hole as me if you follow my technique. I currently plan to handle this by updating the indexes for old documents in the background, over time. Someone else answering this question might have a solution to it baked in - if not I may post it as a separate question.
Analysis of operations on this datastructure:
Add an indexed document:
For each group that has access to the document, create an entity which includes all users that can access it in the accessors field
If there are too many to fit in one field, make more entities and increment that index_num value (using sharded counters).
O(n*m) where n is number of users and m is number of search queries
Query an indexed document:
select __key__ from ACL where accessors = {userid} and searchTerms >= {search} (though i'm not sure why you do ">=" actually, in my queries it's always "=")
Get all the parent keys from these keys
Filter out duplicates
Get those parent documents
O(n+m) where n is the number of users and m is the number of search terms - this is pretty fast. it uses the zig-zag merge join of two indexes (one on accessors, one on searchterms). this assumes that gae index scans are linear. they might be logarithmic for "=" queries but i'm not privy to the design of their indexes nor have i done any tests to verify. note also that you dont need to load any of the properties of the index entity.
Add access for a user to a particular document
Check if the user already has access: select __key__ from ACL where accessor = {userid} and parent = {key(document)}
If not, add it: select * from ACL where parent = {key(document)} and numberOfAccessors < {5000 (or whatever your max is)} limit 1
Append {userid} to accessors and put the entity
O(n) where n is the number of people who have access to the document.
Remove access for a user to a particular document
select * from ACL where accessor = {userid} and parent = {key(document)}
Remove {userid} from accessors and put the entity
O(n) where n is the number of people who have access to the document.
Compact the indexes
You'll have to do this once in a while if you do a lot of removals. not sure the best way to detect this.
To find out whether there's anything to compact for a particular document: select * from ACL where parent = {key(document)} and numberOfAccessors < {2500 (or half wahtever your max is)}
For each/any pair of these: delete one, appending the accessors to the other
O(n) where n is the number of people who have access to the document

Resources