Gist
I implemented a like button in my application. Let's imagine users are able to like other users products.
Issue
I am now wondering which of the following is the most effective and robust method to store those likes in a non-relational Database (in my case MongoDB). It's important that no user can like a product twice.
Possible Solutions
(1) Store the user ids of those, who liked on the product itself and keep track of the number of likes via likes.length
// Product in database
{
likes: [
'userId1',
'userId2',
'userId3',
...
],
...
}
(2) Store all products, that a user liked on the user itself and keep track of the number of likes through a number on the product
// User in database
{
likedProducts: [
'productId1',
'productId2',
'productId3',
...
]
...
}
// Product in database
{
numberOfLikes: 42,
...
}
(3) Maybe there is even a better solution for this?
Either way, if the product has many likes or the user liked many products, there is a big amount of data, that has to load only to show likes and check if the user has already liked it.
Which approach to use, (1) or (2) depends on your use case, specifically, you should think about what data you will need to access more: to retrieve all products liked by a particular user (2) or to retrieve all users who liked a particular product (1). It looks more likely that (1) is a more frequent case - that way you would easily know if the user already liked the product as well as number of likes for the product as it is simply array length.
I would argue that any further improvement would likely be a premature optimization - it's better to optimize with a problem in hand.
If showing number of likes, for example, appears to be a bottleneck, you can denormalize your data further by storing array length as a separate key-value. That way displaying the product list wouldn't require receiving array of likes with userIds from the database.
Even more unlikely, with millions of likes of a single product, you'll find significant slowdown from looping through the likes array to check if the userId is already in it. You can, of course, use something like a sorted array to keep likes sorted, but database communication would be still slow (slower than looping through array in memory anyway). It's better to use the database indexing for binary search and instead of storing array of likes as array embedded into the product (or user) you can store likes in a separate collection:
{
_id: $oid1,
productId: $oid2,
userId: $oid3
}
That, assuming, that the product has key with a number of likes, should be fastest way of accessing likes if all 3 keys are indexed.
You can also be creative and use concatenation of $oid2+$oid3 as $oid1 which would automatically enforce uniqueness of the user-product pair likes. So you'd just try saving it and ignore database error (might lead to subtle bugs, so it'd be safer to check like exists on a failure to save).
Why simply not amend requirements and use either relational database or RDBMS alike solution. Basically, use the right tool, for the right job:
Create another table Likes that keeps pair of your productId and userId as unique key. For example:
userId1 - productId2
userId2 - productId3
userId2 - productId2
userId1 - productId5
userId3 - productId2
Then you can query by userId and get number of likes per user or query by productId and get number of likes per product.
Moreover, unique key userId_productId will guarantee that one user can only like one product.
Additionally, you can keep in another column(s) extra information like timestamp when user liked the product etc.
You might also need to consider the document size, storing user id on each product or string product id in each user might lead to memory outage and won't scale very well.
Rdbms will be better solution for this problem.
Related
I have a database consisting of reviews, follow, and users. Where users following other users is a many to many relationship modeled by the follow table. In total my schema looks as follows:
follow (collection) - key: fid
following (uid)
follower (uid)
review (collection) - key: rid
title (string)
author (uid)
posted (timestamp)
user (collection) - key: uid
created (timestamp)
email (string)
I want to run a query to get the T most recent reviews where the user is following the author. In a SQL environment I would do this with two joins and a where clause.
Let us consider a user following n people, where each person they're following has m reviews. I was considering finding all reviews for all of the n people one is following, then discarding all those older than T, but recognize the number of reads will be n*m. As we can easily expect n > 100 and m > 1000, this is not a viable solution. I recognize there is probably no great way to do this in firestore. Any suggestions?
UPDATE: The top answer to a similar question is giving an nk (where k is an arbitrary limit) solution for a number of reads. It is also answering an easier question: "get the T most recent reviews for each person one is following" not "get the T most recent reviews of all people one is following." This answer, suggests keeping an updated copy of all followers in every review then doing a whereArrayContains clause to find reviews one is following. But if user A follows a user B who has B_m reviews, we will perform B_m writes for each follow or unfollow. We will also be massively denormalizing our database, storing and updating the same information in thousands of locations.
Your data seems highly relational so the best option for you is to switch to a relational database. The only way to get this to work in firestore is to either completely denormalize your data or chain a ton of queries together to get all of the data you need, neither is ideal in my opinion.
there's a way that worked for me which is creating a link between indexes, how is that ? going to firestore => indexes and add index with this you can link the fields you want to query on and then you will be able to do this query in your code
I am making a system similar to our Play Store's star rating system, where a product or entity is given ratings and reviews by multiple users and for each entity, the average rating is displayed.
But the problem is, whether i should store the ratings in database of each entity with a list of users who rated it and rating given, but it will make it hard for a user to check which entities he has rated, as we need to check every entity for user's existence,
Or, should i store each entity with rating in user database but it will make rendering of entity harder
So, is there a simple and efficient way in which it can be done
Or is storing same data in both databases efficient, also i found one example of this system in stackoverflow, when the store up and down votes of a question, and give +5 for up vote while - for down vote to the asking user, which means they definitely need to store each up and down vote in question database, but when user opens the question, he can see his vote, therefore it is stored in user's database
Thanx for help
I would indeed store the 'raw' version at least, so have a big table that stores the productid/entityid, userid and rating. You can query from that table directly to get any kind of result you want. Based on that you can also calculate (or re-calculate) projections if you want, so its a safe bet to store this as the source of truth.
You can start out with a simple aggregate query, as long as that is fast enough, but to optimize it, you can make projections of the data in a different format, for instance the average review score per product. This van be achieved using (materialized) views, or you can just store the aggregated rating separately whenever a vote is cast.
Updating that projected aggregate can be very lightweight as well, because you can store the average rating for an entity, together with the number of votes. So when you update the rating, you can do:
NewAverage = (AverageRating * NumberOfRatings + NewRating) / (NumberOfRatings + 1)
After that, you store the new average and increment number of ratings. So there is no need to do a full aggregation again whenever somebody casts a vote, and you got the additional benefit of tracking the number of votes too, which is often displayed as well on websites.
The easiest way to achieve this is by creating a review table that holds the user and product. so your database should look like this.
product
--id
--name
--price
user
--id
-- firstname
--lastname
review
--id
--userId
--productId
--vote
then if you want to get all review for a product by a user then you can just query
the review table. hope this solves your problem?
I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)
I want to implement a user follow system. A user can follow other users. I'm considering two approaches. One is that there are followers and followees in User schema, both of them are arrays of user _id. The other one is that there's only followers in the schema. Whenever I want to find a user's followers, I have to search all users' followers array, that is, db.user.find( { followers: "_id" } );. What the pros and cons of the two approaches? Thanks.
What you're considering is a classic "many-to-many" relationship here. Unlike a RDBMS, where there is a single "correct" normal form for this schema, in MongoDB the correct schema design depends on the way you'll be using your data, as well as a couple of other factors you haven't mentioned here.
Note that for this discussion I'm assuming that the "follows" relationship is NOT symmetric -- that is, that A can follow B without B having to follow A.
1) There are two basic ways to model this relationship in MongoDB.
You can have an indexed "following" array embedded in the user document.
You can have a separate collection of "following" documents, like this:
{ user: ObjectID("x"), following: ObjectID("y") }
You'd have one document in this collection for each following relationship. You'd need to have two indexes on this collection, one for "user" and one for "following".
Note that the second suggestion in your question (having arrays of both "following" and "followed" in the user document) is simply a variation of the first.
2) The correct design depends on a few factors that you haven't mentioned here.
How many followers can one person have, and how many people can one person follow?
What is your most common query? Is it to present a list of followers, or to present a list of users that are being followed?
How often will you be updating the followers/following list(s)?
3) The trade-offs are as follows:
The advantages to the embedded array approach are that the code is simpler, and you can fetch the entire array of followed users in a single document. If you index the 'following' array, then the query to find all a users followers will be relatively quick, as long as that index fits entirely in RAM. (This is no different than a relational database.)
The disadvantages to the embedded array approach occur if you are frequently updating the followers, or if you allow an unlimited number of followers / following.
If you allow an unlimited number of followers/following, then you can potentially overflow the maximum size of a MongoDB document. It's not unheard-of for some people to have 100K followers or more. If this is the case, then you'll need to go to the separate collection approach.
If you know that there will be frequent updates to the followers, then you'll probably want to use the separate collection approach as well. The reason is that every time you add a follower, you grow the size of the 'followers' array. When it reaches a certain size, it will outgrow the amount of space reserved for it on disk, and MongoDB will have to move the document. This will incur additional write overhead, as all of the indexes for that document will have to be updated as well.
4) If you want to use the embedded array approach, there are a couple of things that you can do to make that more feasable.
First, you can limit the total number of followers that one person can have. Second, when you create a new user, you can create the document with a large number of dummy followers pre-created. (E.g., you populate the 'followers' array with a large number of entries that you know don't refer to any actual user -- perhaps ID 0.) That way, when you add a new follower, you replace one of the ID 0 entries with a real entry, and the document size doesn't grow.
Second, you can limit the number of followers that someone can have, and check for that in the application.
Note that if you use the two-array approach in your document, you will cut the maximum number of followers that one person can have (since a portion of the document will be taken up with the array of users that they are following).
5) As an optimization, you can change the 'following" documents to be bucketed. So, instead of one document for each following relationship, you might bucket them by user:
{ user: "X", following: [ "A", "B", "C" ... ] }
{ user: "X", following: [ "H", "I", "J" ... ] }
{ user: "Y", following: [ "A", "X", "K" ... ] }
6) For more about the ways to model many-to-many, see this presentation:
http://www.10gen.com/presentations/mongosf2011/schemabasics
For more information about the "bucketing" design pattern, see this entry in the MongoDB docs:
http://docs.mongodb.org/manual/use-cases/storing-comments/#hybrid-schema-design
If you provide both followers and followees then you can probably service most of your queries efficiently without a secondary index on either of those fields. For example, you can retrieve the current user and then use the default index on _id to retrieve lists of all of their connections.
db.users.find({_id: {$in: user_A.followers}})
If you don't include followees, you need to create a secondary index on followers in order to service some queries without a collection scan. For example, to determine all of the followees of user A, you would use a query as follows:
db.users.find({followers: user_A._id})
The secondary index costs you some memory and disk space but avoids potential data inconsistencies (mismatched follower and followee lists).
I have a system where users can vote on entities, if they like or hate them. It will be bazillion votes and trazillion records, hopefully, some time in the future :)
At the moment i store a vote in an Entity like this:
UserRecordVote: recordId, userId, hateOrLike
And when i want to get every Record the user liked i do a query like this:
I query the "UserRecordVote" table for all the "likes", then i take the recordIds from that resultset, create a key of that property and get the record from the Record Table.
Then i aggregate all that in a list and return it.
Here's the question:
I came up with a different approach and i want to find out if that one is 1. faster and 2. how much is the difference in cost.
I would create an Entity which's name would be userId + "likes" and the key would be the record id:
new Entity(userId + "likes", recordId)
So when i would do a query to get all the likes i could simply query for all, no filters needed. AND i could just grab the entity key! which would be much cheaper if i remember the documentation of app engine right. (can't find the pricing page anymore). Then i could take the Iterable of keys and do a single get(Iterable keys). Ok so i guess this approach is faster and cheaper right? But what if i want to grab all the votes of a user or better said, i want to grab all the records a user didn't vote on yet.
Here's the real question:
I wan't to load all the records a user didn't vote on yet:
So i would have entities like this:
new Entity(userId+"likes", recordId);
and
new Entity(userId+"hates", recordId);
I would query both vote tables for all entity keys and query the record table for all entity keys. Then i would remove all the record entity keys matching one of the vote entity keys and with the result i would get(Iterable keys) the full entities and have all the record entites which are not in one of the two voting tables.
Is that a useful approach? Is that the fastest and cost efficient way to do a datastore query? Am i totally wrong and i should store the information as list properties?
EDIT:
With that approach i would have 2 entity groups for each user, which would result in million different entity groups, how would GAE Datastore handle that? Atleast the Datastore Viewer entity select box would probably crash :) ?
To answer the Real Question, you probably want to have your hateOrLike field store an integer that indicates either hated/liked/notvoted. Then you can filter on hateOrLike=notVoted.
The other solutions you propose with the dynamically named entities make it impossible to query on other aspects of your entities, since you don't know their names.
The other thing is you expect this to be huge, you likely want to keep a running counter of your votes rather than tabulating every time you pull up a UserRecord - querying all the votes, and then calculating them on each view is very slow - especially since App Engine will only return 1000 results on each query, and if you have more than 1000 votes, you'll have to keep making repeated queries to get all the results.
If you think people will vote quickly, you should look into using a sharded counter for performance. There's examples of that with code available if you do a google search.
Consider serializing user hate/like votes in two separate TextProperties inside the entity. Use the userId as key_name.
rec = UserRecordVote.get_by_key_name(userId)
hates = len(rec.hates.split('_'))
etc.