Database design for a product voting system - database

I'm am making a system where a user can vote up or down on a product, I need to be able to explicitly work out the amount of ups and downs a product has, as well as a total score for a recent period.
Each vote can optionally have a comment with it, and users need the ability to echo/boost other peoples comments (kinda like a retweet), and this will also add/subtract the total score of the product depending on the parent vote being retweeted.
Here are my current proposed tables:
Product
ID, name, category_id
Vote
ID, user_id, product_id, parent_id, comment, score, datetime
User
ID, username etc.
I am thinking I will possibly need a comments table to do this effectively? The votes' score field is either 1 or -1 as per some advice I read on StackOverflow which would allow me to gather the SUM() of that column to calculate total votes, another possibility would be to have separate vote_up and vote_down tables...but I am just not sure.

Depending on what you want to do, this can be an incredibly sophisticated problem, but here's my take on the simplest way (eg. what i can throw together in the 10 min before I leave work ;-P)
I would try the StackOverflow/HotOrNot style approach, and Store their ranking as an unsigned integer.
PRODUCTS(
id,
category_id,
name,
rating INTEGER UNSIGNED NOT NULL DEFAULT 0
);
Then in your 'VOTES' table, you store the Vote (up/down). I think the table you have for your 'VOTES' table looks fine( although I would use either an enumeration as the SCORE datatype, or some strategy to ensure that a vote can't be manipulated via XSS. eg. someone modifies the vote so that their vote up is +10,000 instead of +1, then that would not be cool )
For a small fun app, you can probably get by with incrementing or decrementing the count when the user clicks, but if you are doing anything with aspirations of scaling out, then you would do the vote calculation and ranking via some batch process that runs every 10-15 minutes.
Also at this level, you would start using an algorithm to weight the vote values. For example, if the same user votes (up or down) the same product more than once a day(or once every) then the votes after the first should not count towards calculating the rank of the product.
For Example, here is how Quora's Ranking Algorithm works
If the user is a "Power User" or has an account that is more active, maybe their vote is more important than a new users vote. I think on Yelp, if you don't have more than one or two reviews, your rating and reviews don't get counted until you meet some minimum number of reviews. Really, the skies the limit.
PS. I would also recommend checking out this o'Reilly book on some of the strategies for solving these kinds of problems

If you expect a large number of users simultaneously voting, you really need to consider performance....
A naive approach might look something like this (my apologies if i oversimplify your example, and if T-SQL isn't your poison):
create table Products(
ProductId BIGINT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
Score INT NOT NULL...
ProductDetails...
where you will be performing updates to Products by summing up/down vote tables. BAD!
If you have a large number of users voting, deadlocks are sure to occur by constantly inserting/updating/selecting against the same table.
A better approach would be to drop the Score column altogether, only insert into the up/down vote tables, and select as needed. There's no reason you can't calculate the sum in code (i.e. PHP, C#, or whatever), and it avoids ever having to update the Products table (at least for calculating the Score). In other words, storing the Score on the Products table buys you nothing, and is just unnecessary overhead.
I speak from experience when I say updates "can" be bad in high volume systems. Updating is expensive when compared to inserting at the end or selects (assuming your table is properly indexed), and it's very easy to unknowingly take out a substantial lock in situations like this.

Book Beginning CakePHP has an Ajax tutorial on implementing a working voting (up/down) system on comments. I did the tutorial a few years ago. I am not sure how secure it is or if it would be a good foundation for your project, but it would probably be worth having a look at for some ideas.

Related

How to store feedback like stars or votes of users with efficiency?

I am making a system similar to our Play Store's star rating system, where a product or entity is given ratings and reviews by multiple users and for each entity, the average rating is displayed.
But the problem is, whether i should store the ratings in database of each entity with a list of users who rated it and rating given, but it will make it hard for a user to check which entities he has rated, as we need to check every entity for user's existence,
Or, should i store each entity with rating in user database but it will make rendering of entity harder
So, is there a simple and efficient way in which it can be done
Or is storing same data in both databases efficient, also i found one example of this system in stackoverflow, when the store up and down votes of a question, and give +5 for up vote while - for down vote to the asking user, which means they definitely need to store each up and down vote in question database, but when user opens the question, he can see his vote, therefore it is stored in user's database
Thanx for help
I would indeed store the 'raw' version at least, so have a big table that stores the productid/entityid, userid and rating. You can query from that table directly to get any kind of result you want. Based on that you can also calculate (or re-calculate) projections if you want, so its a safe bet to store this as the source of truth.
You can start out with a simple aggregate query, as long as that is fast enough, but to optimize it, you can make projections of the data in a different format, for instance the average review score per product. This van be achieved using (materialized) views, or you can just store the aggregated rating separately whenever a vote is cast.
Updating that projected aggregate can be very lightweight as well, because you can store the average rating for an entity, together with the number of votes. So when you update the rating, you can do:
NewAverage = (AverageRating * NumberOfRatings + NewRating) / (NumberOfRatings + 1)
After that, you store the new average and increment number of ratings. So there is no need to do a full aggregation again whenever somebody casts a vote, and you got the additional benefit of tracking the number of votes too, which is often displayed as well on websites.
The easiest way to achieve this is by creating a review table that holds the user and product. so your database should look like this.
product
--id
--name
--price
user
--id
-- firstname
--lastname
review
--id
--userId
--productId
--vote
then if you want to get all review for a product by a user then you can just query
the review table. hope this solves your problem?

Is it better to cache a value in a column or query another table [duplicate]

I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)

Can I use a counter in a database Many-to-Many field to reduce lookups?

I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)

one to many relationship vs. multiple records in a single table

I'm designing a payment system. Which of the following two designs is more practical, generally implemented and considered a good practice?
Design 1
Consider two entities — order and credit_card_details.
A credit card might be used for payment of several orders. So we have a 1:M relationship between credit_card_details and order. Keep in mind that each record in credit_card_details is unique with the attributes like card_holder_name, cvv, expiry_date, etc. These are filled in a form while making the payment. This design requires that whenever a payment is made, I would need to lookup the credit_card_details table to check whether a new/old credit card is being used. If the credit card is —
Old: The corresponding FK is added to the order table.
New: A new record is added in credit_card_details and then the corresponding FK is added to the order table
Design 2
This is relatively simpler. I use a single order table where all the attributes of credit_card_details from the previous design are merged to the former table. Whenever an order is placed, I need not check for the existence of the entered credit card details and I simply insert them in order table. However, it comes with the cost of possible duplicate credit card details.
Personally option one makes sense, option 2 does not give you 3NF, and the data is denormalized and hence you may have duplicated data. What if the customer returns the order and you want to make a reverse payment and the card has expired? These are just some common curveballs I am throwing up. It all depends on the given scenarios.
Also how imagine that you wanted a history of all the credit cards associated to a user and against the orders???, what would be a logical way to store these in the database? Surely a separate table right?
So a given user may have 0 to many cards.
A card can be associated to 1 or many orders
And an order is always associated to one card.
Consider possible searching options as well, and look up speed, better to have a unique foreign key in the order table.
A third option might be to have an Order table, Card table and OrderCard table although personally again it depends on your domain, although I think option three may be overkill?
Hope this helps in your design

Newbie Database Design - "Stack Overflow style voting system"

This is a simplified version of the database design I have so far, for a 'Stack Overflow' style voting system.
The question is: if the user has a score for the total number of votes they got for a response, should that score be worked out 'on the fly' or should there be a field in the users table referring to their score. Also if the case is the the later, what would the recommended method be for keeping it up to date?
Users Table
-id
-name
-email
Question Table
-id
-text
-poster (user id)
Responses Table
-id
-text
-question (question id)
-poster (user id)
Votes Table
-id
-response (response id)
-voter (user id)
De-normalizing the database model so a few critical scenarios can have better performance is justified, as long as you do it in a careful and deliberate fashion.
So after you have benchmarked the realistic amount of data and determined that counting votes on the fly causes performance problems, go right ahead and cache the vote count.
Probably the most robust way to keep the cached value up-to-date is to implement a database trigger that increments the cached value whenever a row is inserted into Votes and decrements it when a row is deleted.
(NOTE: Having a SELECT COUNT(*)... trigger can introduce subtle concurrency issues.)
I'd suggest working out the vote total on demand, in the beginning at least. Demand will probably be low as it begins use, and it can be changed later to being stored as a field in the Responses table. When/if that happens, update it with a trigger. Also, set up a view to report the responses and totals off of, so when/if the change is made, you have a central place to make an interface update instead of hunting down queries in your code.
This has the added benefit of being more flexible if the requirements change during the development or after the initial release. Watch for feature creep.
You can also remove the id field in the Votes table. If each vote is unique to response and voter, it's sufficient to make those fields keys to the table.
You would probably want let the schema allow for votes on questions as well (meaning that would need another vote table). I think it would be a good idea to have a vote sum (reputation) in the Users table, since it would be more effective to query a single row in a table than to query a sum over two tables each time you wanted to display a user's reputation. You could either update this from triggers or from your business logic. You also need to think about how you would represent up/down-votes. You could do this with either a bit value (representing up or down) or an absolute number (1/-1) in both the vote tables. You would have to adjust the total in the Users table every time an entry is inserted, deleted or updated in both the voting tables. It would probably be a bit more "fool proof" to update the total through triggers, but you could also argue that it should live in your business logic layer, and that people should not be playing around in the tables directly anyways. But as long as you take an informed decision I personally don't think that it matters much.

Resources