Newbie Database Design - "Stack Overflow style voting system" - database

This is a simplified version of the database design I have so far, for a 'Stack Overflow' style voting system.
The question is: if the user has a score for the total number of votes they got for a response, should that score be worked out 'on the fly' or should there be a field in the users table referring to their score. Also if the case is the the later, what would the recommended method be for keeping it up to date?
Users Table
-id
-name
-email
Question Table
-id
-text
-poster (user id)
Responses Table
-id
-text
-question (question id)
-poster (user id)
Votes Table
-id
-response (response id)
-voter (user id)

De-normalizing the database model so a few critical scenarios can have better performance is justified, as long as you do it in a careful and deliberate fashion.
So after you have benchmarked the realistic amount of data and determined that counting votes on the fly causes performance problems, go right ahead and cache the vote count.
Probably the most robust way to keep the cached value up-to-date is to implement a database trigger that increments the cached value whenever a row is inserted into Votes and decrements it when a row is deleted.
(NOTE: Having a SELECT COUNT(*)... trigger can introduce subtle concurrency issues.)

I'd suggest working out the vote total on demand, in the beginning at least. Demand will probably be low as it begins use, and it can be changed later to being stored as a field in the Responses table. When/if that happens, update it with a trigger. Also, set up a view to report the responses and totals off of, so when/if the change is made, you have a central place to make an interface update instead of hunting down queries in your code.
This has the added benefit of being more flexible if the requirements change during the development or after the initial release. Watch for feature creep.
You can also remove the id field in the Votes table. If each vote is unique to response and voter, it's sufficient to make those fields keys to the table.

You would probably want let the schema allow for votes on questions as well (meaning that would need another vote table). I think it would be a good idea to have a vote sum (reputation) in the Users table, since it would be more effective to query a single row in a table than to query a sum over two tables each time you wanted to display a user's reputation. You could either update this from triggers or from your business logic. You also need to think about how you would represent up/down-votes. You could do this with either a bit value (representing up or down) or an absolute number (1/-1) in both the vote tables. You would have to adjust the total in the Users table every time an entry is inserted, deleted or updated in both the voting tables. It would probably be a bit more "fool proof" to update the total through triggers, but you could also argue that it should live in your business logic layer, and that people should not be playing around in the tables directly anyways. But as long as you take an informed decision I personally don't think that it matters much.

Related

How to store feedback like stars or votes of users with efficiency?

I am making a system similar to our Play Store's star rating system, where a product or entity is given ratings and reviews by multiple users and for each entity, the average rating is displayed.
But the problem is, whether i should store the ratings in database of each entity with a list of users who rated it and rating given, but it will make it hard for a user to check which entities he has rated, as we need to check every entity for user's existence,
Or, should i store each entity with rating in user database but it will make rendering of entity harder
So, is there a simple and efficient way in which it can be done
Or is storing same data in both databases efficient, also i found one example of this system in stackoverflow, when the store up and down votes of a question, and give +5 for up vote while - for down vote to the asking user, which means they definitely need to store each up and down vote in question database, but when user opens the question, he can see his vote, therefore it is stored in user's database
Thanx for help
I would indeed store the 'raw' version at least, so have a big table that stores the productid/entityid, userid and rating. You can query from that table directly to get any kind of result you want. Based on that you can also calculate (or re-calculate) projections if you want, so its a safe bet to store this as the source of truth.
You can start out with a simple aggregate query, as long as that is fast enough, but to optimize it, you can make projections of the data in a different format, for instance the average review score per product. This van be achieved using (materialized) views, or you can just store the aggregated rating separately whenever a vote is cast.
Updating that projected aggregate can be very lightweight as well, because you can store the average rating for an entity, together with the number of votes. So when you update the rating, you can do:
NewAverage = (AverageRating * NumberOfRatings + NewRating) / (NumberOfRatings + 1)
After that, you store the new average and increment number of ratings. So there is no need to do a full aggregation again whenever somebody casts a vote, and you got the additional benefit of tracking the number of votes too, which is often displayed as well on websites.
The easiest way to achieve this is by creating a review table that holds the user and product. so your database should look like this.
product
--id
--name
--price
user
--id
-- firstname
--lastname
review
--id
--userId
--productId
--vote
then if you want to get all review for a product by a user then you can just query
the review table. hope this solves your problem?

Database design, an included attribute vs multiple joins? Confused

So I am taking a class in database design and management and am kind of confused from a design perspective. My example is an invoice system. I just made it up quick so it doesn't have a ton of complexity in it.
There are Customers, Orders, Invoices and Payments entities
Customers
CustId(PK),
Street,
Zip,
City,
..
Orders
OrderID(PK)
CustID(FK)
Date
Amt
....
Invoices
InvoiceID(PK),
OrderID(FK),
Date,
AmtDue,
AmtPaid,
....
Payments
PaymentNo(PK),
InvoiceID(FK),
PayMethod,
Date,
Amt,
...
Customer entity has a one to many relationship with Orders
Purchases entity has a one to many relationship with Invoices
Invoices Entity has a one to many relationship with Payments.
To get the results of a query to list all Payments made by a Customer the query would have to join Payments with the Invoice table, the Invoice table with the Orders table and the Orders table with the Customer table.
Is this the correct way to do it? One could also just put a custID in the payment entity which would then just require one join, but then there is unneeded information in the payment entity. Is this just a design thing or is it a performance issue?
Bonus question. Lets say there should be a report that says what the total customer balance is. Does there need to be a customer balance field in the database or can this be a calculated item that is produced by joining tables and adding up the amount billed vs amount paid?
Thanks!
Is this the correct way to do it?
Yes. Based on the information provided, it looks reasonable.
One could also just put a custID in the payment entity which would then just require one join, but then there is unneeded information in the payment entity. Is this just a design thing or is it a performance issue?
The question you're asking falls under "normal forms", often called normalization. Your target should be Boyce-Codd normal form (similar to 3NF), which should be described in your textbook. I will warn you that misinformation and misuderstanding of database design issues is very abundant on the interwebs, so beware of which answers you pay attention to.
The goal of normalization is to eliminate redundancy, and thus to eliminate "anomaliies", whereby two logically equivalent queries produce inconsistent results. If the same information is kept in two places, and is updated in only one, then two queries against the two different values will produce different -- i.e, inconsistent -- results.
In your example, if there is a Payments.CustID, should I believe that one, or the one derived from joining Payments to Orders? The same goes for total customer balance: do I believe the stored total, or the one I computed from the consituents?
If you are going to "denomalize for performance", as is so often alleged to be necessary, what are you going to do to ensure the redundant values are consistent?
Bonus question. Lets say there should be a report that says what the total customer balance is.
As a matter of fact, in practice balances are sort of a special case. It's often necessary to know the balance at points in time. While it's possible to compute, say, monthy account balances from inception based on transactions, as a practical matter applications usually "draw a line in the sand" and record the balance for future reference. Step are taken -- must be, for the sake of the business -- to ensure the historical information does not change or, if it does, that the recorded balance is updated to reflect the change. From that description alone, you can imagine that the work of enforcing consistency throughout the system is much more work than relying on the DBMS to enforce it. And that is why, insofar as is feasible, it's better to elimate all redundant data, and let the DBMS do the job it was designed to do.
In your analysis, seek Boyce-Codd normal form. Understand your data, eliminate the redundancies, and recognize the relations. Let the DBMS enforce referential integrity. Countless errors will be avoided, and time saved. Only when specific circumstances conspire to show that specific business requirements cannot be satisfied on a particular system with a given, correct design, does one begin the tedious and error-prone work of introducing redundant information and compensating for it with external controls.
"Is this the correct way to do it?" Of course, given your current design. But it's not the ONLY way. So you're studying DB "normalization" and seeing the pros and cons of the various "forms" of normalization. In the "real world" things can change on a dime, due to a management decision or whatever. I tend to use "compound primary keys" instead of simply one field for primary and others as FK. I handle my "FK" programmatically instead of relegating that responsibility to the DB.
I also create and utilize a number of "intermediate" tables, or sometimes "VIEWS", that I use more easily than a bunch of code with too many JOINs. (3rd Normal form addicts can hate, but my code runs faster than a scalded rabbit).
An Order means nothing without a Customer; an Invoice means nothing without an Order; a Payment is great, but means nothing without both an Order and Invoice. So lemme throw this out there -- what's wrong with having a "summary" type of entity that has Cust, Order, Invoice #, and Payment Id ?

Database design - system default items and custom user items

This question applies to any database table design, where you would have system default items and custom user defaults of the same type (ie user can add his own custom items/settings).
Here is an example of invoicing and paymenttypes, By default an invoice can have payment terms of DueOnReceipt, NET10, NET15, NET30 (this is the default for all users!) therefore you would have two tables "INVOICE" and "PAYMENT_TERM"
INVOICE
Id
...
PaymentTermId
PAYMENT_TERM (System default)
Id
Name
Now what is the best way to allow a user to store their own custom "PaymentTerms" and why? (ie user can use system default payment terms OR user's own custom payment terms that he created/added)
Option 1) Add UserId to PaymentTerm, set userid for the user that has added the custom item and system default userid set to null.
INVOICE
Id
...
PaymentTermId
PaymentTerm
Id
Name
UserId (System Default, UserId=null)
Option 2) Add a flag to Invoice "IsPaymentTermCustom" and Create a custom table "PAYMENT_TERM_CUSTOM"
INVOICE
Id
...
PaymentTermId
PaymentTermCustomId
IsPaymentTermCustom (True for custom, otherwise false for system default)
PaymentTerm
Id
Name
PAYMENT_TERM_CUSTOM
Id
Name
UserId
Now check via SQL query if the user is using a custom payment term or not, if IsPaymentTermCustom=True, it means the user is using custom payment term otherwise its false.
Option 3) ????
...
As a general rule:
Prefer adding columns to adding tables
Prefer adding rows to adding columns
Generally speaking, the considerations are:
Effects of adding a table
Requires the most changes to the app: You're supporting a new kind of "thing"
Requires more complicated SQL: You'll have to join to it somehow
May require changes to other tables to add a foreign key column referencing the new table
Impacts performance because more I/O is needed to join to and read from the new table
Note that I am not saying "never add tables". Just know the costs.
Effects of adding a column
Can be expensive to add a column if the table is large (can take hours for the ALTER TABLE ADD COLUMN to complete and during this time the table wil be locked, effectively bringing your site "down"), but this is a one-time thing
The cost to the project is low: Easy to code/maintain
Usually requires minimal changes to the app - it's a new aspect of a thing, rather than a new thing
Will perform with negligible performance difference. Will not be measurably worse, but may be a lot faster depending on the situation (if having the new column avoids joining or expensive calculations).
Effects of adding rows
Zero: If your data model can handle your new business idea by just adding more rows, that's the best option
(Pedants kindly refrain from making comments such as "there is no such thing as 'zero' impact", or "but there will still be more disk used for more rows" etc - I'm talking about material impact to the DB/project/code)
To answer the question: Option 1 is best (i.e. add a column to the payment option table).
The reasoning is based on the guidelines above and this situation is a good fit for those guidelines.
Further,
I would also store "standard" payment options in the same table, but with a NULL userid; that way you only have to add new payment options when you really have one, rather than for every customer even if they use a standard one.
It also means your invoice table does not need changing, which is a good thing - it means minimal impact to that part of your app.
It seems to me that there are merely "Payment Terms" and "Users". The decision of what are the "Default" payment terms is a business rule, and therefore would be best represented in the business layer of your application.
Assuming that you would like to have a set of pre-defined "default" payment terms present in your application from the start, these would already be present in the payment terms table. However, I would put a reference table in between USERS and PAYMENT TERMS:
USERS:
user-id
user_namde
USER_PAYMENT_TERMS:
userID
payment_term_id
PAYMENT_TERMS:
payment_term_id
payment_term
Your business layer should offer up to the user (or more likely, the administrator) through a GUI the ability to:
Assign 0 to many payment term options to a particular user (some
users may not want one of the defaults to even be available, for
example.
Add custom payment terms, which then become available for assignment to one or more users (but which avoids the creation of duplicate payment terms by different users)
Allows the definition of a custom payment term to be assigned to more than one user (say the user's company a unique payment process which requires all of their users to utilize a payment term other than one of the defaults? Create the custom term once, and assign to all users.
Your application business layer would establish rules governing access to payment terms, which could then be accessed by your user interface.
Your UI would then (again, likely through an administrator function) allow the set up of one or more payment terms in addition to the standards you describe, and then make them available to one or more users through something like a checked list box (for example).
Option 1 is definately better for the following reasons:-
Correctness
You can implement a database constraint for uniqueness of the payment term name
You can implement a foreign key constraint from Invoice to PaymentTerm
Ease of Use
Conducting queries will be much simplier because you will always join from Invoice to PaymentTerm rather than requiring a more complex join. Most of the time when you select you will not care if it is an inbuilt or custom payment term. The optimizer will have an easier time with a normal join instead of one that depends on another column to decide which table to join.
Easier to display a list of PaymentTerms coming from one table
We use Option 1 in our data-model quite alot.
Part of the problem, as I see it, is that different payment terms lead to different calculations, too. If I were still in the welding supply business, I'd want to add "2% 10 NET 30", which would mean 2% discount if the payment is made in full within 10 days, otherwise, net 30."
Setting that issue aside, I think ownership of the payment terms makes sense. Assume that the table of users (not shown) includes the user "system" as, say, user_id 0.
create table payment_terms (
payment_term_id integer primary key,
payment_term_owner_id integer not null references users (user_id),
payment_term_desc varchar(30) not null unique,
);
insert into payment_terms values (1, 0, 'Net 10');
insert into payment_terms values (2, 0, 'Net 15');
...
insert into payment_terms values (5, 1, '2% 10, Net 30');
This keeps foreign keys simple, and it makes it easy to select payment terms at run time for presentation in the user interface.
Be very careful here. You probably want to store the description, not the ID number, with your invoices. (It's unique; you can set a foreign key reference to it.) If you store only the ID number, updating a user's custom description might subtly corrupt all the data that references it.
For example, let's say that the user created a custom payment term number 5, '2% 10, Net 30'. You store the ID number 5 in your table of invoices. Then the user decides that things will be different starting today, and updates that description to '2% 10, Net 20'. Now on all your past invoices, the arithmetic no longer matches the payment terms.
Your auditor will kill you. Twice.
You'll want to prevent ordinary users from deleting rows owned by the system user. There are several ways to do that.
Use a BEFORE DELETE trigger.
Add another table with foreign key references to the rows owned by the system user.
Restrict all access through stored procedures that prevent deleting system rows.
(And flags are almost never the best idea.)
Applying general rules of database design to the problem at hand:
one table for system payment terms
one table for user payment terms
a view of join of the two above
Now you can join invoice on the view of payment terms.
Benefits:
No flag columns
No nulls
You separate system defaults from user data
Things become straight forward for the db

Database design for a product voting system

I'm am making a system where a user can vote up or down on a product, I need to be able to explicitly work out the amount of ups and downs a product has, as well as a total score for a recent period.
Each vote can optionally have a comment with it, and users need the ability to echo/boost other peoples comments (kinda like a retweet), and this will also add/subtract the total score of the product depending on the parent vote being retweeted.
Here are my current proposed tables:
Product
ID, name, category_id
Vote
ID, user_id, product_id, parent_id, comment, score, datetime
User
ID, username etc.
I am thinking I will possibly need a comments table to do this effectively? The votes' score field is either 1 or -1 as per some advice I read on StackOverflow which would allow me to gather the SUM() of that column to calculate total votes, another possibility would be to have separate vote_up and vote_down tables...but I am just not sure.
Depending on what you want to do, this can be an incredibly sophisticated problem, but here's my take on the simplest way (eg. what i can throw together in the 10 min before I leave work ;-P)
I would try the StackOverflow/HotOrNot style approach, and Store their ranking as an unsigned integer.
PRODUCTS(
id,
category_id,
name,
rating INTEGER UNSIGNED NOT NULL DEFAULT 0
);
Then in your 'VOTES' table, you store the Vote (up/down). I think the table you have for your 'VOTES' table looks fine( although I would use either an enumeration as the SCORE datatype, or some strategy to ensure that a vote can't be manipulated via XSS. eg. someone modifies the vote so that their vote up is +10,000 instead of +1, then that would not be cool )
For a small fun app, you can probably get by with incrementing or decrementing the count when the user clicks, but if you are doing anything with aspirations of scaling out, then you would do the vote calculation and ranking via some batch process that runs every 10-15 minutes.
Also at this level, you would start using an algorithm to weight the vote values. For example, if the same user votes (up or down) the same product more than once a day(or once every) then the votes after the first should not count towards calculating the rank of the product.
For Example, here is how Quora's Ranking Algorithm works
If the user is a "Power User" or has an account that is more active, maybe their vote is more important than a new users vote. I think on Yelp, if you don't have more than one or two reviews, your rating and reviews don't get counted until you meet some minimum number of reviews. Really, the skies the limit.
PS. I would also recommend checking out this o'Reilly book on some of the strategies for solving these kinds of problems
If you expect a large number of users simultaneously voting, you really need to consider performance....
A naive approach might look something like this (my apologies if i oversimplify your example, and if T-SQL isn't your poison):
create table Products(
ProductId BIGINT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
Score INT NOT NULL...
ProductDetails...
where you will be performing updates to Products by summing up/down vote tables. BAD!
If you have a large number of users voting, deadlocks are sure to occur by constantly inserting/updating/selecting against the same table.
A better approach would be to drop the Score column altogether, only insert into the up/down vote tables, and select as needed. There's no reason you can't calculate the sum in code (i.e. PHP, C#, or whatever), and it avoids ever having to update the Products table (at least for calculating the Score). In other words, storing the Score on the Products table buys you nothing, and is just unnecessary overhead.
I speak from experience when I say updates "can" be bad in high volume systems. Updating is expensive when compared to inserting at the end or selects (assuming your table is properly indexed), and it's very easy to unknowingly take out a substantial lock in situations like this.
Book Beginning CakePHP has an Ajax tutorial on implementing a working voting (up/down) system on comments. I did the tutorial a few years ago. I am not sure how secure it is or if it would be a good foundation for your project, but it would probably be worth having a look at for some ideas.

Where should I break up my user records to keep track of revisions

I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.
A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.
I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.
#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!

Resources