Q&A database map like stackoverflow - seperate table for questions titles - database

I am planning to make Q&A system (quite specific, has nothing to do with IT)
I was looking for Stackoverflow database map: https://meta.stackexchange.com/questions/2677/anatomy-of-a-data-dump/2678#2678
And I am thinking is not it is better practice to make separate table for questions titles. With "firstPostId".
Instead of
|- PostTypeId
| - 1: Question
| - 2: Answer
So I want to know, why stackoverflow did not use separate table for questions title. Is it "Do not optimize yet" or does it have any logic behind it?

Based just on the schema as shown in your link, I surmise that Questions and Answers have so many attributes in common that it was convenient to model it as was done. In short, symmetry and failing to multiply entities unnecessarily seem credible reasons for the approach.
I also suspect they use a key/value (a.k.a. nosql) database for the backing store which allows entries to not possess all possible attributes. For example, a question can have tags but an answer will not. Key/value databases don't fret over differences like that.
Disclaimer: I have no actual knowledge of how SO is implemented.

Related

How to reuse a model to relate with multiple models

I've worked on big projects before, but I'm trying to improve my best practices, and one thing that I'm stuck on is not to create many models.
This might seem a little bit confusing, so let me put an example:
Let's suppose I have a Post model, and an Answer model, the answer one relates to the Post in a One-Many relationship.
Then, I want to add a Comment model, both to Post and Answer.
I could add two Foreign Key nullable columns on the Comment, to show which model it belongs.
But I could also create PostComment and AnswerComment models, removing the nullable column, but creating more kind of boilerplate.
Which practice is the best?
It depends.
I'm assuming the design is primarily to support a transactional application (OLTP), and not reporting (OLAP). I'm also assuming that model = table.
There's nothing inherently wrong with having multiple tables, as long as the design makes sense (can be easily supported), can be extended / modified with relative ease (maintained), does not lead to poorly performing queries (e.g. if there's a mismatch between the database schema and how calling applications want to consume its data.
If data is the same, it should probably go into the same table; e.g. if you're dealing with birds then don't have tblHawk, tblParrot, etc - but you you had all animals then sure you'd probably want to seperate them out somehow - tblBird, tblFish, tblMammal, etc - because the data would be too different & too hard to model effectively.
You have answers and posts - I assume these are different enough that having separate tables makes sense? If so, what about comments to them? If comments are essentially the same regardless of post/answer then one table, as you described, is probably a good idea.
Also consider the application: if you have separate post/answer comment tables there's more code to be developed and maintained - but it's separate, so more code but possibly more flexible with less complexity. Using one table will have the opposite affect. Neither is wrong, but one approach is probably better than the other depending on your situation.

Designing a database for an e-commerce store

Hi I am trying to design a database for an e-commerce website but I can't seem to find a way to do this right, this is what I have so far:
The problem appears at the products.I have 66 types of products most of them having different fields.I have to id's but both of them don't seem very practical:
OPTION A:
At first I thought I to make a table for each product type, but that would result in 66 tables which is not very easy to maintain. I already started to do that I created the Product_Notebook and Product_NotebookBag tables. And then I stopped and thought about it a bit and this solution is not very good.
OPTION B
After thinking about it a bit more I came up with option B which is storing the data into a separate field called description. For example:
"Color : Red & Compatibility : 15.6 & CPU : Intel"
In this approach I could take the string and manipulate it after retrieving it from the database.
I know this approach is also not a very good idea, that's why I am asking for a more practical approach.
See my answer to this question here on Stack Overflow. For your situation I recommend using Entity Attribute Value (EAV).
As I explain in the linked answer, EAV is to be avoided almost all of the time for many good reasons. However, tracking product attributes for an online catalog is one application where the problems with EAV are minimal and the benefits are extensive.
Simply create a ProductProperties table and put all the possible fields there. (You can actually just add more fields to your Products table)
Then, when you list your products, just use the fields you need.
Surely, there are many fields in common as well.
By the way, if you're thinking of storing the data in array (option B?) you'll regret it later. You won't be able to easily sort your table that way.
Also, that option will make it hard to find a particular item by a specific characteristic.

Logic for recommender application

I am developing an application - which would have users answer maybe 10 questions - which would have 3-4 options for each question. At the end of the 10th question, based on the responses, it would need to suggest a certain solution. Since there are 100's of permutation and combinations - what's the logic that would be required to use and the database design,
thanks
EDIT some more detailed explanation
if my application is used to recommend a data plan from various mobile operators - based on the user answering questions like the time spent on the internet, the type of files being downloaded and so on. So, if the response to question 1 was a and question 2 was c, etc - then it would be a certain plan. If the response to question 1 was b and for question 2 it was c, then it would recommend a certain plan. So, if there were 10 questions - then the combinations can be quite large. So is there a certain algorithm that can handle this?
I. what would be the logic?
If I understand correctly, you would define "rules" such as
If the answer to question 5. is either A or B then the suggested plan would be planB, otherwise execute the rest of the rules.
So you would use a rule engine e.g.: http://www.jboss.org/drools/
II. what would be the database design?
This is quite simple:
USERS table,
QUESTIONS table and
ANSWERS table which would refer to the two others
Possibly there would be a QUESTIONNAIRE table as well, and the QUESTIONS table would refer to it.
Just a 'quick' comment, consider letting the user see changes in what company they could be recommended as they answer every question.
For example, if I am most interested in price that would be the question I would answer first and immediately see the 3 cheapest plans/products recommended to me.
The second question could be coverage and if I then could see the 3 plans with best coverage (in my area) that would be interesting too.
When I answer the third question about smart phone features and I say I want internet, then the first question should spit out the 3 cheapest plans/products that include internet, obviously they could change.
And so on...
Maybe it also could be a good idea to let the user "dive into" each question and see the full range of options for that answer. As a user I would appreciate that.
Above comments is just how I would appreciate if a form was made for me, I don't want to answer 10 questions about stuff I'm not really putting any value on, each user is different and will prefer to make their choice on their questions.
So, based on above it would be like a check list where the top answers would be the plans/products with the most fitting check marks. And to give immediate responses (as the user answer/alter each question), here AJAX would probably be your choice.

Survey Data Model

I'm developing a simple survey module for an ASP application I'm working on and I'd like to get some suggestions on the data model.
Questions can be one of three types - multiple choice, multiple answer; multiple choice, single answer, and free response.
I'm thinking of the following tables:
Question - with a question type discriminator ifeld
PossibleAnswers- with a questionID and answer text field
SurveyQuestionResponse- with a questionID, a clientID, and answer text
Am I making this too simple?
Take a look at the
Data Model library at databaseanswers.org
Models #76 thru #81 seem pertinent, if only for "inspiration".
A lot depends on the level of sophistication of the surveys you manage, as some surveys in particular dynamic ones (aimed at removing some of the bias) require additional fields for storing properties such as the probabilities with which a particular question (or reply) is used, the many forms of a question and associated probability, and also the recording of the questions and suggested replies that were effectively offered for a give surveyee.
The model the above link:

schema design

Let's say you are a GM dba and you have to design around the GM models
Is it better to do this?
table_model
type {cadillac, saturn, chevrolet}
Or this?
table_cadillac_model
table_saturn_model
table_chevrolet_model
Let's say that the business lines have the same columns for a model and that there are over a million records for each subtype.
EDIT:
there is a lot of CRUD
there are a lot of very processor intensive reports
in either schema, there is a model_detail table that contains 3-5 records for each model and the details for each model differ (you can't add a cadillac detail to a saturn model)
the dev team doesn't have any issues with db complexity
i'm not really sure that this is a normalization question. even though the structures are the same they might be thought of as different entities.
EDIT:
Reasons for partitioning the structure into multiple tables
- business lines may have different business rules regarding parts
- addModelDetail() could be different for each business line (even though the data format is the same)
- high add/update activity - better performance with partitioned structure instead of single structure (I'm guessing and not sure here)?
I think this is a variation of the EAV problem. When posed as a EAV design, the single table structure generally gets voted as a bad idea. When posed in this manner, the single table strucutre generally gets voted as a good idea. Interesting...
I think the most interesting answer is having two different structures - one for crud and one for reporting. I think I'll try concatenated/flattened view for reporting and multiple tables for crud and see how that works.
Definitely the former example. Do you want to be adding tables to your database whenever you add a new model to your product range?
On data with a lot of writes, (e.g. an OLTP application), it is better to have more, narrower tables (e.g. tables with fewer fields). There will be less lock contention because you're only writing small amounts of data into different tables.
So, based on the criteria you have described, the table structure I would have is:
Vehicle
VehicleType
Other common fields
CadillacVehicle
Fields specific to a Caddy
SaturnVehicle
Fields specific to a Saturn
For reporting, I'd have an entirely different database on an entirely different server that does not have the normalized structure (e.g. just has CadillacVehicle and SaturnVehicle tables with all of the fields from the Vehicle table duplicated into them).
With proper indexes, even the OLTP database could be performant in your SELECT's, regardless of the fact that there are tens of millions of rows. However, since you mentioned that there are processor-intensive reports, that's why I would have a completely separate reporting database.
One last comment. About the business rules... the data store cares not about the business rules. If the business rules are different between models, that really shouldn't factor into your design decisions about the database schema (other than to help dictate which fields are nullable and their data types).
Use the former. Setting up separate tables for the specialisations will complicate your code and doesn't bring any advantages that can't be achieved in other ways. It will also massively simplify your reports.
If the tables really do have the same columns, then the former is the best way to do it. Even if they had different columns, you'd probably still want to have the common columns be in their own table, and store a type designator.
You could try having two separate databases.
One is an OLTP (OnLine Transaction Processing) system which should be highly normalized so that the data model is highly correct. Report performance must not be an issue, and you would deal with non-reporting query performance with indexes/denormalization etc. on a case-by-case basis. The data model should try to match up very closely with the conceptual model.
The other is a Reports system which should pull data from the OLTP system periodically, and massage and rearrange that data in a way that makes report-generation easier and more performant. The data model should not try to match up too closely with the conceptual model. You should be able to regenerate all the data in the reporting database at any time from the data currently in the main database.
I would say the first way looks better.
Are there reasons you would want to do it the second way?
The first way follows normalization better and is closer to how most relational database schema are developed.
The second way seems to be harder to maintain.
Unless there is a really good reason for doing it the second way I would go with the first method.
Given the description that you have given us, the answer is either.
In other words you haven't given us enough information to give a decent answer. Please describe what kind of queries you expect to perform on the data.
[Having said that, I think the answer is going to be the first one ;-)
As I imaging even though they are different models, the data for each model is probably going to be quite similar.
But this is a complete guess at the moment.]
Edit:
Given your updated edit, I'd say the first one definitely. As they have all the same data then they should go into the same table.
Another thing to consider in defining "better"--will end users be querying this data directly? Highly normalized data is difficult for end-users to work with. Of course this can be overcome with views but it's still something to think about as you're finalizing your design.
I do agree with the other two folks who answered: which form is "better" is subjective and dependent on what you're hoping to achieve. If you're hoping to achieve very quick queries that's one thing. If you're hoping to achieve high programmer productivity--that's a different goal again and possibly conflicts with quick queries.
Choice depends on required performance.
The best database is normalized database. But there could be performance issues in normalized database then you have to denormalize it.
Principle "Normalize first, denormalize for performance" works well.
It depends on the datamodel and the use case. If you ever need to report on a query that wants data out of the "models" then the former is preferable because otherwise (with the latter) you'd have to change the query (to include the new table) every time you added a new model.
Oh and by "former" we mean this option:
table_model
* type {cadillac, saturn, chevrolet}
#mson has asked the question "What do you do when a question is not satisfactorily answered on SO?", which is a direct reference to the existing answers to this question.
I contributed the following answer to that discussion, primarily critiquing the way the question was asked.
Quote (verbatim):
I looked at the original question yesterday, and decided not to contribute an answer.
One problem was the use of the term 'model' as in 'GM models' - which cited 'Chevrolet, Saturn, Cadillac' as 'models'. To my understanding, these are not models at all; they are 'brands', though there might also be an industry-insider term for them that I'm not familiar with, such as 'division'. A model would be a 'Saturn Vue' or 'Chevrolet Impala' or 'Cadillac Escalade'. Indeed, there could well be models at a more detailed level than that - different variants of the Saturn Vue, for example.
So, I didn't think that the starting point was well framed. I didn't critique it; it wasn't quite compelling enough, and there were answers coming in, so I let other people try it.
The next problem is that it is not clear what your DBMS is going to be storing as data. If you're storing a million records per 'model' ('brand'), then what sorts of data are you dealing with? Lurking in the background is a different scenario - the real scenario - and your question has used an analogy that failed to be sufficiently realistic. That means that the 'it depends' parts of the answer are far more voluminous than the 'this is how to do it' ones. There is just woefully too little background information on the data to be modelled to allow us to guess what might be best.
Ultimately, it will depend on what uses people have for the data. If the information is going to go flying off in all different directions (different data structures in different brands; different data structures at the car model levels; different structures for the different dealerships - the Chevrolet dealers are handled differently from the Saturn dealers and the Cadillac dealers), then the integrated structure provides limited benefit. If everything is the same all the way down, then the integrated structure provides a lot of benefit.
Are there legal reasons (or benefits) to segregating the data? To what extent are the different brands separate legal entities where shared records could be a liability? Are there privacy issues, such that it will be easier to control access to the data if the data for the separate brands is stored separately?
Without a lot more detail about the scenario being modelled, no-one can give a reliable general answer - at least, not more than the top-voted one already gives (or doesn't give).
Data modelling is not easy.
Data modelling without sufficient information is impossible to do reliably.
I have copied the material here since it is more directly relevant. I do think that to answer this question satisfactorily, a lot more context should be given. And it is possible that there needs to be enough extra context to make SO the wrong place to ask it. SO has its limitations, and one of those is that it cannot deal with questions which require long explanations.
From the SO FAQs page:
What kind of questions can I ask here?
Programming questions, of course! As long as your question is:
detailed and specific
written clearly and simply
of interest to at least one other programmer somewhere
...
What kind of questions should I not ask here?
Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!
This question is, IMO, close to the 'require extended discussion' limit.

Resources