db design: commenting multiple entities - database

So I currently have two entities for which comments are possible, say Pic and Text.
I was looking at this question for a possible DB design:
Implementing Comments and Likes in database
There we'd have a single Comment table:
#comment_id
#entity_id
Now, my client, which is tech-savvy as well, doesn't like the idea of having a common super class for commentable entities - for whatever reason I don't know.
So I currently have a Comment table which has a relationship with either Text and Pic (I am doing reasonably ok at DBs but am not an expert). This would mean that I'd have a Comment table with:
#comment_id
#pic_id
#text_id
which would result, for comments on pics, to have to query where pic_id is not null (the same way for text comments, query for text_id != null). This feels rather odd to me (nevertheless, the queries we'd probably have more are "get_comments_for_pic", or "get_comments_for_text", which would entail querying by pic_id or text_id).
But the client is pushing to have separate tables for each Text_Comment, and Pic_Comment, which would basically achieve the same but would better separate the different comments.
Is there any other reason to prefer one over the other which I can't see right now? Any other suggestion on how to implement this?

Your client is right. Your way, your DB table will have many NULL entries, which usually is a sign of bad DB design. The size of each row will be larger. Also, you will require two indexes in fields with too many NULL values and the DB table will be double the size, than splitting into two DB tables, also increasing the size of the corresponding indexes.
Overall, with your client suggestion, smaller rows, smaller DB tables, smaller indexes => faster performance.

Related

Prevent continuously adding columns by postgres json data type

I wanted to ask, if there may be a different and better approach than mine.
I have a model entity that can have an arbitrary amount of hyperparameters. Depending on the specific model I want to insert as row into the model table, I may have specific hyperparameters. I do not want to continuously add new columns to my model table for new hyperparameters that I encounter when trying out new models (+ I don't like having a lot of columns that are null for many rows). I also want to easily filter models on specific hyperparameter values, e.g. "select * from models where model.hyperparameter_x.value < 0.5". So, an n-to-n relationship to a hyperparameter table comes to mind. The issue is, that the datatype for hyperparameters can be different, so I cannot define a general value column on the relationship table, with a datatype, that's easily comparable across different models.
So my idea is, to define a json type "value" column in the relationship table to support different value datatypes (float, array, string, ...). What I don't like about that idea and what was legitimately critizised by colleagues is that this can result in chaos within the value column pretty fast, e.g. people inserting data with very different json structures for the same hyperparameters. To mitigate this issue, I would introduce a "json_regex_template" column in the hyperparameter table, so that on API level I can easily validate wheter the json for a value for hyperparameter x is correctly defined by the user. An additional "json_example" column in the hyperaparameter table would further help the user on the other side of the API make correct requests.
This solution would still not guarentee non-chaos on database request level (even though no User should directly insert data without using the API, so I don't think thats a very big deal). And the solution still feels a bit hacky. I would believe, that I'm not the first person with this problem and maybe there is a best practice to solve it?
Is my aversion against continuously adding columns reasonable? It's about probl. 3-5 new columns per month, may saturate at some point to a lower number, but thats speculative.
I'm aware of this post (Storing JSON in database vs. having a new column for each key), but it's pretty old, so my hope is that there may be new stuff I could use. The model-hyperparameter thing is of course just a small part of my full database model. Changing to a non-relational database is not an option.
Opinions are much appreciated

What is wrong with this database design?

I was pointed out by someone else that the following database design have serious issues, can anyone tell me why?
a tb_user table saves all the users information
tb_user table will have 3 - 8 users only.
each user's data will be saved in a separate table, naming after the user's name.
Say a user is called: bill_admin, then he has a seperate table, i.e. bill_admin_data, to save all data belongs to him. All users' data shared the same structure.
The person who pointed out this problem said I should merge all the data into one table, and uses FK to distinguish them, but I have the following statement:
users will only be 3 - 8, so there's not gonna be a lot of tables anyway.
each user has a very large data table, say 500K records.
Is it a bad practise to design database like this? And why? Thank you.
Because it isn't very maintainable.
1) Adding data to a database should never require modifying the structure. In your model, if you ever need to add another person you will need a new table (or two). You may not think you will ever need to do this, but trust me. You will.
So assume, for example, you want to add functionality to your application to add a new user to the database. With this structure you will have to give your end users rights to create new tables, which creates security problems.
2) It violates the DRY principle. That is, you are creating multiple copies of the same table structure that are identical. This makes maintenance a pain in the butt.
3) Querying across multiple users will be unnecessarily complicated. There is no good reason to split each user into a separate table other than having a vendetta against the person who has to write queries against this DB model.
4) If you are splitting it into multiple tables for performance because each user has a lot of rows, you are reinventing the wheel. The RDBMS you are using undoubtedly has an indexing feature which allows it to efficiently query large tables. Your home-grown hack is not going to outperform the platform's approach for handling large data.
I wouldn't say it's bad design per se. It is just not the type of design for which relational databases are designed and optimized for.
Of course, you can store your data as you mention, but many operations won't be trivial. For example:
Adding a new person
Removing a person
Generating reports based on data across all your people
If you don't really care about doing this. Go ahead and do your tables as you propose, although I would recommend using a non relational database, such as MongoDB, which is more suited for this type of structure.
If you prefer using relational databases, by aggregating data by type, and not by person gives you lots of flexibility when adding new people and calculating reports.
500k lines is not "very large", so don't worry about size when making your design.
it is good to use Document based database like mongoDB for these type of requirement.

Can you have 2 tables with identical structure in a good DB schema?

2 tables:
- views
- downloads
Identical structure:
item_id, user_id, time
Should I be worried?
I don't think that there is a problem, per se.
When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence.
Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often.
This alone could be a good reason not to burden it with records from the downloads.
Also, the fact that they are indentical now does not mean they will be in the future: views and downloads are different, after all, so sooner or later one or both could grow an extra field or two.
These tables are the same NOW but may schema change in the future. If they represent 2 different concepts it is good to keep them separate. What if you wanted to have a foreign key from another table to the downloads table but not the views table, if they were that same table you could not do this.
I think the answer has to be "it depends". As someone else pointed out, if the schema of one or both tables is likely to evolve then no. I can think of other cases well (simplifying the security model by allow apps/users access to one or the other).
Having said this, I work with a legacy DB where this is a problem. We have multiple identical tables for customer invoices. Data is actually moved between then at different stages in the processing life-cycle. It makes for a complicated mess when trying to access data. It would have been easily solved by a state flag in the original schema, but we now have 20+ years of code written against the multi-table version.
Short answer: depends on why they are the same schema :).
From a E/R modelling point of view I don't see a problem with that, as long as they represent two semantically different entities.
From an implementation point of view, it really depends on how you plan to query that data:
If you plan to query those tables independently from each other, keeping them separate is a good choice
If you plan to query those tables together (maybe with a UNION of a JOIN operation) you should consider storing them in a single table with a discriminator column to distinguish their type
When considering whether to consolidate them into a single table you should also take into account other factors like:
The amount of data stored in each table
The rate at which data grows in each table
The ratio of read/write operations executed on each table
Chris Date and Dave McGoveran formalised the "Principle of Orthogonal Design". Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result.
Arguably it isn't always totally practical to do that and it isn't necessarily clear cut exactly when the principle is being broken. However, I do think it's a good guiding rule, if only because it avoids the problem of duplicate logic in data access code or constraints, i.e. it's a good DRY principle. Avoid having tables with potentially overlapping meanings unless there is some database constraint that prevents duplication between them.
It depends on the context - what is a View and what is a Download? Does a Download imply a View (how else would it be downloaded)?
It's possible that you have well-defined, separate concepts there - but it is a smell I'd want to investigate further. It seems likely that a View and a Download are related somehow, but your model doesn't show anything.
Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding.

Database Table Normalisation

We are storing the metadata of the content in the database tables. Out which there is a column named keywords which contains the keywords related to the content. So, should we normalise the table based on the keyword column because the keyword column will be holding multiple values.
I don't think the answer to this question is complicated at all. This is a textbook normalization problem and the answer is yes, you should definitely normalize this.
Storing multiple values in one column is a violation of first-normal form (most designers try to get to third-normal). The only reason you would want to do this is as a performance optimization, i.e. if the database is extremely large and you are able to use some special indexing strategy on the denormalized column to pull off some specific optimization (i.e. a materialized path). That is not the case here.
Not normalizing will make it difficult to write queries and impossible to index properly. And if you are storing keywords, I can only assume that the reason is for a keyword search - so you definitely want to be able to index this data and write simple queries.
Please - normalize your data.
I think the answer to the question is very complicated. All the big books about database design says that normalization is a good thing. But in your case it depends on how you are going to query the data. For example if you need to get all the rows that contains a key word you will have to use the like operator which is not very fast. But if you have all the key words in a table then you have only a where which is much faster.

SQL-Server DB design time scenario (distributed or centralized)

We've an SQL Server DB design time scenario .. we've to store data about different Organizations in our database (i.e. like Customer, Vendor, Distributor, ...). All the diff organizations share the same type of information (almost) .. like Address details, etc... And they will be referred in other tables (i.e. linked via OrgId and we have to lookup OrgName at many diff places)
I see two options:
We create a table for each organization like OrgCustomer, OrgDistributor, OrgVendor, etc... all the tables will have similar structure and some tables will have extra special fields like the customer has a field HomeAddress (which the other Org tables don't have) .. and vice-versa.
We create a common OrgMaster table and store ALL the diff Orgs at a single place. The table will have a OrgType field to distinguish among the diff types of Orgs. And the special fields will be appended to the OrgMaster table (only relevant Org records will have values in such fields, in other cases it'll be NULL)
Some Pros & Cons of #1:
PROS:
It helps distribute the load while accessing diff type of Org data so I believe this improves performance.
Provides a full scope for accustomizing any particular Org table without effecting the other existing Org types.
Not sure if diff indexes on diff/distributed tables work better then a single big table.
CONS:
Replication of design. If I have to increase the size of the ZipCode field - I've to do it in ALL the tables.
Replication in manipulation implementation (i.e. we've used stored procedures for CRUD operations so the replication goes n-fold .. 3-4 Inert SP, 2-3 SELECT SPs, etc...)
Everything grows n-fold right from DB constraints\indexing to SP to the Business objects in the application code.
Change(common) in one place has to be made at all the other places as well.
Some Pros & Cons of #2:
PROS:
N-fold becomes 1-fold :-)
Maintenance gets easy because we can try and implement single entry points for all the operations (i.e. a single SP to handle CRUD operations, etc..)
We've to worry about maintaining a single table. Indexing and other optimizations are limited to a single table.
CONS:
Does it create a bottleneck? Can it be managed by implementing Views and other optimized data access strategy?
The other side of centralized implementation is that a single change has to be tested and verified at ALL the places. It isn't abstract.
The design might seem a little less 'organized\structured' esp. due to those few Orgs for which we need to add 'special' fields (which are irrelevant to the other tables)
I also got in mind an Option#3 - keep the Org tables separate but create a common OrgAddress table to store the common fields. But this gets me in the middle of #1 & #2 and it is creating even more confusion!
To be honest, I'm an experienced programmer but not an equally experienced DBA because that's not my main-stream job so please help me derive the correct tradeoff between parameters like the design-complexity and performance.
Thanks in advance. Feel free to ask for any technical queries & suggestions are welcome.
Hemant
I would say that your 2nd option is close, just few points:
Customer, Distributor, Vendor are TYPES of organizations, so I would suggest:
Table [Organization] which has all columns common to all organizations and a primary key for the row.
Separate tables [Vendor], [Customer], [Distributor] with specific columns for each one and FK to the [Organization] row PK.
The sounds like a "supertype/subtype relationship".
I have worked on various applications that have implemented all of your options. To be honest, you probably need to take account of the way that your users work with the data, how many records you are expecting, commonality (same organisation having multiple functions), and what level of updating of the records you are expecting.
Option 1 worked well in an app where there was very little commonality. I have used what is effectively your option 3 in an app where there was more commonality, and didn't like it very much (there is more work involved in getting the data from different layers all of the time). A rewrite of this app is implementing your option 2 because of this.
HTH

Resources