We are storing the metadata of the content in the database tables. Out which there is a column named keywords which contains the keywords related to the content. So, should we normalise the table based on the keyword column because the keyword column will be holding multiple values.
I don't think the answer to this question is complicated at all. This is a textbook normalization problem and the answer is yes, you should definitely normalize this.
Storing multiple values in one column is a violation of first-normal form (most designers try to get to third-normal). The only reason you would want to do this is as a performance optimization, i.e. if the database is extremely large and you are able to use some special indexing strategy on the denormalized column to pull off some specific optimization (i.e. a materialized path). That is not the case here.
Not normalizing will make it difficult to write queries and impossible to index properly. And if you are storing keywords, I can only assume that the reason is for a keyword search - so you definitely want to be able to index this data and write simple queries.
Please - normalize your data.
I think the answer to the question is very complicated. All the big books about database design says that normalization is a good thing. But in your case it depends on how you are going to query the data. For example if you need to get all the rows that contains a key word you will have to use the like operator which is not very fast. But if you have all the key words in a table then you have only a where which is much faster.
Related
Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).
So I currently have two entities for which comments are possible, say Pic and Text.
I was looking at this question for a possible DB design:
Implementing Comments and Likes in database
There we'd have a single Comment table:
#comment_id
#entity_id
Now, my client, which is tech-savvy as well, doesn't like the idea of having a common super class for commentable entities - for whatever reason I don't know.
So I currently have a Comment table which has a relationship with either Text and Pic (I am doing reasonably ok at DBs but am not an expert). This would mean that I'd have a Comment table with:
#comment_id
#pic_id
#text_id
which would result, for comments on pics, to have to query where pic_id is not null (the same way for text comments, query for text_id != null). This feels rather odd to me (nevertheless, the queries we'd probably have more are "get_comments_for_pic", or "get_comments_for_text", which would entail querying by pic_id or text_id).
But the client is pushing to have separate tables for each Text_Comment, and Pic_Comment, which would basically achieve the same but would better separate the different comments.
Is there any other reason to prefer one over the other which I can't see right now? Any other suggestion on how to implement this?
Your client is right. Your way, your DB table will have many NULL entries, which usually is a sign of bad DB design. The size of each row will be larger. Also, you will require two indexes in fields with too many NULL values and the DB table will be double the size, than splitting into two DB tables, also increasing the size of the corresponding indexes.
Overall, with your client suggestion, smaller rows, smaller DB tables, smaller indexes => faster performance.
I have a table that holds Test Questions. Each row of the table contains details of a question for a test for a particular user.
The user is presented with 3-5 possible answers and I would like to store details of the answers that have been checked in the row. I don't really want to add new rows for every answer as this would create a huge number of rows.
Is there a way that I can store something like an array of answers in a column in SQL Server? Presently I am storing the data as a JSON string but I remember that Oracle had some way to store array data and I am wondering if SQL Server has the same.
Generally denormallizing is not a good idea. It is rarely a good idea idea. However, it is sometimes necessary for performance reasons. So, if not too slow, don't even consider it.
If you make a secondary answers table in your case with the TestQuestionID (or whatever you call the answers for a single question) to be the clustered index, it won't be much of a performance difference at all compared to a denormalized table.
If I were denormalizing your descriibed table I would probably just create 5 columns in the table, You could also use an xml field, but all you are storing is 5 answers, so I would not use xml in this simple case.
Since you are asking this question, you are not really a seasoned professional (we all start as novices) and you should consult the local sql expert before you denormalize.
ADDED CAVEAT,
Since you accepted this answer, you really need to understand for certain that denormalizing is almost always the wrong thing to do. That is why everyone, including me, was trying to tell you. Don't do this without talking to your DBA -- if you don't have a local DBA (unfortunately all too common) take the collective advice, and don't denormalize. I can think of only 1 time n my career that I think denormalizing was the correct solution. And I have bitten by the bad design (forced on me) by innappropriate denormlazation on many occasions.
Consider Microsoft SQL Server 2008
I need to create a table which can be created two different ways as follows.
Structure Columnwise
StudentId number, Name Varchar, Age number, Subject varchar
eg.(1,'Dharmesh',23,'Science')
(2,'David',21,'Maths')
Structure Rowwise
AttributeName varchar,AttributeValue varchar
eg.('StudentId','1'),('Name','Dharmesh'),('Age','23'),('Subject','Science')
('StudentId','2'),('Name','David'),('Age','21'),('Subject','Maths')
in first case records will be less but in 2nd approach it will be 4 times more but 2 columns are reduced.
So which approach is more better in terms of performance,disk storage and data retrial??
Your second approach is commonly known as an EAV design - Entity-Attribute-Value.
IMHO, 1st approach all the way. That allows you to type your columns properly allowing for most efficient storage of data and greatly helps with ease and efficiency of queries.
In my experience, the EAV approach usually causes a world of pain. Here's one example of a previous question about this, with good links to best practices. If you do a search, you'll find more - well worth a sift through.
A common reason why people head down the EAV route is to model a flexible schema, which is relatively difficult to do efficiently in RDBMS. Other approaches include storing data in XML fields. This is one reason where NOSQL (non-relational) databases can come in very handy due to their schemaless nature (e.g. MongoDB).
The first one will have better performance, disk storage and data retrieval will be better.
Having attribute names as varchars will make it impossible to change names, datatypes or apply any kind of validation
It will be impossible to index desired search actions
Saving integers as varchars will use more space
Ordering, adding or summing integers will be a headache, and will have bad performance
The programming language using this database will not have any possibility to have strong typed data
There are many more reasons for using the first approach.
2 tables:
- views
- downloads
Identical structure:
item_id, user_id, time
Should I be worried?
I don't think that there is a problem, per se.
When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence.
Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often.
This alone could be a good reason not to burden it with records from the downloads.
Also, the fact that they are indentical now does not mean they will be in the future: views and downloads are different, after all, so sooner or later one or both could grow an extra field or two.
These tables are the same NOW but may schema change in the future. If they represent 2 different concepts it is good to keep them separate. What if you wanted to have a foreign key from another table to the downloads table but not the views table, if they were that same table you could not do this.
I think the answer has to be "it depends". As someone else pointed out, if the schema of one or both tables is likely to evolve then no. I can think of other cases well (simplifying the security model by allow apps/users access to one or the other).
Having said this, I work with a legacy DB where this is a problem. We have multiple identical tables for customer invoices. Data is actually moved between then at different stages in the processing life-cycle. It makes for a complicated mess when trying to access data. It would have been easily solved by a state flag in the original schema, but we now have 20+ years of code written against the multi-table version.
Short answer: depends on why they are the same schema :).
From a E/R modelling point of view I don't see a problem with that, as long as they represent two semantically different entities.
From an implementation point of view, it really depends on how you plan to query that data:
If you plan to query those tables independently from each other, keeping them separate is a good choice
If you plan to query those tables together (maybe with a UNION of a JOIN operation) you should consider storing them in a single table with a discriminator column to distinguish their type
When considering whether to consolidate them into a single table you should also take into account other factors like:
The amount of data stored in each table
The rate at which data grows in each table
The ratio of read/write operations executed on each table
Chris Date and Dave McGoveran formalised the "Principle of Orthogonal Design". Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result.
Arguably it isn't always totally practical to do that and it isn't necessarily clear cut exactly when the principle is being broken. However, I do think it's a good guiding rule, if only because it avoids the problem of duplicate logic in data access code or constraints, i.e. it's a good DRY principle. Avoid having tables with potentially overlapping meanings unless there is some database constraint that prevents duplication between them.
It depends on the context - what is a View and what is a Download? Does a Download imply a View (how else would it be downloaded)?
It's possible that you have well-defined, separate concepts there - but it is a smell I'd want to investigate further. It seems likely that a View and a Download are related somehow, but your model doesn't show anything.
Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding.