Is delimiting data in a database field something that would be ok to do?
Something like
create table column_names (
id int identity (1,1) PRIMARY KEY,
column_name varchar(5000)
);
and then storing data in it as follows
INSERT INTO column_names (column_name) VALUES ('stocknum|name|price');
No. this is bad:
in order to create new queries you have to track down how things are stored.
queries that join on price or name or stocknum are going to be nasty
the database can't assign data types to the data or validate it
you can't create constraints on any of this data now
Basically you're subverting the RDBMS' scheme for handling things and making up your own, so you're limiting how much the RDBMS tools can help you and you've made the system harder to understand for new people.
The only possible advantage of this kind of system that I can think of is that it can serve as a workaround to avoid dealing with a totally impossible DBA who vetoes all schema changes regardless of merit. Which can happen, unfortunately.
Of course there's an exception to everything. I'm currently on a project with audit-logging requirements that are pretty stringent. the logging is done to a database, we're using delimited fields for storing the fields because the application is never going to interact with this data, it gets written once and left alone.
Almost certainly not.
It violates principles of normalization. The data stored in a particular row of a particular column should be atomic-- you shouldn't be able to parse the data into smaller component parts.
It makes it substantially more difficult to get acceptable performance. Every piece of code that queries this table will need to know how to parse the data which is generally going to mean that more data needs to be read off disk and potentially sent over the network to the client. Every query that has to parse this data is going to have to be more complex which tends to cause grief for the query optimizer. Concatenated data cannot generally be indexed effectively for searches-- you'd have to do something like a full-text index with custom delimiters rather than a nice standard index on a character string. And if you ever have to update one of the delimited values (i.e. because a product name changes), those updates are going to have to scan every row in the table, parse the data, decide whether to actually update the row, and then update a ton of rows.
It makes the application much more brittle. What happens when someone decides to include a | character in the name attribute, for example? Even if you specify an optional enclosure in the spec (i.e. | is allowed if the entire token is enclosed in double quotes), what fraction of the bits of code that actually parse this column are going to implement and test that correctly?
Related
Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).
here is a simple question to which I would like an answer to:
We have a member table. Each member practices one, many or no sports. Initially we (the developers) created a [member] table, a [sports] table and a [member_sports] table, just as we have always done.
However our client here doesn't like this and wants to store all the sports that the member practices in a single varchar column, separated with a special character.
So if:
1 is football
2 is tennis
3 is ping-pong
4 is swimming
and I like swimming and ping-pong, my favourite sports will be stored into the varchar column as:
x3,x4
Now we don't want to just walk up to the client and claim that his system isn't right. We would like to back it up with proof that the operation to fetch the sports from [member_sports] is more efficient than simply storing the fields as a varchar.
Is there any documentation that can back our claims? Help!
Ask your client if they care about storing accurate information1 rather than random strings.
Then set them a series of challenges. First, ensure that the sport information is in the correct "domain". For the member_sports table, that is:
sport_id int not null
^
|--correct type
For their "store everything in a varchar column" solution, I guess you're writing a CHECK constraint. A regex would probably help here but there's no native support for regex in SQL Server - so you're either bodging it or calling out to a CLR function to make sure that only actual int values are stored.
Next, we not only want to make sure that the domain is correct but that the sports are actually defined in your system. For member_sports, that's:
CONSTRAINT FK_Member_Sports_Sports FOREIGN KEY (Sport_ID) references Sports (Sport_ID)
For their "store everything in a varchar column" I guess this is going to be a far more complex CHECK constraint using UDFs to query other tables. It's going to be messy and procedural. Plus if you want to prevent a row from being removed from sports while it's still referenced by any member, you're talking about a trigger on the sports table that has to query every row in members2`.
Finally, let's say that it's meaningless for the same sport to be recorded for a single member multiple times. For member_sports, that is (if it's not the PK):
CONSTRAINT UQ_Member_Sports UNIQUE (Member_ID,Sport_ID)
For their "store everything in a varchar column" it's another horrifically procedural UDF called from a CHECK constraint.
Even if the varchar variant performed better (unlikely since you need to be ripping strings apart and T-SQL's string manipulation functions are notoriously weak (see above re: regex)) for certain values of "performs better", how do they propose that the data is meaningful and not nonsense?
Writing the procedural variants that can also cope with nonsense is an even more challenging endeavour.
In case it's not clear from the above - I am a big fan of Declarative Referential Integrity (DRI). Stating what you want versus focussing on mechanisms is a huge part of why SQL appeals to me. You construct the right DRI and know that your data is always correct (or, at least, as you expect it to be)
1"The application will always do this correctly" isn't a good answer. If you manage to build an application and related database in which nobody ever writes some direct SQL to fix something, I guess you'll be the first.
But in most circumstances, there's always more than one application, and even if the other application is a direct SQL client only employed by developers, you're already beyond being able to trust that the application will always act correctly. And bugs in applications are far more likely than bugs in SQL database engine's implementations of constraints, which have been tested far more times than any individual application's attempt to enforce constraints.
2Let alone the far more likely query - find all members who are associated with a particular sport. A second index on member_sports makes this a trivial query3. No indexes help the "it's somewhere in this string" solution and you're looking at a table scan with no indexing opportunities.
3Any index that has sport_id first should be able to satisfy such a query.
I'm not that experienced with databases. If I have a database table containing a lot of empty cells, what's the best way to leave them (e.g. so performance isn't degraded, memory is not consumed, if this is even possible)?
I know there's a "null" value. Is there a "none" value or equivalent that has no drawbacks? Or by just not filling the cell, it's considered empty, so there's nothing left to do? Sorry if it's silly question. Sometimes you don't know what you don't know...
Not trying to get into a discussion of normalizing the database. Just wondering what the conventional wisdom is for blank/empty/none cells.
Thanks
The convention is to use null to signify a missing value. That's the purpose of null in SQL.
Noted database researcher C. J. Date writes frequently about his objections to the handling of null in SQL at a logical level, and he would say any column that may be missing belongs in a separate table, so that the absence of a row corresponds to a missing value.
I'm not aware of any serious efficiency drawbacks of using null. Efficiency of any features depend on the specific database implementation you use. You haven't said if you use MySQL, Oracle, Microsoft SQL Server, or other.
MySQL's InnoDB storage engine, for example, doesn't store nulls among the columns of a row, it just stores the non-null columns. Other databases may do this differently. Likewise nulls in indexes should be handled efficiently, but it varies from product to product.
Use NULL. That's what it's for.
Normally databases are said to have rows and columns. If the column does not require a value, it holds nothing (aka NULL) until it is updated with a value. That is best practice for most databases, though not all databases have the NULL value--some use an empty string, but they are the exception.
With regard to space utilization -- disk is relative inexpensive these days, so worries about space consumption are no longer as prevalent as they once used to be, except in gargantuan databases, perhaps. You can get better performance out of a database if you use all fixed-size datatypes, but once you start allowing variable sized string (e.g. varchar, nvarchar) types, that optimization is no longer possible.
In brief, don't worry about performance for the time being, at least until you get your feet wet.
It is possible, but consider:
Are they supposed to be not-empty? Should you implement not null?
Is it a workflow -- so they are empty now, but most of them will be filled in the future?
If both are NO, then you may consider re-design. Edit your question and post the schema you have now.
There are several schools of thought in this. The first is to use null when the data is not known - that's what it's for.
The second is to not allow nulls and either separate out all the fields that could be null to relational tables or to create "fake" values to replace null. For varchar this would usually be the empty string but the problem arises as to what should be the fake value for a date field or or an numeric. Then you have to write code to exclude the fake data just like you have to write code to deal with the nulls.
Personally I prefer to use nulls with some judicious moving of data to child tables if the data is truly a different entity (and often these fields turn out to need the one-to-many structure of a parent-child relationship anyway, such as when you may or may not know the phone number of a person, put it in a separate phone table and then you will often discover you needed to store multiple phone numbers anyway).
i m designing a database using sql server 2005
main concept of our side is to import xml feeds from suppliers
different supplier can have different representation of data
the problem is i need to design table to store imported information
some of the columns are fixed means all supplier products must have similar data coming from the feed like , name, code, price, status, etc
but some product have optional details like
one product have might color property other might dont.
what is the best way to store these kind of scenario into the database.
should i create a table for mandatory columns and other tables to hold optional column.
or i should i list down all the column first and put them into the one table. (there might a lot of null values)
there will thousands of products and database speed is very essential .
we will be doing a lot of product comparison from different supplier
our database will be something like www.pricerunner.co.uk
i hope i explain the concept well
Thousands of products (so thousands of rows.) Thats really not many at all, so you could normalize the the optional data to a few separate tables without having a dramatic effect on query time.
I would say put your indexes in the correct place, optimize your queries, make sure you have filegroups split up nicely, etc (just the usual regular old database stuff) and you should be good.
Depends on how you want to access it.
As you say, speed is important - but what are you going t do with those extra, optional, bits of information? Do you need to store them at all? Assuming you do, how often do you need to access them?
Essentially, if you will always need to at least check if they're there, probably better to put them into one table. If you need to check anyway, might as well get it over with as part of the initial query.
If, on the other hand, you can usually run without bothering to check for these extra pieces, and only need to bother when specilly requested, then it might be better to put them into a different table. The join (or subsequent lookup) will be expensive - much more expensive than pulling nulls for empty columns - but if it's very infrequent, would probably cost less in runtime execution in the long run.
Also bear in mind the tradeoff in storage and transport terms - storing lots of empty fields does take some space, and sending back lots of empty fields takes network bandwidth.
If disk space is not a concern, but bandwidth is, make the application is carfully designed to minimse unecessary lookups, and then with tight queries you can store the extra (optional) data, but not pass it back unless it's requested.
So, it really all depends on what's important to you. Once you know what your overriding design concerns are, you will know which compromises to make to address those concerns at the expense of others. A balancing act.
I have some question regarding database performance in general. I'm using Sqlite but I assume that the performance remarks are applicable to all relational databases?
I have a database that contains a table that stores data of about 200 variables. I write about 50 variables per second to the table. A writen variable contains the id of the variable, a value and a timestamp. Readig is done very rarely but needs to be as fast as possible to get the data per variable in chronological order. When I do a query I always just need to get the data of 1 variable.
How do I design the database so the reading is as fast as possible:
1. I make 1 tabel that contains all the
variables. The variable is stored as
an id. I index the table on the id
and timestamp. The bad part is that
the index makes the write slowe(r).
2. I make 200 tables for each variable
and index the timestamp.
I think the second solution is the most performant but creaying a table for each variable doesn't seem right. Someone can give me some advice?
Thanks
If you really want to use a database, use the first approach, but make sure you are inserting your data in a single transaction; benchmarks show it makes writing much faster.
Are your searches performed on variable name/id AND timestamp, or variable name only. Indexing on timestamp may not be necessary...
Are you sure you need a database? By the sounds of it, a flat-file will work well enough for you, and you don't sound like you actually need any of the trappings of a database. Just create a flat-file for each variable and keep handles to each open. Write to them through your standard buffered IO as often as you need. To read, just open one file and deserialize.
If you are using a relational database, I am guessing those variables are all related? If they are just values, for instance, settings, then maybe a file or something similar may be better.
If you only ever have to query values for ONE variable, then, if you insist on using a database (which may not be a bad thing!), then you should create one table per variable:
id (unsigned int, auto-increment, primary key)
timestamp (datetime)
variable (whatever it is supposed to be)
Do not skimp on data just because "it might take more room on the hard drive" - that only leads to trouble.